Kernel Fusion ============= PyKokkos includes an automatic kernel fusion feature that can significantly improve performance when executing many kernel calls. Kernel fusion dynamically traces and combines multiple kernel launches into a single fused kernel, reducing launch overhead and improving execution efficiency through better data reuse and improved compiler optimizations. Fusion process uses lazy evaluation to record kernel calls in traces and fuses them when the result is requested by the application. This happens automatically and transparently without requiring any changes to your PyKokkos code. Enabling Fusion --------------- Kernel fusion is controlled by the ``PK_FUSION`` environment variable. To enable fusion, set this variable before running your PyKokkos application: .. code-block:: bash export PK_FUSION="naive" To disable fusion (default behavior): .. code-block:: bash unset PK_FUSION Performance Example ------------------- The following example demonstrates the performance benefit of kernel fusion when executing many small kernel calls in a loop: .. code-block:: python import cupy as cp import pykokkos as pk @pk.workunit def work(wid, a): a[wid] = a[wid] + 1 def main(): B = 100000 N = 10 a = cp.ones((B, N)) pk.set_default_space(pk.Cuda) for batch in range(B): pk.parallel_for("work", 10, work, a=a[batch]) print(a) main() Performance Comparison ^^^^^^^^^^^^^^^^^^^^^^ Running the above example with and without fusion shows significant speedup: Machine Specification ^^^^^^^^^^^^^^^^^^^^ The following hardware was used for the performance measurements: .. code-block:: text CPU: Intel(R) Xeon(R) w5-3433, 32 (16 cores, 2 threads per core) GPU: NVIDIA RTX 5000 Ada Generation 32 GB (2x) CUDA: 12.4 OS: Ubuntu 24.04 (Linux) **Without fusion** (``unset PK_FUSION``): .. code-block:: text real 0m27.213s user 0m35.134s sys 0m0.990s **With fusion** (``export PK_FUSION="naive"``): .. code-block:: text real 0m14.840s user 0m22.729s sys 0m1.136s In this example, kernel fusion provides approximately **1.8x speedup** by fusing 100,000 kernel calls into fewer fused kernels, reducing kernel launch overhead and enabling better data reuse and compiler optimizations. When to Use Fusion ------------------ Kernel fusion is most beneficial when: * Executing many kernel calls consecutively * Kernel launch overhead dominates execution time * Multiple kernels operate on shared data .. note:: Fusion is currently most effective on GPU execution spaces where kernel launch overhead is more significant. Fusion can achieve speedups on NVIDIA and AMD GPUs as well as Intel and AMD CPUs. For more details on the kernel fusion implementation, see the fuser paper: `Dynamically Fusing Python HPC Kernels `_. .. toctree:: :maxdepth: 2 :caption: Contents: