Kernel Fusion

PyKokkos includes an automatic kernel fusion feature that can significantly improve performance when executing many kernel calls. Kernel fusion dynamically traces and combines multiple kernel launches into a single fused kernel, reducing launch overhead and improving execution efficiency through better data reuse and improved compiler optimizations.

Fusion process uses lazy evaluation to record kernel calls in traces and fuses them when the result is requested by the application. This happens automatically and transparently without requiring any changes to your PyKokkos code.

Enabling Fusion

Kernel fusion is controlled by the PK_FUSION environment variable. To enable fusion, set this variable before running your PyKokkos application:

export PK_FUSION="naive"

To disable fusion (default behavior):

unset PK_FUSION

Performance Example

The following example demonstrates the performance benefit of kernel fusion when executing many small kernel calls in a loop:

import cupy as cp
import pykokkos as pk

@pk.workunit
def work(wid, a):
    a[wid] = a[wid] + 1

def main():
    B = 100000
    N = 10
    a = cp.ones((B, N))
    pk.set_default_space(pk.Cuda)
    for batch in range(B):
        pk.parallel_for("work", 10, work, a=a[batch])
    print(a)

main()

Performance Comparison

Running the above example with and without fusion shows significant speedup:

Machine Specification

The following hardware was used for the performance measurements:

CPU:  Intel(R) Xeon(R) w5-3433, 32 (16 cores, 2 threads per core)
GPU:  NVIDIA RTX 5000 Ada Generation 32 GB (2x)
CUDA: 12.4
OS:   Ubuntu 24.04 (Linux)

Without fusion (unset PK_FUSION):

real    0m27.213s
user    0m35.134s
sys     0m0.990s

With fusion (export PK_FUSION="naive"):

real    0m14.840s
user    0m22.729s
sys     0m1.136s

In this example, kernel fusion provides approximately 1.8x speedup by fusing 100,000 kernel calls into fewer fused kernels, reducing kernel launch overhead and enabling better data reuse and compiler optimizations.

When to Use Fusion

Kernel fusion is most beneficial when:

  • Executing many kernel calls consecutively

  • Kernel launch overhead dominates execution time

  • Multiple kernels operate on shared data

Note

Fusion is currently most effective on GPU execution spaces where kernel launch overhead is more significant. Fusion can achieve speedups on NVIDIA and AMD GPUs as well as Intel and AMD CPUs.

For more details on the kernel fusion implementation, see the fuser paper: Dynamically Fusing Python HPC Kernels.