Kernel Fusion
PyKokkos includes an automatic kernel fusion feature that can significantly improve performance when executing many kernel calls. Kernel fusion dynamically traces and combines multiple kernel launches into a single fused kernel, reducing launch overhead and improving execution efficiency through better data reuse and improved compiler optimizations.
Fusion process uses lazy evaluation to record kernel calls in traces and fuses them when the result is requested by the application. This happens automatically and transparently without requiring any changes to your PyKokkos code.
Enabling Fusion
Kernel fusion is controlled by the PK_FUSION environment variable.
To enable fusion, set this variable before running your PyKokkos application:
export PK_FUSION="naive"
To disable fusion (default behavior):
unset PK_FUSION
Performance Example
The following example demonstrates the performance benefit of kernel fusion when executing many small kernel calls in a loop:
import cupy as cp
import pykokkos as pk
@pk.workunit
def work(wid, a):
a[wid] = a[wid] + 1
def main():
B = 100000
N = 10
a = cp.ones((B, N))
pk.set_default_space(pk.Cuda)
for batch in range(B):
pk.parallel_for("work", 10, work, a=a[batch])
print(a)
main()
Performance Comparison
Running the above example with and without fusion shows significant speedup:
Machine Specification
The following hardware was used for the performance measurements:
CPU: Intel(R) Xeon(R) w5-3433, 32 (16 cores, 2 threads per core)
GPU: NVIDIA RTX 5000 Ada Generation 32 GB (2x)
CUDA: 12.4
OS: Ubuntu 24.04 (Linux)
Without fusion (unset PK_FUSION):
real 0m27.213s
user 0m35.134s
sys 0m0.990s
With fusion (export PK_FUSION="naive"):
real 0m14.840s
user 0m22.729s
sys 0m1.136s
In this example, kernel fusion provides approximately 1.8x speedup by fusing 100,000 kernel calls into fewer fused kernels, reducing kernel launch overhead and enabling better data reuse and compiler optimizations.
When to Use Fusion
Kernel fusion is most beneficial when:
Executing many kernel calls consecutively
Kernel launch overhead dominates execution time
Multiple kernels operate on shared data
Note
Fusion is currently most effective on GPU execution spaces where kernel launch overhead is more significant. Fusion can achieve speedups on NVIDIA and AMD GPUs as well as Intel and AMD CPUs.
For more details on the kernel fusion implementation, see the fuser paper: Dynamically Fusing Python HPC Kernels.