Field report for the Kokkos User Group Meeting 2026

Posted on March 31, 2026 • 5 min read • 1,050 words

The High Performance Software Foundation (HPSF) Conference 2026 took place in Chicago IL from March 16th to March 20th. There were Kokkos talks all week, with the last two days having a dedicated track for the Kokkos User Group (KUG) meeting. The program brought together Kokkos developers, library maintainers, and application teams to share updates on performance portability work across the Kokkos ecosystem, with recurring discussion topics including GPU performance tuning, interoperability (especially with Fortran), memory management, and distributed execution.

Day 1 (Thursday March 19th):

Thursday focused primarily on Kokkos usage in applications and libraries, along with adoption experiences and training-related discussions.

A number of talks highlighted end-to-end application workflows and the combination of Kokkos with other ecosystem components.

In multi-GPU radiative transport work integrated into OpenFOAM, Nicolas Tricard described improvements for both performance and scalability by grouping rays and using ArborX for spatial searches; speedups of approximately 400× over serial and 10× over OpenMP were reported.

In high-order CFD, batching was presented by Sana Nazir as a key technique to improve GPU performance. Furthermore, the importance of evaluating the performance of the complete solver (not only individual kernels) was emphasized.

For Particle Systems with CabanaPD, Sam Reeve reported that a multimaterial extension is now usable and that performance on MI250 GPUs remained reasonable even with conditional logic in the main kernel path.

Several presentations centered on ecosystem libraries and portability challenges encountered in practice.

Ramzi Messahel gave a talk on a portable mesh interpolation library described using ArborX for neighbor searches while implementing interpolation outside of ArborX. The talk also outlined the migration away from Eigen toward Kokkos-Kernels due to incomplete GPU support in Eigen.

Memory management was also a topic: Kristi Belcher spoke on UmpireSpace: She showed an experimental implementation of Umpire as a Kokkos memory space.

Adoption-oriented talks addressed programming model choices and developer experience.

Nigel Tan made a presentation on performance-portable SIMD for vector Particle in Cell codes and noted that automatic vectorization by the compiler can be effective but may not match hand-optimized performance.

Framework integration perspectives were provided by two talks:

Namjae Choi spoke on MOOSE’s integration of Kokkos in a code based heavily on dynamic polymorphism. The talk mentions separate compilation and RDC as main challenges for accelerator use in their case

Timo Heister reported on the deal.II finite element library bundling Kokkos and highlighted the role of Kokkos training in enabling student contributions to deal.II.

On the Python front, a pyKokkos update by Ivan Grigorik described kernel fusion efforts and a shift away from exposing Kokkos::View directly. pyKokkos is moving towards using NumPy and CuPy as main data structure, allowing for easier interoperability and a more python-like feel.

Training and education session concluded the day with a panel discussion on teaching and training (Panelists: Daniel Holladay, Pariksheet Nanda, Hariprasad Kannan, and John K. Holmen). The panelists emphasized the value of hands-on components, the usefulness of recorded materials, and the observation that C++ fundamentals are often a larger barrier than Kokkos concepts. Kokkos’ documentation search behavior was also identified as an area for improvement (e.g., common terms not consistently surfacing the most relevant pages).

As a prelude to the panel discussion, Pariksheet Nanda spoke about the challenge of teaching domain scientists just enough C++ for accelerators, and Daniel Holladay talked about his experiences with transitioning Fortran developers to Kokkos.

Day 2 (Friday March 20th):

Friday broadened to topics related to performance mechanisms, distributed execution, build/packaging, and Fortran interoperability and migration paths.

Performance-focused talks included updates on library capabilities and execution policies.

Yuuichi Asahi reported that MPI support in Kokkos-FFT is a key next step and that communication costs were a dominant factor in some performance testing.

A talk by Hariprasad Kannan on MDRangePolicy efficiency discussed differences between serial loop behavior (including limited autovectorization) and parallel_for execution, and outlined possible directions for improved SIMD utilization while noting the difficulty of vectorizing arbitrary user functors.

For spectral element kernels, Rohit Kakodkar reported that hardware-aware tiling and chunking are effective for accelerating 3D stencil computations. Comparisons noted that CuTe achieved higher performance in part through overlapping memory loads and computation, motivating interest in more explicit asynchronos data-movement strategies in Kokkos.

Trung Nguyen talked about application package updates including mixed precision support in the Kokkos package for LAMMPS.

Jakob Bludau presented Kokkos’ Build & Packaging working group. The group maintains and improves Kokkos’ build system and tries to simplify downstream consumption of Kokkos. It also supports cases where Kokkos integration into the software stack is complicated. The talk also references updates on Spack packaging, Kokkos-on-Godbolt, and ongoing efforts toward binary distribution.

Several talks addressed extending Kokkos to new environments and scaling models.

A neuromorphic integration update by Bradley Theilman described ongoing work toward a Kokkos backend targeting neuromorphic/heterogeneous systems, with incremental implementation progress and early support for deep_copy and work toward parallel_for.

For distributed multi-GPU execution, KokkosComm was presented by Nicolas Morales as a communication layer designed to manage lifetimes and abstract non-contiguous data handling, with NCCL and RCCL backends.

A substantial portion of Friday covered Fortran interoperability and modernization strategies.

Bruno Turcksin discussed the Kokkos Fortran interop update and recommended using the develop branch given the age of available pre-release versions and noted current limitations such as lack of automatic memory management and primarily SharedSpace support.

In parallel, an approach for automatic translation of Fortran to Kokkos was presented by Brayden Wagoner. The approach is using flang-derived AST information to generate C++ instructions.

In terms of integrating Kokkos into existing Fortran applications Jian Sun introduced a C++/Kokkos dynamical core behind an existing Fortran interface in an earth system model.

Next was Yuuichi Asahi talking about porting portions of a Fortran plasma simulation library to C++/Kokkos. The approach uses with C bindings and identified memory access as primary performance concern.

Additional sessions addressed specialized performance and scaling issues.

An in-kernel inference library (PONNI) was presented by Matthew Norman. In his performance testing bfloat16 DRAM loads/stores were close in cost to float which should be further investigated.

ExaCA performance optimization work presented by Matt Rolchigo described imbalance challenges in microstructure modeling. The talk also presented efforts to overcome the imbalance, but pointed to remaining scaling limitations.

Daniel Holladay ended the week with a proposal for RangePolicy-compatible per-iteration scratch memory by discussing desired functionality and potential contributions into Kokkos.

Kokkos-FFT v1.0 release