API: Batched Dense (DLA) ######################## .. toctree:: :maxdepth: 2 :hidden: dense/batched_iamax dense/batched_trsv dense/batched_tbsv dense/batched_ger dense/batched_syr dense/batched_getrf dense/batched_getrs dense/batched_gbtrf dense/batched_gbtrs dense/batched_pbtrf dense/batched_pbtrs dense/batched_pttrf dense/batched_pttrs dense/batched_lacgv dense/batched_laswp dense/batched_apply_householder dense/batched_householder Our implementation of batched dense linear algebra (DLA) allows user to compose various batched DLA operations. For example, consider a case where small element matrices are created via `gemm` and those matrices are triangular solved by `lu` and `trsv`. An approach using conventional batched BLAS interfaces would be like the following pseudo code: .. code-block:: c++ int N = 1000; /// # of batch items int m = 8; /// a square matrix size Kokkos::View AA("AA", N, m, m); /// element matrices Kokkos::View LL("LL", N, m); /// left basis vectors Kokkos::View RR("RR", N, m); /// right basis vectors Kokkos::View BB("BB", N, m); /// load vector and would be overwritten by a solution batched_dgemm(LL, RR, AA); /// construct element matrices via batched dgemm batched_dgetrf(AA); /// perform lu decomposition of each instance of A matrix array batched_dtrsv("Lower", AA, BB) /// perform forward substitution batched_dtrsv("Upper", AA, BB) /// perform backward substitution Clearly, a performance problem of the above code comes from the fact that the sequence of batched functions does not exploit temporal locality between DLA functions; each batched function sweeps the entire set of batched matrices or vectors in parallel for a single DLA operation. On the other hand, Kokkos batched APIs provide functor-level interfaces so that a user can compose a new batched function. The following example has the same functionality as the above: .. code-block:: c++ int N = 1000; /// # of batch items int m = 8; /// a square matrix size Kokkos::View AA("AA", N, m, m); /// element matrices Kokkos::View LL("LL", N, m); /// left basis vectors Kokkos::View RR("RR", N, m); /// right basis vectors Kokkos::View BB("BB", N, m); /// load vector and would be overwritten by a solution using namespace KokkosBatched; Kokkos::parallel_for(N, KOKKOS_LAMBDA(const int i) { auto A = Kokkos::subview(AA, i, Kokkos::ALL(), Kokkos::ALL()); /// ith matrix auto L = Kokkos::subview(LL, i, Kokkos::ALL()); /// ith left vector auto R = Kokkos::subview(RR, i, Kokkos::ALL()); /// ith right vector auto B = Kokkos::subview(BB, i, Kokkos::ALL()); /// ith load/solution vector SerialGemm ::invoke(one, L, R, zero, A); SerialLU ::invoke(A); SerialTrsv ::invoke(one, A, B); }); In this example, a single `parallel_for` is launched to compute a sequence of DLA operations i.e., `gemm`, `lu` and `trsv`. Then, one may ask, "How is this different from putting BLAS and LAPACK functions inside a parallel loop ?". The main difference is that Kokkos batched APIs are very lightweight generic implementations focusing on small matrix sizes (kernels are developed and tuned from the application context). Most vendor provided DLA libraries such as Intel MKL and NVIDIA CUBLAS perform well for large problem sizes. For tiny problem sizes like 3x3 or 5x5 matrix problems, it is not feasible to use vendor optimized DLA libraries as even error checking already puts a quite amount of overhead for such tiny problem sizes. Furthermore, CUBLAS (or GPU numeric libraries) cannot be nested inside of a parallel region, which is why CUBLAS provides separate batched APIs. Kokkos batched APIs provide generic header-only functor-level implementations that can be embedded in a parallel region. The APIs can be mapped to Kokkos hierarchical parallelism and also provide various implementations of algorithms with a template argument, which allows users to choose (or customize) the batched routines for their application context. BLAS support ============ Below are tables summarizing the currently supported function calls in Kokkos Kernels. BLAS 1 ------ .. list-table:: :widths: 12 10 10 10 :header-rows: 1 * - BLAS Call - Serial - Team - TeamVector * - ROTG - -- - -- - -- * - ROTMG - -- - -- - -- * - ROT - -- - -- - -- * - ROTM - -- - -- - -- * - SWAP - `kokkos_swap(a, b) `_ - -- - -- * - SCAL - `Blas::SerialScale` - `Blas::TeamScale` - `Blas::TeamVectorScale` * - COPY - `SerialCopy` - `TeamCopy` - `TeamVectorCopy` * - AXPY - `SerialAxpy` - `TeamAxpy` - `TeamVectorAxpy` * - DOT* - `SerialDot` - `TeamDot` - `TeamVectorDot` * - DOTU - -- - -- - -- * - DOTC* - -- - -- - -- * - NRM2 - -- - -- - -- * - ASUM - -- - -- - -- * - IAMAX - :doc:`SerialIamax ` - -- - -- BLAS 2 ------ .. list-table:: :widths: 12 10 10 10 :header-rows: 1 * - BLAS Call - Serial - Team - TeamVector * - GEMV - `Blas::SerialGemv` - `TeamGemv` - `TeamVectorGemv` * - GBMV - -- - -- - -- * - SYMV - -- - -- - -- * - SBMV - -- - -- - -- * - SPMV - -- - -- - -- * - TRMV - -- - -- - -- * - TBMV - -- - -- - -- * - TPMV - -- - -- - -- * - TRSV - :doc:`SerialTrsv ` - :doc:`TeamTrsv ` - :doc:`TeamVectorTrsv ` * - TBSV - :doc:`SerialTbsv ` - -- - -- * - TPSV - -- - -- - -- * - GER - :doc:`SerialGer ` - -- - -- * - GERU - :doc:`SerialGer ` - -- - -- * - GERC - :doc:`SerialGer ` - -- - -- * - SYR - :doc:`SerialSyr ` - -- - -- * - HER - :doc:`SerialSyr ` - -- - -- * - SPR - -- - -- - -- * - SYR2 - -- - -- - -- BLAS 3 ------ .. list-table:: :widths: 12 10 10 10 :header-rows: 1 * - BLAS Call - Serial - Team - TeamVector * - GEMM - `SerialGemm` - `TeamGemm` - `TeamVectorGemm` * - SYMM - -- - -- - -- * - SYRK - -- - -- - -- * - SYRK2 - -- - -- - -- * - HEMM - -- - -- - -- * - HERK - -- - -- - -- * - HERK2 - -- - -- - -- * - TRMM - `SerialTrmm` - `TeamTrmm` - `TeamVectorTrmm` * - TRSM - `SerialTrsm` - `TeamTrsm` - `TeamVectorTrsm` LAPACK support ============== Below are tables summarizing the currently supported function calls in Kokkos Kernels. .. list-table:: :widths: 12 10 10 10 :header-rows: 1 * - LAPACK Call - Serial - Team - TeamVector * - getrf - :doc:`SerialGetrf ` - -- - -- * - getrs - :doc:`SerialGetrs ` - -- - -- * - gesv - `SerialGesv` - `TeamGesv` - `TeamVectorGesv` * - gbtrf - :doc:`SerialGbtrf ` - -- - -- * - gbtrs - :doc:`SerialGbtrs ` - -- - -- * - pbtrf - :doc:`SerialPbtrf ` - -- - -- * - pbtrs - :doc:`SerialPbtrs ` - -- - -- * - pttrf - :doc:`SerialPttrf ` - -- - -- * - pttrs - :doc:`SerialPttrs ` - -- - -- * - getri - -- - -- - -- * - trtri - `SerialTrtri` - -- - -- * - geequ - -- - -- - -- * - geqrf - `SerialQR` - `TeamQR` - `TeamVectorQR` * - ungqr - -- - -- - -- * - unmqr - -- - -- - -- * - gesvd - `SerialSVD` - -- - -- * - geev - -- - -- - -- * - lacgv - :doc:`SerialLacgv ` - -- - -- * - laswp - :doc:`SerialLaswp ` - -- - --