API: Batched Dense (DLA)¶

Our implementation of batched dense linear algebra (DLA) allows user to compose various batched DLA operations. For example, consider a case where small element matrices are created via gemm and those matrices are triangular solved by lu and trsv. An approach using conventional batched BLAS interfaces would be like the following pseudo code:

int N = 1000; /// # of batch items
int m = 8;    /// a square matrix size

Kokkos::View<double***> AA("AA", N, m, m); /// element matrices
Kokkos::View<double**>  LL("LL", N, m);    /// left basis vectors
Kokkos::View<double**>  RR("RR", N, m);    /// right basis vectors
Kokkos::View<double**>  BB("BB", N, m);    /// load vector and would be overwritten by a solution

batched_dgemm(LL, RR, AA);     /// construct element matrices via batched dgemm
batched_dgetrf(AA);            /// perform lu decomposition of each instance of A matrix array
batched_dtrsv("Lower", AA, BB) /// perform forward substitution
batched_dtrsv("Upper", AA, BB) /// perform backward substitution

Clearly, a performance problem of the above code comes from the fact that the sequence of batched functions does not exploit temporal locality between DLA functions; each batched function sweeps the entire set of batched matrices or vectors in parallel for a single DLA operation. On the other hand, Kokkos batched APIs provide functor-level interfaces so that a user can compose a new batched function. The following example has the same functionality as the above:

int N = 1000; /// # of batch items
int m = 8;    /// a square matrix size

Kokkos::View<double***> AA("AA", N, m, m); /// element matrices
Kokkos::View<double**>  LL("LL", N, m);    /// left basis vectors
Kokkos::View<double**>  RR("RR", N, m);    /// right basis vectors
Kokkos::View<double**>  BB("BB", N, m);    /// load vector and would be overwritten by a solution

using namespace KokkosBatched;
Kokkos::parallel_for(N, KOKKOS_LAMBDA(const int i) {
  auto A = Kokkos::subview(AA, i, Kokkos::ALL(), Kokkos::ALL()); /// ith matrix
  auto L = Kokkos::subview(LL, i, Kokkos::ALL());                /// ith left vector
  auto R = Kokkos::subview(RR, i, Kokkos::ALL());                /// ith right vector
  auto B = Kokkos::subview(BB, i, Kokkos::ALL());                /// ith load/solution vector

  SerialGemm<Trans::NoTranspose,Trans::NoTranspose,Algo::Gemm::Unblocked>
    ::invoke(one, L, R, zero, A);
  SerialLU<Algo::LU::Unblocked>
    ::invoke(A);
  SerialTrsv<Uplo::Lower,Trans::NoTranspose,Diag::UnitDiag,Algo::Trsv::Unblocked>
    ::invoke(one, A, B);
});

In this example, a single parallel_for is launched to compute a sequence of DLA operations i.e., gemm, lu and trsv. Then, one may ask, “How is this different from putting BLAS and LAPACK functions inside a parallel loop ?”. The main difference is that Kokkos batched APIs are very lightweight generic implementations focusing on small matrix sizes (kernels are developed and tuned from the application context). Most vendor provided DLA libraries such as Intel MKL and NVIDIA CUBLAS perform well for large problem sizes. For tiny problem sizes like 3x3 or 5x5 matrix problems, it is not feasible to use vendor optimized DLA libraries as even error checking already puts a quite amount of overhead for such tiny problem sizes. Furthermore, CUBLAS (or GPU numeric libraries) cannot be nested inside of a parallel region, which is why CUBLAS provides separate batched APIs. Kokkos batched APIs provide generic header-only functor-level implementations that can be embedded in a parallel region. The APIs can be mapped to Kokkos hierarchical parallelism and also provide various implementations of algorithms with a template argument, which allows users to choose (or customize) the batched routines for their application context.

BLAS support¶

Below are tables summarizing the currently supported function calls in Kokkos Kernels.

BLAS 1¶

BLAS Call	Serial	Team	TeamVector
ROTG	–	–	–
ROTMG	–	–	–
ROT	–	–	–
ROTM	–	–	–
SWAP	kokkos_swap(a, b)	–	–
SCAL	Blas::SerialScale	Blas::TeamScale	Blas::TeamVectorScale
COPY	SerialCopy	TeamCopy	TeamVectorCopy
AXPY	SerialAxpy	TeamAxpy	TeamVectorAxpy
DOT*	SerialDot	TeamDot	TeamVectorDot
DOTU	–	–	–
DOTC*	–	–	–
NRM2	–	–	–
ASUM	–	–	–
IAMAX	SerialIamax	–	–

BLAS 2¶

BLAS Call	Serial	Team	TeamVector
GEMV	Blas::SerialGemv	TeamGemv	TeamVectorGemv
GBMV	–	–	–
SYMV	–	–	–
SBMV	–	–	–
SPMV	–	–	–
TRMV	–	–	–
TBMV	–	–	–
TPMV	–	–	–
TRSV	SerialTrsv	TeamTrsv	TeamVectorTrsv
TBSV	SerialTbsv	–	–
TPSV	–	–	–
GER	SerialGer	–	–
GERU	SerialGer	–	–
GERC	SerialGer	–	–
SYR	SerialSyr	–	–
HER	SerialSyr	–	–
SPR	–	–	–
SYR2	–	–	–

BLAS 3¶

BLAS Call	Serial	Team	TeamVector
GEMM	SerialGemm	TeamGemm	TeamVectorGemm
SYMM	–	–	–
SYRK	–	–	–
SYRK2	–	–	–
HEMM	–	–	–
HERK	–	–	–
HERK2	–	–	–
TRMM	SerialTrmm	TeamTrmm	TeamVectorTrmm
TRSM	SerialTrsm	TeamTrsm	TeamVectorTrsm

LAPACK support¶

Below are tables summarizing the currently supported function calls in Kokkos Kernels.

LAPACK Call	Serial	Team	TeamVector
getrf	SerialGetrf	–	–
getrs	SerialGetrs	–	–
gesv	SerialGesv	TeamGesv	TeamVectorGesv
gbtrf	SerialGbtrf	–	–
gbtrs	SerialGbtrs	–	–
pbtrf	SerialPbtrf	–	–
pbtrs	SerialPbtrs	–	–
pttrf	SerialPttrf	–	–
pttrs	SerialPttrs	–	–
getri	–	–	–
trtri	SerialTrtri	–	–
geequ	–	–	–
geqrf	SerialQR	TeamQR	TeamVectorQR
ungqr	–	–	–
unmqr	–	–	–
gesvd	SerialSVD	–	–
geev	–	–	–
lacgv	SerialLacgv	–	–
laswp	SerialLaswp	–	–