API: BLAS

Kokkos Kernels strive to make available most BLAS capabilities to users and even more through extensions. Our native implementations using the Kokkos Core library capabilities are providing portability and performance which allows users to explore the performance of early hardware where no other implementation is available. Additional we support a large third party layer that wraps specialized and high performance implementation of the BLAS specification.

Kokkos objects for linear algebra

Our implementations are heavily leveraging the capabilities of Kokkos::View to simplify the interface of the standard BLAS specification. A Kokkos::View should be thought of as a tensor, depending on its rank it can represent a scalar value, a vector or a matrix. Each rank has an associated extent that specifies the number of values stored by the rank and a stride that records the leading dimension of the rank. Having these encoded in the View means that Kokkos Kernels does not require the user to specify dimension of vector or matrices neither information like leading dimension of matrices. Another information carried by the View is the notion of increment used to step by a given spacing in a vector. That increment is encoded in the layout of the View using a LayoutStride. Below we show the signature of the gemv function of standard BLAS and Kokkos Kernels to illustrate our explanations:

dgemv(character TRANS,
      integer M,
      integer N,
      double precision ALPHA,
      double precision, dimension(lda,*) A,
      integer LDA,
      double precision, dimension(*) X,
      integer INCX,
      double precision BETA,
      double precision, dimension(*) Y,
      integer INCY)
template <class ExecutionSpace, class AViewType, class XViewType, class YViewType>
void gemv(const ExecutionSpace& space,
          const char trans[],
          typename AViewType::const_value_type& alpha,
          const AViewType& A,
          const XViewType& x,
          typename YViewType::const_value_type& beta,
          const YViewType& y)

Now we provide an example of LayoutStride usage to compute the nrm1 of the odd entries and even entries in a vector, of course the example can be adapted for larger increments.

#include "Kokkos_Core.hpp"
#include "KokkosBlas1_nrm1.hpp"

int main(int argc, char* argv[]) {
  Kokkos::initialize(argc, argv);
  {
    using vector = Kokkos::View<double*, Kokkos::DefaultExecutionSpace>;

    // Create a vector of length 20.
    // Fill it with the value 2.
    // Compute its norm1: 20*2 = 40.
    vector a("a", 20);
    Kokkos::deep_copy(a, 2);
    auto nrm1_a = KokkosBlas::nrm1(a);

    // Create a strided layout of length 10 and stride 2
    // Make a new view using the data from vector a and the new layout
    // Set the values in the even vector to 1
    // Compute the norm1 of the even vector: 10*1 = 10
    Kokkos::LayoutStride layout(10, 2);
    Kokkos::View<double*, Kokkos::LayoutStride, Kokkos::DefaultExecutionSpace> even_a(a.data(), layout);
    Kokkos::deep_copy(even_a, 1);
    auto nrm1_even_a = KokkosBlas::nrm1(even_a);

    // Repeat the operations to create an odd vector using a pointer offset and the layout
    // Set the values of the odd vector to 3
    // Compute the norm1 of the odd vector: 10*3 = 30
    Kokkos::View<double*, Kokkos::LayoutStride, Kokkos::DefaultExecutionSpace> odd_a(a.data()+1, layout);
    Kokkos::deep_copy(odd_a, 3);
    auto nrm1_odd_a = KokkosBlas::nrm1(odd_a);

    // Print all the norms out and the values in the original vector.
    // Note that to do this with device views you will need to create
    // mirror views and copy data from device to host with deep_copy.
    std::cout << "nrm1(a)=" << nrm1_a << ", nrm1(even_a)=" << nrm1_even_a << ", nrm1(odd_a)=" << nrm1_odd_a << std::endl;
    std::cout << "a={ ";
    for(int idx = 0; idx < a.extent_int(0); ++idx) {
      std::cout << a(idx) << " ";
    }
    std::cout << "}" << std::endl;
  }
  Kokkos::finalize();
}

types and precision

Kokkos Kernels relies on template parameters to specify the type of input and output variables, this allows us to have generic implementation that are not precision specific. This is reflected in the name of our functions that do not include the usual S, D, C and Z letters that identify the parameters type in reference BLAS. While this makes our naming scheme a simpler, the use of template parameters also allows us to naturally support multi-precision operations. For instance the gemv function above can be called with vectors of double and a matrix of float.

Execution spaces, non-blocking behavior and thread safety

All Kokkos Kernels BLAS calls can be made with an execution space instance as the first parameter. This allows users to attach a stream/queue to the space instance which will be used to launch the BLAS kernel in the appropriate stream/queue. For this strategy to be useful, the BLAS functions are implemented in a non-blocking fashion except when a return value is expected, for instance when computing a norm or a dot product, since we need to obtain the return value before the following kernel can start. Finally while we strive to have a thread safe BLAS implementation, when using an execution space instance to set a select a particular stream/queue to execute a kernel, selecting the queue is not a thread safe operation. More detail on that can be obtained from the vendors documentation: cuBLAS.

BLAS support

Below are tables summarizing the currently supported function calls and third party libraries in Kokkos Kernels. The tables are updated with each release of the library to reflect recently added support.

BLAS 1

BLAS Call

API Call

Reference

BLAS

cuBLAS

rocBLAS

oneMKL

SROTG

rotg(a, b, c, s)

X

X

X

X

X

SROTMG

rotmg(d1, d2, x1, y1, param)

X

X

X

X

SROT

rot(X, Y, c, s)

X

X

X

X

SROTM

rotm(X, Y, param)

X

X

X

SSWAP

swap(X, Y)

X

X

X

X

SSCAL

scal(y,a,x)

X

X

X

X

CSSCAL

scal(y,a,x)

X

X

X

X

SCOPY

deep_copy(y,x)

X

SAXPY

axpy(a,x,y)

X

X

X

SDOT*

dot(x,y)

X

X

X

X

X

SDSDOT*

dot(x,y)

X

X

X

X

X

CDOTU

CDOTC*

dot(x,y)

X

X

X

X

X

SNRM2

nrm2(x)

X

X

X

X

X

SCNRM2

nrm2(x)

X

X

X

X

X

SASUM

nrm1(x)

X

X

X

X

X

ISAMAX

iamax(x)

X

X

X

X

BLAS 2

BLAS Call

API Call

Reference

BLAS

cuBLAS

rocBLAS

oneMKL

SGEMV

gemv(trans,a,A,x,b,y)

X

X

X

X

X

SGBMV

SSYMV

SSBMV

SSPMV

STRMV

trmm(side,uplo,trans, diag,alpha,A,B)

X

X

X

STBMV

STPMV

STRSV

trsm(side,uplo,trans,diag,alpha,A,B)

X

X

X

STBSV

STPSV

SGER

ger(trans,a,x,y,A)

X

X

X

X

CGERU

ger(trans,a,x,y,A)

X

X

X

X

CGERC

ger(trans,a,x,y,A)

X

X

X

X

SSYR

syr(trans,uplo,a,x,A)

X

X

X

X

SSPR

SSYR2

syr2(trans,uplo,a,x,y,A)

X

X

X

X

BLAS 3

BLAS Call

API Call

Reference

BLAS

cuBLAS

rocBLAS

oneMKL

SGEMM

gemm(transA,transB,a,A,B,b,C)

X

X

X

X

SSYMM

SSYRK

SSYRK2

CHEMM

CHERK

CHERK2

STRMM

trmm(side, uplo, trans, diag, a, A, B)

X

X

X

STRSM

trsm(side, uplo, trans, diag, a, A, B)

X

X

X