I am using the C++ API of LibTorch 2.5.1 (to be clear here I am not using any Python code) and have implemented a few self-written CUDA and HIP kernels, e.g.,
/**
@brief Compute Greville abscissae
*/
template <typename real_t>
__global__ void
greville_kernel(torch::PackedTensorAccessor64<real_t, 1> greville,
const torch::PackedTensorAccessor64<real_t, 1> knots,
int64_t ncoeffs, short_t degree, bool interior)
{
for (int64_t k = blockIdx.x * blockDim.x + threadIdx.x;
k < ncoeffs - (interior ? 2 : 0); k += blockDim.x * gridDim.x) {
for (short_t l = 1; l <= degree; ++l)
greville[k] += knots[k + (interior ? 1 : 0) + l];
greville[k] /= real_t(degree);
}
}
that I call from within my regular C++ code as follows
// CUDA
int blockSize, minGridSize, gridSize;
cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, (const void *)greville_kernel<real_t>, 0, 0);
gridSize = (ncoeffs_[j] + blockSize - 1) / blockSize;
greville_kernel<<<gridSize, blockSize>>>(greville, knots, ncoeffs_[j], degrees_[j], interior);
// HIP
int blockSize, minGridSize, gridSize;
static_cast<void>(hipOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, (const void *)greville_kernel<real_t>, 0, 0));
gridSize = (ncoeffs_[j] + blockSize - 1) / blockSize;
greville_kernel<<<gridSize, blockSize>>>(greville, knots, ncoeffs_[j], degrees_[j], interior);
The code is implemented as header-only library (the CUDA/HIP kernels are implemented in a regular hpp
file) and the main application is compiled with nvcc
and hipcc
, respectively.
I managed to compile my code with Intel GPU support enabled by following the installation instructions given here Getting Started on Intel GPU — PyTorch 2.5 documentation and pointing CMake to the libtorch library and header files in the Python site-packages
directory. My code works fine except for the custom kernels.
I would appreciate some help in implementing the above (and similar) custom kernels in SYCL and calling them from the C++ code. I am familiar with CUDA/HIP programming but not yet with SYCL