I’m excited to share an idea that I believe could significantly enhance the performance of time-series analysis in PyTorch. I’ve developed a CUDA-accelerated implementation of the Dynamic Time Warping (DTW) algorithm and would love to get your feedback.
Why CUDA-Accelerated DTW?
Dynamic Time Warping (DTW) is essential for measuring similarity between temporal sequences. However, it can be computationally intensive, especially with large datasets or real-time applications. By leveraging CUDA, we can significantly speed up DTW computations, making it feasible for high-frequency trading, real-time signal processing, and more.
Key Benefits
Performance: Accelerates DTW computations by utilizing GPU parallelism.
Scalability: Efficiently handles larger datasets, suitable for enterprise-level applications.
Seamless Integration: Designed to integrate smoothly with existing PyTorch workflows.
Implementation Overview
The CUDA-accelerated DTW implementation includes:
CUDA kernels for forward and backward DTW computations
Custom PyTorch function integrating with autograd
Numba-accelerated CPU fallback for non-CUDA systems
High-level PyTorch module for easy integration
Here’s a glimpse of the CUDA kernel for forward computation:
@cuda.jit
def compute_dtw_cuda(D, max_i, max_j, R):
b = cuda.blockIdx.x
tid = cuda.threadIdx.x
I = tid
for j in range(1, max_j + 1):
for i in range(1, max_i + 1):
if I == i:
r0 = R[b, i-1, j-1]
r1 = R[b, i-1, j]
r2 = R[b, i, j-1]
R[b, i, j] = D[b, i-1, j-1] + min(r0, r1, r2)
cuda.syncthreads()
I’m currently drafting an RFC (Request for Comments) to formally propose this feature. I would appreciate any feedback, suggestions, or use cases you think should be considered
Thank you for sharing your idea
You can use my code for short sequences.
But I want to change this for use it to compare huge number of sequences with a length of about 1000.