ROCm vs OpenCL/dlprimitives

I finally managed to upgrade my PC now running with Ubuntu 24.04, so I could install properly ROCm 6.1 and test out of box pytorch 2.4 rocm build. It was (almost) straight forward *

GPU AMD rx6600xt 8GB, I still compared to pytorch 1.13 for OpenCL since I hadn’t completed support of 2.4 in pytorch/opencl backend. For ROCM I used official 2.4 build

Training

time in ms per batch

Training batch size rocm/hip opencl Raito %
alexnet 64 57.848 74.965 77.2
resnet18 64 146.917 238.581 61.6
resnet50 32 266.441 358.45 74.3
vgg16 16 206.312 342.292 60.3
densenet161 16 296.807 490.319 60.5
mobilenet_v2 32 157.476 198.891 79.2
mobilenet_v3_small 64 92.506 123.889 74.7
mobilenet_v3_large 64 286.795 325.736 88.0
resnext50_32x4d 32 336.464 491.016 68.5
wide_resnet50_2 32 466.841 644.114 72.5
mnasnet1_0 32 159.97 167.829 95.3
efficientnet_b0 32 205.69 306.328 67.1
regnet_y_400mf 64 171.691 245.65 69.9
convnext_small 16 337.252 591.211 57.0
Average 71.9

Inference

Inference the batch size is always 64, time in ms per batch.

Inference batch=64 rocm/hip opencl Ratio %
convnext_small 476.371 602.858 79.0
alexnet 24.564 25.866 95.0
resnet18 41.478 59.095 70.2
resnet50 165.507 196.455 84.2
vgg16 205.215 309.509 66.3
densenet161 409.825 414.051 99.0
inception_v3 90.632 131.78 68.8
mobilenet_v2 77.652 93.449 83.1
mobilenet_v3_small 22.17 25.647 86.4
mobilenet_v3_large 63.12 70.016 90.2
resnext50_32x4d 245.001 274.578 89.2
wide_resnet50_2 319.019 400.626 79.6
mnasnet1_0 74.205 74.835 99.2
efficientnet_b0 104.285 114.732 90.9
efficientnet_b4 302.771 276.257 109.6
regnet_y_400mf 43.253 56.814 76.1
Average 85.4

Summary

Basically OpenCL performance for my dlprimitives backend is also lower but still gives very good performance, considering that it does not require ROCm infrastructure and thus isn’t limited to Linux only and to very specific devices only.

*) I needed to add environment variable export HSA_OVERRIDE_GFX_VERSION=10.3.0 since officially my rt 6600xt/gfx1032 is not supported so I needed to override it with architecture of 1030

1 Like