I finally managed to upgrade my PC now running with Ubuntu 24.04, so I could install properly ROCm 6.1 and test out of box pytorch 2.4 rocm build. It was (almost) straight forward *
GPU AMD rx6600xt 8GB, I still compared to pytorch 1.13 for OpenCL since I hadn’t completed support of 2.4 in pytorch/opencl backend. For ROCM I used official 2.4 build
Training
time in ms per batch
Training | batch size | rocm/hip | opencl | Raito % |
---|---|---|---|---|
alexnet | 64 | 57.848 | 74.965 | 77.2 |
resnet18 | 64 | 146.917 | 238.581 | 61.6 |
resnet50 | 32 | 266.441 | 358.45 | 74.3 |
vgg16 | 16 | 206.312 | 342.292 | 60.3 |
densenet161 | 16 | 296.807 | 490.319 | 60.5 |
mobilenet_v2 | 32 | 157.476 | 198.891 | 79.2 |
mobilenet_v3_small | 64 | 92.506 | 123.889 | 74.7 |
mobilenet_v3_large | 64 | 286.795 | 325.736 | 88.0 |
resnext50_32x4d | 32 | 336.464 | 491.016 | 68.5 |
wide_resnet50_2 | 32 | 466.841 | 644.114 | 72.5 |
mnasnet1_0 | 32 | 159.97 | 167.829 | 95.3 |
efficientnet_b0 | 32 | 205.69 | 306.328 | 67.1 |
regnet_y_400mf | 64 | 171.691 | 245.65 | 69.9 |
convnext_small | 16 | 337.252 | 591.211 | 57.0 |
Average | 71.9 |
Inference
Inference the batch size is always 64, time in ms per batch.
Inference batch=64 | rocm/hip | opencl | Ratio % |
---|---|---|---|
convnext_small | 476.371 | 602.858 | 79.0 |
alexnet | 24.564 | 25.866 | 95.0 |
resnet18 | 41.478 | 59.095 | 70.2 |
resnet50 | 165.507 | 196.455 | 84.2 |
vgg16 | 205.215 | 309.509 | 66.3 |
densenet161 | 409.825 | 414.051 | 99.0 |
inception_v3 | 90.632 | 131.78 | 68.8 |
mobilenet_v2 | 77.652 | 93.449 | 83.1 |
mobilenet_v3_small | 22.17 | 25.647 | 86.4 |
mobilenet_v3_large | 63.12 | 70.016 | 90.2 |
resnext50_32x4d | 245.001 | 274.578 | 89.2 |
wide_resnet50_2 | 319.019 | 400.626 | 79.6 |
mnasnet1_0 | 74.205 | 74.835 | 99.2 |
efficientnet_b0 | 104.285 | 114.732 | 90.9 |
efficientnet_b4 | 302.771 | 276.257 | 109.6 |
regnet_y_400mf | 43.253 | 56.814 | 76.1 |
Average | 85.4 |
Summary
Basically OpenCL performance for my dlprimitives backend is also lower but still gives very good performance, considering that it does not require ROCm infrastructure and thus isn’t limited to Linux only and to very specific devices only.
*) I needed to add environment variable export HSA_OVERRIDE_GFX_VERSION=10.3.0
since officially my rt 6600xt/gfx1032 is not supported so I needed to override it with architecture of 1030