[RFC] Performance profiling at scale with detailed NVTX annotations

Hi folks,

I’ve been looking at nsys traces for end2end training runs and at some point, as the trace size grows, it becomes infeasible to use Nsight Systems.

So I hacked together a simple solution based on NVTX annotations which can be used both using GUI and e.g. sqlite database to run some analysis offline.

Here is an example of the usage:

sudo nsys profile \
    --trace=cuda,nvtx \
    --sample=process-tree --export sqlite \
    --output annotated_training --force-overwrite true \
    (which python) benchmark.py

This run will produce two artifacts: annotated_training.nsys-rep and annotated_training.sqlite including NVTX annotations for the whole training passes (e.g. forward, backward) together with annotations for each computed buffer (e.g. forward_buf0, backward_buf42).

With Nsight Systems it looks like this:

And with sqlite database as follows:

> sqlite3 annotated_training.sqlite 'select start, end, text from NVTX_EVENTS limit 10'
start        end          text
-----------  -----------  ----------
13886331301  14953769147  forward
13886345012  13974259050  buf0
13974281040  14062378232  buf2
14062397372  14065141875  buf3
14065151855  14065428789  buf4_buf5_
14065438109  14154165230  buf10
14154195241  14154685368  buf12
14154702679  14158116132  buf13
14158125482  14158127282  buf14
14158128932  14158130202  buf15

This data allows to detect precisely where the time is spent during the training.

The PoC code is available here Inductor annotations by AlexDenisov · Pull Request #130429 · pytorch/pytorch · GitHub

The RFC part of this post:

Would this feature be useful for the PyTorch community?

If that’s the case, then I’d be happy to discuss how the implementation should look like to be included in the upstream and do the work needed to get it merged.


