Hi folks,
I’ve been looking at nsys traces for end2end training runs and at some point, as the trace size grows, it becomes infeasible to use Nsight Systems.
So I hacked together a simple solution based on NVTX annotations which can be used both using GUI and e.g. sqlite database to run some analysis offline.
Here is an example of the usage:
sudo nsys profile \
--trace=cuda,nvtx \
--sample=process-tree --export sqlite \
-e TORCHINDUCTOR_ANNOTATE_TRAINING=1,TORCHINDUCTOR_ANNOTATE_BUFFERS=1 \
--output annotated_training --force-overwrite true \
(which python) benchmark.py
This run will produce two artifacts: annotated_training.nsys-rep
and annotated_training.sqlite
including NVTX annotations for the whole training passes (e.g. forward
, backward
) together with annotations for each computed buffer (e.g. forward_buf0
, backward_buf42
).
With Nsight Systems it looks like this:
And with sqlite database as follows:
> sqlite3 annotated_training.sqlite 'select start, end, text from NVTX_EVENTS limit 10'
start end text
----------- ----------- ----------
13886331301 14953769147 forward
13886345012 13974259050 buf0
13974281040 14062378232 buf2
14062397372 14065141875 buf3
14065151855 14065428789 buf4_buf5_
14065438109 14154165230 buf10
14154195241 14154685368 buf12
14154702679 14158116132 buf13
14158125482 14158127282 buf14
14158128932 14158130202 buf15
This data allows to detect precisely where the time is spent during the training.
The PoC code is available here Inductor annotations by AlexDenisov · Pull Request #130429 · pytorch/pytorch · GitHub
The RFC part of this post:
Would this feature be useful for the PyTorch community?
If that’s the case, then I’d be happy to discuss how the implementation should look like to be included in the upstream and do the work needed to get it merged.
Thanks,
Alex.