How profiling Pytorch Using Nsight Compute?

woongjoonchoi · October 14, 2024, 5:29am

I am currently trying to profile the evaluation of RT-DETR github using Nsight Comput. However, the following problems occur. RT-DETR : RT-DETR/rtdetr_pytorch at main · lyuwenyu/RT-DETR (github.com)

ncu -o ./rt_deter_profile --target-processes all  --replay-mode range torchrun --nproc_per_node=2 tools/train.py -c configs/rtdetr/rtdetr_r50vd_6x_coco.yml -r rtdetr_r50vd_6x_coco_from_paddle.pth --test-only


==WARNING== Please consult the documentation for current range-based replay mode 
limitations and requirements.
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
==PROF== Connected to process 596 (/home/oongjoon/anaconda3/envs/RT-DETR/bin/python3.11)
==PROF== Connected to process 595 (/home/oongjoon/anaconda3/envs/RT-DETR/bin/python3.11)
Initialized distributed mode...
Load PResNet50 state_dict
loading annotations into memory...
Done (t=0.39s)   
creating index...
index created!
resume from rtdetr_r50vd_6x_coco_from_paddle.pth
Loading ema.state_dict
Test:  [  0/313]  eta: 0:11:21    time: 2.1777  data: 0.7534  max mem: 1634
Test:  [ 10/313]  eta: 0:02:40    time: 0.5302  data: 0.0928  max mem: 1643
Test:  [ 20/313]  eta: 0:02:12    time: 0.3661  data: 0.0249  max mem: 1643
Test:  [ 30/313]  eta: 0:01:59    time: 0.3621  data: 0.0230  max mem: 1643
Test:  [ 40/313]  eta: 0:01:52    time: 0.3692  data: 0.0232  max mem: 1643
Test:  [ 50/313]  eta: 0:01:51    time: 0.4246  data: 0.0242  max mem: 1643
Test:  [ 60/313]  eta: 0:01:44    time: 0.4171  data: 0.0242  max mem: 1643
Test:  [ 70/313]  eta: 0:01:38    time: 0.3553  data: 0.0229  max mem: 1643
Test:  [ 80/313]  eta: 0:01:32    time: 0.3541  data: 0.0223  max mem: 1643
Test:  [ 90/313]  eta: 0:01:28    time: 0.3635  data: 0.0227  max mem: 1643
Test:  [100/313]  eta: 0:01:23    time: 0.3622  data: 0.0228  max mem: 1643
Test:  [110/313]  eta: 0:01:19    time: 0.3650  data: 0.0235  max mem: 1643
Test:  [120/313]  eta: 0:01:14    time: 0.3575  data: 0.0225  max mem: 1643
Test:  [130/313]  eta: 0:01:10    time: 0.3579  data: 0.0214  max mem: 1643
Test:  [140/313]  eta: 0:01:07    time: 0.4284  data: 0.0214  max mem: 1643
Test:  [150/313]  eta: 0:01:03    time: 0.4201  data: 0.0208  max mem: 1643
Test:  [160/313]  eta: 0:00:59    time: 0.3504  data: 0.0202  max mem: 1643
Test:  [170/313]  eta: 0:00:54    time: 0.3394  data: 0.0199  max mem: 1643
Test:  [180/313]  eta: 0:00:50    time: 0.3483  data: 0.0217  max mem: 1643
Test:  [190/313]  eta: 0:00:46    time: 0.3542  data: 0.0229  max mem: 1643
Test:  [200/313]  eta: 0:00:42    time: 0.3574  data: 0.0228  max mem: 1643
Test:  [210/313]  eta: 0:00:39    time: 0.3616  data: 0.0225  max mem: 1643
Test:  [220/313]  eta: 0:00:35    time: 0.3638  data: 0.0220  max mem: 1643
Test:  [230/313]  eta: 0:00:31    time: 0.3619  data: 0.0216  max mem: 1643
Test:  [240/313]  eta: 0:00:27    time: 0.4185  data: 0.0221  max mem: 1643
Test:  [250/313]  eta: 0:00:23    time: 0.4127  data: 0.0221  max mem: 1643
Test:  [260/313]  eta: 0:00:20    time: 0.3719  data: 0.0217  max mem: 1643
Test:  [270/313]  eta: 0:00:16    time: 0.4079  data: 0.0223  max mem: 1643
Test:  [280/313]  eta: 0:00:12    time: 0.3976  data: 0.0227  max mem: 1643
Test:  [290/313]  eta: 0:00:08    time: 0.3948  data: 0.0220  max mem: 1643
Test:  [300/313]  eta: 0:00:04    time: 0.4103  data: 0.0216  max mem: 1643
Test:  [310/313]  eta: 0:00:01    time: 0.4575  data: 0.0215  max mem: 1643
Test:  [312/313]  eta: 0:00:00    time: 0.5282  data: 0.0211  max mem: 1643
Test: Total time: 0:02:02 (0.3922 s / it)
Averaged stats:
Accumulating evaluation results...
DONE (t=10.45s).
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.531  
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.712  
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.577  
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.347  
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.577  
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.701  
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.390  
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.655  
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.721  
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.547  
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.765  
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.881  
==PROF== Disconnected from process 595
==PROF== Disconnected from process 596
==WARNING== No ranges were profiled.

The kernel is not profiled, with the output ==WARNING== No ranges were profiled.

This is the driver version via nvidia-smi.


+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 561.09         CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   62C    P8             18W /  420W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:02:00.0 Off |                  N/A |
|  0%   43C    P8             19W /  420W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Nsight Compute versions are as follows:

NVIDIA (R) Nsight Compute Command Line Profiler
Copyright (c) 2018-2024 NVIDIA Corporation
Version 2024.3.2.0 (build 34861637) (public-release)

torchversion : 2.0.1+cu117

albanD · October 15, 2024, 1:14pm

Hey!

Did you had a chance to take a look at this great post on the topic: Using Nsight Systems to profile GPU workload ?

woongjoonchoi · October 16, 2024, 4:20am

Hello @albanD

After referring to the above resources and other nvidia docs, I found that there was a problem when doing distributed inference with torchrun. Therefore, I decided to do profiling with a single gpu. Thanks for your answer.

Topic		Replies	Views
Using Nsight Systems to profile GPU workload NVIDIA CUDA	12	32334	April 30, 2025
[RFC] Performance profiling at scale with detailed NVTX annotations compiler	0	382	July 10, 2024
PyTorch Release 2.6.0 - Final RC is available Release Announcements	3	4162	March 20, 2025
NNsight and pytorch and large model remote inference	2	530	May 7, 2024
NNC Per-Operator Benchmarks (on CPU) nnc	5	1012	January 27, 2021

How profiling Pytorch Using Nsight Compute?

Related topics