Fast combined C++/Python/TorchScript/Inductor tracebacks

zdevito · March 28, 2023, 5:38pm

Fast combined C++/Python/TorchScript/Inductor tracebacks

PyTorch combines many different languages so it is difficult to interpret tracebacks from only one language when debugging because lots of information is missing. For instance a matmul in will have a Python traceback in the forward pass but the backward for the same operator only exists in C++.

Fast combined tracebacks show frames from all of our language technologies interleaved together in a single view of program state at the point the traceback is gathered (… marks places where I cut out some lines manually):

pytorch/c10/cuda/CUDACachingAllocator.cpp:872:c10::cuda::CUDACachingAllocator
pytorch/c10/cuda/CUDACachingAllocator.cpp:2400:c10::cuda::CUDACachingAllocato
pytorch/c10/cuda/CUDACachingAllocator.cpp:2574:c10::cuda::CUDACachingAllocato
pytorch/c10/core/StorageImpl.h:66:c10::StorageImpl::StorageImpl(c10::StorageI
... ^-- C++ frames, v-- Python Frames
pytorch/torch/utils/_device.py:63:__torch_function__
pytorch/torch/csrc/dynamo/eval_frame.c:370:eval_frame_default
pytorch/torch/csrc/dynamo/eval_frame.c:642:_custom_eval_frame_shim
pytorch/torch/csrc/dynamo/eval_frame.c:351:custom_eval_frame_shim
... ^-- dynamo extension frames, v-- inductor frames, including original gene
/tmp/torchinductor_eellison/qc/cqcj3npdlbtx7nexsinkq576bfwjc4u4yx7yns7gidaaev
transformers/models/albert/modeling_albert.py:410:ff_chunk
transformers/pytorch_utils.py:246:apply_chunking_to_forward
transformers/models/albert/modeling_albert.py:397:forward
pytorch/torch/nn/modules/module.py:1501:_call_impl
transformers/models/albert/modeling_albert.py:432:forward
pytorch/torch/nn/modules/module.py:1501:_call_impl
transformers/models/albert/modeling_albert.py:480:forward
pytorch/torch/nn/modules/module.py:1501:_call_impl
transformers/models/albert/modeling_albert.py:733:forward
pytorch/torch/nn/modules/module.py:1501:_call_impl
transformers/models/albert/modeling_albert.py:1281:forward
pytorch/torch/csrc/dynamo/eval_frame.c:370:eval_frame_default
pytorch/torch/csrc/dynamo/eval_frame.c:642:_custom_eval_frame_shim
pytorch/torch/csrc/dynamo/eval_frame.c:351:custom_eval_frame_shim
... v-- C++ frames from the backwards worker thread.
pytorch/torch/csrc/autograd/python_function.cpp:108:torch::autograd::PyNode::
pytorch/torch/csrc/autograd/function.h:184:torch::autograd::Node::operator()(
pytorch/torch/csrc/autograd/engine.cpp:919:torch::autograd::call_function(std
pytorch/torch/csrc/autograd/engine.cpp:984:torch::autograd::Engine::evaluate_
pytorch/torch/csrc/autograd/engine.cpp:545:torch::autograd::Engine::thread_ma
pytorch/torch/csrc/autograd/engine.cpp:357:torch::autograd::Engine::thread_in
pytorch/torch/csrc/autograd/python_engine.cpp:83:torch::autograd::python::Pyt
...

Unified View: Frames from Python, C++, TorchScript, and PyTorch 2.0’s inductor all appear in a single traceback with function names, filenames, and linenumbers. Tools that use combined tracebacks do not need to know the language technology being executed.

Fast: Gathering combined tracebacks is extremely fast. For instance, C++ tracebacks can be gathered at about 50ns/frames, or about 2us for a full stack as seen in PyTorch executions. This speed is ~10x faster than the built-in backtrace unwinder due to a cached implementation of a DWARF unwinding. It is on the same order of magnitude as linux’s perf-tools but uses an order of manitude less RAM since it does not have to copy the stack. Full Python stacks are gathered even faster. The couple-microsecond speed is important because, for instance, it makes it practical to turn on for all CUDA memory allocations without measurable increases in overhead.

Memory Snapshots

Interactive Viewer

Fast combined tracebacks are now enabled in CUDA memory snapshots using default settings. This is particularly useful for debugging memory allocations that occur in the backwards pass which have no Python frames, or when using TorchScript.

Combined tracebacks are also available as a simple library that can be used to build additional Python or C++ tools:

From Python

# gather a combined traceback

from torch._C._profiler import gather_traceback, symbolize_tracebacks
from pprint import pprint

tb0 = gather_traceback(python=True, script=True, cpp=True) 
tb1 = gather_traceback(python=True, script=True, cpp=True) 

# symbolize takes a couple seconds to read C++ debug info
# so the API is batched so that tools can symbolize a bunch
# of gathered tracebacks at once
symbolized_tracebacks = symbolize_tracebacks([tb0, tb1])

pprint(symbolized_tracebacks[0])
# ... example lines
[{'filename': '/raid/zdevito/pytorch/torch/csrc/profiler/unwind/unwind.cpp',
  'line': 293,
  'name': 'torch::unwind::unwind()'},
 ...
 {'filename': '/raid/zdevito/pytorch/testit.py', 'line': 7, 'name': '<module>'},
 ...
 {'filename': '??', 'line': 0, 'name': '_start'}]

From C++

#include <torch/csrc/profiler/combined_traceback.h>
...

std::shared_ptr<torch::CapturedTraceback> tb0 = 
  torch::CapturedTraceback::gather(/*python=*/true, /*script=*/true, /*cpp=*/true);
  
std::shared_ptr<torch::CapturedTraceback> tb1 = 
  torch::CapturedTraceback::gather(/*python=*/true, /*script=*/true, /*cpp=*/true);

torch::SymbolizedTracebacks r = torch::symbolize({tb0.get(), tb1.get()});

Tracebacks gathered from Python can be symbolized in C++ and vice-versa. We make calls to addr2line to get the most accurate file and line information possible.

vanbasten23 · April 5, 2023, 6:08pm

Thanks! This is really cool.

zdevito:

std::shared_ptr<torch::CapturedTraceback> tb0 = 
  torch::CapturedTraceback::gather(/*python=*/true, /*script=*/true, /*cpp=*/true);
  
std::shared_ptr<torch::CapturedTraceback> tb1 = 
  torch::CapturedTraceback::gather(/*python=*/true, /*script=*/true, /*cpp=*/true);

I wonder why we need to call torch::CapturedTraceback::gather twice with the same arguments.

zdevito · April 5, 2023, 8:18pm

The 2 calls are just to demonstrate that symbolize can take multiple tracebacks at a time (because it is a slow process to read the debug information). In real use, only one call is needed.

abhishek-rn · July 25, 2023, 8:39am

Hi,

This is really interesting.
I am unable to use this for pytorch 2.0.0 CPU versions.
Can you please provide more details on how to enable this tracebacks generally for any code and get call the call logs in one place.

Thanks

ezyang · August 17, 2023, 3:18am

Here is a small microbenchmark that shows how much gather_traceback is:

(/home/ezyang/local/b/pytorch-env) [ezyang@devgpu005.nha1 ~/local/b/pytorch (3577ae3e)]$ python a.py
0.01536102406680584
0.21078353887423873
(/home/ezyang/local/b/pytorch-env) [ezyang@devgpu005.nha1 ~/local/b/pytorch (3577ae3e)]$ cat a.py
import torch
import timeit
import traceback

from torch._C._profiler import gather_traceback, symbolize_tracebacks

def f():
    print(timeit.timeit("gather_traceback(python=True, script=False, cpp=False)", globals=globals(), number=10000))
    print(timeit.timeit("traceback.extract_stack()", globals=globals(), number=10000))

def g():
    f()

def h():
    g()

h()

Topic		Replies	Views
TorchInductor: a PyTorch-native Compiler with Define-by-Run IR and Symbolic Shapes compiler	46	67256	July 29, 2024
Debugging segfault by going back in time	0	369	July 14, 2024
Tracing with Primitives: Update 0 compiler	17	8266	September 26, 2022
Call for backward compatability to enable users to understand and adapt to pytorch compiler compiler	2	639	November 18, 2023
Debugging across Python and C++	2	660	February 6, 2024

Fast combined C++/Python/TorchScript/Inductor tracebacks