Fast combined C++/Python/TorchScript/Inductor tracebacks
PyTorch combines many different languages so it is difficult to interpret tracebacks from only one language when debugging because lots of information is missing. For instance a matmul in will have a Python traceback in the forward pass but the backward for the same operator only exists in C++.
Fast combined tracebacks show frames from all of our language technologies interleaved together in a single view of program state at the point the traceback is gathered (… marks places where I cut out some lines manually):
pytorch/c10/cuda/CUDACachingAllocator.cpp:872:c10::cuda::CUDACachingAllocator
pytorch/c10/cuda/CUDACachingAllocator.cpp:2400:c10::cuda::CUDACachingAllocato
pytorch/c10/cuda/CUDACachingAllocator.cpp:2574:c10::cuda::CUDACachingAllocato
pytorch/c10/core/StorageImpl.h:66:c10::StorageImpl::StorageImpl(c10::StorageI
... ^-- C++ frames, v-- Python Frames
pytorch/torch/utils/_device.py:63:__torch_function__
pytorch/torch/csrc/dynamo/eval_frame.c:370:eval_frame_default
pytorch/torch/csrc/dynamo/eval_frame.c:642:_custom_eval_frame_shim
pytorch/torch/csrc/dynamo/eval_frame.c:351:custom_eval_frame_shim
... ^-- dynamo extension frames, v-- inductor frames, including original gene
/tmp/torchinductor_eellison/qc/cqcj3npdlbtx7nexsinkq576bfwjc4u4yx7yns7gidaaev
transformers/models/albert/modeling_albert.py:410:ff_chunk
transformers/pytorch_utils.py:246:apply_chunking_to_forward
transformers/models/albert/modeling_albert.py:397:forward
pytorch/torch/nn/modules/module.py:1501:_call_impl
transformers/models/albert/modeling_albert.py:432:forward
pytorch/torch/nn/modules/module.py:1501:_call_impl
transformers/models/albert/modeling_albert.py:480:forward
pytorch/torch/nn/modules/module.py:1501:_call_impl
transformers/models/albert/modeling_albert.py:733:forward
pytorch/torch/nn/modules/module.py:1501:_call_impl
transformers/models/albert/modeling_albert.py:1281:forward
pytorch/torch/csrc/dynamo/eval_frame.c:370:eval_frame_default
pytorch/torch/csrc/dynamo/eval_frame.c:642:_custom_eval_frame_shim
pytorch/torch/csrc/dynamo/eval_frame.c:351:custom_eval_frame_shim
... v-- C++ frames from the backwards worker thread.
pytorch/torch/csrc/autograd/python_function.cpp:108:torch::autograd::PyNode::
pytorch/torch/csrc/autograd/function.h:184:torch::autograd::Node::operator()(
pytorch/torch/csrc/autograd/engine.cpp:919:torch::autograd::call_function(std
pytorch/torch/csrc/autograd/engine.cpp:984:torch::autograd::Engine::evaluate_
pytorch/torch/csrc/autograd/engine.cpp:545:torch::autograd::Engine::thread_ma
pytorch/torch/csrc/autograd/engine.cpp:357:torch::autograd::Engine::thread_in
pytorch/torch/csrc/autograd/python_engine.cpp:83:torch::autograd::python::Pyt
...
Unified View: Frames from Python, C++, TorchScript, and PyTorch 2.0’s inductor all appear in a single traceback with function names, filenames, and linenumbers. Tools that use combined tracebacks do not need to know the language technology being executed.
Fast: Gathering combined tracebacks is extremely fast. For instance, C++ tracebacks can be gathered at about 50ns/frames, or about 2us for a full stack as seen in PyTorch executions. This speed is ~10x faster than the built-in backtrace
unwinder due to a cached implementation of a DWARF unwinding. It is on the same order of magnitude as linux’s perf-tools but uses an order of manitude less RAM since it does not have to copy the stack. Full Python stacks are gathered even faster. The couple-microsecond speed is important because, for instance, it makes it practical to turn on for all CUDA memory allocations without measurable increases in overhead.
Memory Snapshots
Fast combined tracebacks are now enabled in CUDA memory snapshots using default settings. This is particularly useful for debugging memory allocations that occur in the backwards pass which have no Python frames, or when using TorchScript.
Combined tracebacks are also available as a simple library that can be used to build additional Python or C++ tools:
From Python
# gather a combined traceback
from torch._C._profiler import gather_traceback, symbolize_tracebacks
from pprint import pprint
tb0 = gather_traceback(python=True, script=True, cpp=True)
tb1 = gather_traceback(python=True, script=True, cpp=True)
# symbolize takes a couple seconds to read C++ debug info
# so the API is batched so that tools can symbolize a bunch
# of gathered tracebacks at once
symbolized_tracebacks = symbolize_tracebacks([tb0, tb1])
pprint(symbolized_tracebacks[0])
# ... example lines
[{'filename': '/raid/zdevito/pytorch/torch/csrc/profiler/unwind/unwind.cpp',
'line': 293,
'name': 'torch::unwind::unwind()'},
...
{'filename': '/raid/zdevito/pytorch/testit.py', 'line': 7, 'name': '<module>'},
...
{'filename': '??', 'line': 0, 'name': '_start'}]
From C++
#include <torch/csrc/profiler/combined_traceback.h>
...
std::shared_ptr<torch::CapturedTraceback> tb0 =
torch::CapturedTraceback::gather(/*python=*/true, /*script=*/true, /*cpp=*/true);
std::shared_ptr<torch::CapturedTraceback> tb1 =
torch::CapturedTraceback::gather(/*python=*/true, /*script=*/true, /*cpp=*/true);
torch::SymbolizedTracebacks r = torch::symbolize({tb0.get(), tb1.get()});
Tracebacks gathered from Python can be symbolized in C++ and vice-versa. We make calls to addr2line
to get the most accurate file and line information possible.