Debugging segfault by going back in time

Intro

While enabling new CPython versions or debugging some of the more intricate PyTorch code, it often happens that we end up with a segfault or a crash in c++.
The best tool to debug these C++ issues is gdb to be able to look at the state during the error. While this is an amazing tool, we are often looking at a world/object in a corrupted/inconsistent state that should never have happened.

While looking into one such issue, @colesbury introduced me to rr (https://rr-project.org/).
This tool can be used to change your gdb workflow by first doing a record and then, during the replay, within the gdb prompt, use the extra “reverse-continue” (rc shortcut) to move backward in time!

Example

In the process of writing Update to compile and run test_torch with 3.13t by albanD · Pull Request #130689 · pytorch/pytorch · GitHub there was an issue where python -Xgil=1 test_serialization.py would run just fine but segfault at the end.
Since this is a c++ failure, gdb is the right tool to look into the state. Unfortunately, in this case, the objects were in an invalid state indeed (see below).

So I started to use rr to investigate it.
In all the snippets below, I use brackets [Omitted …] to reduce print sizes.

First, let’s record the failure:

$ rr record -v LD_PRELOAD=libstdc++.so.6 python -Xgil=1 test_serialization.py 
[Omitted prints and warnings from the tests]
...........
----------------------------------------------------------------------
Ran 135 tests in 21.060s

OK (skipped=7)
Segmentation fault

Preloading libstdc++ because PyTorch loads it in a local namespace, preventing rr from accessing it.

Then we can replay it to reach the failure again:

$ rr replay
GNU gdb (Fedora Linux) 14.2-1.fc39
[Omitted debug symbols loading]
0x00007f7b087e8bc0 in _start () from /lib64/ld-linux-x86-64.so.2            
(rr) # Load gdb CPython script to be able to inspect objects
(rr) source /home/albandes/local/installs/python3.13.0a6/nogil/source/python-gdb.py 
(rr) c
[Omitted prints and warnings from the tests (exact same as the ones above)]
...........
----------------------------------------------------------------------
Ran 135 tests in 21.060s

OK (skipped=7)
[New Thread 578664.578667]
[New Thread 578664.578668]
[New Thread 578664.578669]
[New Thread 578664.578670]
[New Thread 578664.578671]
[New Thread 578664.578672]
[New Thread 578664.578673]
[New Thread 578664.578674]
[New Thread 578664.578675]
[New Thread 578664.578676]
[New Thread 578664.578677]
[New Thread 578664.578678]
[New Thread 578664.578679]

Thread 1 received signal SIGSEGV, Segmentation fault.
_Py_DECREF_DecRefTotal () at Objects/object.c:238
238	    reftotal_add(_PyThreadState_GET(), -1);
(rr) 

We are now exactly where you would be with a regular gdb run.

Let’s inspect the stack trace

(rr) bt 3
#0  _Py_DECREF_DecRefTotal () at Objects/object.c:238
#1  0x00007f7afa5daf5c in Py_DECREF (op=<function at remote 0x20003302710>, lineno=1036, 
    filename=0x7f7afa63f000 "/home/albandes/local/installs/python3.13.0a6/nogil/install/include/python3.13td/object.h") at /home/albandes/local/installs/python3.13.0a6/nogil/install/include/python3.13td/object.h:878
#2  Py_XDECREF (op=<function at remote 0x20003302710>)
    at /home/albandes/local/installs/python3.13.0a6/nogil/install/include/python3.13td/object.h:1036
#3  Py_XDECREF (op=<function at remote 0x20003302710>)
    at /home/albandes/local/installs/python3.13.0a6/nogil/install/include/python3.13td/object.h:1033
#4  pybind11::handle::dec_ref() const & (this=<optimized out>)
    at /home/albandes/local/pytorch/3.13.0a6_nogil_source/third_party/pybind11/include/pybind11/pytypes.h:282
#5  pybind11::object::~object (this=<optimized out>, __in_chrg=<optimized out>)
    at /home/albandes/local/pytorch/3.13.0a6_nogil_source/third_party/pybind11/include/pybind11/pytypes.h:378
#6  pybind11::detail::accessor<pybind11::detail::accessor_policies::str_attr>::~accessor (this=<optimized out>, 
    __in_chrg=<optimized out>)
    at /home/albandes/local/pytorch/3.13.0a6_nogil_source/third_party/pybind11/include/pybind11/pytypes.h:1016
#7  0x00007f7b0825efd6 in __run_exit_handlers (status=status@entry=0, listp=<optimized out>, 
    run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:111
#8  0x00007f7b0825f11e in __GI_exit (status=status@entry=0) at exit.c:141
#9  0x00000000006391d9 in Py_Exit (sts=0) at Python/pylifecycle.c:3396
#10 0x000000000063f531 in handle_system_exit () at Python/pythonrun.c:604
(More stack frames follow...)

So we are currently exiting the CPython runtime, running the c++ exit handlers (de-allocating the static variables) and destroying a pybind object there.

Now we can look which object is problematic:

(rr) # Select a frame where the op object is available
(rr) f 2
(rr) # Print the object and try to look for info into
(rr) p op
$1 = <function at remote 0x20003302710>
(rr) p ((PyFunctionObject*)op)->func_doc
$5 = None
(rr) p ((PyFunctionObject*)op)->func_name
$2 = <unknown at remote 0x20003f14290>
(rr) p ((PyFunctionObject*)op)->func_name->ob_type
$3 = (PyTypeObject *) 0xdddddddddddddddd

So we have a python function but it has no documentation and it’s name’s type’s pointer looks very suspicious. Note that we would expect the name’s type to be “str”.
A quick look at cpython/Objects/object.c at d005f2c1861dbf0ab3d9f80b54d05d0c0b522c3c · python/cpython · GitHub shows that this special address is being used to tag an already de-allocated type.

This starts to look like a function object that outlived the python runtime (when the string type was destructed) but we’re still trying to properly delete it. This is a common problem we already had in PyTorch, for example protect destructors of python bindings that can be kept alive by c++ objects by albanD · Pull Request #57488 · pytorch/pytorch · GitHub.

This one is not as simple unfortunately as the type is not a custom one like in previous cases within PyTorch and it could be any pybind object stored in a static variable.

Let’s use our time travelling skills to find the object

A good way to figure out which object it is is to see where it was allocated. To do so, we can watch the memory associated with the object and see who touches it.
We expect quite a lot of things to use this object though, so instead, we’re going to watch the memory address of the type attribute as this field is usually set once and never changed.

Note below that we use the address from “op” and not “op” in the expression as we don’t want to watch the expression called “op” in the current scope, but anything touching this address.

(rr) watch *(PyObject**)(((PyObject*)0x20003302710)->ob_type)
Hardware watchpoint 3: *(PyObject**)(((PyObject*)0x20003302710)->ob_type)
(rr) rc
[Omitted prints and warnings from the tests again (I am not sure if it is same order or reverse tbh)]


Thread 1 hit Hardware watchpoint 3: *(PyObject**)(((PyObject*)0x20003302710)->ob_type)

Old value = 0x0
New value = <unreadable>
_PyObject_Init (op=op@entry=<unknown at remote 0x20003302710>, typeobj=typeobj@entry=0x928320 <PyFunction_Type>)
    at ./Include/internal/pycore_object.h:283
283	    Py_SET_TYPE(op, typeobj);

Looks like we are indeed at the spot where the function was created:

(rr) bt
#0  _PyObject_Init (op=op@entry=<unknown at remote 0x20003302710>, typeobj=typeobj@entry=0x928320 <PyFunction_Type>)
    at ./Include/internal/pycore_object.h:283
#1  0x000000000060510d in _PyObject_GC_New (tp=tp@entry=0x928320 <PyFunction_Type>)
    at Python/gc_free_threading.c:1677
#2  0x00000000004d5b2b in PyFunction_NewWithQualName (code=code@entry=<code at remote 0x20003560190>, 
    globals={'__name__': 'torch._library.simple_registry', '__doc__': None, '__package__': 'torch._library', '__loader__': <SourceFileLoader(name='torch._library.simple_registry', path='/home/albandes/local/pytorch/3.13.0a6_nogil_source/torch/_library/simple_registry.py') at remote 0x2000089f1a0>, '__spec__': <ModuleSpec(name='torch._library.simple_registry', loader=<...>, origin='/home/albandes/local/pytorch/3.13.0a6_nogil_source/torch/_library/simple_registry.py', loader_state=None, submodule_search_locations=None, _uninitialized_submodules=[], _set_fileattr=True, _cached='/home/albandes/local/pytorch/3.13.0a6_nogil_source/torch/_library/__pycache__/simple_registry.cpython-313.pyc', _initializing=True) at remote 0x20002f27a60>, '__file__': '/home/albandes/local/pytorch/3.13.0a6_nogil_source/torch/_library/simple_registry.py', '__cached__': '/home/albandes/local/pytorch/3.13.0a6_nogil_source/torch/_library/__pycache__/simple_registry.cpython-313.pyc', '__builtins__': {'__name__': 'builtins', '__doc__': "Built-in functio...(truncated), 
    qualname=<optimized out>, qualname@entry=0x0) at Objects/funcobject.c:185
#3  0x00000000004d632c in PyFunction_New (code=code@entry=<code at remote 0x20003560190>, globals=<optimized out>)
    at Objects/funcobject.c:370
#4  0x00000000005d6e5b in _PyEval_EvalFrameDefault (tstate=0x9bab80 <_PyRuntime+417152>, frame=0x7f7b087a7860, 
    throwflag=0) at Python/generated_cases.c.h:4784
(

This is not within PyTorch C++ code but within the CPython interpreter. So the object was created in some python file.
Let’s use the CPython gdb plugin to find out which:

(rr) py-bt
Traceback (most recent call first):
  File "/home/albandes/local/pytorch/3.13.0a6_nogil_source/torch/_library/simple_registry.py", line 81, in <module>
    def find_torch_dispatch_rule(op, torch_dispatch_class: type) -> Optional[Callable]:
  [Ommited importlib lines]
  File "/home/albandes/local/pytorch/3.13.0a6_nogil_source/torch/_library/__init__.py", line 3, in <module>
    import torch._library.simple_registry
  [Ommited importlib lines]
  File "/home/albandes/local/pytorch/3.13.0a6_nogil_source/torch/library.py", line 13, in <module>
    import torch._library as _library
  [Ommited importlib lines]
  File "/home/albandes/local/pytorch/3.13.0a6_nogil_source/torch/backends/mps/__init__.py", line 7, in <module>
    from ...library import Library as _Library
  [Ommited importlib lines]
  <built-in method __import__ of module object at remote 0x200003900f0>
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1415, in _handle_fromlist
  File "/home/albandes/local/pytorch/3.13.0a6_nogil_source/torch/backends/__init__.py", line 59, in <module>
    from torch.backends import (
  [Ommited importlib lines]
  <built-in method __import__ of module object at remote 0x200003900f0>
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1310, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "/home/albandes/local/pytorch/3.13.0a6_nogil_source/torch/nn/attention/__init__.py", line 8, in <module>
    from torch.backends.cuda import (
  [Ommited importlib lines]
  <built-in method __import__ of module object at remote 0x200003900f0>
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
--Type <RET> for more, q to quit, c to continue without paging--
  File "<frozen importlib._bootstrap>", line 1415, in _handle_fromlist
  File "/home/albandes/local/pytorch/3.13.0a6_nogil_source/torch/nn/__init__.py", line 8, in <module>
    from torch.nn import (
  [Ommited importlib lines]
  <built-in method __import__ of module object at remote 0x200003900f0>
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1310, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "/home/albandes/local/pytorch/3.13.0a6_nogil_source/torch/functional.py", line 7, in <module>
    import torch.nn.functional as F
  [Ommited importlib lines]
  <built-in method __import__ of module object at remote 0x200003900f0>
  File "<frozen importlib._bootstrap>", line 488, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1415, in _handle_fromlist
  File "/home/albandes/local/pytorch/3.13.0a6_nogil_source/torch/__init__.py", line ?, in <module>
    (failed to get frame line number)
  [Ommited importlib lines]
  File "/home/albandes/local/pytorch/3.13.0a6_nogil_source/test/test_serialization.py", line 21, in <module>
    import torch

So the function in question is the one defined in torch library called “find_torch_dispatch_rule”.

The answer

What is special about this function? From a quick search for that name, it points to pytorch/torch/csrc/utils/python_arg_parser.cpp at 774ca93fd2b99808761010528267ef14fa816bda · pytorch/pytorch · GitHub which was added by @zou3519 as a way to speed up library lookup.
It is indeed stashing this python function into a c++ static variable, making it out-live the CPython runtime!

How do we fix that? Well, it turns out @XuehaiPan already has a PR to fix this in Fix static `py::object` dangling pointer with `py::gil_safe_call_once_and_store` by XuehaiPan · Pull Request #130341 · pytorch/pytorch · GitHub ! The fix is to “leak” this object: it will outlive the CPython runtime but we will never de-allocate it (by making sure its refcount never gets to 0) so it is all good!

Conclusion

I have a few things to say to conclude.
First, if you keep a Python object alive from C++, make sure that it will either die before the python runtime or it will be leaked and never deleted.
Second, mixing Python and C++ is a tricky alchemy and even the simplest code can have un-expected consequences.
Third being able to go both forward and backward when debugging is an extremely powerful concept, allowing to quickly make sense of corrupted or inconsistent environment.
And finally, rr is very much a unique tool that is going to save you a lot of time if you’re debugging relatively low level code!

I will finish on the question: could we do the same with a pure python program?
Afterall, in this example, we did go back in a python program. I think the missing extension is the gdb integration to put breakpoints in python code directly and watch python variables. It sounds feasible in theory, but I haven’t seen it yet, let me know if you know of someone who did it!

7 Likes