Fixing torch.compile Reference Leaks (Automatic Deletion of Dynamo Code Objects)

We recently fixed several torch.compile refleaks (reference leaks) as a part of improving Dynamo reliability. Refleaks were reported in issues such as CUDA Memory leak w/ torch.compile in both stable and trunk · Issue #119607 · pytorch/pytorch · GitHub and Memory leak in torch.compile after aborting backward · Issue #112090 · pytorch/pytorch · GitHub.

By fixing torch.compile refleaks, users are now able to sequentially compile and run more than one large model per process without having to resort to tricks like torch.compiler.reset()/torch._dynamo.reset(), which resets the entire Dynamo state and forces recompilations. We also added refleak tests to prevent developers from introducing refleaks.

The remaining known refleaks are:

Debugging torch.compile Reference Leaks

Weakrefs (weak references) are crucial in detecting and debugging reference leaks. In contrast to regular (or strong) references, weakrefs are non-owning references to objects - they do not increase an object’s reference count. A weakref’s object can be directly accessed (this creates a strong reference) by calling the weakref. When the underlying object is cleaned by Python, any finalizer functions registered to the weakref are called, and attempting to call the weakref will result in None.

The first debugging approach is to use a weakref on the leaked object and to call gc.get_referrers() repeatedly until we encounter an object we suspect is holding on to a reference when it should not be.

mod = MyModel()
ref = weakref.ref(<mod, or something in mod, e.g. mod.sublayer>, lambda _: print("obj freed!"))
opt_mod = torch.compile(mod)
opt_mod(input)

del mod
del opt_mod
gc.collect()

# if there is no leak, ref() should be None and we should see "obj freed!" printed
print(gc.get_referrers(ref()))
# otherwise, we have to go digging...
print(gc.get_referrers(gc.get_referrers(ref())[0]))
# etc.

Admittedly, this approach is quite manual and tedious. We tried using a graph visualizer on the GC referrers graph, but we found that the graph was too large and difficult to read.

In addition, C code can cause refleaks since we can arbitrarily increase an object’s reference count. In this case, we can’t use gc.get_referrers() to determine where in the C code a reference count was incorrectly increased. However, we can compare the number of referrers (len(gc.get_referrers())) and the reference count (sys.getrefcount()) - the difference between these two numbers should be constant across different objects (in our debugging, normally the difference is 3).

So we can use gc.get_referrers()/sys.getrefcount() to determine that C code is causing a refleak, but in order to triage further, we can make a few changes to Dynamo:

  1. In C Dynamo, in order to run compiled bytecode, we create a new “shadow” frame based on the original code object, but using Dynamo-compiled bytecode. Instead of running the shadow frame, we can just run the original frame (e.g. change this line). If we refleak only on the shadow frame, then it is likely that the generated bytecode is causing the refleak.

  2. We can even skip creating the shadow frame entirely and default-evaluate the original frame (e.g. change this line). This can help us to determine if the refleak is being caused when we create the shadow frame.

  3. In Python Dynamo, we can skip the tracing and return the original code object with a dummy guard function (e.g. return early in this function).

Leak 1: Cache Invalidation

PRs:

The Dynamo code cache prevents unnecessary recompilations. In general, each original code object that Dynamo traces through has its own cache - each entry in the cache contains Dynamo-compiled bytecode and its guard function. The guard function determines whether running the bytecode on the current execution context (function arguments, locals, etc.) is sound. Before attempting to trace through code, Dynamo tries to find the first entry in the code object’s cache with a passing guard function. If such an entry is found, we can use the previously constructed bytecode, thus avoiding compilation.

A guard function may become invalid - for example, the guard function may check for object identity with an object that was garbage collected. In this case, the guard function should never succeed, and so the associated compiled bytecode is no longer necessary. However, Dynamo did not have a mechanism to automatically remove this invalid cache entry from the cache. Thus, there may be a reference leak as the compiled bytecode can continue to refer to some of the original model’s resources.

Dynamo uses weakref finalizers to invalidate guards. For example, when a guard is created on an object’s id(), the guard only stores its id() - the guard does not take a reference to the object. We then register a weakref finalizer on the original object so that when the original object is deleted, the finalizer will signal to the guard to invalidate itself.

The fix for this reference leak was to enable cache entry deletion. This change was not straightforward as the cache was implemented with a C data structure with complex reference counting rules - we first rewrote sections of the C parts of Dynamo to C++ to take advantage of RAII/pybind/std containers. Additionally, we had to be careful to set references correctly to prevent strong reference cycles and use-after-free.

Leak 2: Python-level Reference Leaks

PRs:

The following 3 refleaks occurred at the Python level, where objects in Python Dynamo would hold on to references to user objects.

The first was caused by the Dynamo code context map holding strong references to Dynamo-generated GraphModules. The purpose of the code context map is to provide a mapping from a GraphModule’s code object to the original GraphModule object - most of Dynamo only has access to the code object, but not the original function object. In particular, the code context map is used to preserve FX node metadata when Dynamo traces through GraphModules. Although the map uses weakrefs (to code objects) as keys, leaks still occurred because the lifetime of the key was directly linked to the lifetime of the value (generated GraphModule). The solution was to make the values weakrefs as well.

The second was caused by invalidation finalizer functions having indirect strong references to user objects. In particular, finalizers held a strong reference to Dynamo’s OutputGraph, which is responsible for generating the output FX graph. OutputGraph keeps references to its generated GraphModules, which in turn can keep references to user models’ layers, parameters, etc. There are cases where objects that we finalize on aren’t cleaned up when we expect them to - by CPython design. In these cases, the finalizer is not run, so it (and its references) continues to exist in memory. The solution was simply to remove the reference to OutputGraph when no longer needed.

The third was caused by the following cycle:

  1. Dynamo code cache entry refers to generated bytecode
  2. Generated bytecode refers to generated GraphModule
  3. GraphModule can refer to user layers
  4. Guards on user objects are watched using weakref finalizers to invalidate the code cache entry (leak #1 was implemented by this point)

This refleak does not happen all the time because the user layers referred to by the generated GraphModule (point 3) are not necessarily guarded on (point 4). But, for example, in the case where we torch.compile a built-in nn.Module subclass directly (without inlining inbuilt nn.Modules), we can leak:

# torch._dynamo.config.inline_inbuilt_nn_modules = False
mod = torch.Linear(10, 10)
ref = weakref.ref(mod, lambda _: print("mod deleted!"))
opt_mod = torch.compile(mod)
opt_mod(torch.randn(10, 10))
del mod
del opt_mod
gc.collect()
print("done!")  # done!
# mod deleted!

The refleak was fixed by removing the GraphModule’s strong references to user layers (point 3). We cannot simply use a weakref here since (1) the GraphModule’s forward function will need to call the weakref in order to use the original layers and (2) downstream programs may inspect layer types and operate on the GraphModule and they do not assume that the GraphModule uses weakrefs. A weakref.Proxy rids the need to make a call to retrieve the original layer, but the original layer type is still hidden. We can attempt to create a shallow copy of the layer via copy.copy, but it does not copy over every single attribute we need, such as hooks.

The solution was to create a “stronger shallow copy” - we create a new layer object by using the layer’s __new__ method and copy over the original layer’s __dict__. Here, the new layer is a distinct object - not a strong reference to the original layer - but it points to the same underlying data as the original layer. We also call this new object a “proxy,” as these objects are intended to have similar behavior to a weakref.Proxy, except that the type of the new object reflects the original.

This bug was difficult to find and fix since all strong references were intentional, so gc.get_referrers() and sys.getrefcount() were not helpful. We discovered this reference cycle by reasoning about the Dynamo caching implementation at a high level and verifying our hypothesis. We were initially unsure which strong reference to remove to break the reference cycle since it seemed necessary to keep all of the references strong. The proxy solution we implemented also has fragility concerns.

Leak 3: 3.11+ Reference Leak

PR: [dynamo] fix 3.11+ refleak by williamwen42 · Pull Request #124238 · pytorch/pytorch · GitHub

3.11+ was especially susceptible to reference leaks - we were discovering that even very simple programs would have leaks. gc.get_referrers() revealed that there was a cell object that held a strong reference to a user object.

Cell objects are used in Python to implement closures - if a function’s local variable will be used in an inner function, the local variable will be wrapped in a cell object, which the inner function can later access when called. Cell objects are necessary to extend the lifetime of the local variable, since otherwise, a local variable will be deleted at the end of a function’s execution.

Since a cell object was holding on to the leaked object, we suspected some closure was being leaked or a closure was holding on to the cell object when it should not be. Although we could determine where the leaking cell object originated from, we could not use gc.get_referrers() to determine which object continued to hold on to a reference to the cell. We tried using some other Dynamo reference leak debugging techniques:

  • Running the original bytecode instead of the optimized bytecode still resulted in a leak. So the leak is not caused by faulty generated bytecode.
  • Skipping Python dynamo (and using the original bytecode instead) still resulted in a leak. So Dynamo is not holding on to a strong reference it shouldn’t be.
  • Running the original frame instead of the shadow frame did not result in a leak. So the issue had to do with running the shadow frame.

At this point, we noticed that the bytecode started with the instruction COPY_FREE_VARS, which copies the cell objects in a code object’s closure into the execution frame. We discovered that when Dynamo simply skipped emitting this bytecode, no leaks occurred. Given that we observed (1) that we leaked if and only if COPY_FREE_VARS was present, (2) that a cell variable was being leaked, and (3) that COPY_FREE_VARS deals with cell variables, we concluded that this instruction was the likely culprit of the leak.

Upon further investigation, we noted that the shadow frame gets its execution context (local variables, cell variables, free variables, etc.) by copying its contents from the original frame. In this process, each copied variable has its reference count incremented. Then when the shadow frame is executed, COPY_FREE_VARS again increments the reference counts of the cell variables. But when we clean up the shadow frame, each variable’s reference count is only decremented once.

But why would calling COPY_FREE_VARS on the shadow frame cause a reference leak while the original frame did not leak? Recall that before C Dynamo runs bytecode, we run guards on the execution context of the original frame. In Python 3.11+, execution frames changed from being Python objects to C structs, for performance reasons. As a result, the variables of the frame are not immediately accessible in Python. Therefore, in order to run guards, we run some C code copy-pasted from CPython that constructs a Python dict that holds the frame variables. But because a frame’s execution context includes its cell variables, this C code necessarily mimics running the COPY_FREE_VARS instruction. To prevent duplicating the COPY_FREE_VARS behavior, CPython updates the frame’s instruction pointer to start after the COPY_FREE_VARS instruction.

Thus, our solution was to skip the COPY_FREE_VARS instruction in the shadow frame if it was also skipped in the original frame. This prevents cell variables from having their reference counts incremented an additional time.

Leak 4: 3.12+ Reference Leak

Links:

During our work on torch.compile support for Python 3.12, one of the refleak tests started to fail in Python 3.12. The code is similar to:

def test():
    param = torch.nn.Parameter(torch.randn(100, 100))
    class MyModel(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.param = param

        def forward(self, x):
            return self.param * x
    mod = MyModel()
    ref = weakref.ref(param, lambda _: print("param deleted!"))
    opt_mod = torch.compile(mod)
    opt_mod(torch.randn(10, 10))

ref = test()
gc.collect()
print("done!")

We determined that the refleak was happening in Python Dynamo since the leak would not occur if we short-circuited in Python Dynamo (i.e. skip the tracing and return the original code object in convert_frame.py). However, none of the Dynamo changes made on the Python side for 3.12 seemed to be relevant to this leaking parameter case. Using gc.get_referrers(), we determined that functools.lru_cache was indirectly holding a strong reference to the parameter. This LRU cache is used to cache calls to a private function in inspect.py, a Python standard module, which was indirectly called by inspect.getattr_static. In particular, Dynamo would call getattr_static on the user object, which would result in the object’s class being cached by the inspect module. If the class contains a class member, or in our case above, a cell, then it is possible for references to user objects to persist even if the class and all its instances are deleted. Further, by removing the lru_cache on the private function in inspect, the above code no longer refleaks.

We reported this bug to CPython. Although there was hesitation on removing the LRU cache since performance would regress, we agreed that unexpectedly longer lifetimes of objects were not desirable. The fix made by CPython was to make the cache on the private function take weak references.

1 Like