In torch.compile, the original bytecode of user-defined function will be transformed by Dynamo. The pull request adds an API torch._dynamo.convert_frame.register_bytecode_hook to PyTorch, which enables hooks to log additional information of the bytecode and optionally to replace the transformed bytecode. Currently, the major usecase of this API is the depyf project for helping users to understand and adapt to torch.compile, and there might be more usecases in the future.
Replacing bytecode is very error-prone in Python, as it involves many implementation details of CPython interpreter.
This discussion thread aims to clarify what is the relationship requirement among these code objects. To be specific, we are talking about three code objects:
Original code object, from user defined functions, dubbed as original_code.
Transformed code object, generated by Dynamo, dubbed as transformed_code.
Replaced code object, returned by bytecode hooks, dubbed as replaced_code.
The requirement is with regard to the following code attributes:
discussion on this PR seems to indicate that new code objects must have more local variables than the original code objects (i.e. more names in co_varnames).
I once met a strange failing test that fails PyTorch in Python 3.11 . After I make sure the replaced code objects have more local variables than the original code objects, the error disappears.
The Python interpreter seems to work as long as the new bytecode does not access a name that does not exist:
def f():
a = 1
b = 2
return a + b
def g():
a = 1
b = 2
c = 3
return a + b + c
g.__code__ = f.__code__
print(g()) # outputs 3
In this example, we replace g’s bytecode with f’s bytecode, and the frame of g can execute f’s bytecode without any problem, even though the replaced bytecode has less local variables than the original bytecode.
f_localsplus of original_code needs to be a prefix of both transformed_code->f_localsplus and replaced_code->f_localsplus. Since we copy f_localsplus from the old frame to the new frame.
This means:
The new code must have more locals than the old code
And, they must be in the same order
f_localsplus contains: args, free/cell vars, local variables.
Technically the order of locals can change (since they are all just NULL when the function is called), but args/free/cell vars must be in the exact same positions as the original function.
To be clear, the sole purpose is to make sure new_frame can execute the transformed or replaced bytecode correctly. Therefore, ideally, we need to look up necessary variables used in new_frame, and copy them from old_frame.
There are four kinds of variables we need to consider:
arguments, passed by the caller function.
local variables, created during execution and destroyed after the frame finishes execution. They are NULL when the frame starts execution.
free variables, variables that are referenced by this frame but come from outer functions.
cell variables, variables that are inside this frame but referenced by inner functions. I suppose they are initially NULL and set by STORE_DEREF later.
Ideally, new_frame should look up variables it requires from old_frame. The current implementation tries to copy all the variables from old_frame to new_frame, that’s why @jansel says there are two constraints:
The new code must have more locals than the old code
And, they must be in the same order
A more natural solution would be:
for i, name in enumerate(old_frame_names):
name_to_idx[name] = i
for i, name in enumerate(new_frame_names):
if name not in name_to_idx:
continue
fastlocals_new[i] = fastlocals_old[name_to_idx[name]]
This might get rid of those constraints. But it incurs higher runtime cost.
cell variables, variables that are inside this frame but referenced by inner functions. I suppose they are initially NULL and set by STORE_DEREF later.
cell variables add a layer of indirection. So they won’t be NULL, they will be a cell containing NULL.
While I agree that adding a name_to_idx would make things more general, it is unnecessary because the order of vars in the old frame is a prefix of the new frame by construction in dynamo (since dynamo creates the new frame by copying and modifying the old frame).
This code is on the hot path (every function call), so we care a lot about performance here.
Agree performance is critical here. I created a PR to skip copying local variables, which can not only bring speedup but also relax these two constraints.
And I will try to follow these constraints to be compatible to previous PyTorch versions. This discussion thread is left for future reference in case anyone runs into the problem.
Update: after this PR, the only constraint is that original_code/transformed_code/replaced_code have the same args/cell variables/free variables. There are no constraints on newly introduced local variables.
However, PyTorch prior to this PR would have these constraints:
The new code must have more locals than the old code
And, they must be in the same order
args/cell variables/free variables must be exactly the same.