How to turn off inlining / force materialization in TorchInductor during torch.compile?

skv66 · August 16, 2025, 4:39am

Hi all,

I’m trying to reproduce the “Without inlining” ablation from the PyTorch 2.0 paper (the table that reports speedups for “All TorchInductor optimizations”, and then “Without inlining”, “Without fusion”, and “Without fusion & inlining”, etc.) . My research goal is to turn off inlining (As well as other optimizations) in TorchInductor while keeping the rest of the pipeline intact, so I can attribute speedups precisely.

Questions

Is there an official flag or env var to disable inlining (i.e., force materialization) in TorchInductor?
How do I verify inlining is off?

Thanks!

jansel · August 22, 2025, 4:10am

The relevent configs are:

github.com/pytorch/pytorch

torch/_inductor/config.py

bf8431ba0


      
          realize_reads_threshold = 4
          realize_opcount_threshold = 30

(set both to 0)

You should verify it is working by compiling something like torch.sin(torch.cos(x)). With inlining you shoul see one SchedulerNode. Without it you should see two.

In the code “realized” buffers are not inlined (while unrealized ones are inlined).

skv66 · August 22, 2025, 4:35pm

Thanks for the pointers! I tried setting the inlining knobs to zero, but I still see sin(cos(x)) ending up in a single SchedulerNode (and one kernel mapping to ["cos","sin"]) in the debug dumps.

The script I used:

import torch, torch._inductor.config as cfg
cfg.realize_reads_threshold = 0
cfg.realize_opcount_threshold = 0
cfg.trace.enabled = True

def f(x):
    return torch.sin(torch.cos(x))

x = torch.randn(1_000_000, device="cpu")
g = torch.compile(f, fullgraph=True)(x)

I ran this using the following flags:
TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_FX_GRAPH_CACHE=0 python verify_inline.py

When I diff the two latest debug runs, the only difference I see is that one run’s generated fx_graph_runnable.py includes the threshold assignments and the other does not:
diff -r torch_compile_debug/run_2025_08_22_11_27_35_941034-pid_8665/ \

     torch_compile_debug/run_2025_08_22_11_29_25_221620-pid_8688/

diff -r torch_compile_debug/run_2025_08_22_11_27_35_941034-pid_8665/torchinductor/model__0_inference_0.0/fx_graph_runnable.py
torch_compile_debug/run_2025_08_22_11_29_25_221620-pid_8688/torchinductor/model__0_inference_0.0/fx_graph_runnable.py
22,23d21
< torch._inductor.config.realize_reads_threshold = 0
< torch._inductor.config.realize_opcount_threshold = 0

Here is the snippet of inductor_provenance_tracking_node_mappings.json
{“preToPost”: {“cos”: [“cos”], “sin”: [“sin”]}, “postToPre”: {“cos”: [“cos”], “sin”: [“sin”]}, “cppCodeToPost”: {“cpp_fused_cos_sin_0”: [“sin”, “cos”]}, “postToCppCode”: {“sin”: [“cpp_fused_cos_sin_0”], “cos”: [“cpp_fused_cos_sin_0”]}}

Questions:

Are these thresholds read early enough that setting them inside the generated runnable wouldn’t affect lowering-time inlining? (i.e., do they need to be set even earlier than I am?)
Is there a specific file/decision point in lowering where “unrealized → inline” vs “realized → materialize” happens that I can instrument to confirm we’re hitting the intended branch?

Thanks!

jansel · August 28, 2025, 12:13am

Here is a workaround (the realize_reads_threshold works for users>1):

diff --git a/torch/_inductor/graph.py b/torch/_inductor/graph.py
index 31be050ab28..8d36cbbd4fe 100644
--- a/torch/_inductor/graph.py
+++ b/torch/_inductor/graph.py
@@ -1712,7 +1712,9 @@ class GraphLowering(torch.fx.Interpreter):
             # Realize if (1) any user need inputs realized, or (2) there is
             # already too many reads and rematerializing can be bad.
             num_users = len(OrderedSet(n.users))
-            if num_users > 1 and isinstance(result, TensorBox):
+            if num_users > 0 and isinstance(result, TensorBox):
+                result.realize()
+
                 for user in n.users:
                     if user.target in needs_realized_inputs:
                         result.realize_hint()

With that your example gives:

/home/jansel/pytorch/torch_compile_debug/run_2025_08_27_17_11_03_775661-pid_114974/torchinductor/model__0_inference_0.0/ir_pre_fusion.txt

with two SchedulerNodes.

skv66 · August 29, 2025, 10:36pm

It worked! Thank you so much!

Topic		Replies	Views
Is it possible to disable inlining of custom module for torch.compile? compiler	1	270	October 11, 2024
When does the inductor code run? compiler	5	670	May 15, 2024
TorchInductor: a PyTorch-native Compiler with Define-by-Run IR and Symbolic Shapes compiler	46	69524	July 29, 2024
Inlining a custom triton kernel compiler	0	178	July 22, 2024
How To Bring Compile Time Down to Zero: Our Plans and Direction (May 14th Edition) compiler	0	1374	May 15, 2024

How to turn off inlining / force materialization in TorchInductor during torch.compile?

Related topics