Hy are AOTI packaged models faster than torch.compile models with the same inputs?

skv66 · October 2, 2025, 9:28pm

Hi all,

I’ve been experimenting with both torch.compile and AOTI compiled packaged models.
Here’s my setup:

Same model architectures (e.g., ResNet18, etc.)
Same input shapes (e.g., [1, 3, 224, 224])
Both run in eval() mode, torch.no_grad(), CUDA device, and warmed up before timing

However, when I benchmark them:

The AOTI compiled and packaged models (torch._inductor.aoti_compile_and_package + torch._inductor.aoti_load_package) run noticeably faster.
The torch.compile models (using the Inductor backend) show higher per-iteration runtime, even though the kernels should be the same.

Questions:

Why is there a performance gap between AOTI packaged models and torch.compile models when everything else (model, input, device) is identical?
Is there a way to visualize the graph of compiled models?

Thanks,
Shriraj

desertfire · October 3, 2025, 1:00am

AOTI generates C++ wrapper code, which means it can get rid of Python overhead for the model code. ResNet18 is a small model and your input is small, so AOTI’s benefit tends to be more noticeable.

BTW, you can run with `TORCH_LOGS=output_code` to check the generated code from both, to double check if there is any disparity in the generated kernel code.

jansel · October 9, 2025, 5:24am

The most likely culprit here is your model is too small to saturate the GPU, so it ends up being overhead bound (rather than compute bound like a larger model would be). AOTI has lower overhead, so it is faster – while if you scale up your model to be larger they will run the same speed.

I’d suggest enabling cudagraphs with torch.compile(mode=”reduce-overhead”), which will help a lot for tiny models.

You could also try using a much larger batch size, so your model has more compute per kernel.

Topic		Replies	Views
Performance Comparison between Torch.Compile and APEX optimizers compiler	1	2528	May 1, 2024
PyTorch 2.x Inference Recommendations deployment	11	2203	November 3, 2024
Problems with torch.compile generated code in tutorial compiler	3	1021	January 2, 2024
Torch.compile() tech talks at PTC'23 compiler	0	663	October 31, 2023
Micro-optimizations for the most micro of benchmarks compiler	0	833	January 25, 2024

Hy are AOTI packaged models faster than torch.compile models with the same inputs?

Related topics