Hy are AOTI packaged models faster than torch.compile models with the same inputs?

Hi all,

I’ve been experimenting with both torch.compile and AOTI compiled packaged models.
Here’s my setup:

  • Same model architectures (e.g., ResNet18, etc.)

  • Same input shapes (e.g., [1, 3, 224, 224])

  • Both run in eval() mode, torch.no_grad(), CUDA device, and warmed up before timing

However, when I benchmark them:

  • The AOTI compiled and packaged models (torch._inductor.aoti_compile_and_package + torch._inductor.aoti_load_package) run noticeably faster.

  • The torch.compile models (using the Inductor backend) show higher per-iteration runtime, even though the kernels should be the same.

Questions:

  • Why is there a performance gap between AOTI packaged models and torch.compile models when everything else (model, input, device) is identical?
  • Is there a way to visualize the graph of compiled models?

Thanks,
Shriraj

AOTI generates C++ wrapper code, which means it can get rid of Python overhead for the model code. ResNet18 is a small model and your input is small, so AOTI’s benefit tends to be more noticeable.

BTW, you can run with `TORCH_LOGS=output_code` to check the generated code from both, to double check if there is any disparity in the generated kernel code.

1 Like

The most likely culprit here is your model is too small to saturate the GPU, so it ends up being overhead bound (rather than compute bound like a larger model would be). AOTI has lower overhead, so it is faster – while if you scale up your model to be larger they will run the same speed.

I’d suggest enabling cudagraphs with torch.compile(mode=”reduce-overhead”), which will help a lot for tiny models.

You could also try using a much larger batch size, so your model has more compute per kernel.