AOTI generates C++ wrapper code, which means it can get rid of Python overhead for the model code. ResNet18 is a small model and your input is small, so AOTI’s benefit tends to be more noticeable.
BTW, you can run with `TORCH_LOGS=output_code` to check the generated code from both, to double check if there is any disparity in the generated kernel code.
The most likely culprit here is your model is too small to saturate the GPU, so it ends up being overhead bound (rather than compute bound like a larger model would be). AOTI has lower overhead, so it is faster – while if you scale up your model to be larger they will run the same speed.
I’d suggest enabling cudagraphs with torch.compile(mode=”reduce-overhead”), which will help a lot for tiny models.
You could also try using a much larger batch size, so your model has more compute per kernel.