TorchInductor Update 3: E2E model training with TorchDynamo + Inductor gets 1.67x/2.1x speedup

Our previous posts introduced the idea of TorchInductor and shared some early but extremely promising inference and training results.

Since last time we have added E2E model training supports to TorchDynamo + Inductor, which brought 1.67x/2.1x speedup over native PyTorch.

Why do we need the E2E model training benchmark?

Previous TorchDynamo + Inductor work has shown extremely promising results on both model inference and training performance, but we are lacking the end to end training performance benchmark on real dataset to help users have a better understanding on the overall performance gain. So we added this E2E training benchmark which runs the popular hugging face Bert model on Yelp Reviews datasets.

Difference between the E2E training vs. existing OSS training benchmark

The major difference is:

  • OSS training benchmark only tests accuracy and speedup of forward and backward on example inputs;
  • E2E training is to check model loss convergence and speedup of the whole training(forward+backward+optimizer.step) pipeline on real datasets.

To fix this gap, we added several new features and fixed bugs, here are the highlighted changes:

How to onboard TorchDynamo stack with your own E2E model training?

Look at the training_iter_fn which would be accelerated by TorchDynamo + Inductor stack, it includes forward pass, optimizer step and backward pass. Currently both forward and optimizer step are captured/accelerated by Dynamo + Inductor. Backward pass still triggers eager’s autograd engine, which runs each ‘compiled backward’ graph as if it was one op, also running any non-compiled eager ops’ backward functions.

Ensure model training with TorchDynamo stack is converged.

To make sure the training loss is converged, we check the final loss obtained from TorchDynamo stack should be less than or equal to the loss of native PyTorch under the same iteration number.

E2E model training performance benchmark

Float type Speedup w/o warm up Speedup w/ warm up
FP32 1.36 1.67
AMP 1.7 2.1

Since we already have hf_Bert model in TorchBench, the speedup we observed on E2E training is strongly correlated with the one in TorchBench(1.12 for FP32 and 1.7 for AMP).

Summary & best practice

Right now the benchmark supports all TorchDynamo backends and all PyTorch optimizers, but only the hf_Bert model, we will add more models in the future.

It’s very straightforward to follow the E2E training benchmark to onboard your own models, but we recommend:

  • Use Adam optimizer which supports capturable and avoids unnecessary graph breaks.
  • TorchInductor takes some time to compile the graph, so using a larger number of epochs/iterations can have more remarkable acceleration. We set epochs = 100 when running the benchmark.

Feel free to try it on your environment and welcome feedback!

2 Likes

For the backward,I wonder if the graph optimization can also be done at eager model , because we can get a backward graph;
If so, Does TorchDynamo mainly speed up forward step?

For the backward,I wonder if the graph optimization can also be done at eager model , because we can get a backward graph;
If so, Does TorchDynamo mainly speed up forward step?

Actually there are two levels of optimization: TorchDynamo to capture/generate better graph and TorchInductor to converts the captured graphs into faster machine code.

For backward, it still triggers eager’s autograd engine, which runs each ‘compiled backward’ graph as if it were one op, also running any non-compiled eager ops’ .backward() functions. TorchDynamo can capture graph for forward() and optimizer.step(), but the TorchInductor backend can speed up forward(), backward() and optimizer.step().

@yanboliang thanks for your reply