Our previous posts introduced the idea of TorchInductor and shared some early but extremely promising inference and training results.
Since last time we have added E2E model training supports to TorchDynamo + Inductor, which brought 1.67x/2.1x speedup over native PyTorch.
Why do we need the E2E model training benchmark?
Previous TorchDynamo + Inductor work has shown extremely promising results on both model inference and training performance, but we are lacking the end to end training performance benchmark on real dataset to help users have a better understanding on the overall performance gain. So we added this E2E training benchmark which runs the popular hugging face Bert model on Yelp Reviews datasets.
Difference between the E2E training vs. existing OSS training benchmark
The major difference is:
- OSS training benchmark only tests accuracy and speedup of forward and backward on example inputs;
- E2E training is to check model loss convergence and speedup of the whole training(forward+backward+optimizer.step) pipeline on real datasets.
To fix this gap, we added several new features and fixed bugs, here are the highlighted changes:
- Support to capture optimizer.step.
- Add a new DefaultDictVariable which is widely used across optimizers.
- Handle torch.autograd.profiler.* context wrapper to avoid unnecessary graph break.
- Add missing torch.set_grad_enable calls in FX graph
- Inductor supports alpha= in add/sub.
- Fix caching of simplify_loops results.
- GradModeVariable should guard on the initial grad state.
How to onboard TorchDynamo stack with your own E2E model training?
Look at the training_iter_fn which would be accelerated by TorchDynamo + Inductor stack, it includes forward pass, optimizer step and backward pass. Currently both forward and optimizer step are captured/accelerated by Dynamo + Inductor. Backward pass still triggers eager’s autograd engine, which runs each ‘compiled backward’ graph as if it was one op, also running any non-compiled eager ops’ backward functions.
Ensure model training with TorchDynamo stack is converged.
To make sure the training loss is converged, we check the final loss obtained from TorchDynamo stack should be less than or equal to the loss of native PyTorch under the same iteration number.
E2E model training performance benchmark
Float type | Speedup w/o warm up | Speedup w/ warm up |
---|---|---|
FP32 | 1.36 | 1.67 |
AMP | 1.7 | 2.1 |
Since we already have hf_Bert model in TorchBench, the speedup we observed on E2E training is strongly correlated with the one in TorchBench(1.12 for FP32 and 1.7 for AMP).
Summary & best practice
Right now the benchmark supports all TorchDynamo backends and all PyTorch optimizers, but only the hf_Bert model, we will add more models in the future.
It’s very straightforward to follow the E2E training benchmark to onboard your own models, but we recommend:
- Use Adam optimizer which supports capturable and avoids unnecessary graph breaks.
- TorchInductor takes some time to compile the graph, so using a larger number of epochs/iterations can have more remarkable acceleration. We set epochs = 100 when running the benchmark.
Feel free to try it on your environment and welcome feedback!