TL;DR: Previously, torchdynamo interrupted compute-communication overlap in DDP to a sufficient degree that DDP training with dynamo was up to 25% slower than DDP training with eager. We modified dynamo to add additional graph breaks when DDP is detected in order to restore opportunities for compute-communication overlap. With these new changes, DDP with dynamo is never more than 1% slower than DDP with eager and up to 15% faster than eager on 64 gpus when compiled with torchinductor. These results are based on benchmarks from 6 OSS models.
If you are new to TorchDynamo, the links below will allow you to catch up on the new exploration. TorchDynamo generates FX graph from Python bytecode and various backends are integrated with TorchDynamo to complete inference/training of the model. In the future, with the help of a cost model, TorchDynamo could automate the selection of the best backend for each subgraph to achieve optimal performance.
- Update 1: An Experiment in Dynamic Python Bytecode Transformation
- Update 2: 1.48x Geomean Speedup on TorchBench CPU Inference
- Update 3: GPU Inference Edition
- Update 4: Lazy Tensors & nvFuser Experiments
- Update 5: Improved Capture and Bigger Graphs
- Update 6: Training support with AOTAutograd
- Update 7: Inference with FX2TRT
- Update 8: TorchDynamo passed correctness check on 7k+ github models
DDP (Distributed Data Parallel) is a tool for distributed training. It’s used for synchronously training single-gpu models in parallel.
DDP training generally goes as follows:
- Each rank will start with an identical copy of a model. A rank is a process; different ranks can be on the same machine (perhaps on different gpus) or on different machines.
- Pass a different batch of data to each rank
- Run the forward pass (on each rank)
- Run the backward pass (on each rank). Now, the gradients on each rank will be different because different data was used on each node
- Synchronize gradients with an allreduce call. An allreduce call will communicate across different ranks so that after the allreduce, the gradients on all the ranks will be identical (e.g. they will all be the average of the gradients across the various ranks)
- Run the optimizer
In the figure above, we see that steps 4 and 5 are combined by allowing allreduces to start before the backward pass has completed. This speeds up the training process by overlapping some of the communication with the rest of the computation of the backward pass. This is how eager DDP training works today. By default, the allreduces are gathered into “buckets”, which are chosen via a heuristic that tries to produce 25MB buckets (the size is configurable).
However - once we enable dynamo, dynamo compiles the individual kernels into a single graph (or a small number of graphs, if graph breaks occur). Then synchronization can’t occur until the entire backward pass has completed.
In the diagram above, we see that with naive usage of dynamo, DDP allreduces don’t start until the entire backward pass computation is finished. In many cases, the lost opportunity for communication/compute overlap can be more significant than the speedup provided by inductor.
@wconstab implemented the DDPOptimizer, which does the following:
- Detects if DDP is active
- If so, it identifies the DDP bucket size and splits the dynamo graph into subgraphs, so that the sum of the parameter sizes in each subgraph is roughly equal to the bucket size.
The heuristic used by the DDPOptimizer won’t always produce buckets that are identical to those produced by eager DDP; we are assuming that the eager DDP strategy heuristic isn’t perfect either, especially in cases where additional graph breaks may occur.
Without the DDPOptimizer, we can compare DDP+dynamo latency to DDP+eager latency, and we find that for >1 node, dynamo can sometimes perform as much as 25% worse than eager.
The figure above shows latency comparisons between eager and inductor on DDP training, without DDPOptimizer. For example, the ~25% slowdown in timm_VIT is based on ~1720ms eager latency on 64 gpus, compared to ~2300ms inductor latency on 64 gpus.
With DDPOptimizer enabled, we do the same comparison and see that DDP performs no more than 1% worse than eager - and is up to 15% faster in some of the 64-gpu configurations
We can also compare the speedup that DDPOptimizer provides for each of the models. This chart compares DDP + dynamo latency to DDP + dynamo + DDPOptimizer latency:
We see that in most cases DDPOptimizer provides very little benefit (or even a slowdown) for the 1-node (i.e 8 GPU) configuration, where communication takes less time. But for the multi-node configuration where communication goes over a network, we see larger speedups, especially for larger models like hf_T5_large or timm_VIT.
We also did correctness checks on the 6 OSS models, using the same multi-node setup used for the performance measurements. Note that these benchmarks were performed using a separate torchbench benchmark setup, so we wanted to do a minimal set of correctness checks to verify correctness on the same suite used for measuring performance. On these tests, we found 5/6 models passed correctness; the remaining model (resnet50) is also failing on non-distributed inductor tests.
- Hardware - AWS cluster, A100 40GB
- 8 nodes, 8 gpus per node = 64 GPUs
- EFA network configuration
- DDP run with static_graphs=False
- Models (batch size):
- resnet50 (128) [Note: we also collected results with batch_size=32, when communication latency dominates: in this case, we see no speedup]
- hf_T5 (12)
- hf_T5_large (4)
- timm_vision_transformer_large (16)
- hf_GPT2_large (4)
- hf_Bert (32)
- Results are based on 19 samples taken over a 30-hour period, on a static set of nodes on the AWS cluster.
Static graph is an optimization for eager DDP. It relies on assumptions about the behavior of the program remaining the same - e.g. gradients for the same set of parameters must always be made available in the same order on each invocation. It allows a few optimizations:
- Re-ordering buckets to more accurately match the actual execution order
- Skipping the find_unused_parameter step, which usually needs to run each iteration to identify which parameters are needed in the backward pass.
- Activation checkpointing
In the 6 OSS models we tested, we don’t see any measurable impact on performance from static_graph=True (at least in eager mode). However, some other models are known to benefit from these optimizations.
Unfortunately, dynamo + DDP currently doesn’t work with static_graph=True. (This is because DDP interprets any tracing as a first step during which it intends to collect data about the run; and then subsequent iterations fail some assertions).
We expect that it should be possible to add some workarounds to support this - but currently, static_graph needs to be turned off to work with dynamo.
Cudagraphs show performance improvements in many scenarios, but also increase memory usage in other scenarios. Because DDPOptimizer creates additional graphs, it exacerbates these memory issues. Therefore we expect many users to need to turn off cudagraphs to be able to run with DDPOptimizer.
- FSDP - @wconstab and @aazzolini have already started investigating issues that arise when running dynamo with FSDP models.
- Better integration with DDP could possibly provide support for static_graph=True, or better performance improvements. Currently, DDPOptimizer makes a best attempt at matching DDP’s buckets; and then DDP re-buckets again based on its own heuristics, which may not always match DDPOptimizer. This could result in delayed allreduce calls. If instead DDPOptimizer could provide its bucket choices to DDP, this wouldn’t be a problem.
Huge thanks to @sanketpurandare; the benchmark scripts were adapted from his script and results from his initial work on DDP benchmarking. He also provided a lot of help debugging bad performance.
OSS DDP benchmarks can be run using the same script at:
python userbenchmark/ddp_experiments/__init__.py --job_dir /full/path/to/shared/directory --repeat 20
Note that the script is mostly targeted for use with a slurm-based setup.
Feel free to adapt these scripts for your own distributed benchmarking.
We’re working on merging this into the typical userbenchmark workflow for torchbench, so that testing can be automated. A big thanks goes to Xu Zhao for his help
A few lessons we learned when running these experiments:
- Run experiments on the same slurm allocation to ensure that the same hardware is used for all measurements. This is especially important for distributed benchmarking where network topology can
- Don’t build pytorch with DEBUG=1 when you want to run benchmarks! While this is generally true, it’s particularly true with distributed benchmarks; we found that NCCL on EFA performs very badly with DEBUG=1 (50x slower in some configurations).
- Watch out for issues with CPU/GPU affinity; we suspect that this caused issues with stragglers that may have increased noise in some measurements.