Enabling Float8 All-Gather in FSDP2

with Andrew Gu, Wanchao Liang, Driss Guessous, Vasiliy Kuznetsov, Brian Hirsh

TL;DR

  • We focus on float8 because it speeds up large GEMMs on H100s and saves network bandwidth with reduced message size.
  • We enabled float8 all-gather in FSDP2. Readers can find training recipe for Llama3 in TorchTitan and float8 dtype implementation in TorchAO/float8.
  • We observed 1.50x speedup with float8 compared with bfloat16 with on-par numerics. 20% of the speedup is contributed by float8 all-gather while the rest 80% come from float8 compute. The result is benchmarked by pretraining Llama3-70B using 128 H100s*.

* Meta’s H100s are customized on Grand Teton. Specifications might be different to public ones.

Float8 Training and Why Float8 All-Gather Matters

Float8 data types are natively supported in NVIDIA H100. Float8 training enablement can be divided into 2 parts: float8 compute and float8 communication.

Float8 compute: support float8 matrix multiplication with torch._scaled_mm. Different from bfloat16, float8 requires both raw tensor and scales to preserve numeric accuracy. Users need to maintain scales in the training loop. There are different scaling strategies including tensor-wise scaling, row/col-wise, group-wise, and block-wise. In this note, we focus on tensor-wise scaling and dynamic scaling [ref], where scales are computed from the current high-precision tensor.

Float8 communication (Float8 all-gather): with float8 compute, doing all-gathers in float8 is a almost a “free lunch” because we need to cast parameters before or after the all-gather. Casting before the all-gather saves 50% bandwidth (vs bfloat16) at the cost of one all-reduce for AMAX. Float8 can be applied to model weights, activations and gradients. We prioritized float8 weights since they are more stable numerically through the training loop and fit better with low-precision dtypes. We focus on Llama3 models in this note.

Readers can find training recipe for Llama3 in TorchTitan and float8 dtype implementation in TorchAO/float8.

Applying FSDP2 to Llama3 with Float8 Weights

Float8 Model (code): PyTorch native float8 requires minimal changes to models. Taking Llama3-8B model as an example, we convert the bfloat16 model to a float8 model by swapping every nn.Linear with a Float8Linear, so that we can perform float8 compute.

TransformerBlock(
    (attention): Attention(
        (wq/wk/wv/wo): Float8Linear(in=4096, out=4096, bias=False) 
    )
    (feed_forward): FeedForward(
        (w1/w2/w3): Float8Linear(in=4096, out=14336, bias=False)
    )
    (attention_norm / ffn_norm): RMSNorm()
)

Applying FSDP2 (code): The UX of wrapping a float8 model is the same as wrapping a bfloat16 model. To keep track of scales efficiently, we call precompute_float8_dynamic_scale_for_fsdp after the optimizer step, so we can get replicated scales for float8 casting before float8 all-gather.

# wrapping each TransformerBlock, then root model
# the UX is the same across float8 model and bfloat16 model
for transformer_block in model.layers.values():
    fully_shard(transformer_block)
fully_shard(model)

# training loop
# ...
optimizer.step()
# all-reduce AMAX for Float8Linear.weight
precompute_float8_dynamic_scale_for_fsdp(model)

FSDP2 extensions for float8 tensor subclass: We keep FSDP2 UX the same across bfloat16 models and float8 models because we implemented the float8 casting in FSDP2 extensions. The float8 linear module’s weight is a tensor subclass that knows how to cast to float8. We can customize the casting logic before and after all-gather, as shown by the following figure.

  • fsdp_pre_all_gather (code): casting the bfloat16 weight into a float8 weight according to the latest replicated AMAX/scale (requiring all-reduce). Note the bfloat16 weight here is sharded by 1/NGPU. Since we all-reduce to get the replicated AMAX and scale on all ranks, casting the sharded bfloat16 parameters to float8 before all-gather is equivalent to all-gathering bfloat16 parameters and then casting to float8.
  • fsdp_post_all_gather (code): constructing Float8Tensor from all-gathered float8 data and replicated scale so they are ready for float8 compute in forward and backward.

Deep Dive Into Performance

We discuss key optimizations in float8 to reach 1.50x speed up vs bfloat16.

Float8 Compute + Bfloat16 All-Gather (1.40x speed up, code): When swapping nn.Linear with Float8Linear, it’s possible to keep the bfloat16 weight as is. We simply treat Float8Linear like a normal nn.Linear and perform bfloat16 all-gather in FSDP2 (stream 22). Float8Linear.forward is responsible for both bfloat16-to-float8 casting and float8 matmul (stream 7). This approach achieved 1.40x speed up and is a strong baseline to showcase the importance of float8 compute. However, it wasted 50% bandwidth to communicate bfloat16 parameters while those parameters will eventually get casted to float8 during forward.

Float8 All-Gather with Individual AMAX All-Reduce (+0.02x on top of 1.40x, code): We perform float8 casting before all-gather to save 50% bandwidth (stream 22). As a result, Float8Linear.forward uses float8 weights directly without the need for casting (stream 7). However, float8 casting requires a global AMAX (argmax) so we need to all-reduce partial AMAX (a scalar) across N ranks (stream 22 and 35). Each float8 parameter requires 1 all-reduce. Those small all-reduces degraded overall performance.

Combined AMAX AllReduce (+0.08x on top of 1.42x, code): We perform single all-reduce for all float8 parameters after the optimizer step. As a result, we avoided small all-reduces inside FSDP hooks (stream 47). We achieved horizontal fusion by calculating AMAX for all float8 parameters at once.

SM contention between NCCL and Float8 compute: Depending on NCCL version and GPU total SMs, sometimes there are bubbles in float8 compute (stream 7). Both float8 compute (sm90_xmm) and float8 all-gather (ncclDevKernel) fight for SMs. The ideal case is to always prioritize float8 compute for layer k over float8 all-gather for layer k+1. In that case, if NCCL uses less SMs for slower communication or float8 compute uses less SMs. We find it useful to set NCCL_MAX_CTAS) to 16 or 8 during benchmarking to resolve contention.

Future Work
We are actively exploring the following directions (see more in PyTorch roadmap)

Float8 in tensor parallel and pipeline parallel: for tensor parallel (including sequence parallel), we shard module input along sequence dim and would need float8 all-gather for inputs. For pipeline parallel, we are verifying if there are any performance gaps for float8.

Delayed scaling: Comparing to dynamic scaling, delayed scaling gains perf by deriving AMAX from previous iterations. The cost is potential loss of numerical accuracy. In practice, float8 weights remain stable within adjacent iterations. We want to support delayed scaling to reach full performance.

Row-wise scaling: Compared to tensor-wise scaling, row-wise scaling preserves better numerical accuracy by having fine-grained scales for each row. The cost is the complexity in the backward, because matrices are transposed from row-wise to column-wise. It requires special treatment for float8 all-gather in FSDP2. This is still a highly exploratory direction.

6 Likes

I’m running this

 NGPU=4 CONFIG_FILE="./train_configs/llama3_3b.toml" ./run_llama_train.sh --float8.enable_float8_linear

and the training loop freeze at model inference statement, i.e.

pred = model(input_ids)

when using NPU>1.

+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/llama3_3b.toml
+ overrides=
+ '[' 1 -ne 0 ']'
+ overrides=--float8.enable_float8_linear
+ torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/llama3_3b.toml --float8.enable_float8_linear
W0808 15:35:20.973000 1110135 torch/distributed/run.py:793]
W0808 15:35:20.973000 1110135 torch/distributed/run.py:793] *****************************************
W0808 15:35:20.973000 1110135 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0808 15:35:20.973000 1110135 torch/distributed/run.py:793] *****************************************
[rank0]:2024-08-08 15:35:23,985 - root - INFO - Starting job: Llama 3 3B training
[rank0]:2024-08-08 15:35:25,681 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-08-08 15:35:25,683 - root - INFO - GPU capacity: NVIDIA H100 80GB HBM3 (0) with 79.11GiB memory
[rank0]:2024-08-08 15:35:25,683 - root - INFO - Building 1-D device mesh with ['dp'], [4]
[rank0]:2024-08-08 15:35:25,685 - root - INFO - Building tiktoken tokenizer locally from /u/tmhoangt/tokenizers/llama3/original/tokenizer.model
[rank0]:2024-08-08 15:35:25,855 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001
[rank0]:2024-08-08 15:35:25,855 - root - INFO - Preparing c4 dataset from allenai/c4
[rank0]:2024-08-08 15:35:30,775 - root - INFO - Building llama3 3B with ModelArgs(dim=4096, n_layers=12, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_batch_size=32, max_seq_len=4096, depth_init=True, norm_type='rmsnorm')
[rank0]:2024-08-08 15:35:30,902 - root - INFO - lm_eval is not installed, GPTQ may not be usable
[rank0]:2024-08-08 15:35:30,907 - root - INFO - Float8 training active
[rank0]:2024-08-08 15:35:30,917 - root - INFO - Swapped to Float8Linear layers with enable_fsdp_float8_all_gather=False
[rank0]:2024-08-08 15:35:30,917 - root - INFO - Model llama3 3B size: 3,668,021,248 total parameters
[rank0]:2024-08-08 15:35:30,918 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-08-08 15:35:30,946 - root - INFO - Applied FSDP to the model
[rank0]:2024-08-08 15:35:34,855 - root - INFO - GPU memory usage for model: 3.44GiB(4.35%)
[rank0]:2024-08-08 15:35:34,857 - root - INFO - Training starts at step 1, with local batch size 2, global batch size 8, sequence length 4096, total steps 1000 (warmup 200)

The packages I use is on CUDA 12.1

 pip install https://download.pytorch.org/whl/nightly/cpu/torchdata-0.9.0.dev20240808%2Bcpu-cp311-cp311-linux_x86_64.whl
 pip install https://download.pytorch.org/whl/nightly/cu121/torchaudio-2.4.0.dev20240808%2Bcu121-cp311-cp311-linux_x86_64.whl
 pip install https://download.pytorch.org/whl/nightly/cu121/torchvision-0.20.0.dev20240808%2Bcu121-cp311-cp311-linux_x86_64.whl
 pip install https://download.pytorch.org/whl/nightly/pytorch_triton-3.0.0%2Bdedb7bdf33-cp311-cp311-linux_x86_64.whl
 pip install https://download.pytorch.org/whl/nightly/cu121/torch-2.5.0.dev20240808%2Bcu121-cp311-cp311-linux_x86_64.whl

Hi Tuan, thanks for bringing this. Did you see any NCCL watchdog timeout, or stacktrace when hitting ctrl+c to terminate job? I tried to reproduce locally but it went through. Would love to help more

Hi Wei,
Thanks for reminding me of NCCL. It reminds me of the issue with CUDA 12.1, and I made necessary change to get it working. I will test on the reported number from my side.

Nice! Look forward to your benchmark result and discussion if you need any

Hi @weifengpy . We are happy to share that we can achieve good performance with FP8 on a larger scale testing (256 H100s)

  • 50% throughput improvement on 70B and 405B llama3 architectures with TP
  • 30% throughput improvement on 8B model scale
    We are deploying a run with bigger dataset now. Attached is the loss parity between FP8 and BF16 on 8B model.
2 Likes

thanks for sharing the results! glad that the speed up preserves in larger scale