PyTorch/XLA 2.2 Release Dev Update

JackCaoG · February 23, 2024, 12:30am

Hey I am here to give another dev update for the PyTorch/XLA 2.2 release. It is a relatively short update compared to the 2.1 because it only covers 3 months of development and it overlaps with a bunch of vacations at the year end. We shared a pretty detailed roadmap for many features in my 2.1 update. We still plan to work on most of those but I don’t want to copy the same things over. I would focus on the features that I think would interest the community the most in this update.

Again this dev update will overlap with our release note. The idea for this update is for me to share why we are making some technical decisions and share our plan for the next few months.

SPMD

SPMD continues to be the biggest development update for the 2.2 since we were wrapping up a bunch of features for our big LLM push last year.

DTensor Integration

We have successfully integrated with PyTorch’s DTensor API and we plan to shift most of our use cases to use DTensor instead. The idea was user only need to specify xla when creating mesh, a sample code snippet will look something like


import torch

from torch.distributed import DeviceMesh, Shard, distribute\_tensor

# distribute_tensor now works with `xla` backend using PyTorch/XLA SPMD.

mesh = DeviceMesh("xla", list(range(world_size)))

big_tensor = torch.randn(100000, 88)

my_dtensor = distribute_tensor(big_tensor, mesh, [Shard(0)])

FSDP via SPMD (FSDP v2)

The naming can be a bit confusing but let me explain. Currently PyTorch/XLA has a FSDP wrapper(api wise it is very similar to FSDP from PyTorch upstream), this wrapper will continue to be there. The idea of the FSDP v2 is simple, FSDP is a very user friendly way of sharding large scale models and it actually turns out to be very compute efficient in many cases. We decided to implement a very similar wrapper using the SPMD technology. Advantages(compared to the existing FSDP wrapper) of this approach include faster graph compilation time(since SPMD only needs to compile 1 graph per device) and more efficient collective(current FSDP can not use asyncours all-gather on TPUs). You can find more details in this doc.

Sharding Device Visualization

We also added two APIs to visualize the device sharding with sharded tensor or sharding string. They could be very handy when you want to figure out what’s the most efficient device placement. You can find more details in this doc.

(WIP) Auto Sharding

We have been working on bringing the XLA’s auto-sharding to the PyTorch/XLA. This feature will allow XLA to pick a sharding annotation without any user sharding hints. XLA auto-sharding service is based on a published research work, Alpa. Our preliminary benchmark on TPU looks very promising, for more detail please take a look at this RFC. Current plan is to release this feature as experimental in the 2.3 release.

Usability

Persistant Compilation Cache

Long compilation time is one of the most noticeable usability issues for PyTorch/XLA. SPMD partially addressed the issue by reducing the number of graphs, but that was not enough. Persistent compilation cache will serialize the compiled program and save it into the local disk which. Saved computation can be loaded the next time the model is run. Note that if you update the model code and result in a different HLO you still need to recompile. For more detail please take a look at this doc.

Compilation/Execution analysis

One common complaint of PyTorch/XLA is that users don’t understand what’s actually going on at PyTorch/XLA level. This time we enhance the PT_XLA_DEBUG=1 flag and let it dump an analysis for every compilation and execution. Some sample analysis looks like

Execution Analysis: ================================================================================
Execution Analysis: Execution Cause
Execution Analysis:   user mark_step
Execution Analysis: Graph Info:
Execution Analysis:   Graph Hash: 537d4b0264b029688281412214d252e9
Execution Analysis:   Number of Graph Inputs: 588
Execution Analysis:   Number of Graph Outputs: 320
Execution Analysis: Python Frame Triggered Execution:
Execution Analysis:   mark_step (/workspaces/dk2/pytorch/xla/torch_xla/core/xla_model.py:840)
Execution Analysis:   broadcast_master_param (/workspaces/dk2/pytorch/xla/torch_xla/core/xla_model.py:1230)

Async profile captures

Currently the user has to manually run the script to capture the profile on a separate terminal while the model is running which is not very scalable. trace_detached was added to provide a way to automatically profile the model.

Dynamo

(WIP)Single Step Graph

When tracing a model dynamo currently generates a single forward graph and generates the backward graph using the aot-autograd. Having a single step in multiple graphs will bring some performance and memory drawbacks compared to a single graph. We worked with the PyTorch team to propose using the CompiledAutograd to generate the backward graph and inlinging it using dynamo. The RFC can be found here and the POC here. Current plan is to break the POC down and merge them gradually. We expect to merge some of the code changes into 2.3 release but the whole feature will likely take longer.

GPU

Multi host training

We officially support the multi-node training on GPU in the 2.2 release through PJRT runtime and torchrun.

(WIP) GPU SPMD

The focus of the SPMD work has been on TPU but we want to bring more attention to the GPU SPMD for 2.3 release. The plan is to unblock the functionality on 2.3 releases and perform more detailed speed analysis/optimization on later releases.

(WIP) Benchmark

We are working on creating a daily torchbench runs on GPU(and TPU) to understand the XLA:GPU and PyTorch/XLA:GPU performance against other compiler backends like TorchInducotr. Throughout this process we fixed a bunch of issues in out benchmarking script and models. The plan is to set up the automatic benchmark, collect data and publish it on a public dashboard by 2.3 .

Core Aten Ops

As of the 2.2 release, PyTorch/XLA currently supports ~80% of the core aten ops. The goal is to increase the percentage to near 100% (except a couple dynamic ops) by 2.3 release. This will allow a better performance when training/inference a PyTorch/XLA model and significantly increase the model coverage for the torch.export.

Checkpointing

Distributed checkpointing is supported using torch.distributed.checkpoint. To enable more advanced checkpointing features, the CheckpointManager class is introduced. Features enabled include per-step checkpoint management, asynchronous checkpointing, and autocheckpointing on preemption. See the docs for a more complete description.

Async Checkpointing

One key feature of the CheckpointManager API is async checkpointing, which allows training to continue while the checkpoint is written to persistent storage in the background. This is accessible through the CheckpointManger.save_async method.

Auto Checkpointing on Preemption

When using a Cloud TPU with autocheckpoint enabled, the CheckpointManger class will take a checkpoint when a preemption has been detected. The checkpoint will be identified by the step at which the preemption was detected.

Topic		Replies	Views
PyTorch/XLA 2.1 Release Dev Update compiler	1	1644	October 13, 2023
PyTorch/XLA 2.3 dev update compiler	0	895	May 29, 2024
PyTorch/XLA 2022 Q4 Dev update compiler	0	3347	January 5, 2023
Meta PyTorch Team 2024 H2 Roadmaps	20	23297	February 19, 2025
PyTorch/XLA 2.4 dev update compiler	0	394	August 6, 2024