Meta PyTorch Team 2024 H2 Roadmaps

We’ve been thinking about how to share the roadmaps for the work we are doing on PyTorch here at Meta. We do planning on a half-year basis so these are some public versions of our 2024 H2 OSS plans for a number of key areas within PyTorch.

Compiler Core
Compiler Deployment
Distributed
Core Libraries
Core Performance
Developer Infrastructure
torchtune
TorchRec
Torchvision
Edge
DataLoading

38 Likes

Great Roadmap! I’m very excited about [O2] in the Distributed section.

  • From what I’ve seen so far, FSDP is scaling very well on a 512/1k GPU scale (following LLM Foundry’s blogs). Since Meta is training on a 24k GPU cluster, do you think FSDP will automatically scale well on clusters (with good interconnect) outside of Meta, or will additional customization be required for distributed training? For example, if we want to scale beyond a 1k cluster.
  • Will you also support async checkpointing natively?
  • Does PyTorch have any plans to integrate changes from transformer_engine as they did for FSDP, TP, PP with DeepSpeed and/or Megatron-LM?

Sorry if these features are already available and I’m unaware of them. Additionally, what is 5DParallel = HSDP2 + Async SP + CP + PP? What does HSDP2 stand for? Is there any link or documentation available?

I’m not so excited about torchtune stuffs. I think this is too small of a problem for the pytorch team to solve (Ofcouse it has the most impact and clout).

3 Likes

Typically for >1k GPUs, we start looking at composing with model parallelism techniques like tensor parallel and pipeline parallelism to scale to higher number of GPUs. These are notationally expressed as 2D/3D parallel.

We support native async distributed checkpointing in PT release 2.4. You can also look at examples in GitHub - pytorch/torchtitan: A native PyTorch Library for large model training which is our reference framework for large scale distributed training including 2D and 3D parallel.

We have FSDP + native FP8 integration in torchtitan now. This also composes with 2DParallel. Dont have a plan for TE yet.

This is just a notation we are using to denote multiple ways to split a model and compose them with various data parallel and model parallel techniques.

HSDP2 is the HSDP implementation with FSDP2 (per-parameter sharding version of FSDP). torchtitan would be the best place to look at actual code in action.

2 Likes

The roadmaps are appreciated! Thanks, @gottbrath.

I’d be curious to learn more and contribute to upcoming scientific PyTorch features, if any. I’m especially interested in areas like complex numbers, integration, interpolation, etc. I’d also love to hear more about any forthcoming changes to optim or distributions, if any.

In addition, it’d be nice if the core members (@smth, @ezyang, @albanD, @Lezcano, etc.) could articulate their current vision for scientific PyTorch. I am especially interested to learn where the core members currently dilettante between features provided by PyTorch, first-party PyTorch packages (e.g., TorchAudio and TorchVision), and third-party PyTorch packages (e.g., PyTorch-Geometric and TorchMetrics) and their reasoning behind their thinking as it relates to infrastructural issues like compilation.

We have been depending on the torchdata datapipes in our new GNN dataloader called GraphBolt for pytorch. It would be nice if the datapipes were not removed from torchdata at all or if there was an easy way to switch from datapipes to whatever new design will be implemented.

1 Like

Hey @0x00b1 great to hear from you again. What features are you interested in contributing to?

Broadly speaking, we don’t have broad goals the next few months around areas like complex numbers or interploation. However, we’re always excited to partner on how PyTorch can better serve the scientific community. There’s much happening at the intersection of hybrid modeling, integration, higher order gradients, etc etc. My personal hope is that by driving performance, usability and composability wins within generative AI modeling, that will help provide improved tooling to accelerate exciting sub-fields such as Foundation Models for multi-omic biology.

That said, I’d love to collaborate on specific features! Feel free to submit ideas via Issues or reach out on Slack.

1 Like

Hi @mfbalin thanks for raising this issue.

Happy to follow up on this offline to figure out the best way forward here such that we don’t break what you’ve built!
Could you ping me on the pytorch slack and we can continue the discussion there?

@0x00b1 this is a hard question you have here!
For the scientific PyTorch part, I would defer to @jisaacso 's great answer above.

My personal view on how and when we use different repos is as follows:

  • pytorch/pytorch is for features that are a) useful for the almost all of our users, b) battle-tested and stable and c) we are ready to support them for the next 5 years.
  • pytorch/* is for a variety of purposes, all driven (or used to be driven) by core contributors. First extensions for a large group of users: by domains (vision, audio, rl, torchrec), usecase (torchx, torchtune, torchtitan) or backend (executorch, xla, cpuinfo). Second, exploration (data, ao), supporting infra repos (tutorials, examples, rfcs, test-infra, builder, pytorch.github.io). And many more (we have 99 repos in the org right now haha).
  • third-party packages: basically exactly the same as above with the following delta: remove the “driven (or used to be driven) by core contributors”, add even more categories and diversity in the repos: Ecosystem | PyTorch (that has only a small shortlist).

Defining specific rules for these boundaries is quite hard as there are specific circumstances and other constraints that are important as well (related to infrastructure, technical dependencies, lack of extension points, etc) in the final decision. And so it is usually taken on a case-by-case basis.
Overall though, I think we are pushing hard to keep the core components small, stable and with all the necessary extension points. Recommending users leverage (and participate in building) the very rich Python ecosystem instead of making PyTorch into a monorepo that must contain every feature/idea/option.

Hope that answers your question a little bit!

3 Likes

torch.compile’s cold start time is really a big thing.

KR 4.2: Cold startup times improved
TorchTitan (3x-4x reduction with hierarchical compilation)

Is there any material about hierarchical compilation?

@strint here’s a slide deck about Hierarchical Compilation with torch.compile: Hierarchical Compilation - Google Slides

5 Likes

Thanks a lot for the reply.

1 Like

Nothing for torchaudio? =(

@ gottbrath Thank you for these release and sharing of Meta internal Vision-OKRs.

To summarize , we are experiencing an important moment where compiler stack (core) would be tightly integrated with 5D + EP parallel scheme to support LLM (torchtriton).

With enhancement of DTensor functionality of 5D parallel + EP, what would the team like to integate it with compiler?

Here is an example, the Graphcore teams used to develop a IO_Overlap pass and partitioner with its sphosticated code topology and livnesss analyzer (the key author joined in AMD recently) and its cost model IP (a cpu device to simulate hardware).

2 Likes

Hi @awsaf49 thank you for your reply. Out of curiosity, are there specific features that you wanted for torchaudio?

yiakwy-xpu-ml-framew,

Thanks for the comment.

@gnadathur can give more specific feedback. But I would venture to say that we are working towards two goals for our PyTorch native parallelism building blocks

  1. composable with one another
  2. traceable with the PT2 compiler stack

I do think that is an area where we would welcome collaboration. Were you thinking along these lines?

A farther down the road goal is the automatic or analytically-guided partitioning. It sounds like your work or proposed work is perhaps more in this space. Would love to follow up and learn more.

1 Like

Hi @yiakwy-xpu-ml-framew

thanks for the comment. Our current approach to compiling distributed is to focus on regional compilation of DTensor portions of the graph (see: Meta PyTorch Team 2024 H2 Roadmaps - #9 by smth). One such example is async tensor parallel implementation. We are seeing good gains in both WPS and compile time reduction for distributed from this approach.

Full model compilation of various compositions of parallelism is a work in progress for now as we flush out gaps in tracing and optimizing the distributed graph. There is no definitive ETA yet.

Thanks for the comment.

To add more color on “full model compilation of various compositions of parallelism”:

  • For 1D parallelism (DDP or FSDP): we are shipping Compiled DDP (prototype) in PyTorch 2.4, and Compiled FSDP2 (prototype) in 2.5 or 2.6.
  • For 2D parallelism (DDP/FSDP + TP): we aim to have a functional prototype by end of 2024, exact ship date is still TBD.
  • For 3D parallelism (DDP/FSDP + TP + PP): the core API for PP is still being iterated on. Once Compiled 2D is ready, we will start working on +PP.
  • For composability with other parallelisms: once the above foundational features are ready, we will start working on it. It would be great to learn from you on which combination of parallelisms is most important for your use case.
1 Like