Meta PyTorch Team 2025 H1 Roadmaps

gottbrath · February 19, 2025, 5:10pm

PyTorch Community,

The Meta team are happy to make our 2025 H1 roadmaps available. We plan on a half year basis and globally optimize across the things we do for our users here at Meta and across the wider community. We’ve removed a few references to internal systems and teams but are eager to share the bulk of what we are planning for this half to encourage robust technical collaboration across the PyTorch community on these topics.

Checkpointing
Compiler Core
Compiler Deployment
Core Libraries
Core Performance
Developer Infra
Distributed
Docs
Edge
Torchcodec
Torchtune
Triton

On the topic of collaboration we are excited to highlight that we have a new discord server to make collaboration on PyTorch Edge easier. If you are building on-device AI applications or want to contribute to PyTorch Edge, come join our Discord server.

johnnynunez · February 20, 2025, 5:19pm

Will you imrpove for upcoming devices with cuda with unified memory? Also for jetson orin/thor, digits, thor etc…

cudaMallocManaged() in all places…

gottbrath · February 20, 2025, 5:28pm

I haven’t heard that anyone is looking at that specifically.

Is there a technical discussion (RFC or dev-discuss thread or slack thread) where this has been proposed and folks have thought through any pros and cons?

syed-ahmed · February 21, 2025, 9:49pm

While cudaMallocManaged offers convenient automatic memory management, I’d strongly advise against using it everywhere. When GPU memory gets saturated, UVM has to perform costly double transfers, evicting pages to CPU before bringing in new ones. This effectively halves your memory bandwidth. For DL workloads that fit in GPU memory (which is most cases), explicit placement consistently outperforms UVM since there are no page faults and access patterns remain predictable.

Instead, leverage PyTorch’s MemPool API for selectively using different allocators in the same PyTorch program.

Use UVM where it makes sense: Let’s say you have a big embedding table and you are just crafting a model for now and don’t want to think about how to shard the table for optimal performance. You could create a custom allocator using cudaMallocManaged with preferred CPU location and get your model working first. After that, you can think about how to optimize the placement of your embedding table:
```
pool = torch.cuda.MemPool(uvm_allocator)
with torch.cuda.use_mem_pool(pool):
    embedding_matrix = torch.rand(...).cuda()
```
NVLS for Fast Multi-GPU Reduction: On systems with NVSwitch, you can tap into NCCL’s memory allocator for zero-copy reductions. This is especially useful when overlapping GEMMs with communication kernels, where there is resource contention (SMs, Copy Engines).
```
pool = torch.cuda.MemPool(nccl_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
dist.register_mem_pool(pool)
dist.all_reduce(tensor)  # Uses NVLS path
```
As seen from the following experiment (thanks to @kwen2501!), using the allocator selectively improved the performance of both the communication and compute kernels. Keep in mind, being selective here is important since ncclMemAlloc can allocate larger than requested size, so if you were to use it everywhere, you would quickly run into OOM.

749×217 18.4 KB

vs
702×219 15.5 KB

Speedups:

1467×433 30.5 KB
Extended GPU Memory (EGM) based all-gather: On systems that support EGM, if you are all-gathering a tensor from peer GPUs to the host CPU (e.g. when checkpointing a model during training), you could allocate the output list of the gather using an EGM allocator and the gather will then use high bandwidth NVLinks:
```
pool = torch.cuda.MemPool(multicast_host_allocator)
with torch.cuda.use_mem_pool(pool):
    output_list = [torch.empty_like(local_shard) for _ in range(size)]
dist.gather(local_shard, output_list)  # Uses EGM path
```

To summarize, for standard deep learning workloads, stick with PyTorch’s default caching allocator - it’s battle-tested and optimized for memory patterns in DL workloads with features like expandable segments and smart block reuse. When you want to further optimize your model’s memory access patterns, you can think about using a different allocator selectively on the part of your model that needs it, using torch.cuda.MemPool API. You can find full examples of the above snippets here for now, more documentation to follow:

johnnynunez · February 21, 2025, 10:10pm

we discuss it here: Nvidia Jetson vs. Mac Unified Memory

The idea is any of the extra memory allocs for CPU<->GPU transfer

arpit15945 · February 22, 2025, 6:54am

In developer infra doc O[3] mentions PEP 759 which has been withdrawn here.

sbmaruf · February 22, 2025, 12:49pm

Nice Roadmap for distributed system.

In 2024, we will have built a lot of new APIs like FSDP2, PP, CP and composed them as 1D-4D parallelism.

you mean in 2025?

looking forward to the fault tolerant training implementation.

I think, in addition to this roadmap, for the distributed section, if the PyTorch team could regularly benchmark TP, PP, CP, etc., against a large cluster setup (which is usually not available to mere mortals), it would help the community a lot.

Also, latelty converting a torch distributed checkpoint to an HF checkpoint has become extremely painful. NVIDIA has apparently decided not to contribute to that for the sake of their NeMo framework. It would be really beneficial for the community if there were starter code and/or an implementation for converting distributed checkpoints to Transformers HF checkpoints.

Huge props for this,

- Develop Efficient MoE inference.
- Harden PT-D parallelism API and improve adoption.

albanD · February 25, 2025, 3:46pm

In developer infra doc O[3] mentions PEP 759 which has been withdrawn here.

Yes that is unfortunate but it was deemed not the best way forward.
We are still looking forward to alternatives to achieve the same goal and you can follow the wheel next project here for details: wheelnext · GitHub

albanD · February 25, 2025, 3:58pm

you mean in 2025?

No, these should all be available already

I think, in addition to this roadmap, for the distributed section, if the PyTorch team could regularly benchmark TP, PP, CP, etc.

I would suggest following GitHub - pytorch/torchtitan: A PyTorch native library for large model training that we use for this kind of large scale training experimentation.

jeromeku · February 25, 2025, 6:02pm

In the Triton Roadmap, it’s mentioned that:

a intra-kernel tracing tool for Triton that gives kernel writers a Kineto-like interface into kernel execution was developed
multiple chasing the speed-of-light analyses to see why Triton perf keeps falling off a cliff were completed.

Are these publicly available?

Thanks!

bertmaher · February 25, 2025, 8:34pm

The intra-kernel tracing tool is coming soon! I think the API is still being refined – it requires some manual instrumentation now, IIRC, but there’s a PR up for it: [tool][proton] Intra kernel profiling support by fywkevin · Pull Request #4861 · triton-lang/triton · GitHub

The speed-of-light analysis is quite interesting – I think it’s something we can share publicly so I’ll encourage the engineer who did it to post here!

jeromeku · February 26, 2025, 12:05am

Many thanks @bertmaher

Looking forward to the releases!

zhaozheng09 · March 17, 2025, 12:38pm

hello, 2025 H1 Torchrec roadmap please thx. @gottbrath

gottbrath · March 31, 2025, 4:49pm

thanks for asking.

The team internally considered redacting and sharing the torchrec roadmap but decided not to this cycle. I’ll get them the feedback that there is interest.

anirudhprabhakaran · April 25, 2025, 4:52pm

Hi! i wanted to know if these tracks are for the internal team at Meta to work on, or is it open for other contributors as well? I would like to start contributing to PyTorch, and some of these tracks seems very interesting to me.

Thank you!

gottbrath · April 28, 2025, 11:25pm

We are highlighting the work that we are planning to do during the half. However, if you want to contribute I would encourage you to watch github for issues, RFC’s and pull requests in the area you are interested in and reach out to the folks working on those specific features. Alternately you can make a post here with what you are thinking in terms of contributions. Generally speaking we welcome productive open source collaborations that can act as multipliers on what we can do to make PyTorch better.

robert-hardwick · June 18, 2025, 4:05pm

Hi.
This roadmap has been useful, i’m looking forward to the next one. I see in Developer Infra O[3]KR[2] it says:

CI <> Release build consolidation

Just wondering if there is any more detail on this? Does this mean consolidating all builds to be built in a manylinux container where as currently the CI builds are built in a jammy container?

At Arm we are doing some work to remove the Aarch64 specific build scripts, so want to find out some more details about this item.

Maybe there will be more details in the next roadmap?

Thanks
Robert @ Arm

seemethere · June 24, 2025, 11:19pm

We had planned to do some work here but unfortunately we’re not able to get to it this half.

For H2 we haven’t identified this as a P0 for us but we would love to collaborate on this if you want to start to take this on for arm in particular.

I think the biggest barrier here is that typically the images used for builds in CI contain a lot more than what is currently available in the current manylinux builder images. Breaking down the number of dependencies needed would probably be the first step towards making this a thing.

Topic		Replies	Views
Understanding the difference between the caching behavior of cuda caching allocator and pluggable allocator	6	378	January 17, 2025
Meta PyTorch Team 2024 H2 Roadmaps	20	23118	February 19, 2025
FSDP & CUDACachingAllocator: an outsider newb perspective distributed	10	7850	December 13, 2024
Memory operations on a custom backend hardware-backends	4	1184	July 5, 2022
Torch.optim happenings: More Ws (2023 H2) frontend API	2	1060	January 5, 2024

Meta PyTorch Team 2025 H1 Roadmaps

Related topics