The Meta team are happy to make our 2025 H1 roadmaps available. We plan on a half year basis and globally optimize across the things we do for our users here at Meta and across the wider community. We’ve removed a few references to internal systems and teams but are eager to share the bulk of what we are planning for this half to encourage robust technical collaboration across the PyTorch community on these topics.
On the topic of collaboration we are excited to highlight that we have a new discord server to make collaboration on PyTorch Edge easier. If you are building on-device AI applications or want to contribute to PyTorch Edge, come join our Discord server.
I haven’t heard that anyone is looking at that specifically.
Is there a technical discussion (RFC or dev-discuss thread or slack thread) where this has been proposed and folks have thought through any pros and cons?
While cudaMallocManaged offers convenient automatic memory management, I’d strongly advise against using it everywhere. When GPU memory gets saturated, UVM has to perform costly double transfers, evicting pages to CPU before bringing in new ones. This effectively halves your memory bandwidth. For DL workloads that fit in GPU memory (which is most cases), explicit placement consistently outperforms UVM since there are no page faults and access patterns remain predictable.
Instead, leverage PyTorch’s MemPool API for selectively using different allocators in the same PyTorch program.
Use UVM where it makes sense: Let’s say you have a big embedding table and you are just crafting a model for now and don’t want to think about how to shard the table for optimal performance. You could create a custom allocator using cudaMallocManaged with preferred CPU location and get your model working first. After that, you can think about how to optimize the placement of your embedding table:
pool = torch.cuda.MemPool(uvm_allocator)
with torch.cuda.use_mem_pool(pool):
embedding_matrix = torch.rand(...).cuda()
NVLS for Fast Multi-GPU Reduction: On systems with NVSwitch, you can tap into NCCL’s memory allocator for zero-copy reductions. This is especially useful when overlapping GEMMs with communication kernels, where there is resource contention (SMs, Copy Engines).
pool = torch.cuda.MemPool(nccl_allocator)
with torch.cuda.use_mem_pool(pool):
tensor = torch.arange(1024 * 1024 * 2, device=device)
dist.register_mem_pool(pool)
dist.all_reduce(tensor) # Uses NVLS path
As seen from the following experiment (thanks to @kwen2501!), using the allocator selectively improved the performance of both the communication and compute kernels. Keep in mind, being selective here is important since ncclMemAlloc can allocate larger than requested size, so if you were to use it everywhere, you would quickly run into OOM.
Extended GPU Memory (EGM) based all-gather: On systems that support EGM, if you are all-gathering a tensor from peer GPUs to the host CPU (e.g. when checkpointing a model during training), you could allocate the output list of the gather using an EGM allocator and the gather will then use high bandwidth NVLinks:
pool = torch.cuda.MemPool(multicast_host_allocator)
with torch.cuda.use_mem_pool(pool):
output_list = [torch.empty_like(local_shard) for _ in range(size)]
dist.gather(local_shard, output_list) # Uses EGM path
To summarize, for standard deep learning workloads, stick with PyTorch’s default caching allocator - it’s battle-tested and optimized for memory patterns in DL workloads with features like expandable segments and smart block reuse. When you want to further optimize your model’s memory access patterns, you can think about using a different allocator selectively on the part of your model that needs it, using torch.cuda.MemPool API. You can find full examples of the above snippets here for now, more documentation to follow:
In 2024, we will have built a lot of new APIs like FSDP2, PP, CP and composed them as 1D-4D parallelism.
you mean in 2025?
looking forward to the fault tolerant training implementation.
I think, in addition to this roadmap, for the distributed section, if the PyTorch team could regularly benchmark TP, PP, CP, etc., against a large cluster setup (which is usually not available to mere mortals), it would help the community a lot.
Also, latelty converting a torch distributed checkpoint to an HF checkpoint has become extremely painful. NVIDIA has apparently decided not to contribute to that for the sake of their NeMo framework. It would be really beneficial for the community if there were starter code and/or an implementation for converting distributed checkpoints to Transformers HF checkpoints.
Huge props for this,
- Develop Efficient MoE inference.
- Harden PT-D parallelism API and improve adoption.
In developer infra doc O[3] mentions PEP 759 which has been withdrawn here.
Yes that is unfortunate but it was deemed not the best way forward.
We are still looking forward to alternatives to achieve the same goal and you can follow the wheel next project here for details: wheelnext · GitHub
The speed-of-light analysis is quite interesting – I think it’s something we can share publicly so I’ll encourage the engineer who did it to post here!