Hi everyone,
As we head into the PyTorch 2.12 release cycle, this is a reminder to feature owners to file a release-highlight issue for every feature you want tracked in the release.
IMPORTANT: Any feature not going through this process will NOT be mentioned in the release blog and other comms.
How to submit a feature
Please either:
- Open a new issue on pytorch/pytorch using the New Feature for Release template, or
- Tag an existing RFC / tracking issue with the release-feature-request label.
For each feature, include:
- What will be available in release 2.12 (API surface, stability tag: Stable / Beta / Prototype)
- Link(s) to relevant tutorials (new or updated)
- A short blog-post write-up (2-3 paragraphs, usable as-is in the release blog)
- Any platform or backend caveats (CUDA / ROCm / XPU / MPS / CPU-Arm)
The running list of tracked features lives here: https://github.com/pytorch/pytorch/issues?q=label%3Arelease-feature-request+is%3Aissue
2.12 Features Identified by AI
The following major features were identified by an AI-assisted scan of the release/2.11 β release/2.12 commit diff. This list is a starting point, not authoritative β descriptions and stability tags have not been reviewed by the feature owners.
Note: If your team is responsible for one of the features below, please submit a New Feature for Release issue so we can track the official description, tutorial, and blog write-up. Corrections and missing features are equally welcome β reply on this thread or open an issue.
Compile & Tracing
- User-Stream Support in Inductor β torch.compile now traces and respects user-created CUDA streams. Includes support for stream.synchronize(), stream.record_event(), stream.wait_event(), multi-stream buffer reuse, and combokernel fusion across stream boundaries.
- Helion + torch.compile Integration β Codegen hooks for fusion-aware autotuning in external Triton template backends, with per-template prologue/epilogue fusion flags, enabling the Helion compiler to participate in Inductorβs scheduling and fusion pipeline.
- Opaque Object Tracing β Objects like Generator, DeviceMesh, and ProcessGroup can now be traced through Dynamo, AOT Autograd, and Inductor as βopaque objectsβ β they can be graph inputs/outputs and survive partitioning, which is critical for compiling distributed and RNG-dependent code.
- Opaque kernels / AOT-autograd opaque objects β Custom ops can register kernels with opaque types, and the AOT autograd graph partitioner tolerates opaque objects flowing across partitions.
- Stateless RNG APIs β New side-effect-free uniform / normal entry points that take an explicit generator state, enabling pure random sampling inside compiled / replayable code paths.
Distributed
- DTensor Pipeline Parallelism β First-class pipeline parallelism built on DTensor: metadata foundation, DTensor-aware stage/schedule refactoring, and integration tests. Also includes expanded sharding propagation rules for conv, LayerNorm, RMSNorm, interpolate, and index ops.
- Twice-Differentiable DTensor Redistribution β DTensor redistribution now supports double-backward, enabling higher-order gradients through distributed collectives.
- FSDP Copy-Engine All-Gather β An all-gather path that uses the GPU copy engine, reducing compute-SM pressure during parameter materialization.
- NCCL Comm Suspend/Resume + Memory Stats β New APIs to suspend and resume NCCL communicators and query their memory usage, aimed at fault-tolerance and long-running-job use cases.
- Store::barrier / TCPStore BARRIER β First-class barrier() on Store and a BARRIER command in TCPStoreClient reduce sync round-trips during rendezvous.
- FlightRecorder: ncclx + gloo Backends β FlightRecorder trace analysis now supports ncclx and gloo alongside nccl.
- Symmetric-Memory: is_symm_mem_tensor β Public check for whether a tensor was allocated in symmetric memory.
Attention & Inference
- Variable-Length Attention for Inference β Adds paged attention support (page_table, seqused_k), an out-variant, a GQA flag, flop registration, and AOTInductor integration for variable-length attention, targeting inference serving workloads.
- CK SDPA β Varlen Attention β Composable Kernel SDPA backend gained varlen attention support on ROCm.
- FlexAttention enable_gqa on Varlen β varlen_attn now takes an enable_gqa flag for grouped-query attention.
Accelerator Frontend
- Unified torch.accelerator Graph/Stream API β New torch.accelerator.Graph for device-agnostic capture/replay, is_capturing() on c10::Stream / torch.Stream, and XPU coverage of the same surface β bringing cross-backend parity to stream and graph management.
- torch.cond with CUDA Graphs β Control-flow regions using torch.cond can now be captured and replayed as part of CUDA Graphs.
- torch.cuda.graph(enable_annotations=β¦) β New kwarg on the CUDA Graph capture context to inject NVTX-style annotations into captured graphs.
- CUDA Green Context β Workqueue Limit β Green Context now supports a workqueue limit, giving finer-grained control over resource partitioning.
Inductor Backends
- Inductor CUTLASS FP8 / FP4 β CUTLASS backend gains float8_e5m2 support, and FP4 block-scaled GEMM now uses nvMatmulHeuristics for kernel selection.
- Inductor XPU GEMM via SYCL-TLA β Intel GPU GEMM in Inductor can now target SYCL-TLA (Intel CUTLASS), closing the gap with the CUDA CUTLASS backend.
- Inductor TMA for Lazy Triton Kernel Compilation β Inductorβs lazy Triton kernel compilation path now supports TMA descriptors on Hopper.
- Inductor DeviceInterface for TPU (Pallas) β Initial DeviceInterface for TPU alongside Pallas backend improvements (scalar prefetch/indirect access, strided access via reshape, tiling, scalar outputs, copy elimination).
Profiler
- Kineto: NCCL Collective Sequence Numbers β NCCL collective traces now carry a seq_num field, making it easier to align traces across ranks.
- Profiler events() Flow / Activity Metadata β events() now returns flow ids, flow types, activity types, and unfinished / Python events, enabling richer post-hoc analysis.
Export
- torch.export β Expanded dtype Serde β torch.export.save now supports torch.uint32 / torch.uint64, and float8_e8m0fnu round-trips through export serde.
NN / Frontend
- torch.backends.python_native (Native DSL) β New surface under torch.backends landed as part of the βNative DSLβ work.
- Fused Adagrad β Python Frontend β The fused-adagrad Python frontend landed.
- torch.nn.init on Meta / 0-Element Tensors β Init methods (including trunc_normal_) now work on meta-device and zero-element tensors.
Platform Enablement
- ROCm: Expandable Segments + rocSHMEM β Expandable memory segments on AMD GPUs (ROCm β₯ 7.02) and rocSHMEM enablement for on-GPU communication.
- ROCm: hipSPARSELt β Broader Coverage + FP8 2:4 Sparsity β Expanded hipSPARSELt integration, including FP8 semi-structured sparsity on ROCm β₯ 7.12.
- ROCm: Inductor FlexAttention Pipelining β Pipelined FlexAttention on AMD GPUs.
- XPU: FMA-based addcdiv Lowering β Fused multiply-add lowering for addcdiv on Intel GPUs.
- MPS: Metal-4 Offline Shader Compilation β Apple Silicon binary wheels ship ahead-of-timeβcompiled Metal-4 shaders, reducing first-run latency.
- Arm: Full AArch64 PR CI β Every pull request now runs AArch64 tests on Graviton 4 / Neoverse V2.
- Elastic: Windows stdout/stderr Redirects β torchrun-style elastic launches now support stdout/stderr redirects on Windows.
2.12 Release Timeline
Official dates from the PyTorch Release 2.12 Key Dates post:
| Milestone | What | Date |
|---|---|---|
| M1 | Release announcement | 27 Mar 2026 |
| M2 | All PRs landed in pytorch/pytorch / Feature submission closed | 10 Apr 2026 |
| M3.1 | Release branch cut | 13β15 Apr 2026 (done) |
| M4 | Release branch finalized, final launch date announced, feature classifications published β Final RC produced | Week of 27 Apr 2026 |
| M4.1 | Tutorial drafts submission deadline | 6 May 2026 |
| M5 | External-facing content finalized | 8 May 2026 |
| M6 | Release Day | 13 May 2026 |
How to help
- Feature owners: file a release-highlight issue (template above) with your blog blurb + tutorial links.
- Tutorial authors: PRs to pytorch/tutorials with the 2.12-release label are tracked against M4.1.
- Everyone else: please review the feature list above and reply in this thread if something major is missing β particularly for MPS, Quantization, and CPU/Arm, where the diff-based scan caught few candidates.
If you have questions, reach out to us under this post.
Cheers, Team PyTorch