Reminder — Calls for Features: PyTorch 2.12

atalman · April 20, 2026, 9:52pm

Hi everyone,

As we head into the PyTorch 2.12 release cycle, this is a reminder to feature owners to file a release-highlight issue for every feature you want tracked in the release.

IMPORTANT: Any feature not going through this process will NOT be mentioned in the release blog and other comms.

How to submit a feature

Please either:

Open a new issue on pytorch/pytorch using the New Feature for Release template, or
Tag an existing RFC / tracking issue with the release-feature-request label.

For each feature, include:

What will be available in release 2.12 (API surface, stability tag: Stable / Beta / Prototype)
Link(s) to relevant tutorials (new or updated)
A short blog-post write-up (2-3 paragraphs, usable as-is in the release blog)
Any platform or backend caveats (CUDA / ROCm / XPU / MPS / CPU-Arm)

The running list of tracked features lives here: https://github.com/pytorch/pytorch/issues?q=label%3Arelease-feature-request+is%3Aissue

2.12 Features Identified by AI

The following major features were identified by an AI-assisted scan of the release/2.11 → release/2.12 commit diff. This list is a starting point, not authoritative — descriptions and stability tags have not been reviewed by the feature owners.

Note: If your team is responsible for one of the features below, please submit a New Feature for Release issue so we can track the official description, tutorial, and blog write-up. Corrections and missing features are equally welcome — reply on this thread or open an issue.

Compile & Tracing

User-Stream Support in Inductor — torch.compile now traces and respects user-created CUDA streams. Includes support for stream.synchronize(), stream.record_event(), stream.wait_event(), multi-stream buffer reuse, and combokernel fusion across stream boundaries.
Helion + torch.compile Integration — Codegen hooks for fusion-aware autotuning in external Triton template backends, with per-template prologue/epilogue fusion flags, enabling the Helion compiler to participate in Inductor’s scheduling and fusion pipeline.
Opaque Object Tracing — Objects like Generator, DeviceMesh, and ProcessGroup can now be traced through Dynamo, AOT Autograd, and Inductor as “opaque objects” — they can be graph inputs/outputs and survive partitioning, which is critical for compiling distributed and RNG-dependent code.
Opaque kernels / AOT-autograd opaque objects — Custom ops can register kernels with opaque types, and the AOT autograd graph partitioner tolerates opaque objects flowing across partitions.
Stateless RNG APIs — New side-effect-free uniform / normal entry points that take an explicit generator state, enabling pure random sampling inside compiled / replayable code paths.

Distributed

DTensor Pipeline Parallelism — First-class pipeline parallelism built on DTensor: metadata foundation, DTensor-aware stage/schedule refactoring, and integration tests. Also includes expanded sharding propagation rules for conv, LayerNorm, RMSNorm, interpolate, and index ops.
Twice-Differentiable DTensor Redistribution — DTensor redistribution now supports double-backward, enabling higher-order gradients through distributed collectives.
FSDP Copy-Engine All-Gather — An all-gather path that uses the GPU copy engine, reducing compute-SM pressure during parameter materialization.
NCCL Comm Suspend/Resume + Memory Stats — New APIs to suspend and resume NCCL communicators and query their memory usage, aimed at fault-tolerance and long-running-job use cases.
Store::barrier / TCPStore BARRIER — First-class barrier() on Store and a BARRIER command in TCPStoreClient reduce sync round-trips during rendezvous.
FlightRecorder: ncclx + gloo Backends — FlightRecorder trace analysis now supports ncclx and gloo alongside nccl.
Symmetric-Memory: is_symm_mem_tensor — Public check for whether a tensor was allocated in symmetric memory.

Attention & Inference

Variable-Length Attention for Inference — Adds paged attention support (page_table, seqused_k), an out-variant, a GQA flag, flop registration, and AOTInductor integration for variable-length attention, targeting inference serving workloads.
CK SDPA — Varlen Attention — Composable Kernel SDPA backend gained varlen attention support on ROCm.
FlexAttention enable_gqa on Varlen — varlen_attn now takes an enable_gqa flag for grouped-query attention.

Accelerator Frontend

Unified torch.accelerator Graph/Stream API — New torch.accelerator.Graph for device-agnostic capture/replay, is_capturing() on c10::Stream / torch.Stream, and XPU coverage of the same surface — bringing cross-backend parity to stream and graph management.
torch.cond with CUDA Graphs — Control-flow regions using torch.cond can now be captured and replayed as part of CUDA Graphs.
torch.cuda.graph(enable_annotations=…) — New kwarg on the CUDA Graph capture context to inject NVTX-style annotations into captured graphs.
CUDA Green Context — Workqueue Limit — Green Context now supports a workqueue limit, giving finer-grained control over resource partitioning.

Inductor Backends

Inductor CUTLASS FP8 / FP4 — CUTLASS backend gains float8_e5m2 support, and FP4 block-scaled GEMM now uses nvMatmulHeuristics for kernel selection.
Inductor XPU GEMM via SYCL-TLA — Intel GPU GEMM in Inductor can now target SYCL-TLA (Intel CUTLASS), closing the gap with the CUDA CUTLASS backend.
Inductor TMA for Lazy Triton Kernel Compilation — Inductor’s lazy Triton kernel compilation path now supports TMA descriptors on Hopper.
Inductor DeviceInterface for TPU (Pallas) — Initial DeviceInterface for TPU alongside Pallas backend improvements (scalar prefetch/indirect access, strided access via reshape, tiling, scalar outputs, copy elimination).

Profiler

Kineto: NCCL Collective Sequence Numbers — NCCL collective traces now carry a seq_num field, making it easier to align traces across ranks.
Profiler events() Flow / Activity Metadata — events() now returns flow ids, flow types, activity types, and unfinished / Python events, enabling richer post-hoc analysis.

Export

torch.export — Expanded dtype Serde — torch.export.save now supports torch.uint32 / torch.uint64, and float8_e8m0fnu round-trips through export serde.

NN / Frontend

torch.backends.python_native (Native DSL) — New surface under torch.backends landed as part of the “Native DSL” work.
Fused Adagrad — Python Frontend — The fused-adagrad Python frontend landed.
torch.nn.init on Meta / 0-Element Tensors — Init methods (including trunc_normal_) now work on meta-device and zero-element tensors.

Platform Enablement

ROCm: Expandable Segments + rocSHMEM — Expandable memory segments on AMD GPUs (ROCm ≥ 7.02) and rocSHMEM enablement for on-GPU communication.
ROCm: hipSPARSELt — Broader Coverage + FP8 2:4 Sparsity — Expanded hipSPARSELt integration, including FP8 semi-structured sparsity on ROCm ≥ 7.12.
ROCm: Inductor FlexAttention Pipelining — Pipelined FlexAttention on AMD GPUs.
XPU: FMA-based addcdiv Lowering — Fused multiply-add lowering for addcdiv on Intel GPUs.
MPS: Metal-4 Offline Shader Compilation — Apple Silicon binary wheels ship ahead-of-time–compiled Metal-4 shaders, reducing first-run latency.
Arm: Full AArch64 PR CI — Every pull request now runs AArch64 tests on Graviton 4 / Neoverse V2.
Elastic: Windows stdout/stderr Redirects — torchrun-style elastic launches now support stdout/stderr redirects on Windows.

2.12 Release Timeline

Official dates from the PyTorch Release 2.12 Key Dates post:

Milestone	What	Date
M1	Release announcement	27 Mar 2026
M2	All PRs landed in pytorch/pytorch / Feature submission closed	10 Apr 2026
M3.1	Release branch cut	13–15 Apr 2026 (done)
M4	Release branch finalized, final launch date announced, feature classifications published — Final RC produced	Week of 27 Apr 2026
M4.1	Tutorial drafts submission deadline	6 May 2026
M5	External-facing content finalized	8 May 2026
M6	Release Day	13 May 2026

How to help

Feature owners: file a release-highlight issue (template above) with your blog blurb + tutorial links.
Tutorial authors: PRs to pytorch/tutorials with the 2.12-release label are tracked against M4.1.
Everyone else: please review the feature list above and reply in this thread if something major is missing — particularly for MPS, Quantization, and CPU/Arm, where the diff-based scan caught few candidates.

If you have questions, reach out to us under this post.

Cheers, Team PyTorch

Topic	Replies	Views
Torch.compile support for Python 3.12 completed compiler	2474	May 3, 2024
PyTorch 2.1: automatic dynamic shape compilation, torch.distributed.checkpoint, torch.compile + NumPy, torch.export prototype, and more! Release Announcements	1077	October 6, 2023
PyTorch Release 2.12 \| Key Dates Release Announcements	519	March 27, 2026
PyTorch Release 2.12 Branch Cut is Complete Release Announcements	40	April 16, 2026
PyTorch 2.1.0 \| Feature Review Complete Release Announcements	1211	August 14, 2023