Reminder β€” Calls for Features: PyTorch 2.12

Hi everyone,

As we head into the PyTorch 2.12 release cycle, this is a reminder to feature owners to file a release-highlight issue for every feature you want tracked in the release.

IMPORTANT: Any feature not going through this process will NOT be mentioned in the release blog and other comms.

How to submit a feature

Please either:

  1. Open a new issue on pytorch/pytorch using the New Feature for Release template, or
  2. Tag an existing RFC / tracking issue with the release-feature-request label.

For each feature, include:

  • What will be available in release 2.12 (API surface, stability tag: Stable / Beta / Prototype)
  • Link(s) to relevant tutorials (new or updated)
  • A short blog-post write-up (2-3 paragraphs, usable as-is in the release blog)
  • Any platform or backend caveats (CUDA / ROCm / XPU / MPS / CPU-Arm)

The running list of tracked features lives here: https://github.com/pytorch/pytorch/issues?q=label%3Arelease-feature-request+is%3Aissue

2.12 Features Identified by AI

The following major features were identified by an AI-assisted scan of the release/2.11 β†’ release/2.12 commit diff. This list is a starting point, not authoritative β€” descriptions and stability tags have not been reviewed by the feature owners.

Note: If your team is responsible for one of the features below, please submit a New Feature for Release issue so we can track the official description, tutorial, and blog write-up. Corrections and missing features are equally welcome β€” reply on this thread or open an issue.

Compile & Tracing

  • User-Stream Support in Inductor β€” torch.compile now traces and respects user-created CUDA streams. Includes support for stream.synchronize(), stream.record_event(), stream.wait_event(), multi-stream buffer reuse, and combokernel fusion across stream boundaries.
  • Helion + torch.compile Integration β€” Codegen hooks for fusion-aware autotuning in external Triton template backends, with per-template prologue/epilogue fusion flags, enabling the Helion compiler to participate in Inductor’s scheduling and fusion pipeline.
  • Opaque Object Tracing β€” Objects like Generator, DeviceMesh, and ProcessGroup can now be traced through Dynamo, AOT Autograd, and Inductor as β€œopaque objects” β€” they can be graph inputs/outputs and survive partitioning, which is critical for compiling distributed and RNG-dependent code.
  • Opaque kernels / AOT-autograd opaque objects β€” Custom ops can register kernels with opaque types, and the AOT autograd graph partitioner tolerates opaque objects flowing across partitions.
  • Stateless RNG APIs β€” New side-effect-free uniform / normal entry points that take an explicit generator state, enabling pure random sampling inside compiled / replayable code paths.

Distributed

  • DTensor Pipeline Parallelism β€” First-class pipeline parallelism built on DTensor: metadata foundation, DTensor-aware stage/schedule refactoring, and integration tests. Also includes expanded sharding propagation rules for conv, LayerNorm, RMSNorm, interpolate, and index ops.
  • Twice-Differentiable DTensor Redistribution β€” DTensor redistribution now supports double-backward, enabling higher-order gradients through distributed collectives.
  • FSDP Copy-Engine All-Gather β€” An all-gather path that uses the GPU copy engine, reducing compute-SM pressure during parameter materialization.
  • NCCL Comm Suspend/Resume + Memory Stats β€” New APIs to suspend and resume NCCL communicators and query their memory usage, aimed at fault-tolerance and long-running-job use cases.
  • Store::barrier / TCPStore BARRIER β€” First-class barrier() on Store and a BARRIER command in TCPStoreClient reduce sync round-trips during rendezvous.
  • FlightRecorder: ncclx + gloo Backends β€” FlightRecorder trace analysis now supports ncclx and gloo alongside nccl.
  • Symmetric-Memory: is_symm_mem_tensor β€” Public check for whether a tensor was allocated in symmetric memory.

Attention & Inference

  • Variable-Length Attention for Inference β€” Adds paged attention support (page_table, seqused_k), an out-variant, a GQA flag, flop registration, and AOTInductor integration for variable-length attention, targeting inference serving workloads.
  • CK SDPA β€” Varlen Attention β€” Composable Kernel SDPA backend gained varlen attention support on ROCm.
  • FlexAttention enable_gqa on Varlen β€” varlen_attn now takes an enable_gqa flag for grouped-query attention.

Accelerator Frontend

  • Unified torch.accelerator Graph/Stream API β€” New torch.accelerator.Graph for device-agnostic capture/replay, is_capturing() on c10::Stream / torch.Stream, and XPU coverage of the same surface β€” bringing cross-backend parity to stream and graph management.
  • torch.cond with CUDA Graphs β€” Control-flow regions using torch.cond can now be captured and replayed as part of CUDA Graphs.
  • torch.cuda.graph(enable_annotations=…) β€” New kwarg on the CUDA Graph capture context to inject NVTX-style annotations into captured graphs.
  • CUDA Green Context β€” Workqueue Limit β€” Green Context now supports a workqueue limit, giving finer-grained control over resource partitioning.

Inductor Backends

  • Inductor CUTLASS FP8 / FP4 β€” CUTLASS backend gains float8_e5m2 support, and FP4 block-scaled GEMM now uses nvMatmulHeuristics for kernel selection.
  • Inductor XPU GEMM via SYCL-TLA β€” Intel GPU GEMM in Inductor can now target SYCL-TLA (Intel CUTLASS), closing the gap with the CUDA CUTLASS backend.
  • Inductor TMA for Lazy Triton Kernel Compilation β€” Inductor’s lazy Triton kernel compilation path now supports TMA descriptors on Hopper.
  • Inductor DeviceInterface for TPU (Pallas) β€” Initial DeviceInterface for TPU alongside Pallas backend improvements (scalar prefetch/indirect access, strided access via reshape, tiling, scalar outputs, copy elimination).

Profiler

  • Kineto: NCCL Collective Sequence Numbers β€” NCCL collective traces now carry a seq_num field, making it easier to align traces across ranks.
  • Profiler events() Flow / Activity Metadata β€” events() now returns flow ids, flow types, activity types, and unfinished / Python events, enabling richer post-hoc analysis.

Export

  • torch.export β€” Expanded dtype Serde β€” torch.export.save now supports torch.uint32 / torch.uint64, and float8_e8m0fnu round-trips through export serde.

NN / Frontend

  • torch.backends.python_native (Native DSL) β€” New surface under torch.backends landed as part of the β€œNative DSL” work.
  • Fused Adagrad β€” Python Frontend β€” The fused-adagrad Python frontend landed.
  • torch.nn.init on Meta / 0-Element Tensors β€” Init methods (including trunc_normal_) now work on meta-device and zero-element tensors.

Platform Enablement

  • ROCm: Expandable Segments + rocSHMEM β€” Expandable memory segments on AMD GPUs (ROCm β‰₯ 7.02) and rocSHMEM enablement for on-GPU communication.
  • ROCm: hipSPARSELt β€” Broader Coverage + FP8 2:4 Sparsity β€” Expanded hipSPARSELt integration, including FP8 semi-structured sparsity on ROCm β‰₯ 7.12.
  • ROCm: Inductor FlexAttention Pipelining β€” Pipelined FlexAttention on AMD GPUs.
  • XPU: FMA-based addcdiv Lowering β€” Fused multiply-add lowering for addcdiv on Intel GPUs.
  • MPS: Metal-4 Offline Shader Compilation β€” Apple Silicon binary wheels ship ahead-of-time–compiled Metal-4 shaders, reducing first-run latency.
  • Arm: Full AArch64 PR CI β€” Every pull request now runs AArch64 tests on Graviton 4 / Neoverse V2.
  • Elastic: Windows stdout/stderr Redirects β€” torchrun-style elastic launches now support stdout/stderr redirects on Windows.

2.12 Release Timeline

Official dates from the PyTorch Release 2.12 Key Dates post:

Milestone What Date
M1 Release announcement 27 Mar 2026
M2 All PRs landed in pytorch/pytorch / Feature submission closed 10 Apr 2026
M3.1 Release branch cut 13–15 Apr 2026 (done)
M4 Release branch finalized, final launch date announced, feature classifications published β€” Final RC produced Week of 27 Apr 2026
M4.1 Tutorial drafts submission deadline 6 May 2026
M5 External-facing content finalized 8 May 2026
M6 Release Day 13 May 2026

How to help

  • Feature owners: file a release-highlight issue (template above) with your blog blurb + tutorial links.
  • Tutorial authors: PRs to pytorch/tutorials with the 2.12-release label are tracked against M4.1.
  • Everyone else: please review the feature list above and reply in this thread if something major is missing β€” particularly for MPS, Quantization, and CPU/Arm, where the diff-based scan caught few candidates.

If you have questions, reach out to us under this post.

Cheers, Team PyTorch

1 Like