Intel GPU & CPU Enabling Status and Feature Plan – 2026 H1 Update

Intel GPU

Recap

In 2025, the PyTorch Intel GPU (XPU) team made substantial contributions to upstream PyTorch, spanning multiple PyTorch release cycles. The contributions cover new features, performance optimizations, infrastructure improvements, distributed capabilities, bug fixes, and documentation — strengthening Intel GPU support and significantly improving the PyTorch user experience on Intel GPUs.

Areas Where We Excelled in Recent PyTorch Releases

  • Full-stack AI acceleration coverage: FP8, FP16, BF16, INT8, and INT4 (WOQ) quantization is now supported across the entire stack. Attention mechanisms are fully enabled, including FlexAttention, Scaled Dot Product Attention (SDPA), and FlashAttention (based on SYCL-TLA). End-to-end optimization through torch.compile is production-ready.

  • Windows platform breakthrough: torch.compile, AOT Inductor, Kineto Profiler, and SYCL C++ Extension are all now enabled on Windows XPU — making Intel GPUs a compelling cross-platform accelerator experience in PyTorch.

  • Distributed capabilities from the ground up on Intel® Data Center GPU Max series: Out-of-the-box XCCL backend, FlightRecorder support, FSDP2 enablement, and DTensor RNG support laid a solid foundation for distributed workloads.

  • Continuous hardware coverage expansion: From Intel® Data Center GPUs Max series to Client GPUs (Intel® Arc A/B-Series Graphics) to the latest Intel® Core™ Ultra Mobile Processors (Series 3) with Intel® Arc™ Graphics (Panther Lake) platform. We are maintaining comprehensive support across the Intel GPU product portfolio.

  • Engineering quality and velocity: Three software stack upgrades within the year, CI infrastructure modernization, significantly expanded test coverage, and package size optimizations all reflect a mature and high-velocity engineering cadence.

Continued Efforts for Future PyTorch Releases

To further establish Intel GPUs as a broadly adopted accelerator in the PyTorch ecosystem, our continued efforts focus on advancing performance, portability, ecosystem integration, and long-term architectural sustainability.

  • Enable production-ready support for leading serving frameworks — Provide comprehensive upstream support for serving frameworks such as vLLM and SGLang. Enable efficient and scalable inference on Intel GPUs across diverse deployment environments.

  • Advance and promote torch.accelerator toward becoming a well-adopted programming interface across heterogeneous accelerators — Advance the vision of enabling the same PyTorch program to run seamlessly across accelerators. Improve portability, reduce backend fragmentation, and strengthen PyTorch long-term architectural foundation.

  • Advance performance leadership across key workloads and architectural layers — Continuously improve benchmark results and real-world model performance on Intel GPUs through coordinated innovations across the runtime, compiler, kernel, and precision stack:

    • Advance XPUGraph integration in PyTorch to unlock graph-based performance optimizations where applicable.
    • Deliver comprehensive low-precision support aligned with evolving ecosystem standards to enable efficient and scalable training and inference.
    • Integrate SYCL-TLA as a complementary kernel backend alongside oneDNN and Triton, providing a flexible and scalable path to accelerate GEMM, SDPA, and other performance-critical operations.
    • Expand and optimize ATen operator implementations to improve performance coverage and ensure efficient execution across diverse workloads.
  • Enable seamless support for next-generation Intel GPU platforms — Expand PyTorch enablement to upcoming Intel GPU architectures, ensuring a consistent user experience across Intel’s GPU portfolio.

  • Advance compiler innovation through Helion to unlock future performance opportunities on Intel GPUs — Improve Helion support on Intel GPUs and demonstrate its potential through selective high-impact kernels, establishing a foundation for future compiler-driven performance scaling.

  • Strengthen CI/CD and release infrastructure — Generalize test cases to support a broader range of accelerators, using Intel GPUs as a showcase platform. Continue improving Intel GPU CI infrastructure to support more demanding workloads. Streamline the release process for Intel GPU support to align with PyTorch’s accelerated release cadence.

Focus Areas for Continued Efforts

As part of our continued efforts in future PyTorch releases, the Intel PyTorch team will focus on the following key areas for Intel GPUs in 2026:

PyTorch Compiler

We will continue improving the Triton backend for Intel GPUs and introduce a new SYCL-TLA backend to further enhance performance through torch.compile. Meanwhile, we will proactively assess and optimize vLLM performance across Inductor, Helion, and custom kernels for leading models on Intel GPUs.

Reliability Infrastructure Improvements

  • Improve reliability of vLLM x PT2 tests on PyTorch repo CI on Intel GPUs and follow the community process to make and keep tests green
  • Add Intel GPUs to vLLM x PT2 Performance Dashboard

XPUGraphTree

  • Integrate XPUGraphTree into Inductor to improve performance where applicable
  • Generalize the Graph implementations across different accelerators on the Inductor side

SYCL-TLA Backend

  • Integrate SYCL-TLA into Inductor as a new backend for Intel Discrete GPUs on Linux
  • Provide more SYCL-TLA kernel templates to support custom kernels

Triton Backend

  • Continue improving FlexAttention forward and backward performance to accelerate OOB workloads
  • Optimize FlexAttention for vLLM production inference
  • Provide more Triton kernel templates to support custom kernels

Helion

  • Improve Intel GPU support in Helion and enable Intel GPU CI in Helion repo
  • Demonstrate promising performance for selective workloads on Intel GPUs with Helion

PyTorch Framework Performance

ATen Operations

  • Extend ATen operations on Intel GPUs driven by UT coverage
  • Support and optimize batch-invariant kernels in PyTorch on Intel GPUs
  • Optimize quant/dequant kernels for FP8, MXFP8, and MXFP4
  • Optimize GEMM kernels for FP8, MXFP8, and MXFP4

Attention Stack

  • Improve library-based SDPA operation on top of oneDNN
  • Integrate SYCL-TLA as a flexible and scalable path to accelerate GEMM, SDPA, and other performance-critical operations
  • Improve FlexAttention forward and backward performance for OOB models
  • Explore other approaches to implement and accelerate FlexAttention besides Triton

Torch Accelerator Generalization

  • Generalize Graph record/replay into a portable and backend-agnostic execution model, and expose it through torch.accelerator to unlock consistent performance benefits across accelerators.
  • Continue expanding runtime generalization, enabling torch.accelerator to support broader workloads, execution modes, and system configurations.
  • Collaborate closely with the PyTorch community to drive ecosystem adoption of torch.accelerator, reduce backend-specific maintenance overhead, and improve long-term sustainability of accelerator integrations.

TorchAO

  • Ensure TorchAO performs well for different lower precision data types on Intel GPUs, including INT4, INT8, FP8, MXFP8, and MXFP4
  • Make MXFP4 production-ready with fast 4-bit PTQ (GPTQ) and an accuracy debugging toolkit in TorchAO on Intel GPUs. Verify end-to-end flow on selective large models.

TorchCodec

  • Add support for Intel Xe media decoding and encoding in TorchCodec
  • Integrate the Intel TorchCodec support with the Hugging Face Datasets and Transformers libraries. Explore other popular frameworks that would benefit from similar enabling.

TorchRec

  • Develop SYCL version for FBGEMM-based operators: invert_permute, jagged_index_select, permute_1D_sparse_data, expand_into_jagged_permute; used in rowwise and tablewise sharding strategies.
  • Develop SYCL version for FBGEMM-based operators to support non-pooling functionalities in embeddings: dense_embedding_codegen_lookup_function and split_embedding_codegen_lookup_rowwise_adagrad_function_pt2 (forward and backward pass).

PyTorch Distributed

torch.distributed / torchcomms

  • Adoption & coverage: Enable in-tree TorchComms support for the XCCL backend, add CI/CD testing on Intel GPUs, and provide coverage for all collective and P2P operations across scale-up + scale-out.
  • Migration & parity: Drive feature/perf parity vs. ProcessGroupXCCL and support upstream migration — while ensuring compatibility with key distributed infra features like FlightRecorder, torch.compile, DeviceMesh, and DTensor on XCCL.
  • Symmetric memory & customized communication ops: Enable an intra-node XPU symmetric memory backend to support asynchronous Tensor Parallelism (TP) and customized communication ops on XPU devices.

Distributed Parallelism and Compilation

  • Compiler-first distributed training (Intel GPU enablement): Deliver initial SimpleFSDP support on Intel GPUs — compiler-based sharding, bucketing/reordering, parallelisms, activation checkpointing, and mixed precision & initial Compiler Toolkit support (AOT), targeting a merged path.
  • Maintain/expand TorchTitan OSS momentum: Enable DeepEP support for Intel GPUs (explicitly dependent on underlying library support) and validate against a subset of SOTA models in the TorchTitan context.
  • Seamless scaling via auto-parallelization: Enable automatic parallelization on Intel GPUs, including Pipeline Parallel (noted as potentially merging with SimpleFSDP + Compiler Toolkit), dependent on PyTorch Autoparallel support.
  • Expand DTensor support for Intel GPUs, improve sharding propagation and operator coverage.

Intel CPU

Recap

On the CPU side, the PyTorch Intel CPU team continued contributing to PyTorch upstream throughout 2025, with areas as follows:

  • Expanded Low-Precision Enablement: We broadened quantization support on CPU, with more quantization recipes and continued performance improvements for INT8 workloads. We now provide comprehensive precision coverage across FP32, BF16, FP16, and INT8, enabling flexible deployment configurations.

  • Attention Optimization: Over the past year, we have continuously optimized attention mechanisms, including SDPA, FlashAttention, and FlexAttention.

  • PyTorch compile: We have added foundational INT8 quantization patterns in Inductor, expanded fusion coverage, and further optimized the GEMM template. In addition, we improved memory layout propagation to avoid unnecessary data movement, enhanced parallelization strategies and vectorization for CPU code generation, and refined reduction implementations to improve both numerical accuracy and execution efficiency.

  • Engineering Responsiveness and Stability: We have remained highly engaged with the PyTorch community by actively triaging CPU-related issues, addressing regressions, and delivering timely fixes.

Focus Areas for Continued Efforts

As part of our continued efforts in future PyTorch releases, the Intel PyTorch team will focus on the following key areas for Intel CPUs in 2026:

Framework Performance Optimization (Intel CPU Focus)

Torch Operator Enhancements

We will continue expanding operator coverage and improving kernel efficiency across low-precision and reduced-precision workloads:

  • Add FP8 support for applicable operators/kernels, together with performance optimization
  • Improve existing BF16, FP16, INT8 datatype-based operator/kernel coverage and performance

Attention Stack

We will continue to advance transformer attention performance on Intel CPU platforms by enabling next-generation FlashAttention optimizations and extending FlexAttention with low-precision and decoding optimizations:

  • Integrate FlashAttention V3 optimizations into existing FlashAttention kernels
  • Further incorporate FlashAttention V4 enhancements to FlashAttention kernels
  • Optimize FlexAttention with AVX512-FP16 instructions
  • Add Flash decoding optimizations within FlexAttention to improve decoding efficiency
  • Add FP8 KV cache support within FlexAttention to reduce memory footprint and improve token generation throughput

Torch Compile / Inductor

We will continue investing in the Inductor CPP backend to enhance graph compilation and runtime performance:

  • Optimize the Inductor CPP backend by improving GEMM template and fusion opportunities
  • Add corresponding patterns support for FP8 and complete remaining INT8 lowering in Inductor
  • Add FP8 and remaining INT8 kernels/templates support
  • Optimize hotspots identified through benchmarking against the Triton CPU backend
  • Optimize CPP backend build time

AOT Inductor

To improve the deployment experience of compiled models, we will enhance AOT Inductor’s packaging and loading workflow:

  • Improve model package extraction and loading mechanisms to reduce redundant unpacking and lower disk usage

Distributed Improvement

To strengthen CPU-based distributed inference scenarios (single node), we will optimize Tensor Parallel performance using shared-memory-based collective primitives:

  • Add SHM-based intra-node support in Gloo
  • Enable PyTorch to utilize SHM-based primitives for Tensor Parallel scenarios

Enabling of the Next Generation Intel® Xeon® Scalable Processor

The next generation Intel® Xeon® Scalable processor (code-named Diamond Rapids) brings a set of new hardware features, with some information already available through the GCC enabling blog. To enable applicable hardware capabilities at the AI domain, we will continue adding new features to leverage the power of the new platform:

  • FP8 operators/kernel support in ATen and Inductor CPP backend — FP8 datatype support of applicable operators/kernels; Inductor pattern match for those new operators and related template optimizations
  • TF32 support for applicable operators
  • Leverage new instructions (AVX512 FP16 instructions, etc.) to optimize existing kernels through Inductor CPP backend

Engineering Quality and Velocity

We will continue improving robustness, maintainability, and ecosystem alignment:

  • Upgrade oneDNN for bug fixes and performance improvements
  • Continue strengthening CPU issue triage and resolution
  • Improve UTs/CI coverage

Summary

In this update, we summarized the progress made across recent PyTorch releases and outlined our continued efforts as a north star for advancing Intel GPU and CPU support in future releases.

These achievements would not have been possible without the invaluable support from Alban, Andrey, Bin, Chris, Driss, Jason, Jerry, Nikita, and Tristan, along with many other maintainers and reviewers across the PyTorch community. We sincerely appreciate their collaboration and contributions.

Looking ahead, the Intel PyTorch team will continue working closely with the community to further improve Intel GPU and CPU functionality, performance, and overall user experience.

4 Likes