Intel GPU
Recap
In 2025, the PyTorch Intel GPU (XPU) team made substantial contributions to upstream PyTorch, spanning multiple PyTorch release cycles. The contributions cover new features, performance optimizations, infrastructure improvements, distributed capabilities, bug fixes, and documentation — strengthening Intel GPU support and significantly improving the PyTorch user experience on Intel GPUs.
Areas Where We Excelled in Recent PyTorch Releases
-
Full-stack AI acceleration coverage:
FP8,FP16,BF16,INT8, andINT4 (WOQ)quantization is now supported across the entire stack. Attention mechanisms are fully enabled, includingFlexAttention,Scaled Dot Product Attention (SDPA), andFlashAttention(based onSYCL-TLA). End-to-end optimization throughtorch.compileis production-ready. -
Windows platform breakthrough:
torch.compile, AOT Inductor, Kineto Profiler, and SYCL C++ Extension are all now enabled on Windows XPU — making Intel GPUs a compelling cross-platform accelerator experience in PyTorch. -
Distributed capabilities from the ground up on Intel® Data Center GPU Max series: Out-of-the-box
XCCLbackend,FlightRecordersupport,FSDP2enablement, andDTensorRNGsupport laid a solid foundation for distributed workloads. -
Continuous hardware coverage expansion: From Intel® Data Center GPUs Max series to Client GPUs (Intel® Arc A/B-Series Graphics) to the latest Intel® Core™ Ultra Mobile Processors (Series 3) with Intel® Arc™ Graphics (Panther Lake) platform. We are maintaining comprehensive support across the Intel GPU product portfolio.
-
Engineering quality and velocity: Three software stack upgrades within the year, CI infrastructure modernization, significantly expanded test coverage, and package size optimizations all reflect a mature and high-velocity engineering cadence.
Continued Efforts for Future PyTorch Releases
To further establish Intel GPUs as a broadly adopted accelerator in the PyTorch ecosystem, our continued efforts focus on advancing performance, portability, ecosystem integration, and long-term architectural sustainability.
-
Enable production-ready support for leading serving frameworks — Provide comprehensive upstream support for serving frameworks such as
vLLMandSGLang. Enable efficient and scalable inference on Intel GPUs across diverse deployment environments. -
Advance and promote
torch.acceleratortoward becoming a well-adopted programming interface across heterogeneous accelerators — Advance the vision of enabling the same PyTorch program to run seamlessly across accelerators. Improve portability, reduce backend fragmentation, and strengthen PyTorch long-term architectural foundation. -
Advance performance leadership across key workloads and architectural layers — Continuously improve benchmark results and real-world model performance on Intel GPUs through coordinated innovations across the runtime, compiler, kernel, and precision stack:
- Advance
XPUGraphintegration in PyTorch to unlock graph-based performance optimizations where applicable. - Deliver comprehensive low-precision support aligned with evolving ecosystem standards to enable efficient and scalable training and inference.
- Integrate
SYCL-TLAas a complementary kernel backend alongsideoneDNNandTriton, providing a flexible and scalable path to accelerateGEMM,SDPA, and other performance-critical operations. - Expand and optimize
ATenoperator implementations to improve performance coverage and ensure efficient execution across diverse workloads.
- Advance
-
Enable seamless support for next-generation Intel GPU platforms — Expand PyTorch enablement to upcoming Intel GPU architectures, ensuring a consistent user experience across Intel’s GPU portfolio.
-
Advance compiler innovation through Helion to unlock future performance opportunities on Intel GPUs — Improve
Helionsupport on Intel GPUs and demonstrate its potential through selective high-impact kernels, establishing a foundation for future compiler-driven performance scaling. -
Strengthen CI/CD and release infrastructure — Generalize test cases to support a broader range of accelerators, using Intel GPUs as a showcase platform. Continue improving Intel GPU CI infrastructure to support more demanding workloads. Streamline the release process for Intel GPU support to align with PyTorch’s accelerated release cadence.
Focus Areas for Continued Efforts
As part of our continued efforts in future PyTorch releases, the Intel PyTorch team will focus on the following key areas for Intel GPUs in 2026:
PyTorch Compiler
We will continue improving the Triton backend for Intel GPUs and introduce a new SYCL-TLA backend to further enhance performance through torch.compile. Meanwhile, we will proactively assess and optimize vLLM performance across Inductor, Helion, and custom kernels for leading models on Intel GPUs.
Reliability Infrastructure Improvements
- Improve reliability of
vLLMx PT2 tests on PyTorch repo CI on Intel GPUs and follow the community process to make and keep tests green - Add Intel GPUs to
vLLMx PT2 Performance Dashboard
XPUGraphTree
- Integrate
XPUGraphTreeinto Inductor to improve performance where applicable - Generalize the Graph implementations across different accelerators on the Inductor side
SYCL-TLA Backend
- Integrate
SYCL-TLAinto Inductor as a new backend for Intel Discrete GPUs on Linux - Provide more
SYCL-TLAkernel templates to support custom kernels
Triton Backend
- Continue improving
FlexAttentionforward and backward performance to accelerate OOB workloads - Optimize
FlexAttentionforvLLMproduction inference - Provide more
Tritonkernel templates to support custom kernels
Helion
- Improve Intel GPU support in
Helionand enable Intel GPU CI inHelionrepo - Demonstrate promising performance for selective workloads on Intel GPUs with
Helion
PyTorch Framework Performance
ATen Operations
- Extend
ATenoperations on Intel GPUs driven by UT coverage - Support and optimize batch-invariant kernels in PyTorch on Intel GPUs
- Optimize
quant/dequantkernels forFP8,MXFP8, andMXFP4 - Optimize
GEMMkernels forFP8,MXFP8, andMXFP4
Attention Stack
- Improve library-based
SDPAoperation on top ofoneDNN - Integrate
SYCL-TLAas a flexible and scalable path to accelerateGEMM,SDPA, and other performance-critical operations - Improve
FlexAttentionforward and backward performance for OOB models - Explore other approaches to implement and accelerate
FlexAttentionbesidesTriton
Torch Accelerator Generalization
- Generalize Graph record/replay into a portable and backend-agnostic execution model, and expose it through
torch.acceleratorto unlock consistent performance benefits across accelerators. - Continue expanding runtime generalization, enabling
torch.acceleratorto support broader workloads, execution modes, and system configurations. - Collaborate closely with the PyTorch community to drive ecosystem adoption of
torch.accelerator, reduce backend-specific maintenance overhead, and improve long-term sustainability of accelerator integrations.
TorchAO
- Ensure
TorchAOperforms well for different lower precision data types on Intel GPUs, includingINT4,INT8,FP8,MXFP8, andMXFP4 - Make
MXFP4production-ready with fast 4-bitPTQ(GPTQ) and an accuracy debugging toolkit inTorchAOon Intel GPUs. Verify end-to-end flow on selective large models.
TorchCodec
- Add support for Intel Xe media decoding and encoding in
TorchCodec - Integrate the Intel
TorchCodecsupport with the Hugging Face Datasets and Transformers libraries. Explore other popular frameworks that would benefit from similar enabling.
TorchRec
- Develop SYCL version for FBGEMM-based operators:
invert_permute,jagged_index_select,permute_1D_sparse_data,expand_into_jagged_permute; used in rowwise and tablewise sharding strategies. - Develop SYCL version for FBGEMM-based operators to support non-pooling functionalities in embeddings:
dense_embedding_codegen_lookup_functionandsplit_embedding_codegen_lookup_rowwise_adagrad_function_pt2(forward and backward pass).
PyTorch Distributed
torch.distributed / torchcomms
- Adoption & coverage: Enable in-tree TorchComms support for the
XCCLbackend, add CI/CD testing on Intel GPUs, and provide coverage for all collective and P2P operations across scale-up + scale-out. - Migration & parity: Drive feature/perf parity vs.
ProcessGroupXCCLand support upstream migration — while ensuring compatibility with key distributed infra features likeFlightRecorder,torch.compile,DeviceMesh, andDTensoronXCCL. - Symmetric memory & customized communication ops: Enable an intra-node XPU symmetric memory backend to support asynchronous Tensor Parallelism (TP) and customized communication ops on XPU devices.
Distributed Parallelism and Compilation
- Compiler-first distributed training (Intel GPU enablement): Deliver initial
SimpleFSDPsupport on Intel GPUs — compiler-based sharding, bucketing/reordering, parallelisms, activation checkpointing, and mixed precision & initial Compiler Toolkit support (AOT), targeting a merged path. - Maintain/expand TorchTitan OSS momentum: Enable
DeepEPsupport for Intel GPUs (explicitly dependent on underlying library support) and validate against a subset of SOTA models in the TorchTitan context. - Seamless scaling via auto-parallelization: Enable automatic parallelization on Intel GPUs, including Pipeline Parallel (noted as potentially merging with
SimpleFSDP+ Compiler Toolkit), dependent on PyTorch Autoparallel support. - Expand DTensor support for Intel GPUs, improve sharding propagation and operator coverage.
Intel CPU
Recap
On the CPU side, the PyTorch Intel CPU team continued contributing to PyTorch upstream throughout 2025, with areas as follows:
-
Expanded Low-Precision Enablement: We broadened quantization support on CPU, with more quantization recipes and continued performance improvements for
INT8workloads. We now provide comprehensive precision coverage acrossFP32,BF16,FP16, andINT8, enabling flexible deployment configurations. -
Attention Optimization: Over the past year, we have continuously optimized attention mechanisms, including
SDPA,FlashAttention, andFlexAttention. -
PyTorch compile: We have added foundational
INT8quantization patterns in Inductor, expanded fusion coverage, and further optimized theGEMMtemplate. In addition, we improved memory layout propagation to avoid unnecessary data movement, enhanced parallelization strategies and vectorization for CPU code generation, and refined reduction implementations to improve both numerical accuracy and execution efficiency. -
Engineering Responsiveness and Stability: We have remained highly engaged with the PyTorch community by actively triaging CPU-related issues, addressing regressions, and delivering timely fixes.
Focus Areas for Continued Efforts
As part of our continued efforts in future PyTorch releases, the Intel PyTorch team will focus on the following key areas for Intel CPUs in 2026:
Framework Performance Optimization (Intel CPU Focus)
Torch Operator Enhancements
We will continue expanding operator coverage and improving kernel efficiency across low-precision and reduced-precision workloads:
- Add
FP8support for applicable operators/kernels, together with performance optimization - Improve existing
BF16,FP16,INT8datatype-based operator/kernel coverage and performance
Attention Stack
We will continue to advance transformer attention performance on Intel CPU platforms by enabling next-generation FlashAttention optimizations and extending FlexAttention with low-precision and decoding optimizations:
- Integrate
FlashAttentionV3 optimizations into existingFlashAttentionkernels - Further incorporate
FlashAttentionV4 enhancements toFlashAttentionkernels - Optimize
FlexAttentionwithAVX512-FP16instructions - Add Flash decoding optimizations within
FlexAttentionto improve decoding efficiency - Add
FP8KV cache support withinFlexAttentionto reduce memory footprint and improve token generation throughput
Torch Compile / Inductor
We will continue investing in the Inductor CPP backend to enhance graph compilation and runtime performance:
- Optimize the Inductor CPP backend by improving
GEMMtemplate and fusion opportunities - Add corresponding patterns support for
FP8and complete remainingINT8lowering in Inductor - Add
FP8and remainingINT8kernels/templates support - Optimize hotspots identified through benchmarking against the
TritonCPU backend - Optimize CPP backend build time
AOT Inductor
To improve the deployment experience of compiled models, we will enhance AOT Inductor’s packaging and loading workflow:
- Improve model package extraction and loading mechanisms to reduce redundant unpacking and lower disk usage
Distributed Improvement
To strengthen CPU-based distributed inference scenarios (single node), we will optimize Tensor Parallel performance using shared-memory-based collective primitives:
- Add SHM-based intra-node support in Gloo
- Enable PyTorch to utilize SHM-based primitives for Tensor Parallel scenarios
Enabling of the Next Generation Intel® Xeon® Scalable Processor
The next generation Intel® Xeon® Scalable processor (code-named Diamond Rapids) brings a set of new hardware features, with some information already available through the GCC enabling blog. To enable applicable hardware capabilities at the AI domain, we will continue adding new features to leverage the power of the new platform:
FP8operators/kernel support in ATen and Inductor CPP backend —FP8datatype support of applicable operators/kernels; Inductor pattern match for those new operators and related template optimizationsTF32support for applicable operators- Leverage new instructions (
AVX512 FP16instructions, etc.) to optimize existing kernels through Inductor CPP backend
Engineering Quality and Velocity
We will continue improving robustness, maintainability, and ecosystem alignment:
- Upgrade
oneDNNfor bug fixes and performance improvements - Continue strengthening CPU issue triage and resolution
- Improve UTs/CI coverage
Summary
In this update, we summarized the progress made across recent PyTorch releases and outlined our continued efforts as a north star for advancing Intel GPU and CPU support in future releases.
These achievements would not have been possible without the invaluable support from Alban, Andrey, Bin, Chris, Driss, Jason, Jerry, Nikita, and Tristan, along with many other maintainers and reviewers across the PyTorch community. We sincerely appreciate their collaboration and contributions.
Looking ahead, the Intel PyTorch team will continue working closely with the community to further improve Intel GPU and CPU functionality, performance, and overall user experience.