Intel GPU
Context
We previously published the Intel GPU Enabling Status and Feature Plan to introduce Intel GPU support in PyTorch. Following the recent release of the Meta PyTorch Team 2025 H1 Roadmaps, the Intel PyTorch team has refreshed the status and feature plan to stay aligned with PyTorch community strategies, such as open-source adoption in head repositories (organic growth) through UX enhancements and foundational performance optimizations to drive sustained performance improvements.
Recap
Intel GPU in PyTorch is designed to provide a seamless GPU programming experience, covering both front-end and back-end integration. By leveraging Intel’s advancements in GPU technology, we enhance PyTorch’s performance and versatility, enabling significant workload acceleration and improved processing efficiency. This ensures an out-of-the-box experience for users on the Intel GPU platform while benefiting the broader PyTorch community.
Areas Where We Excelled in Recent PyTorch Releases:
-
Over time expanded Intel GPU support to a wide range across Linux and Windows, including:
-
Simple installation of torch-xpu PIP wheels and an effortless setup experience
-
High ATen operation coverage with SYCL and oneDNN for smooth eager mode support with functionality and performance
-
Notable speedups with
torch.compile
through defaultTorchInductor
andTriton
backend, proved by measurable performance gains with Hugging Face*, TIMM, and TorchBench benchmarks
Continued Efforts for Future PyTorch Releases:
-
Achieve PyTorch-native SOTA performance across important model categories and benchmarks on Intel GPUs
- Enable and improve Template-based GEMM performance for
torch.compile
- Improve performance for LLM models through optimized
sdpa
,FlexAttention
, and lower precision data types
- Enable and improve Template-based GEMM performance for
-
Deliver a more streamlined and user-friendly experience on Intel GPUs
- Improve feature coverage, including higher ATen operation coverage, improved
torch.compile
, Distributed enabling, better documentation, etc. - Expand Intel GPU support to the head libraries of the PyTorch ecosystem, such as
torchao
,torchtune
,torchtitan
- Enable more Intel GPU platforms to provide cutting-edge hardware features to benefit community users broadly
- Improve feature coverage, including higher ATen operation coverage, improved
-
Improve generalization of the PyTorch runtime and backend infrastructure to support a wide range of hardware backends
- Maintain close collaboration with the community to generalize the PyTorch runtime further
- Explore generalization of the CI test infrastructure to provide enhanced flexibility across diverse hardware backends
- Investigate extending the API set in
torch.compile
to enable flexible support across diverse hardware backends
Focus Areas for Continued Efforts
PyTorch Compiler Core
-
PyTorch-native SOTA performance across key model categories and benchmarks, including generation models
- Enable and optimize Triton templates for Intel GPUs to deliver promising performance
- Continue optimizing underperforming Triton kernels on Intel GPUs, covering GEMM and additional operations
- Integrate and enable XPU SYCL/C++ GEMM template library to accelerate computation on Intel GPUs
- Support lower-precision data types to run larger models and mitigate GPU memory capacity constraints
-
Exportability on Intel GPUs for deployment
- Add Intel GPU support to
AOTI
and ensure XPU works well withtorch.export
as measured by the pass rate on custom OSS models - Ensure
torch.export
works properly with custom Triton ops for Intel GPU by aligning the design withtorch.compile
caching workstream for Intel GPU, which uses device binary for SerDe and has global states
- Add Intel GPU support to
-
Coherent programming model across the board for the PyTorch Compiler (including both compile and export path) on Intel GPUs for better development UX
- Improve the error messages and documentation for Intel GPU in PyTorch
- Improve the documentation for
torch.compile
on Windows for Intel GPUs
PyTorch Core Libraries
-
Improve inference and training on Intel GPUs
- Ensure Intel GPU compatibility with
torchao
for popular lower-precision data types and dense computations - Achieve SOTA performance for inference on Intel GPUs
- Deliver competitive fine-tuning and training performance on Intel GPUs
- Ensure Intel GPU compatibility with
-
Reduce overall library entropy while maturing infrastructure to be more secure, performant, and stable on Intel GPUs
-
Multi-Backend Evolution
- Continue generalizing the caching allocator, Event API, and Python bindings for multiple devices, using Intel GPU as a showcase
- Continue expanding the
torch.accelerator
namespace to improve device-agnostic programming - Collaborate closely with the community to define backend responsibilities and capabilities
-
Improve optimizer OOTB performance for Intel GPUs by enabling the fused implementations
-
Continue responding to community-raised issues on Intel GPUs in a timely manner
-
-
Evolve Intel GPU support in PyTorch to embrace the emergent trend of test time computation.
- Initiate Intel GPU support for the algorithm-zoo library
- Enable Intel GPU support for PyTorch-native MCTS
- Enable Intel GPU support for agentic system solutions in PyTorch
-
Architect Intel GPU support in
TorchCodec
- Contribute to multi-GPU support in
TorchCodec
- Propose implementation of Intel GPU support in
TorchCodec
- Contribute to multi-GPU support in
-
Collaborate with the community to explore establishing a baseline OSS LLM for Intel GPU kernel generation
PyTorch Core Performance
-
Attention Module on Intel GPUs
- Support
FlexAttention
on Intel GPUs via Triton and explore SYCL-based implementations - Improve oneDNN backend performance for SDPA
- Explore INT8 and FP8 support for SDPA
- Support
-
INT8 Quantization - PT2E
- Onboard Intel GPU to pt2e quantization flow and beta feature
- Support group-wise and codebook quantization (palletization) on Intel GPUs
- Improve Quantization Aware Training (QAT) by investing in techniques like improving fine-tuning throughput during QAT
-
Support various performance and accuracy techniques in
torchao
, specifically- Research-friendly building blocks in torchao on Intel GPUs, enabling researchers to develop algorithms in
torchao
(e.g., spinquant, gptq, pruning techniques) on Intel GPUs
- Research-friendly building blocks in torchao on Intel GPUs, enabling researchers to develop algorithms in
-
Ensure
torchao
performs well on a diverse range of Intel GPU hardware for selected models such as Llama 3.1 8B and 3.2 11B:- Demonstrate promising INT8/INT4 performance based on projections
- Integrate Intel GPU support into the
torchao
CI/CD pipeline
-
Contribute to
torchao
to establish a robust benchmarking infrastructure to track performance for inference on a set of key models on Intel GPUs- Run e2e model benchmark suite in
torchbench
to measure performance(latency + accuracy) oftorchao
inference APIs on key models (eg. HF optimum andtorchao/_models
) on Intel GPUs. - Contribute to
torchao
to run kernel micro-benchmark on Intel GPUs
- Run e2e model benchmark suite in
-
Integrate
torchao
+ XPU with SGLang, vLLM, HF Transformer, HF Diffuser -
Align fine-tuning UX with torchao and enable
torchao
on Intel GPUs intorchtune
:- Provide a unified user-facing UX for converting models to low-precision fine-tuning on Intel GPUs
- Enable unified composability between
DTensor
andtorchao
subclasses (e.g.,MXFP4Tensor
,AffineQuantizedTensor
,Float8Tensor
)
-
Support MX dtypes in both PyTorch and
torchao
for Intel GPUs
PyTorch Distributed
-
Native support for Intel® oneAPI Collective Communications Library (oneCCL) as
xccl
backend intorch.distributed
-
Ensure full functionality and UT coverage
-
Harden Parallelism APIs and broaden adoption across PyTorch ecosystem
- FSDP2, PP, CP and composed them as 1D-4D parallelism
-
End-to-End model convergence(accuracy) and performance for baseline models
-
-
Mature profiling and debugging tools for large scale-up and scale-out configurations
-
Torch Titan: Enable XCCL support in torchtitan and showcase pretraining for Llama 3.1 LLM and reference MoE model
-
Torch Tune: Enable XCCL distributed fine-tuning for standard Llama 3.x model, enable end-to-end distributed recipe(s) such as torchtune/recipes/full_finetune_distributed.py at main · pytorch/torchtune · GitHub
Intel CPU
Context
The Intel PyTorch team continues enhancing the performance and feature sets of PyTorch on Intel CPU platforms. In the past PyTorch releases, the Intel PyTorch team has:
- Optimized FlexAttention on Intel CPU platforms, which rapidly improves the performance of PyTorch on LLM for both first token latency and next token latency;
- Optimized BF16/FP16 to be Beta level for both eager mode and Inductor mode on Intel CPUs;
- Optimized TorchInductor CPP backend to be Beta level on Intel CPUs;
More can be found in past PyTorch release blogs.
Focus Areas for Continued Efforts
-
PyTorch-native SOTA performance of key LLM/Image-Generation models
- Continue improving PyTorch-native inference performance for key models like Llama3.1 and Stable Diffusion to achieve SOTA on Intel CPUs with the Inductor CPP backend
-
Competitive GEMM performance on Intel CPUs
- In past PyTorch releases, the GEMM template in the Inductor C++ backend had already been optimized for FP32, BF16, and FP16 data types. As weight-only INT8/INT4 GEMM has been popular, achieving competitive performance through the Inductor C++ backend for these data types will benefit related scenarios with TorchAO + PyTorch native stack
-
Mature FlexAttention to ensure reliability and high performance for different scenarios on Intel CPUs
- FlexAttention is crucial for LLM performance, as different optimization methods apply to different scenarios. The Inductor C++ backend introduced support for FlexAttention in PyTorch 2.6 and optimized it for large batch size scenarios in PyTorch 2.7. In PyTorch 2.8, we plan to further enhance its performance with optimizations for block sparsity (improving sparse attention variants) and flash decoding (benefiting long-context scenarios)
-
Reduce overall PyTorch Core/library entropy to be more stable and performant on Intel CPUs
- Resolve 75% of newly opened issues and 90% of newly opened high-priority issues for TorchInductor on Intel CPUs
Summary
In this update, we summarized the progress we’ve made in recent PyTorch releases and outlined our continued efforts as a North Star to improve Intel GPU and Intel CPU support in future releases. Our achievements would not have been possible without the invaluable support from Alban, Andrey, Bin, Jason, Jerry, and Nikita — we sincerely thank them for their contributions.
The Intel PyTorch team will continue to work closely with the PyTorch community to enhance Intel GPU and Intel GPU functionality and performance, covering eager mode, torch.compile
, Distributed, and core libraries. At the same time, we are committed to contributing to the broader PyTorch ecosystem, helping make PyTorch increasingly friendly and extensible for integration with diverse hardware backends.