Intel GPU & CPU Enabling Status and Feature Plan – 2025 H1 Update

Intel GPU

Context

We previously published the Intel GPU Enabling Status and Feature Plan to introduce Intel GPU support in PyTorch. Following the recent release of the Meta PyTorch Team 2025 H1 Roadmaps, the Intel PyTorch team has refreshed the status and feature plan to stay aligned with PyTorch community strategies, such as open-source adoption in head repositories (organic growth) through UX enhancements and foundational performance optimizations to drive sustained performance improvements.

Recap

Intel GPU in PyTorch is designed to provide a seamless GPU programming experience, covering both front-end and back-end integration. By leveraging Intel’s advancements in GPU technology, we enhance PyTorch’s performance and versatility, enabling significant workload acceleration and improved processing efficiency. This ensures an out-of-the-box experience for users on the Intel GPU platform while benefiting the broader PyTorch community.

Areas Where We Excelled in Recent PyTorch Releases:

Continued Efforts for Future PyTorch Releases:

  • Achieve PyTorch-native SOTA performance across important model categories and benchmarks on Intel GPUs

    • Enable and improve Template-based GEMM performance for torch.compile
    • Improve performance for LLM models through optimized sdpa, FlexAttention, and lower precision data types
  • Deliver a more streamlined and user-friendly experience on Intel GPUs

    • Improve feature coverage, including higher ATen operation coverage, improved torch.compile, Distributed enabling, better documentation, etc.
    • Expand Intel GPU support to the head libraries of the PyTorch ecosystem, such as torchao, torchtune, torchtitan
    • Enable more Intel GPU platforms to provide cutting-edge hardware features to benefit community users broadly
  • Improve generalization of the PyTorch runtime and backend infrastructure to support a wide range of hardware backends

    • Maintain close collaboration with the community to generalize the PyTorch runtime further
    • Explore generalization of the CI test infrastructure to provide enhanced flexibility across diverse hardware backends
    • Investigate extending the API set in torch.compile to enable flexible support across diverse hardware backends

Focus Areas for Continued Efforts

PyTorch Compiler Core

  • PyTorch-native SOTA performance across key model categories and benchmarks, including generation models

    • Enable and optimize Triton templates for Intel GPUs to deliver promising performance
    • Continue optimizing underperforming Triton kernels on Intel GPUs, covering GEMM and additional operations
    • Integrate and enable XPU SYCL/C++ GEMM template library to accelerate computation on Intel GPUs
    • Support lower-precision data types to run larger models and mitigate GPU memory capacity constraints
  • Exportability on Intel GPUs for deployment

    • Add Intel GPU support to AOTI and ensure XPU works well with torch.export as measured by the pass rate on custom OSS models
    • Ensure torch.export works properly with custom Triton ops for Intel GPU by aligning the design with torch.compile caching workstream for Intel GPU, which uses device binary for SerDe and has global states
  • Coherent programming model across the board for the PyTorch Compiler (including both compile and export path) on Intel GPUs for better development UX

    • Improve the error messages and documentation for Intel GPU in PyTorch
    • Improve the documentation for torch.compile on Windows for Intel GPUs

PyTorch Core Libraries

  • Improve inference and training on Intel GPUs

    • Ensure Intel GPU compatibility with torchao for popular lower-precision data types and dense computations
    • Achieve SOTA performance for inference on Intel GPUs
    • Deliver competitive fine-tuning and training performance on Intel GPUs
  • Reduce overall library entropy while maturing infrastructure to be more secure, performant, and stable on Intel GPUs

    • Multi-Backend Evolution

      • Continue generalizing the caching allocator, Event API, and Python bindings for multiple devices, using Intel GPU as a showcase
      • Continue expanding the torch.accelerator namespace to improve device-agnostic programming
      • Collaborate closely with the community to define backend responsibilities and capabilities
    • Improve optimizer OOTB performance for Intel GPUs by enabling the fused implementations

    • Continue responding to community-raised issues on Intel GPUs in a timely manner

  • Evolve Intel GPU support in PyTorch to embrace the emergent trend of test time computation.

    • Initiate Intel GPU support for the algorithm-zoo library
    • Enable Intel GPU support for PyTorch-native MCTS
    • Enable Intel GPU support for agentic system solutions in PyTorch
  • Architect Intel GPU support in TorchCodec

    • Contribute to multi-GPU support in TorchCodec
    • Propose implementation of Intel GPU support in TorchCodec
  • Collaborate with the community to explore establishing a baseline OSS LLM for Intel GPU kernel generation

PyTorch Core Performance

  • Attention Module on Intel GPUs

    • Support FlexAttention on Intel GPUs via Triton and explore SYCL-based implementations
    • Improve oneDNN backend performance for SDPA
    • Explore INT8 and FP8 support for SDPA
  • INT8 Quantization - PT2E

    • Onboard Intel GPU to pt2e quantization flow and beta feature
    • Support group-wise and codebook quantization (palletization) on Intel GPUs
    • Improve Quantization Aware Training (QAT) by investing in techniques like improving fine-tuning throughput during QAT
  • Support various performance and accuracy techniques in torchao, specifically

    • Research-friendly building blocks in torchao on Intel GPUs, enabling researchers to develop algorithms in torchao (e.g., spinquant, gptq, pruning techniques) on Intel GPUs
  • Ensure torchao performs well on a diverse range of Intel GPU hardware for selected models such as Llama 3.1 8B and 3.2 11B:

    • Demonstrate promising INT8/INT4 performance based on projections
    • Integrate Intel GPU support into the torchao CI/CD pipeline
  • Contribute to torchao to establish a robust benchmarking infrastructure to track performance for inference on a set of key models on Intel GPUs

    • Run e2e model benchmark suite in torchbench to measure performance(latency + accuracy) of torchao inference APIs on key models (eg. HF optimum and torchao/_models) on Intel GPUs.
    • Contribute to torchao to run kernel micro-benchmark on Intel GPUs
  • Integrate torchao + XPU with SGLang, vLLM, HF Transformer, HF Diffuser

  • Align fine-tuning UX with torchao and enable torchao on Intel GPUs in torchtune:

    • Provide a unified user-facing UX for converting models to low-precision fine-tuning on Intel GPUs
    • Enable unified composability between DTensor and torchao subclasses (e.g., MXFP4Tensor, AffineQuantizedTensor, Float8Tensor)
  • Support MX dtypes in both PyTorch and torchao for Intel GPUs

PyTorch Distributed

  • Native support for Intel® oneAPI Collective Communications Library (oneCCL) as xccl backend in torch.distributed

    • Ensure full functionality and UT coverage

    • Harden Parallelism APIs and broaden adoption across PyTorch ecosystem

      • FSDP2, PP, CP and composed them as 1D-4D parallelism
    • End-to-End model convergence(accuracy) and performance for baseline models

  • Mature profiling and debugging tools for large scale-up and scale-out configurations

  • Torch Titan: Enable XCCL support in torchtitan and showcase pretraining for Llama 3.1 LLM and reference MoE model

  • Torch Tune: Enable XCCL distributed fine-tuning for standard Llama 3.x model, enable end-to-end distributed recipe(s) such as torchtune/recipes/full_finetune_distributed.py at main · pytorch/torchtune · GitHub

Intel CPU

Context

The Intel PyTorch team continues enhancing the performance and feature sets of PyTorch on Intel CPU platforms. In the past PyTorch releases, the Intel PyTorch team has:

  • Optimized FlexAttention on Intel CPU platforms, which rapidly improves the performance of PyTorch on LLM for both first token latency and next token latency;
  • Optimized BF16/FP16 to be Beta level for both eager mode and Inductor mode on Intel CPUs;
  • Optimized TorchInductor CPP backend to be Beta level on Intel CPUs;

More can be found in past PyTorch release blogs.

Focus Areas for Continued Efforts

  • PyTorch-native SOTA performance of key LLM/Image-Generation models

    • Continue improving PyTorch-native inference performance for key models like Llama3.1 and Stable Diffusion to achieve SOTA on Intel CPUs with the Inductor CPP backend
  • Competitive GEMM performance on Intel CPUs

    • In past PyTorch releases, the GEMM template in the Inductor C++ backend had already been optimized for FP32, BF16, and FP16 data types. As weight-only INT8/INT4 GEMM has been popular, achieving competitive performance through the Inductor C++ backend for these data types will benefit related scenarios with TorchAO + PyTorch native stack
  • Mature FlexAttention to ensure reliability and high performance for different scenarios on Intel CPUs

    • FlexAttention is crucial for LLM performance, as different optimization methods apply to different scenarios. The Inductor C++ backend introduced support for FlexAttention in PyTorch 2.6 and optimized it for large batch size scenarios in PyTorch 2.7. In PyTorch 2.8, we plan to further enhance its performance with optimizations for block sparsity (improving sparse attention variants) and flash decoding (benefiting long-context scenarios)
  • Reduce overall PyTorch Core/library entropy to be more stable and performant on Intel CPUs

    • Resolve 75% of newly opened issues and 90% of newly opened high-priority issues for TorchInductor on Intel CPUs

Summary

In this update, we summarized the progress we’ve made in recent PyTorch releases and outlined our continued efforts as a North Star to improve Intel GPU and Intel CPU support in future releases. Our achievements would not have been possible without the invaluable support from Alban, Andrey, Bin, Jason, Jerry, and Nikita — we sincerely thank them for their contributions.

The Intel PyTorch team will continue to work closely with the PyTorch community to enhance Intel GPU and Intel GPU functionality and performance, covering eager mode, torch.compile, Distributed, and core libraries. At the same time, we are committed to contributing to the broader PyTorch ecosystem, helping make PyTorch increasingly friendly and extensible for integration with diverse hardware backends.

5 Likes

Thanks a lot for sharing these updates!