Introduction
Intel GPU in PyTorch is designed to offer a seamless GPU programming experience, accommodating both the front-end and back-end.
The upstreaming process for Intel GPU begins with torch.compile
as the initial step and progressively enables eager/aten operations. The functionality and performance are benchmarked using dynamo—specifically with HF, TIMM, and TorchBench.
The [RFC] Intel GPU Upstreaming provides more details about the Intel GPU upstreaming design. From an implementation perspective, Intel GPU enabling will be integrated gradually into PyTorch, enhancing its maturity and capabilities with each PyTorch release.
NOTE: The device name for Intel GPU in PyTorch is XPU. Therefore, XPU represents Intel GPU in this post.
PyTorch 2.4
PyTorch 2.4 has been released, with Intel GPU providing essential support for both eager mode and torch.compile
as a prototype feature. Users can get Intel GPU support by building PyTorch from the source.
- Eager mode: Implemented key ATen operators with the SYCL programming language.
torch.compile
: Integrated Intel GPU backend for Inductor on top of Triton.- (The most performance-critical operators like Conv and GEMM for both Eager and torch.compile are highly optimized by using oneAPI Deep Neural Network Library (oneDNN))
The blog PyTorch 2.4 Supports Intel® GPU Acceleration of AI Workloads provides further details.
PyTorch 2.5
For the upcoming PyTorch 2.5 release, we aim to enhance the coverage of ATen operations to improve support for eager mode, while continuing to develop and refine the torch.compile
functionality for Intel GPU. Additionally, we will focus on optimizing performance for both eager mode and torch.compile
for Intel GPU.
Scope of PyTorch 2.5
- Enhanced Functionality and Performance: Improve both
torch.compile
and eager mode functionalities, aiming to enhance overall performance. - Better Intel GPU Support: Extend support to selected Intel GPUs from both data center and client categories.
- Cross-Platform Compatibility: Ensure compatibility with both Linux and Windows.
Current Status for PyTorch 2.5
torch.compile
- Implemented JIT mode/Python wrapper.
- Successfully passed accuracy mode tests for Dynamo HF, TorchBench, and TIMM.
- Eager Mode
- Implemented significant portion of ATen operations, prioritized by Dynamo benchmarks and other platforms.
- Successfully passed accuracy mode tests for Dynamo HF, TorchBench, and TIMM.
- Runtime
- Essential Intel GPU runtime support was ready.
- ABI Mode
- Compatibility ensured for both ABI=0 and ABI=1.
- CI/CD
- Infrastructure: Finished Stage 0 – On-demand Intel GPU CI.
- Test Cases: Enabled Inductor test cases for Intel GPU.
PyTorch 2.6
For PyTorch 2.6, we aim to implement most ATen operations and further enable the torch.compile
stack for Intel GPU, including support for AOTInductor
and torch.export
. Enhancements for eager mode and torch.compile
are planned, along with initial support for distributed computing.
torch.compile
- Functionality: Support
AOTInuctor
,torch.export
, and INT8 quantization (PT2E). - Performance: Improve performance continuously.
- Functionality: Support
- Eager mode support
- Functionality: Support most ATen operations and INT8 quantization. (Quantization Op).
- Performance: Improve performance continuously.
- Torch libraries
- Add Intel GPU support to selected torch libraries.
- Distributed(Linux, Ponte Vecchio Only)
- Provide initial FSDP/DDP support for Intel GPU.
- Platforms
- OS: Support WSL2 in addition to Linux and Windows.
- Hardware: Support selected Intel GPUs from both data center and client categories.
- CI/CD
- Enable Intel GPU Build-CI by default to gate PyTorch PRs.
Post PyTorch 2.6
Beyond the detailed plan outlined above, our ongoing efforts will focus on enhancing functionality and performance to expand usability. Key areas of development will include:
- Functionality
- Achieve matured eager mode support.
- Augment
torch.compile
capabilities including model coverage, performance improvements, and feature enhancements.
- Broaden distributed support.
- Enhance support for quantization, sparsity, and low-precision techniques for both training and inference.
- XPUGraph: Upstream XPUGraph to streamline performance optimizations.
- Torch Libraries
- Extend support and integration for Intel GPU across the Torch ecosystem libraries.
- Workloads
- Enhance performance across all phases of the lifecycle in large language models and generative AI, from pre-training and fine-tuning to inference.