Intel GPU Enabling Status and Feature Plan

Introduction

Intel GPU in PyTorch is designed to offer a seamless GPU programming experience, accommodating both the front-end and back-end.

The upstreaming process for Intel GPU begins with torch.compile as the initial step and progressively enables eager/aten operations. The functionality and performance are benchmarked using dynamo—specifically with HF, TIMM, and TorchBench.

The [RFC] Intel GPU Upstreaming provides more details about the Intel GPU upstreaming design. From an implementation perspective, Intel GPU enabling will be integrated gradually into PyTorch, enhancing its maturity and capabilities with each PyTorch release.

NOTE: The device name for Intel GPU in PyTorch is XPU. Therefore, XPU represents Intel GPU in this post.

PyTorch 2.4

PyTorch 2.4 has been released, with Intel GPU providing essential support for both eager mode and torch.compile as a prototype feature. Users can get Intel GPU support by building PyTorch from the source.

  • Eager mode: Implemented key ATen operators with the SYCL programming language.
  • torch.compile: Integrated Intel GPU backend for Inductor on top of Triton.
  • (The most performance-critical operators like Conv and GEMM for both Eager and torch.compile are highly optimized by using oneAPI Deep Neural Network Library (oneDNN))

The blog PyTorch 2.4 Supports Intel® GPU Acceleration of AI Workloads provides further details.

PyTorch 2.5

For the upcoming PyTorch 2.5 release, we aim to enhance the coverage of ATen operations to improve support for eager mode, while continuing to develop and refine the torch.compile functionality for Intel GPU. Additionally, we will focus on optimizing performance for both eager mode and torch.compile for Intel GPU.

Scope of PyTorch 2.5

  • Enhanced Functionality and Performance: Improve both torch.compile and eager mode functionalities, aiming to enhance overall performance.
  • Better Intel GPU Support: Extend support to selected Intel GPUs from both data center and client categories.
  • Cross-Platform Compatibility: Ensure compatibility with both Linux and Windows.

Current Status for PyTorch 2.5

  • torch.compile
    • Implemented JIT mode/Python wrapper.
    • Successfully passed accuracy mode tests for Dynamo HF, TorchBench, and TIMM.
  • Eager Mode
    • Implemented significant portion of ATen operations, prioritized by Dynamo benchmarks and other platforms.
    • Successfully passed accuracy mode tests for Dynamo HF, TorchBench, and TIMM.
  • Runtime
    • Essential Intel GPU runtime support was ready.
  • ABI Mode
    • Compatibility ensured for both ABI=0 and ABI=1.
  • CI/CD
    • Infrastructure: Finished Stage 0 – On-demand Intel GPU CI.
    • Test Cases: Enabled Inductor test cases for Intel GPU.

PyTorch 2.6

For PyTorch 2.6, we aim to implement most ATen operations and further enable the torch.compile stack for Intel GPU, including support for AOTInductor and torch.export. Enhancements for eager mode and torch.compile are planned, along with initial support for distributed computing.

  • torch.compile
    • Functionality: Support AOTInuctor, torch.export, and INT8 quantization (PT2E).
    • Performance: Improve performance continuously.
  • Eager mode support
    • Functionality: Support most ATen operations and INT8 quantization. (Quantization Op).
    • Performance: Improve performance continuously.
  • Torch libraries
    • Add Intel GPU support to selected torch libraries.
  • Distributed(Linux, Ponte Vecchio Only)
    • Provide initial FSDP/DDP support for Intel GPU.
  • Platforms
    • OS: Support WSL2 in addition to Linux and Windows.
    • Hardware: Support selected Intel GPUs from both data center and client categories.
  • CI/CD
    • Enable Intel GPU Build-CI by default to gate PyTorch PRs.

Post PyTorch 2.6

Beyond the detailed plan outlined above, our ongoing efforts will focus on enhancing functionality and performance to expand usability. Key areas of development will include:

  • Functionality
    • Achieve matured eager mode support.
    • Augment torch.compile capabilities including model coverage, performance improvements, and feature enhancements.
  • Broaden distributed support.
  • Enhance support for quantization, sparsity, and low-precision techniques for both training and inference.
  • XPUGraph: Upstream XPUGraph to streamline performance optimizations.
  • Torch Libraries
    • Extend support and integration for Intel GPU across the Torch ecosystem libraries.
  • Workloads
    • Enhance performance across all phases of the lifecycle in large language models and generative AI, from pre-training and fine-tuning to inference.
9 Likes