IBM Spyre Accelerator: PyTorch Enabling Status and Feature Plan - 1H 2026

IBM Spyre Accelerator: PyTorch Enabling Status and Feature Plan - 1H 2026

Introduction

This roadmap outlines IBM’s plan for integrating the Spyre accelerator with the PyTorch ecosystem during the first half of 2026. Our goal is to provide seamless PyTorch support for Spyre by building on existing PyTorch ecosystem components, ensuring minimal runtime overhead while maximizing performance and developer productivity. Equally important, we are committed to contributing back to the community — generalizing dataflow accelerator enablement in torch.inductor and lower layers, contributing OpenReg testing infrastructure into PyTorch core, and establishing out-of-tree CI/CD infrastructure — so that our work benefits not just Spyre but the broader ecosystem of dataflow accelerators.

Scope for 1H 2026: All work in this roadmap is scoped to inference only at FP16 and FP8 precisions, starting with the following priority models:

  1. GPT-OSS (20B)
  2. Granite 4 Hybrid (30B)
  3. Mistral-small
  4. Qwen 2.5 VL 7B
  5. Llama 3.1 8B-instruct
  6. Ministral 8B
  7. Ministral 14B

Motivation

We are building PyTorch-native support for the Spyre accelerator, designed from the ground up around upstream integration mechanisms. This approach is driven by several key goals:

Ecosystem-First Development:

  • Leveraging torch.inductor and out-of-tree extensions as the primary compilation and integration path
  • Minimizing custom code by building on community-maintained infrastructure
  • Freeing up engineering capacity to contribute hardware-specific optimizations and improvements back to the PyTorch ecosystem

Ecosystem Access and Model Coverage:

  • Direct access to the rapidly expanding PyTorch model ecosystem (vLLM, Hugging Face, etc.)
  • Faster time-to-support for new model architectures
  • Ability to leverage community tools, optimizations, and best practices

Sustainability and Community Collaboration:

  • Alignment with PyTorch standards ensures long-term compatibility as the ecosystem evolves
  • Opportunities to contribute improvements back to the community
  • Easier collaboration with external partners and customers already using PyTorch

Contributing Back: Dataflow Accelerator Enablement:

  • Generalizing torch.inductor pathways to work across dataflow architectures, not just Spyre-specific optimizations
  • Contributing an intermediate representation at the Tile level that is generic for any dataflow accelerator to use for lower-level scheduling of kernels
  • Ensuring our upstream contributions are designed for broad adoption, making it easier for the next dataflow accelerator to onboard

Production-Grade Performance:

  • Ensuring the ecosystem integration approach meets production deployment requirements
  • Validating that the PyTorch-native approach can deliver high performance on dataflow accelerators
  • Establishing CI/CD infrastructure for continuous validation and regression detection across the full stack

This approach represents a strategic investment in long-term sustainability while delivering the performance standards our users expect.

Roadmap Overview

Our work focuses on four key pillars:

  1. PyTorch Core Integration - Deep integration with torch.inductor, runtime mechanisms, distributed inference, and profiling
  2. Backend Compiler - Building a robust compiler backend with KernelTile IR (KTIR) as the community-aligned intermediate representation for dataflow accelerators
  3. vLLM Integration - Production inference support for large language models
  4. CI/CD Infrastructure - Establishing out-of-tree CI support for the PyTorch ecosystem

We are committed to contributing generic primitives and improvements back to the PyTorch community, including OpenReg testing infrastructure and broader KTIR adoption across AI accelerators.

For detailed technical specifications and design documents, please refer to our RFCs repository.


PyTorch Core Integration

torch.inductor Integration

Our primary objective is to integrate Spyre into torch.inductor with maximal performance at the PyTorch level.

Inductor extensions for Dataflow accelerators:

  • Introduce tile based tensor layout representations in inductor
  • Multi-core work division on tiles as part of inductor passes
  • Scratchpad optimization for enabling dataflow accelerators
  • Performant multi-card inference optimization is out of scope for 1H 2026 (functional distributed inference covered separately below)

Key Metrics:

  • Achieve TTFT, ITL, and throughput metrics that meet production deployment requirements on single card
  • Priority models compiling end-to-end through torch.inductor by end of 1H 2026

torch.runtime Integration

Enable Spyre using torch built-in mechanisms to ensure minimal runtime overhead and enable eager mode execution.

Device Registration and Extensions:

  • Integrate device registration and startup using core PyTorch out-of-tree extensions
  • Implement memory management, data transfer, and dispatcher as out-of-tree extensions with minimal overhead
  • Contribute generic primitives of OpenReg testing infrastructure into PyTorch core for broader community benefit

Key Metrics:

  • 100% of device lifecycle (registration, startup, teardown) implemented via out-of-tree extensions
  • Runtime overhead from PyTorch integration layer: <5% compared to direct device access
  • OpenReg primitives contributed upstream: ≥3 PRs merged into PyTorch core

Op Coverage (Integration in Core PyTorch)

Integrate new ops into torch.inductor for Spyre, focusing on registration and layout constraint propagation.

Torch Op Integration:

  • Increase torch op coverage to enable the priority models
  • Register new ops in torch.inductor and implement layout constraint propagation for each op
  • Enable seamless integration of custom kernels with torch.compile workflow
  • Implement new ops through the backend compiler IR as the single entry point for all op lowering and code generation

Key Metrics:

  • Op integration in torch.inductor sufficient to support priority models listed above
  • Layout constraint propagation validated for all integrated ops
  • End-to-end torch.compile workflow validated on priority models

Distributed Inference

Enable multi-card inference for Spyre using PyTorch distributed primitives, with a phased approach from compiled functional collectives to full torch.distributed integration.

Compiled Functional Collectives (1H 2026):

  • Support compilation of functional collective operations (all-reduce, all-gather) through torch.inductor
  • Distributed inference support targeting all priority models via compiled functional collectives

Migration to torch.distributed (1H 2026 and beyond):

  • Transition to torch.distributed for eager mode collective operations
  • Eventual migration to torch.comms as the long-term community communication layer

Key Metrics:

  • All priority models running distributed inference end-to-end via compiled functional collectives
  • Functional correctness validated: distributed inference results match single-card reference outputs

Profiling and Performance Analysis

Build a profiling toolkit for Spyre that integrates with PyTorch profiling infrastructure and provides performance visibility from end-to-end model execution down to intra-kernel behavior.

Spyre Profiling Toolkit:

  • Build a System Management Interface (Spyre SMI) for device monitoring: power, temperature, utilization, memory bandwidth, and per-process resource usage
  • Integrate with PyTorch Profiler via upstream Kineto plugin to trace Spyre kernel execution, memory timelines, and call stacks
  • Extend Spyre trace analyzer to HTA metric parity and multi-Spyre support
  • Design IR instrumentation-based profiler using FX graph observability hooks for selective operator-level and intra-kernel profiling
  • Extend Inductor provenance tracking to show IR after any user-specified compiler pass

Key Metrics:

  • Profiling overhead: <5% on application performance at standard device instrumentation level (SMI)
  • 100% identical results with and without profiling enabled
  • All profiling tools designed for upstream contribution or open-source release

Backend Compiler

IR Integration Across the Compilation Stack

The backend compiler connects the spyre-inductor frontend (our out-of-tree torch.inductor extension) to Spyre code generation through optimization passes exercising a set of mid-level and low-level IRs, enabling clean separation of concerns and community-aligned extensibility.

SuperDSC (SDSC) — Backend Compiler IR:

  • SuperDSC (SDSC) serves as the compiler IR produced after spyre-inductor frontend’s optimization passes. It is consumed by the Spyre backend compiler for further lowering and optimization passes for code generation.
  • Enable clean separation of concerns between the PyTorch integration layer and hardware-specific optimizations
  • All op lowering and code generation flows through SDSC as the single entry point into the backend

KernelTile IR (KTIR) — Community-Aligned Specification:

  • KernelTile IR (KTIR) is the longer-term community-aligned specification, designed for adoption across dataflow accelerators
  • KTIR generalizes the tile-level intermediate representation so that other dataflow accelerators can leverage it for lower-level scheduling

Dataflow Scheduling and Code Generation

Realize efficient dataflow scheduling and code generation for Spyre hardware.

Dataflow Scheduling:

  • Open-source contributions to automatic dataflow scheduling
  • Design scheduling algorithms to be useful and adaptable for other dataflow accelerators
  • Develop a native programming language for dataflow accelerators used for development and verification

Key Metrics:

  • 100% of torch ops required by priority models expressible in SDSC
  • All priority models compiling end-to-end through the backend compiler
  • SDSC generation time from spyre-inductor lowering: few minutes per priority model
  • Complete KTIR spec published

vLLM Integration

Enable Spyre in the vLLM ecosystem and expand supported models for production inference workloads.

Model Support:

  • Adopt modeling code from vLLM, consolidating on upstream model implementations
  • Enable Spyre support for the priority models through vLLM

Performance Optimization:

  • Develop new Spyre attention backend for vLLM that does not have homogeneous sequence length constraint
  • Improve how upstream vLLM handles caching of torch.compile artifacts

API Stability:

  • Collaborate with the vLLM community to stabilize the platform plugin interface
  • Establish a predictable release cadence for platform plugin API changes

Key Metrics:

  • Priority models serving inference end-to-end through vLLM on Spyre
  • Inter-token latency: significant reduction via new attention backend
  • Startup time: few seconds with torch.compile artifact caching
  • Breaking changes in platform plugin interface: ≤1 per quarter

Testing and CI/CD Infrastructure

Establish new CI/CD pipeline for using Spyre accelerator with PyTorch and vLLM, enabling out-of-tree CI support in collaboration with the broader PyTorch community. Testing spans from op-level through full vLLM inference validation, all scoped to the priority models.

Test Categories:

  • Op-level Tests: Validate individual torch ops for the 7 priority models (see Introduction)
  • Inductor Tests: Ensure torch.inductor integration correctness
    • Compilation accuracy and performance validation
    • Lowering and code generation verification
  • Module-level Tests: Test PyTorch module components and building blocks
    • Attention mechanisms, normalization layers, activations
    • Memory management and data transfer correctness
  • Top-level Model Tests: End-to-end model validation similar to vLLM
    • Quality metrics: accuracy, convergence, numerical stability
    • Performance metrics: throughput, latency, memory utilization
  • vLLM Integration Tests: End-to-end inference validation through vLLM on Spyre
    • Model loading, compilation, and serving for priority models
    • Throughput, latency, and correctness validation matching vLLM benchmarks

Key Metrics:

  • PyTorch/vLLM ecosystem OSS tests relevant to priority models running nightly
  • Test pass rate on executed suite: >95% on nightly runs
  • Nightly regression detection: failures flagged within few hours of commit
  • CI pipeline uptime: >99% availability
  • Average full pipeline run time: < 3 hours

Summary

IBM’s Spyre accelerator integration with PyTorch in 1H 2026 focuses on co-development with the PyTorch ecosystem, leveraging PyTorch’s out-of-tree extension mechanisms to minimize custom code while maximizing performance. Our comprehensive approach spans backend compiler integration with torch.inductor, runtime integration (device registration, memory management, distributed inference, and profiling), ecosystem tooling (CI/CD infrastructure), and production deployment (vLLM support).

We are committed to contributing generic improvements back to the PyTorch community, including OpenReg testing primitives, KTIR generalization for AI accelerators, and collaboration on out-of-tree CI support infrastructure. This ensures that our work benefits not just Spyre users but the broader PyTorch ecosystem.


3 Likes