[tac] Follow up: Inductor HW backend implementation

smth · September 12, 2024, 1:01pm

Continuing a discussion on Inductor Hardware backends by @AnthonyB @jgong5 @albanD

Original topic from From @AnthonyB

Hi,

At Graphcore we’re in a similar position to AMD: as part of our experiments we’ve tried to re-use either parts of the cpu OpenMP / C++ codegen or part of the CUDA / Triton backend and it hasn’t been straight forward:

There isn’t always a clear device abstraction:
- device type and Inductor backend are used interchangeably when they shouldn’t: CPU <==> CppScheduling and CUDA <==> Triton
  - For example the Triton project are working on adding a CPU Triton backend: pretty sure it would be hard today for them to register a Triton backend for a CPU device.
- Device type specific code leaking into implementation (e.g cuda.synchronize(), _cudaDeviceProperties)
- In some places the list of device types is hardcoded (For example benchmarking: if device == “cpu” then benchmark_cpu else benchmark_gpu)
No documentation available for HW vendors: the device_interface for example is a great idea to abstract things but it has a lot of things and it’s not very clear what/where they’re used for, a lot of the types annotations are missing, some methods are oddly specific (e.g is_bf16_supported? Why not is_dtype_supported ?), etc.

Overall I don’t think we’re in a bad place, it just needs a bit of polishing / clean up, and probably better definitions of the device type ↔ device interface ↔ inductor backend relationships and maybe add some more tests with dummy Inductor backends to make sure device types / scheduler assumptions don’t sneak in the code base?

We’ve only started contributing small patches to help improve things in these areas recently but we’d be keen on joining a larger effort to standardise things a bit better for HW vendors if some other people were up for it.

Thanks,

Anthony

smth · September 12, 2024, 1:01pm

@jgong5 says

I guess the Inductor backend extension interface is not as mature as that of torch.compile. Is there any consideration why you prefer starting with extending Inductor instead of torch.compile?

smth · September 12, 2024, 1:02pm

@AnthonyB replies:

In both cases we tried to reuse existing backend extensions mostly to save time and avoid code duplication:

A lot of the inductor graph passes are relevant regardless of what the code gets lowered to, the graph fusion part seems fairly device and backend agnostic too? (Most of the implementations seem to be fusing nodes based on the input graph’s topology more than based on HW capabilities?)
For the CppScheduler it was just as a starting point because we wanted to mix stock ATen handlers with generated code (We didn’t go very far down that route, because it seemed hard to decouple the threading from the vectorised codedegen, and in the end it was unlikely we would have re-used either so we didn’t persevere too much.
For the TritonScheduler: most of it can be re-used, I believe the main part that might need to be made a bit more generic is the heuristics part (Especially for non-gpu HW: the balance between grid sizes and block sizes might be very different) and maybe some concepts like streams which don’t necessarily apply to non-GPUs (Although, it seems fairly easy to turn those into no-ops at the device_interface level).

Anthony

smth · September 12, 2024, 1:02pm

@albanD replies:

Hey all,

Just wanted to chime in to give some of the context I have on my end:

First for the non-inductor part of things related to the lack of documentation for device interfaces and generally how some abstractions are missing or the code is unnecessarily device-generic.

I’m afraid this is a side effect of the historical backend situation in the framework (cuda was the only widely used backend for many years) and general developer habits (most devs are familiar with CUDA and can debug it, most need help when looking at other backend).

I think we started making significant progress on device-generic APIs within PyTorch in the last year or so. With significant improvements ranging from torch.Stream/Event to autocast, AC, etc.

This work has benefitted all backend vendors (including those use the out-of-tree backend extension) and we want to continue in this direction. There is still a lot of work there and significant design decisions that need to be made. The one you raised about the “device module” where we expose generic capabilities and APIs options is an important one under discussion.

I would encourage you to reach out to me or the amazing engineers driving the particular issues on github about any of these missing pieces to help design and implement it.

Second on the Inductor side, I think this has been covered in details below. It is a complex question with changing answer as we’re finetuning the overall stack. Hopefully things like the new Halide backend in inductor can give some end to end example on how this can be done.

I think contributions are welcome in this space as well and you should reach out to the team with all this specific feedback such that we can fix these together.

Cheers,

Alban

smth · September 12, 2024, 1:02pm

@jgong5 replies

Perhaps it is a good topic to discuss on the Github instead of here. From Intel, we also wanted to share some common FX graph passes from Inductor in Gaudi torch.compile backend. It might be a common requirement for various device backends… I would like to learn more about how you plan to leverage the Inductor codegen parts in your IPUs though. Do you support the Triton language directly? If so, the TritonScheduler should be a good fit for you.

jansel · September 12, 2024, 5:31pm

@AnthonyB

This should no longer be the case with the Halide backend that recently landed. You can set cpu_backend=“halide” or cuda_backend=“halide” and completely replace CppScheduling/TritonScheduling.

There is already a draft PR for this:

github.com/pytorch/pytorch

Add Triton CPU as an Inductor backend

pytorch:gh/int3/98/base ← pytorch:gh/int3/98/head

opened 05:15AM - 14 Aug 24 UTC

int3

+414 -238

Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #1…35342 * __->__ #133408 The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @rec

This abstraction should be handled through the DeviceInterface API here:

github.com

pytorch/pytorch/blob/dab7d646d55a2b6696d51dee4816a6743ec1ae5a/torch/_dynamo/device_interface.py#L228


      
                      return torch.cuda.get_device_properties(device).gcnArchName.split(":", 1)[0]
          
          
          get_xpu_stream: Optional[Callable[[int], int]]
          if torch.xpu._is_compiled():
              from torch._C import _xpu_getCurrentRawStream as get_xpu_stream
          else:
              get_xpu_stream = None
          
          
          class XpuInterface(DeviceInterface):
              device = torch.xpu.device
              Event = torch.xpu.Event
              Stream = torch.xpu.Stream
          
              class Worker:
                  @staticmethod
                  def set_device(device: int):
                      caching_worker_current_devices["xpu"] = device
          
                  @staticmethod

We support NVIDIA, Intel, and AMD gpus through Inductor+Triton. The XPU version should provide a good example here since it has its own functions for things like synchronize.

Most of these should have already been generalized by prior efforts for support XPU (Intel GPU), HalideGPU, HalideCPU, TritonCPU, etc. Pull requests are welcome if we missed updating anything here.

ai-mannamalai · September 16, 2024, 4:35am

aside: I’m curious to know about Intel Gaudi and OpenVINO and how they align and differ.

Sujoy_Saraswati · November 16, 2024, 3:57am

Intel Gaudi torch.compile mode is implemented as a Dynamo mid-layer backend instead of as a low-level Inductor backend. You could get more details from Getting Started with Inference on Intel Gaudi — Gaudi Documentation 1.18.0 documentation. Essentially, Intel Gaudi registers its mid-layer backend compiler to the Dynamo backend registry at runtime, receives the FX graphs from Dynamo during execution converts the FX to its backend executable form.

Topic		Replies	Views
TorchInductor Update 4: CPU backend started to show promising performance boost compiler	1	2979	November 25, 2022
TorchInductor: a PyTorch-native Compiler with Define-by-Run IR and Symbolic Shapes compiler	46	70980	July 29, 2024
Inductor CUDA Backend compiler	1	806	April 4, 2024
How to Access Triton Kernels from TorchInductor when running on CPU? compiler	1	778	August 12, 2024
TorchInductor Update 6: CPU backend performance update and new features in PyTorch 2.1 compiler	0	2074	September 22, 2023

[tac] Follow up: Inductor HW backend implementation

Related topics