[tac] Follow up: Inductor HW backend implementation

Continuing a discussion on Inductor Hardware backends by @AnthonyB @jgong5 @albanD

Original topic from From @AnthonyB

Hi,

At Graphcore we’re in a similar position to AMD: as part of our experiments we’ve tried to re-use either parts of the cpu OpenMP / C++ codegen or part of the CUDA / Triton backend and it hasn’t been straight forward:

  • There isn’t always a clear device abstraction:

    • device type and Inductor backend are used interchangeably when they shouldn’t: CPU <==> CppScheduling and CUDA <==> Triton

      • For example the Triton project are working on adding a CPU Triton backend: pretty sure it would be hard today for them to register a Triton backend for a CPU device.
    • Device type specific code leaking into implementation (e.g cuda.synchronize(), _cudaDeviceProperties)

    • In some places the list of device types is hardcoded (For example benchmarking: if device == “cpu” then benchmark_cpu else benchmark_gpu)

  • No documentation available for HW vendors: the device_interface for example is a great idea to abstract things but it has a lot of things and it’s not very clear what/where they’re used for, a lot of the types annotations are missing, some methods are oddly specific (e.g is_bf16_supported? Why not is_dtype_supported ?), etc.

Overall I don’t think we’re in a bad place, it just needs a bit of polishing / clean up, and probably better definitions of the device type ↔ device interface ↔ inductor backend relationships and maybe add some more tests with dummy Inductor backends to make sure device types / scheduler assumptions don’t sneak in the code base?

We’ve only started contributing small patches to help improve things in these areas recently but we’d be keen on joining a larger effort to standardise things a bit better for HW vendors if some other people were up for it.

Thanks,

Anthony

2 Likes

@jgong5 says

I guess the Inductor backend extension interface is not as mature as that of torch.compile. Is there any consideration why you prefer starting with extending Inductor instead of torch.compile?

@AnthonyB replies:

In both cases we tried to reuse existing backend extensions mostly to save time and avoid code duplication:

  • A lot of the inductor graph passes are relevant regardless of what the code gets lowered to, the graph fusion part seems fairly device and backend agnostic too? (Most of the implementations seem to be fusing nodes based on the input graph’s topology more than based on HW capabilities?)

  • For the CppScheduler it was just as a starting point because we wanted to mix stock ATen handlers with generated code (We didn’t go very far down that route, because it seemed hard to decouple the threading from the vectorised codedegen, and in the end it was unlikely we would have re-used either so we didn’t persevere too much.

  • For the TritonScheduler: most of it can be re-used, I believe the main part that might need to be made a bit more generic is the heuristics part (Especially for non-gpu HW: the balance between grid sizes and block sizes might be very different) and maybe some concepts like streams which don’t necessarily apply to non-GPUs (Although, it seems fairly easy to turn those into no-ops at the device_interface level).

Anthony

@albanD replies:

Hey all,

Just wanted to chime in to give some of the context I have on my end:

First for the non-inductor part of things related to the lack of documentation for device interfaces and generally how some abstractions are missing or the code is unnecessarily device-generic.

I’m afraid this is a side effect of the historical backend situation in the framework (cuda was the only widely used backend for many years) and general developer habits (most devs are familiar with CUDA and can debug it, most need help when looking at other backend).

I think we started making significant progress on device-generic APIs within PyTorch in the last year or so. With significant improvements ranging from torch.Stream/Event to autocast, AC, etc.

This work has benefitted all backend vendors (including those use the out-of-tree backend extension) and we want to continue in this direction. There is still a lot of work there and significant design decisions that need to be made. The one you raised about the “device module” where we expose generic capabilities and APIs options is an important one under discussion.

I would encourage you to reach out to me or the amazing engineers driving the particular issues on github about any of these missing pieces to help design and implement it.

Second on the Inductor side, I think this has been covered in details below. It is a complex question with changing answer as we’re finetuning the overall stack. Hopefully things like the new Halide backend in inductor can give some end to end example on how this can be done.

I think contributions are welcome in this space as well and you should reach out to the team with all this specific feedback such that we can fix these together.

Cheers,

Alban

@jgong5 replies

Perhaps it is a good topic to discuss on the Github instead of here. From Intel, we also wanted to share some common FX graph passes from Inductor in Gaudi torch.compile backend. It might be a common requirement for various device backends… I would like to learn more about how you plan to leverage the Inductor codegen parts in your IPUs though. Do you support the Triton language directly? If so, the TritonScheduler should be a good fit for you.

1 Like

@AnthonyB

This should no longer be the case with the Halide backend that recently landed. You can set cpu_backend=“halide” or cuda_backend=“halide” and completely replace CppScheduling/TritonScheduling.

There is already a draft PR for this:

This abstraction should be handled through the DeviceInterface API here:

We support NVIDIA, Intel, and AMD gpus through Inductor+Triton. The XPU version should provide a good example here since it has its own functions for things like synchronize.

Most of these should have already been generalized by prior efforts for support XPU (Intel GPU), HalideGPU, HalideCPU, TritonCPU, etc. Pull requests are welcome if we missed updating anything here.

aside: I’m curious to know about Intel Gaudi and OpenVINO and how they align and differ.

Intel Gaudi torch.compile mode is implemented as a Dynamo mid-layer backend instead of as a low-level Inductor backend. You could get more details from Getting Started with Inference on Intel Gaudi — Gaudi Documentation 1.18.0 documentation. Essentially, Intel Gaudi registers its mid-layer backend compiler to the Dynamo backend registry at runtime, receives the FX graphs from Dynamo during execution converts the FX to its backend executable form.