PyTorch 2 Quantization, How it works?

Hi,

Is there a technical detailed document about the PyTorch 2 quantization roadmap? As for today, I wonder about the following subjects (and I code them with names for future reference):
[1] pytorch-2-quantization-GPU-target, status and plans?
[2] pytorch-2-quantization-CPU-target, status and plans?
[3] pytorch-2-quantization-datatypes: INT8, Others?
[4] pytorch-2-quantization-layers: what layers are automatically supported?
[5] pytorch-2-quantization-errors: How can we improve the errors reported when something goes wrong with the quantization like quantizing MMUL linear layer needs a shape length of >= 2.
[6] Any more ideas by the community?

1 Like

A lot of the development work for quantization has moved out of core and is now in the torchao repo GitHub - pytorch/ao: Create and integrate custom data types, layouts and kernels with up to 2x speedups with 65% less VRAM for inference and support for training you can see what we’ve done for the past 2 releases here Release v0.2.0 · pytorch/ao · GitHub with a 0.3 release planned for Wednesday June 26

There we’re prioritizing making it easy to create and integrate high performance dtypes and layouts into PyTorch programs so this already includes quantization for both CPU and GPU although GPU is a stronger focus right now.

As far as dtypes go this includes int8, int4, we even have bitpacking support to support int2-7. We also have the mx data types fp4/6/8 and are now also working on fp5. We’ve listed a few here GitHub - pytorch/ao: Create and integrate custom data types, layouts and kernels with up to 2x speedups with 65% less VRAM for inference and support for training and for the more experimental bit packing stuff they’re located here

As far as which layers are supported, typically we make sure to at least make sure that nn.Linear works since that’s what most people care about but keep in mind we’re a dtype library not a dtype linear library so we want to make it easy to support everything and some of those plans are shared here The next tutorials · Issue #426 · pytorch/ao · GitHub

Regarding how to improve the error this is a multidimensional optimization problem so our approach has been to use autotuning to make it easier for people to handle using torchao.autoquant() ao/torchao/quantization at main · pytorch/ao · GitHub

And if you’re interested in codeveloping a feature with us feel free to open an issue on github or join us on the torchao channel on discord.gg/cudamode

1 Like