Clarification of PyTorch Quantization Flow Support (in pytorch and torchao)

a) torchao quant can support backend specific int8 kernels, you can expose it through “layout” (for different packing format), an example is CPU layout for int4 weight only quantization: ao/test/integration/test_integration.py at f38c2722d953ea9352268f0f43f0889041423f27 · pytorch/ao · GitHub, see Quantization Overview — torchao 0.9 documentation for a more detailed explanation

b). yeah it can be extended to other ops as we work more on optimizations, ideally it’s driven by specific important model / use cases. let me know if you feel any model is bottlenecked by these ops and we can take a look. one op I have in mind is SPDA, and maybe moe next.