Clarification of PyTorch Quantization Flow Support (in pytorch and torchao)

There has been some questions around the support for torchao quantization flow(quantize_, autoquant, QAT etc.) and pt2e quantization flow(prepare_pt2e, prepare_qat_pt2e, convert_pt2e) by PyTorch. This note is trying to give a brief summary of the support plan and how people can choose between the two types of flow.

Also for the older eager mode quantization and fx graph mode quantization our plan is to deprecate them, see Torch.ao.quantization Migration Plan for more details.

Support

Currently, we are committed to support both torchao and pt2e quantization flows as they support different use cases, but we’ll aim for sharing common building blocks between the two flows, for example, quantize dequantize ops, observers etc. Support means to maintain and develop new features as requested by users.

Recommendations for Modeling Users

For pt2e quantization, we are also moving the implementation to torchao and having more shared components between pt2e quantization and torchao quantization (specifically they will be using the same quantization primitive ops (quantize_affine/dequantize_affine)). In 2025 H1, we plan to support blockwise quantization and also codebook quantization with pt2e quantization flow as well.

Please let us know if you have any questions.

1 Like

Hi @jerryzh168, thanks for the update!

As a Torch user in the embedded space, I’d like to ask a pair of things. Do you have plans to constrain the export feature in torchao quantization? As far I understand, you recommend to exclude the export in the torchao flow for speedup reasons. However, it’s still interesting to export models quantized with advanced techniques to deploy them in custom backends.

Additionally, it seems that during export with torch.export() the ops of the generated IR is dependent on the package we use. For example, we obtain prims ops when we export with torchao, but ATen (and Core ATen) ops with pt2e. I’ve read some discussions stating that’s possible to control how much an op is decomposed, but today it’s a bit opaque for the users. Do you intend to expose somehow the degree of decomposition to, for example, decompose the ATen ops to prim ops generated with pt2e, or viceversa?

I’m happy to help!

  1. we do support export for torchao quantization as well, a good example is Int8DynamicActivationInt4WeightConfig, you can export the model and see the quantize/dequantize ops (used in pt2e) show up in the exported model: ao/test/integration/test_integration.py at 8f93751cd6533732dcce0cdd336d04a204f2adc0 · pytorch/ao · GitHub, see ao/tutorials/developer_api_guide/export_to_executorch.py at main · pytorch/ao · GitHub for more explanations on how you can preserve a high level op during the export for the op to be consumed (lowered) by proceeding transformations

Do you intend to expose somehow the degree of decomposition to, for example, decompose the ATen ops to prim ops generated with pt2e, or viceversa?

export itself seem to allow people to specify a decomposition table. I’m not very familiar with this part, see Get `aot_autograd`'ed graph without `torch.compile` and freeze constants without Inductor context · Issue #140205 · pytorch/pytorch · GitHub for an example of using a decomposition table after export. my understanding is we first get aten IR with export and then use decomp table to decompose it to prim IR.

but we also have the tutorial mentioned above for preserving specific high level ops: ao/tutorials/developer_api_guide/export_to_executorch.py at main · pytorch/ao · GitHub

please let me know if there is any questions for that.

a side question, where are aten IR and prim IR defined? Pere my understanding, aten IR is defined at pytorch/aten/src/ATen/native/native_functions.yaml at main · pytorch/pytorch · GitHub. And prim IR is defined at Redirecting... , but we could not visit this webpage now.