Quantization in Pytorch

jedreky · December 4, 2023, 8:03pm

Hi Everyone,

I recently started looking at quantisation in Pytorch. I’m interested in it, because I want to quantise an LLM (like Llama) without using external libraries like GGML or AutoGPTQ, simply because they do not seem stable enough to be included in a production stack.

I’ve read this (Quantization — PyTorch 2.1 documentation) and then tried to follow this tutorial (Dynamic Quantization — PyTorch Tutorials 2.1.1+cu121 documentation), but it does not work on my system. I get an Illegal instruction (core dumped) error. I’m using torch==2.1.1 and Python 3.11.6.

I’m writing here, because I’m interested in the general state of quantisation in Pytorch. This seems to be a very hot topic right now and many people would be interested in using it. Is anyone actively working on it and if yes, what’s the current scope? This is something I’m strongly interested in and I could contribute – I have an applied maths background, so I think I would be a good fit

thanks,
jed

jerryzh168 · January 16, 2024, 11:36pm

Hi @jedreky thanks for your interest, yeah we are actively working on this.

There are currently two paths:

gpt-fast (GitHub - pytorch-labs/gpt-fast: Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.): quantizing model in eager mode and then deploy with torch.compile, has some common LLM quantizations like GPTQ implemented, but it’s not exported yet
pt2 export quantization (Quantization — PyTorch main documentation): first export the model, and then do quantization, this is mostly supporting traditional static quantization, but no LLM specific quantization implemented yet (like GPTQ)

We are planning to support LLM quantization for executorch as well. Currently I’m trying to reuse the GPTQ quantization implemented in gpt-fast and get an exported model and lower to executorch.

I think maybe you can help contribute (e.g. implementation of new quantization techniques etc.) after we have initial quantize → export flow working. Or simply just use our flow and report back any issues to help us improve.

jedreky · January 17, 2024, 3:35pm

Hey @jerryzh168 , that sounds very interesting, I’d be interested to contribute, maybe just let me know when you have the initial flow ready, ok? thanks

jerryzh168 · February 24, 2025, 9:47pm

Just posted a clarification here as well: Clarification of PyTorch Quantization Flow Support (in pytorch and torchao)

Topic		Replies	Views
PyTorch 2 Quantization, How it works?	1	481	June 24, 2024
Next Steps for PyTorch Compilers compiler	9	10629	October 21, 2021
Clarification of PyTorch Quantization Flow Support (in pytorch and torchao)	9	729	May 9, 2025
FP8 datatype in PyTorch	6	10211	February 27, 2023
State of PyTorch core: September 2021 edition frontend API	1	9427	September 21, 2021

Quantization in Pytorch

Related topics