Quantization in Pytorch

Hi Everyone,

I recently started looking at quantisation in Pytorch. I’m interested in it, because I want to quantise an LLM (like Llama) without using external libraries like GGML or AutoGPTQ, simply because they do not seem stable enough to be included in a production stack.

I’ve read this (Quantization — PyTorch 2.1 documentation) and then tried to follow this tutorial (Dynamic Quantization — PyTorch Tutorials 2.1.1+cu121 documentation), but it does not work on my system. I get an Illegal instruction (core dumped) error. I’m using torch==2.1.1 and Python 3.11.6.

I’m writing here, because I’m interested in the general state of quantisation in Pytorch. This seems to be a very hot topic right now and many people would be interested in using it. Is anyone actively working on it and if yes, what’s the current scope? This is something I’m strongly interested in and I could contribute – I have an applied maths background, so I think I would be a good fit :slight_smile:



Hi @jedreky thanks for your interest, yeah we are actively working on this.

There are currently two paths:

We are planning to support LLM quantization for executorch as well. Currently I’m trying to reuse the GPTQ quantization implemented in gpt-fast and get an exported model and lower to executorch.

I think maybe you can help contribute (e.g. implementation of new quantization techniques etc.) after we have initial quantize → export flow working. Or simply just use our flow and report back any issues to help us improve.

Hey @jerryzh168 , that sounds very interesting, I’d be interested to contribute, maybe just let me know when you have the initial flow ready, ok? thanks :slight_smile:

