A few months ago, I came up with the idea that current LLMs might be using overly detailed vectors for word embeddings. After doing some research, I discovered that NVIDIA has introduced int8 precision (which I assume is to address this issue). The idea is that with lower precision, you can increase the number of dimensions, at least from a storage perspective. From a computational standpoint, optimizations need to be made, but I’m not sure if the improvements would be linear due to potential implementation tricks when multiplying tensors of floating-point numbers. I thought it might be promising to explore how far discretization could go.
The current issue is that a single boolean is stored on one byte, which is likely convenient for implementation but is 8 times larger than the 1 bit it should ideally occupy. My idea was to create a sort of bitarray tensor that supports all the key operations already implemented in the Torch framework.
As a researcher at ENS Paris-Saclay, I planned to write a paper on the potential gains from such drastic discretization. However, I wasn’t keen on reimplementing an entire neural network framework just for this purpose.
I’m certainly not an expert in CUDA, and I’m unsure if basic operations have been implemented for bitarrays. Still, I believe it’s feasible in C++. The question is: is it simply a matter of creating a new dtype, or are there technical constraints that I’m not aware of?
If anyone has deeper insights on this topic, I would love to hear them!
Thanks a lot for reading!
This is indeed a good question and I’m not aware of any implementation of this.
I think the main question on how to do this is what you want the end result to be: a fully working implementation to run a couple example models or a library targetted at other researchers that should work with any other features within PyTorch and you’re planning on maintaining long term?
A good example of implementing these esoteric dtypes can be found in this repo: GitHub - pytorch/ao: PyTorch native quantization and sparsity for training and inference
In particular the techniques to have a subclass pretending to be a given dtype (to work with autograd and other systems) and a custom implementation that worked on the actual quantized data inside.
The scaffolding to get that to happen should be only a couple 100 lines of Python, but then you will need the actual cuda kernels that can leverage this, which I expect are going to be a bit more challenging to write.
Thanks a lot for your quick feedback (and for your recommendations). I guess the first step is to create a prototype, see how it performs, and depending on the outcome, decide whether or not it should be properly implemented.
However, in my mind, things seem simple: overload all torch functions with the new dtype. Since I don’t see any operation that can be done on floats that can’t be done on bools, it seems pretty straightforward if starting from all the basic operations. But I know things are rarely that simple . As you mentioned, CUDA will probably be the most challenging part. That’s why I was wondering if you think it could be done initially without GPU acceleration support.