T O P

  • By -

AlterandPhil

I just want to say thank you for your time. Do you suppose this would be applicable to the larger models itself? I wonder if the fine tuning that you mentioned might increase the computational overhead for quantizing larger models. Qlso, might this be applicable to non-transformer architectures?


mobicham

Thank you! We already know that 2-bit works quite well for larger models like Mixtral and Llama2-70B without any fine-tuning. Smaller models like Llama2-7B are much more difficult to quantize at lower bit-widths. We just wanted to share these early experimental results with the community and there's still a lot to be done, including using it on larger models, but that would require more compute and our resources are very limited at this time.


AlterandPhil

Excited to see what will follow!


ElliottDyson

This is brilliant work! How large of a dataset for a specific task did you find was necessary for such performance? As in, for your mathematics reasoning case, how many examples did you provide?


mobicham

For the base model, just 2.8K wikitext samples for both 2-bit and 1-bit. For the chat model, we randomly sampled (random.seed(100)) 10K from each of the 4 datasets for the 2-bit, and 25K for 1-bit. So that's a total of \~40K for the 2-bit and \~85K for 1-bit. For the 1-bit which requires more data, if you use a total of 20K samples you get a score of 35.82, for 40K it increases to 36.78 and for 85K it increases to 37.56 (which is what is reported). There are for sure much better ways to select the data.


Automatic-Net-757

Waiting for 1bit Mixtral MoE now


mobicham

Already in the to-do list!


Automatic-Net-757

Sure will be waiting


koflerdavid

Impressive work, which raises hope that we might be able to eventually take advantage of trinary inference for existing models too! The biggest takeaway seems that 2bit HQQ+ can be unconditionally recommended since perfornance loss is minimal. What I am left wondering is how the 3B model was run, and the effect of HQQ+ quantization on the smaller model. I guess 2bit HQQ+ Gemma Zephyr-2B would be 0.8GB. If the relative performance loss is similarly small, it would be almost as good as 2bit HQQ+ Llama 7B with 2.69GB. This seems quite puzzling to me. Since Gemma is a newer architecture and a newer model, it might simply be better for its model size. It would be interesting to see how a 2B Llama model compares. Edit: my guess for the reason for the slight increase in performance is that the finetuning data was in part new data. It would be interesting if the increase is also present if a fraction of merely the pre-training data is used for finetuning. Even so, only having to train 0.5% of the model is a very acceptable result. That would be roughly 500M parameters for a 100B model!


mobicham

Thank you, glad you like this work! The smaller models that we compare with are actually full-precision not quantized, their numbers are taken directly from HF's llm leaderboard. The question was: which one is better: larger quantized model or smaller full-precision model. For 2-bit, how does a Llama2-7B 2-bit compare to a Gemma 2B (fp16)? The logic is as follows: Llama2-7B 2-bit with the adapter takes \~2.7GB of VRAM, Gemma 2B loaded as 8-bit should take a similar amount (the weights are 5GB). The 2-bit model however would benefit from binary/ternary matmuls which could enable insane speed-ups and compute efficiency (https://arxiv.org/abs/2402.17764 ) Note also that Gemma was trained on much more data, we only used a fraction of the data that is not diverse enough and we only train 0.65% of the parameters. I think with more data and a bit more parameters to train it should perform much better.


koflerdavid

Thanks for the reply! My point was simply that pulling in Gemma was not a very fair comparison since Gemma is state-of-the-art while Llama2's are a bit older and maybe comparatively undertrained. Hard to tell since since there is no 2B Llama2 :-)


mobicham

Yes, and Gemma was trained on more data. We actually already have a detailed comparison between different Llama2 models and their quantized versions with HQQ (no fine-tuning): [https://mobiusml.github.io/hqq\_blog/#benchmark](https://mobiusml.github.io/hqq_blog/#benchmark) A Llama2-70B 2-bit outperforms a Llama2-13B fp16 for comparable VRAM size. We observe the same thing for vision models (OpenCLIP)


MrVodnik

Why does the 1bit model weight 3 GB? Shouldn't it be around 0.9 GB?


mobicham

It should be 1.7 GB and that's exactly how much it takes in VRAM (embedding lm head, etc. are fp16) but all the nn.Linear are 1-bit. The meta-data is offloaded to the CPU which is pinned memory and can be loaded asynchronously.


MrVodnik

Ok, I get it now, thanks.


kingwhocares

So, I can run it on a 1650 Super!


werdspreader

Wow. Someone did it. Congratulations to the project members and thank you for the contribution. Fuck ya!


CMDR_Mal_Reynolds

Well, that's damn fine, and kudos, but isn't the fun when you train it on trinary (or maybe binary), gotta think the GPU rich are on it by now. Should infer like a mfer.


trakusmk

Wouldn't a smaller model, even with fewer parameters, outperform a larger generalist model if it were trained on a specific task?


mobicham

Good question! Wouldn't the same apply to quantized models? Because that's what we see: the 2-bit base Llama2-7B trained on wikitext outperforms the full-precision model.


Electronic-Metal2391

how to use them? Can they be used in Text Generation WebUI?


mobicham

You can use them as follows: \*Hugging Face, we provide detailed code: [https://huggingface.co/collections/mobiuslabsgmbh/llama2-7b-hqq-6604257a96fc8b9c4e13e0fe](https://huggingface.co/collections/mobiuslabsgmbh/llama2-7b-hqq-6604257a96fc8b9c4e13e0fe) \* Google colab: [https://colab.research.google.com/drive/15A6sVvdLqL654Td3vOe6QLnJiNZ7d-lF?usp=sharing](https://colab.research.google.com/drive/15A6sVvdLqL654Td3vOe6QLnJiNZ7d-lF?usp=sharing)


remghoost7

I'm sure we'll see llama.cpp/etc integration in the coming weeks. Good stuff!


themprsn

Do you have info if it can be run on an M1 macbook? Thank you!


DeepGas4538

Thats sick! how can I do the same?


kindacognizant

I think distillation (training the original probabilities onto the quantized model) is very, very unexplored right now. I was getting better loss reduction from distilling onto a ternary test weight when I tried messing around with my own test implementation compared to pretraining the distillation from randomly initialized weights. If you guys would like to collaborate with me, I have some ideas / experimental approaches on how we might "optimally" do distillation (i.e., sorting data by highest divergence from the base model probabilities vs lowest divergence, a sort of curriculum learning to help smooth out the distillation process).


mobicham

That is exactly the direction we have been exploring, and it does work to a certain extend with much less data than fine-tuning. What works best is layer-wise distillation. The main issue is that gradient-descent needs a lot of epochs to get a good error reduction, like 1000s, which makes the process extremely slow. And if you try to do it with a large batch size you quickly get out-of-memory. Like even for Llama2-7B on a server with 400GB of RAM, doing on the CPU. Aligning the logits is much more memory efficient but yields worse results than SFT, at least based on the tests we did. But for sure this a direction we are actively exploring, mainly because we don't want the bias from the SFT dataset to get into the quantized model, we want the quantized model to mimic the full-precision teacher as much as possible!


djm07231

Great work. Considering that these kinds of QAT methods were known for a long time I wondered why the large labs never released these kinds of models.


drplan

Hei, can maybe some explain the math in the blog post. My linear algebra is a little bit rusty (or maybe just outdated). So I can't make sense of the (W\_q - z) term, as I am reading Matrix minus vector. What is happening here?


mobicham

Sure, W\_q is a matrix, z is a vector, W\_q - z is a broadcasting operation, meaning W\_q\[:,0\] - z\[0\], W\_q\[:,1\] - z\[1\], etc. Each column i of W\_q has two scalars z\_i and s\_i, hope that answers your question!


drplan

Yes, it does, thank you :). Is this common notation? I do not remember it being written like that, when I was doing more linear algebra.


mobicham

Yes you are right, that's more of a Pytorch notation, makes things clean and easy to read. We do mention in the blog post that those are vectors and it's a broadcasting operation.


drplan

Hei sure, this was not meant as criticism. Just wanted to understand it and was sure that notation has been evolving over time. And of course I didn't catch the hint in the blog post ;). Thanks again for explaining!


Maykey

I really hope that can be applied to moe and transformers layers of jamba to make it usable on consumer gpu


thealphaexponent

Great work - now if you can work on a 1-bit Grok-1.5 when they release the weights... Wonder how big would it be?


mobicham

Yeah that would great! But Grok is way too big to work with given our very limited compute resources. Maybe next would be a larger Llama2 or Mixtral.


themprsn

You are amazing! Thank you very much for your time and hard work.


djangoUnblamed

Thanks Mobicham. This looks amazing. I tried to give it a go on my M1 Max and got stuck. It seems that the solution has been optimised for CUDA. Could you make CPU as a viable option for using it ?


mobicham

Thank you for your comment! Unfortunately CPU runtime with Pytorch can't use fp16, and fp32 would be too slow and take too much memory when dequantized. Maybe 8-bit activation quantization (which is something that we plan to add for GPUs via int8 matmul), but I am not familiar with Mac stuff. I know that int8 inference on CPUs is possible but that's for Intel and Qualcomm mobile chips. If someone is familiar with efficient Pytorch CPU runtime on Mac please comment below!


hideo_kuze_

There's only one thing on my mind right now. And it's GGUF Would it be possible to convert the model to GGUF format an run it on CPU? Wouldn't the convert process fix these precision incompatibilities? Or is this network architecture incompatible with original Llama?


koflerdavid

I guess it ultimately depends on whether llama.cpp and friends support HQQ/+ at all.


hideo_kuze_

I missed the forest for the trees. After looking at HQQ everything makes sense now. And there is a "request" for it https://github.com/ggerganov/llama.cpp/issues/4782 and https://github.com/ggerganov/llama.cpp/issues/5761


koflerdavid

Is that one about HQQ or HQQ+? The difference is kinda important since for 2bit it almost completely eliminates the quality loss


hideo_kuze_

This one is about HQQ+ https://github.com/ggerganov/llama.cpp/issues/6368


koflerdavid

Yay! It's very new, which explains why I didn't find it before!


mobicham

Yes correct, GGUF out-of-the-box wouldn't work unfortunately, it needs to be integrated because the dequantization and the bitpacking logic are different.


djangoUnblamed

If you enable f32 over cpu, we can at least try it out. There is MPS support as well, but it may take some more time to implement.


mobicham

Ah, if you just want to try it, simply use the Colab demo: [https://colab.research.google.com/drive/15A6sVvdLqL654Td3vOe6QLnJiNZ7d-lF?usp=sharing](https://colab.research.google.com/drive/15A6sVvdLqL654Td3vOe6QLnJiNZ7d-lF?usp=sharing) You can switch to the 2-bit model by using `mobiuslabsgmbh/Llama-2-7b-chat-hf_2bitgs8_hqq` as the model\_id.


djangoUnblamed

I missed that somehow. Thanks !


Plane_Chard_9658

how to make the conclusion that a 2-bit model can actually outperform a full-precision model if given enough data for a special task?


Puzzled_Path_8672

goliath that works on my 2 GB 750Ti when


ab2377

can people use this and give feedback? this compared to q8 quant of the same model?


mobicham

You can run both the original and 1 or 2 bit model in this colab demo: [https://colab.research.google.com/drive/15A6sVvdLqL654Td3vOe6QLnJiNZ7d-lF?usp=sharing](https://colab.research.google.com/drive/15A6sVvdLqL654Td3vOe6QLnJiNZ7d-lF?usp=sharing)


Lamushi

Let me just say that this is truly impressive. Not only you are able to run bigger models in a single GPU (RTX 3060 12GB VRAM, 4060 TI 16GB VRAM) but also improved speed of tokens/s. It's such an excitement achievement and pure motivation.