stefanof93 t1_jbzeots wrote on March 12, 2023 at 10:19 PM

Anyone evaluate all the quantized versions and compare them against smaller models yet? How many bits can you throw away before you're better of picking a smaller version?

Amazing_Painter_7692 OP t1_jbzov27 wrote on March 12, 2023 at 11:34 PM

https://github.com/qwopqwop200/GPTQ-for-LLaMa

Performance is quite good.

LetterRip t1_jc4rifv wrote on March 14, 2023 at 12:56 AM

Depends on the model. Some have difficulty even with full 8bit quantization; others you can go to 4bit relatively easily. There is some research that suggests 3bit might be the useful limit, with rarely certain 2bit models.