stefanof93 t1_jbzeots wrote
Anyone evaluate all the quantized versions and compare them against smaller models yet? How many bits can you throw away before you're better of picking a smaller version?
Amazing_Painter_7692 OP t1_jbzov27 wrote
https://github.com/qwopqwop200/GPTQ-for-LLaMa
Performance is quite good.
LetterRip t1_jc4rifv wrote
Depends on the model. Some have difficulty even with full 8bit quantization; others you can go to 4bit relatively easily. There is some research that suggests 3bit might be the useful limit, with rarely certain 2bit models.
Viewing a single comment thread. View all comments