light24bulbs t1_jc2s2oc wrote
Reply to comment by Kinexity in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
Oh, definitely, it's an amazing optimization.
But less than a token a second is going to be too slow for a lot of real time applications like human chat.
Still, very cool though
Lajamerr_Mittesdine t1_jc5b99n wrote
I imagine 1 token per 0.2 seconds would be fast enough. That'd be equivalent to a 60 WPM typer.
Someone should benchmark it on an AMD 7950X3D or Intel 13900-KS
light24bulbs t1_jc5e0zk wrote
yeah theres definitely a threshold in there where its fast enough for human interaction. It's only an order of magnitude off, that's not too bad.
Viewing a single comment thread. View all comments