Viewing a single comment thread. View all comments

Think_Olive_1000 t1_j3tqojo wrote

Nah, there's already work that can reduce generic LLM model size by a half and not lose any performance. And LLMs I think will be great as foundation models for training more niche smaller models for narrower tasks - people already use openAIs API to generate data to fine-tune their own niche models. I think we'll look back at current LLMs and realise just how inefficient they were - though a necessary evil to prove that something like this CAN be done.

1

suflaj t1_j3twskh wrote

Half is not enough. We're thinking in the order of 100x or even more. Do not forget that even ordinary BERT is not really commercially viable as-is.

I mean sure you can use them to get a nicer distribution for your dataset. But at the end of the day the API is too slow to train any "real" model, and you can already probably collect and generate data for smaller models yourself. So as a replacement for lazy people - sure, I think ChatGPT by itself probably has the potential to solve most repetitive questions people have on the internet. But it won't be used like that at scale so ultimately it is not useful.

If it wasn't clear enough by now, I'm not skeptic because of what LLMs are, but how they simply do not scale up to real-world requirements. Ultimately, people do not have datacenters at home, and OpenAI and other vendors do not have the hardware for any actual volume of need other than a niche, hobbyist one. And the investment to develop something like ChatGPT is too big to justify for that use.

All of this was ignoring the obvious legal risks from using ChatGPT generations commercially!

−1

Think_Olive_1000 t1_j3u3k7w wrote

Bert is being used by Google for search under the hood. It's how theyve got that instant fancy extractive answers box. I don't disagree that LLMs are large. So was Saturn V.

1

suflaj t1_j3u4smq wrote

Google's BERT use is not a commercial, consumer product, it is an enterprise one (Google uses it and runs it on their hardware), they presumably use the large version or something even larger than the pretrained weights available on the internet and to achieve latencies they have they are using datacentres and non-trivial distribution schemes for it, not just consumer hardware.

Meanwhile, your average CPU will need anywhere from 1-4 seconds to do one inference pass in onnx runtime, of course much less on a GPU, but to be truly cross platform you're targetting JS in most cases, which means CPU and not a stack as mature as what Python/C++/CUDA have.

What I'm saying is:

  • people have said no to paid services, they want free products
  • consumer hardware has not scaled nearly as fast as DL
  • even ancient models are still too slow to run on consumer hardware after years of improvement
  • distilling, quantizing and optimizing them seems to get them to run just fast enough to not be a nuisance, but is often too tedious to work out for a free product
−1