Submitted by Vegetable-Skill-9700 t3_121a8p4 in MachineLearning
YoloSwaggedBased t1_jdp9cge wrote
Reply to comment by drinkingsomuchcoffee in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
I can't find it now, but I've read a paper that essentially proposed this, at least for inferencing. You essentially have a model output and task loss after every n layers of the model. At training time, you produce outputs up to the end of the architecture and then at inference time utilise some heuristic to measure how much accuracy loss you're willing to sacrifice for layer wise model reduction.
drinkingsomuchcoffee t1_jdpg1cb wrote
The problem is learned features aren't factored nicely into a minimal set of parameters. For example, identifying if an image is a cat may be 1000s of parameters over n layers, where it may actually be expressed as 10 parameters over fewer layers. A small model does this automatically, as it's obviously physically constrained. A large model has no such constraint, so it is wasteful. There's probably many solutions to get the best of both worlds at training time, but it's by no means an easy problem. And the current distillation methods or retraining feel clunky. We actually want the big model to use all its parameters efficiently and not waste them, which it's likely doing if much more compact models can get similar results. It's probably extremely wasteful if it requires an order of magnitude in size to get a few percentage points improvement. Compare that to biological entities where an order of magnitude size increase results in huge cognitive improvements.
Viewing a single comment thread. View all comments