Submitted by Vegetable-Skill-9700 t3_121a8p4 in MachineLearning
currentscurrents t1_jdmyjrb wrote
Reply to comment by Crystal-Ammunition in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
Bigger models are more sample efficient, so it should need less data.
But - didn't the Chinchilla paper say bigger models need more data? Yes, but that's only true because right now compute is the limiting factor. They're intentionally trading off more data for less model size.
As computers get faster and models bigger, data will increasingly become the limiting factor, and people will trade off in the opposite direction instead.
itshouldjustglide t1_jdoazux wrote
Don't bigger models need more data so that all of the neurons can be trained so as to reduce unnecessary noise and randomness?
ganzzahl t1_jdovu3h wrote
I'm also very interested in this – does anyone have papers similar to Chinchilla, but without the training FLOPs restriction, and instead comparing identical dataset sizes?
An aside: I feel like I remember some older MT papers where LSTMs outperformed Transformers for some low resource languages, but I think that's outdated – using transfer learning, multilingual models and synthetic data, I'm fairly certain Transformers always outperform nowadays.
PilotThen t1_jdpnoul wrote
I didn't find a paper but I think that is sort of what EleutherAI was doing with their pythia models.
You'll find the models on huggingface and I'd say that they are also interesting from an opensource perspective because of their license (apache-2.0)
(Also open-assistent seems to be building on top of them.)
AllowFreeSpeech t1_je3rjmv wrote
20:1 ratio of tokens:params
Viewing a single comment thread. View all comments