Submitted by Vegetable-Skill-9700 t3_121a8p4 in MachineLearning
harharveryfunny t1_jdm3bm4 wrote
It seems most current models don't need the number of parameters that they have. DeepMind did a study on model size vs number of training tokens and concluded that for each doubling of number of parameters the number of training tokens also needs to double, and that a model like GPT-3, trained on 300B tokens would really need to be trained on 3.7T tokens (a 10x increase) to take advantage of it's size.
To prove their scaling law, DeepMind built the 70B params Chinchilla model, and trained it on the predicted optimal 1.4T (!) tokens, and found it to outperform GPT-3.
alrunan t1_jdmbv4k wrote
The chinchilla scaling laws is just used to calculate the optimal scale for dataset and model size for a particular training budget.
You should read the LLaMA paper.
harharveryfunny t1_jdmd38s wrote
>You should read the LLaMA paper.
OK - will do. What specifically did you find interesting (related to scaling or not) ?
Viewing a single comment thread. View all comments