alrunan
alrunan t1_jdmbv4k wrote
Reply to comment by harharveryfunny in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
The chinchilla scaling laws is just used to calculate the optimal scale for dataset and model size for a particular training budget.
You should read the LLaMA paper.
alrunan t1_jdmm3lw wrote
Reply to comment by harharveryfunny in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
The 7B model is trained on 1T tokens and performs really well for its number of parameters.