Submitted by __Maximum__ t3_11l3as6 in MachineLearning
Chinchilla states that the model size/dataset ratio should be 1 to 20 and they show it experimentally. LLaMA states their 7B model continued to improve even after 1T tokens. That's 1 to 142. Has anyone figured it out?
CKtalon t1_jbaogg3 wrote
Chinchilla just says that for a given compute, what is the optimal amount of data to train on to give the best bang for your buck. It doesn’t mean that the model converges to ‘best performance’ once it reaches the Chinchilla-optimal token count. Ergo, you can keep training if you have plenty of budget