Taenk t1_jbdidpy wrote on March 8, 2023 at 6:51 AM

Reply to comment by CKtalon in [D] Can someone explain the discrepancy between the findings of LLaMA and Chinchilla? by __Maximum__

Can you rephrase that a little bit? Does it mean that Chinchilla answers „assuming that you have one Teraflop of compute time, use 20 tokens of data per parameter of model, then you hit diminishing returns in the sense that you could train another model from scratch faster“ and LLaMA answers „assuming you want optimal performance at inference time, regardless of compute budget, even small models can benefit from larger datasets“?

CKtalon t1_jbdjaxa wrote on March 8, 2023 at 7:02 AM

Instead of choosing a huge model and having it undertrained due to limited compute budget, choose the small but biggest model for your compute budget using their estimates. It doesn’t necessarily mean that a small model trained with larger datasets will naturally beat a bigger model.

Maximum OP t1_jbdr6zj wrote on March 8, 2023 at 8:50 AM

Not quite. Assuming you have certain compute, if you have a model with 1B parameters, then use a dataset of 20B tokens. Look at the figures in Chinchilla paper, they demonstrate it nicely.