Submitted by __Maximum__ t3_11l3as6 in MachineLearning
adt t1_jbbzba8 wrote
Reply to comment by __Maximum__ in [D] Can someone explain the discrepancy between the findings of LLaMA and Chinchilla? by __Maximum__
There are a few that 'feel' that way. Try Megatron-11B (~200:1) based on RoBERTa (6,198:1). Wayyyyy ahead of its time, and I've matched it with much larger models in some testing.
Here's the full table of Chinchilla-align comparisons:
whata_wonderful_day t1_jbcxdwf wrote
Nice! How did you get access to Megatron-11B? I can't find it online anywhere
Jepacor t1_jbdrovb wrote
The link to the model is in the Google sheets they linked : https://github.com/facebookresearch/fairseq/blob/main/examples/megatron_11b/README.md
whata_wonderful_day t1_jbhp4gb wrote
Thanks, alas I thought it was an encoder model. I've been on the lookout for a big one, largest I've seen is deberta V2 with 1.5B params
__Maximum__ OP t1_jbdqy5c wrote
Thanks for the links. Looks like RoBERTa did not gain a lot from the additional trainings, only minor improvements, but yeah, it was a tiny model. How was this not a good lesson? Why did people need Chinchilla? Maybe it's just having a lot of data comes easy so people gather as much as possible, even though they know they will go maximum 1 epoch over it.
Viewing a single comment thread. View all comments