Submitted by Vegetable-Skill-9700 t3_121a8p4 in MachineLearning
andreichiffa t1_jdvojfg wrote
Reply to comment by shanereid1 in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700
It's a common result from the domain of flat minima - to train the model needs to be overparametrized to avoid getting stuck in local minima and to smooth the loss landscape.
However the overparameterization at the training stage can be trimmed at the inference stage.
Viewing a single comment thread. View all comments