Viewing a single comment thread. View all comments

andreichiffa t1_jdvojfg wrote on March 27, 2023 at 3:20 PM

Reply to comment by shanereid1 in [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700

It's a common result from the domain of flat minima - to train the model needs to be overparametrized to avoid getting stuck in local minima and to smooth the loss landscape.

However the overparameterization at the training stage can be trimmed at the inference stage.