Viewing a single comment thread. View all comments

kilow4tt t1_jd68jnd wrote

Was there any effort to go from 75% sparsity during pre-training to a less sparse (e.g. 25%) sparsity during fine-tuning rather than strictly going from 75% sparsity to 0%?

6

brownmamba94 t1_jd6j6pt wrote

Hi, this is the first author on the paper. You asked a great question and it’s something we are pursuing internally. In this study we kept things simple and switched from sparse to completely dense during finetuning. But as for future work, you’re right, we can certainly vary the amount of “redensification” as well (e,g., 25%, 50%, or possibly some schedule). This is a very interesting research direction, because the full dense capacity of the model may not be needed to recover performance on the downstream task.

9