Submitted by CS-fan-101 t3_11xskuk in MachineLearning
kilow4tt t1_jd68jnd wrote
Was there any effort to go from 75% sparsity during pre-training to a less sparse (e.g. 25%) sparsity during fine-tuning rather than strictly going from 75% sparsity to 0%?
brownmamba94 t1_jd6j6pt wrote
Hi, this is the first author on the paper. You asked a great question and it’s something we are pursuing internally. In this study we kept things simple and switched from sparse to completely dense during finetuning. But as for future work, you’re right, we can certainly vary the amount of “redensification” as well (e,g., 25%, 50%, or possibly some schedule). This is a very interesting research direction, because the full dense capacity of the model may not be needed to recover performance on the downstream task.
Viewing a single comment thread. View all comments