brownmamba94

brownmamba94 t1_jddhxdb wrote

Hi yes, this is a great question. When we say FLOP-equivalent, we're saying on an ideal hardware which can accelerate unstructured weight sparsity, the total compute-time would also be equivalent. Except, we're showing we can actually improve the accuracy of the original dense model for the same compute budget with these Sparse Iso-FLOP Transformations (e.g., Sparse Wide, Sparse Parallel, etc.).

In Section 4 of our paper, we actually make comparisons for inference and training on hardwares with and without support for sparsity acceleration.

In theory, there should be no increase in wall-clock time, but on GPUs there'd be a significant increase. However, emerging hardware accelerators like Cerebras CS-2 are doing hardware-software co-design for sparse techniques, which can allow us to take advantage of sparse acceleration during training.

0

brownmamba94 t1_jdd1otu wrote

Hi thank you for the feedback. This was a genuine oversight and we will correct the paper with a new acronym in the revised version of the manuscript. You can expect the changes soon. I look forward to any feedback you have on the research itself, cheers!

8

brownmamba94 t1_jdawqp9 wrote

That's a pretty interesting thought...reminds me of this research from MIT that came out last summer. hmm...how computationally complex is a single neuron? Work like this can potentially help advance the field of analog deep learning. I think sparsity will play a role here in both at the connection-level and neuron-level, potentially further reducing energy consumption and allowing for better resource utilization.

1

brownmamba94 t1_jdaq0gn wrote

Hi, thanks for acknowledging the novelty of our work and finding our paper a good read. We look forward to releasing our code so yourself and others can experiment with the different SIFT transformations. And yes, first time sparsity is being used to improve the accuracy!

5

brownmamba94 t1_jd8lqry wrote

Also, the N:M sparsity structure is much more constrained in terms of mask diversity compared to unstructured sparsity. In Table 1 in the N:M Transposable sparsity paper, they present the mask diversity constraint between different sparsity techniques (both unstructured and structured), and as expected unstructured sparsity achieves the best. I think this is important especially for dynamic sparse training because now the algorithm has a much larger search space to explore sparse subnetworks. Also, imposing structured sparsity like N:M sparsity tends to reduce the expressivity of a weight matrix at higher sparsity levels, which can be a constraint if you want to get high compression ratios.

3

brownmamba94 t1_jd6zyd5 wrote

I totally agree, and really wonder how the landscape will look in 10 years when it comes to ML model architectures, training strategies, optimization techniques, etc...it'll be very interesting.Although plasticity-based learning, spiking neural networks, and other neuromorphic algorithms that use local learning rules don't get the same kind of attention as gradient based learning, I do believe mimicing the neural activity of the brain through emulating spiking neural networks could potentially one day be a good solution for inference (in terms of cost and power efficiency). Though, currently, implementing spike-based learning and training has still proven to be a challenge. But hey, one thing is common is that sparsity is a key enabler for these types of hardware.

3

brownmamba94 t1_jd6xm1n wrote

Yes, that's right, usually it's the other way around and that's usually because for the average researcher its computationally expensive to pre-train the LLM from scratch. So, they often typically take existing pre-trained LLM checkpoints and perform fine-tuning on them on a domain specific task. The FLOPs required for pre-training is several orders of magnitude more FLOPs than fine-tuning.

In this work, like you said, we're aiming to show that thanks to the Cerebras CS-2, we can achieve faster pre-training with unstructured weight sparsity, and fine-tune dense to recover the performance on the downstream task. The ability to do faster pre-training opens up a lot of potential for new directions in LLM research. Note that an interesting extension of our work is to do sparse pre-training followed by parameter efficient fine-tuning using techniques like LoRA from Microsoft.

There's actually a couple really nice blogs from Sean Lie, our Co-founder and Chief Hardware Architect, discussing how the Cerebras CS-2 can translate unstructured sparsity to realized gains unlike traditional GPUs. All the experiments in our paper were done on the CS-2, including the 1.3B GPT-3 XL. There was no GPU training here. I encourage you to check out these blogs:

Harnessing the Power of Sparsity for Large GPT AI ModelsCerebras Architecture Deep Dive: First Look Inside the HW/SW Co-Design for Deep Learning

4

brownmamba94 t1_jd6j6pt wrote

Hi, this is the first author on the paper. You asked a great question and it’s something we are pursuing internally. In this study we kept things simple and switched from sparse to completely dense during finetuning. But as for future work, you’re right, we can certainly vary the amount of “redensification” as well (e,g., 25%, 50%, or possibly some schedule). This is a very interesting research direction, because the full dense capacity of the model may not be needed to recover performance on the downstream task.

9