brownmamba94
brownmamba94 t1_jddhxdb wrote
Reply to comment by GamerMinion in [R] Introducing SIFT: A New Family of Sparse Iso-FLOP Transformations to Improve the Accuracy of Computer Vision and Language Models by CS-fan-101
Hi yes, this is a great question. When we say FLOP-equivalent, we're saying on an ideal hardware which can accelerate unstructured weight sparsity, the total compute-time would also be equivalent. Except, we're showing we can actually improve the accuracy of the original dense model for the same compute budget with these Sparse Iso-FLOP Transformations (e.g., Sparse Wide, Sparse Parallel, etc.).
In Section 4 of our paper, we actually make comparisons for inference and training on hardwares with and without support for sparsity acceleration.
In theory, there should be no increase in wall-clock time, but on GPUs there'd be a significant increase. However, emerging hardware accelerators like Cerebras CS-2 are doing hardware-software co-design for sparse techniques, which can allow us to take advantage of sparse acceleration during training.
brownmamba94 t1_jdd1otu wrote
Reply to comment by mouldygoldie in [R] Introducing SIFT: A New Family of Sparse Iso-FLOP Transformations to Improve the Accuracy of Computer Vision and Language Models by CS-fan-101
Hi thank you for the feedback. This was a genuine oversight and we will correct the paper with a new acronym in the revised version of the manuscript. You can expect the changes soon. I look forward to any feedback you have on the research itself, cheers!
brownmamba94 t1_jdawqp9 wrote
Reply to comment by Carrasco_Santo in [R] SPDF - Sparse Pre-training and Dense Fine-tuning for Large Language Models by CS-fan-101
That's a pretty interesting thought...reminds me of this research from MIT that came out last summer. hmm...how computationally complex is a single neuron? Work like this can potentially help advance the field of analog deep learning. I think sparsity will play a role here in both at the connection-level and neuron-level, potentially further reducing energy consumption and allowing for better resource utilization.
brownmamba94 t1_jdaq0gn wrote
Reply to comment by elisiyumali in [R] Introducing SIFT: A New Family of Sparse Iso-FLOP Transformations to Improve the Accuracy of Computer Vision and Language Models by CS-fan-101
Hi, thanks for acknowledging the novelty of our work and finding our paper a good read. We look forward to releasing our code so yourself and others can experiment with the different SIFT transformations. And yes, first time sparsity is being used to improve the accuracy!
brownmamba94 t1_jd8lqry wrote
Reply to comment by maizeq in [R] SPDF - Sparse Pre-training and Dense Fine-tuning for Large Language Models by CS-fan-101
Also, the N:M sparsity structure is much more constrained in terms of mask diversity compared to unstructured sparsity. In Table 1 in the N:M Transposable sparsity paper, they present the mask diversity constraint between different sparsity techniques (both unstructured and structured), and as expected unstructured sparsity achieves the best. I think this is important especially for dynamic sparse training because now the algorithm has a much larger search space to explore sparse subnetworks. Also, imposing structured sparsity like N:M sparsity tends to reduce the expressivity of a weight matrix at higher sparsity levels, which can be a constraint if you want to get high compression ratios.
brownmamba94 t1_jd6zyd5 wrote
Reply to comment by Carrasco_Santo in [R] SPDF - Sparse Pre-training and Dense Fine-tuning for Large Language Models by CS-fan-101
I totally agree, and really wonder how the landscape will look in 10 years when it comes to ML model architectures, training strategies, optimization techniques, etc...it'll be very interesting.Although plasticity-based learning, spiking neural networks, and other neuromorphic algorithms that use local learning rules don't get the same kind of attention as gradient based learning, I do believe mimicing the neural activity of the brain through emulating spiking neural networks could potentially one day be a good solution for inference (in terms of cost and power efficiency). Though, currently, implementing spike-based learning and training has still proven to be a challenge. But hey, one thing is common is that sparsity is a key enabler for these types of hardware.
brownmamba94 t1_jd6xm1n wrote
Reply to comment by _Arsenie_Boca_ in [R] SPDF - Sparse Pre-training and Dense Fine-tuning for Large Language Models by CS-fan-101
Yes, that's right, usually it's the other way around and that's usually because for the average researcher its computationally expensive to pre-train the LLM from scratch. So, they often typically take existing pre-trained LLM checkpoints and perform fine-tuning on them on a domain specific task. The FLOPs required for pre-training is several orders of magnitude more FLOPs than fine-tuning.
In this work, like you said, we're aiming to show that thanks to the Cerebras CS-2, we can achieve faster pre-training with unstructured weight sparsity, and fine-tune dense to recover the performance on the downstream task. The ability to do faster pre-training opens up a lot of potential for new directions in LLM research. Note that an interesting extension of our work is to do sparse pre-training followed by parameter efficient fine-tuning using techniques like LoRA from Microsoft.
There's actually a couple really nice blogs from Sean Lie, our Co-founder and Chief Hardware Architect, discussing how the Cerebras CS-2 can translate unstructured sparsity to realized gains unlike traditional GPUs. All the experiments in our paper were done on the CS-2, including the 1.3B GPT-3 XL. There was no GPU training here. I encourage you to check out these blogs:
Harnessing the Power of Sparsity for Large GPT AI ModelsCerebras Architecture Deep Dive: First Look Inside the HW/SW Co-Design for Deep Learning
brownmamba94 t1_jd6j6pt wrote
Reply to comment by kilow4tt in [R] SPDF - Sparse Pre-training and Dense Fine-tuning for Large Language Models by CS-fan-101
Hi, this is the first author on the paper. You asked a great question and it’s something we are pursuing internally. In this study we kept things simple and switched from sparse to completely dense during finetuning. But as for future work, you’re right, we can certainly vary the amount of “redensification” as well (e,g., 25%, 50%, or possibly some schedule). This is a very interesting research direction, because the full dense capacity of the model may not be needed to recover performance on the downstream task.
brownmamba94 t1_jddzjgc wrote
Reply to comment by __Maximum__ in [R] Introducing SIFT: A New Family of Sparse Iso-FLOP Transformations to Improve the Accuracy of Computer Vision and Language Models by CS-fan-101
Thanks for your inquiry. We are working with our legal team to figure out the best path forward, but most likely, we'll be releasing under some permissive license that allows you to use the code for your applications.