LetterRip
LetterRip t1_jc79qjb wrote
Reply to comment by farmingvillein in [R] Stanford-Alpaca 7B model (an instruction tuned version of LLaMA) performs as well as text-davinci-003 by dojoteef
Stability.AI has been funding RWKV's training.
LetterRip t1_jc4rifv wrote
Reply to comment by stefanof93 in [P] Discord Chatbot for LLaMA 4-bit quantized that runs 13b in <9 GiB VRAM by Amazing_Painter_7692
Depends on the model. Some have difficulty even with full 8bit quantization; others you can go to 4bit relatively easily. There is some research that suggests 3bit might be the useful limit, with rarely certain 2bit models.
LetterRip t1_jc3864s wrote
Reply to comment by cyvr_com in [R] Stanford-Alpaca 7B model (an instruction tuned version of LLaMA) performs as well as text-davinci-003 by dojoteef
Source code and weights are different licenses.
LLama license in the request form appears to be the same,
Relevant part here
> a. Subject to your compliance with the Documentation and Sections 2, 3, and 5, Meta grants you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty free and limited license under Meta’s copyright interests to reproduce, distribute, and create derivative works of the Software solely for your non-commercial research purposes. The foregoing license is personal to you, and you may not assign or sublicense this License or any other rights or obligations under this License without Meta’s prior written consent; any such assignment or sublicense will be void and will automatically and immediately terminate this License.
https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform
as linked from
LetterRip t1_jbtn573 wrote
Reply to comment by Simusid in [Discussion] Compare OpenAI and SentenceTransformer Sentence Embeddings by Simusid
number of total tokens in input + output.
LetterRip t1_jbks0mg wrote
Reply to comment by Aran_Komatsuzaki in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321
> I've compared Pythia (GPT-3 variants) w/ context length = 2048 vs. RWKV w/ context length = 4096 of comparable compute budget, and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the former scores better for the first 1024 tokens. While RWKV performs well on the tasks with short context (e.g. the tasks used in its repo for evaluating the RWKV), it may still not be possible to replace Transformer for longer context tasks (e.g. typical conversation with ChatGPT).
Thanks for sharing your results. It is being tuned to longer context lengths, current is
RWKV-4-Pile-14B-20230228-ctx4096-test663.pth
https://huggingface.co/BlinkDL/rwkv-4-pile-14b/tree/main
There should soon be a 6k and 8k as well.
So hopefully you should see better results with longer contexts soon.
> and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the former scores better for the first 1024 tokens.
Could you clarify - was one of those meant to be former and the other later?
LetterRip t1_jbkmk5e wrote
Reply to comment by ThePerson654321 in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321
> He makes it sound extraordinary
The problem is that extraordinary claims raise the 'qwack' suspicion when there isn't much evidence provided in support.
> The most extraordinary claim I got stuck up on was "infinite" ctx_len. One of the biggest limitations of transformers today is imo their context length. Having an "infinite" ctx_len definitely feels like something DeepMind, OpenAi etc would want to investigate?
Regarding the infinite context length - that is for inference and it is more accurately stated as not having a fixed context length. While infinite "in theory" in practice the 'effective context length' is about double the trained context length,
> It borrows ideas from Attention Free Transformers, meaning the attention is a linear in complexity. Allowing for infinite context windows.
> Blink DL mentioned that when training with GPT Mode with a context length of 1024, he noticed that RWKV_RNN deteriorated around a context length of 2000 so it can extrapolate and compress the prompt context a bit further. This is due to the fact that the model likely doesn't know how to handle samples beyond that size. This implies that the hidden state allows for the the prompt context to be infinite, if we can fine tune it properly. ( Unclear right now how to do so )
LetterRip t1_jbkdshr wrote
Reply to comment by farmingvillein in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321
Here is what the author stated in the thread,
> Tape-RNNs are really good (both in raw performance and in compression i.e. very low amount of parameters) but they just can't absorb the whole internet in a reasonable amount of training time... We need to find a solution to this!
I think they knew it existed (ie they knew there was a deeplearning project named RWKV), but they appear to have not know it met their scaling needs.
LetterRip t1_jbjphkw wrote
Reply to comment by ThePerson654321 in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321
> I don't buy the argument that it's too new or hard to understand. Some researcher at, for example, Deepmind would have been able to understand it.
This was posted by DeepMind a month ago,
I emailed them that RWKV exactly met their desire for a way to train RNNs 'on the whole internet' in a reasonable time.
So prior to a month ago they didn't know it existed (edit - or at least not much more than it existed) or happened to meet their use case.
> RWKV 7B came out 7 months ago but the concept has been promoted by the developer much longer.
There was no evidence it was going to be interesting. There are lots of ideas that work on small models that don't work on larger models.
> 2) This might actually be a problem. But the code is public so it shouldn't be that difficult to understand it.
Until it has proved itself there was no motivation to take the effort to figure it out. The lower the effort threshold the more likely people will have a look, the larger the threshold the more likely people will invest their limited time in the 100's of other interesting bits of research that come out each week.
> If your idea is truly good you will get at attention sooner or later anyways.
Or be ignored for all time till someone else discovers the idea and gets credit for it.
In this case the idea has started to catch on and be discussed by 'the Big Boys', people are cautiously optimistic and people are investing time to start learning about it.
> I don't buy the argument that it's too new or hard to understand.
It isn't "too hard to understand" - it simply hadn't shown itself to be interesting enough to worth more than minimal effort to understand it. Without a paper that exceeded the minimal effort threshold. Now it has proven itself with the 14B that it seems to scale. So people are beginning to invest the effort.
> It does not work as well as the developer claim or have some other flaw that makes it hard to scale for example (time judge of this)
No, it simply hadn't been shown to scale. Now we know it scales to at least 14B, and there is no reason to think it won't scale the same as any other GPT model.
The DeepMind paper that was lamenting the need for a fast way to train RNN models was about a month ago, which
LetterRip t1_jbjfiyg wrote
Reply to [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321
-
The larger models (3B, 7B, 14B) have only been released quite recently
-
Information about the design has been fairly scarce/hard to track down because no paper has been written on it and submitted
-
people want to know that it actually scales before investing work into it.
-
Mostly people are learning about it from the release links to reddit and the posts haven't been in such a manner to attract interest.
LetterRip t1_jb5bgvj wrote
Greatly appreciated, you might run it on aesthetic and 5B also.
LetterRip t1_javpxbv wrote
Reply to comment by CellWithoutCulture in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir
> I mean... why were they not doing this already? They would have to code it but it seems like low hanging fruit
GPT-3 came out in 2020 (they had their initial price then a modest price drop early on).
Flash attention is June of 2022.
Quantization we've only figured out how to do it fairly lossless recently (especially int4). Tim Dettmers LLM int8 is from August 2022.
https://arxiv.org/abs/2208.07339
> That seems large, which paper has that?
See
https://github.com/HazyResearch/flash-attention/raw/main/assets/flashattn_memory.jpg
>We show memory savings in this graph (note that memory footprint is the same no matter if you use dropout or masking). Memory savings are proportional to sequence length -- since standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length. We see 10X memory savings at sequence length 2K, and 20X at 4K. As a result, FlashAttention can scale to much longer sequence lengths.
LetterRip t1_janljeo wrote
Reply to comment by lucidraisin in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir
Ah, I'd not seen the Block Recurrent Transformers paper before, interesting.
LetterRip t1_jani50o wrote
Reply to comment by jinnyjuice in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir
We don't know the supply demand curve, so we can't know for sure that the revenue increased.
LetterRip t1_jal4y8i wrote
Reply to comment by bjergerk1ng in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir
Certainly that is also a possibility. Or they might have done teacher student distillation.
LetterRip t1_jal4vgs wrote
Reply to comment by cv4u in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir
Yep, or a mix between the two.
GLM-130B quantized to int4, OPT and BLOOM int8,
https://arxiv.org/pdf/2210.02414.pdf
Often you'll want to keep the first and last layer as int8 and can do everything else int4. You can quantize based on the layers sensitivity, etc. I also (vaguely) recall a mix of 8bit for weights, and 4bits for biases (or vice versa?),
Here is a survey on quantization methods, for mixed int8/int4 see the section IV. ADVANCED CONCEPTS: QUANTIZATION BELOW 8 BITS
https://arxiv.org/pdf/2103.13630.pdf
Here is a talk on auto48 (automatic mixed int4/int8 quantization)
https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41611/
LetterRip t1_jajezib wrote
Reply to comment by minimaxir in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir
June 11, 2020 is the date of the GPT-3 API was introduced. No int4 support and the Ampere architecture with int8 support had only been introduced weeks prior. So the pricing was set based on float16 architecture.
Memory efficient attention is from a few months ago.
ChatGPT was just introduced a few months ago.
The question was 'how OpenAI' could be making a profit, if they were making a profit on GPT-3 2020 pricing; then they should be making 90% more profit per token on the new pricing.
LetterRip t1_jaj1kp3 wrote
Reply to [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir
> I have no idea how OpenAI can make money on this.
Quantizing to mixed int8/int4 - 70% hardware reduction and 3x speed increase compared to float16 with essentially no loss in quality.
A*.3/3 = 10% of the cost.
Switch from quadratic to memory efficient attention. 10x-20x increase in batch size.
So we are talking it taking about 1% of the resources and a 10x price reduction - they should be 90% more profitable compared to when they introduced GPT-3.
edit - see MS DeepSpeed MII - showing a 40x per token cost reduction for Bloom-176B vs default implementation
https://github.com/microsoft/DeepSpeed-MII
Also there are additional ways to reduce cost not covered above - pruning, graph optimization, teacher student distillation. I think teacher student distillation is extremely likely given reports that it has difficulty with more complex prompts.
LetterRip t1_ja89hul wrote
LetterRip t1_ja88v3i wrote
Reply to comment by badabummbadabing in [D] Training a UNet-like architecture for semantic segmentation with 200 outcome classes. by Scared_Employer6992
Which particular paper?
LetterRip t1_ja4d12c wrote
Reply to comment by coconautico in [P] [N] Democratizing the chatGPT technology through a Q&A game by coconautico
It appears they have changed the ToS. It used to restrict usage of output.
LetterRip t1_ja3rzqk wrote
Reply to comment by coconautico in [P] [N] Democratizing the chatGPT technology through a Q&A game by coconautico
> I have manually copy-pasted a few interesting questions that I asked chatGPT and encouraged lateral thinking or required specialized knowledge. > >
Don't do that - it violates ChatGPT's TOS which could result in a lawsuit against the model developers.
LetterRip t1_j9s7k0n wrote
LetterRip t1_j9ker51 wrote
Reply to [D] Faster Flan-T5 inference by _learn_faster_
See this tutorial - converts to ONXX CPU, then to tensor-RT for a 3-6x speedup.
https://developer.nvidia.com/blog/optimizing-t5-and-gpt-2-for-real-time-inference-with-tensorrt/
LetterRip t1_j8dpgxc wrote
Reply to comment by diviludicrum in [R] [N] Toolformer: Language Models Can Teach Themselves to Use Tools - paper by Meta AI Research by radi-cho
There are plenty of examples of tool use in nature that don't require intelligence. For instance ants,
https://link.springer.com/article/10.1007/s00040-022-00855-7
The tool use being demonstrated by toolformer can be purely statistical in nature, no need for intelligence.
LetterRip t1_jcl6axl wrote
Reply to comment by bo_peng in [R] RWKV 14B ctx8192 is a zero-shot instruction-follower without finetuning, 23 token/s on 3090 after latest optimization (16G VRAM is enough, and you can stream layers to save more VRAM) by bo_peng
Rocky