ThePerson654321 OP t1_jbk8kxy wrote on March 9, 2023 at 5:21 PM

Reply to comment by farmingvillein in [D] Why isn't everyone using RWKV if it's so much better than transformers? by ThePerson654321

I'm basically just referring to the claims by the developer. He makes it sound extraordinary:

> best of RNN and transformer, great performance, fast inference, saves VRAM, fast training, "infinite" ctx_len, and free sentence embedding.

> Inference is very fast (only matrix-vector multiplications, no matrix-matrix multiplications) even on CPUs, so you can even run a LLM on your phone.

The most extraordinary claim I got stuck up on was "infinite" ctx_len. One of the biggest limitations of transformers today is imo their context length. Having an "infinite" ctx_len definitely feels like something DeepMind, OpenAi etc would want to investigate?

I definitely agree with that their might be a incompatibility with the already existing transformer specific infrastructure.

But thanks for your answer. It might be one or more of the following:

The larger organizations hasn't noticed/cared about it yet
I overestimate how good it is (from the developer's description)
It has some unknown flaw that's not obvious to me and not stated in the repository's ReadMe.
All the existing infrastructure is tailored for transformers and is not compatible with RWKV

At least we'll see in time.

farmingvillein t1_jbkwkgl wrote on March 9, 2023 at 7:50 PM

> most extraordinary claim I got stuck up on was "infinite" ctx_len.

All RNNs have that capability, on paper. But the question is how well does the model actually remember and utilize things that happened a long time ago (things that happened beyond the the window that a transformer has, e.g.). In simpler RNN models, the answer is usually "not very".

Which doesn't mean that there can't be real upside here--just that it is not a clear slam-dunk, and that it has not been well-studied/ablated. And obviously there has been a lot of work in extending transformer windows, too.

LetterRip t1_jbkmk5e wrote on March 9, 2023 at 6:47 PM

> He makes it sound extraordinary

The problem is that extraordinary claims raise the 'qwack' suspicion when there isn't much evidence provided in support.

> The most extraordinary claim I got stuck up on was "infinite" ctx_len. One of the biggest limitations of transformers today is imo their context length. Having an "infinite" ctx_len definitely feels like something DeepMind, OpenAi etc would want to investigate?

Regarding the infinite context length - that is for inference and it is more accurately stated as not having a fixed context length. While infinite "in theory" in practice the 'effective context length' is about double the trained context length,

> It borrows ideas from Attention Free Transformers, meaning the attention is a linear in complexity. Allowing for infinite context windows.

> Blink DL mentioned that when training with GPT Mode with a context length of 1024, he noticed that RWKV_RNN deteriorated around a context length of 2000 so it can extrapolate and compress the prompt context a bit further. This is due to the fact that the model likely doesn't know how to handle samples beyond that size. This implies that the hidden state allows for the the prompt context to be infinite, if we can fine tune it properly. ( Unclear right now how to do so )

https://github.com/ArEnSc/Production-RWKV