Aran_Komatsuzaki t1_jbkjgzf wrote on March 9, 2023 at 6:28 PM

I've compared Pythia (GPT-3 variants) w/ context length = 2048 vs. RWKV w/ context length = 4096 of comparable compute budget, and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the latter scores better on the first 1024 tokens. While RWKV performs comparably to Tranformer on the tasks with short context (e.g. the tasks used in its repo for evaluating the RWKV), it may still not be possible to replace Transformer for longer context tasks (e.g. typical conversation with ChatGPT).

RWKV has fast decoding speed, but multiquery attention decoding is nearly as fast w/ comparable total memory use, so that's not necessarily what makes RWKV attractive. If you set the context length 100k or so, RWKV would be faster and memory-cheaper, but it doesn't seem that RWKV can utilize most of the context at this range, not to mention that the vanilla attention is also not feasible at this range.

LetterRip t1_jbks0mg wrote on March 9, 2023 at 7:21 PM

> I've compared Pythia (GPT-3 variants) w/ context length = 2048 vs. RWKV w/ context length = 4096 of comparable compute budget, and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the former scores better for the first 1024 tokens. While RWKV performs well on the tasks with short context (e.g. the tasks used in its repo for evaluating the RWKV), it may still not be possible to replace Transformer for longer context tasks (e.g. typical conversation with ChatGPT).

Thanks for sharing your results. It is being tuned to longer context lengths, current is

RWKV-4-Pile-14B-20230228-ctx4096-test663.pth

https://huggingface.co/BlinkDL/rwkv-4-pile-14b/tree/main

There should soon be a 6k and 8k as well.

So hopefully you should see better results with longer contexts soon.

> and the former scored clearly better perplexity on the tokens after the first 1024 tokens, while the former scores better for the first 1024 tokens.

Could you clarify - was one of those meant to be former and the other later?

Aran_Komatsuzaki t1_jbkyegs wrote on March 9, 2023 at 8:01 PM

> Thanks for sharing your results. It is being tuned to longer context lengths, current is

I tried the one w/ context length = 4096 for RWKV :)

> Could you clarify - was one of those meant to be former and the other late

Sorry for the typo. The latter 'former' is meant to be the 'latter'.