Submitted by bo_peng t3_1135aew in MachineLearning

Hi everyone. I am an independent researcher working on my pure RNN language model RWKV. I have finished the training of RWKV-4 14B (FLOPs sponsored by Stability EleutherAI - thank you!) and it is indeed very scalable. Note RWKV is parallelizable too, so it's combining the best of RNN and transformer.

The ChatRWKV project (let's build together):

https://github.com/BlinkDL/ChatRWKV

Zero-shot comparison with NeoX / Pythia (same dataset: the Pile) at same params count (14.2B):

​

https://preview.redd.it/f6lxnjgfceia1.png?width=1174&format=png&auto=webp&v=enabled&s=54de7568974fc187584bd6825d92935baa079e83

Generation results (simply topP=0.85, no repetition penalty) - looks great with my magic prompt (sometimes even better than NeoX 20B):

https://preview.redd.it/99deuc17ceia1.png?width=1878&format=png&auto=webp&v=enabled&s=456c8d9bb2a968d73f44a0d3589cf6b893be31f4

​

https://preview.redd.it/g62e4l48ceia1.png?width=1887&format=png&auto=webp&v=enabled&s=c997bf27692d7e53d07de19048b6cbf3d2c9ebff

​

https://preview.redd.it/379egq09ceia1.png?width=1808&format=png&auto=webp&v=enabled&s=895f05fe14e2a3a41863802858114f3096d0ed77

​

https://preview.redd.it/pcgq7gz9ceia1.png?width=1886&format=png&auto=webp&v=enabled&s=138b0aec404b8f7f49f585d00284edbac791ffaf

​

https://preview.redd.it/rn743etbceia1.png?width=1715&format=png&auto=webp&v=enabled&s=6d83cc2a200bdd655b690f56559dda43490ed2b3

​

https://preview.redd.it/uhal4dkcceia1.png?width=1879&format=png&auto=webp&v=enabled&s=3b3db0b96456df9590a8b38ebe7d58509ebccb20

Explanation, fine-tuning, training and more:

https://github.com/BlinkDL/RWKV-LM

268

Comments

You must log in or register to comment.

mz_gt t1_j8ofbiq wrote

This is really awesome! I’ve been seeing the progress of your work on RWKV and I have to ask: I know you’ve mentioned a lot of RWKV is using tricks from here and there, and adding a lot of your own tweaks of course, but have you considered writing a paper? There are plenty of highly renowned published works with less to say than RWKV.

I think a renewed discussion about RNNs is more than warranted right now given the current direction with transformers, and the highly complicated nature of HiPPOs are personally not something I see replacing it anytime soon.

60

bo_peng OP t1_j8qhad9 wrote

Thank you :) Too busy for that at this moment, but I will get a paper out later this year.

26

MysteryInc152 t1_j8oj9qx wrote

Fantastic work. Thanks for doing this. Good luck scaling to 24b. I hope more catch on because the lack of a limited context length is a game changer.

17

farmingvillein t1_j8p269l wrote

> I hope more catch on because the lack of a limited context length is a game changer.

I'd be cautious about concluding this, without more testing.

RNNs, in some theoretical sense, support infinite context more easily than N^2 transformers; in practice, their effective "context window" often doesn't look much different than a reasonable transformer, when we look at performance metrics against long sequences.

30

maizeq t1_j8p3f1s wrote

Any papers I can refer to that for that last paragraph? I expect it is true but would love to see some empirical work.

8

farmingvillein t1_j8p7qa8 wrote

Any of the papers that address building NLP for long contexts will tend to have a relevant related works section. E.g., https://arxiv.org/pdf/2109.00301.pdf.

(The one qualifier here is that, at "modern" scale, RNNs have not really been well-tested (since people tend to just use...transformers). So, maaaybe they are actually simply superior. Evidence so far says "doubtful", however (at least for more vanilla implementations).)

12

gwern t1_j8ph8as wrote

I don't think the Related Works section of that paper provides any useful references. It simply provides doodads people claim help memory without papers showing that the memory doesn't work.

5

farmingvillein t1_j8piz80 wrote

Not clear to me what you are looking for here.

> It simply provides doodads people claim help memory without papers showing that the memory doesn't work.

The very first reference I pulled, Graves 2014, specifically compares w/ and w/o memory.

Or Dai et al, which tries to compare against various RNN-style baselines with similar parameters.

Perhaps we're talking past each other?

2

gwern t1_j8psc8m wrote

> Not clear to me what you are looking for here.

The question asked was pretty clear, to justify the statement:

>> in practice, their effective "context window" often doesn't look much different than a reasonable transformer, when we look at performance metrics against long sequences.

Simply comparing RNNs with and RNNs without memory doesn't tell you anything about how fast the memory fades out and that it never winds up being bigger than a Transformer. For example, you could construct a toy problem which requires memory reaching back exactly 1 state, and show that an arch with any memory outperforms memory-less arch; this would obviously tell you nothing of interest like 'this memory makes little use of history further back than 50 steps and none past 200 (and so is easily outperformed by history-stacking like a Transformer)'. Nor does comparing a Transformer with a history of say l=500 and an RNN, and the Transformer winning, tell you anything about why the RNN lost - ok, the Transformer did better, great, we have a superior new tool, but why? maybe it has similar memory problems and is just way better at the modeling part or memorizes better or something entirely different.

Likewise, unless you are comparing RNN baselines which somehow have known hard history constraints, they cannot tell you anything useful about how fast the effective memory fades out, how the accuracy of the memory is 'distributed' over the effective context window, if there are hard cutoffs, if the RNN is basically only using the last few states and so on.

In contrast, a Transformer has direct shortcut access to the history (we don't need any paper to know this, literally any GPT output exhibiting coherent long-range references past a few paragraphs demonstrates this directly), and so if you show that an RNN uses primarily the past 50 steps and simply 'fades out' completely past 200 steps and so the 'infinite history' is meaningless in practice, well, we know perfectly well that Transformers make excellent use of context windows larger than 50 or 200 tokens (as my two references show), so a direct comparison is otiose. Directly examining a RNN's understanding of its history, as those papers do, is much better than some higher-level performance comparison, which is what most of those referenced papers do; direct performance comparisons are great, but do not ablate where the problem is on the RNN's end. (Although if I really needed one, I would prefer to point at the RNN vs Transformer scaling laws in context window anyway, like Kaplan et al 2020 IIRC, to show that the Transformers are making good use of it, not merely some sort of better-than-RNN use or gains elsewhere.)

4

farmingvillein t1_j8qipd4 wrote

Let's think step by step:

You:

> I don't think the Related Works section of that paper provides any useful references.

Your own response to the question that was posed:

> https://arxiv.org/abs/1805.04623 > https://arxiv.org/abs/1702.04521

There is no possible way that you actually read the Related Works section you dismissed, given that the papers you cited are already covered in the same references you dismissed.

E.g., "Sharp Nearby, Fuzzy Far Away" is directly discussed in the cited "Transformer-XL":

> Empirically, previous work has found that LSTM language models use 200 context words on average (Khandelwal et al., 2018), indicating room for further improvement

> Simply comparing RNNs with and RNNs without memory doesn't tell you anything about how fast the memory fades out and that it never winds up being bigger than a Transformer

I never said this, so I'm not sure what your argument is.

> we know perfectly well that Transformers make excellent use of context windows larger than 50 or 200 tokens (as my two references show)

Neither of the papers you link to (assuming you are talking about your own comment at https://www.reddit.com/r/MachineLearning/comments/1135aew/r_rwkv4_14b_release_and_chatrwkv_a_surprisingly/j8pg3g7/) make any reference to Transformers.

If your claim is that the papers indicated that RNNs have a small window (sure) and that Transformers have a longer one, you're arguing (as you seem to be in your entire post) again against a strawman. Re-read what I actually wrote:

> in practice, their effective "context window" often doesn't look much different than a reasonable transformer, when we look at performance metrics against long sequences.

My statement here is an empirical one around performance--which, among other things, is why I reference Dai et al, who (among others!) do a fairly extensive breakdown of empirical performance differences of RNNs- versus transformer-type architectures against long text sequences.

The whole point is that an OP said that RNNs were attractive because of the theoretical infinite context--but my response was that 1) we don't really see that in practice, when we try to measure it directly (as both of our sources point out), and 2) we don't see evidence of superior long-distance behavior when testing against real-world(ish) data sets that should theoretically reward that. And that both of these points are encapsulated if you follow the reference I shared (or, as I noted, most reasonable "long-distance transformer" papers).

(As with all things research...someone may come out with a small modification tomorrow that invalidates everything above--but, for now, it represents the broad public (i.e., non-private) understanding of architecture behaviors.)

−1

gwern t1_j8s2du5 wrote

> There is no possible way that you actually read the Related Works section you dismissed, given that the papers you cited are already covered in the same references you dismissed.

Telling someone to read the Related Works section of every one of a dozen papers in the Related Works section of a paper is a ridiculous thing to suggest, and no, I did not recurse down n deep in a breadth-first search. I read the Related Works of that paper, as I said ("I don't think the Related Works section of that paper"), noted that they were a bunch of memory-related papers which might or might not cite the actually relevant research I had in mind, but life was too short to queue up a dozen papers just to check their RW when I already knew some useful ones. Giving someone a random reference and telling them to manually crawl the literature is not helpful. In contrast, the two references I provided directly bore on the question, they didn't maybe cite papers which might bury something relevant in a footnote or cite papers which might someday answer the question...

> I never said this, so I'm not sure what your argument is.

I was pointing out why it was irrelevant to bring up a paper which "compares w/ and w/o memory." Mildly interesting but such a comparison cannot show what was asked about the effective memory of RNNs. Of course it is better to have (any) memory than not.

> which, among other things, is why I reference Dai et al, who (among others!) do a fairly extensive breakdown of empirical performance differences of RNNs- versus transformer-type architectures against long text sequences.

Dai would in fact have been useful, had you referenced it in your original comment. Unless you mean, 'vaguely gestured in the direction of a paper which has 50+ references with 35 in the RW section alone, any of which could have been relevant and where the relevant benchmarking of Dai was not highlighted in the paper to begin with, nor is the relative context work mentioned in the abstract of Dai but buried at the end of the paper (with the RNN results hidden inside a table) so you just have to know it's already there, and claimed you 'reference it'.' Then sure, yeah, that was a useful reference. Thanks for the input.

> If your claim is that the papers indicated that RNNs have a small window (sure) and that Transformers have a longer one, you're arguing (as you seem to be in your entire post) again against a strawman.

It's not a strawman. It's not obvious a priori that Transformers would work so much better or that RNN histories fade out so fast, which is why it had to be empirically established that the history fades out completely, as opposed to any of the other reasons that RNNs could underperform (maybe they have history but can't learn a good algorithm exploiting their memory, say, or they could but they are poorly optimized - there are so many ways for NNs to break) and people were surprised by how well Transformers work. It is completely understandable that OP would expect RNN history to work better than it does, and would want some hard citeable evidence that it works so badly that Transformers, with their apparently brutal hard cutoff, wind up having much closer to 'infinite context' than RNNs themselves.

Thus, it's useful to provide references showing that. (Not references to unspecified references which may or may not show that - gl.)

1

farmingvillein t1_j8s7ygo wrote

This...is pretty astounding. Just have the grace to admit you were wrong, and move on.

> Telling someone to read the Related Works section of every one of a dozen papers in the Related Works section of a paper is a ridiculous thing to suggest

Then how can you possibly say:

> I don't think the Related Works section of that paper provides any useful references.

?

This is hardcore trolling. You can, and frequently do, do better than this.

You are literally pushing posts that are factually incorrect, and that you either know are factually incorrect, or are too lazy to validate either way.

This is the type of thing which blows up post quality in this sub.

> Giving someone a random reference and telling them to manually crawl the literature is not helpful.

This...is ridiculous. This is--traditionally--a very academic-friendly sub. This is how research works. "Here is where you can start a literature review on a bundle of related papers" is an extremely classic response which is generally considered helpful to complex and nuanced questions.

And underlying issue is actually very complex, as evidenced in part by the fact that your references do not actually answer the question. "Go read related works" can be obnoxious when there are a single one or two papers that do answer the question--but that is not the case here.

> In contrast, the two references I provided directly bore on the question

No they did not. They did not touch at all upon Transformers versus RNNs, which was the question. You've chosen to cherry-pick one slice of the problem and declare victory.

> It's not a strawman.

You don't seem to understand what a strawman is. Strawman:

> an intentionally misrepresented proposition that is set up because it is easier to defeat than an opponent's real argument.

I was not making this argument. You were making this argument. QED, this a strawman.

2

bo_peng OP t1_j8qhiyk wrote

RWKV is the exception. When you look at loss against token position, it is comparable with transformers.

You can tell that from the generation results too.

3

farmingvillein t1_j8qj1u7 wrote

> RWKV is the exception. When you look at loss against token position, it is comparable with transformers.

Can you link to what you are referring to? If I missed it in the OP post, my apologies.

2

csreid t1_j8p5z30 wrote

But they theoretically support infinite context length. Getting it is a problem to be solved, not a fundamental incompatibility like it is with transformers.

2

farmingvillein t1_j8p7lci wrote

Neither really work for super long contexts, so it is kind of a moot point.

Both--empirically--end up with bolt-on approaches to enhance memory over very long contexts, so it isn't really clear (a priori) that the RNN has a true advantage here.

8

rapist1 t1_j8ppons wrote

Could you please writeup the methods of RWKV in an arxiv paper, a standalone readme, or even a blog post format? I have read the description on the GitHub repository and it is very scattered and hard to read

16

MustBeSomethingThere t1_j8r2v8m wrote

I ran the model "RWKV-4-Pile-14B-20230213-8019" on CPU and main 64 GB RAM (Not on GPU). It fits in that RAM size and the answer below took 5 minutes to generate (In my opinion not bad for a 6-core CPU and CPU-utilization was just 30%). Below is the Question and Answer:

User: +gen \nAsk Expert\n\nQuestion:\nWhat are some good plans to get rich fast?\n\nExpert Full Answer:\n

From living in developing countries to be a millionaire. The reality is that is a long process that requires a lot of planning and effort. Here is a basic outline of the process.

  1. Start out with the right mindset

You need to have a good mindset in place. The ability to handle failure and accept challenges are going to be essential. If you can handle that then you can go through the journey of trying to be rich. It's like going on a diet or working out for the first time. If you have a negative mindset then you are never going to get past the first few steps.

  1. Make sure you have a work ethic that is impeccable

Make sure you have a work ethic that is impeccable. This is because it will require many hours of work over a long period of time to build up a large fortune. You can't expect to get rich overnight. Just like any successful business it will take you years to get the rewards

7

waffles2go2 t1_j8p9q79 wrote

Could you explain the table highlighting?

4

redv t1_j8opi69 wrote

Is it possible to run this on a latptop using CPU and with less than 16GB of ram? If yes, then how does one do this? Thanks.

1

avocadoughnut t1_j8p3psq wrote

He has trained several smaller RWKV models. You can find them on huggingface

10

Kiseido t1_j8qfm0j wrote

What version of Python is used for this project? I cannot find a number anywhere!

1

hfnuser0000 t1_j8qoshn wrote

I am interested in the theoretical aspect of how your model work. Says transformers, you have tokens that attend to other tokens. In the case of RNNs, a piece of information can be preserved for later uses but with a cost of reducing memory capacity for other information and once the information is lost, it's lost forever. So I think the context length of a RNN scale linearly with the memory capacity (and indirectly with the number of parameters), right?

1

syb3ria t1_j8u9p09 wrote

Thanks for sharing your work OP. How do you compare it to Bloom?

1

Gody_Godee t1_j91z1cm wrote

could you test it again LRA please?

1