Viewing a single comment thread. View all comments

ReadSeparate t1_iywxpuh wrote

I wonder how feasible it is to use an external database to store/retrieve important information to achieve coherency.

If it’s not, then I guess we’ll have to wait for something to replace Transformers. Perhaps there’s a self-attention mechanism out there which runs in constant time.

3

Nameless1995 t1_iyyl3m5 wrote

Technically Lambda already uses "external database" i.e external tools (the internet, calculator, etc.) to retrieve information:

https://arxiv.org/pdf/2201.08239.pdf (Section 6.2)

It doesn't solve /u/ThePahtomPhoton's memory problem (I don't remember what GPT3's exact solution is), but solutions already exist (just not scaled up to GPT3 level).

One solution is using a kNN lookup in a non-differentiable manner: https://arxiv.org/abs/2203.08913

One solution is making Transformers semi-recurrent (process inside chunks parallely, then sequencially process some coarse-compressed-chunk-representation sequentially.). This can allow information to be carried in through the sequential process:

https://arxiv.org/pdf/2203.07852

https://openreview.net/forum?id=mq-8p5pUnEX

Another solution is to augment Transformer with a State Space model which have shown great promise in long range arena:

https://arxiv.org/abs/2206.13947

https://arxiv.org/pdf/2206.12037

https://arxiv.org/abs/2209.10655

13

ReadSeparate t1_iyz1wbt wrote

Awesome comment, thank you, I'm gunna check all of these out. For the external database thing, to clarify, I was wondering if part of the model training could be learning which information to store so that it can be remembered for later. Like for example, in a conversation with someone, their name can be stored in a database and retrieved later when they want to reference the person's name, even if that's not in the context window any longer.

1

Nameless1995 t1_iyz4iie wrote

Yes, this is partly done in the semi-recurrent Transformers. The model has to decide which information it needs to store in the compressed recurrent chunk-wise memory for future.

What you have in mind is probably closer to a form of "long term memory" while, arguably, what the semi-recurrent transformer may be modelling is better short-term memory (although S4 itself can model strong long-term dependencies but not sure how that would translate to more complex real world) i.e by recurrently updating some k vectors (which can signify a short-term memory, or a working memory). While in theory the short-term memory, as implemented in semi-recurrent transformers still may give access to information from far far back in the past and the "short-term" may be a misnomer (perhaps, "working" memory is the better term), the limitation would be that it's bandwith is still low (analogous to our own working memory) - all past beyond the chunk window needs to compressed into some k vectors. This may suffice for practical use like conversation for a few hours, but perhaps not good enough for "life-time agents", building up its own profile through a life time of memory (I would be skeptical that our "slow" memory of salient things we have experienced throughout life can be compressed into a few vectors of a recurrent memory).

However, aspects of the solution for that problem is also here. For example, the memorizing transformer paper (https://arxiv.org/abs/2203.08913) that I mentioned already allows k-NN retriever from it's whole past representations (which can be a lifetime of conversational history without compression. Basically in this case "every thing is stored" but only relevant things are retrieved as needed by a kNN (so the burden of learning what to "store" is removed and the main burden is in the retrieval -- finding top-k relevant items from memory). However, if we need to "bound" total memory we can use some adaptive deletion mechanism as well based on for example surprisal mechanism ("more surprising information" -- quantifiable based on difficulty to predict (can be easily done with NNs) can be made more persistent in the memory -- i.e more resistant to deletion)). This is similar to Retrieval augmented generation, where the model retrieves information from external sources like wikipedia and such, but instead the same kind of technique is used towards the model's own past information. The combination of this kNN retrieval with a more local "working memory" (from semi-recurrent transformer papers) could be potentially much more powerful. I think overall most of the elementary tools to making some uber-powerful model (leaps beyond GPT) are already here, the challenge is in engineering, making a scalable solution given limited computation and developing elegant integration (but with rise in brute computational power, challenges will only grow weaker even if we don't come up with many new concepts in the modeling side).

2