Viewing a single comment thread. View all comments

Nameless1995 t1_iyz4iie wrote

Yes, this is partly done in the semi-recurrent Transformers. The model has to decide which information it needs to store in the compressed recurrent chunk-wise memory for future.

What you have in mind is probably closer to a form of "long term memory" while, arguably, what the semi-recurrent transformer may be modelling is better short-term memory (although S4 itself can model strong long-term dependencies but not sure how that would translate to more complex real world) i.e by recurrently updating some k vectors (which can signify a short-term memory, or a working memory). While in theory the short-term memory, as implemented in semi-recurrent transformers still may give access to information from far far back in the past and the "short-term" may be a misnomer (perhaps, "working" memory is the better term), the limitation would be that it's bandwith is still low (analogous to our own working memory) - all past beyond the chunk window needs to compressed into some k vectors. This may suffice for practical use like conversation for a few hours, but perhaps not good enough for "life-time agents", building up its own profile through a life time of memory (I would be skeptical that our "slow" memory of salient things we have experienced throughout life can be compressed into a few vectors of a recurrent memory).

However, aspects of the solution for that problem is also here. For example, the memorizing transformer paper (https://arxiv.org/abs/2203.08913) that I mentioned already allows k-NN retriever from it's whole past representations (which can be a lifetime of conversational history without compression. Basically in this case "every thing is stored" but only relevant things are retrieved as needed by a kNN (so the burden of learning what to "store" is removed and the main burden is in the retrieval -- finding top-k relevant items from memory). However, if we need to "bound" total memory we can use some adaptive deletion mechanism as well based on for example surprisal mechanism ("more surprising information" -- quantifiable based on difficulty to predict (can be easily done with NNs) can be made more persistent in the memory -- i.e more resistant to deletion)). This is similar to Retrieval augmented generation, where the model retrieves information from external sources like wikipedia and such, but instead the same kind of technique is used towards the model's own past information. The combination of this kNN retrieval with a more local "working memory" (from semi-recurrent transformer papers) could be potentially much more powerful. I think overall most of the elementary tools to making some uber-powerful model (leaps beyond GPT) are already here, the challenge is in engineering, making a scalable solution given limited computation and developing elegant integration (but with rise in brute computational power, challenges will only grow weaker even if we don't come up with many new concepts in the modeling side).

2