Submitted by Awekonti t3_zqitxl in MachineLearning

Hey! Currently, I am reading the papers on Deep Learning based Recommender Systems. After around 20 papers, I realised the base idea of the papers is the same - recommendation task either Top-K recommendations or simply predicting the utility (i am not talking about those frameworks that simply model the auxiliary information). The papers have differences in base models (I am reading DNN/MLP, Autoencoder and Attentive models based), but the methodology is the same - replace the way to factorize the matrix to find the latent feature vectors of users/items/social relations, only some papers introduce custom loss function with regularisation terms (just to model the social network I would say). And all these models perform as "state-of-the-art". The question is where is this research field going/developing? All these findings/performance results are simply empirical with no theoretical evidence.

56

Comments

You must log in or register to comment.

__lawless t1_j0ymvpl wrote

There are some methods that are matrix factorization free like sequential recombination models

2

Awekonti OP t1_j0yroqs wrote

But most of the papers focus on Collaborative Filtering -> MF. My assumption is that this approach has are potential? Please tell whether this is a right assumption.

2

__lawless t1_j0yuzij wrote

It really depends on the problem/situation. Of course MF is very powerful and interpretable despite being very simple. But for my job we did not have success with it, on the other hand we had great success with transformer based sequential recommendation models. Another method you want to look into is GNN. We did not invest it in because it is harder (not impossible) to scale. For example look into pinsage

2

cnapun t1_j0z2van wrote

User behavior is pretty stochastic and not really well-captured in datasets available to academia. There's also the second class of papers that explores ranking more than candidate generation, which imo are usually more interesting, but also harder to find good data for in academia.

I take all results in papers discussing embeddings/two tower models (for retrieval) with a grain of salt because in my experience, the number one thing that matters for these in practice is negative sampling (but people rarely do ablations on this. see this paper that shows how metric learning hasn't really progressed as much as papers would have you think). They can still be good to read for ideas though

18

micro_cam t1_j0zxjpw wrote

So any recommender system at scale is going to need to generate something you can use to retrieve content.

Embeddings and approximate nearest neighbor look up systems are a popular way to do this at an infrastructure level with vector databases like milvus. Most papers are targeting these systems and end up including an inner product of some embeddings and thus look a lot like classic matrix factorization.

(If you're a mathematician you might say all deep learning looks like matrix factorization, just some of it additive or otherwise non inner product)

One cool paper from bytedance (tiktok owner) that didn't generate much buzz tried to use deep learning to generate discrete retrieval codes. Apparently this works well enough to be used in production at bytedance, but the reviews and rejection on this paper are a great example of how hard it is to really set up good comparisons on public datasets.

Another cool area is multi modal learning like here... this can still be an inner product embedding but with some additional structure to allow multiple interests.

I wouldn't put that much weight on papers though. There isn't a lot of theory in this field and a lot of stuff that gets published couldn't be used in production for performance reasons. And really good stuff might not be published as it provides competitive edge.

And in practice things like how you sample your data, retraining frequency, and what features you can get into the model and how quickly you update data make a much bigger difference than model architecture.

3

ParsnipIntrepid9234 t1_j0zxmtp wrote

Hey I agree with you it looks like it's still not mature enough..

btw it looks like there is no open source code available for robust recommendation systems,

By that I mean, One that can take into account item features and user features, that handles warm start (new users with items) scenarios or cold start scenarios (new users with no items only with their features)

If you guys have anything besides LightFM (not good for new users not present in the training) or Polara Hybrid svd (not maintained anymore) that would be awesome :)

−2

domestication_never t1_j107cnt wrote

I made a dumb-ass recommendation system using a graph database. The graph had links between all the "likes", "relationships" and anything else I could glean about the demographics of the user. Nodes were the items of interest and the users. The recommendations where just a walk around the user object, and ranking by how many times it hit on an "interest" node.

I thought at the time: If I could ML better, here is where it'd go. ML over the graph to draw inferences. But, alas, I was much too stupid to try it.

0

cnapun t1_j10a9jz wrote

In my experience, negative sampling is super application-dependent (esp for retrieval) sadly. FB had a paper discussing how they train a search retrieval model (no hard negatives), while amazon used hard negatives combined with easy negatives in product search (fb paper mentioned they tried this but it didn't help, but did some other stuff). Both of them use hinge loss, but other places use softmax more often. I'm a fan of random negatives (and distance weighted sampling), but eventually we found that mixed negatives + softmax with sample probability correction work a little better for a lot of cases.

One of the big challenges is that there are so many possible hyperparams here: do you concatenate negatives or sum losses, how many in-batch negatives do you use, if you have things that are from a different distribution that positives, can you use them as in-batch negatives, what's the ratio of in-batch to random negatives. And depending on the application, different configurations here can yield better or worse results.

Some not super-recent papers I can think of:

https://research.google/pubs/pub50257/

https://arxiv.org/abs/1706.07567

https://arxiv.org/abs/2010.14395

https://arxiv.org/abs/1907.00937 (3.2)

https://arxiv.org/abs/2006.11632 (2.2/2.4,6.1)

5