I don't remember where I've read about this, but it left a lasting impression on me as it feels intuitively true and impactful - in a manner, the learning on each datapoint pulls the network towards encoding that individual example, relying on stochastic emergence of shared features, which in turn relies on a dataset:model size ratio that prevents overfitting and a balanced dataset.

Has there been any research into counteracting this phenomenon, such as more purposeful extraction of features, clever batching schemas, synthetic datapoints or anything else such?

Comments

magpiesonskates t1_j1u1oag wrote on December 27, 2022 at 11:38 AM

#1,090,285

This is only true if you use batch size of 1. Randomly sampled batches should average out the effect you speak of

derpderp3200 OP t1_j1ufkob wrote on December 27, 2022 at 2:06 PM

#1,091,784

Replying to magpiesonskates (#1,090,285)

Are there any articles or papers benchmarking this, or exploring more elaborate solutions than just batching?

big_haptun777 t1_j1ug2fb wrote on December 27, 2022 at 2:10 PM

#1,091,838

I believe that it has already been solved via shuffling and batching. You will possibly not get stuck in local minima.

eigenham t1_j1uhtn7 wrote on December 27, 2022 at 2:25 PM

#1,092,009

Replying to magpiesonskates (#1,090,285)

A similar phenomenon happens because of batching in general though. More generally, the distribution of the samples in each batch determines what the cost function "looks like" (as a function approximation) to the gradient calculation. That sample (and thus function approximation) can be biased towards a single sample or a subset of samples. I think OP's question is still an interesting one for the general case.

ResponsibilityNo7189 t1_j1ulsd1 wrote on December 27, 2022 at 2:56 PM

#1,092,495

That is why you have hundreds of millions of parameters in a network. There is so many ways for the weights to move that it's not a zero-sum game: some direction will not be so detrimental to other examples. It's precisely for this reason that self-supervised methods tend to work best on very deep networks. see "Scaling Vision Transformers".

HateRedditCantQuitit t1_j1v0fto wrote on December 27, 2022 at 4:41 PM

#1,094,459

Replying to derpderp3200 (#1,091,784)

The whole SGD & optimizer field is kinda this. Think about how momentum and the problem you’re talking about interact, for a small example.

derpderp3200 OP t1_j1vgi23 wrote on December 27, 2022 at 6:26 PM

#1,096,802

Replying to ResponsibilityNo7189 (#1,092,495)

I assume this is the case early into training, but eventually the training process starts needing to "compress" information so a given parameter handles more than one very specific case, at which point it'll be subject to this phenomenon again- any dog example will want "not dog" neurons inactive, any dog example will want neurons contributing to classification of other classes inactive.

Sure, statistically you're still descending down the slope of a network that's good at each class, but this is only the case when your classes - and thus the "pull effects" are balanced, not as an intrinsic ability of the network to extract differentiating features.

derpderp3200 OP t1_j1vmr20 wrote on December 27, 2022 at 7:07 PM

#1,097,660

Replying to eigenham (#1,092,009)

Similar but not identical? What effect do you mean?

But yeah, the way I see it, the network isn't navigating a single gradient towards "a good classifier" optima, but rather down whatever gradient is left after the otherwise-destructive inference of gradients of individual training examples, as opposed to a more "purposeful" extraction of features.

Which happens to result in a gradual movement towards being a decent classifier, but it strictly relies on balanced, large, and well-crafted datasets to balance the "pull vectors" out to "zero" so the convergence effect dominates, as well as incredibly high training costs.

I don't know how it would look, but surely a more "cooperative" learning process would learn faster if not better.

Red-Portal t1_j1vu94s wrote on December 27, 2022 at 7:57 PM

#1,098,701

I think what you're describing is similar to curriculum learning and importance sampling SGD. The former claims that there is a better order of feeding data during SGD that results in better training. But I'm not sure how scientifically grounded that line of research has become. It used to be closer to art. The latter is simple. Since some samples are more "destructive" (higher variance), sample them less often while numerically compensating for that.

Nameless1995 t1_j1xeo4l wrote on December 28, 2022 at 2:40 AM

#1,106,198

There is a literature related to taking gradient agreement/conflict into account for different motivations (usually different from the exact motivation in OP).

This is one place to start looking: https://arxiv.org/abs/2009.00329 (you can find some related work from the citations in google scholar/semantic scholar)

IndecisivePhysicist t1_j1yd893 wrote on December 28, 2022 at 8:29 AM

#1,111,143

This is good actually. The "noise" of batch sgd acts as a regularizer. You don't actually want to find the global minimum of the training set but rather you want a generalizeable minimum which usually means a flat minimum because there will be some distribution shift at test time. The minimum being slightly different for each minibatch helps draw you toward a flat minimum.

derpderp3200 OP t1_j1ygqtj wrote on December 28, 2022 at 9:18 AM

#1,111,638

Replying to Nameless1995 (#1,106,198)

What a fascinating paper- reminds me of an idea I had to store some sort of secondary value in weights that contribute to correct outputs that prevents unlearning their features, but had no specific idea of how to execute it- can't believe I didn't think of what this paper's authors did. Thank you.

derpderp3200 OP t1_j1yh8qe wrote on December 28, 2022 at 9:26 AM

#1,111,706

Replying to IndecisivePhysicist (#1,111,143)

But is it the most efficient and effective method?

I'd imagine it's likely possible to converge much faster, and that at some point into training, you likely run into a "limit" where the "signal"(learnable features) can no longer overcome the "noise"(the "pull effect").

nonotan t1_j1yovo3 wrote on December 28, 2022 at 11:12 AM

#1,112,762

Replying to derpderp3200 (#1,111,706)

It's probably not the most efficient method. However, in general, methods that converge faster tend to lead to slightly worse minima (think momentum-based methods vs "plain" SGD), which "intuitively" makes some degree of sense (the additional time spent training isn't completely wasted, with some of it effectively helping explore the possibility space, optimizing the model in ways that simple gradient-following might miss entirely)

I would be shocked if there doesn't exist a method that does even better than SGD while also being significantly more efficient. But it's probably not going to be easy to find, and I expect most simple heuristics ("this seems to be helping, do it more" or "this doesn't seem to be helping, do it less") will lead to training time vs accuracy tradeoffs, rather than universal improvements.

IndecisivePhysicist t1_j20lxts wrote on December 28, 2022 at 8:10 PM

#1,125,272

Replying to derpderp3200 (#1,111,706)

Converge to what though? My whole point was that you don't want to converge to the actual global minimum on the test set, you want one of the many local minima and you want one that is flat.

velcher t1_j217txd wrote on December 28, 2022 at 10:37 PM

#1,129,095

https://arxiv.org/abs/2001.06782 Gradient Surgery for Multi-Task Learning

Some related work in Multi-task RL. But I remember my impression of it was that it only moderately helps Multi-task RL.

realjunkman t1_j2a54lt wrote on December 30, 2022 at 7:22 PM

#1,192,203

There was a paper recently about finding parameter regions that are unused and only updating those on fine tuned data. Can't remember the name but that was an interesting approach.

derpderp3200 OP t1_j2avw24 wrote on December 30, 2022 at 10:16 PM

#1,198,573

Replying to realjunkman (#1,192,203)

Interesting! I thought about something similar, a "no parameter is left unused" during training, but using unused regions for fine-tuning sounds like a much more clever application of the principle.

realjunkman t1_j2cc4nm wrote on December 31, 2022 at 4:48 AM

#1,212,596

Replying to derpderp3200 (#1,198,573)

It was a presentation I saw at EMNLP this past year. I’ll try and look for it, but if I don’t report back… it was a presentation during day 3!

Zondartul t1_j2cthgl wrote on December 31, 2022 at 7:54 AM

#1,217,304

Replying to magpiesonskates (#1,090,285)

Would using a bath size of "all your data at once" (so basically no batching) be ideal, if unfeasible?