Submitted by derpderp3200 t3_zwd49c in MachineLearning
I don't remember where I've read about this, but it left a lasting impression on me as it feels intuitively true and impactful - in a manner, the learning on each datapoint pulls the network towards encoding that individual example, relying on stochastic emergence of shared features, which in turn relies on a dataset:model size ratio that prevents overfitting and a balanced dataset.
Has there been any research into counteracting this phenomenon, such as more purposeful extraction of features, clever batching schemas, synthetic datapoints or anything else such?
magpiesonskates t1_j1u1oag wrote
This is only true if you use batch size of 1. Randomly sampled batches should average out the effect you speak of