Submitted by hopedallas t3_zmaobm in MachineLearning
I am working on a problem where the negative/0 label to postie/1 label ratio is 180MM/10MM. The data size is around 25GB and I have >500 features. Certainly, I don't want to use all 180MM rows of majority class to train my model due to computational limitations. Currently, I simply perform an under-sampling from majority class. However, I have been reading that this may cause loss of the useful information or cause difficulties for determining the decision boundary between the classes (see https://machinelearningmastery.com/undersampling-algorithms-for-imbalanced-classification/). When I do the under-sampling, I try to make sure that distribution of my data stays the same. I am wondering if there is a better way to handle this?
Far-Butterscotch-436 t1_j0a9083 wrote
5% imbalance isn't bad. Just use a cost function that uses a metric to handle imbalance. Ie, the weighted average binomial deviance and you'll be fine.
Also you can create downsampling ensemble to compare performance and compare. Don't downsample to 50/50, try for at least 10%
You've got a good problem, lots of observations with few features