Oversampling

Oversampling is increasing the frequency of occurance of a certain characteristic in your data. For example, assume you are looking for people who will buy race cars and you are starting with a dataset that contains 50,000 records of which only 1,000 records (2%) are records of car buyers. This is such a small percentage that data mining tools cannot effectively train because so few occurances exist in the dataset that they can be considered noise and are therefore unimportant. So there is a need to increase the frequency of occurence of these rare records.

There are multiple ways to do this. You could retain all buyer records and randomly select 9,000 of the nonbuyer records to add. Thus, your oversampled dataset would contain 1000 of 10,000 records (10%) that are buyers. In most cases, this is a high enough base rate so that the machine learning algorithms can effectively learn to detect these records.

Consider the case where you have less data to work with so you cannot drastically reduce the number of non-buyers in the dataset. Assume you have a dataset that contains 10,000 records of which only 200 (2%) are buyers. You can select the buyer records and put them in the dataset five times, so you have 1,000 buyer records. If you add these to the 9,800 records of nonbuyers, you will have 1,000/(9,800 + 1,000) = 9.26% as a base rate.

Before oversampling, save a copy of the original dataset that is not oversampled. Make a copy of it and apply oversampling to the copy. Train your model on the oversampled dataset. Then, your model should be able to find records in the original dataset. By applying the model to the original data, your confusion matrix will reflect what you can expect when you apply the data to new unseen data.

Python has useful routines for oversampling.

Over-sampling in python