Underfitting and Overfitting

Machine learning algorithms attempt to find general patterns from the data. Then the general pattern is used to create a model which is used to predict values for given values of input variables. When a learning algorithm attempts to fit too simple a model to the data, it is referred to as underfitting the data. When the learning algorithm creates an overly complex model, it is referred to as overfitting the data.

Underfitting

The image below shows three models created to predict price from size. In this example, all three models were created from the same data. Representative observations are shown. Observations selected in the training partition are shown in red, and observations selected into the test data are shown in green. The blue line represents the models generated by the training process.

The left graph shows an example of underfitting. The essential shape of the data in the graph is curvilinear, but the predictor function is a linear function that can only fit a straight line. The modeling method is not complex enough to capture the curilinear information. During learning, a linear function is fit to the data, which cannot do a good job of mapping to the data. Training will result in the best possible fit that a linear model can achieve. However, the approach is not complex enough to model the general pattern in the data.

Figure 2.1: Underfitting vs. overfitting in numerical prediction

This underfitting results in a biased model. Bias is the systemmatic over or under prediction in the model for certain inputs. By comparing the actual data values to the values predicted by the model, the model will overestimate price when x is small and x is large. When x is medium, the model will predict prices that are too low.

This underfitting also produces low variance. Variance is the difference between how the model performs on the (unseen) test data and how the model performs on the (seen) training data. The model performs poorly on both the training data and on unseen data. The model fit indicators are very similar for the training partition and the test partition.

Underfitting can also occur when predictive input variables are missing. In the present example, size is all that is necessary to predict price. But consider a different situation where price depends on both size and quality. If we have only one of these predictors, the model will be underfit.

The remedy for underfitting is twofold: 1) use machine learning algorithms that can recognize and model more complex relationships, and 2) give the learning algorithms the relevant inputs that will allow for the prediction of price. Failure to do this will result in bias.

Overfitting

At the other extreme, consider the right-most model. Overfitting occurs when a model learns the training data too well. When a learning algorithm perceives that idiosyncratic data reflects a general pattern, it overfits the data. The noise or random fluctuations in the training data is picked up and learned, so it is present in the model. Because the detail and noise in the training data do not accurately reflect the general pattern in the unseen data, the predicted values by the model are distorted. Therefore, the model performs poorly when it is applied to the new data.

In this example, the function mapped to the data has too many terms, e.g., the terms θ3x3 and θ4x4 in the example in the graphic, which makes a model too complex to capture the general pattern. In this extreme example, the complexity of the terms lets the learning algorithm map the relationship between size and price exactly for the training data. So when considering only the training data, the model is right on. When the model encouters inputs from the test data, it does very poorly. It has high bias. For some values of size, price is much too high, and for other values of size, it is too low. It also has high variance. The model fit indicators for the training partition are excellent (perfect in this example) but are poor when applied to the unseen data. It will make predictions based on the model that are influenced too much by specific training data observations. Too much complexity in the modeling method based on too many terms in the polynomial lets this happen.

Having too many input variables is another way to create overfitting. When the learning algorithm sees many input variables that may or may not have anything to do with predictive the value of the output variable, computations become complex and the learning algorithm can become confused. The algorithm can assign meaning to the values of the different variables, some of which may be predictive, and others of which appear to be predictive but are related to the outcome variable by chance.

The remedy for overfitting is twofold: 1) don't use overly complex modeling methods when simple ones will do the job, and 2) get rid of non-predictive inputs and include only a smaller set of relevant predictors. Failure to do this will result in bias and variance.

Good Fit

Too simple and too complex are both bad. We want the right degree of complexity—not too little and not too much. When the learning algorithm can find and represent the general pattern from the data, as captured by the second order polynomial equation in this example, it achieves an appropriate level of fit. Consequently, as the model is used to estimate new data, it will match the curved representation. This results in the desirable condition of low bias and low variance.

Under and Overfitting with Categorical Predictions

Underfitting and overfitting can also occur when creating a model for categorical predictions. The image below shows a graphical representation of different classifiers. The straight line applied to separate the two categories in the underfitting example does not capture the best separation boundary. The properly fit model captures the separation available with curved separation boundary. It is not a perfect classifier but a very good one, probably the best that can be done for this problem. Although the overfitted classifier makes no categorization mistakes, it is not ideal because it fits this specific training data set too well. When this model is applied to a different data set, it will include observations that match the general pattern but not this overly specific pattern. Therefore, some observations will fall on the wrong side of the convoluted separating line.

Figure 2.2: Underfitting and overfitting with categorical prediction

The Signal and the Noise

We are trying to find the real pattern in the data, the true signal, and differentiate it from random noise and irrelevant data in a dataset. We are trying to separate the true signal from the noise. A variety of causes can lead to underfitting or overfitting your data. We will explore a variety of ways to control underfitting and overfitting throughout this course.

One way a learning algorithm may obtain an inaccurate signal is to train on data that does not represent the actual population it should have represented. Let’s say you’re modeling height and weight vs. age in children. If a large reprentative sample of the population is analyzed, a clear relationship is found. The model reflected in the graph below was developed by studying large samples of the U.S. population of young people. Since many people from the overall population were sampled, a correct distribution was produced. The distribution is so well defined that it describes what proportion of the population falls into which percentiles. This clear, accurate model is a good example of well-described signal. It is a clear map of what the real pattern looks like. With such an accurate and clear model, it is easy to tell where any given child fits into the overall population. That is, you can tell if a child is average, below average, or above average. For example, you could tell if a child is in the 95th percentile. 

You can also tell if a child is not repesentative of this general pattern. For example, a severly undersized child, such as might result from extreme undernourishment or some genetic defect, would not even fit on this graph. Likewise, an abnormally large child would also not fit this graph. This is an important use of a good model. Observations that do not fit can be identified as outside of the norm. In such a case, a parent or pediatrician might search for the reason a child was so far from the norm.  

Figure 2.3: Growth Chart

Now assume a model was developed from young people in one small town. There may be characteristics of the local population that are not representative of the overall population. The local population may be made up of children who are shorter, taller, thinner, or heavier than the general population. Since the model learned from data that does not reflect the overall population, the sample is biased. Because of this, when the model is applied to children in other cities, it will not produce accurate predictions. For example, a normal child from a different town might appear as an outlier.