11.2 The number of neighbors
The K in KNN refers to the number of nearest neighbors that are considered when predicting a new record. For example, if k = 1, then only the single nearest neighbor is used. If k = 5, the five nearest neighbors are used.
Choosing the number of neighbors
The best value for k is situation specific. In some situations, a higher k will produce better predictions on new records. In other situations, a lower k will produce better predictions. Because of this, some BI tools, including JMP, let you specify the maximum k it should consider. Then the tool will start at that maximum level of k and also try lower values of k to identify which value of k produces the best results. Typically, a value for the maximum k should be less than 15.
It is typically best to choose the value of k which has the lowest error rate on the validation data. If more than one value of k produces the same accuracy, choose the one with the smallest k because simpler models are generally better than more complicated models.
When k is too low
There are pros and cons of lower and higher levels of k. Overfitting can result when k is small because a small k may capture the local structure in the data but may overfit because using just a few observations may not represent the pattern that exists when more observations are considered.
Another reason overfitting can occur when k is small is that an idiosyncratic record may be selected as a neighbor. Idiosyncratic in this context means that the relationship between predictor variable attributes and the outcome variable value may be unusual. When an idiosyncratic neighbor is used to predict values in the new record, this can produce uncharacteristic predictions. Further, the lower the k, the greater the proportion of influence that an idiosyncratic neighbor will have on the prediction.
When k is too large
On the other hand, there are also problems if the k is too large. Having many neighbors reduces the likelihood of overfitting but can cause too much averaging out. In other words, you may find some neighbors that are very similar to the new record to be predicted. But as K increases, you will also select some records as neighbors that are different from the record to be predicted.
Taking this to the extreme, when k equals the sample size, the value of the predicted continuous numeric outcome variable value will be the average value of all records in the sample. In the case when the outcome variable is a categorical variable, the outcome will be the same as using the naïve rule. That is, all predicted values will we the majority class. Thus, when all records are treated as neighbors, the selective ability of the algorithm to find a small set of similar records is lost.