11.1 Background
The K Nearest Neighbor (KNN) algorithm finds neighbor records that are similar to the one being predicted rather than building models that are characterized by interpretable parameters or trees. These similar records are then used to predict the outcome value for new records. This is somewhat different than the predictive algorithms we have studied so far in this course. Before describing how KNN works, let's briefly review some characteristics of the predictive algorithms we have studied so far in this course. This is helpful because it provides a context to compare how KNN is different from the other algorithms we have studied so far.
Characterstics of other algorithms
The algorithms that we have studied so far produce interpretable models that are separate from the data and that work best when data meets certain criteria.
Interpretable Models
The algorithms that we have examined up to now produce interpretable models. Multiple linear regression (MLR) and logistic regression (LogReg) produce models that include estimated intercepts and regression coefficients. Because of this, models are expressed that include these coefficients. MLR based time-series forecast (TSF) models are a specific type of MLR models, where characteristics of time are represented as predictor variables.
The categorical and regression tree (CART) algorithm produces models in the form of interpretable binary trees. The branches of the tree are characterized by decision rules.
Model Separation
The MLR, TSF, and CART algorithms create models that are based on patterns in the training data. After the training data is used to define the models, the models are stored so that can be used to predict outcomes with new instances of data. Thus, after the training, the model reflects the learning from the training data and the model can be used without having to refer back to the training data.
Overview of KNN
KNN is different from the previous algorithms we have examined. KNN is an instance-based learner. KNN does not produce separated models with interpretable coefficients nor does it produce interpretable trees. Instead, the nearest neighbors serve as the basis for predicting new records.
Nearest neighbors
KNN compares each new instance (record) to be predicted with all instances in the training set to find those that are nearest. Nearest in this context means the predictor variable values are most alike. Specifically, KNN finds which records in the training data have predictor attribute values that are similar to the values of the predictor attributes found in the record to be predicted.
Predicting outcome values
Once these nearest neighbors have been identified, the outcome values of the neighbors are used to predict the outcome value for the new records.
If the outcome is a continuous numeric value, the outcome values of the neighbors are averaged. Then this average is assigned to the outcome value of the record to be predicted.
If the outcome is a categorical value, KNN determines the predominant class of those k nearest neighbors. The predominant class becomes the prediction for the new example. That is, classification is done by majority vote of the k nearest neighbors.
Number of data partitions
Since k can only be found by trying different values for k, the validation set fine tunes the algorithm. Using k in the validation process means that some overfitting can be introduced during validation. Therefore, some authors recommend three partitions as follows:
-
Train with the training set
-
Use the validation set to find the best k
-
Use the test set to evaluate the performance of the tuned model