Distance

Euclidian Distance

Below is an example of using the Pythagorean Theorem to calculate the distance between two records when there are two predictor variables x1 and x2. The difference on the x1 dimension is 3 - 1 = 2. The difference on the x2 dimension is 2 - 1 = 1. When these differences are squared, they become 4 and 1, respectively. When 4 and 1 are added, the sum is 5. The square root of 5 is 2.24.

Euclid extended the principle used by Pythagoras to more than two variables. In the example below there are two records with four predictor variables.

Standardization of numeric variables

Distance is very sensitive to the magnitude of the predictor variables. A variable that typically has large values relative to the value of other variables can dominate the combined distance calculation.

Consider the example below, where two input variables have very different magnitudes. Income is so much larger than age that it dominates the distance calculation. When a difference of 10,000 is squared, it is 100,000,000. This larger value dwarfs the square of 3, which is 9, such that when they are added together and the square root is taken, the influence of age is effectively lost in the calculation.

To avoid this problem, when variables have very different magnitudes, it is important to standard the data before measuring distance. To avoid this problem some data mining tools, including JMP, automatically standardize all continuous predictor variables before running the KNN algorithm.

The image below shows the most common standardization method that converts values so they are expressed in standard deviations from the mean. This produces negative and positive values. The example below shows data before and after has gone through the standardization process.

Distance for categorical variables

The previous examples have shown how to calculate distance with continuous numeric variables. How can distance be calculated with categorical variables? For example, assume color is an input variable, with common values such as red, green, blue, yellow, and others.

In such cases, if the old record has the same color value as the new record, the distance is zero. Otherwise the distance is 1.

The example below shows how Euclidean Distance is calculated when both continuous numeric variables and categorical variables are included. In this example, no standardization is performed because the predictor variables have similar magnitudes.

Examples of how k can affect outcomes

The image below shows an example of how selecting different number of neighbors can affect the prediction for a classification. Assume a new record is represented by the X at the center of the inner circle. Notice that with a classifier based on k = 3, the inner circle contains the nearest neighbors, which all have a value of class B. Therefore, the predicted value of X will be class B.

When k = 11, the outer circle contains the nearest neighbors. In this case, the majority class is class A, so x would be assigned a value of A.

Figure 11.1: Number of neighbors matters

The examples below shows calculating the Euclidean Distance when k = 3 and how a different result is obtained when k = 1.