Measures of spread

Measures of spread include Variance, Standard Deviation, and the Interquartile Range.

Variance

Variance is a measure of how far a data is spread from the mean of the values. It is the average of the squares of the deviations from the mean.

Squaring the deviations ensures that negative and positive deviations do not cancel each other out. A value of zero means that there is no variability; All the numbers in the data set are the same.

The formula for variance is seen below. Sigma-squared of x refers to the variance of x. N is equal to the number of total observations. X refers to the mean, and Xi refers to the ith observation.

Figure 20.6: Variance Formula

Standard Deviation

Standard Deviation is equal to the square root of variance. Though variances gives you an idea of how far the data spread from the mean, the standard deviation is easier to interpret as it gives exact distances from the mean.

Figure 20.7: Standard Deviation

Notice that in the formulas for variance and standard deviation, we divide by N instead of dividing by (N - 1). In classic inferential statistics when calculating these measures on samples, they divide by N - 1. Why? Subtracting by one in classic inferential statistics is done to inflate variance and standard devation values to account for uncertainty when taking small samples. However, in data mining, we typically work with large samples, so the effect of subtracting by one would be negligable. Thus, we divide by N.

Interquartile Range

The interquartile range is the middle 50% of your data, the range between the 25th and 75th percentile. The interquartile range (IQR) is represented by the box on a boxplot. The red bracket above the box plot represents the most dense 50% of the data distribution.

Figure 20.8: Interquartile Range