16.4 How to Measure Clustering Quality
Given that in cluster analysis we never know if we have “the correct answer,” a way is needed to evaluate a clustering’s quality. In principle, a clustering based on proximity is valid if we have clusters that individually are cohesive (tightly packed around a centroid) and distinctly separated from the other clusters in the clustering.
The mean and standard deviation are statistics that can be used to detemine whether clusters are both cohesive within each cluster and are distinctly separated from other clusters. As shown in this image, the mean and standard deviation is reported for each cluster.
Parallel coordinate plots provide a useful way to visualize whether the means of inputs to the clustering process are similar or different across clusters.
Finally, if the clusters are going to be used to help improve the accuracy of predictions, model quality statistics for models developed based on the clusters can be compared for different clusterings.