16.3 Algorithms for Cluster Analysis
Basic guidelines for clustering
Before we look at specific clustering algorithms, here are some basics guidelines to follow when performing cluster analysis.
-
Eliminate outliers before deriving clusters. The presence of outliers can confuse clustering algorithms.
-
Use only input variables when deriving clusters. Never include the output variable as an input to be used when deriving clusters.
K-Means clustering
K-means is a simple, yet widely used, clustering algorithms. With this method, clusters are identified by the algorithm based on proximity. It uses the concept of a centroid which is defined as the mean of a group of points. In a dataset defined in n dimensions, that is with n attributes or columns, each centroid is assigned a value to each of the n dimensions. Before beginning a cluster analysis using K-means, the analyst must first choose K – the number of expected clusters.
The steps of the algorithm are:
- Randomly locate K initial centroids within the n-dimensional space. (Alternatively, randomly choose K observations from the dataset to serve as the initial centroids.)
- Repeat:
- Assign each of the observations in the dataset to the nearest centroid.
- Recompute each centroid’s location as the mean of all observations assigned to that centroid until observation assignments to centroids do not change.
Issues with K-Means Clustering
Although K-means is simple to understand and implement, it does have shortcomings:
- K, the number of clusters, must be set before initiating the process.
- K-means generates a complete partitioning of the observations. There is no option to exclude observations from the clustering.
- When initial centroids are randomly located, the resulting clusterings may vary from execution to execution. The end result is not deterministic.
- K-means does not handle datasets well that contain clusters of varying size. In general, it will tend to split the larger clusters and may merge smaller clusters.
- K-means often makes some tiny clusters and some very large clusters.
Hierarchical clustering
Another relatively simple method to perform cluster analysis is hierarchical clustering which generates a taxonomy or hierarchy of clusters. It has two alternative approaches: bottom-up and top-down.
In bottom-up hierarchical clustering, each observation is assigned to its own cluster. Repeatedly, the closest two clusters are merged until only one cluster remains or until each cluster reaches a predetermined minimum measure of cluster cohesiveness.
In top-down hierarchical clustering, we begin with one cluster containing all observations. Repeatedly, clusters are divided until all clusters reach a predefined maximum measure of cluster cohesiveness. Top-down hierarchical clustering has the additional complexity in that there needs to be a way to select the next cluster to be split, and once selected, there needs to be a way to allocate observations to the two newly created clusters.
The process of top-down clustering is similar to the process of tree building. In decision trees, the degree of homogeneity of a node is based on the single classification variable and the split of a node is based on criteria that result in the most homogeneous nodes. In cluster analysis there is no classification variable. Hence, all attributes (dimensions) must be used to compute a measure of cluster dispersion in selection of nodes to be split. This will be discussed later when we present measures of individual cluster and overall clustering quality.
Over the years numerous other methodologies have been proposed for cluster analysis. Some have enhanced the existing K-means and hierarchical proximity based methodologies, while others have focused on density or connectedness based methodologies. Self-organizing maps, which will be introduced later in this chapter, can be thought of as an enhanced K-means algorithm. For more information on these algorithms, the reader is directed to books dedicated primarily to cluster analysis.
Self-Organizing Maps (SOM)
The self-organizing map algorithm was developed by Tuevo Kohonen. It is similar in approach to K-means and its variants. The primary differences are that the SOM algorithm utilizes a topologically fixed centroid structure and, based on that topology, when one centroid is updated, neighboring centroids are also updated. JMP uses a two-dimensional grid where the user can specify the number of rows and columns.
The steps of the algorithm are:
- Select topology including number of centroids (cells).
- Randomly initialize all centroids.
- Repeat:
- select next observation
- locate centroid closest to the observation (winning centroid)
- update winning centroid and other centroids in winner’s neighbourhood, nudging each centroid closer to the selected observation until a previously determined number of training iterations have been completed or the centroid updates fail to reach a minimal change threshold.
- Assign each observation to its winning centroid to form the clusters.
Observations are sequentially selected in step 3.1 above. Once the last observation has been applied, the selection process returns to the first. One cycle through the list of observations is known as an iteration.
For a given centroid, the magnitude of the update in step 3.3 above depends on its proximity to the winning centroid. The winning centroid itself is nudged the most, while those centroids immediately surrounding the winner are nudged less, and so forth, based on a Mexican hat neighborhood function. The result is that as the training process progresses, similar observations migrate toward the same or neighboring centroids. In the end, the most similar observations are assigned to the same cluster (centroid); neighboring clusters are more similar; while distant clusters are the least similar.