How do we find subpopulations?

Three ways to find subpopulations

There are three primary ways that we can find subpopulations: use domain knowledge, use visualizations, and perform clustering analysis. These are presented here in order of how they are typically be applied.

The first step in identifying subpopulations is to use domain knowledge to brainstorm about which subpopulations might exist in your data. What logical subgroups probably have different characteristics that likely matter to predicting the outcome variable? If you lack the domain knowledge yourself, seek domain knowledge from subject matter experts.

The second step is to create visualizations of the data. Histograms and boxplots may show multimodal populations or intense concentrations of outliers.

Figure 16.1: Bi-modal histogram

3-dimensional scatterplots can also be very useful for finding subgroups.

Figure 16.2: 3-Dimensional scatter plot

A parallel coordinates plot (PCP) is a visualization technique used to plot individual data elements across many dimensions. Each of the dimensions corresponds to a vertical axis and each data observation (record) is displayed as a single line that spans all of the dimension axes. This automobile example shows that cars were very heavy in the 1970 and became lighter in later years. It also shows how horsepower relates to the number of cylinders and MPG.

Figure 16.3: Parallel coordinate plot

Lastly, you can perform cluster analysis. With this method, we use data mining algorithms that look for meaningful groups of observations.

Cluster analysis as unsupervised learning

Cluster analysis is the process of grouping observations based on the similarity of their attributes. The results of a cluster analysis are called clusters.

Cluster analysis itself is a form of exploratory data analysis, as it involves exploring the various clusters within a dataset. Additional steps can then be taken to fit models to each cluster you identified during cluster analysis. The reading and homework will include both cluster analysis and the fitting of models to the clusters identified.

Most of what we have done in this course up to now is supervised learning. In supervised learning, there is a known correct answer. For example, if we know which individuals do and do not have breast cancer, we can train a predictive algorithm to see the patterns in the input variables that differentiate those with cancer versus those without cancer. Because we know which people do and do not have cancer, algorithms can learn—that is, be supervised—based on the knowledge of a known answer.

Cluster analysis is a form of unsupervised learning. In unsupervised learning there is no known objective correct answer.

Because cluster analysis is a form of unsupervised learning, there is no objective answer as to how many clusters are best, what instances should be placed in each cluster, and what attributes and attribute values should be used to best group instances into clusters. There is no objectively correct grouping.

For example, assume we are attempting to group customers into segments. We want meaningful subgroups where members of the subgroup think and act alike but that think and act differently than members in a different subgroup. But what attributes will accomplish this objective? Customer types probably differ on many dimension, such as age, gender, income, family size, technical sophistication, frequency of computer use, frequency of cell phone use, and so forth.

There are many different groups of customers that could be derived. Should there be four, six, or eight segments? Should some of the clusters be big and others small? The answer to these questions cannot be objectively known. Further, customers may be broken up into different numbers of segments for different purposes. For example, we may decide that that six segments is a useful grouping for the purpose of understanding the characteristics of different types of customers, but four segments may best reflect how customers respond to marketing campaigns. 

The objective of cluster analysis is to discover useful groupings according to attribute values of the records in the dataset. Instead of finding the "correct clusters," we are looking for useful clusters. We can look at similarities and differences in attributes for customers and make educated guesses about which customers are alike enough to be in the same cluster, but the answer is still just an educated guess. But educated guesses are the best that can be accomplished and often produces cluster groupings that are very helpful.