16.1 Introduction to Segmentation and Clustering
The Importance of Subpopulations
Does you dataset contain one type of thing or multiple types of things? That is a very important question. Just because we have a sample of data does not mean that it represents only one population. Datasets may be composed of observations drawn from multiple subpopulations, where each subpopulation has different characteristics. Relationships that are apparent when looking at a single subset may not be as readily identified when exploring the full dataset. Unless we search for these subpopulations, they typically remain unrecognized.
Why Look for Subpopulations
There are two primary reasons to look for subpopulations when data mining. First, to identify and understand subpopulations that exist in our sample. Second, it can increase the quality of our predictions. If there are meaningful subgroups in a population, it is often better to figure out what these subgroups are and then build a separate model for each subgroup. This often produces better prediction accuracy than when we create one model for all instances in the population combined.
Consider a residential real-estate example. If we select all of the homes in a county, there are typically meaningful groups of homes that are similar to each other but different than other groups of homes. Some are located in rich parts of the county, some are located in poor parts of the county. Some homes are large, some are medium, and some are small. Some homes are in larger cities some homes are on small farms. Some are old some are new. Some are stand-alone homes, some are condominiums or townhomes, some are mobile homes. Homes in the city are probably different from homes on farms.
If we are to create a pricing model for homes in the county we probably should separate homes by type. For example, we may build a model for stand-alone single-family homes and a different model for condominiums because these two types of homes are fundamentally different. Most condos are small, have only a few small rooms, and have little or no yard. So the number of bedrooms and lot size will have a different relationship to price for condos than for single-family homes that have more rooms, larger rooms, and much larger yards.
What happens if we build one predictive pricing model based on data from both condos and stand-alone single family homes? The relationship between the predictors and outcome variables will get averaged together. Thus, the error in the resulting pricing estimates will go up considerably. We will make less accurate predictions for condos and for single-family homes. In effect, by averaging two fundamentally different types of homes, we made a model that does poorly for both.
In fact, that is how analyst typically win data mining competitions. They break the data into subgroups where items within each subgroup are alike but are different from items in other subgroups. Try different numbers of subgroups. Then, analyst determine which number of subgroups is best and which models work best for each respective subgroup.
A common approach involves the following steps: 1) Divide the sample into different numbers of cluster sets. For example, you might try cluster sets with three, four, and five clusters. Keep increasing the number of clusters in each set until you have a variety of sets. This is necessary because when creating clusters you do not know whether few or many clusters will be the most helpful for optimizing your predictive power. 2) Apply various DM algorithms to each cluster. Lastly, 3) find the clusters that have the highest predictive capability.