7.4 Reconciling multiple names for a category
Sometimes a specific category in the same categorical variable column has multiple spellings or some values are abbreviated and some are not. This typically occurs because of loose data input controls, where instead of picking from a predefined list, the column is populated as free form text. This problem also sometimes results from combining data from multiple sources, where different sources used different names or abbreviations.
The image below shows an example of this common problem, where most names are included twice. Most names have one version that is all capitalized and one where just the first character of each word in the name is capitalized. For example, the data mining tools will think that Provo is not the same as PROVO, when, in fact, they should be recognized as instances of the same category.
Why is this a problem? The answer is because it causes an unecessary and undersirable proliferation of the number of categorical values being recognized in the data column. This inhibits the ability of the learning algorithm to associate outcomes with single category. In the case of MLR or LogReg, where dummy variables are generated for each each value of a categorical variable, it will lead to way too many dummy variables being created. In the case of cities shown in the image above, coefficients will be estimated for 41 categories, when far fewer should be estimated.
It also means that fewer observations will be used to estimate each coefficient. This can lead to unstable and unreliable estimates. In the Cities example, the number of records for Provo are 39 and the number for PROVO are 67. If we do not fix the disparate names, some observations will be used to estimate PROVO and the other observations will be used to estimate Provo. When in reality all 67 + 39 = 106 observations should be used to make a single estimate for that city.
To avoid these problems, it is best to recode disparate names so that they become a single name. There are many ways to do this. JMP provides a way to easily handle this type of problem. The following video shows how to do this in JMP.