Handle missing values

In this section, we describe the process of handling missing values. Data preparation includes handling missing values. Missing values occur when one or more records are missing one or more attributes (fields). Most data mining software will not run if the dataset has missing values. This conservative policy is warranted because software that is trying to estimate the effects of variables can be confused and produce calculation errors when it encounters missing values.

Sometimes the data that is collected is returned with missing values. Having missing values in our data can often time prohibit us from creating our models. The video below shows how to easily handle missing values using Microsoft Azure Machine Learning Studio.

Common strategies for dealing data include ommitting records that are not complete, omitting columns if many records lack that column of data, and subtituting or estimating values for the missing attribute in records that are missing that attribute.

Omission (most common strategy)

If there are plenty of records in the dataset and only a few records have missing values, you can omit the records with missing values with little or no impact on the outcome of the models. However, if there are many records that are missing values on certain a certain variable(s) then you would want to drop those variables. The figures below show an example of omitting records with missing values on the homes dataset.

Figure 22.1: Omission pills and settings
Figure 22.2: Before omission
Figure 22.3: After omission

Sometimes eliminating records or variables is not practical when data is scarce or is expensive. Make sure you know your data and its limitation.

Substitution

Another way of handling missing values in the dataset is to replace missing values with reasonable substitutes. This helps you keep the record and use the rest of its (non-missing) information.

There are several ways you can do this substitution. One way you can do this is substituting missing values on a variable with a specific replacement value. Figure 3.7 below displays an example of how this can be done.

Figure 22.4: Substituting with a specific replacement value

Another way of substitution is by the data type of the variable. If it is a numerical variable you can substitute its missing values with its mean or median. If it is a categorical variable you can replace its missing values the mode. Figures 3.8-3.10 displays an example of replacing the missing values of the bedroom variable with its mean.

Figure 22.5: Substituting missing values with mean pills and settings
Figure 22.6: Before substituting missing values with mean
Figure 22.7: After substituting missing values with mean

Notice that the overall mean of the dataset changed when we replaced the missing values with the mean. Instead of replacing the missing values of the bedrooms with the mean 3.7165 Microsoft Azure Studio rounded it up. Once you think about it, rounding it up to a whole number makes more sense because homes don’t usually have 3.7165 bedrooms.

Converting file to CSV

If you want to convert your final output of your experiment to a CSV search for the "Convert to CSV" pill and connect your last Clean Missing Data pill to your Convert to CSV pill. After you run the experiment right click on that pill and click on Results dataset-->Download.

Figure 22.8: Convert to CSV Pill