2.3 Understand and Prepare the Data
Understand the Data
One of the responsibilities of an analyst is to know and understand data they are working with. Analysts must understand the variable types and what is represented by values in those variables within the dataset. This takes time, but it is essential. Look at each attribute in your data and make sure you understand what it contains and what it means. The figure below displays some records from the homes dataset
It can be tempting to run data through learning algorithms before you understand what the data represents to see if you can produce a good outcome. Before you perform modeling, you should take the time to understand what the data represents. This can save you from making the following mistakes.
-
Non-representative data. Using data that does not represent the population.
-
Non-predictive input data. Some input columns are likely to have predictive value and some are not. Is the data adequate to capture the relationships that you think will let you obtain the desired predictions? What other data might help you that is not there? Where might you be able to find it? If you bypass this step, you may miss the opportunity to collect additional useful information.
-
Incomplete records or fields. Many machine learning environments will not run when some of the data is missing.
-
Data errors or miscoded data. Learning algorithms get confused by bogus inputs. This can pollute your model.
-
Observations that are outliers. The real estate listings in an area likely contain very cheap and very elaborate, expensive houses. If we include these listings along with listings for more typical houses, it will likely reduce the quality of the model intended to be used for more typical houses.
-
Numbers that represent categories. Algorithms use categorical and numeric values differently. So if a number represents a category, it should be adjusted to represent a category.
-
Related inputs. Ideally, inputs are independent of other inputs. Otherwise, they contain partially overlapping information. Collinearity is when two numeric inputs that help predict the outcome partially overlap. Age and height are related, and both can be used to predict weight. So age and height are not independent. Categorical dependencies are when categories depend on other categories. For instance, city, state, and zip code are related. If you know city, you know zip code (or know that the city has a finite number of specific zip codes). So including related dependent categories might confuse a modeling algorithm.
Inputs derived from the output. Assume you are trying to predict housing cost (the outcome variable). Inputs could include square footage and price per square foot. But you have to know the outcome variable (price) to be able to predict price per square foot. So using earnings per share (EPS) to predict price is inappropriate.
-
Do not use an output as an input. Sometimes multiple related outputs are in the data. For example, net income (NI), earnings before income taxes (EBIT), and earnings per share (EPS) are all indicators of profitability. It is easy to predict net income from EBIT or EPS. So if you are trying to predict income, don't use one or more measures of income to predict another measure of income. That is cheating because it is using the answer to predict the answer.
Transform Data to Extract Information
Sometimes the data has the potential to reveal more information, but it needs to be converted in ways to make that information available. For example, price and square footage are attributes in the homes dataset. This is useful information. But by dividing price by square feet you can find price per square foot. This reveals whether a home is expensive or cheap for its size. This can reveal important information not otherwise obvious in the data. For example, some homes may be small but are made of very high quality materials and have extra nice wall coverings and floors. Others are large but are poorly constructed and made from cheap materials. Price and and size by themselves do not tell this story, but price per square foot does. This produces a valuable insight. Not only can the analyst see price per size, but it can also lead the analyst to ask questions about what other data could be collected to might make the model even better. For example, perhaps data could be collected on the quality of construction and the quality of flooring and wall coverings. A new model created with this added information would probably better predict price.
Partition the Data
When we create a model, we randomly partition the data into subsamples. This allows the learning algorithms to learn on one subsample and test the model on the unseen data, the data records it has not seen. This allows us to see how much overfitting occured during training the model on the training data. That is, did the model think that random noise in the training data reflected the true pattern that it should model? You can tell if overfitting occurs because the model quality indicators will be higher for the training data than for the validation data.