7.2 Understanding the data
It is important to understand the data you have available. Do you have measures of an abstract construct? Do you have more than one measure of the same construct? If so, you should combine them in some way. Are your measures of more concrete, observable phenomena like height or age? Are the measures that you have valid and adequate indicators of the constructs or variables you are trying to access?
What constructs and variables likely influence other variables? Consider an example of building a model to predict diabetes.
Understand the Data
Some of the columns in the data may represent fairly easy to understand attributes like age in years, gender, or height. Other columns may be indicators or measures of more abstract constructs. You should determine which columns represent concrete variables and which are measures of constructs. Below is the data description for a diabetes problem. The goal of this dataset is to predict which individuals will have diabetes. So what is being described by the data columns? The answer requires some domain knowledge.
Name | Description |
---|---|
Diabetes | Has diabetes (Y or N) |
Pregnancy | Number of times pregnant |
Glucose | Plasma glucose concentration 2 hours after ingesting glucose |
Insulin | 2-Hour serum insulin (mu U/ml) |
Diastolic | Diastolic blood pressure (mm Hg) |
BMI | Body mass index (weight in kg/(height in m)^2) |
Skinfold | Triceps skin fold thickness (mm) |
FamHist | Diabetes pedigree function |
Age | Age in years |
Four measures of constructs are included in the data. There are two measures for tests of diabetes and two measures of weight.
Tests for diabetes
Glucose. Glucose tests are used to diagnose diabetes. Two hours after a person drinks oral glucose, blood is drawn and the level of blood glucose is measured. Diabetics have higher blood glucose than non-diabetics because diabetes reduces the ability of a person to absorb glucose into their cells, so blood glucose remains high.
Insulin. Insulin is the chemical in the body that signals cells to absorb glucose. Blood insulin level depends on age, weight, gender, and many things, so there is no standard amount of insulin. People with diabetes are insulin resistant, so the body produces higher levels of insulin to help the body absorb sugar.
Measures of weight
BMI. The Body Mass Index is a standardized measure of whether a person has a a healthy weight or is under or overweight. It takes into account weight and height so that the index works for people of all heights. Thresholds exist for BMI with levels for underweight, normal, overweight, and obese. BMI is more useful than a simple measure of only weight like pounds or kilograms because it is an indication of healthy weight which is weight in relation to height.
Skinfold. Skinfold measures are taken with calipers to measure loose flesh. Thin people have less fat and therefore smaller skinfold measures than heavier people.
Conceptual Maps
With the above data description in mind, we can construct a conceptual map. A conceptual map is a useful way to help you make sense of your problem and clarify what is represented by the data columns in a dataset.
A conceptual map includes all of the data columns in the dataset. For example, the conceptual map below reflects all columns in the Diabetes dataset.
A conceptual map helps you determine what the measurements represent in the data and what inputs are expected to help predict the outcome variable.
Determine what each measure represents
You should determine what each measure (column) in your data represents. Do the columns in your data represent measures of simple concrete variables or are they measures of constructs?
Simple measures of variables. A solid box represents a simple measure of a variable. When a measure directly assesses a concrete variable, like age or number of pregnancies, it can be considered a simple measure.
Constructs. Constructs are represented by a cloud symbol. It is impossible to directly measure constructs like health or happiness because these are abstract concepts. Therefore, we use measures that are intended to reflect the construct.
Measures of constructs. Boxes with dashed lines represent a measure of a construct. Measures are connected to construct that they measure by lines with no arrow heads. This lack of an arrow head signifies that the measures do not cause the construct. Rather, they reflect the construct. In the diabetes example, skinfold and BMI are two measures of healthy weight. Glucose and insulin are the two measures for diabetes. In fact, people with diabetes routinely check their blood glucose level to see how well they are managing their diabetes. High blood sugar levels suggest that they are not absorbing blood sugar well. High glucose levels are also diagnostic indicators of diabetes. Since diabetics are insulin resistant, the body produces more insulin to promote absorption of blood glucose into the cells.
In this example, notice that there is a solid rectangular box labeled diabetes connected to the diabetes construct with a simple line. In the diabetes problem, Whether a person had diabetes was determined by high blood insulin levels and high blood glucose levels. If so, an expert designated that the person has diabetes (diabetes = yes). So the column named "Diabetes" is not a measure of diabetes. It is the outcome variable that reflects whether a person was determined to be a diabetic. Since it is an indicator of diabetic status, and not a measure of diabetes, it has a solid rectangle instead of a dashed rectangle.
What contributes to the outcome
Contribution Relationships. The second important aspect of a conceptual map is the designation of what variables are supposed to contibute, positively or negatively, to the outcome variable. Contribution arrows are included in the conceptual map to represent the hypothesized relationship between constructs and variables. In this example, pregnancy, age, heredity, healthy weight, and blood pressure are hypothesized to contribute to diabetes.
Such a map can help you think through what variables might “cause” or influence other variables. These hypothesized relationships are based on domain knowledge drawn from past research on contributors to diabetes. Domain knowledge is necessary to construct such a map. If you lack such domain knowledge you may need to learn more about the domain or consult a domain expert.
It is reasonable to assume that diabetes does not contribute to pregnancies, age, and whether someone has parents or relatives with diabetes. It is more logical to conclude that these factors may contribute to diabetes.
Diagnostic Tests for the Outcome Variable. Since glucose and insulin levels are diagnostic tests for the diabetes construct, they should not be used to predict diabetes. To do so is like running a test to determine if a person has cancer and then using that test to predict whether the person has cancer. It would be more useful to determine if we can predict diabetes by other possible contributors to diabetes without the obvious direct diagnostic tests. Then for informational purposes we would add the diagnostic tests to the model to see how much prediction ability changes, but we should explain in our results that measures are indicators of diabetes not hypothesized contributors to diabetes.
Other Inappropriate Predictors
Diagnostic tests of outcome variables
It is inappropriate to use the result a diagnostic test of an outcome to predict the value of the outcome variable.
Consider the example earlier in this chapter where we were trying to predict diabetes. In the data, the outcome variable was whether a person had diabetes (Yes/No). How was this status determined? A medical doctor used two other values in the data about each patient to determine whether the patient had diabetes: the levels of glucose and insulin in the patient's blood. These are in fact diagnostic tests of diabetes. Since they were used to determine the outcome variable, they should not be used to predict diabetes. Other variables like family history of diabetes and patient BMI would be appropriate predictors of diabetes.
Consider another example, suppose you were trying to predict wear levels of an automobile transmission. How do transmission experts determine the level of wear? One way experts determine the amount of wear is to measure how much thinning has occured on the intermeshing teeth in the gears of the transmission. Based on this, they can translate that thinning to a percent of wear of the transmission. Simply put, thinning is used to diagnose the percentage of wear. It is a diagnostic test of wear. So thinning should not be used as an input variable. Other input variables like the number of miles the vehicle was driven and how often the transmission fluid was changed would be appropriate predictors since they are not diagnostic tests of wear.
Predicting one measure of Y with another measure of Y
You sometimes have multiple measures of an outcome construct in a dataset. When this happens, you should not use one measure of the outcome variable to predict another measure of the outcome variable. This type of mistake is refered to as circular prediction. It is the inappropriate use of an outcome measure to predict a different measure of the outcome variable.
For example, assume you are trying to create a model to predict business income. In the data, you hopefully have input variables that can help predict business income. However, you might also have multiple measures of income such as net income, earnings before income taxes (EBIT), gross income, and earnings per share. Why is this a problem? These are all just different measures of the construct income. Net income and EBIT are so highly correlated that if you predict net income from EBIT or EBIT from net income, your model will be extremely predictive. But this is inappropriate because you are predicting an answer with the answer. Earnings per share (EPS) is a measure of income derived from net income. Net income is divided by the number of shares of stock to produce EPS. Therefore, it would be inappropriate to predict net income from EPS or EPS from net income. You should not predict income with another measure of income.
Values not known at initiation
The value of some variables result from the same attributes that contributed to the outcome. These are not known until the value of the outcome variable is known. These should not be used to predict the outcome. For example, assume one is trying to predict the number of Youtube likes a video will receive at the time the video is posted. There are some attributes that would be known at the time a video is posted such as who the author of the video is, the number of subscribers an author has, and the theme of the video. These would be considered valid candidate predictors for the stated purpose because they are known when the video is first posted. But other outcomes are not known until later, like the number of comments posted in response to a video. The number of likes and the number of comments both depend on factors such as video quality, the number of subscribers, etc. Since the number of comments are not known when the video posts, the number of comments should not be used to predict the number of likes.
This discussion emphasizes how important it is to think through what constructs are represented by measures in the dataset. Are there multiple measures of the outcome variable in the data? Does the data contain diagnostic tests of the outcome variables? Are there outcomes (variables) that result from the same causes as the outcome variable that are not known until the outcome variables is known? You must answer these questions to know which variables are appropriate as inputs and which ones are inappropriates. If you construct a conceptual map of the data columns, it can help you get clear on what the data columns represent. This can help you avoid making inappropriate models and invalid inferences about a prediction model.
If you have appropriate and inappropriate variables, your main modeling should be done using the appropriate input variables. You have the option of creating a secondary model that includes inappropriate variables. For the diabetes example shown in this chapter, you would evaluate the input variables as potential predictors and get the best model possible. Then, you could do a secondary model that also includes the glucose and insulin levels as well but you would need to disclose that the secondary model includes input variables that are diagnotic tests of diabetes and because of this the model is artifically deterministic.