Roles of Variables and Data Types

Roles Variables Play

A variable is any characteristic, number, or quantity that can be identified, measured, or counted. Age, gender, business income and expenses, country of birth, capital expenditure, class grades, eye color, and vehicle type are examples of variables. It is called a variable because the value may differ for different instances in a population and may change in value over time.

In data science, the two main roles that variables play in terms of predictive modeling are referred to as outcome variables and input variables. The outcome variable is what is being predicted. Another common name for outcome variable is dependent variable.

Input variables are variables that are being evaluated to determine if they are useful predictor variables. Those variables that provide predictive value can be refered to as predictor variables. Another common name for input variables is independent variables.

Figure 2.5: Roles of Variables

We can use the home data table (see near bottom of this section) as an example to identify the outcome and input variables. Let’s say you are trying to predict the price of a house by using the bathrooms, bedrooms, propertyType, and qFeet variables. The bathrooms, bedrooms, propertyType, and Square Feet variables are the input variables. They will be evaluated to determine whether they have predictive value that can be used to help predict the outcome variable price.

The role the variable plays determines whether it is a predictor or outcome variable. In one analysis, a variable might be an outcome variable. In another analysis, it might be a predictor variable.

Data Types of Variables

Each variable is represented by a data type. Different modeling methods are designed to handle only output variables or input variables of certain data types.

Regarding outcome variables, some modeling methods can predict categories. Other methods can only predict continuous or discrete numeric values. And some modeling methods can predict categories and numeric outcomes.

Data type is also important to input variables. Some modeling methods can handle inputs of multiple data types. Others cannot. Sometimes we must convert from one data type to another to be able to include them as input variables for certain algorithms. For example, Naive Bayes (NB) works only with categorical input variables. Numeric inputs must be binned into categories for the method to work. Multiple linear regression (MLR) is primarly designed to handle numeric input variables. For the MLR method to be able to handle categorical inputs, such as male/female, the categorical inputs have to be translated to numeric dummy variables.

The following data types are the ones you are most likely to encounter:

Figure 2.6: Data Types of Variables

Numeric Variables

Numeric variables have values that describe a measurable quantity as a number, like "how many" or "how much." Therefore, numeric variables are quantitative variables.

Continuous Variables

A continuous variable is a numeric variable where values are in a range of real numbers. The value given to an observation for a continuous variable can include values as small as the instrument of measurement allows. Examples of continuous variables include height, time, age, and temperature.

Discrete Variables

A discrete variable is a numeric variable represented by a distinct set of whole values, that is, integers. Examples of discrete variables include the number of registered cars, number of business locations, and number of children in a family, all of of which are measured as whole units (e.g., 1, 2, 3).

Likert scales are integer values along a continuum; thus, they are discrete numeric variables. They are commonly found in a variety of minimum and maximum value ranges such as one through five or one through seven.

Figure 2.7: Likert scale

Values within discrete variables can be used as categorical variables but are also commonly used to approximate continuous variables. As an example of using these as a category, a business may want to look at their most satisfied or disatisfied customers. Or a business may want to examine customers with 2 cars. Thus, the number represents a category of people that have the same satisfaction level or the same number of cars.

It is also common for discrete variables like Likert scales to be used as numbers along a continuum, where it can be assumed that the same distance exists between each level of the scale. That is, the same distance exists betwen a 1 and a 2 as exists between 2 and 3, and so forth. This approximation of continuous variables means that it can be appropriate to calculate averages. For example, it is common to calculate average customer satisfaction. Likewise, it is common to calculate the average number of cars that a typical family has. As such, the value might be 2.5 cars.

Categorical Variables

Categorical variables contain a finite number of categories or distinct groups. Category data might not have a logical order. For example, categorical predictors include gender, material type, and payment method. In Figure 2.8 below, the columns highlighted in green are categorical variables.

Categorical variables may be further described as ordinal or nominal:

Ordinal Variables

An ordinal variable is a categorical variable for which the possible values are ordered. Ordinal variables can be considered "in between" categorical and quantitative variables. The categories associated with ordinal variables can be ranked higher or lower than another, but do not establish a consistent numeric difference between each category. For example, someone could be asked to rank order their preferences among types of food with their ordered preference recorded as integers. With ordinal variables, you cannot assume that the same distance exists between the ordered values. For example, a person's favorite food might be preferred only slightly more than their second favorite food. But there may be a large difference between a person's second and third most favorite food. For this reason, it is not appropriate to calculate the mean of ordinal variables.

Another example of an ordinal variable is education level. Education level might be categorized as:

  1. Elementary school

  2. High school graduate

  3. Some college

  4. College graduate

  5. Graduate degree

In this example, like with other ordinal variables, the quantitative differences between the categories are uneven, even though the differences between the labels are the same (e.g., the difference between 1 and 2 is four years of education, whereas the difference between 2 and 3 could be anything from part of a year to several years).

Nominal Variables

A nominal variable is a categorical variable that is not able to be organized in a logical sequence. Examples of nominal categorical variables include sex, business type, eye color, religion, and brand.

In the image below, the green columns are nominal categorical variables. The number of bedrooms and bathrooms are discrete numeric variables. The remaining numeric variables are continuous.

Figure 2.8: Highlighted variables of homes dataset

Categories Pretending to Be Numbers

Data mining algorithms process numbers and non-numbers differently. Thus, a number that is actually a category should be converted to textual values so that it will be processed correctly by data mining software. Otherwise, incorrect results will be obtained.

Sometimes a number is used as a name for a category. An example of this can be found with departments represented as [1, 2, 3, 4, 5], where each department is represented by a specific number. In effect, the number is acting as a name rather than a quantity or measure.

Ask yourself these questions if you think a number may be a name of something instead of an actual count or measure of something:

  1. Is this number a count or measure of something? If so, it is probably a valid number and should remain a number.

  2. Are these numbers acting as names of items within a category? For example, consider an attribute named "diabetes type" that contains integer values as [1, 2]. These numbers are really names for "Type 1" and "Type 2" diabetes. Moreover, a person with Type 2 diabetes does not have twice the amount of diabetes as a person with Type 1 diabetes. They are just different forms of diabetes.

  3. Is it illogical to do math on these numbers? For example, it is illogical to take the average of the diabetes numbers (names) because averaging names does not produce a value that has meaning. Likewise, averaging department name does not make logical sense.

If your review indicates that the values within a data column are names rather than valid numbers, you should convert the numbers into a textual form so the data mining software can recognize the values as categories. An example of doing this for department numbers that are actually department names would be: convert [1, 2, 3, 4, 5] into [Dept1, Dept2, Dept3, Dept4, and Dept5]. Or some data mining tools (including JMP) will default numbers to a numeric data type but will let you designate the numeric values as a category (nominal variable).