9.1Linear Regression

The analyses we have performed so far only estimate the relationship among pairs of variables. [Side note: even though there are several categories of eduation, we were still only examining two variables: education and income] However, that is only a start. Variables don't exist "in a vacuum", or in isolation from other variables. There are typically MANY variables involved in explaining a phenomenon. In other words, we need to move beyond bivariate statistics to multi-variate statistics. For this chapter, we are going to continue using the same Bike Buyers data from the prior chapter. If you don't have the workbook anymore, you can download it again here:

As we move beyond pairs of variables, it's also time to designate variables as dependent (y) versus independent (x). In particular, we want to designate one (and only one) variable as the dependent (a.k.a. "y" or "label") variable--which is the variable that we want to explain or predict. Next, we need to designate one (and up to many) variable(s) as the independent (a.k.a. "x" or "feature") variable(s)--which are those variables which we will use to explain or predict the dependent variable.

For example, the dependent Y variable is typically something that is valuable to predict like whether or not someone will purchase one of our bikes (see the variable PurchaseBike: 0 = no, 1 = yes in the data set provided). There will only be one dependent variable in our examples. the independent X variable(s) are those which would theoretically predict the dependent variable. That is the entire purpose of all of the other variables in the bike buyers data set. For example, people with more income are more likely to purchase bikes. People who live closer to work are more likely to purchase bikes (for commuting). Rather than examine how one independent variable at a time relates to the dependent variable (e.g. a Pearson correlation coefficient), we want to know what the combined effect of all independent variables together is on the dependent variable. To accomplish this, we need to move beyong the Pearnson correlation coefficient (r) to the coefficient of determination (R2)

Coefficient of Determination

You may have noticed in the scatter plots above that we also calculated a statistic called R2 in addition to r. This is called the coefficient of determination which is a key output of fitting a line of best fit from a scatter plot. It is interpreted as the proportion of the variance in the dependent (y) variable that is predictable from the independent (x) variable(s). However, R squared can also be calculated between a Y variable and a set of X variables.

For example, let's conceptualize the correlation coefficient as the amount of variance in one variable that overlaps with another variable. However, we want to ignore whether the relationship is positive or negative. Rather, we simply want to know how much of a Y variable can be explained, or predicted, by an X variable. See the diagram below:

If we square the correlation coefficient r = .14 for that relationship (see the correlation table created earlier), we get an R 2 value of 0.02. In other words, 2% of the variance in PurchaseBike can be explained by variance in Education which is represented (although a bit exaggerated) by the overlap between the two circles above. However, we have collected many variables which might overlap with, or explain, PurchaseBike. We have also included CommuteDistance in the figure below:

There are two things to learn from this image. First, R 2 is a representation, not only of the effect of a single X variable on a Y variable, but also the total summed overlap of all X variables on a Y variable. Imagine adding a circle to that diagram for every factor we have measured in the Bike Buyers data set. The total overlap of all variables with PurchaseBike is the R 2 value we are interested in.

Second, notice that commute distance is correlated not only with PurchaseBike, but also with the other independent variable Education. In addition, part of the overlap between Education and PurchaseBike is also overlapped with Commute distance. So is the true relationship between Education and PurchaseBike best represented by the correlation coefficient between those two variables? No, it's better to analyze the effects of a set of X variables at once in order to see what individual effect each independent variable has on a dependent variable after removing, or "controlling for," the effects of all other variables.

In the figure below, the true effect of Education is represented by only the portion that doesn't overlap with all other independent variables. So how do we measure just that portion that is due only to education? That is one of the purposes of multiple regression.

Linear Regression

Regression is a powerful statistics analysis that allows you to measure the relationship between a dependent (output) variable and, not just one, but a set of independent (input) variables. As a result, the effect of each independent variable is controlled for by the effects of the other independent variables.

In linear regression, data is modeled using linear predictor functions (think about drawing a straight, or "linear", line through the data; as one variable goes up, the other variable goes up or down in a "linear" equation: Y = mx + b). This allows unknown model parameters to be estimated from the data. As a result, multiple linear regression is a great first step toward predicting unknown future "Y" values based on a set of known existing "X" values. Using the data set below (same as the previous chapter), follow along with the video tutorial to see how a basic prediction calculator can be produced in Excel.

As you can probably tell if you followed along with the video above, multiple regression-based prediction calculators are somewhat complex, but also EXTREMELY powerful tools that are rarley used in practice by the "average employee" simply because they don't understand them or don't realize how easy they are to create in Excel.

If you feel like you understand the basic idea of multiple (meaning more than one independent variable) linear regression, then great! Feel free to skip this next video. However, if you are having trouble understanding how a regression anlaysis works and you'd like to understand more of the underlying mathematical concepts, then you may want to watch this video below:

You may recall from the video above that the regression coefficients cannot be compared because the input data (i.e. X variablees) are all on different scales. For example, although Income was a highly significant variable explaining whether or not someone purchased a bike, it's regression coefficient (B) was much smaller than any other because the scale of income was in the ten thousands. This is also why I used the p-value to compare the effectiveness of each input variable rather than their actual coefficient.

Well, there is an easy way to alter the scale of the input variables so that they are all comparable. That is accomplished by scaling the data which adjusts the input values to vary along the same range; thus, making regression coefficients comparable. There are many forms of scaling (e.g. LogNormal, Tanh, Logistic, MinMax, Zscore). I'll give you the most common example below: z-score. Z-score is defined as the number of "standard deviations from the mean" an input value lies. It's calculated as ((input value - sample mean) / standard deviation). Follow along with the video below to convert the input values to z-scores and then reexamine the coefficients and effect size of the same model as the one you produced in the prior video:

Finally now that the inputs have been standardized, we refer to the regression coefficient as a lowercase beta "β."

Review

You've now learned several new statistics from this chapter and the prior chapter. Let's summarize what all of them are for. See the table below:

Including Categorical Variables

You may be wondering by now why we have variables in the Bike Buyers data set like "Region" and "Occupation" when we can't analyze them in a regression model. Well, guess what? We can. We just have to make some modifications to them. Let's also re-analyze "Education" and treat it as a categorical varible and see if we get any better results than when we converted it to an ordinal variable (i.e. partial high school = 1, high school = 2, etc). Watch the video below to learn how to create dummy codes to analyze categorical variables in a regression model: