9.7 Tools: Regression Analysis
Another tool that you can follow in examining how variables within a process might be related is regression analysis. Regression analysis assists you in defining the relationship between one independent and one dependent variable. For instance, the pepperoni roll is a snack popular in West Virginia and some nearby Appalachian Mountains such as Western Pennsylvania, Western Maryland, and Appalachian Ohio. It is all over the place in West Virginia, specifically, and is arguably considered the food most closely associated with the state.
Now, let’s say you are consulting for a local food processing organization producing specialty pepperoni rolls. The leadership team is interested in learning about whether the temperature in their state-of-the-art new oven has a relationship to the roll’s flavor and how closely they might be related. It would be best to have both variables in a numerical format to work with regression.
The most straightforward tool used in regression analysis is the scatter diagram. Recall that a scatter diagram is a plot of one independent variable versus a dependent variable. You can use the scatter diagram to examine cause and effect relationships. Your assumption, in this case, is that the independent variable causes a change in the dependent variable. A hint here is that you need to vary X over a sufficiently large range to detect correlation better. Before we go much further, it is crucial to keep in mind that if you want to predict Y for an X value that lies outside of the range you tested, the prognosis is highly questionable, and you should test it properly.
To perform a regression analysis regarding the relationship between the temperatures in the new oven and the roll’s flavor, you can bake rolls at varying temperatures and have an organization expert rate their flavor numerically. A rate of 1 could indicate undesirable taste and a rate of 10 a perfect flavor. Let’s further assume that you collected the data displayed in Table 9.4.
New Oven Temperature | Flavorfulness |
---|---|
202 | 1 |
227 | 1.2 |
252 | 2.2 |
277 | 2.4 |
302 | 4 |
327 | 5 |
352 | 5.2 |
377 | 6.2 |
402 | 8 |
426 | 8.9 |
452 | 9.5 |
477 | 10 |
Using the data shown in Table 9.4, if the team creates a scatter diagram with a fit line, which you can perform using basic Excel, they will produce the graph shown in Figure 9.5 below. Through a simple inspection of the chart, you can conclude that a relationship between the input (oven temperature) and output (flavorfulness) variables exist. The data plots (flavor) are located tightly around the line, which indicates that as temperature rises, so does the flavorfulness of the pepperoni rolls in this example.
Linear Models
A linear model is simply an expression of a type of association between two variables, X and Y. A linear relationship means that a given size in X produces a proportionate change in Y. You can think of a linear regression model as a data plot that graphs the linear relationship between the independent and dependent variables. You typically use it to show the strength of the relationship and the dispersion of the results to explain the behavior of the dependent variable.
Let’s return to our previous example, where you want to test the strength of the relationship between the temperatures in the new oven and the pepperoni roll’s flavor. Again, you would take the independent variable, oven temperature, and relate it to the dependent variable, taste, to assess if there was a relationship. An important point for you to keep in mind: The lower the variability in the data, the stronger the relationship and the tighter the fit to the regression line.
-
Click Data Analysis on the Analysis tab.
-
From the Data Analysis popup, choose Regression.
-
Under Input, select the ranges for both independent and dependent variables.
-
Check the Labels checkbox to include variables labels appearing in row 1.
-
Click OK.
Let's now take a look at the output shown in Figure 9.5. Multiple R is the Correlation Coefficient that measures the strength of a linear relationship between two variables. The correlation coefficient can be any value between -1 and 1, and its absolute value indicates the relationship strength. The larger the absolute value, the stronger the relationship. In this example, multiple R is approximately 0.99, suggesting a solid connection, i.e., a solid relationship between the new state-of-the-art new oven temperature and the peperoni roll flavor.
The R2 value, also known as the coefficient of determination, measures the variation in the dependent variable explained by the independent variable or how well the regression model fits the data. The R2 value ranges from 0 to 1, and a higher value indicates a better fit and a stronger relationship. The p-value, or probability value, ranges from 0 to 1 and means if the test is significant. In contrast to the R2 value, a smaller p-value is favorable as it indicates a correlation between the dependent and independent variables. Adjusted R2 is the R2 adjusted for the number of the independent variable in the model. You will want to use this value instead of R2 for multiple regression analysis.
Standard Error is another goodness-of-fit measure that shows the precision of your regression analysis—the smaller the number, the more confident you can be about your regression equation. While R2 represents the percentage of the variance of the dependent variable that the model explains, Standard Error is an absolute measure that shows the average distance that the data points fall from the regression line.
You will rarely use the ANOVA part for simple linear regression analysis in Excel, but you should have a close look at the last component. The Significance F value gives you an idea of how reliable (statistically significant) your results are. If Significance F is less than 0.05 (5%), your model is OK. If it is more than 0.05, you might have to search for an alternative and more appropriate independent variable that can better fit the model.
You can chart a regression in Excel by highlighting the data and charting it as a scatter plot or by simply clicking “Line Fit Plots” in the Regression Data analysis pop out (step 3, described earlier). To add a regression line and the linear equation, follow the steps below:
-
Click on the graph in Excel.
-
Select Design and select Add Chart Element.
-
Select Trendline, More Trendline Options.
-
Select the option for the predicted trend line and click OK.
-
Click on the box to display the equation and the R2 value in the chart.
In simple linear regression, you use a single input variable (X) to predict a single output variable (Y). As you can see in Figure 9.7, the output includes an equation in the following form:
Where
B0 is the intercept point on the y-axis.
B1 is the constant that tells you how and how much the X variable affects the output.
E is the amount of error.
Let’s now interpret that equation. You generated the graph to portray how the new oven temperature affects the flavor of the pepperoni rolls. The form of the simple regression equation in this case is
Using the statistical output,
This outcome means that each unit increase in temperature increases flavorfulness by 0.0357 units, and that, on average, flavorfulness is -6.834 units in the absence of temperature. If you would like to predict flavor at a temperature of, say, 400°, you can simply use your equation:
Multiple regression would follow the same principles, but you would be studying the effect of multiple Xs (independent or predictor variables) on one output (Y) variable. A crucial cautionary point here is that you need to perform a residuals analysis to verify the validity of the regression. A residual plot is a type of plot that displays the fitted values against the residual values for a regression model. You use this type of plot to assess whether a linear regression model is appropriate for a given dataset and to check for heteroscedasticity of residuals. The Excel outputs of your regression analysis include a Residual Outputs Table, shown in Figure 9.8, which you can use to produce a residual scatter plot following the same procedures described earlier.
Figure 9.9 shows the residual plot for our example. In examining the plot, you want to see no patterns and about the same variance. Looking at the graph, the values are reasonably spread out. In our case, the sample size is small so it has become a bit more difficult to identify patterns.
Click here to view a list of available activities.