Logistic Regression Using Excel Solver
Principles of Logistic Regression
Download the file from the link below to follow along with the text example or video and to practice on your own.
Probabilities and the Odds Ratio
Before we describe the logistic regression learning procedure, we will define some terms.
The probability of an event occurring is the ratio of choosing an outcome divided by the total possibilities of outcomes. For example, if there is a jar containing 5 marbles, with 3 being white and 2 being black, the probability of randomly choosing a white marble, Pw, is 3/5 or 0.6 or sometimes expressed as 60%. The probability of choosing a black marble, Pb, is 2/5 or 0.4. Note that the probability of choosing a black is 1 minus the probability of choosing a white, or Pb = 1-Pw. They are complementary probabilities.
The odds ratio is the ratio of two complementary probabilities. If you flip a fair coin, the probability that it will come up heads is 50% and tails is 50%. The odds ratio is 0.5/0.5 or 1 to 1 that it will come up heads. The probability of getting two consecutive heads (if the events are independent) is the product of the individual probabilities, or 0.5*0.5 = 0.25. The probability of not getting two consecutive heads is 1-0.25 = 0.75. The odds ratio of the probabilities would be 1:3; the odds are 1 to 3 in favor of two consecutive heads, or also 3 against 1.
The formula for odds is Odds = P/(1-P) where P is the probability of an event. So if the probability of an event is 0.75, then the odds are (0.75/(1-0.75) = 0.75/0.25 = 3 or 3 to 1, and the odds ratio is 3/1 or 3.
Solving the above equation for P gives us the formula to go from odds back to probabilities. This formula is P = Odds/(1+Odds). So if the odds are 3, then (3/(1+3) = ¾ = 0.75), the probability is 0.75.
Figure 17.1 shows that while the probability ranges from 0 to 1 (or infinitely close to 1), the odds range from infinitely close to zero to infinity. This handles the problem of predicting probabilities larger than 100% by using odds that can handle an upper limit of infinity yet be converted to a probability not going above 100%.
Even if we use odds instead of probabilities, we still have the problem of the lower limit constrained by zero. To solve this, we take the log of the odds, which has the property of an infinite lower limit and an infinite upper limit that can still be converted to a probability bounded between 0 and 1. Logs are negative for positive odds less than zero. Figure 17.2 illustrates these characteristics of odds and the log of odds.
Solving the Logistic Equation
Based on what we have learned about logistic functions and logs, we can formulate the equation that we want to solve. Let’s formulate a logistic equation with one independent variable.
Now we will go through several algebraic transformations of the equation:
-
p = 1/(1+e-(mx+b)) (start)
-
p+pe-(mx+b) = 1 (multiply both sides by 1+e-(mx+b))
-
pe-(mx+b) = 1-p (subtract p from both sides)
-
e(mx+b) = p/1-p (divide by p and invert both sides) (Note: p/(1-p) is also the definition of Odds.)
-
mx + b = ln(p/(1-p) = ln(Odds) (take the logarithm of both sides)
Rearranging, we formulate the equation that we want to solve. Notice that through this transformation and by taking the logarithm of both sides, we have converted the logistic or sigmoid equation to a linear equation with the vertical axis being Ln(Odds) instead of probability. The standard terminology for Ln(p/1-p) is Logit(p).
As we observed earlier, this equation and graph handle very large numbers, both positive and negative. When p ~ 1 or p ~ 0, the logarithms approach infinity in both directions.
Logistic Regression Using Excel
We now have a process to perform logistic regression in Excel:
-
Convert probabilities to odds.
-
Convert odds to log odds (also called logit).
-
Find an equation that results in the combination of feature variables that maximizes the likelihood of the log odds.
-
Convert the predicted log odds to odds, and then convert odds to probabilities.
Regular linear regression uses ordinary least-squares to yield an equation with an intercept and coefficient-variable pairs through a set of matrix equations. Logistic regression gives a similar-looking equation, but the engine is maximum likelihood estimation. We define the likelihood as the probability that the distribution, or sigmoid curve, fits the data points that were observed. We want to find the “maximum likelihood,” which is the logistic curve that is the best fit of observed data points.
This method tries out different values for the intercept and coefficients until it finds values closest to the actual observed probabilities. This is not done by a set of equations like linear regression, but by an iterative process that tries out different values for the coefficients until the likelihood is maximized. This is what Solver does in Excel—you specify an outcome, and Solver determines the input values needed to get the outcome. We want coefficients that predict a probability close to 1 for “Yes” and a probability close to 0 for “No.”
We use log likelihood (in a similar way that we use R2 in linear regression) as the measure for the goodness of our model. The closer the log likelihood is to 0 (i.e., that the likelihood is 1), the closer the product of all the predicted probabilities is to 1 (a perfect model), and the closer to maximizing the estimate of the likelihood.
Election Results Example
Figure 17.3 illustrates a simple example of election results for city councilmen. In this example, we have 22 data points of the expenditures in a local election and whether the candidate won or lost the election. In this example, the expenditures are in thousands of dollars, and the election results are 1 for a win and 0 for a loss. The data is sorted by expenditure of money, but, of course, that is not necessary for plotting.
Remember, the equation that we want to solve is
We will want to find the coefficient, m, and the intercept, b, for the maximum likelihood of the curve that best fits the data points. Since we will be using the Solver, we will set up two cells to be changing cells, one for m and one for b. We will initialize those cells with a value of one for the intercept and value of zero for the m. We identified these cells as betaIntercept and betaExpenditures.
The next step is to create a column that calculates Logit(p). From the equation above, (i.e., Logit(p) = Ln(Odds) = b + mx), Logit(p) is equal to the BetaIntercept cell plus BetaExpenditure times the value of the Expenditure. Figure 17.4 illustrates these two steps. Column D contains the calculation for Logit (p). The starting values are all ones because of the zero coefficient.
Next we work with odds and probabilities by exponentiation of the logarithm. Convert the logit score to odds with formula: Odds = EXP(logit(p)). We get a log by using the Excel function LN(). The complement to reverse it is EXP(). This is illustrated in Figure 17.5
Next we calculate both the probability of a Win and the probability of a Loss. A Win uses the formula ProbabilityWin = Odds/(1+Odds). A loss uses the formula ProbabilityLoss = 1-ProbabilityWin. We enter that information in columns F and G. Finally, make another column that chooses either Win or Loss based on the information in Column C. We use the Excel IF function in column H. Figure 17.6 shows these three columns.
Next, we move back to logarithms by taking the log of the likelihood. In column I, we take the log of column H to get the log(likelihood), which we also know is logit(p). We add this as column J. Figure 17.7 shows this step. Also in Figure 17.7 we create a sum for all of the log likelihoods.
The Sum of the LogLikelihoods is the value we want to maximize. Maximizing the sum by changing the intercept and coefficient will give the best fit curve for the data. Remember that column D contains the equation for Logit(p) = b + mx. As Solver changes the coefficients, different versions of the curve are generated. The solution is the smallest Sum of LogLikehood. Since we are working with probabilities between zero and one, the logarithms are all negative. So the maximum value is the smallest negative number.
Figure 17.8 shows the Solver window. Select the Sum cell, I3, as the cell to maximize. Select the coefficient cells C3:D3 as the changing cells. The solving method should be GRG Nonlinear. Clear the check box for Make Unconstrained Variables Non-Negative. Click the Options button and choose Central Derivatives. Then run the Solver.
Figure 17.9 shows the results of the Solver. The intercept is -9.68 and the coefficient is .23. The solved equation is Ln(Odds) = -9.68 + .23*Expenditures.
Finally, let’s overlay the curve on the original data to observe how well it fits the original data. To plot the curve, we use the equation Likelihood = 1/(1-e-(-9.68 + 0.23x)). We chose values of x for every five thousand dollars from 0 to 100. Figure 17.10 shows both the original data points and the curve of the equation overlaying the data points.
Looking at the data points, we see that there is quite a bit of overlap in the data around the 35 to 40 thousand dollar expenditures. Obviously, there is no way that a curve can perfectly explain that kind of data anomaly. As a prediction model, however, we would assign a probability of 1 to any value greater than 0.5 and a probability of 0 to those less than 0.5. There are still a couple of data points that are outliers in the original data. For most of the data points, however, the curve provides a good fit.
For those councilmen who are running for office, this curve provides the information that it would be much safer to spend in the 50 thousand dollar range rather than 35 or 40 thousand dollar range.
For example, let’s say a councilman planned to spend 48 thousand dollars. What is the probability of winning the election? Note: This is the same process as we did in columns D, E, and F of the Election Results table.
-
Ln(Odds) = -9.88 + 0.23 * 48 = 1.16
-
Odds = Exp(1.16) = 3.19
-
Probability = Odds /(1+ Odds) = 3.19/(4.19) = 76% likelihood of winning
Conversely, we can also ask the question: “According to the model, how much should a candidate spend to have a 90% likelihood of winning?”
-
Odds = p/(1-p) = .90 / (1-0.90) = 9.0
-
Ln(Odds) + Ln(9) = 2.197
-
Amount = (9.88 + 2.197) / 0.23 = 52.51 thousand dollars
Comparing Models
Likelihood of Donating Example
In this section, we will look at a more complex model. Earlier it was noted that the mx term could be m1x1 + m2x2 + m3x3 ... For this example, we use a dataset comprising 40 records, each with one target (Donate) and three feature variables (Income, Age, and Membership). Donate is the dependent variable (yes or no), and Income, Age, and Membership are the independent variables.
One thing we note about this example is that Income and Age are continuous variables, as we used in the previous example. However, Membership is a discrete or categorical variable with values of "member" or "not member," which will be coded as 1's and 2's. We will discover that logistic regression handles both types of variables.
Figure 17.11 is the starting configuration for the logistic regression. In row 3, we provide the start coefficients for the equations b + m1x1 + m2x2 + m3x3, with b = 1 and m1, m2, m3 all zero. Let’s review the columns. Where we give formulas in the table below, they are for row 6, the top data row in the range.
Column | Contents | Explanation |
---|---|---|
A | Zeros and Ones | Dependent variable. 1 = Donated, 0 = Not Donated. |
B, C, D | Income and Age Values. Membership code. Independent variables. | Membership 1 = Not member, 2 = Member |
E | =$B$3+$C$3*B6+$D$3*C6+$E$3*D6 | This is the equation for b + m1x1 + m2x2 + m3x3 with the values of x1, x2, x3 from columns B, C, and D. This equals Logit(p), which equals Ln(Odds). |
F | =EXP(E6) | Taking the exponent of a log yields the value, = Odds. |
G | =F6/(1+F6) | Converting the odds to probability for dependent value = 1, or a “Yes” likelihood. |
H | =1-G6 | The probability for value = 0, or a “No” likelihood. |
I | =IF(A6=1,G6,H6) | Choosing either column G or H depending on the value of column A (1 = yes, 0 = no). |
J | =LN(I6) | The log of the likelihood from column I. |
As before, in cell J3 we take the sum of all the log likelihoods for the 40 records, row 6 through row 45, =SUM(J6:J45).
To calculate the best fit for the Ln(Odds) equation, we use the Solver. The cell to maximize is the Sum in J3. The changing cells are the coefficient cells in B3, C3, D3, and E3. Use the GRG Nonlinear solving method. Uncheck the “Make Unconstrained Variables Non-Negative” box. In the Options window, choose the Central derivative for the GRG Nonlinear method. These options are shown in Figure 17.12.
Figure 17.13 shows the final regression model, with the sum maximized and with the intercept and coefficients calculated.
The logistic regression equation is:
Logit to Donate = -21.366 + 0.0955*Income + 0.125*Age + 2.735*Member
This Logit score can be converted to odds and then to probability of a person’s tendency to donate.
So, for example, there is a person with an income of 110 (thousands), who is 58 years old and a member (=2). What is their probability of donating?
-
Logit to Donate = -21.366 + 0.0955 * 110 + 0.125* 58 + 2.735 * 2 = 1.88
-
Odds = EXP (1.88) = 6.56
-
Probability they will donate = 6.56/1 + 6.56 = 86.7%
Model Comparisons
We now have to ask the question: “Is this solution a good model?”
Compare with Intercept-Only Model
To compare models, we build a model with the intercept only, then one with the intercept and the first variable (Income), then one with the intercept and first and second variables (Income and Age) to compare with the model with all of the variables. In this example, we will compare these four models to see which performs the best.
To create these other three models, we simply copy the worksheet of the start configuration to three other worksheets. To run these other models, we simply change the changing cells in the Solver.
-
For the Intercept-only model, use B3 as the changing cell.
-
For the Intercept and Income model, use B3 and C3 as the changing cells.
-
For the Intercept, Income, and Age model, use B3, C3, and D3 as the changing cells.
Running the Solver for each of the three other models, we get the following results. Notice only the coefficients of the changing cells have non-zero values as shown in Figure 17.14
Because logistic regression uses maximum likelihood estimates, the statistical test of incremental variance explained by two different models is chi-square. However, chi-square, because it is the sum of squared z-scores, must always be positive. So, we need to convert our negative Sum of Likelihood scores to positive scores and then calculate the chi-square tests.
One way to keep track of these model variations is to make a table like the one below, which has the Sum of the Log Likelihoods (LL) of each model entered. Next, we convert the negative Sum of Likelihood scores to positive scores. There are many ways to do this. The usual way to do this is to multiply the Sum of Likelihood by -2. Some statisticians call this value “deviance,” which refers to the amount of deviation between the predicted and actual probabilities. Figure 17.15 shows these two steps, with the Log Likelihood of each model and that value times -2.
Next we calculate the difference or change between the intercept-only model and the other models to see how each model predicts better than the intercept-only model. We will use these values to test whether the reduction in the deviance is statistically significant. We also enter into the table the degrees of freedom. The degrees of freedom are n-1, where n is the number of variables tested. Figure 17.16 shows these two additional columns.
Finally, we calculate the chi-square test of each model against the intercept-only model. The chi-square test value for the full model against the intercept-only model means: Assume that the population is the 40 records in our data set and 16 people did donate, and we take a sample, the variables Income, Age, and Member do not improve our prediction of donation behavior beyond knowledge of the simple average, 16/40 = 40%. This is the null hypothesis. In other words, any change in the results from the intercept-only model is purely by chance. We normally test this against a level of 0.05, or anything less than 5% we reject the null hypothesis. We can then calculate the probability of finding a deviance value that differs from the intercept-only value by as much as it does if the model makes no difference. Or in other words, how likely is the deviance difference between the two models just sampling error? Figure 17.17 shows the chi-square test. We use the Excel function CHISQ.DIST.RT for testing the right tail.
From the table above, we see that only 1.5 in 100,000 samples would we get a difference in deviance as large as 25 with three degrees of freedom if there were no effect in the population. So we reject the null hypothesis and conclude that the variables make a difference in predicting a person’s tendency to donate.
Model Comparisons Against Other Models
In the comparison above, we compared all of the models against the intercept-only and found that all of the models were statistically significant. There is still the question of individual feature variable influence as they are added to the models. Or in other words, what is the impact of adding Income, then Age, then Member? Do some variables not make a difference?
For example, in the table above we see that the change in deviance from the Int and Income model to the Int, Income, Age model is small (14.88 to 15.98) compared to the difference when we add Member to the model (25.06). So, let’s run the chi-square to compare the Int and Income model to the Int, Income, Age model.
Figure 17.18 compares these two models. We changed the degrees of freedom in the table because now Int and Income is the constrained model or base model, and model Int, Income, Age adds Age as the one new variable that can vary from the base model. We see that the small change in deviance (1.1) would be expected 29% of the time by chance (which is above the usual cut-off rules of .05 or .01), so we would not reject the null hypothesis. We would conclude that adding Age to the model makes no statistically significant difference. We could drop Age from the full model.
Let’s run the full model but without including Age, and compare the results of the chi-square test, testing Intercept and Income together (without Age) and then testing Intercept, Income, and Member together (without Age). Figure 17.19 shows the model results and the chi-square test.
We can see that our chi-square p-value became even smaller, indicating a more rare occurrence in the new model of Intercept, Income, Member. The Sum of the LogLikelihood was only reduced a small amount (from -14.38 to -14.98) from the original model that included Age, but the model is more trustworthy. So, we might conclude that this is our best model:
The Logit to Donate = -13.87 + 0.09*Income + 2.64*Member
Model Goodness Measures
The goodness of a model can be defined in many ways. The chi-square test model validation process we performed above is a good way to test for the goodness of your model (at least from the viewpoint of statistical significance).
There have also been many attempts at a logistic regression version of the R2 used in regression called pseudo R2. It does not work as well, is not as accurate, and has many different versions. A popular version is McFadden R2 calculated by the formula:
In our example, it would be 1-(-14.98/-26.92) = 0.443. Figure 17.20 shows the pseudo R2 for the two models (without Age) from the previous model, Figure 17.20.
The interpretation is not the same as R2 (which explains variability). For the McFadden R2, the interpretation is the improvement from the null model to the fitted model, or how much the model parameters improve on the prediction of the null model. McFadden wrote that the pseudo R2 “values tend to be considerably lower than those of the R2. For example, values of 0.2 to 0.4 for pseudo R2 represent EXCELLENT fit.” This means our pseudo R2 of .443 is good. But, pseudo R2 is really more of a tool to compare models. We see that the pseudo R2 of the Int, Income, Member model of .443 is higher than that of the Int and Income model of 0.276.
Lastly, most experts agree that the best test of any predictive model is cross-validation, where we train the model on part of the data and test it on another part of the data.