13.2 Logistic Regression Using JADE
Probabilities and the Odds Ratio
Download the file from the link below to follow along with the text example or video and to practice on your own.
Before we describe the logistic regression learning procedure, we will define some terms.
The probability of an event occurring is the ratio of choosing an outcome divided by the total possibilities of outcomes. For example, if there is a jar containing 5 marbles, with 3 being white and 2 being black, the probability of randomly choosing a white marble, Pw, is 3/5 or 0.6 or sometimes expressed as 60%. The probability of choosing a black marble, Pb, is 2/5 or 0.4. Note that the probability of choosing a black is 1 minus the probability of choosing a white, or Pb = 1-Pw. They are complementary probabilities.
The odds ratio is the ratio of two complementary probabilities. If you flip a fair coin, the probability that it will come up heads is 50% and tails is 50%. The odds ratio is 0.5/0.5 or 1 to 1 that it will come up heads. The probability of getting two consecutive heads (if the events are independent) is the product of the individual probabilities, or 0.5*0.5 = 0.25. The probability of not getting two consecutive heads is 1-0.25 = 0.75. The odds ratio of the probabilities would be 1:3; the odds are 1 to 3 in favor of two consecutive heads, or also 3 against 1.
The formula for odds is Odds = P/(1-P) where P is the probability of an event. So if the probability of an event is 0.75, then the odds are 0.75/(1-0.75) = 0.75/0.25 = 3 or 3 to 1, and the odds ratio is 3/1 or 3.
Solving the above equation for P gives us the formula to go from odds back to probabilities. This formula is P = Odds/(1+Odds). So if the odds are 3, then 3/(1+3) = ¾ = 0.75, the probability is 0.75.
Figure 13.3 shows that while the probability ranges from 0 to 1 (or infinitely close to 1), the odds range from infinitely close to zero to infinity. This handles the problem of predicting probabilities larger than 100% by using odds that can handle an upper limit of infinity yet be converted to a probability not going above 100%.
Even if we use odds instead of probabilities, we still have the problem of the lower limit constrained by zero. To solve this, we take the log of the odds, which has the property of an infinite lower limit and an infinite upper limit that can still be converted to a probability bounded between 0 and 1. Logs are negative for positive odds less than zero. Figure 13.4 illustrates these characteristics of odds and the log of odds.
Solving the Logistic Equation
Based on what we have learned about logistic functions and logs, we can formulate the equation that we want to solve. Let’s formulate a logistic equation with one independent variable.
To solve this equation, we go through several algebraic transformations and arrive at this equation:
Once we have calculated the Odds Ratio, we can convert that to predicted probability using the previously defined equation, P = O/(1+O).
We will demonstrate the steps to convert the sigmoid equation to a linear equation. These steps are basic algebraic transformations of the equation:
-
p = 1/(1+e-(mx+b)) (start)
-
p+pe-(mx+b) = 1 (multiply both sides by 1+e-(mx+b))
-
pe-(mx+b) = 1-p (subtract p from both sides)
-
e(mx+b) = p/1-p = Odds (divide by p and invert both sides)
-
mx + b = ln(p/(1-p) = ln(Odds) (take the logarithm of both sides)
Regular linear regression uses ordinary least-squares to yield an equation with an intercept and coefficient-variable pairs through a set of matrix equations. Logistic regression gives a similar-looking equation, but the engine is maximum likelihood estimation. In other words, we use the term likelihood as the measure of the predicted probability.
We define the likelihood as the probability that the distribution, or sigmoid curve, fits the data points that were observed. We want to find the “maximum likelihood,” which is the logistic curve that is the best fit of observed data points.
Logistic Regression Using Excel and JADE
In this section we will show a simple example of a logistic equation. This example contains only one independent variable. We use this example because it is easy to plot it on a two-dimensional graph, which makes it easier to visualize the properties and shape of the sigmoid or logistic equation.
We will use the Javascript Automation Development Environment (JADE) add-in to solve the logistic or sigmoid equation. JADE calculates the intercept, i.e., the b value, and the coefficient, i.e, the m value. JADE will also provide us statistics to evaluate the goodness of fit as determined by the chi-square value, p-values, and standard error values for the intercept and coefficient.
Election Results Example
Figure 13.5 illustrates a simple example of election results for city councilors. In this example, we have 40 data points of the expenditures in a local election and whether the candidate won or lost the election. The expenditures are in thousands of dollars, and the election results are 1 for a win and 0 for a loss. The data is sorted by expenditure of money, but, of course, that is not necessary for plotting nor for calculating the values.
We begin the logistic regression process for this data by opening the JADE tool. Figure 13.6 illustrates how this can be done. Click on the Automation Development icon in the Home ribbon. The JADE tool opens up in a side panel. We want to use the Excel tools. Click on that hotlink. NOTE: Depending on the scaling parameters that are set on your computer, the JADE tool may be scaled too large or too small. To change the scaling, press Ctrl- to reduce the size or Ctrl+ to increase the size of JADE. First, be sure the side panel is in focus.
Clicking on the Excel Tools opens up the next window as shown in Figure 13.7. In this panel, click on the Analytics hotlink to get to Logistic Regression.
Another way to get to the Logistic Regression tool is to click on the Introduction title bar. Figure 13.8 illustrates the drop-down menu that displays from the Introduction title bar. Click on the Logistic Regression hotlink to open the Logistic Regression control panel.
We finally get to the Logistic Regression panel as shown in Figure 13.9. You can read about the tool and load an example dataset by clicking on the Example heading bar. We will not show this in the text, but you should try it out. It is self-explanatory as it loads up the data and allows you to try the tool out. The example that we will do here in the text will also teach you how the tool works.
Referring back to Figure 13.9, the first column shows the amount of money the candidate spent on the campaign. We will call this the independent variable. The independent variables are also referred to as Predictors. The second column, which shows whether the candidate won or lost, is the dependent variable. The dependent variable is also referred to as the Outcome.
As can be seen in Figure 13.9 there are three parts to using the tool. Remember, our goal is to determine the values b and m, which are the intercept and the coefficient on the sigmoid equation that best fits the data. It is normal procedure to use part of the data as the training data to calculate the intercept and coefficient, and a smaller part of the data to verify the results. The second step is to run the model that will do the actual calculations for the intercept and coefficient. The third step is to make predictions on the outcomes on separate data that we want to test.
Step 1. Splitting the Data
The first step, as shown in Figure 13.10, is to split the data into Training Data and Verification Data. The top input box, under Step 1. Identify Data is the data sheet that contains the data. The second box is the Data Range. The Data Range is all of the data to be used, both the independent variables and the dependent variables. The easiest way to fill in the box is to select the top left cell of the data, in our case cell A1, and then click the icon to the right of the box. The icons are circled in the figure. In our case, the Training Dataset is set to use 80% of the data, which leaves 20% as the verification data.
In this example, the Data Sheet is StartJADE. The Data Range is A1:B41, which contains all columns and all rows. The split is 80% Training Data. Once the boxes in Step 1 are filled in, then click on the Split Data button. The tool adds a new column to the data showing which data will be used for training (TRUE) and which data will be used for verification (FALSE).
Step 2. Running the Calculations Model
After the data is split, the next step is to do the calculations. Figure 13.10 shows this process. We have to tell JADE which columns are the dependent variable, called the Outcome, and which columns are the independent variables, called the Predictors. We simply fill in both boxes with the appropriate data. Again, an easy way is to select the data, and then click on the crosshatch icon to the right of the boxes. In our case, the Outcome is the column of ones and zeros to indicate win or lose, which is B1:B41. The Predictor is the independent variable. It is the amount expended, as given in cells A1:A41. Notice that JADE identifies and lists the Predictor columns. In this example, that column name is Expenditure (000). We will use this information in a later step.
Once those boxes are completed, we click the Run Model button to perform the calculations. JADE calculates the intercept and coefficient and displays the results along with some statistical analysis of the results. JADE creates a new worksheet for the results. Figure 13.11 illustrates the results for our example.
Figure 13.11 illustrates the form of the results. The Heading Bar, in blue, lists the worksheet and the column for the independent variables, which is the range A1:A41. It also displays the Training Percent, which in our case is 80%.
The next box gives the statistics for the Overall Model Fit. In our case, this model has a Chi Square of 27.537. The Degrees of Freedom is 1. Remember, Degrees of Freedom equals n-1, where n is the number of independent items. In this model, n = 2, intercept and one coefficient. The p-value for the overall model is smaller than five decimal points, so it shows up as zero. Also, this model predicted the other 20% of the data points with 100% accuracy.
The Confusion Matrix uses the 20% of the data that was not used for training and verifies the result. The Confusion Matrix shows which data points give an incorrect result, either predicting a loss for an actual win, or a win when it was a loss. Twenty percent of 40 is 8 data points. In this instance, the model performed very well, predicting 4 wins and 4 losses accurately.
The two boxes with zeroes on the diagonal are related to the statistical concept of type I and type II errors. A type I error accepts a value when it is fact false (predicting a 1 when it is actually a 0). The top right corner box is type I errors. Type II errors reject a value when it should be accepted. The bottom left corner box is type I errors (predicting a 0 when it is actually a 1).
The final box shows the actual values that were calculated. In this instance, the intercept value is -7.432 and the coefficient is 0.157. Hence, the sigmoid equation is:
The p-values for each of these values are also very good. Both of them are smaller than 0.01.
One of the reasons we used a single variable model was so that we could show it on a two-dimensional graph. Figure 13.12 illustrates the original data with the sigmoid curve plotted on the same graph. You can see that there are a couple of outliers that would not have been predicted correctly. But, the data that was used to verify the results did not have any erroneous predictions.
Step 3. Making Predictions on New Data
The final ability of the tool is to apply the calculated model to new data that needs to be tested. Figure 13.13 illustrates ten data points with different values of expenditures.
Referring back to Figure 13.13, Step 3 requires the name of the Data Sheet for the new data and the column name of the independent variable. JADE then calculates three columns. The first column is the Odds Ratio. As shown in the sidebar box, the Odds Ratio calculation is e(7.425 + 0.157 * expenditure). The Odds Ratio calculation is done using the log of the probabilities that best fits the data, as was explained earlier. The next column is the likelihood or probability of a win. Remember, the equation for the probability is P = Odds/(1+Odds). This likelihood information can be useful to observe what the probability of a win might be.
The third column is the actual prediction. If the likelihood of success is greater than 50%, then it is assigned a win. If less than 50%, it is assigned a loss.
Looking at the data, we see that the second person spent $65,000 on his/her campaign, which gives a 94% likelihood of winning, and was assigned a 1 or win. However, the last person in the table spent $44,000, which yielded a likelihood of winning of 37%, and thus was assigned a loss.