Introduction to Logistic Regression

What is logistic regression? It is a statistical analysis method to predict a binary outcome, such as Yes or No, based on prior observations or a learning set of data points. A logistic regression model predicts a dependent variable by analyzing the relationship of the dependent variable to one or more independent variables. Logistic regression is a predictive model and as such is an important tool in machine learning. It allows algorithms to classify incoming data based on historical data.

While linear regression is the most used method to yield an estimation of a continuous target variable, we cannot use regular linear regression if our problem has a binary target variable (where the outcome is a 1 or 0, yes or no, spam or ham, buy or not buy, fraud or not fraud, etc.). In linear regression we used the method of Least Squares to calculate a trendline. Using the trendline, we were able to predict new values for the dependent variable given values for the independent variable.

However, when there is a dependent variable that only has values of 0 or 1 (yes/no, pass/fail, cancer/no cancer, win/lose), then the linear regression model is not able to provide a good fit. Figure 13.1 illustrates a graph of 0/1 y-values for x from -10 to +10 with a linear regression line. It is obvious that linear regression is not appropriate for this type of problem.

Figure 13.1: Least Squares Regression

In this situation, a linear regression would violate assumptions upon which linear regression is based, including normally distributed error terms, homoscedasticity (equal spread of the differences between predicted and observed values), a linear relationship between x and the mean of y, and the problem of the unconstrained range of the target variable (the possibility of negative probabilities or probabilities over 100%).

Logistic regression is a statistical analysis method that is used to predict a binary outcome. These binary outcomes allow a straightforward decision between the two alternatives. A better fit to the data is the logistic function (also called the sigmoid function or inverse logit function).

Note: The term “logistic” refers to the fact that the model is working with logarithms. Its derivation does NOT come from terms for “logic” or “logical.”

A sigmoid or logistic function can be expressed as y = 1/(1 + e - (mx+b)). Note that the logistic function has the regression equation (mx + b), but as an exponent. Thus, this model works with logarithms to calculate the intercept (b) and coefficients (m) of the mx + b equation. Also, note that the mx term may in reality be multiple input variables of x (x1, x2, x3…) with multiple coefficients, just as we observed with multiple linear regression.

Figure 13.2 illustrates a logistic function that is a much better fit for a binary classification. The figure plots independent values of x from -10 to +10 to the sigmoid function of 1/(1 + e-x). Except for a few values around an x of 0, the logistic function does yield values close to 0 or 1, and it is constrained to values less than 1 and greater than 0. So, no matter how large x becomes, either positive x or negative x, the output is bounded by 0 and 1.

Figure 13.2: Logistic Function

The objective of logistic regression is to find the sigmoid curve that best fits the sample data. This process will consist of finding the best values for the intercept and the coefficients that yield the closest fit to the data points. There is not a pre-programmed way to perform logistic regression in Excel. It is not an option in the Data Analysis tools like regular regression. However, there are several ways to do it manually in Excel in simple or more complex ways. We will use a combination approach in this chapter.