10.1 Introduction to Logistic Regression
In this chapter we will discuss a classification method that is computationally fast and efficient in classifying even large datasets. Similar to linear regression, logistic regression involves the creation of a model based on specific predictor variables that are related to the response variable (Shmueli, Bruce, and Patel, 2016). Also like linear regression, these predictor variables can be either numeric or categorical. So what’s new? Logistic Regression is used when instead of trying to predict a continous numeric variable, we are trying to predict which instanaces belong to specific classes or categories. In other words, the outcome variable is categorical. The table below summarizes the predictive models that we have examined so far in this course.
Logistic Regression is used to predict categorical variables, or in other words, to classify a response as belong to the correct class. For example, say it’s Valentine’s day and you receive an anonymous note with a message written at the top, “Will you be my valentine?” accompanied by three options: Yes, No, and Maybe.
Each of these options is a class. If the note only had two options, Yes and No, then you would refer to the situation as binary classification, or classification of only two classes. For this chapter only, we are going to deal with an outcome variable that is binary (a categorical outcome variable that has two values such as "yes" and "no") rather than a continuous numeric outcome variable. Logistic regression can also be applied to ordered outcome categories (ordinal data) that has more than two ordered categories, such as what you find in many surveys. However, that is beyond the scope of this course
What is the logistic curve? What is the base of the natural logarithm? Why do statisticians prefer logistic regression to ordinary linear regression when the output variable is binary? How are probabilities, odds and logits related? What is a logit? What is an odds ratio? How is logistic regression similar and different from linear regression? How is the β weight in logistic regression for a categorical variable related to the odds ratio of its constituent categories?
This chapter is challenging because there are many new concepts in it. Studying this may bring back feelings that you had early in the course, when there were many new concepts each week.
The application of logistic regression are many. A few examples include classifying business ventures as profitable or unprofitable, predicting whether a loan will be paid in full or defaulted upon. A benefit of LogReg is that it is possible to identifying differentiating factors between two classes, such as attributes that differentiate between graduate and undergraduate students.
It is customary to code a binary DV either 1 or 0. For example, we might code a successfully kicked field goal as 1 and a missed field goal as 0 or we might code yes as 1 and no as 0 or admitted as 1 and rejected as 0.
If we code like this, then the mean of the distribution is equal to the proportion of 1s in the distribution. For example, if there are 100 people in the distribution and 30 of them are coded 1, then the mean of the distribution is .30, which is the proportion of 1s. The mean of the distribution is also the probability of drawing a person labeled as 1 at random from the distribution.
That is, if we grab a person at random from our sample of 100 just described, the probability that the person will be a 1 is .30. Therefore, proportion and probability of 1 are the same in such cases. The mean of a binary distribution so coded is denoted as P, the proportion of 1s. The proportion of zeros is (1-P), which is sometimes denoted as Q. The variance of such a distribution is PQ, and the standard deviation is Sqrt(PQ). Why can't all stats be this easy?
Suppose we want to predict whether an adult is male or female (Male = 1, Female = 0) using height in inches as the predictor variable. If we plot the relations between the two variables, the plot might look something like this:
Points to notice about the graph (these data are fictional) are as follows:
-
The regression line is a rolling average, just as in linear regression. The Y-axis is P, the proportion of 1s at any given value of height. As height increases, the probability of someone being a male increases.
-
The regression line is nonlinear.
-
None of the observations --the raw data points-- actually fall on the regression line. They all fall on zero or one. (See graph).
-
As the probability of being male increases, the probability of being female decreases, where the relationship between probability of female = 1 minus probability of being male and vice versa. (See graph).
Another way to characterize the relationship between made and missed field goals would be with odds. That is successes to failures. In this example, there are 30 successes to 70 failures, which can be simplified to 3 to 7 odds of success. We will work with both probabilities and odds later in this chapter because understanding both and how they are related is essential.