Bivariate Statistics

Correlation

Data analysts see many datasets throughout their careers. Often, the datasets they are asked to examine include data they are not familiar with. In order to properly analyze the data, analysts first need to spend time gaining an understanding of the data.

Upon receiving a dataset, one of the first actions an analyst should take is to check if there is any association between the variables included in the dataset. In particular, it is important to check for an association between the variable you are predicting, and the explanatory variables.

The Pearson correlation coefficient, r, is the statistic that measures the strength of the linear relationship, or association, between two numerical variables. The value of correlation coefficient ranges from -1 to 1.

Figure 20.1: Pearson Correlation Coefficient (r)

A value of -1 indicates that there is perfect correlation between the two numeric variables, in a negative direction (downward) direction. A value of 0 indicates no correlation. A value of 1 indicates perfect correlation between the two variables, in a positive direction. The image below shows perfect positive correlation (r = 1) on the left, and perfect negative correlation (r = -1) on the right.

Figure 20.2: Perfect Correlations

There image below shows a range of plots and the corresponding correlation coefficients. These show that r can "recognize" linear associations but does not recognize other systematic patterns of association between two numeric variables.

The image below created by By Denis Boigelot shows a range of plots and the corresponding correlation coefficients. These show that r can "recognize" linear associations but does not recognize other systematic patterns of association between two numeric variables.

Figure 20.3: Patterns and associated Pearson correlation values

Further insight comes from considering Anscombe's Quartet where all four data sets have the same Pearson correlation value of +0.816, same R-squared ( R2) value of .665, and same line of best fit: y = 3 + .5x. Even though these examples all follow the general pattern "as x increases, y tends to increase," each does so in a very different way (image credit).

Figure 20.4: Anscombe's quartet

Blind spots in the R2 statistic do not just occur when there is a positive correlation and positive slope in the data. They also occur when there is a negative correlation and a negative slope.

It is important to recognize that these blind spots happen in r, R2, and linear equations for lines of best fit. Each of these statistics provides valuable information but do not tell the whole story. They are no substitute for creating scatter plots to visualize the data because such plots reveal patterns that a single calculated statistic cannot capture.

Slope

Slope is a statistic used to measure the degree of positive (upward) or negative (downward) slant between two numerical variables.

In the linear equation y= mx + b, m is the slope and is equal to rise/run.

Figure 20.5: Slope Formula

Coefficient of Determination

The coefficient of determination (denoted by R2) is a key output of regression analysis. That is, when numbers are being predicted by other numbers. Note that multiple linear regression and simple linear regression are just two of many prediction methods that use numeric input variables to predict numeric output variables. So the term regression here refers to all methods that predict a numeric output variables from one or more numeric input variables. It is interpreted as the proportion of the variance in the outcome (dependent) variable that is predictable from the input (independent) variable(s).

With simple linear regression, where there is only one numeric predictor, and when an intercept is included in the model, the coefficient of determination is equal to the square of the Pearson correlation r, between x and y scores. If additional regressors (predictors) are included, R2 is the square of the coefficient of multiple correlation. In both such cases, the coefficient of determination ranges from 0 to 1.

An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable by the inputs and the model.

An R2 of 0 means that the dependent variable cannot be predicted from the independent variable using a model.

An R2 of 1 means the dependent variable can be predicted perfectly (without errors) from the independent variable using a model.

An R2 of 0.10 means that 10 percent of the variance in Y is predictable; an R2 of 0.20 means that 20 percent is predictable; and so on.