Purpose of Statistics in Data Mining

As a data analyst, it is your responsibility to recognize patterns in data, create models that describe your data well so that you can then use those models to make inferences about the future. For example, if you were given a dataset that described the selling price of trucks in a given year, you could create a model that would allow you to then predict the the selling price of trucks in the following year.

Before we can make valid inferences about the future, our data needs to meet some qualifications. We use basic statistics and data visualizations to confirm that the data meets these qualifications.

After building our models, we again use statistics and visualizations to check the performance or accuracy of our models.

We can organize the statistics we will address in this class into three primary groups.

  1. Statistics used before building models
  2. Statistics used after building models, but before making inferences
  3. Statistics used after building models, to interpret model performance

Understand that there are times when these statistics may be used in other steps of the data analysis process.

Statistics used before building models include descriptive statistics such as Variance, Standard Deviation, and those that can help determine whether or not a distribution is a Normal Distribution. Correlation and Slope are bivariate statistics that describe the relationship between two variables.