2.1 CRISP-DM: Data Mining Process
Methodologies are frameworks for performing tasks that help us cover every important step. The methodology that we will use in this class is the CRISP-DM, a cross-platform industry standard process for data mining. CRISP-DM is currently the leading framework used and taught for data mining. Although this framework is standard, the specific tools, software, and statistical techniques vary greatly. See the figure below for a depiction of the phases involved:
Phases of CRISP-DM
We will briefly explain each phase below (with portions drawn from Cross-industry standard process for data mining).
Business Understanding
This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition and a preliminary plan designed to achieve the objectives. A decision model, especially one built using the Decision Model and Notation standard, can be used.
Data Understanding
The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.
This phase is important because it helps the analyst understand the data and makes sure the quality of the data is adequate to represent relationships and serve as a reliable foundation from which to derive valid inferences. The vast majority of your time will, and should, be spent on understanding the data. If you don't understand the data, then you are taking large risks when it comes time to perform modeling and make business decisions based on those models.
Data Preparation
The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) from the initial raw data.
The necessary steps depend on the data to be prepared.
-
Combining data from multiple data sources
-
Giving appropriate names to data columns
-
Data cleaning and fixing data errors
-
Investigating outliers
-
Transforming data so the data can be used by data analysis tools
-
Partitioning data
Typically, the data scientists will determine which data preparation steps are required until they are sure of their results. Then, the cleaning and preparation will be automated into an ETL process.
Modeling
In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Typically, there are several techniques for the same data mining problem type. Different algorithms are better for some problems and worse for others. So multiple methods must be investigated. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.
This phase is where much of this course will focus. We will learn many different forms of modeling and alternative calculations for each modeling type.
Evaluation
At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis perspective. Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the quality of the various models. There are a handful of model evaluation methods that are typically used, which we will examine later in this course. These let you see which models perform best.
Review the steps taken to construct the model to be certain it properly achieves the business objectives. Determine if all important business issues have been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.
Deployment
Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained will need to be organized and presented in a way that is useful to the customer. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data scoring (e.g., segment allocation) or data mining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. Even if the analyst deploys the model, it is important for the customer to understand the actions which will need to be carried out in order to actually make use of the created models.
In practice, we integrate model results into information systems so that it produces the decision-supporting information required. The system continues to collect model-related information so additional model tuning can be performed in the future. This data is used to tune future iterations of the model: thus, completing the loop required for future machine learning to occur.
Perhaps the most important take-away from CRISP-DM is that data mining is an iterative process, often with no clear "finish" point. Rather, data scientists have to continually evaluate and re-evaluate their models as new data is gathered and new algorithms emerge. However, at some point, a decision needs to be made concerning the best model so that deployment can begin happening. Data scientists must learn to "satisfice" so that the need for perfection doesn't end up costing more than the benefits of incremental improvements to the models.