1.3 The Data Mining Process
Although data mining's roots can be traced back to the late 1980s, for most of the 1990s the field was still in its infancy. Data mining was still being defined and refined. It was largely a loose conglomeration of data models, analysis algorithms, and ad hoc outputs. In 1999, several sizeable companies, including auto maker Daimler-Benz, insurance provider OHRA, hardware and software manufacturer NCR Corporation, and statistical software maker SPSS Inc., began working together to formalize and standardize an approach to data mining. The result of their work was CRISP-DM, the Cross-Industry Standard Process for Data Mining. Although the participants in the creation of CRISP-DM certainly had vested interests in certain software and hardware, the process was designed independent of the tools used for data mining. It was written in such a way as to be conceptual in nature—something that could be applied independent of any certain tool or kind of data. The process consists of six steps or phases, as illustrated in Figure 1.1. Note that depending on outcomes at some steps in the process, we may actually go back to the prior step to rework some of our previous efforts or assumptions.
CRISP-DM Step 1: Business (Organizational) Understanding
The first step in CRISP-DM is Business Understanding, or what will be referred to in this text as Organizational Understanding, since organizations of all kinds, not just businesses, can use data mining to answer questions and solve problems. This step is crucial to a successful data mining outcome, yet is often overlooked as people try to dive right into mining their data. This is natural, of course—we are often anxious to generate some interesting output because we want to find answers. But you wouldn't begin building a car without first defining what you want the vehicle to look like or do, nor without first designing what you are going to build. Consider these oft-quoted lines from Lewis Carroll's Alice's Adventures in Wonderland:
"Would you tell me, please, which way I ought to go from here?" [said Alice.]
"That depends a good deal on where you want to get to," said the Cat.
"I don't much care where—" said Alice.
"Then it doesn't matter which way you go," said the Cat.
"—so long as I get SOMEWHERE," Alice added as an explanation.
"Oh, you're sure to do that," said the Cat, "if you only walk long enough."1
Such would be the case if you were to build models without first understanding your organizational question and the data you have to answer it. You can mine data all day long and into the night, but if you don't know what you want to know, if you haven't defined any questions to answer, or if you don't understand the relationship between your data and the questions you want to answer, then the efforts of your data mining are less likely to be fruitful. Start with high-level ideas: What is making my customers complain so much? How can I increase my per-unit profit margin? How can I anticipate and fix manufacturing flaws in order to avoid shipping a defective product? These are Organizational Understanding types of questions. Define what you want to know. From here, you can begin to develop the more specific questions you want to answer, and this will enable you to proceed to . . .
CRISP-DM Step 2: Data Understanding
As with Organizational Understanding, Data Understanding is a preparatory activity, and sometimes, its value is lost on people. Don't let its value be lost on you! Years ago, when workers did not have their own computers (or multiple computers) sitting on their desk (or lap, or in their pocket), data were centralized. If you needed information from a company's data store, you could request a report from someone who could query that information from a central database (or fetch it from a company filing cabinet) and provide the results to you. The inventions of the personal computer, workstation, laptop, tablet computer, and smartphone have each triggered moves away from data centralization. As hard drives became simultaneously larger and cheaper, and as software became increasingly more accessible and easier to use, data began to disperse across the enterprise. Over time, valuable data stores became strewn across hundreds and even thousands of devices, sequestered in marketing managers' spreadsheets, customer support databases, and human resources file systems, just to name a few places.
As you can imagine, this has created a multifaceted data problem. Marketing may have wonderful data that could be a valuable asset to senior management, but senior management may not be aware of the data's existence—either because of territorialism on the part of the marketing department or because the marketing folks simply haven't thought to tell the executives about the data they've gathered. The same could be said of the information sharing, or lack thereof, between almost any two business units in an organization. In the language of corporations, the term "silos" is often invoked to describe the separation of units to the point where interdepartmental sharing and communication is poor or almost nonexistent. It is unlikely that effective organizational data mining can occur when employees do not know what data they have (or could have) at their disposal or where those data are currently located. In Chapter 2, we will take a closer look at some mechanisms that organizations use to try to bring all their data into a common location. These include databases, data marts and data warehouses.
Simply centralizing data is not enough, however. There are many questions that may arise once an organization's data have been corralled. Where did the data come from? Who collected them and was there a standard method of collection? What do the various columns and rows of data mean? How old are the data? Are there acronyms or abbreviations that are unknown or unclear? Information about the source, composition, age, etc. of your data is often referred to as metadata. In order to compile metadata on your data sets, you may need to do some research in the Data Understanding phase of your data mining activities. Sometimes you will need to meet with subject matter experts in various departments to unravel where certain data came from, how they were collected, and how they have been coded and stored. It is critically important that you verify the accuracy and reliability of the data as well. The old adage "Something is better than nothing" does not apply in data mining. Inaccurate or incomplete data could be worse than nothing in a data mining activity, because decisions based upon incomplete or wrong data are likely to be incomplete or wrong decisions. Once you have gathered, identified, and understood your data assets, then you may engage in . . .
CRISP-DM Step 3: Data Preparation
Data come in many shapes and formats. Some data are numeric, some are in paragraphs of text, and others are in picture form such as charts, graphs, and maps. Some data are anecdotal or narrative, such as comments on a customer satisfaction survey or the transcript of a witness's testimony. Data that aren't in rows or columns of numbers shouldn't be dismissed, though—sometimes non-traditional data formats can be the most information rich. This is often referred to as unstructured data, and it is becoming increasingly prevalent. We'll talk in this book about approaches to formatting data, beginning in Chapter 2. Although rows and columns will be one of our most common layouts, we'll also get into text mining in a later chapter, where paragraphs and documents can be fed into RapidMiner or R and then analyzed for patterns as well.
Data Preparation involves a number of activities. These may include joining two or more data sets together, reducing data sets to only those variables that are interesting in a given data mining exercise, scrubbing data clean of anomalies such as outlier observations or missing data, or reformatting data for consistency purposes. For example, you may have seen a spreadsheet or database that holds phone numbers in many different formats:
(555) 555-1234 | 555/555-1234 |
555-555-1234 | 555.555.1234 |
555 555 1234 | 5555551234 |
Each of the cells above offers the same phone number stored in a different format. The results of a data mining exercise are most likely to yield good, useful results when the underlying data are as consistent as possible. Data preparation can help to ensure that you improve your chances of a successful outcome when you begin . . .
CRISP-DM Step 4: Modeling
A model, in data mining, is a computerized representation of real-world observations. Models are the application of algorithms to seek out, identify, and display any patterns or messages in your data. There are two basic types of models in data mining: those that classify and those that predict.
As you can see in Figure 1.2, there is some overlap between types of models in data mining. For example, this book will teach you about decision trees. A decision tree is a predictive model used to determine which attributes of a given data set are the strongest indicators of a given outcome. The outcome is expressed as the likelihood that an observation will fall into one of several categories. Thus, decision trees are predictive in nature, but they also help us to classify our data. This will probably make more sense when we get to the chapter on decision trees, but for now, it's important just to understand that models help us to classify and predict based on patterns the models find in our data.
Models may be simple or complex. They may contain only a single process, or they may contain subprocesses. Regardless of their layout, models are where data mining moves from preparation and understanding to development and interpretation. We will build a number of example models in this text. Once a model has been built, it is time for . . .
CRISP-DM Step 5: Evaluation
All analyses of data have the potential for false positives—that is, an expectation based on the data that is not actually true. However, even if a model doesn't yield false positives, the model may not find any interesting patterns in your data. This may be because the model isn't set up well to find the patterns, you could be using the wrong technique, or there simply may not be anything interesting in your data for the model to find. The Evaluation phase of CRISP-DM is there specifically to help you determine how valuable your model is, what it has found (if anything), and what you might want to do with the results. If you refer back to the CRISP-DM diagram in Figure 1.1, you will see that sometimes the Evaluation phase takes you back to Business Understanding, without going on to the Deployment step. You may have learned something, but not something you want to use in your day-to-day operations. Your evaluation insights in some instances may prompt you to start your analytical process over, this time with an enhanced Business Understanding and perhaps with better Data Understanding and plans to improve Data Preparation.
Evaluation can be accomplished using a number of techniques, both mathematical and logical in nature. This book will examine techniques for cross-validation and testing for false positives using RapidMiner. For some models, the power or strength indicated by certain test statistics will also be discussed. Beyond these measures, however, model evaluation must also include a human aspect. As individuals gain experience and expertise in their field, they will have operational knowledge that may not be measurable in a mathematical sense but is nonetheless indispensable in determining the value of a data mining model. This human element will also be discussed throughout the book. Using both data-driven and experiential evaluation techniques to determine a model's usefulness, we can then decide how to move on to . . .
CRISP-DM Step 6: Deployment
If you have successfully identified your questions, prepared data that can answer those questions, and created a model that passes the test of being interesting and useful, then you have arrived at the point of actually using your results. This is the Deployment stage, and it is a happy and busy time for a data miner. Activities in this phase include setting up automation of your model, meeting with consumers of your model's outputs (these may be inside or outside your organization), integrating with existing management or operational information systems, feeding new learning from model use back into the model to improve its accuracy and performance (sometimes called tuning your model), and monitoring and measuring the outcomes of model use. Be prepared for a bit of distrust of your model at first—you may even face pushback from groups who may feel their jobs are threatened by this new tool or who may not trust the reliability or accuracy of the outputs. But don't let this discourage you! Remember that CBS did not trust the initial predictions of the UNIVAC, one of the first commercial computer systems, when the network used it to predict the eventual outcome of the 1952 U.S. presidential election on election night. With only 5% of the votes counted, UNIVAC predicted Dwight D. Eisenhower would defeat Adlai Stevenson in a landslide, something no pollster or election expert considered likely or even possible. In fact, most "experts" expected Stevenson to win by a narrow margin, with some acknowledging that because the vote was expected to be close, Eisenhower might prevail in a tight contest. It was only late that night, when human vote counts confirmed that Eisenhower had run away with the election, that CBS went on the air to acknowledge first, that Eisenhower had won, and second, that UNIVAC had predicted this very outcome several hours earlier, but the broadcast network's leadership had refused to trust the computer's prediction. UNIVAC was further vindicated later, when its prediction was found to be within 1% of the eventual vote tally. New technology is often unsettling to people, and it is hard sometimes to trust what computers show. Be humble, patient, and specific as you explain how a new data mining model works, what the results mean, and how the results can be used. Be open to questions and constructive critiques of your work—these may actually help you to improve and further fine-tune your model.
While the UNIVAC example illustrates the power and utility of predictive computer modeling (despite inherent mistrust), it should not be construed as a reason for blind trust either. In the days of UNIVAC, the biggest problem was the newness of the technology. This new machine was doing something no one really expected or could explain, and because few people understood how the computer worked, it was hard to trust it. Today we face a different but equally troubling problem: computers have become ubiquitous, and too often, we don't question enough whether or not the results are accurate and meaningful. In order for data mining models to be effectively deployed, we must strike a balance. By clearly communicating a model's function and utility to stakeholders, thoroughly testing and proving the model, then planning for and monitoring its implementation, data mining models can be effectively introduced into the organizational flow. Failure to carefully and effectively manage deployment, however, can sink even the best and most effective models.