Introduction

Despite increasing mentions in news media and other venues over the past two decades, data mining activity continues to be largely transparent to the world. Most of the time, we never even notice that it's happening, unless something goes wrong. For example, in March of 2018, worldwide news broke about a massive-scale data analytics project involving a company called Cambridge Analytica and its use of tens of millions of data records it acquired through Facebook. Suddenly, people were aware of data mining, or at least one specific project that used it, and some people became quite passionate about it. Yet Cambridge Analytica, which filed for bankruptcy in May of 2018 in the wake of their data-related scandal, is hardly alone in using data and analytics. Most companies now use some form of data mining, whether to better understand customers, improve sales and marketing, or to measure their own performance internally.

Whenever we sign up for a grocery store shopping card, place a purchase using a credit or debit card, respond to a survey, play a game or use an app on a smartphone, or just generally peruse the World Wide Web, we are creating data. Have you ever stopped to think about why companies like Google, Facebook, and Twitter offer so many services for free, and how they can afford to do so? We exchange not having to pay for those organizations' ability to collect data on our behaviors. Most of our online and app-based browsing, shopping, and social media posting behaviors are now recorded as data. These data are stored in large sets on powerful computers owned by the companies we deal with every day. Lying within those data sets are patterns—indicators of our interests, our habits, and our preferences and tendencies. Data mining allows people to locate and interpret those patterns, helping them make better informed decisions and better serve their customers. Those may seem like wonderful benefits, and they are, but there are also concerns about the practice of data mining. Privacy monitoring groups are particularly vocal about organizations that amass vast quantities of data, some of which can be very personal in nature. While not the primary focus of this book, we do dedicate a chapter at the end to the topic of ethics, and it is important for you to be cognizant of the privacy concerns that data mining may raise. You should ensure that you conduct your own data mining activities in ways that respect people more than process and protect the rights of individuals represented in your data sets.

The intent of this book is to introduce you to concepts and practices common in data mining. It is intended primarily for students and business professionals who are interested in using information systems and technologies to solve organizational problems by mining data, but who may not have a formal background or education in computer science. Although data mining is the fusion of applied statistics, logic, artificial intelligence, machine learning, and data management systems, you are not required to have a strong background in these fields to use this book. Introductory college-level courses in statistics and databases would be helpful, but are not absolutely necessary. Care has been taken to explain the necessary concepts and techniques required to successfully learn how to mine data.

Each chapter in this book explains a data mining concept or technique or a group of techniques. You should understand that the book is not designed to be an instruction manual or tutorial for the tools we will use (primarily RapidMiner and R). These software packages are capable of many types of data analysis, and this text is not intended to cover all of their capabilities, but rather it is intended to illustrate how these software tools can be used to perform certain kinds of data manipulation and mining. The book is also not exhaustive; it includes a variety of common data mining techniques, but RapidMiner and R are capable of many, many data mining tasks that are not covered in this book.

The chapters all follow a common format. First, chapters will present a scenario referred to as Context and Perspective. This section in each chapter will help you gain a real-world idea about a certain kind of problem that data mining can help solve. It is intended to help you think of ways that the data mining technique in that chapter can be applied to organizational problems you might face. Following Context and Perspective, a set of Learning Objectives is offered. The idea behind these chapter sections is that each chapter is designed to teach you something new about data mining. By listing the objectives at the beginning of the chapter, you will have a better idea of what you should expect to learn by reading it. Following the learning objectives, each chapter's step-by-step example will enable you to work through an actual data mining task. For most chapters, these sections will follow the phases of a data mining methodology called CRISP-DM, which will be explained shortly. Finally, after the main concepts of the chapter have been delivered, each chapter concludes with a Chapter Summary, a set of Review Questions to help reinforce the main points of the chapter, and one or more Exercises to allow you to try your hand at applying what was taught in the chapter.