Terminology

Let's begin by defining some key terms and concepts that you may have learned. We will then use them to create an outline of this course. If you read and watched the introduction (see 0.2) of this book, then you're already familiar with the concepts of a core data scientist and an applied data analyst. However, you may not be as familiar with the concepts of data mining, data analytics, artificial intelligence, business intelligence, and machine learning. Let's begin with three of those terms. Refer to the image below:

Data mining (DM) is a process. As described in the image, it includes the collection, aggregation, and (often) the visualization of large amounts of data. Data mining it is an all-encompassing topic because most other data-based topics fit within the larger concept of data mining, from exploring for trends to predicting a very specific outcome. The term "business intelligence" is often used synonymously with DM because it refers to the descriptive analysis of large datasets. Common off-the-shelf tools for DM include Tableau, Microsoft PowerBI, and others. More customized solutions that require integration projects include companies like Adobe Analytics and DOMO (a local start-up). Other technologies for data collecting (e.g. SQL), cleaning, and aggregating (Hadoop, MapReduce) are a necessary precursor to the more-complex data tools. These technologies are discussed in the next chapter.

Artificial intelligence (AI) refers to programming logic—not the hardware itself, but the software within—that replicates human logic, reasoning, and decision-making. Typically, this begins with some data mining that helps us understand human behavior. But creating the AI program doesn't always require that prior data be stored, drawn from, or updated. Sometimes AI is based on an optimization formula with an objective and constraints (like the Solver add-in in Microsoft Excel). That is why part of the oval above representing AI lies outside of the data mining concept.

Machine learning (ML) is a subset of both DM and AI because it relies on a dataset to come to a reasonable decision. However, ML is a unique subset of AI because of how it comes to that decision. In particular, ML uses prior data to "train" the weights assigned to future data inputs, which are then calculated in a score to predict a more customized outcome. The term "data analytics" is often associated with machine learning because it implies taking data beyond description alone into more predictive capabilities. ML is a form of data analytics where the predictive output is both integrated into an information system and updated automatically over time with new data. Common tools for ML—and the implementing of ML into live apps—include both cloud-based tools like Amazon Web Services Machine Learning and Azure Machine Learning Studio, but ML can also be "direct coded" into languages like Python and R.

Let's put these concepts together in an example. Let's say that a company wants to understand who is purchasing their products and begin predicting what products that all customers will want. This process begins by storing all of their data in a centralized location. The company then builds or purchases a tool that visualizes the data and generates summary statistics. Through trends in the bar charts and scatterplots, the company begins to notice that the customer's gender and age seem to relate to purchase decisions. At this point, the company is performing data mining only. Next, the company decides to implement a dynamic home page on their website that displays a different set of product specials or sales when the visitor's age and gender is known. This is similar to a sales associate directing a walk-in customer to a particular section of the store. At this point, the company is implementing rudimentary AI. Eventually, the company creates a predictive model that takes into account a customer's past purchase history and generates predictive weights that indicate how important each prior purchase is toward predicting their next purchase. Now, when the customer opens the company's web page, they are given product suggestions that are unique not only to their gender and age, but also to their purchase history. Furthermore, these predictive weights are updated after each purchase decision that every customer makes. At this point, the company is implementing machine learning.

Data Analytics versus Machine Learning

Where does data analytics fit with the concepts of DM, AI, and ML? Data analytics, like machine learning, is the process of applying predictive algorithms to the set representing "causes" (e.g. gender, age, purchase history) of some "effect" (e.g. purchase decisions). These analytics also generate a set of weights that indicate how effective each "cause" variable is (e.g. gender = 0.10, age = 0.12, purchase history = 0.67) in explaining the effect. The term "business analytics" simply refers to data analytics in the business context. However, as opposed to machine learning, the predictive analyses performed as part of data analytics are often initially performed "offline"—meaning that data is exported from a live data source and used to generate predictive analyses like those referred to in the prior section. However, at some point, that data becomes stale. In other words, as time passes from when the data was originally exported, the data becomes a poorer representation of how current customers are behaving. This is important because consumer preferences change as products, services, regulations, trends, and even cultures change.

Why would anyone export data (thus, making it "offline") to perform analyses? Basically, initial data analysis is often exploratory—meaning that we are exploring the causes of a desired effect. For example, when a consumer clicks on a website product, the website will often try to improve their sales by showing related products that the consumer could purchase. However, there are many statistical algorithms that can be used to perform this type of product recommendation. Which one is best? We won't know for sure until we analyze sales data using every potential algorithm. Once we determine the best prediction formula, then we can implement the formula in our website code. Imagine that Amazon is performing this analysis; they complete 1.6 million consumer orders per day. That kind of volume requires a lot of computing power to analyze. Instead of analyzing live data, it makes more sense to export a smaller random sample of data for offline analysis.

Once that best analysis is determined, it's time to implement actual machine learning. Machine learning is accomplished when an analytical model is integrated into an information system and continually (and automatically) "retrained" with the latest consumer data and behaviors. When that happens, the machine "learns" the new behaviors of consumers automatically as those behaviors are captured in an information system. See the image below:

Figure 1.2: Machine Learning Process

We will dive into greater detail on this process in later chapters. For now, it visually outlines the steps that data scientists go through to genreate machine learnin pipeliness. It begins by identifying a relevant business problem or opportunity to address with data. Then, we proceed by extracting the data we need to generate some type of machine learning feedback or prediction. Next, we clean and prepare the data to get it into and ideal format for fast and efficient statistical processing. Then, we segregate the into those used for "training" a predictive model versus those used to test the model. Next, we apply a statistical formula to generate a set of weights for each type of data indicating how important each is in predicting some outcome. You might think of this in terms of the classic function from High School algebra: f(x) = m1x1 + m2x2 m3x3 + ... + mnxn + b. Then, we evaluate the quality of this predictive model and iteratively refine it until we get the best predictive accuracy possible. Once the best model is determined, we deploy the prediction in some way (e.g. through website or mobile app). Finally, we observe the results over time to determine how well our prediction is accurate. Eventually, our models "drift"--meaning performance may get worse--and we start the process again by importing new data. The visualization below indicates the consumer's perspective of the ML process--i.e. the person who receives the prediction and uses it to make a decision:

Figure 1.3: Consumer perspective of Machine Learning

To understand the consumer perspective, consider the Amazon case again where a consumer is looking to buy some moving boxes: 1) A consumer—"Homer"—visits Amazon.com in search of boxes, 2) Amazon recommends alternative boxes that are often purchased by other customers viewing the box he is currently viewing, 3) Homer makes the decision to purchase a particular box which is then recorded in Amazon's operational database, 3) this data is analyzed/processed using a predictive statistical algorithm (referred to as "modeling"), 4) the results of this statistical model are then used to generate new recommendations based on Homer's last decision, 5) These recommendations are given back to Homer and other customers when they come back to Amazon.

Descriptive versus Predictive

Another way to understand the progression from data mining to machine learning is by determining whether the analyses performed are descriptive or predictive. Descriptive techniques are the procedures designed for reporting, analyzing, and monitoring data in ways that describe the past and immediate present state of the business processes that the data are produced from. Predictive techniques are the procedures designed to predict the most likely future outcomes including performance, states, preferences, and much more based on historical data.

As indicated in the figure above, the methods of analysis get simultaneously more valuable and more complicated as they progress toward predictive techniques. We will touch on DM in this course, but our main focus will be on ML which means that we want to automate the data analytics process rather than simply perform the relevant analyses.