Data Mining Technology Stack

Now that you're at least familiar with some of the key terms involved in the field of business analytics, let's review a high-level version of the technology stack required to make data mining possible. Designing, maintaining, and optimizing the data infrastructure for data collection, management, transformation, and access is the role of the data engineer or data engineering team.

Figure 1.3: Flow of Data

Let's explain this stack starting from the bottom. NOTE: the arrows in the diagram depict the flow of data from one technology to another.

Data Sources

Data sources are where data come from. Much of the data used in data mining projects comes from the operational databases. Operational databases keep track of every piece of data necessary for our business processes in a non-redundant (a.k.a. normalized) relational database. However, there are a lot of other sources of good data that can be used to create business intelligence to aid decision making.

Another important data source is the research data generated by consulting firms or from a company's internal research and development. For example, a company may conduct a survey on its website or it may hire a firm to call its customers and ask them about their satisfaction with the company's products. Similarly, the company likely has supply chain partners who are incented to see it succeed. For example, a company's suppliers want the company to keep buying parts and products from them. So, they will often offer the company access to some of their data, which may be useful. Next, it is very common to purchase access to third-party databases. Consider the major credit bureaus (Equifax, Transunion, and Experian). They offer consumer credit information for sale. The United States Post Office sells access to address information. A company can use these data sources to fill in the gaps in its own consumer data collection.

ETL

It is not a good idea to perform complex analyses on the operational database. Operational databases need to be fast so that business can run quickly. They also need to be accurate. As a result, you can't have people querying an operational database for ad-hoc or non-core business process reasons and possibly messing up the data. Therefore, we copy the data from the operational databases. This is called the  extract-transform-load (ETL) process. In the extract phase the data are copied from the operational database(s) and other data sources. In the transform stage, some portion of that data is summarized in cases where the transaction-level of detail is not needed for analysis. For example, rather than store each individual sale, the data may be summarized to daily sales by product line, salesperson, store location, and so forth. Lastly, in the load stage the data are inserted into a new database.

The transform step in ETL also includes significant data cleaning. Data cleaning is the process of improving the quality of the data. For example, does the data need to be numeric versus text? Can the data be summarized to improve speed? Does the dataset need to be transformed into a normal probability? Another important concern with ETL is how often it should occur. Should you ETL every single sale as it happens in real-time? Probably not, because that would require significant extra computing power. So, should you ETL once a month? Well, it depends. Is the value of having the most recent month greater than the cost of running an ETL batch once per month? If so, then you should probably run it more often. In other words, that decision affects the timeliness of your data. So, let's say you decide on ETL every morning at 4 a.m., which is when your website traffic is at its lowest. In the past 24 hours, let's say you've had 2,500 new customers register. However, few of them filled out all of their profile information. Can you design your ETL process to query the United States Postal Service database and fill in any missing data (to be added into your data warehouse, not your operational customer DB)? Yes, you can. That would improve the missing values problem and possibly any accuracy problems with their contact information. In fact, there are many other databases you can pay to access and program into your ETL process to improve or clean your data.

Data Lakes

Data lakes are emerging as both an addition and an alternative to the traditional data warehouse. A data lake is a rough or raw copy of data from all data sources in one combined location that is still distinct from the original data sources. In other words, very little, if any transformation is required when data is copied from the original sources to a data lake.

What has led to the emergence of data lakes? The primary reason for their development is that presentation tools have advanced enough to handle data in almost any format. Gone are the days when Excel was the only tool we could use to analyze data. For example, to create a time series forecast in Excel, the data (rows) need to be in periodic format—meaning that each row represents a period of time (e.g., Jan sales, Feb sales, Mar sales, Apr sales). That's not how the data are stored in relational databases. Data are stored as individual transactions in each row. However, Tableau doesn't care what format the data takes. Tableau can take that raw transaction-level format and still produce a sales forecast and visualization. Therefore, why waste the cost and time to ETL the data into periods? As a result, many organizations are opting to simply use data lakes if all of their tools can handle the raw data format, because they are less costly to implement and still provide a location (i.e., an analytical database) to keep data meant for analysis that is separate from the operational database.

In summary, organizations are implementing their analytical databases, generally speaking, in one of three ways: (1) structured data warehouse only, (2) data lake only, or (3) initial data lake with an ETL process to copy the data into a structured data warehouse.

Data Warehouse

The data warehouse (a.k.a. analytical database) is where we load the recently transformed data. Data warehouses are also relational databases of a sort. However, the same rules of normality (complete normalization) no longer apply. Data warehouses have their own rules and schemas (e.g., the star schema) that are optimized for ad-hoc data analyses.

Many industry experts believe the days of the data warehouse are numbered. Hardware has progressed to the point where the entire data warehouse can exist entirely in memory (think massive amounts of RAM). This means that there is no physical model of the data warehouse. The ETL process simply loads the analytical data into the memory of servers that are always on. Essentially, this means that the data is simply being pushed from slow secondary storage (hard drives) to fast primary storage (RAM). On the one hand, this is more efficient because it doesn't require a second database with its own copy of the data. On the other hand, it's riskier because servers can crash and need rebooting (thus, erasing the temporary RAM and requiring that the entire analytical database be re-created). However, that may be an acceptable risk since analytical data is not necessary for running core business processes, and today's data centers have advanced redundancies that prevent or limit catastrophic server failures.

Presentation to End User

Unless you are in the IT department, you will likely never interact directly with any of those technologies other than getting read access (via SQL SELECT statements; see the supplementary chapter on structured query language) to the databases. Rather, you will most likely use a suite of tools designed to help you analyze the data. These tools can range from broad applications (Microsoft Excel) to more specific applications for data visualization (Power BI and Tableau) and data prediction (Microsoft ML Studio and AWS Machine Learning). Data mining refers to this process of analyzing large amounts of data (a.k.a. big data) to create new information that is useful in making unstructured business decisions and solving unstructured business problems.

Data mining analyses can be broadly categorized into the two types mentioned above that distinguish business analytics from business intelligence. That is, data analysis can take a descriptive or predictive approach. Because these applications and techniques will be the focus of this BI module, we will describe them in more detail below.