1.4Data Mining Technology Stack
Now that you're at least familiar with some of the key terms involved in the field of business analytics, let's review a high level version of the technology stack required to make data mining possible.
Let's explain this stack below starting from the bottom. NOTE: the arrows in the diagram depict the flow of data from one technology to another.
Data sources are where data comes from. Much of the data used in data mining projects comes from the operational databases (which was the focus of section 1.3). Operational databases keep track of every piece of data necessary for our business processes in a non-redundant (a.k.a. normalized) relational database. However, there are a lot of other sources of good data that can be used to create business intelligence to aid decision making.
Another important source of data is the research received consulting firms to perform or that a company performs itself. For example, a company may administer a survey on its website or it may hire a firm to call its customers and ask them about their satisfaction with the company's products. Similarly, the company likely has supply chain partners who are incented to see it succeed. For example, a company's suppliers want it to keep buying parts and products from them. So, they will often offer the company access to some of their data, which may be useful. Next, it is very common to purchase access to third-party databases. Consider the major credit bureaus (Equifax, Transunion, and Experian). They offer consumer credit information for sale. The United States Post Office sells access to address information. A company can use these data sources to fill in the gaps in its own consumer data collection.
It is not a good idea to perform complex analyses on the operational database. Operational DBs need to be fast so that business can run quickly. They also need to be accurate. As a result, you can't have people querying an operational database for ad-hoc or non-core business process reasons and possibly messing up the data. Therefore, we copy the data from the operational databases. This is called the extract-transform-load (ETL) process. In the extract phase the data is copied from the operational database(s) and other data sources. In the transform stage the data is summarized. We do not want to load the data as-is into another database becasue that same level of detail is not need for analysis. For example, rather than store each individual sale, we will store the daily sales by product line, salesperson, store location, and so forth. Lastly, in the load stage the data is inserted into a new database.
The transform step in ETL also includes significant data cleaning. Data cleaning is the process of improving the quality of the data. For example, does the data need to be numeric versus text? Can it be summarized to improve speed? Does it need to be transformed into a normal probability? Another important concern with ETL is how often it should occur. Should you ETL every single sale as it happens in real-time? Probably not, because that would require significant extra computing power. So, should you ETL once a month? Well, it depends. Is the value of having the most recent month greater than the cost of running an ETL batch once per month? If so, then you should probably run it more often. In other words, that decision affects the timeliness of your data. So, let's say you decide on ETL every morning at 4 a.m. which is when your website traffic is at its lowest. In the past 24 hours, let's say you've had 2500 new customers register. However, few of them filled out all of their profile information. Can you design your ETL process to query the United States Postal Service database and fill in any missing data (to be added into your data warehouse, not your operational customer DB)? Yes, you can. That would improve the missing values problem and possibly any accuracy problems with their contact information. In fact, there are many other databases you can pay to access and program into your ETL process to improve or clean our data.
Data lakes are emerging as both an addition, and an alternative, to the a traditional data warehouse. A data lake is a rough or raw copy of data from all data sources in one combined location that is still distinct from the original data sources. In other words, very little, if any transformation is required when data is copied from the original sources to a data lake.
What has led to the emergence of data lakes? The primary reason is that presentation tools have advanced enough to handle data in almost any format. Gone are the days when Excel was the only tool we could use to analyze data. For example, to create a time series forecast in Excel, the data (rows) needs to be in periodic format—meaning that each row much represents a period of time (e.g. Jan sales, Feb sales, Mar sales, Apr sales). That's not how the data is stored in relational databases. It's stored as individual transactions in each row. However, Tableau doesn't care what format the data is in. It can take that raw transaction-level format and still produce a sales forecast and visualization. Therefore, why waste the cost and time to ETL the data into periods? As a result, many organizations are opting to simply use Data Lakes if all of their tools can handle the raw data format because a data lake is less costly to implement and still provides a location (i.e. analytical database) to divide analyze data away from the operational database.
In summary, organizations are implementing their analytical databases, generally speaking, in one of three ways: 1) structured data warehouse only, 2) data lake only, or 3) an initial data lake with an ETL process to copy the data into a structured data warehouse.
The data warehouse (a.k.a. analytical database) is where we load the recently transformed data. Data warehouses are also relational databases of a sort. However, the same rules of normality (complete normalization) no longer apply. Data warehouses have their own rules and schemas (e.g. the star schema) that are optimized for ad-hoc data analyses.
Many industry experts believe the days of the data warehouse are numbered. Hardware has progressed to the point where the entire data warehouse can exist entirely in memory (think massive amounts of RAM). This means that there is no physical model of the data warehouse. The ETL process simply loads the analytical data into the memory of servers that are always on. Essentially, this means that the data is simply being pushed from slow secondary storage (hard drives) to fast primary storage (RAM). On the one hand, this is more efficient because it doesn't require a second database with its own copy of the data. On the other hand, it's riskier because servers can crash and need rebooting (thus, erasing the temporary RAM and requiring that the entire analytical database be re-created). However, that may be an acceptable risk since analytical data is not necessary for running core business processes and today's data centers have advanced redundancies that prevent or limit catastrophic server failures.
Presentation to End User
Unless you are in the IT department, you will likely never interact directly with any of those technologies other than getting read access (via SQL SELECT statements; see the supplementary chapter on structured query language) to the databases. Rather, you will most likely use a suite of tools designed to help you analyze the data. These tools can range from broad applications (Microsoft Excel) to more specific applications for data visualization (Power BI and Tableau) and data prediction (Azure Machine Learning Studio and AWS Machine Learning). Data mining refers to this process of analyzing large amounts of data (a.k.a. big data) to create new information that is useful in making unstructured business decisions and solving unstructured business problems.
Data mining analyses can be broadly categorized into the two types mentioned above that distinguish business analytics from business intelligence. That is, data analysis can take a descriptive or predictive approach. Because these applications and techniques will be the focus of this BI module, we'll describe them in more detail below.