1.3 Data Mining Technology Stack
Now that you’re at least familiar with some of the key terms involved in the field of business analytics, let’s review a high-level version of the technology stack required to make data mining possible. Designing, maintaining, and optimizing the data infrastructure stack for data collection, management, transformation, and access is the role of the data engineer or data engineering team.
Let’s explain this stack starting from the bottom.
Data Sources
Data sources are where data come from. Much of the data used in data mining projects come from the operational databases that were the focus of the prior module. Operational databases keep track of every piece of data necessary for our business processes in a non-redundant (or normalized) relational database. However, there are many other sources of good data that can be used to create business intelligence to aid decision-making.
Another important source of data is the research performed by a company or its hired consulting firm. For example, we may administer a survey on our company website; or, we may hire a firm to call our customers and ask them about their satisfaction with our products. Similarly, we likely have supply chain partners who are invested in seeing us succeed. For example, our suppliers want us to keep buying parts and products from them. So, they will often offer us access to some of their data, which may be useful. Next, it is very common to purchase access to third-party databases. Consider the major credit bureaus (Equifax, Transunion, and Experian). These bureaus offer consumer credit information for sale. The United States Post Office sells access to address information. A company can use these data sources to subsidize its own consumer data collection.
However, keep in mind that there are many valuable sources of data that are either (1) not legally accessible or (2) not ethically acceptable to use. For example, many companies are able to scrape consumer data without the user’s knowledge or consent. Some countries have laws against these practices (see EU Privacy Regulation), whereas others either don’t regulate against them or don’t enforce the regulation. Collecting data from users may be legal, but it’s not ethical. Similarly, some companies are quite adept at collecting information about their competitors (other than the publicly available information on a competitor’s website). Hacking a competitor’s system for data is clearly illegal, but there are also legal—but unethical—methods of collecting competitor data.
ETL
It is not a good idea to perform complex analyses on the operational database. Operational databases need to be fast and accurate so that the business can run quickly and well. So, you can’t have people querying an operational database for ad hoc purposes or reasons not related to core business processes and possibly messing up the data. Therefore, we need to copy the data from the operational databases. This is called the extract-transform-load (ETL) process. Extract means that the data are copied from the operational database(s) and other data sources. Transform means that the data are not transferred as-is into another database. Because we don’t need the same level of detail for our analyses, we will typically summarize the data. For example, rather than store each individual sale, we will store the daily sales by product line, salesperson, store location, and so on. Lastly, load means that the data are then inserted into a new database.
The transform step in ETL also includes significant data cleaning. Data cleaning is the process of improving the quality of the data. For example, do the data need to be numeric versus text? Can the data be summarized to improve the speed of the database? Do the data need to be transformed into a normal probability? Another important concern with ETL is how often it should occur. Should you ETL every single sale as it happens in real time? Probably not, because that would require significant extra computing power. So, should you ETL once a month? Well, it depends. Is the value of having the most recent month greater than the cost of running an ETL batch once per month? If so, then you should probably run it more often. In other words, that decision affects the timeliness of your data. So, let’s say you decide to ETL every morning at 4 a.m., which is when your website traffic is at its lowest. In the past twenty-four hours, let’s say you had 2,500 new customers register. However, few of them filled out all of their profile information. Can you design your ETL process to query the United States Postal Service database and fill in any missing data (to be added into your data warehouse, not your operational customer database, which the customers can see)? Yes, you can. That would improve the missing values problem and possibly correct any accuracy problems with the new client’s contact information. In fact, there are many other databases you can pay to access and program into your ETL process to improve or clean the data.
Data Warehouse
The data warehouse (or analytical database) is where we load the recently transformed data. Data warehouses are also relational databases of a sort. However, the same rules of normality (complete normalization) no longer apply. Data warehouses have their own rules and schemas (e.g., the star schema) that are optimized for ad hoc data analyses.
Many industry experts believe the days of the data warehouse are numbered. Hardware has progressed to the point where the entire data warehouse can exist entirely in memory (think massive amounts of RAM). This means that there is no physical model of the data warehouse. The ETL process simply loads the analytical data into the memory of servers that are always on. Essentially, this means that the data are being pushed from slow secondary storage (hard drives) to fast primary storage (RAM). On the one hand, this is more efficient because it doesn’t require a second database with a copy of the data. On the other hand, it’s riskier because servers can crash and need rebooting (thus erasing the temporary RAM and requiring that the entire analytical database be recreated). However, that may be an acceptable risk since analytical data is not necessary for running core business processes, and today’s data centers have advanced redundancies that prevent or limit catastrophic server failures.
Data Lakes
Data lakes are emerging as both an addition and an alternative to the traditional data warehouse. A data lake is a rough or raw copy of data from all data sources in one combined location that is still distinct from the original data sources. In other words, very little, if any, transformation (i.e., ETL) is required when data are copied from the original sources to a data lake.
What has led to the emergence of data lakes? Primarily, the presentation tools have advanced significantly to handle data in almost any format. Gone are the days when Excel was the only tool we could use to analyze data. For example, to create a time series forecast in Excel, the data (rows) need to be in periodic format—meaning that each row must represent a period of time (e.g., Jan sales, Feb sales, Mar sales, Apr sales). That’s not how the data are stored in relational databases. The data are stored as individual transactions in each row. However, Tableau doesn’t care what format the data are in. It can take that raw transaction-level format and still produce a sales forecast and visualization. So, why waste the time to ETL the data into periods? Many organizations are opting to use data lakes if all of their tools can handle the raw data format because (1) a data lake is less costly to implement and (2) it still provides a location (i.e., analytical database) to divide analyzed data away from the operational database.
In summary, organizations are implementing their analytical databases, generally speaking, in one of three ways: (1) structured data warehouse only, (2) data lake only, or (3) an initial data lake with an ETL process to copy the data into a structured data warehouse.
Presentation to End User
Unless you are in the IT department, you will likely never interact directly with any of the technologies mentioned above other than to get read access (via SQL SELECT statements) to the databases. Rather, you will use a suite of tools designed to help you analyze the data. These tools can range from broad applications like Microsoft Excel to more specific applications for data visualization (e.g., Power BI and Tableau) and data prediction (e.g., Azure Machine Learning Studio and AWS Machine Learning). Data mining refers to this process of analyzing large amounts of data (or big data) to create new information that is useful in making unstructured business decisions and solving unstructured business problems.
However, to become truly fluent as a data scientist, it is becoming increasingly clear that you can’t rely only on tools like Excel, Power BI, and Tableau. Even more advanced tools like Azure ML Studio and Amazon ML—which are useful tools that allow non-programmers to create real machine learning environments—do not give you the full range of capabilities afforded by today’s techniques. Furthermore, to get a job as a data analyst or scientist today, you need to be able to perform data analytics and create machine learning capabilities by writing code and creating programs to automate the data analytics process.