Data Mining and Data Warehouses

What is Data Mining?

One of the most interesting growth areas of database usage is in the area of Online Analytical Processing, or OLAP. Powerful new data mining technologies have opened up a whole new world of OLAP capabilities. As technology advances and data storage capabilities increase in magnitude, companies are capturing and saving billions and trillions of daily transactions. In the past, due to storage limitations, day-to-day transaction data would be dumped and either discarded or stored in long-term tape storage. Today, however, with the massive amount of online storage available, companies are saving terabytes of raw data. For example, Walmart captures 20 million point-of-sale transactions every day. Raw data does not in and of itself provide benefit to an organization. However, new data mining techniques are enabling organizations to extract meaningful information from this raw data.

So what is data mining? Data mining is a technique used to analyze and extract information from massive databases. Data mining tools enable organizations to learn from the daily behaviors of their customers which then allows them to predict future behaviors and future trends. Data mining is like metal mining—it is mining for veins of precious ore (information) in the massive hillside of raw earth (raw data). The example that we are probably most familiar with is those companies that mine our emails and present us with advertising based on the content of the emailed message.

One incredible benefit of advanced data mining tools is that they can sometimes find unexpected relationships or hidden patterns that enable better tactical and strategic decisions. Hence, data mining is used both to predict trends and behaviors in existing products and services and also to find new relationships or new patterns that can be exploited.

Data mining is now used by a wide range of industries, including retail, finance, health care, and manufacturing. Some more specific uses of data mining techniques include the following:

  • Market analysis: identify buying patterns of customers and product connections

  • Customer analysis: identify which customers are repeat buyers and which are single shot

  • Fraud detection: identify transaction patterns that indicate fraudulent types of behavior

  • Direct marketing: identify customer profiles that respond to direct marketing programs

  • Trend analysis: identify what products are increasing, which are decreasing, and what the changing characteristics of current customers are

What is a Data Warehouse?

Even though daily transaction systems generate much of the raw data for data mining, the objectives and database structure supporting OLTP processing are quite different from what is required for OLAP capabilities. The solution is the creation of a new type of database called a data warehouse.

A data warehouse is simply a database whose primary objectives are analyzing, querying, and reporting. Data warehouses are often made up of several data marts. A data mart is a smaller form of a data warehouse that focuses on a specific type of data. There are two important elements in a data warehouse: (1) a flexible and loosely structured database that allows complex queries, and (2) a set of analytical tools to analyze, query, and report on interesting patterns within the data.

The database structure for a data warehouse is often quite different from an operational normalized database. The data in a data warehouse may not be normalized. It may contain both detailed and summary data. The database is also continually growing as more data is extracted from daily operations or updated from outside vendors.

Figure 1.8 shows the process for creating a data warehouse. Direct operational data for the primary business activities as well as indirect operational data from auxiliary activities are both extracted from the operational database. Externally purchased data is also loaded into the data warehouse as needed. The process of uploading data to the warehouse is called Extract, Transform, Load (ETL). The transformation process is used to reformat and restructure input data to the required warehouse formats.

Figure 1.8: Process to Build Information for a Data Warehouse

The data mining is actually done by tools that search for and find interesting information and patterns. These search tools can be as simple as a DBMS query capability or as a sophisticated as high-end pattern recognition programs or even artificial intelligence programs. Current research is continually developing new techniques and approaches to mining the data.

Externally Purchased Data

Many business entities also require information that is external to the business to include it in the data warehouse. Information about the economy, business trends, market activity, and competitor activities is often critical to long-term success. This type of information is much more difficult to obtain. There is a plethora of economic and financial facts, news articles, and analyst commentaries published every day in countless forums. How could a business ever find all the relevant facts and then organize and structure those facts so that they provide meaningful market information? How can these businesses then store that information in a database so that it can be used to make executive decisions? Even though this problem is not purely a database problem, database technology is certainly a large part of the solution. Not only can details and facts be captured and stored in a database, but database analysis will enable businesses to discover new types of information by analyzing the data.

The complexity of this problem is such that most businesses do not have the resources or the expertise to do a thorough job of finding and collecting the external information that may be helpful. Fortunately, there are commercial services that focus on this business need that will find, collect, organize, record, and distribute this kind of economic, financial, and market information. Dow Jones provides such a service on their website called Factiva. This Dow Jones service collects data facts from 31,000 different sources in 200 countries. Much of this information is not available on the web and must be extracted from original documents. The Business Information System provided by Dow Jones consists of a set of tools to do focused research and data extraction based on the information stored in this massive database. Without sophisticated database technology, this service would be impossible to provide.