Cause and Effect

"Information is the oil of the 21st century, and analytics is the combustion engine." —Peter Sondergaard, senior vice president, Gartner Research

What causes one organization to be successful and another to fail? This question drives the academic discipline of business management, and the answer can be informed by a great deal of research. But ultimately, there is not a scripted answer to that question that can be definitively taught in a course. In other words, it is an "unstructured" problem that we continually strive to solve, but never completely succeed in answering.

Therefore, business management (and particularly the subdiscipline of information systems) is actually a very creative discipline. There are no rules other than "do not break the law" and "be ethical." Therefore, organizations have very wide parameters within which they can develop creative ways of achieving above-average gains in a market. There are typically two methods that managers employ to test creative ideas: (1) draw from their past experience (and that of their co-workers and employees) to base new ideas upon, or (2) gather data about the business's performance and use it to determine cause and effect.

Which of these two methods is the most valuable to study? Well, the past experience of a smart business manager is nothing to ignore. Some people have made incredible lives for themselves off of one good idea. But if they don't adapt and change, that one good idea may never produce good results again. The past experience of just one person is very limited, and on its own is not enough to solve future problems. Even the experience of two, three, or a thousand employees is quite limited. In addition, a person's experiences are interpreted (or often misinterpreted) through the narrow lens of personal biases, beliefs, values, and desires. Beliefs can be distorted very easily; the truth can be stretched until it is more false than true.

The second method, gathering and examining data, is a much more productive way to make wise business decisions. Information drawn from accurate and timely data doesn't lie. It isn't biased. It doesn't care about gender, race, religion, or politics. It simply represents an objective view of the facts. Therefore, most of the best business decisions are based on accurate and timely data instead of on past experience. Perhaps most importantly, data allow for the establishment of cause and effect. Accurately explaining the causes for each effect is how theory is formed and true knowledge is discovered.

The effect that we are interested in is business success, which is simple enough to see. The cause of success is less obvious. This is where we have to be careful. Data allow us to measure hypothesized causes and desired successes. However, the data cannot determine the true cause of each effect. It only gives support for a theorized cause-and-effect relationship. Consider the following chart depicting accurate data:

Does organic food cause autism? Probably not. In fact, this ridiculous chart was made as an example of how data can be terribly misinterpreted when it is misapplied. Why is there such a strong relationship in the figure above between organic food sales and autism? As you may remember from prior statistics courses, the only way to truly establish causality is with a randomized experiment with treatments. There are many possible reasons. Perhaps something else caused both (e.g. media awareness). Perhaps it is a random coincedence. Regardless, strange correlations like this happen all of the time when dealing with secondary data. (If you have time to kill, enjoy these often hilarious examples)

This discussion brings up two types of data used in analyses: primary and secondary. Primary data is the data generated from experiments or surveys that is intended for analytics and cause-effect analysis. Academic researchers and A-B testers generate primary data. However, data analytics is more often based on the collection of secondary data which is data that was generated previously for one purpose, but which the data scientist will use for another purpose. Even though we can't establish causality with secondary data, we will still imply causality as long as we can come up with a theory to explain the causal relationship. Therefore, it is still very worthwhile to determine what variables are highly correlated with positive outcomes. From a technical definition, identifying a high correlation is not the same as establishing a cause, but for the purposes of this textbook, we will imply that we are establishing cause and effect.

In summary, data can be valuable, but it can also be dangerous if "the wrong hands" interpret it. The "right hands" to interpret data are those who understand data and help to create the theories that explain organization success. However, teaching theories and data interpretation is not the purpose of this book. Rather, the focus will be on teaching you the proper techniques to analyze and make use of data in order to build valuable theory and, more importantly, use that knowledge in a machine learning environment. It's up to you (on your own or in future courses) to learn the theories that 1) identify the relevant variables, and 2) explain the relationships among those variables that establish cause and effect.