1.7 Describing Data
Tables and Graphs
Using a table or graph to describe data can be extremely useful to show a visual representation of your findings to both novice and advanced audiences. You will use different types of graphs depending on what information you would like to see.
As illustrated below, there are many different types of graphs, each of which is used to visually summarize data:
Please take a few minutes to visit the website typesofgraphs.com and learn more about the different types of graphs. Throughout this course, we will show you how to make various graphs like these.
Tables are sometimes more straightforward. Here is an example from a package from RStudio called skimr:
Here the summary is numerical. Throughout this course, we will show you how to make and use various types of tables to help you make good business decisions.
Using Numbers to Describe Data
Mean
2 + 3 + 2 + 5 = 12
12/4 = 3
mean = 3
You may remember in your math classes learning about means, or averages. This is one of the more popular statistics of central tendency, the others being median and mode, which we will describe shortly.
The basic premise of central tendency is that we want to take a series of numbers and see if they can be summarized as a single number that still has meaning and utility. For example, the mean might be used to compare salaries or hours worked between groups or to describe a set of data. If the average starting salary for a company is $40,000, this indicates that the mean has been calculated at $40,000 and that we might reasonably expect that if we are a new hire, we would make somewhere around that amount. While there are those who make more or less than $40,000, we would probably be surprised if we were offered a few thousand more or less in terms of salary, so it serves as a commonly used benchmark.
2, 5, 16, 4, 36
1, 1, 89, 375, 3, 6, 42
684, 273, 608, 374
2, 6, 7, 2, 6, 9, 1, 4, 4, 6
The mean is calculated by adding up the total value of numbers and dividing by how many numbers there are.
As a formula, this can be written in a few ways. Here we show both a calculation formula (with x1, x2, and x3 written out explicitly) as well as a more compact formula that uses sigma (Σ) notation.
x with a line above it is called x-bar and is a symbol for the mean. The Greek letter sigma (Σ) simply means to add. In this case, we might say that if we had four numbers to average, the first number (i = 1) would be x1, the second (i = 2) x2, the third (i = 3) x3, and the fourth (i = 4) x4 for a total of four numbers. We usually represent the number of measurements or cases with n, so in this example n = 4. The sigma formula says that we should add every number in our list together from the first one (i = 1) to the fourth one (i = 4) and then divide by the total number n.
Some limitations of the mean are that it can be heavily influenced by extreme values. Imagine you have three salespeople, one of whom achieves a million dollars in sales for the month and the other two with no sales at all. If you looked at them as a team, their average monthly sales would still seem misleadingly high (or low) at $333,333.33!
Median
Next we have the median. In an odd set of numbers, the median is simply the middle number. For example, in the set of {1, 2, 100}, the median is simply 2. In a set that contains an even amount of numbers, the median is the mean of the middle two numbers. For example, in the set of {1, 2, 3, 100}, the median would be the average of 2 and 3, or 2.5. Because it only looks at the middle number, the median is less swayed by extreme scores, but it gives a somewhat misleading rendition of the data (is 2 really the best descriptor of the set {1, 2, 100}?). It also has some of the same problems as the mean when there are an even number of data points.
Mode
Last would be the mode. This measure of central tendency is calculated simply by taking whatever number, if any, occurs most often. While this may be useful if you need to know the most common score, its relevance is so limited that it is often ignored.
Standard Deviation
While each of these measures has its limitations, statisticians also wanted ways to describe how spread out scores are. Imagine a salesperson who month after month has sales that range between $9,000 and $11,000, with an average of $10,000. Now imagine a different salesperson who sells between $5,000 and $15,000 but who also averages $10,000. Is there a way to measure consistency also? There is, of course, and this concept is called the standard deviation. It measures how much scores deviate from the mean. Standard deviation is one of the most important concepts in statistics. In a normal distribution, standard deviations are found using the following equations:
Sample Standard Deviation
Population Standard Deviation
Here is an example of the standard deviation curve. Imagine a market researcher is analyzing the results of a recent customer survey. He wants to have an idea of how varied his customers' opinions are about his latest product on a scale of 1–10, where 10 means they are completely satisfied. If his sample results show that the average is an 8 with a standard deviation of 0.5, this means that 68.3% of the population of his customers fall between a 7.5 and an 8.5 (one standard deviation above and below) and that 95.4% of them fall between 7 and 9 (see the chart above). We will explore these concepts more later in the course, but being able to take sample data and describe the entire population is very useful.
You might note that with the population standard deviation, we divide by n, the number of measurements in our population, while for the sample, we use n - 1. Researchers have found that when we use sample standard deviations (variances which are the standard deviations squared) to estimate population standard deviations that are already known, the sample standard deviations are most accurate when we use n - 1. For an example, see this standard deviation demonstration from Khan Academy.
Degrees of Freedom
Standard deviation is related to the concept of degrees of freedom. For example, if we have 20 people in a study who represent the sample of a larger population, we would use n - 1 or 19 degrees of freedom in our calculations, instead of just using n = 20.
Want to try our built-in assessments?
Use the Request Full Access button to gain access to this assessment.