Video Notes
In this dual purpose series, we’re studying basic statistics and practicing working with R.
Today’s topic is standard deviation.
What is standard deviation
Standard deviation is a measure of how spread out numbers are in a particular data set. Specifically, it measures how much the numbers differ from the mean.
When a data set has a high standard deviation value, it means there’s a lot of variability in the data. A lower standard deviation value indicates there’s less variability.
To contextualize this, imagine you are looking at sets of student test scores. If one set of scores has a high standard deviation, it tells you there’s a lot of variability in the student’s performance. On the other hand, a lower standard deviation would indicate the students were more consistent in their scores.
Here are two example data sets (encoded as R vectors) to visualize this:
scores_a <- c(98, 89, 97, 89, 88)
scores_b <- c(25, 40, 100, 77, 60)
FYI
A vector in R is the most basic data structure that stores a collection of elements of the same data type. It can contain numbers, characters, or logical values. In the above code we create numerical vectors using the built-in R function, c
(concatenate).
Skimming through these numbers, we can see there’s more variability in scores_b, so we would expect the standard deviation value for this data set to be higher than the standard deviation for scores_a.
As proof of this, let’s have R calculate the standard deviation (using the built-in sd function) and see how it reflects the variability.
scores_a <- c(98, 89, 97, 89, 88)
print(sd(scores_a)) # 4.868265
scores_b <- c(25, 40, 100, 77, 60)
print(sd(scores_b)) # 29.63613
As expected, scores_b (29.64) had a higher standard deviation compared to scores_b (4.87).
Calculating standard deviation
Now that we have a big picture understanding of standard deviation, let’s “pop the hood” and understand how it is calculated. We will show both a manual calculation and a calculation using R functionality.
Of course we could skip both these approaches and go straight to R’s built-in sd function (as demonstrated above) but the goal is to get a deeper understanding of standard deviation and also programming in R.
Manual calculation
There are two formulas for standard deviation, depending on whether your data represents a population or a sample. The formulas are nearly identical, but the sample standard deviation divides by n - 1 instead of N.
This adjustment, known as Bessel’s correction, corrects for the bias that occurs when estimating the population standard deviation from a sample. Since a sample tends to underestimate the true variability in a population, using n - 1 instead of N results in a slightly larger standard deviation, making it a more accurate estimator of the population standard deviation.
For our example, we’ll use the formula for a sample, as this is the formula R uses in the sd function.
Applying this formula, and and given these test scores:
98, 89, 97, 89, 88
Start by calculating the mean:
(98 + 89 + 97 + 89 + 88) / 5 = 92.2
Next, we need to figure out how much each score varies from the mean by subtracting the mean from each score:
98 - 92.2 = 5.8
89 - 92.2 = -3.2
97 - 92.2 = 4.8
89 - 92.2 = -3.2
88 - 92.2 = -4.2
If we were to sum up the above negative and positive numbers we’d end up with 0,
Square the resulting differences from the above step:
5.8² = 33.64
-3.2² = 10.24
4.8² = 23.04
-3.2² = 10.24
-4.2² = 17.64
Add up these squared values to yield what is referred to as the sum of squares:
33.64 + 10.24 + 23.04 + 10.24 + 17.64 = 94.8
Divide the sum of squares by n-1 (where n is number of scores) to yield the variance:
(94.8 / 4) = 23.7
Finally, calculate the square root of the variance to yield the standard deviation:
√23.7 = 4.868264
We know we did everything correctly because this standard deviation value of 4.87 matches up with what we calculated using R’s sd function:
scores_a <- c(98, 89, 97, 89, 88)
sd(scores_a) # 4.868265
With R functions
Let’s repeat the above procedures with the assistance of built-in R functions:
# Define our hypothetical scores
scores <- c(98, 89, 97, 89, 88)
# Calculate the mean
mean <- mean(scores)
# Subtract the mean from each score
diff_from_mean = scores - mean
# Square the resulting values
squared_diff_from_mean = diff_from_mean^2
# Sum the squares
sum_of_squares = sum(squared_diff_from_mean)
# Calculate the variance by dividing the sum of squares by n-1
variance = sum_of_squares / (length(scores) - 1)
# Calculate the SD by taking the square root of the variance
standard_deviation = sqrt(variance) # 4.868265
Once again, we learn that our standard deviation is 4.868265.
Practice
Repeat the above steps using our second set of hypothetical test scores:
scores_b <- c(25, 40, 100, 77, 60)
Complete this process by doing both a manual calculation as well as a calculation with the assistance of built-in R functions (mean, sum, length, sqrt).
Make sure your resulting values matches the standard deviation as calculated by R's built-in sd function.