← Other topics

Standard Deviation - Statistics and R Simplified

Video Notes

In this dual purpose series, we’re studying basic statistics and practicing working with R.

Today’s topic is standard deviation.

What is standard deviation

Standard deviation is a measure of how spread out numbers are in a particular data set. Specifically, it measures how much the numbers differ from the mean.

When a data set has a high standard deviation value, it means there’s a lot of variability in the data. A lower standard deviation value indicates there’s less variability.

To contextualize this, imagine you are looking at sets of student test scores. If one set of scores has a high standard deviation, it tells you there’s a lot of variability in the student’s performance. On the other hand, a lower standard deviation would indicate the students were more consistent in their scores.

Here are two example data sets (encoded as R vectors) to visualize this:

scores_a <- c(98, 89, 97, 89, 88)
scores_b <- c(25, 40, 100, 77, 60)

A vector in R is the most basic data structure that stores a collection of elements of the same data type. It can contain numbers, characters, or logical values. In the above code we create numerical vectors using the built-in R function, c (concatenate).

Skimming through these numbers, we can see there’s more variability in scores_b, so we would expect the standard deviation value for this data set to be higher than the standard deviation for scores_a.

As proof of this, let’s have R calculate the standard deviation (using the built-in sd function) and see how it reflects the variability.

scores_a <- c(98, 89, 97, 89, 88)
print(sd(scores_a)) # 4.868265

scores_b <- c(25, 40, 100, 77, 60)
print(sd(scores_b)) # 29.63613

As expected, scores_b (29.64) had a higher standard deviation compared to scores_b (4.87).

Calculating standard deviation

Now that we have a big picture understanding of standard deviation, let’s “pop the hood” and understand how it is calculated. We will show both a manual calculation and a calculation using R functionality.

Of course we could skip both these approaches and go straight to R’s built-in sd function (as demonstrated above) but the goal is to get a deeper understanding of standard deviation and also programming in R.

Manual calculation

There are two formulas for standard deviation, depending on whether your data represents a population or a sample. The formulas are nearly identical, but the sample standard deviation divides by n - 1 instead of N.

This adjustment, known as Bessel’s correction, corrects for the bias that occurs when estimating the population standard deviation from a sample. Since a sample tends to underestimate the true variability in a population, using n - 1 instead of N results in a slightly larger standard deviation, making it a more accurate estimator of the population standard deviation.

For our example, we’ll use the formula for a sample, as this is the formula R uses in the sd function.

Applying this formula, and and given these test scores:

98, 89, 97, 89, 88

Start by calculating the mean:

(98 + 89 + 97 + 89 + 88) / 5 = 92.2

Next, we need to figure out how much each score varies from the mean by subtracting the mean from each score:

98 - 92.2 = 5.8
89 - 92.2 = -3.2
97 - 92.2 = 4.8
89 - 92.2 = -3.2
88 - 92.2 = -4.2

If we were to sum up the above negative and positive numbers we’d end up with 0, Square the resulting differences from the above step:

 5.8² = 33.64
-3.2² = 10.24
 4.8² = 23.04
-3.2² = 10.24
-4.2² = 17.64

Add up these squared values to yield what is referred to as the sum of squares:

33.64 + 10.24 + 23.04 + 10.24 + 17.64 = 94.8

Divide the sum of squares by n-1 (where n is number of scores) to yield the variance:

(94.8 / 4) = 23.7

Finally, calculate the square root of the variance to yield the standard deviation:

√23.7 = 4.868264

We know we did everything correctly because this standard deviation value of 4.87 matches up with what we calculated using R’s sd function:

scores_a <- c(98, 89, 97, 89, 88)
sd(scores_a) # 4.868265

With R functions

Let’s repeat the above procedures with the assistance of built-in R functions:

# Define our hypothetical scores
scores <- c(98, 89, 97, 89, 88)

# Calculate the mean
mean <- mean(scores)

# Subtract the mean from each score
diff_from_mean = scores - mean

# Square the resulting values
squared_diff_from_mean = diff_from_mean^2

# Sum the squares
sum_of_squares = sum(squared_diff_from_mean)

# Calculate the variance by dividing the sum of squares by n-1
variance = sum_of_squares / (length(scores) - 1)

# Calculate the SD by taking the square root of the variance
standard_deviation = sqrt(variance) # 4.868265

Once again, we learn that our standard deviation is 4.868265.

Practice

Repeat the above steps using our second set of hypothetical test scores:

scores_b <- c(25, 40, 100, 77, 60)

Complete this process by doing both a manual calculation as well as a calculation with the assistance of built-in R functions (mean, sum, length, sqrt).

Make sure your resulting values matches the standard deviation as calculated by R's built-in sd function.

Unlock all the notes for $4

No subscriptions, no auto-renewals.

Just a simple one-time payment that helps support my free, to-the-point videos without sponsered ads.

Unlocking gets you access to the notes for this video plus all 200+ guides on this site.

Your support is appreciated. Thank you!

Payment Info

/
$4 6 months
$25 forever
Please check the form for errors
Questions? help@codewithsusan.com
← Other topics