Video Notes
A dataset may be written in long or wide format.
Consider an example data set that records information about employees and their sales for a given month.
In long format, we see repeated rows for Employees to encode their sales for each month. These repeated rows make the data “longer”, hence the name long format.
In wide format, we would instead see a single row for each Employee and the addition of columns for each Month. These additional columns makes the data “wider”, hence the name, wide format.
Summary:
- long format ➡️ more rows
- wide format ➡️ more columns
Wide format is more human-friendly - it’s easy for us to quickly glance at the data and make comparisons.
Long format is more programming-friendly - it’s better for grouping, summarization, visualization, and statistical analysis.
Because both wide and long format have their advantages, it’s useful to know how to translate between the two a process referred to as "reshaping"
Example data
Example data in long format:
data <- read.csv(
text = "
Employee,Month,Sales
Alice,January,5000
Alice,February,5200
Alice,March,5100
Bob,January,4800
Bob,February,4700
Bob,March,4900
Charlie,January,5300
Charlie,February,5400
Charlie,March,5500",
stringsAsFactors = FALSE
)
Example data in wide format:
data <- read.csv(
text = "
Employee,January,February,March
Alice,5000,5200,5100
Bob,4800,4700,4900
Charlie,5300,5400,5500",
stringsAsFactors = FALSE
)
Reshaping data
To reshape data we can use the pivot_longer and pivot_wider functions from the tidyr package. In the video, I demonstrate these functions with the following demo data:
Convert from long to wide format:
library(tidyr)
data_wide <- data_long %>%
pivot_wider(
names_from = "Month",
values_from = "Sales"
)
- tidyr function used: pivot_wider
- arguments:
-
names_from - The column that contains the values that will become the new column names.
-
values_from - The column that contains the data that will be placed inside the new columns.
Convert from wide to long format:
library(tidyr)
data_long <- data_wide %>%
pivot_longer(
cols = c("January", "February", "March"),
names_to = "Month",
values_to = "Sales"
)
- tidyr function used: pivot_longer
- arguments:
-
cols - The columns that will be stacked into rows, turning their names into values in the names_to column and their data into the values_to column.
-
names_to - The name of the new column that will store the original column names from cols.
-
values_to - The name of the new column that will store the values from the original columns in cols.
FYI
Instead of indicating which cols to reshape (which could get cumbersome with many columns), you can instead indicate which cols to not reshape using the negative selection operator (-
):
data_long <- data_wide %>%
pivot_longer(
cols = -Employee,
names_to = "Month",
values_to = "Sales"
)