In my guide Import and Clean Data we looked at the process for importing and cleaning up a practice data file on employee satisfaction (download here).
In this guide, we will do some exploratory analysis to better understand this data. This will entail generating basic descriptive statistics and visualizations, identifying outliers, and finally looking for any correlations amongst our variables.
If you’re just joining me, you can get started with the following code that will import and process the data:
# On import, treat empty strings as NA
employee_survey <- read.csv('https://gist.githubusercontent.com/susanBuck/d69ac8413ce3c47bd25661a79eb44ffd/raw/e501c2c4471f3c38f68ff8003f4218f365d06f34/employee_survey.csv'), na.strings = "")
# Convert categorical variables to Factors:
# - dept: Departments are categories with no natural order.
# - salary: Salary levels have a natural order (low < medium < high).
employee_survey$dept <- factor(employee_survey$dept)
employee_survey$salary <- factor(employee_survey$salary, levels=c("low", "medium", "high"), ordered = TRUE)
# This data contains 788 missing rows.
# There are no other NA values through the rest of the file,
# so we’ll use na.omit to clean up the missing rows.
employee_survey <- na.omit(employee_survey)
Step one in understanding our data is to summarize it. For this, we’ll use the skimr package, which offers a more detailed summary compared to base R functions like summary:
library(skimr)
employee_survey %>% select(-Emp.ID) %>% skim()
Output:
── Data Summary ────────────────────────
Values
Name Piped data
Number of rows 14999
Number of columns 9
_______________________
Column type frequency:
factor 2
numeric 7
________________________
Group variables None
── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate ordered n_unique top_counts
1 dept 0 1 FALSE 10 sal: 4140, tec: 2720, sup: 2229, IT: 1227
2 salary 0 1 TRUE 3 low: 7316, med: 6446, hig: 1237
── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 satisfaction_level 0 1 0.613 0.249 0.09 0.44 0.64 0.82 1 ▃▅▇▇▇
2 last_evaluation 0 1 0.716 0.171 0.36 0.56 0.72 0.87 1 ▂▇▆▇▇
3 number_project 0 1 3.80 1.23 2 3 4 5 7 ▇▆▃▂▁
4 average_montly_hours 0 1 201. 49.9 96 156 200 245 310 ▃▇▆▇▂
5 time_spend_company 0 1 3.50 1.46 2 3 3 4 10 ▇▃▁▁▁
6 Work_accident 0 1 0.145 0.352 0 0 0 0 1 ▇▁▁▁▂
7 promotion_last_5years 0 1 0.0213 0.144 0 0 0 0 1 ▇▁▁▁▁
There are 9 columns total: 2 are factors and 7 are numeric.
There is no missing data (_n_missing = 0 _for all variables), which checks out as part of our Import and Clean Data steps was to handle missing values.
Below are some notes and observations about the different variables...
skim_variable n_missing complete_rate ordered n_unique top_counts
1 dept 0 1 FALSE 10 sal: 4140, tec: 2720, sup: 2229, IT: 1227
There are 10 departments. There appears a large drop from the top count (sales: 4140) to the 4th (IT: 1227) suggesting this variable is imbalanced. Let’s visualize the department frequency via a bar chart to dig deeper:
library(ggplot2)
employee_survey %>%
ggplot(aes(x = dept)) +
geom_bar(fill = "skyblue") +
labs(
title = "Distribution of Employees by Department",
x = "Department",
y = "Count"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Based on this bar chart, there does in fact appear to be an imbalance. Something to keep in mind as this might affect how we model department-level trends.
skim_variable n_missing complete_rate ordered n_unique top_counts
2 salary 0 1 TRUE 3 low: 7316, med: 6446, hig: 1237
Most employees earn low or medium salaries — and high salary is a small group (8%).
Again, we can create a bar chart to visualize this:
employee_survey %>%
ggplot(aes(x = salary)) +
geom_bar(fill = "limegreen") +
labs(
title = "Distribution of Employees by Salary Level",
x = "Salary",
y = "Count"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
When analyzing differences by salary level, we should remember the group sizes are unequal.
Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
1 satisfaction_level 0 1 0.613 0.249 0.09 0.44 0.64 0.82 1 ▃▅▇▇▇
2 last_evaluation 0 1 0.716 0.171 0.36 0.56 0.72 0.87 1 ▂▇▆▇▇
3 number_project 0 1 3.80 1.23 2 3 4 5 7 ▇▆▃▂▁
4 average_montly_hours 0 1 201. 49.9 96 156 200 245 310 ▃▇▆▇▂
5 time_spend_company 0 1 3.50 1.46 2 3 3 4 10 ▇▃▁▁▁
6 Work_accident 0 1 0.145 0.352 0 0 0 0 1 ▇▁▁▁▂
7 promotion_last_5years 0 1 0.0213 0.144 0 0 0 0 1 ▇▁▁▁▁
For each of the numeric variables, we get:
Helpfully, skimr also provides a mini histogram to get a broad sense of the distribution of these variables.
For example, we can see that satisfaction_level ▃▅▇▇▇ is broadly left-skewed with a long tail to the left, indicating many values are high and a few are low.
Digging deeper into this, we can plot a more detailed histogram of satisfaction_level:
employee_survey %>%
ggplot(aes(x = satisfaction_level)) +
geom_histogram(fill = "turquoise") +
labs(
title = "Distribution of Satisfaction Level",
x = "satisfaction_level",
y = "Count"
) +
theme_minimal()
This reveals the shape of our data in more detail, and we can see it’s bimodal, with two main peaks - one in the low satisfaction range (~0.1) and one in the higher range (~0.7 ~ 0.8).
Before we go any further, we should identify any outliers in our data.
To do this visually, let’s create a box plot for each numerical variable. This will visualize the median, quartiles, whiskers, and any outliers.
We will be able to present multiple box plots at once using ggplot2’s faceting feature. Faceting creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset.
employee_survey %>%
select(where(~ is.numeric(.x) && n_distinct(.x) > 2), -Emp.ID) %>%
pivot_longer(cols = everything(),
names_to = "Variable",
values_to = "Value") %>%
ggplot(aes(x = Variable, y = Value)) +
geom_boxplot(fill = "lightblue") +
facet_wrap( ~ Variable, scales = "free") +
labs(title = "Box Plots of Numeric Variables") +
theme_minimal()
Here’s a breakdown of each step:
Step 1
select(where(~ is.numeric(.x) && n_distinct(.x) > 2), -Emp.ID)
Selects all numeric columns with more than 2 unique values (to exclude binary columns promotion_last_5years and work_accidents). Also excludes Emp.ID.
Step 2
pivot_longer(cols = everything(),
names_to = "Variable",
values_to = "Value")
Uses pivot_longer to reshape the data from wide to long format, which is the format ggplots’s facet_wrap function needs. I.e. it takes multiple columns and stacks/reduces them to two columns:
Step 3
ggplot(aes(x = Variable, y = Value)) +
geom_boxplot(fill = "lightblue") +
Initiates a ggplot box plot, mapping Variable to the x axis and Value to the y axis.
Step 4
facet_wrap( ~ Variable, scales = "free") +
Uses ggplot2’s facet_wrap function to wrap the plot into a series of panels, creating one plot per unique value of the Variable column. The argument scales = "free" makes it so that each individual subplot (facet) has its own y-axis scale to accommodate different ranges.
Step 5
labs(title = "Box Plots of Numeric Variables") +
theme_minimal()
Adds a title and applies the minimal theme.
Results:
From this visualization, we can see that the variable time_spend_company contains outliers. Let’s create an individual box plot to “zoom” in on that variable:
# Box plot for time spent at company
ggplot(employee_survey, aes(x = "", y = time_spend_company)) +
geom_boxplot(fill = "lightblue") +
labs(
title = "Time Spent at Company",
x = NULL,
y = "Years at Company"
) +
theme_minimal()
To double check our outlier identification using box plots, let’s programmatically calculate and identify outliers.
To start, we define a function to identify outliers using the 1.5 × IQR (Interquartile Range) rule:
# Function to identify outliers
find_outliers <- function(column) {
# Finds the first (Q1) and third (Q3) quartiles of the column
Q1 <- quantile(column, 0.25, na.rm = TRUE)
Q3 <- quantile(column, 0.75, na.rm = TRUE)
# The IQR (Interquartile Range) is the middle 50% of the data
IQR_value <- Q3 - Q1
# Any value below the lower bound or above the upper bound is considered an outlier
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
# Filters the column and returns just the values that are outliers
return(column[column < lower_bound | column > upper_bound])
}
Then we apply this function to our numeric, non-binary columns:
# Count outliers per numeric variable
outlier_counts <- employee_survey %>%
select(where(~ is.numeric(.x) && n_distinct(.x) > 2), -Emp.ID) %>%
map(find_outliers) %>%
map_int(length) %>%
enframe(name = "variable", value = "num_outliers")
Results:
> outlier_counts
# A tibble: 5 × 2
variable num_outliers
<chr> <int>
1 satisfaction_level 0
2 last_evaluation 0
3 number_project 0
4 average_montly_hours 0
5 time_spend_company 1282
Logically the next step after identifying outliers is deciding if it’s appropriate to filter them out. Making that decision is beyond the scope of this guide, but here’s the code for if you do want to filter them out:
# Get outlier values for time_spend_company
outliers <- employee_survey %>%
pull(time_spend_company) %>%
find_outliers()
# Create copy of data frame with time_spend_company outliers removed
employee_survey_without_outliers <- employee_survey %>%
filter(!time_spend_company %in% outliers)
Now that we have a broad understanding of the values for our individual variables, let’s see if there are any correlations amongst them.
To do this, we can calculate correlation values using R’s built in cor method, and generate a heat map of the results using the corrplot package:
library(corrplot)
# Perason’s Correlation Matrix
# Compute the Pearson correlation between numeric variables (excluding Emp.ID)
# Then visualize the upper triangle of the correlation matrix with a heatmap via corrplot()
employee_survey %>%
select(where(~ is.numeric(.) && n_distinct(.) > 2), -Emp.ID) %>%
cor(use = "complete.obs") %>%
corrplot(
method = "circle",
type = "upper",
addCoef.col = "black", # numeric correlation labels
number.cex = 0.7, # size of correlation numbers
tl.cex = 0.8, # axis label size
tl.col = "black", # axis label color
cl.cex = 0.8 # color legend (gradient bar) label size
)
Results:
Interpretation:
There’s a moderate positive correlation between:
Note: average_montly_hours is not a typo, that is how it’s labeled in original data
There’s a weak positive correlation between:
There’s a weak negative correlation between:
There’s no meaningful correlation between:
Let’s “zoom in” on the strongest correlation identified in the above step, number_project & average_montly_hours (r = 0.42), and visualize it via a scatter plot and linear regression line:
ggplot(employee_survey, aes(x = number_project, y = average_montly_hours)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = TRUE) +
labs(title = "Number of Projects & Average Monthly Hours")
This guide helped you explore and visualize a dataset using some core data analysis techniques in R. If you’re ready to dig deeper, here are some next steps to consider:
For $4, you’ll get 6 months of unlimited access to the notes for the above video
and all of my other 200+ guides and videos.
This is a one-time payment— no subscriptions or auto-renewals to worry about cancelling.
Your support helps me continue creating free, high-quality videos. Thank you!