← Other topics

Exploring and Visualizing Data (R Simplified)

Video Notes

In my guide Import and Clean Data we looked at the process for importing and cleaning up a practice data file on employee satisfaction (download here).

In this guide, we will do some exploratory analysis to better understand this data. This will entail generating basic descriptive statistics and visualizations, identifying outliers, and finally looking for any correlations amongst our variables.

If you’re just joining me, you can get started with the following code that will import and process the data:

# On import, treat empty strings as NA
employee_survey <- read.csv('https://gist.githubusercontent.com/susanBuck/d69ac8413ce3c47bd25661a79eb44ffd/raw/e501c2c4471f3c38f68ff8003f4218f365d06f34/employee_survey.csv'), na.strings = "")

# Convert categorical variables to Factors:
#  - dept: Departments are categories with no natural order.
#  - salary: Salary levels have a natural order (low < medium < high).
employee_survey$dept <- factor(employee_survey$dept)
employee_survey$salary <- factor(employee_survey$salary, levels=c("low", "medium", "high"), ordered = TRUE)

# This data contains 788 missing rows. 
# There are no other NA values through the rest of the file, 
# so we’ll use na.omit to clean up the missing rows.
employee_survey <- na.omit(employee_survey)

Summarize the Data

Step one in understanding our data is to summarize it. For this, we’ll use the skimr package, which offers a more detailed summary compared to base R functions like summary:

library(skimr)
employee_survey %>% select(-Emp.ID) %>% skim()

Output:

── Data Summary ────────────────────────
                           Values    
Name                       Piped data
Number of rows             14999     
Number of columns          9         
_______________________              
Column type frequency:               
  factor                   2         
  numeric                  7         
________________________             
Group variables            None      

── Variable type: factor ───────────────────────────────────────────────────────────────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique top_counts                               
1 dept                  0             1 FALSE         10 sal: 4140, tec: 2720, sup: 2229, IT: 1227
2 salary                0             1 TRUE           3 low: 7316, med: 6446, hig: 1237          

── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────
  skim_variable         n_missing complete_rate     mean     sd    p0    p25    p50    p75 p100 hist 
1 satisfaction_level            0             1   0.613   0.249  0.09   0.44   0.64   0.82    1 ▃▅▇▇▇
2 last_evaluation               0             1   0.716   0.171  0.36   0.56   0.72   0.87    1 ▂▇▆▇▇
3 number_project                0             1   3.80    1.23   2      3      4      5       7 ▇▆▃▂▁
4 average_montly_hours          0             1 201.     49.9   96    156    200    245     310 ▃▇▆▇▂
5 time_spend_company            0             1   3.50    1.46   2      3      3      4      10 ▇▃▁▁▁
6 Work_accident                 0             1   0.145   0.352  0      0      0      0       1 ▇▁▁▁▂
7 promotion_last_5years         0             1   0.0213  0.144  0      0      0      0       1 ▇▁▁▁▁

There are 9 columns total: 2 are factors and 7 are numeric.

There is no missing data (_n_missing = 0 _for all variables), which checks out as part of our Import and Clean Data steps was to handle missing values.

Below are some notes and observations about the different variables...

dept (factor)

  skim_variable n_missing complete_rate ordered n_unique top_counts
1 dept                  0             1 FALSE         10 sal: 4140, tec: 2720, sup: 2229, IT: 1227

There are 10 departments. There appears a large drop from the top count (sales: 4140) to the 4th (IT: 1227) suggesting this variable is imbalanced. Let’s visualize the department frequency via a bar chart to dig deeper:

library(ggplot2)

employee_survey %>%
  ggplot(aes(x = dept)) +
  geom_bar(fill = "skyblue") +
  labs(
    title = "Distribution of Employees by Department",
    x = "Department",
    y = "Count"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
Bar chart of departments

Based on this bar chart, there does in fact appear to be an imbalance. Something to keep in mind as this might affect how we model department-level trends.

salary (factor)

  skim_variable n_missing complete_rate ordered n_unique top_counts
2 salary                0             1 TRUE           3 low: 7316, med: 6446, hig: 1237

Most employees earn low or medium salaries — and high salary is a small group (8%).

Again, we can create a bar chart to visualize this:

employee_survey %>%
  ggplot(aes(x = salary)) +
  geom_bar(fill = "limegreen") +
  labs(
    title = "Distribution of Employees by Salary Level",
    x = "Salary",
    y = "Count"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
Bar chart of salary levels

When analyzing differences by salary level, we should remember the group sizes are unequal.

Numeric Variables

Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────
  skim_variable         n_missing complete_rate     mean     sd    p0    p25    p50    p75 p100 hist 
1 satisfaction_level            0             1   0.613   0.249  0.09   0.44   0.64   0.82    1 ▃▅▇▇▇
2 last_evaluation               0             1   0.716   0.171  0.36   0.56   0.72   0.87    1 ▂▇▆▇▇
3 number_project                0             1   3.80    1.23   2      3      4      5       7 ▇▆▃▂▁
4 average_montly_hours          0             1 201.     49.9   96    156    200    245     310 ▃▇▆▇▂
5 time_spend_company            0             1   3.50    1.46   2      3      3      4      10 ▇▃▁▁▁
6 Work_accident                 0             1   0.145   0.352  0      0      0      0       1 ▇▁▁▁▂
7 promotion_last_5years         0             1   0.0213  0.144  0      0      0      0       1 ▇▁▁▁▁

For each of the numeric variables, we get:

Helpfully, skimr also provides a mini histogram to get a broad sense of the distribution of these variables.

For example, we can see that satisfaction_level ▃▅▇▇▇ is broadly left-skewed with a long tail to the left, indicating many values are high and a few are low.

Digging deeper into this, we can plot a more detailed histogram of satisfaction_level:

employee_survey %>% 
  ggplot(aes(x = satisfaction_level)) +
  geom_histogram(fill = "turquoise") +
  labs(
    title = "Distribution of Satisfaction Level",
    x = "satisfaction_level",
    y = "Count"
  ) +
  theme_minimal()
Histogram for satisfaction level

This reveals the shape of our data in more detail, and we can see it’s bimodal, with two main peaks - one in the low satisfaction range (~0.1) and one in the higher range (~0.7 ~ 0.8).

Visualize Outliers

Before we go any further, we should identify any outliers in our data.

To do this visually, let’s create a box plot for each numerical variable. This will visualize the median, quartiles, whiskers, and any outliers.

We will be able to present multiple box plots at once using ggplot2’s faceting feature. Faceting creates tables of graphics by splitting the data into subsets and displaying the same graph for each subset.

employee_survey %>%
  select(where(~ is.numeric(.x) && n_distinct(.x) > 2), -Emp.ID) %>%
  pivot_longer(cols = everything(),
               names_to = "Variable",
               values_to = "Value") %>%
  ggplot(aes(x = Variable, y = Value)) +
  geom_boxplot(fill = "lightblue") +
  facet_wrap( ~ Variable, scales = "free") + 
  labs(title = "Box Plots of Numeric Variables") +
  theme_minimal()

Here’s a breakdown of each step:

Step 1

select(where(~ is.numeric(.x) && n_distinct(.x) > 2), -Emp.ID)

Selects all numeric columns with more than 2 unique values (to exclude binary columns promotion_last_5years and work_accidents). Also excludes Emp.ID.

Step 2

pivot_longer(cols = everything(),
             names_to = "Variable",
             values_to = "Value")

Uses pivot_longer to reshape the data from wide to long format, which is the format ggplots’s facet_wrap function needs. I.e. it takes multiple columns and stacks/reduces them to two columns:

  1. Variable -> the original column name (e.g. satisfaction_level, last_evaluation, number_project, etc.)
  2. Value -> the actual numeric values from those columns

Step 3

ggplot(aes(x = Variable, y = Value)) +
geom_boxplot(fill = "lightblue") +

Initiates a ggplot box plot, mapping Variable to the x axis and Value to the y axis.

Step 4

facet_wrap( ~ Variable, scales = "free") + 

Uses ggplot2’s facet_wrap function to wrap the plot into a series of panels, creating one plot per unique value of the Variable column. The argument scales = "free" makes it so that each individual subplot (facet) has its own y-axis scale to accommodate different ranges.

Step 5

labs(title = "Box Plots of Numeric Variables") +
  theme_minimal()

Adds a title and applies the minimal theme.

Results:

Box plots of numeric variables

From this visualization, we can see that the variable time_spend_company contains outliers. Let’s create an individual box plot to “zoom” in on that variable:

# Box plot for time spent at company
ggplot(employee_survey, aes(x = "", y = time_spend_company)) +
  geom_boxplot(fill = "lightblue") +
  labs(
    title = "Time Spent at Company",
    x = NULL,
    y = "Years at Company"
  ) +
  theme_minimal()

Programmatically Identify Outliers

To double check our outlier identification using box plots, let’s programmatically calculate and identify outliers.

To start, we define a function to identify outliers using the 1.5 × IQR (Interquartile Range) rule:

# Function to identify outliers
find_outliers <- function(column) {
  # Finds the first (Q1) and third (Q3) quartiles of the column
  Q1 <- quantile(column, 0.25, na.rm = TRUE)
  Q3 <- quantile(column, 0.75, na.rm = TRUE)

  # The IQR (Interquartile Range) is the middle 50% of the data
  IQR_value <- Q3 - Q1 

  # Any value below the lower bound or above the upper bound is considered an outlier
  lower_bound <- Q1 - 1.5 * IQR_value
  upper_bound <- Q3 + 1.5 * IQR_value

  # Filters the column and returns just the values that are outliers
  return(column[column < lower_bound | column > upper_bound])
}

Then we apply this function to our numeric, non-binary columns:

# Count outliers per numeric variable
outlier_counts <- employee_survey %>%
  select(where(~ is.numeric(.x) && n_distinct(.x) > 2), -Emp.ID) %>%
  map(find_outliers) %>%
  map_int(length) %>%
  enframe(name = "variable", value = "num_outliers")

Results:

> outlier_counts
# A tibble: 5 × 2
  variable             num_outliers
  <chr>                       <int>
1 satisfaction_level              0
2 last_evaluation                 0
3 number_project                  0
4 average_montly_hours            0
5 time_spend_company           1282

Removing Outliers

Logically the next step after identifying outliers is deciding if it’s appropriate to filter them out. Making that decision is beyond the scope of this guide, but here’s the code for if you do want to filter them out:

# Get outlier values for time_spend_company
outliers <- employee_survey %>%
  pull(time_spend_company) %>%
  find_outliers()

# Create copy of data frame with time_spend_company outliers removed
employee_survey_without_outliers <- employee_survey %>%
  filter(!time_spend_company %in% outliers)

Correlations

Now that we have a broad understanding of the values for our individual variables, let’s see if there are any correlations amongst them.

To do this, we can calculate correlation values using R’s built in cor method, and generate a heat map of the results using the corrplot package:

library(corrplot)

# Perason’s Correlation Matrix
# Compute the Pearson correlation between numeric variables (excluding Emp.ID)
# Then visualize the upper triangle of the correlation matrix with a heatmap via corrplot()
employee_survey %>%
  select(where(~ is.numeric(.) && n_distinct(.) > 2), -Emp.ID) %>%
  cor(use = "complete.obs") %>%
  corrplot(
    method = "circle",
    type = "upper",
    addCoef.col = "black",  # numeric correlation labels
    number.cex = 0.7,       # size of correlation numbers
    tl.cex = 0.8,           # axis label size
    tl.col = "black",       # axis label color
    cl.cex = 0.8            # color legend (gradient bar) label size
  )

Results: Correlation matrix showing relationships between satisfaction_level, last_evaluation, number_project, average_monthly_hours, and time_spend_company. Strongest positive correlation is between number_project and average_monthly_hours (r = 0.42). Satisfaction_level has weak negative correlations with number_project (r = -0.14) and time_spend_company (r = -0.10).

Interpretation:

There’s a moderate positive correlation between:

Note: average_montly_hours is not a typo, that is how it’s labeled in original data

There’s a weak positive correlation between:

There’s a weak negative correlation between:

There’s no meaningful correlation between:

Visualize linear regression

Let’s “zoom in” on the strongest correlation identified in the above step, number_project & average_montly_hours (r = 0.42), and visualize it via a scatter plot and linear regression line:

ggplot(employee_survey, aes(x = number_project, y = average_montly_hours)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(title = "Number of Projects & Average Monthly Hours")
Scatter plot of average_montly_hours and number_project

What next?

This guide helped you explore and visualize a dataset using some core data analysis techniques in R. If you’re ready to dig deeper, here are some next steps to consider:

Unlock all the notes for $4

No subscriptions, no auto-renewals.

Just a simple one-time payment that helps support my free, to-the-point videos without sponsered ads.

Unlocking gets you access to the notes for this video plus all 200+ guides on this site.

Your support is appreciated. Thank you!

Payment Info

/
$4 6 months
$25 forever
Please check the form for errors
Questions? help@codewithsusan.com
← Other topics