Video Notes
In this guide, we’ll work through the process of importing and preparing data with R in RStudio. By the end, you’ll have a cleaned dataset ready for analysis.
In subsequent guides, we’ll look at next steps with this data including analysis and visualization.
Below is a summary of key points from this guide, including the complete code used.
Example data
The example data set comes from Kaggle.com and can be found here: Employee Satisfaction Survey Data.
➡️ Here’s a direct link to the CSV data file from this demo....
From this page, choose Download ZIP, then extract the contents on your computer. Within its contents is the data file we’ll use, employee_survey.csv
.
Setup RStudio Project
- In RStudio, create a new project called employee-satisfaction; you can place this directory in any location you’d like on your computer. In my example, I’ll place it on my Desktop directory.
- Within the resulting folder, create two subdirectories: data and scripts.
- Move the employee_survey.csv file into the data subdirectory.
- Create a new file called start.R in the scripts subdirectory.
- In the console, run the command
getwd()
to check your current working directory and confirm it’s set to the employee-satisfaction directory created for your project. For example, my directory is /Users/Susan/Desktop/employee-satisfaction
.
# Import our data file; treat empty strings as NA
employee_survey <- read.csv('data/employee_survey.csv', na.strings = "")
# Inspect the structure of the data frame created from the above step
str(employee_survey)
# See how many unique values we have in `dept` and `salary`, our two chr columns
unique(employee_survey$dept)
unique(employee_survey$salary)
# Because there’s a limited set of character strings being used repeatedly
# throughout `dept` and `salary`, it makes sense to convert them to Factors,
# which are used in R to encode categorical variables.
employee_survey$dept <- factor(employee_survey$dept)
# `salary` has a natural order of "low", "medium", "high" so we'll encode that as part of our Factor conversion
employee_survey$salary <- factor(employee_survey$salary, levels=c("low", "medium", "high"), ordered = TRUE)
# Footnote: Instead of doing this chr → factor conversion individually on these columns, we could have added the argument stringsAsFactors = TRUE to the read.csv function above.
# After completing the above chr → factor conversion, re-run the structure function to check the results
str(employee_survey)
# Next, let's handle NA values in our data...
# This gives us a count of NA values per column, showing most of our columns
# have 788 rows of "NA" data
colSums(is.na(employee_survey))
# Let's narrow down where these "NA" values are...
# The following code generates a logical matrix of TRUE/FALSE values that
# corresponds to each field in our data where:
# TRUE = NA value present
# FALSE = NA value *not* present
View(is.na(employee_survey))
# Using the logical matrix generated by is.na, we can filter our any rows where
# there was at least 1 NA value (because count of TRUES will be greater than 0)
View(employee_survey[rowSums(is.na(employee_survey)) > 0,])
# Inspecting the above output, and cross checking against our raw .csv file, we
# observed there were 788 lines of blank data in our file that looked like this:
# ,,,,,,,,,
# ,,,,,,,,,
# ,,,,,,,,,
# ,,,,,,,,,
# ,,,,,,,,,
# ,,,,,,,,,
# Use na.omit to remove any lines with an "NA" value, cleaning up those missing rows:
employee_survey <- na.omit(employee_survey)
Final code tidied codie
Below is the final, optimized version of our script. We’ve removed exploratory steps and kept only the essential data cleaning operations.
# On import, treat empty strings as NA
employee_survey <- read.csv('data/employee_survey.csv', na.strings = "")
# Convert categorical variables to Factors:
# - dept: Departments are categories with no natural order.
# - salary: Salary levels have a natural order (low < medium < high).
employee_survey$dept <- factor(employee_survey$dept)
employee_survey$salary <- factor(employee_survey$salary, levels=c("low", "medium", "high"), ordered = TRUE)
# This data contains 788 missing rows.
# There are no other NA values through the rest of the file,
# so we’ll use na.omit to clean up the missing rows.
employee_survey <- na.omit(employee_survey)
Summary of R functions used
-
read.csv - Reads a CSV file into a R data frame
-
getwd - Returns the current working directory
-
str - Displays the structure of an R object
-
unique - Returns distinct values from a vector, data frame, or array
-
factor - Converts a vector of data into factors
-
c - Creates a vector
-
is.na - Returns a logical vector indicating TRUE for NA values and FALSE otherwise.
-
colSums Sums up the values of a column
-
na.omit - Removes rows with any NA values from a data frame.