Introduction to tidyverse

One of the most influential and widely used package in R is the tidyverse package. This package includes several other packages, which are key for data manipulation e.g. dplyr, ggplot2, stringr, readr, tidyr. Installing tidyverse is essentially a shortcut to installing all these other packages.

install.packages("tidyverse")
library(tidyverse)

The tidyverse package bundle offers an alternative way of dealing with data compared to base R syntax. Both approaches have strengths and weaknesses. Originally, this session was written with a focus on tidyverse but in QM you will primarily learn base R so I will focus on teaching you base R as well, except in cases where it is much more efficient to use tidyverse. I will leave the tidyverse code in the file for you to check out later. In the knitted notebook, you can easily switch between tabs showcasing base R and tidyverse syntax.

Data processing

Before we can work with data, we need to load it with R. There are various different types of data and files, below are some of the most common ones, esp. for social science research. Tidyverse includes some packages to handle different files, e.g., readxl for Excel, and haven for Stata files.

install.packages("readxl")    # install pkg once
library(readxl)               # load pkg everytime you open R(-project)
df <- read_excel("data.xlsx") # load data
install.packages("xlsx")        # install pkg once
library(readxl)                 # load pkg everytime you open R(-project)
df <- read.xlsx("data.xlsx", 1) # load first sheet of Excel-file
install.packages("xlsx")         # install pkg once
library(readxl)                  # load pkg everytime you open R(-project)
df <- read.xlsx("data.xlsx", 1)  # load first sheet of Excel-file, read.xlsx2() faster for large data

File types

File File Extension Command
Excel .xlsx; .xls readxl::read_excel(), xlsx::read.xlsx()
CSV .csv read.csv() (German csv: read.csv2())
Stata .dta haven::read_dta()
RData .RData;.rds load() , read.rds()

CSV is short for comma seperated values. By default, read.csv() identifies commas “,” as seperators (to seperate data entries), whereas read.csv2() identifies semicolons “;” as seperators. An alternative to using read.csv2() is to explicitly declare the seperator using read.csv(file, sep = ";"). For further information, consult the documentation in R using the command ?read.csv.

Hint: readxl:: before the command read_excel() defines the origin namespace of the command. In other words, the package, readxl, that includes the function read_excel().

If you utilize APIs or databases you may also work with .json files, which allow for more complex structures–incl. nesting–compared to plain text stored as .csv.

Pipelines

The are two distinct ways to execute successive commands on the same object. The original base R logic and pipelines.

Base R method

The original base R method is to place the command to be executed first in the center then add other commands on top of it, like an onion: The code is read inverse to the order by which the functions are applied: sqrt() is applied last, mean() is applied first.

q <- c(6,3,8)
sqrt(exp(mean(q)))  #first you get the mean, then exponential, then sqrt

Pipes

There are two types of pipes: Base R pipes, |>, and margrittr pipes, %>%. Pipes in R were originally introduced by the magrittr package (included in dplyr and thus in tidyverse too). In version 4.1.0, R introduced its own native version of pipes. With pipes, the code is written (and read) by its order of execution: mean() is executed first, followed by exp() and sqrt() being executed last.

q %>%         #send q into the mean function
  mean() %>%  #send the result of the mean into the exp function
  exp()  %>%  #send the result into the sqrt function
  sqrt()
q |>          #send q into the mean function
  mean() |>   #send the result of the mean into the exp function
  exp()  |>   #send the result into the sqrt function
  sqrt()

Whether to use the original R coding logic or pipes is mostly a matter of taste. Base R code (without pipes) can become cumbersome to read the more functions are nested into another.

As a rule of thumb, tidyverse functions are better suited for piping (with %>%). Base R pipes are slightly faster (more efficient) than margrittr pipes. In practice, the magrittr pipes included in tidyverse are popular and commonly used, whereas R pipes rarely see adaption.

When piping, the object or value on the left-hand side (in our example ‘q’) is passed as the first argument to the function [here ‘mean()’] by default. After the code is executed, the resulting value or object is again passed through a pipe to the next function [here: exp()] and so. To change this default behaviour, one can use placeholders (_ for |> and . for %>%).

Cheatsheets

Below you find a selection of cheatsheets specifically for tidyverse:

Previous
Next