In this post, we show two ways of converting datasets between long and wide formats: (1) using functions spread()
and gather()
from package tidyr
; (2) using functions dcast()
and melt()
from package data.table
.
tidyr
library(tidyr)
library(dplyr)
Dataset long
is currently in the long format. We will reshape it to the wide format with spread()
.
long
## # A tibble: 8 x 5
## id semester course gpa participation
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 Fall2019 A 3.81 1
## 2 1 Fall2019 B 3.82 2
## 3 1 Spring2019 A 3.76 3
## 4 1 Spring2019 B 3.77 4
## 5 2 Fall2019 A 3.56 4
## 6 2 Fall2019 B 3.55 3
## 7 2 Spring2019 A 3.45 2
## 8 2 Spring2019 B 3.47 1
spread()
converts data from long to wide, spreading a key-value pair across multiple columns. The trick is to set the right key
variable and the value
variable.
Below we add two new variables that are gpa
of course
A and B. Here, key
is course
and value
is from gpa
.
wide1 <- long %>% spread(key = course, value = gpa)
wide1
## # A tibble: 8 x 5
## id semester participation A B
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 Fall2019 1 3.81 NA
## 2 1 Fall2019 2 NA 3.82
## 3 1 Spring2019 3 3.76 NA
## 4 1 Spring2019 4 NA 3.77
## 5 2 Fall2019 3 NA 3.55
## 6 2 Fall2019 4 3.56 NA
## 7 2 Spring2019 1 NA 3.47
## 8 2 Spring2019 2 3.45 NA
We can rename the two new columns if we want to.
names(wide1)[4:5] <- c("gpa_A", "gpa_B")
wide1
## # A tibble: 8 x 5
## id semester participation gpa_A gpa_B
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 Fall2019 1 3.81 NA
## 2 1 Fall2019 2 NA 3.82
## 3 1 Spring2019 3 3.76 NA
## 4 1 Spring2019 4 NA 3.77
## 5 2 Fall2019 3 NA 3.55
## 6 2 Fall2019 4 3.56 NA
## 7 2 Spring2019 1 NA 3.47
## 8 2 Spring2019 2 3.45 NA
Recently, tidyr
has updated spread()
to pivot_wider()
, which provides more intuitive syntax and more options. spread()
is not going away but will not be under active development.
wide2 <- long %>% pivot_wider(names_from = semester, values_from = participation)
names(wide2)[4:5] <- c("participation_Fall2019", "participation_Spring2019")
wide2
## # A tibble: 8 x 5
## id course gpa participation_Fall2019 participation_Spring2019
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 A 3.81 1 NA
## 2 1 B 3.82 2 NA
## 3 1 A 3.76 NA 3
## 4 1 B 3.77 NA 4
## 5 2 A 3.56 4 NA
## 6 2 B 3.55 3 NA
## 7 2 A 3.45 NA 2
## 8 2 B 3.47 NA 1
If there are multiple variables whose values will be paired with the key variable, we can specify them with c()
in the values_from
argument.
wide3 <- long %>% pivot_wider(names_from = course, values_from = c(gpa, participation))
wide3
## # A tibble: 4 x 6
## id semester gpa_A gpa_B participation_A participation_B
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 Fall2019 3.81 3.82 1 2
## 2 1 Spring2019 3.76 3.77 3 4
## 3 2 Fall2019 3.56 3.55 4 3
## 4 2 Spring2019 3.45 3.47 2 1
gather()
converts data from wide to long format for demo.
We will use the dataset wide1
in the wide format generated above.
wide1
## # A tibble: 8 x 5
## id semester participation gpa_A gpa_B
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 Fall2019 1 3.81 NA
## 2 1 Fall2019 2 NA 3.82
## 3 1 Spring2019 3 3.76 NA
## 4 1 Spring2019 4 NA 3.77
## 5 2 Fall2019 3 NA 3.55
## 6 2 Fall2019 4 3.56 NA
## 7 2 Spring2019 1 NA 3.47
## 8 2 Spring2019 2 3.45 NA
Below we convert the dataset wide1
back to the long format with gather()
. Here the key
variable to be created is course
, and the value
variable to be created is gpa
. We remove the rows with NA
in the value column.
wide1 %>% gather(key = "course", value = "gpa", gpa_A, gpa_B, na.rm = T)
## # A tibble: 8 x 5
## id semester participation course gpa
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 1 Fall2019 1 gpa_A 3.81
## 2 1 Spring2019 3 gpa_A 3.76
## 3 2 Fall2019 4 gpa_A 3.56
## 4 2 Spring2019 2 gpa_A 3.45
## 5 1 Fall2019 2 gpa_B 3.82
## 6 1 Spring2019 4 gpa_B 3.77
## 7 2 Fall2019 3 gpa_B 3.55
## 8 2 Spring2019 1 gpa_B 3.47
A bit more postprocessing on variable course
to separate the course names A
and B
from the prefix gpa_
.
wide1 %>%
gather(key = "course", value = "gpa", gpa_A, gpa_B, na.rm = T) %>%
separate(col = course, into = c("gpa_", "course"), sep = 4) %>%
select(-gpa_)
## # A tibble: 8 x 5
## id semester participation course gpa
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 1 Fall2019 1 A 3.81
## 2 1 Spring2019 3 A 3.76
## 3 2 Fall2019 4 A 3.56
## 4 2 Spring2019 2 A 3.45
## 5 1 Fall2019 2 B 3.82
## 6 1 Spring2019 4 B 3.77
## 7 2 Fall2019 3 B 3.55
## 8 2 Spring2019 1 B 3.47
Recently, tidyr
has updated gather()
to pivot_longer()
. Like pivot_wider()
, it provides more intuitive syntax and more options. gather()
is not going away but will not be under active development.
wide2
## # A tibble: 8 x 5
## id course gpa participation_Fall2019 participation_Spring2019
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 A 3.81 1 NA
## 2 1 B 3.82 2 NA
## 3 1 A 3.76 NA 3
## 4 1 B 3.77 NA 4
## 5 2 A 3.56 4 NA
## 6 2 B 3.55 3 NA
## 7 2 A 3.45 NA 2
## 8 2 B 3.47 NA 1
Using pivot_longer()
, reshaping and postprocessing of column values can be done within itself. With arguments of names_prefix
, names_sep
or names_pattern
, we no longer need separate()
or extract()
.
wide2 %>%
pivot_longer(names_to = "semester",
values_to = "participation",
cols = participation_Fall2019 : participation_Spring2019,
values_drop_na = T,
names_prefix = "participation_")
## # A tibble: 8 x 5
## id course gpa semester participation
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 1 A 3.81 Fall2019 1
## 2 1 B 3.82 Fall2019 2
## 3 1 A 3.76 Spring2019 3
## 4 1 B 3.77 Spring2019 4
## 5 2 A 3.56 Fall2019 4
## 6 2 B 3.55 Fall2019 3
## 7 2 A 3.45 Spring2019 2
## 8 2 B 3.47 Spring2019 1
data.table
Package data.table
also offers ways to convert the data between long and wide format.
library(data.table)
dcast()
reshapes datasets from long to wide formats.
What we do below is the same as long %>% spread(key = course, value = gpa)
.
dcast(long, id + semester + participation ~ course, value.var = "gpa")
## id semester participation A B
## 1 1 Fall2019 1 3.81 NA
## 2 1 Fall2019 2 NA 3.82
## 3 1 Spring2019 3 3.76 NA
## 4 1 Spring2019 4 NA 3.77
## 5 2 Fall2019 3 NA 3.55
## 6 2 Fall2019 4 3.56 NA
## 7 2 Spring2019 1 NA 3.47
## 8 2 Spring2019 2 3.45 NA
melt()
reshapes data from wide to long. Instead of working on a key-value pair, it uses id
and measure
.
The example below is the same as wide1 %>% gather(key = course, value = gpa, -id, -semester, -participation)
.
melt(wide1, id = c("id", "semester", "participation"), measure = c("gpa_A", "gpa_B"))
## id semester participation variable value
## 1 1 Fall2019 1 gpa_A 3.81
## 2 1 Fall2019 2 gpa_A NA
## 3 1 Spring2019 3 gpa_A 3.76
## 4 1 Spring2019 4 gpa_A NA
## 5 2 Fall2019 3 gpa_A NA
## 6 2 Fall2019 4 gpa_A 3.56
## 7 2 Spring2019 1 gpa_A NA
## 8 2 Spring2019 2 gpa_A 3.45
## 9 1 Fall2019 1 gpa_B NA
## 10 1 Fall2019 2 gpa_B 3.82
## 11 1 Spring2019 3 gpa_B NA
## 12 1 Spring2019 4 gpa_B 3.77
## 13 2 Fall2019 3 gpa_B 3.55
## 14 2 Fall2019 4 gpa_B NA
## 15 2 Spring2019 1 gpa_B 3.47
## 16 2 Spring2019 2 gpa_B NA
We can also remove the rows with NA
by setting na.rm = TRUE
.
melt(wide1, id = c("id", "semester", "participation"), measure = c("gpa_A", "gpa_B"), na.rm = TRUE)
## id semester participation variable value
## 1 1 Fall2019 1 gpa_A 3.81
## 3 1 Spring2019 3 gpa_A 3.76
## 6 2 Fall2019 4 gpa_A 3.56
## 8 2 Spring2019 2 gpa_A 3.45
## 10 1 Fall2019 2 gpa_B 3.82
## 12 1 Spring2019 4 gpa_B 3.77
## 13 2 Fall2019 3 gpa_B 3.55
## 15 2 Spring2019 1 gpa_B 3.47