In this post, we show two ways of converting datasets between long and wide formats: (1) using functions spread() and gather() from package tidyr; (2) using functions dcast() and melt() from package data.table.
tidyrlibrary(tidyr)
library(dplyr)
Dataset long is currently in the long format. We will reshape it to the wide format with spread().
long
## # A tibble: 8 x 5
## id semester course gpa participation
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 Fall2019 A 3.81 1
## 2 1 Fall2019 B 3.82 2
## 3 1 Spring2019 A 3.76 3
## 4 1 Spring2019 B 3.77 4
## 5 2 Fall2019 A 3.56 4
## 6 2 Fall2019 B 3.55 3
## 7 2 Spring2019 A 3.45 2
## 8 2 Spring2019 B 3.47 1
spread() converts data from long to wide, spreading a key-value pair across multiple columns. The trick is to set the right key variable and the value variable.
Below we add two new variables that are gpa of course A and B. Here, key is course and value is from gpa.
wide1 <- long %>% spread(key = course, value = gpa)
wide1
## # A tibble: 8 x 5
## id semester participation A B
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 Fall2019 1 3.81 NA
## 2 1 Fall2019 2 NA 3.82
## 3 1 Spring2019 3 3.76 NA
## 4 1 Spring2019 4 NA 3.77
## 5 2 Fall2019 3 NA 3.55
## 6 2 Fall2019 4 3.56 NA
## 7 2 Spring2019 1 NA 3.47
## 8 2 Spring2019 2 3.45 NA
We can rename the two new columns if we want to.
names(wide1)[4:5] <- c("gpa_A", "gpa_B")
wide1
## # A tibble: 8 x 5
## id semester participation gpa_A gpa_B
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 Fall2019 1 3.81 NA
## 2 1 Fall2019 2 NA 3.82
## 3 1 Spring2019 3 3.76 NA
## 4 1 Spring2019 4 NA 3.77
## 5 2 Fall2019 3 NA 3.55
## 6 2 Fall2019 4 3.56 NA
## 7 2 Spring2019 1 NA 3.47
## 8 2 Spring2019 2 3.45 NA
Recently, tidyr has updated spread() to pivot_wider(), which provides more intuitive syntax and more options. spread() is not going away but will not be under active development.
wide2 <- long %>% pivot_wider(names_from = semester, values_from = participation)
names(wide2)[4:5] <- c("participation_Fall2019", "participation_Spring2019")
wide2
## # A tibble: 8 x 5
## id course gpa participation_Fall2019 participation_Spring2019
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 A 3.81 1 NA
## 2 1 B 3.82 2 NA
## 3 1 A 3.76 NA 3
## 4 1 B 3.77 NA 4
## 5 2 A 3.56 4 NA
## 6 2 B 3.55 3 NA
## 7 2 A 3.45 NA 2
## 8 2 B 3.47 NA 1
If there are multiple variables whose values will be paired with the key variable, we can specify them with c() in the values_from argument.
wide3 <- long %>% pivot_wider(names_from = course, values_from = c(gpa, participation))
wide3
## # A tibble: 4 x 6
## id semester gpa_A gpa_B participation_A participation_B
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 Fall2019 3.81 3.82 1 2
## 2 1 Spring2019 3.76 3.77 3 4
## 3 2 Fall2019 3.56 3.55 4 3
## 4 2 Spring2019 3.45 3.47 2 1
gather() converts data from wide to long format for demo.
We will use the dataset wide1 in the wide format generated above.
wide1
## # A tibble: 8 x 5
## id semester participation gpa_A gpa_B
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 Fall2019 1 3.81 NA
## 2 1 Fall2019 2 NA 3.82
## 3 1 Spring2019 3 3.76 NA
## 4 1 Spring2019 4 NA 3.77
## 5 2 Fall2019 3 NA 3.55
## 6 2 Fall2019 4 3.56 NA
## 7 2 Spring2019 1 NA 3.47
## 8 2 Spring2019 2 3.45 NA
Below we convert the dataset wide1 back to the long format with gather(). Here the key variable to be created is course, and the value variable to be created is gpa. We remove the rows with NA in the value column.
wide1 %>% gather(key = "course", value = "gpa", gpa_A, gpa_B, na.rm = T)
## # A tibble: 8 x 5
## id semester participation course gpa
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 1 Fall2019 1 gpa_A 3.81
## 2 1 Spring2019 3 gpa_A 3.76
## 3 2 Fall2019 4 gpa_A 3.56
## 4 2 Spring2019 2 gpa_A 3.45
## 5 1 Fall2019 2 gpa_B 3.82
## 6 1 Spring2019 4 gpa_B 3.77
## 7 2 Fall2019 3 gpa_B 3.55
## 8 2 Spring2019 1 gpa_B 3.47
A bit more postprocessing on variable course to separate the course names A and B from the prefix gpa_.
wide1 %>%
gather(key = "course", value = "gpa", gpa_A, gpa_B, na.rm = T) %>%
separate(col = course, into = c("gpa_", "course"), sep = 4) %>%
select(-gpa_)
## # A tibble: 8 x 5
## id semester participation course gpa
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 1 Fall2019 1 A 3.81
## 2 1 Spring2019 3 A 3.76
## 3 2 Fall2019 4 A 3.56
## 4 2 Spring2019 2 A 3.45
## 5 1 Fall2019 2 B 3.82
## 6 1 Spring2019 4 B 3.77
## 7 2 Fall2019 3 B 3.55
## 8 2 Spring2019 1 B 3.47
Recently, tidyr has updated gather() to pivot_longer(). Like pivot_wider(), it provides more intuitive syntax and more options. gather() is not going away but will not be under active development.
wide2
## # A tibble: 8 x 5
## id course gpa participation_Fall2019 participation_Spring2019
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 A 3.81 1 NA
## 2 1 B 3.82 2 NA
## 3 1 A 3.76 NA 3
## 4 1 B 3.77 NA 4
## 5 2 A 3.56 4 NA
## 6 2 B 3.55 3 NA
## 7 2 A 3.45 NA 2
## 8 2 B 3.47 NA 1
Using pivot_longer(), reshaping and postprocessing of column values can be done within itself. With arguments of names_prefix, names_sep or names_pattern, we no longer need separate() or extract().
wide2 %>%
pivot_longer(names_to = "semester",
values_to = "participation",
cols = participation_Fall2019 : participation_Spring2019,
values_drop_na = T,
names_prefix = "participation_")
## # A tibble: 8 x 5
## id course gpa semester participation
## <dbl> <chr> <dbl> <chr> <dbl>
## 1 1 A 3.81 Fall2019 1
## 2 1 B 3.82 Fall2019 2
## 3 1 A 3.76 Spring2019 3
## 4 1 B 3.77 Spring2019 4
## 5 2 A 3.56 Fall2019 4
## 6 2 B 3.55 Fall2019 3
## 7 2 A 3.45 Spring2019 2
## 8 2 B 3.47 Spring2019 1
data.tablePackage data.table also offers ways to convert the data between long and wide format.
library(data.table)
dcast() reshapes datasets from long to wide formats.
What we do below is the same as long %>% spread(key = course, value = gpa).
dcast(long, id + semester + participation ~ course, value.var = "gpa")
## id semester participation A B
## 1 1 Fall2019 1 3.81 NA
## 2 1 Fall2019 2 NA 3.82
## 3 1 Spring2019 3 3.76 NA
## 4 1 Spring2019 4 NA 3.77
## 5 2 Fall2019 3 NA 3.55
## 6 2 Fall2019 4 3.56 NA
## 7 2 Spring2019 1 NA 3.47
## 8 2 Spring2019 2 3.45 NA
melt() reshapes data from wide to long. Instead of working on a key-value pair, it uses id and measure.
The example below is the same as wide1 %>% gather(key = course, value = gpa, -id, -semester, -participation).
melt(wide1, id = c("id", "semester", "participation"), measure = c("gpa_A", "gpa_B"))
## id semester participation variable value
## 1 1 Fall2019 1 gpa_A 3.81
## 2 1 Fall2019 2 gpa_A NA
## 3 1 Spring2019 3 gpa_A 3.76
## 4 1 Spring2019 4 gpa_A NA
## 5 2 Fall2019 3 gpa_A NA
## 6 2 Fall2019 4 gpa_A 3.56
## 7 2 Spring2019 1 gpa_A NA
## 8 2 Spring2019 2 gpa_A 3.45
## 9 1 Fall2019 1 gpa_B NA
## 10 1 Fall2019 2 gpa_B 3.82
## 11 1 Spring2019 3 gpa_B NA
## 12 1 Spring2019 4 gpa_B 3.77
## 13 2 Fall2019 3 gpa_B 3.55
## 14 2 Fall2019 4 gpa_B NA
## 15 2 Spring2019 1 gpa_B 3.47
## 16 2 Spring2019 2 gpa_B NA
We can also remove the rows with NA by setting na.rm = TRUE.
melt(wide1, id = c("id", "semester", "participation"), measure = c("gpa_A", "gpa_B"), na.rm = TRUE)
## id semester participation variable value
## 1 1 Fall2019 1 gpa_A 3.81
## 3 1 Spring2019 3 gpa_A 3.76
## 6 2 Fall2019 4 gpa_A 3.56
## 8 2 Spring2019 2 gpa_A 3.45
## 10 1 Fall2019 2 gpa_B 3.82
## 12 1 Spring2019 4 gpa_B 3.77
## 13 2 Fall2019 3 gpa_B 3.55
## 15 2 Spring2019 1 gpa_B 3.47