Reshaping Data between Long and Wide Formats

In this post, we show two ways of converting datasets between long and wide formats: (1) using functions spread() and gather() from package tidyr; (2) using functions dcast() and melt() from package data.table.

`tidyr`

library(tidyr)
library(dplyr)

long to wide

Dataset long is currently in the long format. We will reshape it to the wide format with spread().

long

## # A tibble: 8 x 5
##      id semester   course   gpa participation
##   <dbl> <chr>      <chr>  <dbl>         <dbl>
## 1     1 Fall2019   A       3.81             1
## 2     1 Fall2019   B       3.82             2
## 3     1 Spring2019 A       3.76             3
## 4     1 Spring2019 B       3.77             4
## 5     2 Fall2019   A       3.56             4
## 6     2 Fall2019   B       3.55             3
## 7     2 Spring2019 A       3.45             2
## 8     2 Spring2019 B       3.47             1

spread() converts data from long to wide, spreading a key-value pair across multiple columns. The trick is to set the right key variable and the value variable.

Below we add two new variables that are gpa of course A and B. Here, key is course and value is from gpa.

wide1 <- long %>% spread(key = course, value = gpa)
wide1

## # A tibble: 8 x 5
##      id semester   participation     A     B
##   <dbl> <chr>              <dbl> <dbl> <dbl>
## 1     1 Fall2019               1  3.81 NA   
## 2     1 Fall2019               2 NA     3.82
## 3     1 Spring2019             3  3.76 NA   
## 4     1 Spring2019             4 NA     3.77
## 5     2 Fall2019               3 NA     3.55
## 6     2 Fall2019               4  3.56 NA   
## 7     2 Spring2019             1 NA     3.47
## 8     2 Spring2019             2  3.45 NA

We can rename the two new columns if we want to.

names(wide1)[4:5] <- c("gpa_A", "gpa_B")
wide1

## # A tibble: 8 x 5
##      id semester   participation gpa_A gpa_B
##   <dbl> <chr>              <dbl> <dbl> <dbl>
## 1     1 Fall2019               1  3.81 NA   
## 2     1 Fall2019               2 NA     3.82
## 3     1 Spring2019             3  3.76 NA   
## 4     1 Spring2019             4 NA     3.77
## 5     2 Fall2019               3 NA     3.55
## 6     2 Fall2019               4  3.56 NA   
## 7     2 Spring2019             1 NA     3.47
## 8     2 Spring2019             2  3.45 NA

Recently, tidyr has updated spread() to pivot_wider(), which provides more intuitive syntax and more options. spread() is not going away but will not be under active development.

wide2 <- long %>% pivot_wider(names_from = semester, values_from = participation)
names(wide2)[4:5] <- c("participation_Fall2019", "participation_Spring2019")
wide2

## # A tibble: 8 x 5
##      id course   gpa participation_Fall2019 participation_Spring2019
##   <dbl> <chr>  <dbl>                  <dbl>                    <dbl>
## 1     1 A       3.81                      1                       NA
## 2     1 B       3.82                      2                       NA
## 3     1 A       3.76                     NA                        3
## 4     1 B       3.77                     NA                        4
## 5     2 A       3.56                      4                       NA
## 6     2 B       3.55                      3                       NA
## 7     2 A       3.45                     NA                        2
## 8     2 B       3.47                     NA                        1

If there are multiple variables whose values will be paired with the key variable, we can specify them with c() in the values_from argument.

wide3 <- long %>% pivot_wider(names_from = course, values_from = c(gpa, participation))
wide3

## # A tibble: 4 x 6
##      id semester   gpa_A gpa_B participation_A participation_B
##   <dbl> <chr>      <dbl> <dbl>           <dbl>           <dbl>
## 1     1 Fall2019    3.81  3.82               1               2
## 2     1 Spring2019  3.76  3.77               3               4
## 3     2 Fall2019    3.56  3.55               4               3
## 4     2 Spring2019  3.45  3.47               2               1

wide to long

gather() converts data from wide to long format for demo.

We will use the dataset wide1 in the wide format generated above.

wide1

## # A tibble: 8 x 5
##      id semester   participation gpa_A gpa_B
##   <dbl> <chr>              <dbl> <dbl> <dbl>
## 1     1 Fall2019               1  3.81 NA   
## 2     1 Fall2019               2 NA     3.82
## 3     1 Spring2019             3  3.76 NA   
## 4     1 Spring2019             4 NA     3.77
## 5     2 Fall2019               3 NA     3.55
## 6     2 Fall2019               4  3.56 NA   
## 7     2 Spring2019             1 NA     3.47
## 8     2 Spring2019             2  3.45 NA

Below we convert the dataset wide1 back to the long format with gather(). Here the key variable to be created is course, and the value variable to be created is gpa. We remove the rows with NA in the value column.

wide1 %>% gather(key = "course", value = "gpa", gpa_A, gpa_B, na.rm = T)

## # A tibble: 8 x 5
##      id semester   participation course   gpa
##   <dbl> <chr>              <dbl> <chr>  <dbl>
## 1     1 Fall2019               1 gpa_A   3.81
## 2     1 Spring2019             3 gpa_A   3.76
## 3     2 Fall2019               4 gpa_A   3.56
## 4     2 Spring2019             2 gpa_A   3.45
## 5     1 Fall2019               2 gpa_B   3.82
## 6     1 Spring2019             4 gpa_B   3.77
## 7     2 Fall2019               3 gpa_B   3.55
## 8     2 Spring2019             1 gpa_B   3.47

A bit more postprocessing on variable course to separate the course names A and B from the prefix gpa_.

wide1 %>% 
  gather(key = "course", value = "gpa", gpa_A, gpa_B, na.rm = T) %>%
  separate(col = course, into = c("gpa_", "course"), sep = 4) %>%
  select(-gpa_)

## # A tibble: 8 x 5
##      id semester   participation course   gpa
##   <dbl> <chr>              <dbl> <chr>  <dbl>
## 1     1 Fall2019               1 A       3.81
## 2     1 Spring2019             3 A       3.76
## 3     2 Fall2019               4 A       3.56
## 4     2 Spring2019             2 A       3.45
## 5     1 Fall2019               2 B       3.82
## 6     1 Spring2019             4 B       3.77
## 7     2 Fall2019               3 B       3.55
## 8     2 Spring2019             1 B       3.47

Recently, tidyr has updated gather() to pivot_longer(). Like pivot_wider(), it provides more intuitive syntax and more options. gather() is not going away but will not be under active development.

wide2

## # A tibble: 8 x 5
##      id course   gpa participation_Fall2019 participation_Spring2019
##   <dbl> <chr>  <dbl>                  <dbl>                    <dbl>
## 1     1 A       3.81                      1                       NA
## 2     1 B       3.82                      2                       NA
## 3     1 A       3.76                     NA                        3
## 4     1 B       3.77                     NA                        4
## 5     2 A       3.56                      4                       NA
## 6     2 B       3.55                      3                       NA
## 7     2 A       3.45                     NA                        2
## 8     2 B       3.47                     NA                        1

Using pivot_longer(), reshaping and postprocessing of column values can be done within itself. With arguments of names_prefix, names_sep or names_pattern, we no longer need separate() or extract().

wide2 %>% 
  pivot_longer(names_to = "semester", 
               values_to = "participation", 
               cols = participation_Fall2019 : participation_Spring2019,
               values_drop_na = T,
               names_prefix = "participation_")

## # A tibble: 8 x 5
##      id course   gpa semester   participation
##   <dbl> <chr>  <dbl> <chr>              <dbl>
## 1     1 A       3.81 Fall2019               1
## 2     1 B       3.82 Fall2019               2
## 3     1 A       3.76 Spring2019             3
## 4     1 B       3.77 Spring2019             4
## 5     2 A       3.56 Fall2019               4
## 6     2 B       3.55 Fall2019               3
## 7     2 A       3.45 Spring2019             2
## 8     2 B       3.47 Spring2019             1

`data.table`

Package data.table also offers ways to convert the data between long and wide format.

library(data.table)

long to wide

dcast() reshapes datasets from long to wide formats.

What we do below is the same as long %>% spread(key = course, value = gpa).

dcast(long, id + semester + participation ~ course, value.var = "gpa")

##   id   semester participation    A    B
## 1  1   Fall2019             1 3.81   NA
## 2  1   Fall2019             2   NA 3.82
## 3  1 Spring2019             3 3.76   NA
## 4  1 Spring2019             4   NA 3.77
## 5  2   Fall2019             3   NA 3.55
## 6  2   Fall2019             4 3.56   NA
## 7  2 Spring2019             1   NA 3.47
## 8  2 Spring2019             2 3.45   NA

wide to long

melt() reshapes data from wide to long. Instead of working on a key-value pair, it uses id and measure.

The example below is the same as wide1 %>% gather(key = course, value = gpa, -id, -semester, -participation).

melt(wide1, id = c("id", "semester", "participation"), measure = c("gpa_A", "gpa_B"))

##    id   semester participation variable value
## 1   1   Fall2019             1    gpa_A  3.81
## 2   1   Fall2019             2    gpa_A    NA
## 3   1 Spring2019             3    gpa_A  3.76
## 4   1 Spring2019             4    gpa_A    NA
## 5   2   Fall2019             3    gpa_A    NA
## 6   2   Fall2019             4    gpa_A  3.56
## 7   2 Spring2019             1    gpa_A    NA
## 8   2 Spring2019             2    gpa_A  3.45
## 9   1   Fall2019             1    gpa_B    NA
## 10  1   Fall2019             2    gpa_B  3.82
## 11  1 Spring2019             3    gpa_B    NA
## 12  1 Spring2019             4    gpa_B  3.77
## 13  2   Fall2019             3    gpa_B  3.55
## 14  2   Fall2019             4    gpa_B    NA
## 15  2 Spring2019             1    gpa_B  3.47
## 16  2 Spring2019             2    gpa_B    NA

We can also remove the rows with NA by setting na.rm = TRUE.

melt(wide1, id = c("id", "semester", "participation"), measure = c("gpa_A", "gpa_B"), na.rm = TRUE)

##    id   semester participation variable value
## 1   1   Fall2019             1    gpa_A  3.81
## 3   1 Spring2019             3    gpa_A  3.76
## 6   2   Fall2019             4    gpa_A  3.56
## 8   2 Spring2019             2    gpa_A  3.45
## 10  1   Fall2019             2    gpa_B  3.82
## 12  1 Spring2019             4    gpa_B  3.77
## 13  2   Fall2019             3    gpa_B  3.55
## 15  2 Spring2019             1    gpa_B  3.47

Reshaping Data between Long and Wide Formats

Yun Dai

04/2020

`tidyr`

long to wide

wide to long

`data.table`

long to wide

wide to long