Spreading Rows in Hierarchical Data

Filling in values

expand() and complete() from package tidyr add rows of observations by filling in all combinations of the specified variables. They add all possible combinations even when some combinations do not exist in the data frame.

library(tidyr)
library(dplyr)

Let’s use a sample dataset to illustrate their usage.

file8

##   id course placement attendance
## 1  1      A         1          3
## 2  1      B         2          4
## 3  2      A         3          2
## 4  2      C         1          3
## 5  3      B         2          4
## 6  3      C         3          5

expand() returns a new data frame with all combinations of the specified variables id and course without the other variables.

file8 %>% expand(id, course) %>% arrange(id)

## # A tibble: 9 x 2
##      id course
##   <dbl> <chr> 
## 1     1 A     
## 2     1 B     
## 3     1 C     
## 4     2 A     
## 5     2 B     
## 6     2 C     
## 7     3 A     
## 8     3 B     
## 9     3 C

complete(), on the other hand, returns a new data frame with all combinations of id and course, and keeps all variables from the original data frame.

file8 %>% complete(id, course) %>% arrange(id)

## # A tibble: 9 x 4
##      id course placement attendance
##   <dbl> <chr>      <dbl>      <dbl>
## 1     1 A              1          3
## 2     1 B              2          4
## 3     1 C             NA         NA
## 4     2 A              3          2
## 5     2 B             NA         NA
## 6     2 C              1          3
## 7     3 A             NA         NA
## 8     3 B              2          4
## 9     3 C              3          5

expand(nesting()) and complete(nesting()) add only the combinations of existing values of id and course to the new data frame.

file8 %>% expand(nesting(id, course)) %>% arrange(id)

## # A tibble: 6 x 2
##      id course
##   <dbl> <chr> 
## 1     1 A     
## 2     1 B     
## 3     2 A     
## 4     2 C     
## 5     3 B     
## 6     3 C

file8 %>% complete(nesting(id, course), placement) %>% arrange(id)

## # A tibble: 18 x 4
##       id course placement attendance
##    <dbl> <chr>      <dbl>      <dbl>
##  1     1 A              1          3
##  2     1 A              2         NA
##  3     1 A              3         NA
##  4     1 B              1         NA
##  5     1 B              2          4
##  6     1 B              3         NA
##  7     2 A              1         NA
##  8     2 A              2         NA
##  9     2 A              3          2
## 10     2 C              1          3
## 11     2 C              2         NA
## 12     2 C              3         NA
## 13     3 B              1         NA
## 14     3 B              2          4
## 15     3 B              3         NA
## 16     3 C              1         NA
## 17     3 C              2         NA
## 18     3 C              3          5

Replicating rows

rep() from base R replicates rows by the specified number.

case 1

Let’s say we want to create copies of each case by the value of attendance. As a result, each id will have a record for each attendance of each course. Later, we can, for instance, add course performance data to each attendance record of every course.

file8[rep(row.names(file8), file8$attendance), ]

##     id course placement attendance
## 1    1      A         1          3
## 1.1  1      A         1          3
## 1.2  1      A         1          3
## 2    1      B         2          4
## 2.1  1      B         2          4
## 2.2  1      B         2          4
## 2.3  1      B         2          4
## 3    2      A         3          2
## 3.1  2      A         3          2
## 4    2      C         1          3
## 4.1  2      C         1          3
## 4.2  2      C         1          3
## 5    3      B         2          4
## 5.1  3      B         2          4
## 5.2  3      B         2          4
## 5.3  3      B         2          4
## 6    3      C         3          5
## 6.1  3      C         3          5
## 6.2  3      C         3          5
## 6.3  3      C         3          5
## 6.4  3      C         3          5

case 2

We have a dataset with variables of companies, sectors, years and dates, analyst scores and ranks, analyst ratings on companies etc. Below is a subset of the data for demo.

head(file9)

## # A tibble: 6 x 3
##   group  rank score
##   <int> <dbl> <dbl>
## 1   735     4  6.77
## 2   736     4  6.77
## 3   734     4  8.43
## 4   735     4  8.43
## 5   824     2 16.5 
## 6   828     3  9.27

One thing we want to do with the data is to estimate a model by rank: rank = 1 versus 2, 2 versus 3, 3 versus the first runner-up (rank = 4), and the last runner-up versus the first non-runner-up. In the runner-up group, there could be more than one analyst, whose score values could be different.

summary(file9$rank)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   4.000   4.111   4.000  34.000

However, there could be cases where we have only one analyst as the runner-up in each group, while we need at least two analysts to make the comparisons respectively with rank 3 and the first non-runner-up. Therefore, we need to expand those single runners-up. We will later recode the duplicated runners-up to rank 5, and the first non-runners-up to rank 6. Ultimately, we will compare rank 1 versus 2, 2 versus 3, 3 versus 4, and 5 versus 6. This may all sound very confusing right now; let’s see it step by step.

First, expand the runners-up if rank = 4 have only one score value, meaning there is only one person as the runner-up in a group. There are 182 such cases marked by expd = 2.

file9 <- file9 %>% 
  group_by(group, rank) %>% 
  arrange(group, rank, score) %>% 
  mutate(expd = ifelse(first(score) == last(score) & rank == 4, 2, 1))

table(file9$expd)

## 
##    1    2 
## 7711  182

Next, replicate the number of rows by the value of expd. The cases with only one runner-up (expd = 2) will be replicated, and the rest (expd = 1) will remain untouched. The data is then sorted.

file9 <- file9[rep(row.names(file9), file9$expd), ]
file9 <- arrange_all(file9)

head(file9, 10)

## # A tibble: 10 x 4
## # Groups:   group, rank [7]
##    group  rank score  expd
##    <int> <dbl> <dbl> <dbl>
##  1     1     2  14.3     1
##  2     2     3  16.5     1
##  3     2     4  16.4     2
##  4     2     4  16.4     2
##  5     3     1  26.2     1
##  6     3     2  24.8     1
##  7     3     3  16.5     1
##  8     3     3  20.0     1
##  9     3     4  16.4     1
## 10     3     4  18.2     1

Then we create a new variable rank2, where rank2 = rank if rank <= 3. For the newly created replicated runners-up marked by expd = 2, we assign one case to be rank 4 and another to be rank 5 by their positions (odd/even rows).

file9 <- file9 %>% 
  mutate(rank2 = case_when(
    rank <= 3 ~ rank,
    rank > 3 & row_number() %% 2 == 0 & expd == 2 ~ 4,
    rank > 3 & row_number() %% 2 == 1 & expd == 2 ~ 5))

head(file9, 10)

## # A tibble: 10 x 5
## # Groups:   group, rank [7]
##    group  rank score  expd rank2
##    <int> <dbl> <dbl> <dbl> <dbl>
##  1     1     2  14.3     1     2
##  2     2     3  16.5     1     3
##  3     2     4  16.4     2     5
##  4     2     4  16.4     2     4
##  5     3     1  26.2     1     1
##  6     3     2  24.8     1     2
##  7     3     3  16.5     1     3
##  8     3     3  20.0     1     3
##  9     3     4  16.4     1    NA
## 10     3     4  18.2     1    NA

Now we return to all runners-up to mark the first runners-up and the last runners-up in each group.

file9 <- file9 %>% 
  group_by(group, rank) %>% 
  mutate(max = ifelse(rank == 4 & expd != 2, max(score), NA),
         min = ifelse(rank == 4 & expd != 2, min(score), NA),
         rank2 = ifelse(rank == 4 & score == max & expd != 2, 4, rank2), 
         rank2 = ifelse(rank == 4 & score == min & expd != 2, 5, rank2)) %>% 
  arrange(group, rank, rank2, score)

head(file9, 15)

## # A tibble: 15 x 7
## # Groups:   group, rank [11]
##    group  rank score  expd rank2   max   min
##    <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     1     2  14.3     1     2  NA    NA  
##  2     2     3  16.5     1     3  NA    NA  
##  3     2     4  16.4     2     4  NA    NA  
##  4     2     4  16.4     2     5  NA    NA  
##  5     3     1  26.2     1     1  NA    NA  
##  6     3     2  24.8     1     2  NA    NA  
##  7     3     3  16.5     1     3  NA    NA  
##  8     3     3  20.0     1     3  NA    NA  
##  9     3     4  18.2     1     4  18.2  16.4
## 10     3     4  16.4     1     5  18.2  16.4
## 11     4     1  26.2     1     1  NA    NA  
## 12     4     2  24.8     1     2  NA    NA  
## 13     4     3  20.0     1     3  NA    NA  
## 14     4     4  18.2     2     4  NA    NA  
## 15     4     4  18.2     2     5  NA    NA

Spreading Rows in Hierarchical Data

Yun Dai

04/2019

Filling in values

Replicating rows

case 1

case 2