expand()
and complete()
from package tidyr
add rows of observations by filling in all combinations of the specified variables. They add all possible combinations even when some combinations do not exist in the data frame.
library(tidyr)
library(dplyr)
Let’s use a sample dataset to illustrate their usage.
file8
## id course placement attendance
## 1 1 A 1 3
## 2 1 B 2 4
## 3 2 A 3 2
## 4 2 C 1 3
## 5 3 B 2 4
## 6 3 C 3 5
expand()
returns a new data frame with all combinations of the specified variables id
and course
without the other variables.
file8 %>% expand(id, course) %>% arrange(id)
## # A tibble: 9 x 2
## id course
## <dbl> <chr>
## 1 1 A
## 2 1 B
## 3 1 C
## 4 2 A
## 5 2 B
## 6 2 C
## 7 3 A
## 8 3 B
## 9 3 C
complete()
, on the other hand, returns a new data frame with all combinations of id
and course
, and keeps all variables from the original data frame.
file8 %>% complete(id, course) %>% arrange(id)
## # A tibble: 9 x 4
## id course placement attendance
## <dbl> <chr> <dbl> <dbl>
## 1 1 A 1 3
## 2 1 B 2 4
## 3 1 C NA NA
## 4 2 A 3 2
## 5 2 B NA NA
## 6 2 C 1 3
## 7 3 A NA NA
## 8 3 B 2 4
## 9 3 C 3 5
expand(nesting())
and complete(nesting())
add only the combinations of existing values of id
and course
to the new data frame.
file8 %>% expand(nesting(id, course)) %>% arrange(id)
## # A tibble: 6 x 2
## id course
## <dbl> <chr>
## 1 1 A
## 2 1 B
## 3 2 A
## 4 2 C
## 5 3 B
## 6 3 C
file8 %>% complete(nesting(id, course), placement) %>% arrange(id)
## # A tibble: 18 x 4
## id course placement attendance
## <dbl> <chr> <dbl> <dbl>
## 1 1 A 1 3
## 2 1 A 2 NA
## 3 1 A 3 NA
## 4 1 B 1 NA
## 5 1 B 2 4
## 6 1 B 3 NA
## 7 2 A 1 NA
## 8 2 A 2 NA
## 9 2 A 3 2
## 10 2 C 1 3
## 11 2 C 2 NA
## 12 2 C 3 NA
## 13 3 B 1 NA
## 14 3 B 2 4
## 15 3 B 3 NA
## 16 3 C 1 NA
## 17 3 C 2 NA
## 18 3 C 3 5
rep()
from base R replicates rows by the specified number.
Let’s say we want to create copies of each case by the value of attendance
. As a result, each id
will have a record for each attendance
of each course
. Later, we can, for instance, add course performance data to each attendance
record of every course
.
file8[rep(row.names(file8), file8$attendance), ]
## id course placement attendance
## 1 1 A 1 3
## 1.1 1 A 1 3
## 1.2 1 A 1 3
## 2 1 B 2 4
## 2.1 1 B 2 4
## 2.2 1 B 2 4
## 2.3 1 B 2 4
## 3 2 A 3 2
## 3.1 2 A 3 2
## 4 2 C 1 3
## 4.1 2 C 1 3
## 4.2 2 C 1 3
## 5 3 B 2 4
## 5.1 3 B 2 4
## 5.2 3 B 2 4
## 5.3 3 B 2 4
## 6 3 C 3 5
## 6.1 3 C 3 5
## 6.2 3 C 3 5
## 6.3 3 C 3 5
## 6.4 3 C 3 5
We have a dataset with variables of companies, sectors, years and dates, analyst scores and ranks, analyst ratings on companies etc. Below is a subset of the data for demo.
head(file9)
## # A tibble: 6 x 3
## group rank score
## <int> <dbl> <dbl>
## 1 735 4 6.77
## 2 736 4 6.77
## 3 734 4 8.43
## 4 735 4 8.43
## 5 824 2 16.5
## 6 828 3 9.27
One thing we want to do with the data is to estimate a model by rank
: rank
= 1 versus 2, 2 versus 3, 3 versus the first runner-up (rank
= 4), and the last runner-up versus the first non-runner-up. In the runner-up group, there could be more than one analyst, whose score
values could be different.
summary(file9$rank)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 4.000 4.111 4.000 34.000
However, there could be cases where we have only one analyst as the runner-up in each group
, while we need at least two analysts to make the comparisons respectively with rank 3 and the first non-runner-up. Therefore, we need to expand those single runners-up. We will later recode the duplicated runners-up to rank 5, and the first non-runners-up to rank 6. Ultimately, we will compare rank 1 versus 2, 2 versus 3, 3 versus 4, and 5 versus 6. This may all sound very confusing right now; let’s see it step by step.
First, expand the runners-up if rank
= 4 have only one score
value, meaning there is only one person as the runner-up in a group
. There are 182 such cases marked by expd
= 2.
file9 <- file9 %>%
group_by(group, rank) %>%
arrange(group, rank, score) %>%
mutate(expd = ifelse(first(score) == last(score) & rank == 4, 2, 1))
table(file9$expd)
##
## 1 2
## 7711 182
Next, replicate the number of rows by the value of expd
. The cases with only one runner-up (expd
= 2) will be replicated, and the rest (expd
= 1) will remain untouched. The data is then sorted.
file9 <- file9[rep(row.names(file9), file9$expd), ]
file9 <- arrange_all(file9)
head(file9, 10)
## # A tibble: 10 x 4
## # Groups: group, rank [7]
## group rank score expd
## <int> <dbl> <dbl> <dbl>
## 1 1 2 14.3 1
## 2 2 3 16.5 1
## 3 2 4 16.4 2
## 4 2 4 16.4 2
## 5 3 1 26.2 1
## 6 3 2 24.8 1
## 7 3 3 16.5 1
## 8 3 3 20.0 1
## 9 3 4 16.4 1
## 10 3 4 18.2 1
Then we create a new variable rank2
, where rank2
= rank
if rank
<= 3. For the newly created replicated runners-up marked by expd
= 2, we assign one case to be rank 4 and another to be rank 5 by their positions (odd/even rows).
file9 <- file9 %>%
mutate(rank2 = case_when(
rank <= 3 ~ rank,
rank > 3 & row_number() %% 2 == 0 & expd == 2 ~ 4,
rank > 3 & row_number() %% 2 == 1 & expd == 2 ~ 5))
head(file9, 10)
## # A tibble: 10 x 5
## # Groups: group, rank [7]
## group rank score expd rank2
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 2 14.3 1 2
## 2 2 3 16.5 1 3
## 3 2 4 16.4 2 5
## 4 2 4 16.4 2 4
## 5 3 1 26.2 1 1
## 6 3 2 24.8 1 2
## 7 3 3 16.5 1 3
## 8 3 3 20.0 1 3
## 9 3 4 16.4 1 NA
## 10 3 4 18.2 1 NA
Now we return to all runners-up to mark the first runners-up and the last runners-up in each group
.
file9 <- file9 %>%
group_by(group, rank) %>%
mutate(max = ifelse(rank == 4 & expd != 2, max(score), NA),
min = ifelse(rank == 4 & expd != 2, min(score), NA),
rank2 = ifelse(rank == 4 & score == max & expd != 2, 4, rank2),
rank2 = ifelse(rank == 4 & score == min & expd != 2, 5, rank2)) %>%
arrange(group, rank, rank2, score)
head(file9, 15)
## # A tibble: 15 x 7
## # Groups: group, rank [11]
## group rank score expd rank2 max min
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2 14.3 1 2 NA NA
## 2 2 3 16.5 1 3 NA NA
## 3 2 4 16.4 2 4 NA NA
## 4 2 4 16.4 2 5 NA NA
## 5 3 1 26.2 1 1 NA NA
## 6 3 2 24.8 1 2 NA NA
## 7 3 3 16.5 1 3 NA NA
## 8 3 3 20.0 1 3 NA NA
## 9 3 4 18.2 1 4 18.2 16.4
## 10 3 4 16.4 1 5 18.2 16.4
## 11 4 1 26.2 1 1 NA NA
## 12 4 2 24.8 1 2 NA NA
## 13 4 3 20.0 1 3 NA NA
## 14 4 4 18.2 2 4 NA NA
## 15 4 4 18.2 2 5 NA NA