Combining Data Frames

Combining data frames, either by rows or columns, is a common need in data frame manipulation. Base R and the package dplyr have offered ways to handle those tasks.

library(dplyr)

Binding by rows

Let’s use two sample datasets to show how to append one data frames to another by rows.

file1

##   id year family   role rating
## 1  1 1995      1  child      A
## 2  2 1995      1  child      B
## 3  3 1995      1 mother      B
## 4  1 1995      2 father      C
## 5  2 1995      2 mother      B
## 6  3 1995      2  child      A

file2

##   id year family   role rating
## 1  1 2005      1  child      A
## 2  2 2005      1  child      A
## 3  3 2005      1 mother      B
## 4  1 2005      2 father      B
## 5  2 2005      2 mother      C
## 6  3 2005      2  child      B

The task is to append the file2 to the file1 by rows. This can be achieved with dplyr::bind_rows().

bind_rows(file1, file2)

##    id year family   role rating
## 1   1 1995      1  child      A
## 2   2 1995      1  child      B
## 3   3 1995      1 mother      B
## 4   1 1995      2 father      C
## 5   2 1995      2 mother      B
## 6   3 1995      2  child      A
## 7   1 2005      1  child      A
## 8   2 2005      1  child      A
## 9   3 2005      1 mother      B
## 10  1 2005      2 father      B
## 11  2 2005      2 mother      C
## 12  3 2005      2  child      B

bind_rows() also allows us to mark the source file by specifying the .id argument.

bind_rows(file1, file2, .id = "source")

##    source id year family   role rating
## 1       1  1 1995      1  child      A
## 2       1  2 1995      1  child      B
## 3       1  3 1995      1 mother      B
## 4       1  1 1995      2 father      C
## 5       1  2 1995      2 mother      B
## 6       1  3 1995      2  child      A
## 7       2  1 2005      1  child      A
## 8       2  2 2005      1  child      A
## 9       2  3 2005      1 mother      B
## 10      2  1 2005      2 father      B
## 11      2  2 2005      2 mother      C
## 12      2  3 2005      2  child      B

Merging data frames

merge() merges two data frames by common columns or row names.

Here we have file1 and file5 to merge, which share the common column family.

file1

##   id year family   role rating
## 1  1 1995      1  child      A
## 2  2 1995      1  child      B
## 3  3 1995      1 mother      B
## 4  1 1995      2 father      C
## 5  2 1995      2 mother      B
## 6  3 1995      2  child      A

file5

##   family area SES
## 1      1    N   1
## 2      2    S   2
## 3      3    N   3

by argument sets the identification column.

merge(file1, file5, by = "family")

##   family id year   role rating area SES
## 1      1  1 1995  child      A    N   1
## 2      1  2 1995  child      B    N   1
## 3      1  3 1995 mother      B    N   1
## 4      2  1 1995 father      C    S   2
## 5      2  2 1995 mother      B    S   2
## 6      2  3 1995  child      A    S   2

all argument evaluates if unmatched cases from either data frame should be kept or dropped. Here with all = TRUE we have kept all observations from both files.

Compare this case below with the case above. Without all = TRUE in the above case, unmatched cases are dropped.

merge(file1, file5, by = "family", all = TRUE)

##   family id year   role rating area SES
## 1      1  1 1995  child      A    N   1
## 2      1  2 1995  child      B    N   1
## 3      1  3 1995 mother      B    N   1
## 4      2  1 1995 father      C    S   2
## 5      2  2 1995 mother      B    S   2
## 6      2  3 1995  child      A    S   2
## 7      3 NA   NA   <NA>   <NA>    N   3

Rows in file5 with no match in file1 have NA values in the columns from file1.

If we set all.y to be TRUE, all non-missing information from file5 (y) is kept, while information from file1 (x) is missing.

merge(file1, file5, by = "family", all.y = TRUE)

##   family id year   role rating area SES
## 1      1  1 1995  child      A    N   1
## 2      1  2 1995  child      B    N   1
## 3      1  3 1995 mother      B    N   1
## 4      2  1 1995 father      C    S   2
## 5      2  2 1995 mother      B    S   2
## 6      2  3 1995  child      A    S   2
## 7      3 NA   NA   <NA>   <NA>    N   3

Compare setting all.y = TRUE with setting all.x = TRUE, where only complete cases from file5 has been kept.

merge(file1, file5, by = "family", all.x = TRUE)

##   family id year   role rating area SES
## 1      1  1 1995  child      A    N   1
## 2      1  2 1995  child      B    N   1
## 3      1  3 1995 mother      B    N   1
## 4      2  1 1995 father      C    S   2
## 5      2  2 1995 mother      B    S   2
## 6      2  3 1995  child      A    S   2

We can also merge data frames by multiple common columns.

file3

##   id family   role
## 1  1      1  child
## 2  2      1  child
## 3  3      1 mother
## 4  1      2 father
## 5  2      2 mother
## 6  3      2  child

file4

##   id year family rating
## 1  1 1995      1      A
## 2  2 1995      1      B
## 3  3 1995      1      B
## 4  1 1995      2      C
## 5  2 1995      2      B
## 6  3 1995      2      A

To merge these two files:

merge(file3, file4, by = c("id", "family"))

##   id family   role year rating
## 1  1      1  child 1995      A
## 2  1      2 father 1995      C
## 3  2      1  child 1995      B
## 4  2      2 mother 1995      B
## 5  3      1 mother 1995      B
## 6  3      2  child 1995      A

Join operations

inner_join(), left_join(), right_join(), and full_join() from dplyr join data frames together in different ways.

The sample data frames we have are file6 and file7. Note that these two data frames cannot be merged one-to-one or one-to-many, but need to be joined many-to-many.

file6

##   industry company rank
## 1        A       a    1
## 2        A       b    2
## 3        A       c    3
## 4        B       l    4
## 5        B       m    5
## 6        B       n    6
## 7        C       o    7

file7

##   industry year
## 1        A 1995
## 2        A 2005
## 3        B 1995
## 4        B 2005
## 5        D 1995
## 6        D 2005

inner_join() returns all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combinations of the matches are returned.

file6 %>% inner_join(file7, by = "industry")

##    industry company rank year
## 1         A       a    1 1995
## 2         A       a    1 2005
## 3         A       b    2 1995
## 4         A       b    2 2005
## 5         A       c    3 1995
## 6         A       c    3 2005
## 7         B       l    4 1995
## 8         B       l    4 2005
## 9         B       m    5 1995
## 10        B       m    5 2005
## 11        B       n    6 1995
## 12        B       n    6 2005

left_join() returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

file6 %>% left_join(file7, by = "industry")

##    industry company rank year
## 1         A       a    1 1995
## 2         A       a    1 2005
## 3         A       b    2 1995
## 4         A       b    2 2005
## 5         A       c    3 1995
## 6         A       c    3 2005
## 7         B       l    4 1995
## 8         B       l    4 2005
## 9         B       m    5 1995
## 10        B       m    5 2005
## 11        B       n    6 1995
## 12        B       n    6 2005
## 13        C       o    7   NA

right_join() returns all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

file6 %>% right_join(file7, by = "industry")

##    industry company rank year
## 1         A       a    1 1995
## 2         A       b    2 1995
## 3         A       c    3 1995
## 4         A       a    1 2005
## 5         A       b    2 2005
## 6         A       c    3 2005
## 7         B       l    4 1995
## 8         B       m    5 1995
## 9         B       n    6 1995
## 10        B       l    4 2005
## 11        B       m    5 2005
## 12        B       n    6 2005
## 13        D    <NA>   NA 1995
## 14        D    <NA>   NA 2005

full_join() returns all rows and all columns from both x and y. Where there are unmatched values, it returns NA for the missing one.

file6 %>% full_join(file7, by = "industry")

##    industry company rank year
## 1         A       a    1 1995
## 2         A       a    1 2005
## 3         A       b    2 1995
## 4         A       b    2 2005
## 5         A       c    3 1995
## 6         A       c    3 2005
## 7         B       l    4 1995
## 8         B       l    4 2005
## 9         B       m    5 1995
## 10        B       m    5 2005
## 11        B       n    6 1995
## 12        B       n    6 2005
## 13        C       o    7   NA
## 14        D    <NA>   NA 1995
## 15        D    <NA>   NA 2005

Combining Data Frames

Yun Dai

04/2019

Binding by rows

Merging data frames

Join operations