Combining data frames, either by rows or columns, is a common need in data frame manipulation. Base R and the package dplyr have offered ways to handle those tasks.
library(dplyr)
Let’s use two sample datasets to show how to append one data frames to another by rows.
file1
## id year family role rating
## 1 1 1995 1 child A
## 2 2 1995 1 child B
## 3 3 1995 1 mother B
## 4 1 1995 2 father C
## 5 2 1995 2 mother B
## 6 3 1995 2 child A
file2
## id year family role rating
## 1 1 2005 1 child A
## 2 2 2005 1 child A
## 3 3 2005 1 mother B
## 4 1 2005 2 father B
## 5 2 2005 2 mother C
## 6 3 2005 2 child B
The task is to append the file2 to the file1 by rows. This can be achieved with dplyr::bind_rows().
bind_rows(file1, file2)
## id year family role rating
## 1 1 1995 1 child A
## 2 2 1995 1 child B
## 3 3 1995 1 mother B
## 4 1 1995 2 father C
## 5 2 1995 2 mother B
## 6 3 1995 2 child A
## 7 1 2005 1 child A
## 8 2 2005 1 child A
## 9 3 2005 1 mother B
## 10 1 2005 2 father B
## 11 2 2005 2 mother C
## 12 3 2005 2 child B
bind_rows() also allows us to mark the source file by specifying the .id argument.
bind_rows(file1, file2, .id = "source")
## source id year family role rating
## 1 1 1 1995 1 child A
## 2 1 2 1995 1 child B
## 3 1 3 1995 1 mother B
## 4 1 1 1995 2 father C
## 5 1 2 1995 2 mother B
## 6 1 3 1995 2 child A
## 7 2 1 2005 1 child A
## 8 2 2 2005 1 child A
## 9 2 3 2005 1 mother B
## 10 2 1 2005 2 father B
## 11 2 2 2005 2 mother C
## 12 2 3 2005 2 child B
merge() merges two data frames by common columns or row names.
Here we have file1 and file5 to merge, which share the common column family.
file1
## id year family role rating
## 1 1 1995 1 child A
## 2 2 1995 1 child B
## 3 3 1995 1 mother B
## 4 1 1995 2 father C
## 5 2 1995 2 mother B
## 6 3 1995 2 child A
file5
## family area SES
## 1 1 N 1
## 2 2 S 2
## 3 3 N 3
by argument sets the identification column.
merge(file1, file5, by = "family")
## family id year role rating area SES
## 1 1 1 1995 child A N 1
## 2 1 2 1995 child B N 1
## 3 1 3 1995 mother B N 1
## 4 2 1 1995 father C S 2
## 5 2 2 1995 mother B S 2
## 6 2 3 1995 child A S 2
all argument evaluates if unmatched cases from either data frame should be kept or dropped. Here with all = TRUE we have kept all observations from both files.
Compare this case below with the case above. Without all = TRUE in the above case, unmatched cases are dropped.
merge(file1, file5, by = "family", all = TRUE)
## family id year role rating area SES
## 1 1 1 1995 child A N 1
## 2 1 2 1995 child B N 1
## 3 1 3 1995 mother B N 1
## 4 2 1 1995 father C S 2
## 5 2 2 1995 mother B S 2
## 6 2 3 1995 child A S 2
## 7 3 NA NA <NA> <NA> N 3
Rows in file5 with no match in file1 have NA values in the columns from file1.
If we set all.y to be TRUE, all non-missing information from file5 (y) is kept, while information from file1 (x) is missing.
merge(file1, file5, by = "family", all.y = TRUE)
## family id year role rating area SES
## 1 1 1 1995 child A N 1
## 2 1 2 1995 child B N 1
## 3 1 3 1995 mother B N 1
## 4 2 1 1995 father C S 2
## 5 2 2 1995 mother B S 2
## 6 2 3 1995 child A S 2
## 7 3 NA NA <NA> <NA> N 3
Compare setting all.y = TRUE with setting all.x = TRUE, where only complete cases from file5 has been kept.
merge(file1, file5, by = "family", all.x = TRUE)
## family id year role rating area SES
## 1 1 1 1995 child A N 1
## 2 1 2 1995 child B N 1
## 3 1 3 1995 mother B N 1
## 4 2 1 1995 father C S 2
## 5 2 2 1995 mother B S 2
## 6 2 3 1995 child A S 2
We can also merge data frames by multiple common columns.
file3
## id family role
## 1 1 1 child
## 2 2 1 child
## 3 3 1 mother
## 4 1 2 father
## 5 2 2 mother
## 6 3 2 child
file4
## id year family rating
## 1 1 1995 1 A
## 2 2 1995 1 B
## 3 3 1995 1 B
## 4 1 1995 2 C
## 5 2 1995 2 B
## 6 3 1995 2 A
To merge these two files:
merge(file3, file4, by = c("id", "family"))
## id family role year rating
## 1 1 1 child 1995 A
## 2 1 2 father 1995 C
## 3 2 1 child 1995 B
## 4 2 2 mother 1995 B
## 5 3 1 mother 1995 B
## 6 3 2 child 1995 A
inner_join(), left_join(), right_join(), and full_join() from dplyr join data frames together in different ways.
The sample data frames we have are file6 and file7. Note that these two data frames cannot be merged one-to-one or one-to-many, but need to be joined many-to-many.
file6
## industry company rank
## 1 A a 1
## 2 A b 2
## 3 A c 3
## 4 B l 4
## 5 B m 5
## 6 B n 6
## 7 C o 7
file7
## industry year
## 1 A 1995
## 2 A 2005
## 3 B 1995
## 4 B 2005
## 5 D 1995
## 6 D 2005
inner_join() returns all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combinations of the matches are returned.
file6 %>% inner_join(file7, by = "industry")
## industry company rank year
## 1 A a 1 1995
## 2 A a 1 2005
## 3 A b 2 1995
## 4 A b 2 2005
## 5 A c 3 1995
## 6 A c 3 2005
## 7 B l 4 1995
## 8 B l 4 2005
## 9 B m 5 1995
## 10 B m 5 2005
## 11 B n 6 1995
## 12 B n 6 2005
left_join() returns all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
file6 %>% left_join(file7, by = "industry")
## industry company rank year
## 1 A a 1 1995
## 2 A a 1 2005
## 3 A b 2 1995
## 4 A b 2 2005
## 5 A c 3 1995
## 6 A c 3 2005
## 7 B l 4 1995
## 8 B l 4 2005
## 9 B m 5 1995
## 10 B m 5 2005
## 11 B n 6 1995
## 12 B n 6 2005
## 13 C o 7 NA
right_join() returns all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
file6 %>% right_join(file7, by = "industry")
## industry company rank year
## 1 A a 1 1995
## 2 A b 2 1995
## 3 A c 3 1995
## 4 A a 1 2005
## 5 A b 2 2005
## 6 A c 3 2005
## 7 B l 4 1995
## 8 B m 5 1995
## 9 B n 6 1995
## 10 B l 4 2005
## 11 B m 5 2005
## 12 B n 6 2005
## 13 D <NA> NA 1995
## 14 D <NA> NA 2005
full_join() returns all rows and all columns from both x and y. Where there are unmatched values, it returns NA for the missing one.
file6 %>% full_join(file7, by = "industry")
## industry company rank year
## 1 A a 1 1995
## 2 A a 1 2005
## 3 A b 2 1995
## 4 A b 2 2005
## 5 A c 3 1995
## 6 A c 3 2005
## 7 B l 4 1995
## 8 B l 4 2005
## 9 B m 5 1995
## 10 B m 5 2005
## 11 B n 6 1995
## 12 B n 6 2005
## 13 C o 7 NA
## 14 D <NA> NA 1995
## 15 D <NA> NA 2005