Combining data frames, either by rows or columns, is a common need in data frame manipulation. Base R and the package dplyr
have offered ways to handle those tasks.
library(dplyr)
Let’s use two sample datasets to show how to append one data frames to another by rows.
file1
## id year family role rating
## 1 1 1995 1 child A
## 2 2 1995 1 child B
## 3 3 1995 1 mother B
## 4 1 1995 2 father C
## 5 2 1995 2 mother B
## 6 3 1995 2 child A
file2
## id year family role rating
## 1 1 2005 1 child A
## 2 2 2005 1 child A
## 3 3 2005 1 mother B
## 4 1 2005 2 father B
## 5 2 2005 2 mother C
## 6 3 2005 2 child B
The task is to append the file2
to the file1
by rows. This can be achieved with dplyr::bind_rows()
.
bind_rows(file1, file2)
## id year family role rating
## 1 1 1995 1 child A
## 2 2 1995 1 child B
## 3 3 1995 1 mother B
## 4 1 1995 2 father C
## 5 2 1995 2 mother B
## 6 3 1995 2 child A
## 7 1 2005 1 child A
## 8 2 2005 1 child A
## 9 3 2005 1 mother B
## 10 1 2005 2 father B
## 11 2 2005 2 mother C
## 12 3 2005 2 child B
bind_rows()
also allows us to mark the source file by specifying the .id
argument.
bind_rows(file1, file2, .id = "source")
## source id year family role rating
## 1 1 1 1995 1 child A
## 2 1 2 1995 1 child B
## 3 1 3 1995 1 mother B
## 4 1 1 1995 2 father C
## 5 1 2 1995 2 mother B
## 6 1 3 1995 2 child A
## 7 2 1 2005 1 child A
## 8 2 2 2005 1 child A
## 9 2 3 2005 1 mother B
## 10 2 1 2005 2 father B
## 11 2 2 2005 2 mother C
## 12 2 3 2005 2 child B
merge()
merges two data frames by common columns or row names.
Here we have file1
and file5
to merge, which share the common column family
.
file1
## id year family role rating
## 1 1 1995 1 child A
## 2 2 1995 1 child B
## 3 3 1995 1 mother B
## 4 1 1995 2 father C
## 5 2 1995 2 mother B
## 6 3 1995 2 child A
file5
## family area SES
## 1 1 N 1
## 2 2 S 2
## 3 3 N 3
by
argument sets the identification column.
merge(file1, file5, by = "family")
## family id year role rating area SES
## 1 1 1 1995 child A N 1
## 2 1 2 1995 child B N 1
## 3 1 3 1995 mother B N 1
## 4 2 1 1995 father C S 2
## 5 2 2 1995 mother B S 2
## 6 2 3 1995 child A S 2
all
argument evaluates if unmatched cases from either data frame should be kept or dropped. Here with all = TRUE
we have kept all observations from both files.
Compare this case below with the case above. Without all = TRUE
in the above case, unmatched cases are dropped.
merge(file1, file5, by = "family", all = TRUE)
## family id year role rating area SES
## 1 1 1 1995 child A N 1
## 2 1 2 1995 child B N 1
## 3 1 3 1995 mother B N 1
## 4 2 1 1995 father C S 2
## 5 2 2 1995 mother B S 2
## 6 2 3 1995 child A S 2
## 7 3 NA NA <NA> <NA> N 3
Rows in file5
with no match in file1
have NA
values in the columns from file1
.
If we set all.y
to be TRUE
, all non-missing information from file5
(y
) is kept, while information from file1
(x
) is missing.
merge(file1, file5, by = "family", all.y = TRUE)
## family id year role rating area SES
## 1 1 1 1995 child A N 1
## 2 1 2 1995 child B N 1
## 3 1 3 1995 mother B N 1
## 4 2 1 1995 father C S 2
## 5 2 2 1995 mother B S 2
## 6 2 3 1995 child A S 2
## 7 3 NA NA <NA> <NA> N 3
Compare setting all.y = TRUE
with setting all.x = TRUE
, where only complete cases from file5
has been kept.
merge(file1, file5, by = "family", all.x = TRUE)
## family id year role rating area SES
## 1 1 1 1995 child A N 1
## 2 1 2 1995 child B N 1
## 3 1 3 1995 mother B N 1
## 4 2 1 1995 father C S 2
## 5 2 2 1995 mother B S 2
## 6 2 3 1995 child A S 2
We can also merge data frames by multiple common columns.
file3
## id family role
## 1 1 1 child
## 2 2 1 child
## 3 3 1 mother
## 4 1 2 father
## 5 2 2 mother
## 6 3 2 child
file4
## id year family rating
## 1 1 1995 1 A
## 2 2 1995 1 B
## 3 3 1995 1 B
## 4 1 1995 2 C
## 5 2 1995 2 B
## 6 3 1995 2 A
To merge these two files:
merge(file3, file4, by = c("id", "family"))
## id family role year rating
## 1 1 1 child 1995 A
## 2 1 2 father 1995 C
## 3 2 1 child 1995 B
## 4 2 2 mother 1995 B
## 5 3 1 mother 1995 B
## 6 3 2 child 1995 A
inner_join()
, left_join()
, right_join()
, and full_join()
from dplyr
join data frames together in different ways.
The sample data frames we have are file6
and file7
. Note that these two data frames cannot be merged one-to-one or one-to-many, but need to be joined many-to-many.
file6
## industry company rank
## 1 A a 1
## 2 A b 2
## 3 A c 3
## 4 B l 4
## 5 B m 5
## 6 B n 6
## 7 C o 7
file7
## industry year
## 1 A 1995
## 2 A 2005
## 3 B 1995
## 4 B 2005
## 5 D 1995
## 6 D 2005
inner_join()
returns all rows from x
where there are matching values in y
, and all columns from x
and y
. If there are multiple matches between x
and y
, all combinations of the matches are returned.
file6 %>% inner_join(file7, by = "industry")
## industry company rank year
## 1 A a 1 1995
## 2 A a 1 2005
## 3 A b 2 1995
## 4 A b 2 2005
## 5 A c 3 1995
## 6 A c 3 2005
## 7 B l 4 1995
## 8 B l 4 2005
## 9 B m 5 1995
## 10 B m 5 2005
## 11 B n 6 1995
## 12 B n 6 2005
left_join()
returns all rows from x
, and all columns from x
and y
. Rows in x
with no match in y
will have NA
values in the new columns. If there are multiple matches between x
and y
, all combinations of the matches are returned.
file6 %>% left_join(file7, by = "industry")
## industry company rank year
## 1 A a 1 1995
## 2 A a 1 2005
## 3 A b 2 1995
## 4 A b 2 2005
## 5 A c 3 1995
## 6 A c 3 2005
## 7 B l 4 1995
## 8 B l 4 2005
## 9 B m 5 1995
## 10 B m 5 2005
## 11 B n 6 1995
## 12 B n 6 2005
## 13 C o 7 NA
right_join()
returns all rows from y
, and all columns from x
and y
. Rows in y
with no match in x
will have NA
values in the new columns. If there are multiple matches between x
and y
, all combinations of the matches are returned.
file6 %>% right_join(file7, by = "industry")
## industry company rank year
## 1 A a 1 1995
## 2 A b 2 1995
## 3 A c 3 1995
## 4 A a 1 2005
## 5 A b 2 2005
## 6 A c 3 2005
## 7 B l 4 1995
## 8 B m 5 1995
## 9 B n 6 1995
## 10 B l 4 2005
## 11 B m 5 2005
## 12 B n 6 2005
## 13 D <NA> NA 1995
## 14 D <NA> NA 2005
full_join()
returns all rows and all columns from both x
and y
. Where there are unmatched values, it returns NA
for the missing one.
file6 %>% full_join(file7, by = "industry")
## industry company rank year
## 1 A a 1 1995
## 2 A a 1 2005
## 3 A b 2 1995
## 4 A b 2 2005
## 5 A c 3 1995
## 6 A c 3 2005
## 7 B l 4 1995
## 8 B l 4 2005
## 9 B m 5 1995
## 10 B m 5 2005
## 11 B n 6 1995
## 12 B n 6 2005
## 13 C o 7 NA
## 14 D <NA> NA 1995
## 15 D <NA> NA 2005