In this post, we discuss a few tricks for working with duplicates, including how to find and mark duplicates, find positions of duplicates, and remove duplicates.
The sample dataset we will be using is shown below. In this dataset, Name
and Location
both contain duplicates.
## # A tibble: 8 x 3
## Timestamp Name Location
## <chr> <chr> <chr>
## 1 2020/1/29 9:02 John Doe Shanghai
## 2 2020/2/25 14:12 John Doe New York
## 3 2020/1/28 22:52 Richard Roe Beijing
## 4 2020/1/29 1:31 Jane Doe Shanghai
## 5 2020/2/3 3:04 Jane Doe Paris
## 6 2020/3/3 2:22 Jane Doe Shanghai
## 7 2020/3/20 0:00 Jane Doe Shanghai
## 8 2020/2/17 10:14 Joe Public Abu Dhabi
Using duplicated()
, we will get a logical vector indicating which values are duplicates. By default, the first value will be marked as the unique value (FALSE
) and the rest as duplicates (TRUE
).
duplicated(location$Name)
## [1] FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE
We may create a new column to mark the duplicates.
location$dup <- duplicated(location$Name)
## # A tibble: 8 x 4
## Timestamp Name Location dup
## <chr> <chr> <chr> <lgl>
## 1 2020/1/29 9:02 John Doe Shanghai FALSE
## 2 2020/2/25 14:12 John Doe New York TRUE
## 3 2020/1/28 22:52 Richard Roe Beijing FALSE
## 4 2020/1/29 1:31 Jane Doe Shanghai FALSE
## 5 2020/2/3 3:04 Jane Doe Paris TRUE
## 6 2020/3/3 2:22 Jane Doe Shanghai TRUE
## 7 2020/3/20 0:00 Jane Doe Shanghai TRUE
## 8 2020/2/17 10:14 Joe Public Abu Dhabi FALSE
In cases where we want to mark all identical elements as “duplicates”, we can specify the fromLast
argument to be TRUE
. When fromLast
is TRUE
, duplication would be considered from the reverse side.
location$dup2 <- duplicated(location$Name) | duplicated(location$Name, fromLast=TRUE)
## # A tibble: 8 x 5
## Timestamp Name Location dup dup2
## <chr> <chr> <chr> <lgl> <lgl>
## 1 2020/1/29 9:02 John Doe Shanghai FALSE TRUE
## 2 2020/2/25 14:12 John Doe New York TRUE TRUE
## 3 2020/1/28 22:52 Richard Roe Beijing FALSE FALSE
## 4 2020/1/29 1:31 Jane Doe Shanghai FALSE TRUE
## 5 2020/2/3 3:04 Jane Doe Paris TRUE TRUE
## 6 2020/3/3 2:22 Jane Doe Shanghai TRUE TRUE
## 7 2020/3/20 0:00 Jane Doe Shanghai TRUE TRUE
## 8 2020/2/17 10:14 Joe Public Abu Dhabi FALSE FALSE
which()
gives us the TRUE indices of a logical object.
Using which()
with duplicated()
, we get the positions of the duplicates.
which(duplicated(location$Name))
## [1] 2 5 6 7
dplyr::distinct()
keeps the unique values of selected columns.
If we do not specify .keep_all = TRUE
, only variable Name
will be kept and the others will be dropped.
library(dplyr)
location %>% distinct(Name, .keep_all = TRUE)
## # A tibble: 4 x 5
## Timestamp Name Location dup dup2
## <chr> <chr> <chr> <lgl> <lgl>
## 1 2020/1/29 9:02 John Doe Shanghai FALSE TRUE
## 2 2020/1/28 22:52 Richard Roe Beijing FALSE FALSE
## 3 2020/1/29 1:31 Jane Doe Shanghai FALSE TRUE
## 4 2020/2/17 10:14 Joe Public Abu Dhabi FALSE FALSE
We can include multiple variables in distinct()
so that the combination of the group would be unique.
location %>% distinct(Name, Location, .keep_all = TRUE)
## # A tibble: 6 x 5
## Timestamp Name Location dup dup2
## <chr> <chr> <chr> <lgl> <lgl>
## 1 2020/1/29 9:02 John Doe Shanghai FALSE TRUE
## 2 2020/2/25 14:12 John Doe New York TRUE TRUE
## 3 2020/1/28 22:52 Richard Roe Beijing FALSE FALSE
## 4 2020/1/29 1:31 Jane Doe Shanghai FALSE TRUE
## 5 2020/2/3 3:04 Jane Doe Paris TRUE TRUE
## 6 2020/2/17 10:14 Joe Public Abu Dhabi FALSE FALSE