Duplicates

In this post, we discuss a few tricks for working with duplicates, including how to find and mark duplicates, find positions of duplicates, and remove duplicates.

The sample dataset we will be using is shown below. In this dataset, Name and Location both contain duplicates.

## # A tibble: 8 x 3
##   Timestamp       Name        Location 
##   <chr>           <chr>       <chr>    
## 1 2020/1/29 9:02  John Doe    Shanghai 
## 2 2020/2/25 14:12 John Doe    New York 
## 3 2020/1/28 22:52 Richard Roe Beijing  
## 4 2020/1/29 1:31  Jane Doe    Shanghai 
## 5 2020/2/3 3:04   Jane Doe    Paris    
## 6 2020/3/3 2:22   Jane Doe    Shanghai 
## 7 2020/3/20 0:00  Jane Doe    Shanghai 
## 8 2020/2/17 10:14 Joe Public  Abu Dhabi

Finding duplicates

Using duplicated(), we will get a logical vector indicating which values are duplicates. By default, the first value will be marked as the unique value (FALSE) and the rest as duplicates (TRUE).

duplicated(location$Name)

## [1] FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE

We may create a new column to mark the duplicates.

location$dup <- duplicated(location$Name)

## # A tibble: 8 x 4
##   Timestamp       Name        Location  dup  
##   <chr>           <chr>       <chr>     <lgl>
## 1 2020/1/29 9:02  John Doe    Shanghai  FALSE
## 2 2020/2/25 14:12 John Doe    New York  TRUE 
## 3 2020/1/28 22:52 Richard Roe Beijing   FALSE
## 4 2020/1/29 1:31  Jane Doe    Shanghai  FALSE
## 5 2020/2/3 3:04   Jane Doe    Paris     TRUE 
## 6 2020/3/3 2:22   Jane Doe    Shanghai  TRUE 
## 7 2020/3/20 0:00  Jane Doe    Shanghai  TRUE 
## 8 2020/2/17 10:14 Joe Public  Abu Dhabi FALSE

In cases where we want to mark all identical elements as “duplicates”, we can specify the fromLast argument to be TRUE. When fromLast is TRUE, duplication would be considered from the reverse side.

location$dup2 <- duplicated(location$Name) | duplicated(location$Name, fromLast=TRUE)

## # A tibble: 8 x 5
##   Timestamp       Name        Location  dup   dup2 
##   <chr>           <chr>       <chr>     <lgl> <lgl>
## 1 2020/1/29 9:02  John Doe    Shanghai  FALSE TRUE 
## 2 2020/2/25 14:12 John Doe    New York  TRUE  TRUE 
## 3 2020/1/28 22:52 Richard Roe Beijing   FALSE FALSE
## 4 2020/1/29 1:31  Jane Doe    Shanghai  FALSE TRUE 
## 5 2020/2/3 3:04   Jane Doe    Paris     TRUE  TRUE 
## 6 2020/3/3 2:22   Jane Doe    Shanghai  TRUE  TRUE 
## 7 2020/3/20 0:00  Jane Doe    Shanghai  TRUE  TRUE 
## 8 2020/2/17 10:14 Joe Public  Abu Dhabi FALSE FALSE

Finding positions of duplicates

which() gives us the TRUE indices of a logical object.

Using which() with duplicated(), we get the positions of the duplicates.

which(duplicated(location$Name))

## [1] 2 5 6 7

Removing duplicates

dplyr::distinct() keeps the unique values of selected columns.

If we do not specify .keep_all = TRUE, only variable Name will be kept and the others will be dropped.

library(dplyr)

location %>% distinct(Name, .keep_all = TRUE)

## # A tibble: 4 x 5
##   Timestamp       Name        Location  dup   dup2 
##   <chr>           <chr>       <chr>     <lgl> <lgl>
## 1 2020/1/29 9:02  John Doe    Shanghai  FALSE TRUE 
## 2 2020/1/28 22:52 Richard Roe Beijing   FALSE FALSE
## 3 2020/1/29 1:31  Jane Doe    Shanghai  FALSE TRUE 
## 4 2020/2/17 10:14 Joe Public  Abu Dhabi FALSE FALSE

We can include multiple variables in distinct() so that the combination of the group would be unique.

location %>% distinct(Name, Location, .keep_all = TRUE)

## # A tibble: 6 x 5
##   Timestamp       Name        Location  dup   dup2 
##   <chr>           <chr>       <chr>     <lgl> <lgl>
## 1 2020/1/29 9:02  John Doe    Shanghai  FALSE TRUE 
## 2 2020/2/25 14:12 John Doe    New York  TRUE  TRUE 
## 3 2020/1/28 22:52 Richard Roe Beijing   FALSE FALSE
## 4 2020/1/29 1:31  Jane Doe    Shanghai  FALSE TRUE 
## 5 2020/2/3 3:04   Jane Doe    Paris     TRUE  TRUE 
## 6 2020/2/17 10:14 Joe Public  Abu Dhabi FALSE FALSE

Duplicates

Yun Dai

05/2020

Finding duplicates

Finding positions of duplicates

Removing duplicates