In this post, we will go through the tasks we most frequently meet with in data transformation. We will take a very practical approach, at times even seemingly trivial, leaving the conceptual part aside for a moment.

We are going to show examples along the way with the aid of package dplyr as well as base R functions. dplyr provides fast and intuitive ways to manipulate date frames. This article offers a quick start of dplyr on data frame manipulation essentials.

Before we start, let’s load the package dplyr.

library(dplyr)

A side note on dplyr pipe operator %>% before we proceed.

The dplyr pipe operator %>% originally comes from package magrittr. In dplyr, %>% chains functions together, passing the output of the former function to the input of the next function. This way we don’t have to nest a lot of functions, which also makes the code more readable.

For instance, in the example below, step by step each new operation is performed on the output from the previous one. We first keep the rows where the variable resource has no missing cases; we then group the data by date and resource on the filtered subset; we next add a new column to count observations of the newly formed group; following that we rename the new variable counting the observations, and so forth.

session %>% 
  filter(!is.na(resource)) %>%
  group_by(date, resource) %>%
  add_tally(interval) %>%
  rename(sum_day = n)

Recoding variables


variable transformation

Variable transformation is the replacement of a variable by a new value or a function of that variable. The purposes can be multiple: convenience, reducing skewness, producing linear or additive relationships, and more. Standardisation and logarithmic transformation are common types that are familiar to us.

Below we are not discussing why we are performing a task, but simply showing how with a few examples for specific tasks.


<- to assign new values

One way we often use to recode variables is to use the assignment sign <-.

Let’s use the sample dataset storms from dplyr to show how we may recode the missing values.

head(dplyr::storms)
## # A tibble: 6 x 13
##   name   year month   day  hour   lat  long status category  wind pressure
##   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
## 1 Amy    1975     6    27     0  27.5 -79   tropi… -1          25     1013
## 2 Amy    1975     6    27     6  28.5 -79   tropi… -1          25     1013
## 3 Amy    1975     6    27    12  29.5 -79   tropi… -1          25     1013
## 4 Amy    1975     6    27    18  30.5 -79   tropi… -1          25     1013
## 5 Amy    1975     6    28     0  31.5 -78.8 tropi… -1          25     1012
## 6 Amy    1975     6    28     6  32.4 -78.7 tropi… -1          25     1012
## # ... with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>

As we can see, there are many missing values in storms$ts_diameter.

summary(storms$ts_diameter)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   69.05  138.09  166.76  241.66 1001.18    6528

We decide that we want to recode all NAs to 9999.

storm <- storms

storm$ts_diameter[is.na(storm$ts_diameter)] <- 9999
summary(storm$ts_diameter)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   224.4  9999.0  6578.8  9999.0  9999.0

We can of couse recode the values 9999 back to NA.

storm$ts_diameter[storm$ts_diameter == 9999] <- NA
summary(storm$ts_diameter)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   69.05  138.09  166.76  241.66 1001.18    6528

recode() to replace values

recode() replaces numeric values based on their position, and replaces character values by their name. .default argument, if supplied, gives all cases not matched a new value. .missing argument replaces the missing values by the specified new value.

Recoding character values:

iris$species <- recode(iris$Species, setosa = "a",  versicolor = "b", virginica = "c")
table(iris$species)
## 
##  a  b  c 
## 50 50 50

Recoding numeric values:

mtcars$cyl2 <- recode(mtcars$cyl, `4` = 1, `6` = 2, `8` = 3)
table(mtcars$cyl2)
## 
##  1  2  3 
## 11  7 14

case_when() to recode on subsets

case_when() allows us to vectorize multiple if and else if statements.

For instance, below we reorganized the majors by grouping them to a higher category.

survey$new_major <- case_when(
  survey$major %in% c("Biology", "Chemistry", "Physics", "Neural Science") ~ "Science",
  survey$major %in% c("Computer Systems Engineering", "Electrical and Systems Engineering") ~ "CS & Engineering",
  survey$major %in% c("Business and Marketing", "Business and Finance", "Economics") ~ "Business, Finance & Economics",
  survey$major %in% "Mathematics" ~ "Mathematics",
  survey$major %in% NA ~ "Undefined"
)

mutate() to add new variables at the same time

mutate() adds new variables while recoding them.

mt <- mtcars %>% mutate(newvar = mpg * 2)
head(mt)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb cyl2 newvar
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4    2   42.0
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4    2   42.0
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1    1   45.6
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1    2   42.8
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2    3   37.4
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1    2   36.2

mutate() can operate on multiple variables.

mt <- mtcars %>% mutate(newvar1 = mpg * 2, newvar2 = cumsum(mpg))
head(mt)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb cyl2 newvar1 newvar2
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4    2    42.0    21.0
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4    2    42.0    42.0
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1    1    45.6    64.8
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1    2    42.8    86.2
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2    3    37.4   104.9
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1    2    36.2   123.0

mutate(ifelse()) to recode with conditions

mutate() can be used with functions. For instance, mutate() with ifelse() recodes variables conditionally.

Let’s say we want to create a new column, copying cases whose values are smaller than 3.

mutate(rank2 = ifelse(rank <= 3, rank, NA))

In another example, let’s say we have a dataset subsample.

subsample
##    Q1 Q2
## 1  NA  1
## 2   5 NA
## 3  NA  1
## 4  NA NA
## 5  NA NA
## 6  NA  1
## 7  NA NA
## 8  NA NA
## 9  NA  1
## 10  1 NA

When processing the data, we want to combine the information from two columns. The rule is that we only need the non-missing case from the two columns.

subsample %>% mutate(
  newvar = ifelse(is.na(Q1), Q2, NA),
  newvar = ifelse(is.na(Q2), Q1, newvar)
)
##    Q1 Q2 newvar
## 1  NA  1      1
## 2   5 NA      5
## 3  NA  1      1
## 4  NA NA     NA
## 5  NA NA     NA
## 6  NA  1      1
## 7  NA NA     NA
## 8  NA NA     NA
## 9  NA  1      1
## 10  1 NA      1

mutate(case_when())

case_when() can be nested within mutate().

mutate(rank2 = case_when(
    rank <= 3 ~ rank,
    rank > 3 & row_number() %% 2 == 0 & expd == 1 ~ 4,
    rank > 3 & row_number() %% 2 == 1 & expd == 1 ~ 5))

Adding / dropping columns and rows

There are several ways to add and drop columns and rows in a data frame. First, let’s look at the base R methods using the sample dataset mtcars.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb cyl2
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4    2
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4    2
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1    2
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2    3
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1    2

Adding a column:

mt <- mtcars
mt$new_col <- 2 * mt$cyl
head(mt)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb cyl2
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4    2
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4    2
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1    2
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2    3
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1    2
##                   new_col
## Mazda RX4              12
## Mazda RX4 Wag          12
## Datsun 710              8
## Hornet 4 Drive         12
## Hornet Sportabout      16
## Valiant                12

Removing a column:

mt$new_col <- NULL
head(mt)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb cyl2
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4    2
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4    2
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1    2
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2    3
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1    2

Removing a row:

mt <- mtcars[-(1:5), ]
head(mt)
##             mpg cyl  disp  hp drat   wt  qsec vs am gear carb cyl2
## Valiant    18.1   6 225.0 105 2.76 3.46 20.22  1  0    3    1    2
## Duster 360 14.3   8 360.0 245 3.21 3.57 15.84  0  0    3    4    3
## Merc 240D  24.4   4 146.7  62 3.69 3.19 20.00  1  0    4    2    1
## Merc 230   22.8   4 140.8  95 3.92 3.15 22.90  1  0    4    2    1
## Merc 280   19.2   6 167.6 123 3.92 3.44 18.30  1  0    4    4    2
## Merc 280C  17.8   6 167.6 123 3.92 3.44 18.90  1  0    4    4    2

select()

dplyr offers several functions that handle adding and dropping columns and rows efficiently.

select() keeps the selected variables.

mt <- mtcars %>% select(mpg, cyl) 
head(mt)
##                    mpg cyl
## Mazda RX4         21.0   6
## Mazda RX4 Wag     21.0   6
## Datsun 710        22.8   4
## Hornet 4 Drive    21.4   6
## Hornet Sportabout 18.7   8
## Valiant           18.1   6

We can rename the variables while we select the variables.

mt <- mtcars %>% select(mpg2 = mpg, cyl2 = cyl) 
head(mt)
##                   mpg2 cyl2
## Mazda RX4         21.0    6
## Mazda RX4 Wag     21.0    6
## Datsun 710        22.8    4
## Hornet 4 Drive    21.4    6
## Hornet Sportabout 18.7    8
## Valiant           18.1    6

filter()

filter() filters rows, keeping rows where given conditions are true.

Here we use the sample dataset iris.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species species
## 1          5.1         3.5          1.4         0.2  setosa       a
## 2          4.9         3.0          1.4         0.2  setosa       a
## 3          4.7         3.2          1.3         0.2  setosa       a
## 4          4.6         3.1          1.5         0.2  setosa       a
## 5          5.0         3.6          1.4         0.2  setosa       a
## 6          5.4         3.9          1.7         0.4  setosa       a

Below we filter the rows to get a subset where Species is all versicolor and Petal.Width is smaller than 1.2

filter(iris, Species == "versicolor" & Petal.Width < 1.2)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species species
## 1           4.9         2.4          3.3         1.0 versicolor       b
## 2           5.0         2.0          3.5         1.0 versicolor       b
## 3           6.0         2.2          4.0         1.0 versicolor       b
## 4           5.8         2.7          4.1         1.0 versicolor       b
## 5           5.6         2.5          3.9         1.1 versicolor       b
## 6           5.7         2.6          3.5         1.0 versicolor       b
## 7           5.5         2.4          3.8         1.1 versicolor       b
## 8           5.5         2.4          3.7         1.0 versicolor       b
## 9           5.0         2.3          3.3         1.0 versicolor       b
## 10          5.1         2.5          3.0         1.1 versicolor       b

This would be the same as

iris %>% filter(Species == "versicolor" & Petal.Width < 1.2)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species species
## 1           4.9         2.4          3.3         1.0 versicolor       b
## 2           5.0         2.0          3.5         1.0 versicolor       b
## 3           6.0         2.2          4.0         1.0 versicolor       b
## 4           5.8         2.7          4.1         1.0 versicolor       b
## 5           5.6         2.5          3.9         1.1 versicolor       b
## 6           5.7         2.6          3.5         1.0 versicolor       b
## 7           5.5         2.4          3.8         1.1 versicolor       b
## 8           5.5         2.4          3.7         1.0 versicolor       b
## 9           5.0         2.3          3.3         1.0 versicolor       b
## 10          5.1         2.5          3.0         1.1 versicolor       b

Renaming variables


names()

In base R, we can rename variables using function names().

mt <- mtcars
names(mt)[1] <- "var"
names(mt)
##  [1] "var"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb" "cyl2"

Renaming more than one variable:

names(mt)[1:3] <- c("a", "b", "c")
names(mt)
##  [1] "a"    "b"    "c"    "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb" "cyl2"

Renaming all:

names(mt)[1:11] <- letters[1:11]
names(mt)
##  [1] "a"    "b"    "c"    "d"    "e"    "f"    "g"    "h"    "i"    "j"   
## [11] "k"    "cyl2"

rename()

We can also use rename() from dplyr, which renames the variables without dropping variales. If we use select() to rename variables, the unmentioned variables will be dropped.

mt <- mtcars %>% rename(mpg2 = mpg, cyl2 = cyl) 
head(mt)
##                   mpg2 cyl2 disp  hp drat    wt  qsec vs am gear carb cyl2
## Mazda RX4         21.0    6  160 110 3.90 2.620 16.46  0  1    4    4    2
## Mazda RX4 Wag     21.0    6  160 110 3.90 2.875 17.02  0  1    4    4    2
## Datsun 710        22.8    4  108  93 3.85 2.320 18.61  1  1    4    1    1
## Hornet 4 Drive    21.4    6  258 110 3.08 3.215 19.44  1  0    3    1    2
## Hornet Sportabout 18.7    8  360 175 3.15 3.440 17.02  0  0    3    2    3
## Valiant           18.1    6  225 105 2.76 3.460 20.22  1  0    3    1    2

Compare it with how select() works:

mt <- mtcars %>% select(mpg2 = mpg, cyl2 = cyl) 
head(mt)
##                   mpg2 cyl2
## Mazda RX4         21.0    6
## Mazda RX4 Wag     21.0    6
## Datsun 710        22.8    4
## Hornet 4 Drive    21.4    6
## Hornet Sportabout 18.7    8
## Valiant           18.1    6

Arranging rows

arrange() from package dplyr sorts variables in ascending order, and desc() sorts variables in descending order.

This example below sorts mtcars by variable mpg in ascending order.

df <- arrange(mtcars, mpg)
head(df)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb cyl2
## 1 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4    3
## 2 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4    3
## 3 13.3   8  350 245 3.73 3.840 15.41  0  0    3    4    3
## 4 14.3   8  360 245 3.21 3.570 15.84  0  0    3    4    3
## 5 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4    3
## 6 15.0   8  301 335 3.54 3.570 14.60  0  1    5    8    3

arrange() can sort multiple variables in the order specified. The example below sorts df first by mpg, then cyl, and finally disp.

df <- arrange(mtcars, mpg, cyl, disp)
head(df)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb cyl2
## 1 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4    3
## 2 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4    3
## 3 13.3   8  350 245 3.73 3.840 15.41  0  0    3    4    3
## 4 14.3   8  360 245 3.21 3.570 15.84  0  0    3    4    3
## 5 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4    3
## 6 15.0   8  301 335 3.54 3.570 14.60  0  1    5    8    3

This example sorts df first by mpg and cyl in ascending order, and then disp in descending order.

df <- arrange(mtcars, mpg, cyl, -disp)
head(df)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb cyl2
## 1 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4    3
## 2 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4    3
## 3 13.3   8  350 245 3.73 3.840 15.41  0  0    3    4    3
## 4 14.3   8  360 245 3.21 3.570 15.84  0  0    3    4    3
## 5 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4    3
## 6 15.0   8  301 335 3.54 3.570 14.60  0  1    5    8    3

Same as above, except that we use function desc() to sort rows in descending order.

df <- arrange(mtcars, mpg, cyl, desc(disp))
head(df)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb cyl2
## 1 10.4   8  472 205 2.93 5.250 17.98  0  0    3    4    3
## 2 10.4   8  460 215 3.00 5.424 17.82  0  0    3    4    3
## 3 13.3   8  350 245 3.73 3.840 15.41  0  0    3    4    3
## 4 14.3   8  360 245 3.21 3.570 15.84  0  0    3    4    3
## 5 14.7   8  440 230 3.23 5.345 17.42  0  0    3    4    3
## 6 15.0   8  301 335 3.54 3.570 14.60  0  1    5    8    3

Reordering columns

We can move columns with select(var, everything()).

new_col <- mtcars %>% select(carb, gear, everything())
head(new_col)
##                   carb gear  mpg cyl disp  hp drat    wt  qsec vs am cyl2
## Mazda RX4            4    4 21.0   6  160 110 3.90 2.620 16.46  0  1    2
## Mazda RX4 Wag        4    4 21.0   6  160 110 3.90 2.875 17.02  0  1    2
## Datsun 710           1    4 22.8   4  108  93 3.85 2.320 18.61  1  1    1
## Hornet 4 Drive       1    3 21.4   6  258 110 3.08 3.215 19.44  1  0    2
## Hornet Sportabout    2    3 18.7   8  360 175 3.15 3.440 17.02  0  0    3
## Valiant              1    3 18.1   6  225 105 2.76 3.460 20.22  1  0    2

Or move a column the other way:

new_col <- mtcars %>% select(-mpg, everything())
head(new_col)
##                   cyl disp  hp drat    wt  qsec vs am gear carb cyl2  mpg
## Mazda RX4           6  160 110 3.90 2.620 16.46  0  1    4    4    2 21.0
## Mazda RX4 Wag       6  160 110 3.90 2.875 17.02  0  1    4    4    2 21.0
## Datsun 710          4  108  93 3.85 2.320 18.61  1  1    4    1    1 22.8
## Hornet 4 Drive      6  258 110 3.08 3.215 19.44  1  0    3    1    2 21.4
## Hornet Sportabout   8  360 175 3.15 3.440 17.02  0  0    3    2    3 18.7
## Valiant             6  225 105 2.76 3.460 20.22  1  0    3    1    2 18.1

Grouping

group_by() from dplyr groups a data frame by the specified variable. ungroup() removes the grouping.

storm <- dplyr::storms %>% group_by(status)
head(storm)
## # A tibble: 6 x 13
## # Groups:   status [1]
##   name   year month   day  hour   lat  long status category  wind pressure
##   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <int>    <int>
## 1 Amy    1975     6    27     0  27.5 -79   tropi… -1          25     1013
## 2 Amy    1975     6    27     6  28.5 -79   tropi… -1          25     1013
## 3 Amy    1975     6    27    12  29.5 -79   tropi… -1          25     1013
## 4 Amy    1975     6    27    18  30.5 -79   tropi… -1          25     1013
## 5 Amy    1975     6    28     0  31.5 -78.8 tropi… -1          25     1012
## 6 Amy    1975     6    28     6  32.4 -78.7 tropi… -1          25     1012
## # ... with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>

group_by() is most useful to use with summarise(), filter(), mutate(), and arrange().

storm <- dplyr::storms %>% filter(!is.na(ts_diameter)) %>% group_by(year) %>% summarise(mean = mean(ts_diameter))
head(storm)
## # A tibble: 6 x 2
##    year  mean
##   <dbl> <dbl>
## 1  2004  164.
## 2  2005  165.
## 3  2006  136.
## 4  2007  128.
## 5  2008  175.
## 6  2009  174.

Removing duplicates

distinct() from dplyr selects unique rows from a data frame.

distinct(mtcars)

In the example below, duplicates are removed and all other variables are kept where cyl has no duplicates. Only the first cases of duplicated cyl have been kept.

distinct(mtcars, cyl, .keep_all = TRUE)
##    mpg cyl disp  hp drat   wt  qsec vs am gear carb cyl2
## 1 21.0   6  160 110 3.90 2.62 16.46  0  1    4    4    2
## 2 22.8   4  108  93 3.85 2.32 18.61  1  1    4    1    1
## 3 18.7   8  360 175 3.15 3.44 17.02  0  0    3    2    3

Without argument .keep_all, the other variables will be dropped.

distinct(mtcars, cyl)
##   cyl
## 1   6
## 2   4
## 3   8

distinct() allows for multiple variables to be included.

distinct(mtcars, cyl, gear, .keep_all = TRUE)
##    mpg cyl  disp  hp drat    wt  qsec vs am gear carb cyl2
## 1 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4    2
## 2 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1    1
## 3 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1    2
## 4 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2    3
## 5 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1    1
## 6 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2    1
## 7 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4    3
## 8 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6    2

unique() from base R can also extract unique elements, but distinct() offers more flexibility.