In this post, we will go through the tasks we most frequently meet with in data transformation. We will take a very practical approach, at times even seemingly trivial, leaving the conceptual part aside for a moment.
We are going to show examples along the way with the aid of package dplyr
as well as base R functions. dplyr
provides fast and intuitive ways to manipulate date frames. This article offers a quick start of dplyr
on data frame manipulation essentials.
Before we start, let’s load the package dplyr
.
library(dplyr)
A side note on dplyr
pipe operator %>%
before we proceed.
The dplyr
pipe operator %>%
originally comes from package magrittr
. In dplyr
, %>%
chains functions together, passing the output of the former function to the input of the next function. This way we don’t have to nest a lot of functions, which also makes the code more readable.
For instance, in the example below, step by step each new operation is performed on the output from the previous one. We first keep the rows where the variable resource
has no missing cases; we then group the data by date
and resource
on the filtered subset; we next add a new column to count observations of the newly formed group; following that we rename the new variable counting the observations, and so forth.
session %>%
filter(!is.na(resource)) %>%
group_by(date, resource) %>%
add_tally(interval) %>%
rename(sum_day = n)
Variable transformation is the replacement of a variable by a new value or a function of that variable. The purposes can be multiple: convenience, reducing skewness, producing linear or additive relationships, and more. Standardisation and logarithmic transformation are common types that are familiar to us.
Below we are not discussing why we are performing a task, but simply showing how with a few examples for specific tasks.
<-
to assign new valuesOne way we often use to recode variables is to use the assignment sign <-
.
Let’s use the sample dataset storms
from dplyr
to show how we may recode the missing values.
head(dplyr::storms)
## # A tibble: 6 x 13
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropi… -1 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropi… -1 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropi… -1 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropi… -1 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropi… -1 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropi… -1 25 1012
## # ... with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
As we can see, there are many missing values in storms$ts_diameter
.
summary(storms$ts_diameter)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 69.05 138.09 166.76 241.66 1001.18 6528
We decide that we want to recode all NA
s to 9999.
storm <- storms
storm$ts_diameter[is.na(storm$ts_diameter)] <- 9999
summary(storm$ts_diameter)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 224.4 9999.0 6578.8 9999.0 9999.0
We can of couse recode the values 9999 back to NA
.
storm$ts_diameter[storm$ts_diameter == 9999] <- NA
summary(storm$ts_diameter)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 69.05 138.09 166.76 241.66 1001.18 6528
recode()
to replace valuesrecode()
replaces numeric values based on their position, and replaces character values by their name. .default
argument, if supplied, gives all cases not matched a new value. .missing
argument replaces the missing values by the specified new value.
Recoding character values:
iris$species <- recode(iris$Species, setosa = "a", versicolor = "b", virginica = "c")
table(iris$species)
##
## a b c
## 50 50 50
Recoding numeric values:
mtcars$cyl2 <- recode(mtcars$cyl, `4` = 1, `6` = 2, `8` = 3)
table(mtcars$cyl2)
##
## 1 2 3
## 11 7 14
case_when()
to recode on subsetscase_when()
allows us to vectorize multiple if
and else if
statements.
For instance, below we reorganized the majors by grouping them to a higher category.
survey$new_major <- case_when(
survey$major %in% c("Biology", "Chemistry", "Physics", "Neural Science") ~ "Science",
survey$major %in% c("Computer Systems Engineering", "Electrical and Systems Engineering") ~ "CS & Engineering",
survey$major %in% c("Business and Marketing", "Business and Finance", "Economics") ~ "Business, Finance & Economics",
survey$major %in% "Mathematics" ~ "Mathematics",
survey$major %in% NA ~ "Undefined"
)
mutate()
to add new variables at the same timemutate()
adds new variables while recoding them.
mt <- mtcars %>% mutate(newvar = mpg * 2)
head(mt)
## mpg cyl disp hp drat wt qsec vs am gear carb cyl2 newvar
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2 42.0
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2 42.0
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1 45.6
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2 42.8
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3 37.4
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 2 36.2
mutate()
can operate on multiple variables.
mt <- mtcars %>% mutate(newvar1 = mpg * 2, newvar2 = cumsum(mpg))
head(mt)
## mpg cyl disp hp drat wt qsec vs am gear carb cyl2 newvar1 newvar2
## 1 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2 42.0 21.0
## 2 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2 42.0 42.0
## 3 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1 45.6 64.8
## 4 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2 42.8 86.2
## 5 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3 37.4 104.9
## 6 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 2 36.2 123.0
mutate(ifelse())
to recode with conditionsmutate()
can be used with functions. For instance, mutate()
with ifelse()
recodes variables conditionally.
Let’s say we want to create a new column, copying cases whose values are smaller than 3.
mutate(rank2 = ifelse(rank <= 3, rank, NA))
In another example, let’s say we have a dataset subsample
.
subsample
## Q1 Q2
## 1 NA 1
## 2 5 NA
## 3 NA 1
## 4 NA NA
## 5 NA NA
## 6 NA 1
## 7 NA NA
## 8 NA NA
## 9 NA 1
## 10 1 NA
When processing the data, we want to combine the information from two columns. The rule is that we only need the non-missing case from the two columns.
subsample %>% mutate(
newvar = ifelse(is.na(Q1), Q2, NA),
newvar = ifelse(is.na(Q2), Q1, newvar)
)
## Q1 Q2 newvar
## 1 NA 1 1
## 2 5 NA 5
## 3 NA 1 1
## 4 NA NA NA
## 5 NA NA NA
## 6 NA 1 1
## 7 NA NA NA
## 8 NA NA NA
## 9 NA 1 1
## 10 1 NA 1
mutate(case_when())
case_when()
can be nested within mutate()
.
mutate(rank2 = case_when(
rank <= 3 ~ rank,
rank > 3 & row_number() %% 2 == 0 & expd == 1 ~ 4,
rank > 3 & row_number() %% 2 == 1 & expd == 1 ~ 5))
There are several ways to add and drop columns and rows in a data frame. First, let’s look at the base R methods using the sample dataset mtcars
.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb cyl2
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 2
Adding a column:
mt <- mtcars
mt$new_col <- 2 * mt$cyl
head(mt)
## mpg cyl disp hp drat wt qsec vs am gear carb cyl2
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 2
## new_col
## Mazda RX4 12
## Mazda RX4 Wag 12
## Datsun 710 8
## Hornet 4 Drive 12
## Hornet Sportabout 16
## Valiant 12
Removing a column:
mt$new_col <- NULL
head(mt)
## mpg cyl disp hp drat wt qsec vs am gear carb cyl2
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 2
Removing a row:
mt <- mtcars[-(1:5), ]
head(mt)
## mpg cyl disp hp drat wt qsec vs am gear carb cyl2
## Valiant 18.1 6 225.0 105 2.76 3.46 20.22 1 0 3 1 2
## Duster 360 14.3 8 360.0 245 3.21 3.57 15.84 0 0 3 4 3
## Merc 240D 24.4 4 146.7 62 3.69 3.19 20.00 1 0 4 2 1
## Merc 230 22.8 4 140.8 95 3.92 3.15 22.90 1 0 4 2 1
## Merc 280 19.2 6 167.6 123 3.92 3.44 18.30 1 0 4 4 2
## Merc 280C 17.8 6 167.6 123 3.92 3.44 18.90 1 0 4 4 2
select()
dplyr
offers several functions that handle adding and dropping columns and rows efficiently.
select()
keeps the selected variables.
mt <- mtcars %>% select(mpg, cyl)
head(mt)
## mpg cyl
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 6
## Hornet Sportabout 18.7 8
## Valiant 18.1 6
We can rename the variables while we select the variables.
mt <- mtcars %>% select(mpg2 = mpg, cyl2 = cyl)
head(mt)
## mpg2 cyl2
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 6
## Hornet Sportabout 18.7 8
## Valiant 18.1 6
filter()
filter()
filters rows, keeping rows where given conditions are true.
Here we use the sample dataset iris
.
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species species
## 1 5.1 3.5 1.4 0.2 setosa a
## 2 4.9 3.0 1.4 0.2 setosa a
## 3 4.7 3.2 1.3 0.2 setosa a
## 4 4.6 3.1 1.5 0.2 setosa a
## 5 5.0 3.6 1.4 0.2 setosa a
## 6 5.4 3.9 1.7 0.4 setosa a
Below we filter the rows to get a subset where Species
is all versicolor
and Petal.Width
is smaller than 1.2
filter(iris, Species == "versicolor" & Petal.Width < 1.2)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species species
## 1 4.9 2.4 3.3 1.0 versicolor b
## 2 5.0 2.0 3.5 1.0 versicolor b
## 3 6.0 2.2 4.0 1.0 versicolor b
## 4 5.8 2.7 4.1 1.0 versicolor b
## 5 5.6 2.5 3.9 1.1 versicolor b
## 6 5.7 2.6 3.5 1.0 versicolor b
## 7 5.5 2.4 3.8 1.1 versicolor b
## 8 5.5 2.4 3.7 1.0 versicolor b
## 9 5.0 2.3 3.3 1.0 versicolor b
## 10 5.1 2.5 3.0 1.1 versicolor b
This would be the same as
iris %>% filter(Species == "versicolor" & Petal.Width < 1.2)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species species
## 1 4.9 2.4 3.3 1.0 versicolor b
## 2 5.0 2.0 3.5 1.0 versicolor b
## 3 6.0 2.2 4.0 1.0 versicolor b
## 4 5.8 2.7 4.1 1.0 versicolor b
## 5 5.6 2.5 3.9 1.1 versicolor b
## 6 5.7 2.6 3.5 1.0 versicolor b
## 7 5.5 2.4 3.8 1.1 versicolor b
## 8 5.5 2.4 3.7 1.0 versicolor b
## 9 5.0 2.3 3.3 1.0 versicolor b
## 10 5.1 2.5 3.0 1.1 versicolor b
names()
In base R, we can rename variables using function names()
.
mt <- mtcars
names(mt)[1] <- "var"
names(mt)
## [1] "var" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb" "cyl2"
Renaming more than one variable:
names(mt)[1:3] <- c("a", "b", "c")
names(mt)
## [1] "a" "b" "c" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb" "cyl2"
Renaming all:
names(mt)[1:11] <- letters[1:11]
names(mt)
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
## [11] "k" "cyl2"
rename()
We can also use rename()
from dplyr
, which renames the variables without dropping variales. If we use select()
to rename variables, the unmentioned variables will be dropped.
mt <- mtcars %>% rename(mpg2 = mpg, cyl2 = cyl)
head(mt)
## mpg2 cyl2 disp hp drat wt qsec vs am gear carb cyl2
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 2
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 2
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 2
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 3
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 2
Compare it with how select()
works:
mt <- mtcars %>% select(mpg2 = mpg, cyl2 = cyl)
head(mt)
## mpg2 cyl2
## Mazda RX4 21.0 6
## Mazda RX4 Wag 21.0 6
## Datsun 710 22.8 4
## Hornet 4 Drive 21.4 6
## Hornet Sportabout 18.7 8
## Valiant 18.1 6
arrange()
from package dplyr
sorts variables in ascending order, and desc()
sorts variables in descending order.
This example below sorts mtcars
by variable mpg
in ascending order.
df <- arrange(mtcars, mpg)
head(df)
## mpg cyl disp hp drat wt qsec vs am gear carb cyl2
## 1 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 3
## 2 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 3
## 3 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 3
## 4 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4 3
## 5 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 3
## 6 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8 3
arrange()
can sort multiple variables in the order specified. The example below sorts df
first by mpg
, then cyl
, and finally disp
.
df <- arrange(mtcars, mpg, cyl, disp)
head(df)
## mpg cyl disp hp drat wt qsec vs am gear carb cyl2
## 1 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 3
## 2 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 3
## 3 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 3
## 4 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4 3
## 5 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 3
## 6 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8 3
This example sorts df
first by mpg
and cyl
in ascending order, and then disp
in descending order.
df <- arrange(mtcars, mpg, cyl, -disp)
head(df)
## mpg cyl disp hp drat wt qsec vs am gear carb cyl2
## 1 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 3
## 2 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 3
## 3 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 3
## 4 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4 3
## 5 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 3
## 6 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8 3
Same as above, except that we use function desc()
to sort rows in descending order.
df <- arrange(mtcars, mpg, cyl, desc(disp))
head(df)
## mpg cyl disp hp drat wt qsec vs am gear carb cyl2
## 1 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 3
## 2 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 3
## 3 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 3
## 4 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4 3
## 5 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 3
## 6 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8 3
We can move columns with select(var, everything())
.
new_col <- mtcars %>% select(carb, gear, everything())
head(new_col)
## carb gear mpg cyl disp hp drat wt qsec vs am cyl2
## Mazda RX4 4 4 21.0 6 160 110 3.90 2.620 16.46 0 1 2
## Mazda RX4 Wag 4 4 21.0 6 160 110 3.90 2.875 17.02 0 1 2
## Datsun 710 1 4 22.8 4 108 93 3.85 2.320 18.61 1 1 1
## Hornet 4 Drive 1 3 21.4 6 258 110 3.08 3.215 19.44 1 0 2
## Hornet Sportabout 2 3 18.7 8 360 175 3.15 3.440 17.02 0 0 3
## Valiant 1 3 18.1 6 225 105 2.76 3.460 20.22 1 0 2
Or move a column the other way:
new_col <- mtcars %>% select(-mpg, everything())
head(new_col)
## cyl disp hp drat wt qsec vs am gear carb cyl2 mpg
## Mazda RX4 6 160 110 3.90 2.620 16.46 0 1 4 4 2 21.0
## Mazda RX4 Wag 6 160 110 3.90 2.875 17.02 0 1 4 4 2 21.0
## Datsun 710 4 108 93 3.85 2.320 18.61 1 1 4 1 1 22.8
## Hornet 4 Drive 6 258 110 3.08 3.215 19.44 1 0 3 1 2 21.4
## Hornet Sportabout 8 360 175 3.15 3.440 17.02 0 0 3 2 3 18.7
## Valiant 6 225 105 2.76 3.460 20.22 1 0 3 1 2 18.1
group_by()
from dplyr
groups a data frame by the specified variable. ungroup()
removes the grouping.
storm <- dplyr::storms %>% group_by(status)
head(storm)
## # A tibble: 6 x 13
## # Groups: status [1]
## name year month day hour lat long status category wind pressure
## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int>
## 1 Amy 1975 6 27 0 27.5 -79 tropi… -1 25 1013
## 2 Amy 1975 6 27 6 28.5 -79 tropi… -1 25 1013
## 3 Amy 1975 6 27 12 29.5 -79 tropi… -1 25 1013
## 4 Amy 1975 6 27 18 30.5 -79 tropi… -1 25 1013
## 5 Amy 1975 6 28 0 31.5 -78.8 tropi… -1 25 1012
## 6 Amy 1975 6 28 6 32.4 -78.7 tropi… -1 25 1012
## # ... with 2 more variables: ts_diameter <dbl>, hu_diameter <dbl>
group_by()
is most useful to use with summarise()
, filter()
, mutate()
, and arrange()
.
storm <- dplyr::storms %>% filter(!is.na(ts_diameter)) %>% group_by(year) %>% summarise(mean = mean(ts_diameter))
head(storm)
## # A tibble: 6 x 2
## year mean
## <dbl> <dbl>
## 1 2004 164.
## 2 2005 165.
## 3 2006 136.
## 4 2007 128.
## 5 2008 175.
## 6 2009 174.
distinct()
from dplyr
selects unique rows from a data frame.
distinct(mtcars)
In the example below, duplicates are removed and all other variables are kept where cyl
has no duplicates. Only the first cases of duplicated cyl
have been kept.
distinct(mtcars, cyl, .keep_all = TRUE)
## mpg cyl disp hp drat wt qsec vs am gear carb cyl2
## 1 21.0 6 160 110 3.90 2.62 16.46 0 1 4 4 2
## 2 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1 1
## 3 18.7 8 360 175 3.15 3.44 17.02 0 0 3 2 3
Without argument .keep_all
, the other variables will be dropped.
distinct(mtcars, cyl)
## cyl
## 1 6
## 2 4
## 3 8
distinct()
allows for multiple variables to be included.
distinct(mtcars, cyl, gear, .keep_all = TRUE)
## mpg cyl disp hp drat wt qsec vs am gear carb cyl2
## 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2
## 2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 1
## 3 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 2
## 4 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 3
## 5 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 1
## 6 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 1
## 7 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 3
## 8 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 2
unique()
from base R can also extract unique elements, but distinct()
offers more flexibility.