In this post, we demonstrate two ways of aggregating and summarizing data: with package dplyr
and with package data.table
.
First, let’s load the sample dataset session
, which has recorded how much time a user has spent on each resource type on which device type on an online course platform.
head(session)
## # A tibble: 6 x 5
## id date resource device interval
## <int> <dttm> <fct> <fct> <dbl>
## 1 1 2016-01-24 00:00:00 Enrollment desktop 0
## 2 1 2016-01-24 00:00:00 Enrollment desktop 0.198
## 3 1 2016-01-24 00:00:00 Enrollment desktop 0.224
## 4 1 2016-01-24 00:00:00 Enrollment desktop 11.2
## 5 1 2016-01-24 00:00:00 Enrollment desktop 0.229
## 6 1 2016-01-24 00:00:00 Enrollment desktop 3.77
summary(session$resource)
## Cover Do Enrollment Exam Listen Proposal
## 8850 3586 288 102 311 30
## Read Watch Welcome NA's
## 2663 36592 3452 154
summary(session$device)
## desktop mobile
## 54385 1643
summary(session$interval)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0926 0.4026 25.5089 2.2425 1793.1289
dplyr
functionsThere are several functions in dplyr
that aggregate data for summaries: tally()
, count()
, and summarise()
. Below we explain how each one works with examples.
library(dplyr)
tally()
tally()
calls n()
to count the number of cases or calls sum()
to calculate the total amount.
tally(n())
Below we count the number of cases by the resource type. Note that n()
does not take arguments.
session %>% filter(!is.na(resource)) %>% group_by(resource) %>% tally(n())
## # A tibble: 9 x 2
## resource n
## <fct> <int>
## 1 Cover 8850
## 2 Do 3586
## 3 Enrollment 288
## 4 Exam 102
## 5 Listen 311
## 6 Proposal 30
## 7 Read 2663
## 8 Watch 36592
## 9 Welcome 3452
tally(sum())
tally(sum())
calculates the sum by the specified group.
Below we sum the time spent on each resource type in minutes.
session %>% filter(!is.na(resource)) %>% group_by(resource) %>% tally(sum(interval) / 60)
## # A tibble: 9 x 2
## resource n
## <fct> <dbl>
## 1 Cover 2302.
## 2 Do 3841.
## 3 Enrollment 9.54
## 4 Exam 102.
## 5 Listen 131.
## 6 Proposal 3.34
## 7 Read 2553.
## 8 Watch 12978.
## 9 Welcome 1857.
add_tally()
add_tally()
is short-hand for mutate()
and adds a column named n
to the data frame.
Below column n
is the sum of the time spent on each resource
type by date
and resource
.
session %>% filter(!is.na(resource)) %>% group_by(date, resource) %>% add_tally(sum(interval) / 60)
## # A tibble: 55,874 x 6
## # Groups: date, resource [346]
## id date resource device interval n
## <int> <dttm> <fct> <fct> <dbl> <dbl>
## 1 1 2016-01-24 00:00:00 Enrollment desktop 0 4.12
## 2 1 2016-01-24 00:00:00 Enrollment desktop 0.198 4.12
## 3 1 2016-01-24 00:00:00 Enrollment desktop 0.224 4.12
## 4 1 2016-01-24 00:00:00 Enrollment desktop 11.2 4.12
## 5 1 2016-01-24 00:00:00 Enrollment desktop 0.229 4.12
## 6 1 2016-01-24 00:00:00 Enrollment desktop 3.77 4.12
## 7 1 2016-01-24 00:00:00 Enrollment desktop 0.454 4.12
## 8 1 2016-01-24 00:00:00 Enrollment desktop 1.77 4.12
## 9 1 2016-01-24 00:00:00 Enrollment desktop 0.0379 4.12
## 10 1 2016-01-24 00:00:00 Watch desktop 0.185 31.6
## # … with 55,864 more rows
count()
count()
is similar to tally()
, but calls group_by()
before and ungroup()
after. If the data is already grouped, count()
adds an additional group that is removed afterwards.
Here, using count()
to get the number of cases by each resource type returns the same result as with tally()
above. With count(resource)
, data is grouped by resource
.
session %>% filter(!is.na(resource)) %>% count(resource)
## # A tibble: 9 x 2
## resource n
## <fct> <int>
## 1 Cover 8850
## 2 Do 3586
## 3 Enrollment 288
## 4 Exam 102
## 5 Listen 311
## 6 Proposal 30
## 7 Read 2663
## 8 Watch 36592
## 9 Welcome 3452
In the example below, we count the number of cases grouped by both id
and resource
.
session %>% filter(!is.na(resource)) %>% count(id, resource)
## # A tibble: 285 x 3
## id resource n
## <int> <fct> <int>
## 1 1 Do 54
## 2 1 Enrollment 9
## 3 1 Listen 8
## 4 1 Read 28
## 5 1 Watch 358
## 6 1 Welcome 35
## 7 2 Cover 110
## 8 2 Do 42
## 9 2 Enrollment 9
## 10 2 Listen 9
## # … with 275 more rows
add_count()
add_count()
is a short-hand for group_by()
+ add_tally()
.
Below column n
is the count of resource
by date
and resource
.
session %>% filter(!is.na(resource)) %>% add_count(date, resource)
## # A tibble: 55,874 x 6
## id date resource device interval n
## <int> <dttm> <fct> <fct> <dbl> <int>
## 1 1 2016-01-24 00:00:00 Enrollment desktop 0 94
## 2 1 2016-01-24 00:00:00 Enrollment desktop 0.198 94
## 3 1 2016-01-24 00:00:00 Enrollment desktop 0.224 94
## 4 1 2016-01-24 00:00:00 Enrollment desktop 11.2 94
## 5 1 2016-01-24 00:00:00 Enrollment desktop 0.229 94
## 6 1 2016-01-24 00:00:00 Enrollment desktop 3.77 94
## 7 1 2016-01-24 00:00:00 Enrollment desktop 0.454 94
## 8 1 2016-01-24 00:00:00 Enrollment desktop 1.77 94
## 9 1 2016-01-24 00:00:00 Enrollment desktop 0.0379 94
## 10 1 2016-01-24 00:00:00 Watch desktop 0.185 17
## # … with 55,864 more rows
summarise()
summarise()
reduces multiple values down to a single value. The summary functions include measures for center (mean()
, median()
), spread (sd()
, IQR()
, mad()
), range (min()
, max()
, quantile()
), position (first()
, last()
, nth()
), count (n()
, n_distinct()
) and logical evaluations (any()
, all()
) of data.
Together with group_by()
, summarise()
returns summaries by the specified group. Without grouping, summarise()
returns a single row.
Below we calculate the mean of the total time (in minutes) spent on each resource
by date
and resource
.
session %>%
filter(!is.na(resource)) %>%
group_by(date, resource) %>%
tally(sum(interval)) %>%
group_by(resource) %>%
summarise(mean(n) / 60)
## # A tibble: 9 x 2
## resource `mean(n)/60`
## <fct> <dbl>
## 1 Cover 36.5
## 2 Do 61.0
## 3 Enrollment 0.954
## 4 Exam 11.3
## 5 Listen 8.75
## 6 Proposal 1.11
## 7 Read 41.9
## 8 Watch 224.
## 9 Welcome 29.0
Without grouping data, below we get a single row, the sum of the time spent on all resources.
session %>%
filter(!is.na(resource)) %>%
tally(sum(interval / 60))
## # A tibble: 1 x 1
## n
## <dbl>
## 1 23777.
data.table
functionsdata.table
offers efficient methods to aggregate and summarize data with the data.table
syntax DT[i, j, by]
. i
indicates on which rows; j
indicates what to do; by
indicates being grouped by what.
library(data.table)
session <- data.table(session)
Below we first count the number of cases by the resource type.
.N
is a data.table
special-symbol. It is an integer containing the number of rows in the group, and is renamed to N
in the result. .N
may be useful when the column names are not known in advance.
session[!is.na(resource), .N, by = resource]
## resource N
## 1: Enrollment 288
## 2: Watch 36592
## 3: Welcome 3452
## 4: Read 2663
## 5: Do 3586
## 6: Listen 311
## 7: Cover 8850
## 8: Exam 102
## 9: Proposal 30
It is equivalent to the dplyr
way of
session %>% filter(!is.na(resource)) %>% group_by(resource) %>% tally(n())
Below we sum the time spent on each resource type in minutes.
session[!is.na(resource), sum(interval)/60, by = resource]
## resource V1
## 1: Enrollment 9.536455
## 2: Watch 12978.392194
## 3: Welcome 1857.182209
## 4: Read 2552.938363
## 5: Do 3841.031722
## 6: Listen 131.176714
## 7: Cover 2301.627118
## 8: Exam 102.057999
## 9: Proposal 3.340203
It is equivalent to the dplyr
way of
session %>% filter(!is.na(resource)) %>% group_by(resource) %>% tally(sum(interval) / 60)
Like with dplyr::add_tally()
, we can create a new column with :=
in the data table that contains the summaries instead of aggregating the data table.
Below column sum
is the sum of the time spent on each resource
type by date
and resource
.
Note that the data.table
method keeps the missing cases of resource
in the result, while the dplyr
way excludes the rows where resource
is NA
.
session[!is.na(resource), sum := sum(interval)/60, by = c("date", "resource")]
session
## id date resource device interval sum
## 1: 1 2016-01-24 Enrollment desktop 0.0000000 4.1187860
## 2: 1 2016-01-24 Enrollment desktop 0.1976628 4.1187860
## 3: 1 2016-01-24 Enrollment desktop 0.2244470 4.1187860
## 4: 1 2016-01-24 Enrollment desktop 11.2180660 4.1187860
## 5: 1 2016-01-24 Enrollment desktop 0.2291820 4.1187860
## ---
## 56024: 37 2016-04-02 Cover desktop 0.7414949 0.5713272
## 56025: 37 2016-04-02 Proposal desktop 0.1176212 0.2384181
## 56026: 37 2016-04-02 Proposal desktop 3.6968849 0.2384181
## 56027: 37 2016-04-02 Proposal desktop 0.4945791 0.2384181
## 56028: 37 2016-04-02 <NA> desktop 0.1196060 NA
It is equivalent to the dplyr
way of
session %>% filter(!is.na(resource)) %>% group_by(date, resource) %>% add_tally(sum(interval) / 60)
Below we calculate the mean of the total time (in minutes) spent on each resource
by date
and resource
.
s <- session[!is.na(resource), sum(interval), by = c("date", "resource")]
s[, mean(V1)/60, by = resource]
## resource V1
## 1: Enrollment 0.9536455
## 2: Watch 223.7653827
## 3: Welcome 29.0184720
## 4: Read 41.8514486
## 5: Do 60.9687575
## 6: Listen 8.7451143
## 7: Cover 36.5337638
## 8: Exam 11.3397777
## 9: Proposal 1.1134009
It is equivalent to the dplyr
way of
session %>%
filter(!is.na(resource)) %>%
group_by(date, resource) %>%
tally(sum(interval)) %>%
group_by(resource) %>%
summarise(mean(n)/60)