Aggregating Data to Get Summaries

In this post, we demonstrate two ways of aggregating and summarizing data: with package dplyr and with package data.table.

First, let’s load the sample dataset session, which has recorded how much time a user has spent on each resource type on which device type on an online course platform.

head(session)

## # A tibble: 6 x 5
##      id date                resource   device  interval
##   <int> <dttm>              <fct>      <fct>      <dbl>
## 1     1 2016-01-24 00:00:00 Enrollment desktop    0    
## 2     1 2016-01-24 00:00:00 Enrollment desktop    0.198
## 3     1 2016-01-24 00:00:00 Enrollment desktop    0.224
## 4     1 2016-01-24 00:00:00 Enrollment desktop   11.2  
## 5     1 2016-01-24 00:00:00 Enrollment desktop    0.229
## 6     1 2016-01-24 00:00:00 Enrollment desktop    3.77

summary(session$resource)

##      Cover         Do Enrollment       Exam     Listen   Proposal 
##       8850       3586        288        102        311         30 
##       Read      Watch    Welcome       NA's 
##       2663      36592       3452        154

summary(session$device)

## desktop  mobile 
##   54385    1643

summary(session$interval)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0926    0.4026   25.5089    2.2425 1793.1289

`dplyr` functions

There are several functions in dplyr that aggregate data for summaries: tally(), count(), and summarise(). Below we explain how each one works with examples.

library(dplyr)

`tally()`

tally() calls n() to count the number of cases or calls sum() to calculate the total amount.

`tally(n())`

Below we count the number of cases by the resource type. Note that n() does not take arguments.

session %>% filter(!is.na(resource)) %>% group_by(resource) %>% tally(n())

## # A tibble: 9 x 2
##   resource       n
##   <fct>      <int>
## 1 Cover       8850
## 2 Do          3586
## 3 Enrollment   288
## 4 Exam         102
## 5 Listen       311
## 6 Proposal      30
## 7 Read        2663
## 8 Watch      36592
## 9 Welcome     3452

`tally(sum())`

tally(sum()) calculates the sum by the specified group.

Below we sum the time spent on each resource type in minutes.

session %>% filter(!is.na(resource)) %>% group_by(resource) %>% tally(sum(interval) / 60)

## # A tibble: 9 x 2
##   resource          n
##   <fct>         <dbl>
## 1 Cover       2302.  
## 2 Do          3841.  
## 3 Enrollment     9.54
## 4 Exam         102.  
## 5 Listen       131.  
## 6 Proposal       3.34
## 7 Read        2553.  
## 8 Watch      12978.  
## 9 Welcome     1857.

`add_tally()`

add_tally() is short-hand for mutate() and adds a column named n to the data frame.

Below column n is the sum of the time spent on each resource type by date and resource.

session %>% filter(!is.na(resource)) %>% group_by(date, resource) %>% add_tally(sum(interval) / 60)

## # A tibble: 55,874 x 6
## # Groups:   date, resource [346]
##       id date                resource   device  interval     n
##    <int> <dttm>              <fct>      <fct>      <dbl> <dbl>
##  1     1 2016-01-24 00:00:00 Enrollment desktop   0       4.12
##  2     1 2016-01-24 00:00:00 Enrollment desktop   0.198   4.12
##  3     1 2016-01-24 00:00:00 Enrollment desktop   0.224   4.12
##  4     1 2016-01-24 00:00:00 Enrollment desktop  11.2     4.12
##  5     1 2016-01-24 00:00:00 Enrollment desktop   0.229   4.12
##  6     1 2016-01-24 00:00:00 Enrollment desktop   3.77    4.12
##  7     1 2016-01-24 00:00:00 Enrollment desktop   0.454   4.12
##  8     1 2016-01-24 00:00:00 Enrollment desktop   1.77    4.12
##  9     1 2016-01-24 00:00:00 Enrollment desktop   0.0379  4.12
## 10     1 2016-01-24 00:00:00 Watch      desktop   0.185  31.6 
## # … with 55,864 more rows

`count()`

count() is similar to tally(), but calls group_by() before and ungroup() after. If the data is already grouped, count() adds an additional group that is removed afterwards.

Here, using count() to get the number of cases by each resource type returns the same result as with tally() above. With count(resource), data is grouped by resource.

session %>% filter(!is.na(resource)) %>% count(resource)

## # A tibble: 9 x 2
##   resource       n
##   <fct>      <int>
## 1 Cover       8850
## 2 Do          3586
## 3 Enrollment   288
## 4 Exam         102
## 5 Listen       311
## 6 Proposal      30
## 7 Read        2663
## 8 Watch      36592
## 9 Welcome     3452

In the example below, we count the number of cases grouped by both id and resource.

session %>% filter(!is.na(resource)) %>% count(id, resource)

## # A tibble: 285 x 3
##       id resource       n
##    <int> <fct>      <int>
##  1     1 Do            54
##  2     1 Enrollment     9
##  3     1 Listen         8
##  4     1 Read          28
##  5     1 Watch        358
##  6     1 Welcome       35
##  7     2 Cover        110
##  8     2 Do            42
##  9     2 Enrollment     9
## 10     2 Listen         9
## # … with 275 more rows

`add_count()`

add_count() is a short-hand for group_by() + add_tally().

Below column n is the count of resource by date and resource.

session %>% filter(!is.na(resource)) %>% add_count(date, resource)

## # A tibble: 55,874 x 6
##       id date                resource   device  interval     n
##    <int> <dttm>              <fct>      <fct>      <dbl> <int>
##  1     1 2016-01-24 00:00:00 Enrollment desktop   0         94
##  2     1 2016-01-24 00:00:00 Enrollment desktop   0.198     94
##  3     1 2016-01-24 00:00:00 Enrollment desktop   0.224     94
##  4     1 2016-01-24 00:00:00 Enrollment desktop  11.2       94
##  5     1 2016-01-24 00:00:00 Enrollment desktop   0.229     94
##  6     1 2016-01-24 00:00:00 Enrollment desktop   3.77      94
##  7     1 2016-01-24 00:00:00 Enrollment desktop   0.454     94
##  8     1 2016-01-24 00:00:00 Enrollment desktop   1.77      94
##  9     1 2016-01-24 00:00:00 Enrollment desktop   0.0379    94
## 10     1 2016-01-24 00:00:00 Watch      desktop   0.185     17
## # … with 55,864 more rows

`summarise()`

summarise() reduces multiple values down to a single value. The summary functions include measures for center (mean(), median()), spread (sd(), IQR(), mad()), range (min(), max(), quantile()), position (first(), last(), nth()), count (n(), n_distinct()) and logical evaluations (any(), all()) of data.

Together with group_by(), summarise() returns summaries by the specified group. Without grouping, summarise() returns a single row.

Below we calculate the mean of the total time (in minutes) spent on each resource by date and resource.

session %>% 
  filter(!is.na(resource)) %>% 
  group_by(date, resource) %>% 
  tally(sum(interval)) %>% 
  group_by(resource) %>%
  summarise(mean(n) / 60)

## # A tibble: 9 x 2
##   resource   `mean(n)/60`
##   <fct>             <dbl>
## 1 Cover            36.5  
## 2 Do               61.0  
## 3 Enrollment        0.954
## 4 Exam             11.3  
## 5 Listen            8.75 
## 6 Proposal          1.11 
## 7 Read             41.9  
## 8 Watch           224.   
## 9 Welcome          29.0

Without grouping data, below we get a single row, the sum of the time spent on all resources.

session %>% 
  filter(!is.na(resource)) %>% 
  tally(sum(interval / 60))

## # A tibble: 1 x 1
##        n
##    <dbl>
## 1 23777.

`data.table` functions

data.table offers efficient methods to aggregate and summarize data with the data.table syntax DT[i, j, by]. i indicates on which rows; j indicates what to do; by indicates being grouped by what.

library(data.table)

session <- data.table(session)

count

Below we first count the number of cases by the resource type.

.N is a data.table special-symbol. It is an integer containing the number of rows in the group, and is renamed to N in the result. .N may be useful when the column names are not known in advance.

session[!is.na(resource), .N, by = resource]

##      resource     N
## 1: Enrollment   288
## 2:      Watch 36592
## 3:    Welcome  3452
## 4:       Read  2663
## 5:         Do  3586
## 6:     Listen   311
## 7:      Cover  8850
## 8:       Exam   102
## 9:   Proposal    30

It is equivalent to the dplyr way of

session %>% filter(!is.na(resource)) %>% group_by(resource) %>% tally(n())

sum

Below we sum the time spent on each resource type in minutes.

session[!is.na(resource), sum(interval)/60, by = resource]

##      resource           V1
## 1: Enrollment     9.536455
## 2:      Watch 12978.392194
## 3:    Welcome  1857.182209
## 4:       Read  2552.938363
## 5:         Do  3841.031722
## 6:     Listen   131.176714
## 7:      Cover  2301.627118
## 8:       Exam   102.057999
## 9:   Proposal     3.340203

It is equivalent to the dplyr way of

session %>% filter(!is.na(resource)) %>% group_by(resource) %>% tally(sum(interval) / 60)

adding a column

Like with dplyr::add_tally(), we can create a new column with := in the data table that contains the summaries instead of aggregating the data table.

Below column sum is the sum of the time spent on each resource type by date and resource.

Note that the data.table method keeps the missing cases of resource in the result, while the dplyr way excludes the rows where resource is NA.

session[!is.na(resource), sum := sum(interval)/60, by = c("date", "resource")]
session

##        id       date   resource  device   interval       sum
##     1:  1 2016-01-24 Enrollment desktop  0.0000000 4.1187860
##     2:  1 2016-01-24 Enrollment desktop  0.1976628 4.1187860
##     3:  1 2016-01-24 Enrollment desktop  0.2244470 4.1187860
##     4:  1 2016-01-24 Enrollment desktop 11.2180660 4.1187860
##     5:  1 2016-01-24 Enrollment desktop  0.2291820 4.1187860
##    ---                                                      
## 56024: 37 2016-04-02      Cover desktop  0.7414949 0.5713272
## 56025: 37 2016-04-02   Proposal desktop  0.1176212 0.2384181
## 56026: 37 2016-04-02   Proposal desktop  3.6968849 0.2384181
## 56027: 37 2016-04-02   Proposal desktop  0.4945791 0.2384181
## 56028: 37 2016-04-02       <NA> desktop  0.1196060        NA

It is equivalent to the dplyr way of

session %>% filter(!is.na(resource)) %>% group_by(date, resource) %>% add_tally(sum(interval) / 60)

Below we calculate the mean of the total time (in minutes) spent on each resource by date and resource.

s <- session[!is.na(resource), sum(interval), by = c("date", "resource")]
s[, mean(V1)/60, by = resource]

##      resource          V1
## 1: Enrollment   0.9536455
## 2:      Watch 223.7653827
## 3:    Welcome  29.0184720
## 4:       Read  41.8514486
## 5:         Do  60.9687575
## 6:     Listen   8.7451143
## 7:      Cover  36.5337638
## 8:       Exam  11.3397777
## 9:   Proposal   1.1134009

It is equivalent to the dplyr way of

session %>% 
  filter(!is.na(resource)) %>% 
  group_by(date, resource) %>% 
  tally(sum(interval)) %>% 
  group_by(resource) %>% 
  summarise(mean(n)/60)

Aggregating Data to Get Summaries

Yun Dai

07/2019

`dplyr` functions

`tally()`

`tally(n())`

`tally(sum())`

`add_tally()`

`count()`

`add_count()`

`summarise()`

`data.table` functions

count

sum

adding a column

Aggregating Data to Get Summaries

Yun Dai

07/2019

dplyr functions

tally()

tally(n())

tally(sum())

add_tally()

count()

add_count()

summarise()

data.table functions

count

sum

adding a column

`dplyr` functions

`tally()`

`tally(n())`

`tally(sum())`

`add_tally()`

`count()`

`add_count()`

`summarise()`

`data.table` functions