Filling in missing values with previous or next value of selected columns can be useful, when values are recorded only when they change and otherwise not recorded.

For instance, below we have a sample dataset that is part of a tracking log (transformed from JSON format), which recorded user behaviors of an online course site. User interactions with the course site are logged when events data are emitted by the server, the browser, or the mobile device.

head(session, 10)
##               datetime       event id
## 1  2016-02-21 09:14:49        <NA>  5
## 2  2016-02-21 09:14:48        <NA>  5
## 3  2016-03-10 19:52:23        <NA> 16
## 4  2016-03-10 19:46:06        <NA> 16
## 5  2016-03-10 19:51:33  seek_video 16
## 6  2016-03-10 19:46:06 pause_video 16
## 7  2016-03-10 19:48:41  seek_video 16
## 8  2016-03-10 19:46:06  play_video 16
## 9  2016-03-10 19:56:00        <NA> 16
## 10 2016-03-10 19:43:38        <NA> 16

What we want to do is fill in the missing values of event for each id using the previous value of event.

We show how to do this with tidyr::fill() and zoo::na.locf() below.


tidyr::fill()

In fill(), .direction is the direction in which to fill missing values. We can change the direction to “down” (the default), “up”, “downup” (first down and then up) or “updown” (first up and then down).

library(tidyr)
library(dplyr)

session2 <- session %>%
  mutate(event2 = event) %>%
  group_by(id) %>%
  fill(event2, .direction = "down")
head(session2, 20)
## # A tibble: 20 x 4
## # Groups:   id [2]
##    datetime            event                 id event2            
##    <dttm>              <fct>              <int> <fct>             
##  1 2016-02-21 09:14:49 <NA>                   5 <NA>              
##  2 2016-02-21 09:14:48 <NA>                   5 <NA>              
##  3 2016-03-10 19:52:23 <NA>                  16 <NA>              
##  4 2016-03-10 19:46:06 <NA>                  16 <NA>              
##  5 2016-03-10 19:51:33 seek_video            16 seek_video        
##  6 2016-03-10 19:46:06 pause_video           16 pause_video       
##  7 2016-03-10 19:48:41 seek_video            16 seek_video        
##  8 2016-03-10 19:46:06 play_video            16 play_video        
##  9 2016-03-10 19:56:00 <NA>                  16 play_video        
## 10 2016-03-10 19:43:38 <NA>                  16 play_video        
## 11 2016-03-10 19:46:51 play_video            16 play_video        
## 12 2016-03-10 19:51:28 speed_change_video    16 speed_change_video
## 13 2016-03-10 19:55:46 seq_goto              16 seq_goto          
## 14 2016-03-10 19:48:10 <NA>                  16 seq_goto          
## 15 2016-03-10 19:51:36 play_video            16 play_video        
## 16 2016-03-10 19:51:30 play_video            16 play_video        
## 17 2016-03-10 19:57:32 play_video            16 play_video        
## 18 2016-03-10 19:55:49 load_video            16 load_video        
## 19 2016-03-10 19:58:07 <NA>                  16 load_video        
## 20 2016-03-10 19:46:46 load_video            16 load_video

zoo::na.locf()

Package zoo deals with time series data efficiently.

Below we use its function na.locf0(), which is like na.locf() except it defaults to na.rm = FALSE and requires a single vector argument. na.locf() is a generic function for replacing each NA with the most recent non-NA prior to it.

library(zoo)

session3 <- session %>% 
  group_by(id) %>%
  mutate(event2 = na.locf0(event, fromLast = FALSE))

In na.locf0(), fromLast sets whether observations are to be carried backward or forward. fromLast = FALSE fills the value with the previous entry. fromLast = TRUE carries the next observation backward.

The result is shown below.

head(session3, 20)
## # A tibble: 20 x 4
## # Groups:   id [2]
##    datetime            event                 id event2            
##    <dttm>              <fct>              <int> <fct>             
##  1 2016-02-21 09:14:49 <NA>                   5 <NA>              
##  2 2016-02-21 09:14:48 <NA>                   5 <NA>              
##  3 2016-03-10 19:52:23 <NA>                  16 <NA>              
##  4 2016-03-10 19:46:06 <NA>                  16 <NA>              
##  5 2016-03-10 19:51:33 seek_video            16 seek_video        
##  6 2016-03-10 19:46:06 pause_video           16 pause_video       
##  7 2016-03-10 19:48:41 seek_video            16 seek_video        
##  8 2016-03-10 19:46:06 play_video            16 play_video        
##  9 2016-03-10 19:56:00 <NA>                  16 play_video        
## 10 2016-03-10 19:43:38 <NA>                  16 play_video        
## 11 2016-03-10 19:46:51 play_video            16 play_video        
## 12 2016-03-10 19:51:28 speed_change_video    16 speed_change_video
## 13 2016-03-10 19:55:46 seq_goto              16 seq_goto          
## 14 2016-03-10 19:48:10 <NA>                  16 seq_goto          
## 15 2016-03-10 19:51:36 play_video            16 play_video        
## 16 2016-03-10 19:51:30 play_video            16 play_video        
## 17 2016-03-10 19:57:32 play_video            16 play_video        
## 18 2016-03-10 19:55:49 load_video            16 load_video        
## 19 2016-03-10 19:58:07 <NA>                  16 load_video        
## 20 2016-03-10 19:46:46 load_video            16 load_video