We have some tracking logs from an online platform that recorded user behaviors with timestamps (UTC time), IP addresses, server events, pages visited, and device information, among others. Ultimately, we want to visualize the local peak hours of online access globally. What we need to do first is find a way to transform the UTC timestamp to the local time by matching the geolocation with each timestamp. The question is, therefore, how to get the geolocations.

This is when the IP addresses come to rescue. One of the solutions is to use function ip_api() from package rgeolocate to obtain information on time zone, together with city, country, latitude, longgitude and others, via ip-api.com. With the time zone, we can then use function with_tz() from package lubridate to get the local time from the UTC time.

Below is the subset of data that we are going to use for this post.

head(local, 3)
##   id                             time            ip
## 1  1 2016-01-24T10:51:53.798520+00:00 216.165.95.65
## 2  1 2016-01-24T10:51:53.996183+00:00 216.165.95.65
## 3  1 2016-01-24T10:51:54.220630+00:00 216.165.95.65

As usual, let’s first load all the packages that we are going to use for this task.

library(dplyr)
library(rgeolocate)
library(lubridate)
library(jsonlite)
library(ggplot2)

Converting timestamp to date-time objects

First, we convert the timestamp time from character to date and date-time objects.

local$date <- as.Date(local$time, format="%Y-%m-%d", tz = "UTC")
local$datetime <- as.POSIXct(local$time, format = "%Y-%m-%dT%H:%M:%S", tz = "UTC")
options(digits.secs = 0)
head(local)
##   id                             time            ip       date
## 1  1 2016-01-24T10:51:53.798520+00:00 216.165.95.65 2016-01-24
## 2  1 2016-01-24T10:51:53.996183+00:00 216.165.95.65 2016-01-24
## 3  1 2016-01-24T10:51:54.220630+00:00 216.165.95.65 2016-01-24
## 4  1 2016-01-24T10:52:05.438696+00:00 216.165.95.65 2016-01-24
## 5  1 2016-01-24T10:52:05.667878+00:00 216.165.95.65 2016-01-24
## 6  1 2016-01-24T10:52:09.439209+00:00 216.165.95.65 2016-01-24
##              datetime
## 1 2016-01-24 10:51:53
## 2 2016-01-24 10:51:53
## 3 2016-01-24 10:51:54
## 4 2016-01-24 10:52:05
## 5 2016-01-24 10:52:05
## 6 2016-01-24 10:52:09

Geolocating IP addresses

The next step is to call API with ip_api() from package rgeolocate, which returns a data frame with information on geolocations. ip_api() consumes a vector of IP addresses and geolocates them via ip-api.com.

To make the calculations faster, we take the unique IP addresses to call the API and later merge them back to the original data frame.

ipinfo <- distinct(local, ip)
ipinfo$loc <- ip_api(ipinfo$ip)

The output is a nested data frame with 2 variables ip and loc. The geolocations are stored in a data frame nested within loc.

str(ipinfo)
## 'data.frame':    86 obs. of  2 variables:
##  $ ip : Factor w/ 315 levels "","100.35.165.153",..: 171 148 168 158 183 147 276 62 160 142 ...
##  $ loc:'data.frame': 86 obs. of  13 variables:
##   ..$ as_code     : chr  "AS12 New York University" "AS4058 CITIC Telecom International CPC Limited" "AS12 New York University" "AS21928 T-Mobile USA, Inc." ...
##   ..$ city_name   : chr  "New York" "Hong Kong" "New York" "Brooklyn" ...
##   ..$ country_name: chr  "United States" "Hong Kong" "United States" "United States" ...
##   ..$ country_code: chr  "US" "HK" "US" "US" ...
##   ..$ isp         : chr  "New York University" "CITIC Telecom International CPC Limited" "New York University" "T-Mobile USA, Inc." ...
##   ..$ latitude    : chr  "40.7359" "22.2877" "40.7359" "40.6764" ...
##   ..$ longitude   : chr  "-73.9904" "114.212" "-73.9904" "-73.8753" ...
##   ..$ organisation: chr  "New York University" "Shanghai NEW York University" "New York University" "T-Mobile USA, Inc." ...
##   ..$ region_code : chr  "NY" "HEA" "NY" "NY" ...
##   ..$ region_name : chr  "New York" "Eastern" "New York" "New York" ...
##   ..$ timezone    : chr  "America/New_York" "Asia/Hong_Kong" "America/New_York" "America/New_York" ...
##   ..$ zip_code    : chr  "10003" NA "10003" "11208" ...
##   ..$ status      : chr  "success" "success" "success" "success" ...

To flatten the nested structure to a flat data frame, we use function flatten() from package jsonlite.

ipinfo <- flatten(ipinfo)
head(ipinfo, 3)
##              ip                                    loc.as_code
## 1 216.165.95.65                       AS12 New York University
## 2 202.66.60.166 AS4058 CITIC Telecom International CPC Limited
## 3  216.165.95.5                       AS12 New York University
##   loc.city_name loc.country_name loc.country_code
## 1      New York    United States               US
## 2     Hong Kong        Hong Kong               HK
## 3      New York    United States               US
##                                   loc.isp loc.latitude loc.longitude
## 1                     New York University      40.7359      -73.9904
## 2 CITIC Telecom International CPC Limited      22.2877       114.212
## 3                     New York University      40.7359      -73.9904
##               loc.organisation loc.region_code loc.region_name
## 1          New York University              NY        New York
## 2 Shanghai NEW York University             HEA         Eastern
## 3          New York University              NY        New York
##       loc.timezone loc.zip_code loc.status
## 1 America/New_York        10003    success
## 2   Asia/Hong_Kong         <NA>    success
## 3 America/New_York        10003    success

We then only keep the variables we need. In this case, it is the time zone.

ipinfo <- ipinfo %>% select(ip, loc.timezone) %>% distinct()

We then merge the data frame ipinfo back to local.

local <- merge(local, ipinfo, by = "ip", all = TRUE)

Some more cleaning needs to be done. time is dropped. Time zone variable is renamed.

local <- local %>% 
  select(-time) %>%
  rename(timezone = loc.timezone)

Getting the local times

With the time zone at hand, we can use function with_tz() from lubridate to get the local times.

local <- local %>% rowwise() %>% mutate(localtime = with_tz(datetime, timezone))

It is straightforward to also get the dates, hours, and minutes thereafter.

local$localdate <- as.Date(local$localtime)
local$localhour<-strptime(local$localtime, format="%Y-%m-%d %H:%M:%S")$hour 
local$localmin <- as.numeric(strftime(local$localtime, format="%M"))

Here we go.

head(local)
## # A tibble: 6 x 9
##   ip       id date       datetime            timezone localtime          
##   <fct> <int> <date>     <dttm>              <chr>    <dttm>             
## 1 101.…    13 2016-04-01 2016-04-01 22:20:56 Austral… 2016-04-02 05:20:56
## 2 101.…    13 2016-04-01 2016-04-01 06:37:44 Austral… 2016-04-01 13:37:44
## 3 101.…    13 2016-04-01 2016-04-01 06:37:42 Austral… 2016-04-01 13:37:42
## 4 101.…    13 2016-04-01 2016-04-01 06:37:43 Austral… 2016-04-01 13:37:43
## 5 101.…    13 2016-04-01 2016-04-01 22:20:52 Austral… 2016-04-02 05:20:52
## 6 101.…    13 2016-04-01 2016-04-01 06:37:35 Austral… 2016-04-01 13:37:35
## # ... with 3 more variables: localdate <date>, localhour <int>,
## #   localmin <dbl>

For peak hours, we want to round the local times by the unit of 30 mins.

local <- local %>% mutate(localhour2 = ifelse(localmin > 30, localhour + 1, localhour)) 

We also recode the localdate whose hour is 24 to the next day, so that we don’t have both 0 and 24 in localhour.

local$localdate[local$localhour2 == 24] <- local$localdate[local$localhour2 == 24] + 1
local$localhour2[local$localhour2 == 24] <- 0

head(local)
## # A tibble: 6 x 10
##   ip       id date       datetime            timezone localtime          
##   <fct> <int> <date>     <dttm>              <chr>    <dttm>             
## 1 101.…    13 2016-04-01 2016-04-01 22:20:56 Austral… 2016-04-02 05:20:56
## 2 101.…    13 2016-04-01 2016-04-01 06:37:44 Austral… 2016-04-01 13:37:44
## 3 101.…    13 2016-04-01 2016-04-01 06:37:42 Austral… 2016-04-01 13:37:42
## 4 101.…    13 2016-04-01 2016-04-01 06:37:43 Austral… 2016-04-01 13:37:43
## 5 101.…    13 2016-04-01 2016-04-01 22:20:52 Austral… 2016-04-02 05:20:52
## 6 101.…    13 2016-04-01 2016-04-01 06:37:35 Austral… 2016-04-01 13:37:35
## # ... with 4 more variables: localdate <date>, localhour <int>,
## #   localmin <dbl>, localhour2 <dbl>

Plotting peak hours

To plot the peak hours, we first calculate how many users were online at each hour for each day.

local2 <- local %>% 
  distinct(localdate, id, localhour2) %>% 
  group_by(localdate, localhour2) %>% 
  add_count(localhour2) %>% 
  arrange(localdate, localhour2) %>%
  select(localdate, localhour2, n) %>%
  distinct()

head(local2)
## # A tibble: 6 x 3
## # Groups:   localdate, localhour2 [6]
##   localdate  localhour2     n
##   <date>          <dbl> <int>
## 1 2016-01-24          0     1
## 2 2016-01-24          6     2
## 3 2016-01-24         18     6
## 4 2016-01-24         20     1
## 5 2016-01-24         21     2
## 6 2016-01-24         23     2

After that, we are ready to visualize the global peak hours with a heat map.

ggplot(local2, aes(localdate, localhour2)) + 
  geom_tile(aes(fill = n), colour = "white") + 
  scale_fill_gradient(low = "#f3da4c", high = "blue", breaks = seq(0, 13, 2)) +
  scale_x_date(breaks = seq(as.Date("2016-01-26 UTC"), as.Date("2016-04-02 UTC"), "7 days"), 
               date_labels = "%b %d") + 
  scale_y_continuous(breaks = seq(0, 23, 1)) +
  labs(x = "Date", y = "Hour", fill = "Freq of Online Access") +
  ggtitle("Peak Hour of Studying Online (Local Time)")