We have some tracking logs from an online platform that recorded user behaviors with timestamps (UTC time), IP addresses, server events, pages visited, and device information, among others. Ultimately, we want to visualize the local peak hours of online access globally. What we need to do first is find a way to transform the UTC timestamp to the local time by matching the geolocation with each timestamp. The question is, therefore, how to get the geolocations.
This is when the IP addresses come to rescue. One of the solutions is to use function ip_api()
from package rgeolocate
to obtain information on time zone, together with city, country, latitude, longgitude and others, via ip-api.com. With the time zone, we can then use function with_tz()
from package lubridate
to get the local time from the UTC time.
Below is the subset of data that we are going to use for this post.
head(local, 3)
## id time ip
## 1 1 2016-01-24T10:51:53.798520+00:00 216.165.95.65
## 2 1 2016-01-24T10:51:53.996183+00:00 216.165.95.65
## 3 1 2016-01-24T10:51:54.220630+00:00 216.165.95.65
As usual, let’s first load all the packages that we are going to use for this task.
library(dplyr)
library(rgeolocate)
library(lubridate)
library(jsonlite)
library(ggplot2)
First, we convert the timestamp time
from character to date and date-time objects.
local$date <- as.Date(local$time, format="%Y-%m-%d", tz = "UTC")
local$datetime <- as.POSIXct(local$time, format = "%Y-%m-%dT%H:%M:%S", tz = "UTC")
options(digits.secs = 0)
head(local)
## id time ip date
## 1 1 2016-01-24T10:51:53.798520+00:00 216.165.95.65 2016-01-24
## 2 1 2016-01-24T10:51:53.996183+00:00 216.165.95.65 2016-01-24
## 3 1 2016-01-24T10:51:54.220630+00:00 216.165.95.65 2016-01-24
## 4 1 2016-01-24T10:52:05.438696+00:00 216.165.95.65 2016-01-24
## 5 1 2016-01-24T10:52:05.667878+00:00 216.165.95.65 2016-01-24
## 6 1 2016-01-24T10:52:09.439209+00:00 216.165.95.65 2016-01-24
## datetime
## 1 2016-01-24 10:51:53
## 2 2016-01-24 10:51:53
## 3 2016-01-24 10:51:54
## 4 2016-01-24 10:52:05
## 5 2016-01-24 10:52:05
## 6 2016-01-24 10:52:09
The next step is to call API with ip_api()
from package rgeolocate
, which returns a data frame with information on geolocations. ip_api()
consumes a vector of IP addresses and geolocates them via ip-api.com.
To make the calculations faster, we take the unique IP addresses to call the API and later merge them back to the original data frame.
ipinfo <- distinct(local, ip)
ipinfo$loc <- ip_api(ipinfo$ip)
The output is a nested data frame with 2 variables ip
and loc
. The geolocations are stored in a data frame nested within loc
.
str(ipinfo)
## 'data.frame': 86 obs. of 2 variables:
## $ ip : Factor w/ 315 levels "","100.35.165.153",..: 171 148 168 158 183 147 276 62 160 142 ...
## $ loc:'data.frame': 86 obs. of 13 variables:
## ..$ as_code : chr "AS12 New York University" "AS4058 CITIC Telecom International CPC Limited" "AS12 New York University" "AS21928 T-Mobile USA, Inc." ...
## ..$ city_name : chr "New York" "Hong Kong" "New York" "Brooklyn" ...
## ..$ country_name: chr "United States" "Hong Kong" "United States" "United States" ...
## ..$ country_code: chr "US" "HK" "US" "US" ...
## ..$ isp : chr "New York University" "CITIC Telecom International CPC Limited" "New York University" "T-Mobile USA, Inc." ...
## ..$ latitude : chr "40.7359" "22.2877" "40.7359" "40.6764" ...
## ..$ longitude : chr "-73.9904" "114.212" "-73.9904" "-73.8753" ...
## ..$ organisation: chr "New York University" "Shanghai NEW York University" "New York University" "T-Mobile USA, Inc." ...
## ..$ region_code : chr "NY" "HEA" "NY" "NY" ...
## ..$ region_name : chr "New York" "Eastern" "New York" "New York" ...
## ..$ timezone : chr "America/New_York" "Asia/Hong_Kong" "America/New_York" "America/New_York" ...
## ..$ zip_code : chr "10003" NA "10003" "11208" ...
## ..$ status : chr "success" "success" "success" "success" ...
To flatten the nested structure to a flat data frame, we use function flatten()
from package jsonlite
.
ipinfo <- flatten(ipinfo)
head(ipinfo, 3)
## ip loc.as_code
## 1 216.165.95.65 AS12 New York University
## 2 202.66.60.166 AS4058 CITIC Telecom International CPC Limited
## 3 216.165.95.5 AS12 New York University
## loc.city_name loc.country_name loc.country_code
## 1 New York United States US
## 2 Hong Kong Hong Kong HK
## 3 New York United States US
## loc.isp loc.latitude loc.longitude
## 1 New York University 40.7359 -73.9904
## 2 CITIC Telecom International CPC Limited 22.2877 114.212
## 3 New York University 40.7359 -73.9904
## loc.organisation loc.region_code loc.region_name
## 1 New York University NY New York
## 2 Shanghai NEW York University HEA Eastern
## 3 New York University NY New York
## loc.timezone loc.zip_code loc.status
## 1 America/New_York 10003 success
## 2 Asia/Hong_Kong <NA> success
## 3 America/New_York 10003 success
We then only keep the variables we need. In this case, it is the time zone.
ipinfo <- ipinfo %>% select(ip, loc.timezone) %>% distinct()
We then merge the data frame ipinfo
back to local
.
local <- merge(local, ipinfo, by = "ip", all = TRUE)
Some more cleaning needs to be done. time
is dropped. Time zone variable is renamed.
local <- local %>%
select(-time) %>%
rename(timezone = loc.timezone)
With the time zone at hand, we can use function with_tz()
from lubridate
to get the local times.
local <- local %>% rowwise() %>% mutate(localtime = with_tz(datetime, timezone))
It is straightforward to also get the dates, hours, and minutes thereafter.
local$localdate <- as.Date(local$localtime)
local$localhour<-strptime(local$localtime, format="%Y-%m-%d %H:%M:%S")$hour
local$localmin <- as.numeric(strftime(local$localtime, format="%M"))
Here we go.
head(local)
## # A tibble: 6 x 9
## ip id date datetime timezone localtime
## <fct> <int> <date> <dttm> <chr> <dttm>
## 1 101.… 13 2016-04-01 2016-04-01 22:20:56 Austral… 2016-04-02 05:20:56
## 2 101.… 13 2016-04-01 2016-04-01 06:37:44 Austral… 2016-04-01 13:37:44
## 3 101.… 13 2016-04-01 2016-04-01 06:37:42 Austral… 2016-04-01 13:37:42
## 4 101.… 13 2016-04-01 2016-04-01 06:37:43 Austral… 2016-04-01 13:37:43
## 5 101.… 13 2016-04-01 2016-04-01 22:20:52 Austral… 2016-04-02 05:20:52
## 6 101.… 13 2016-04-01 2016-04-01 06:37:35 Austral… 2016-04-01 13:37:35
## # ... with 3 more variables: localdate <date>, localhour <int>,
## # localmin <dbl>
For peak hours, we want to round the local times by the unit of 30 mins.
local <- local %>% mutate(localhour2 = ifelse(localmin > 30, localhour + 1, localhour))
We also recode the localdate
whose hour is 24 to the next day, so that we don’t have both 0 and 24 in localhour
.
local$localdate[local$localhour2 == 24] <- local$localdate[local$localhour2 == 24] + 1
local$localhour2[local$localhour2 == 24] <- 0
head(local)
## # A tibble: 6 x 10
## ip id date datetime timezone localtime
## <fct> <int> <date> <dttm> <chr> <dttm>
## 1 101.… 13 2016-04-01 2016-04-01 22:20:56 Austral… 2016-04-02 05:20:56
## 2 101.… 13 2016-04-01 2016-04-01 06:37:44 Austral… 2016-04-01 13:37:44
## 3 101.… 13 2016-04-01 2016-04-01 06:37:42 Austral… 2016-04-01 13:37:42
## 4 101.… 13 2016-04-01 2016-04-01 06:37:43 Austral… 2016-04-01 13:37:43
## 5 101.… 13 2016-04-01 2016-04-01 22:20:52 Austral… 2016-04-02 05:20:52
## 6 101.… 13 2016-04-01 2016-04-01 06:37:35 Austral… 2016-04-01 13:37:35
## # ... with 4 more variables: localdate <date>, localhour <int>,
## # localmin <dbl>, localhour2 <dbl>
To plot the peak hours, we first calculate how many users were online at each hour for each day.
local2 <- local %>%
distinct(localdate, id, localhour2) %>%
group_by(localdate, localhour2) %>%
add_count(localhour2) %>%
arrange(localdate, localhour2) %>%
select(localdate, localhour2, n) %>%
distinct()
head(local2)
## # A tibble: 6 x 3
## # Groups: localdate, localhour2 [6]
## localdate localhour2 n
## <date> <dbl> <int>
## 1 2016-01-24 0 1
## 2 2016-01-24 6 2
## 3 2016-01-24 18 6
## 4 2016-01-24 20 1
## 5 2016-01-24 21 2
## 6 2016-01-24 23 2
After that, we are ready to visualize the global peak hours with a heat map.
ggplot(local2, aes(localdate, localhour2)) +
geom_tile(aes(fill = n), colour = "white") +
scale_fill_gradient(low = "#f3da4c", high = "blue", breaks = seq(0, 13, 2)) +
scale_x_date(breaks = seq(as.Date("2016-01-26 UTC"), as.Date("2016-04-02 UTC"), "7 days"),
date_labels = "%b %d") +
scale_y_continuous(breaks = seq(0, 23, 1)) +
labs(x = "Date", y = "Hour", fill = "Freq of Online Access") +
ggtitle("Peak Hour of Studying Online (Local Time)")