# Writing Window Functions

###### PUBLISHED ON MAY 6, 2020

I was recently working on exploring some time series data for a Kaggle competition and found myself wanting to calculate the rolling average of some sales. I don’t often work with time series data, so I had to look up functions to use to calculate rolling averages (n.b. that if you don’t know what a rolling average is, read here), and I was surprised that dplyr doesn’t have one built in. It turns out there are several packages that do have rolling aggregate (average, standard deviation, etc) functions, such as the RcppRoll package and the zoo package. But I also thought it provided a nice opportunity to practice writing some of my own rolling aggregate functions, which I’m going to walk through here.

# Setup

First, I’m going to load packages. For this, I’m only using {tidyverse} (and within tidyverse, mostly {purrr} for iteration) and {RcppRoll} as a ground-truth to test my functions. I’m also going to use the {glue} package later on, but that’s less central and I’ll load it when I need it.

knitr::opts_chunk$set(echo = TRUE, warning = FALSE) library(tidyverse) ## -- Attaching packages ---------------------------------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 -- ## v ggplot2 3.3.0 v purrr 0.3.4 ## v tibble 3.0.1 v dplyr 0.8.5 ## v tidyr 1.0.2 v stringr 1.4.0 ## v readr 1.3.1 v forcats 0.5.0 ## -- Conflicts ------------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() -- ## x dplyr::filter() masks stats::filter() ## x dplyr::lag() masks stats::lag() library(RcppRoll) Next, I’m going to set up a minimal tibble to use for calculations. This will have an day column and a val column. The val column is the one I’m going to be doing calculations on, and the day column is going to serve as an index for the rolling average. set.seed(0408) df <- tibble( day = c(1:250), val = rnorm(250, mean = 5, sd = 1) ) df ## # A tibble: 250 x 2 ## day val ## <int> <dbl> ## 1 1 2.64 ## 2 2 5.30 ## 3 3 4.29 ## 4 4 5.76 ## 5 5 3.75 ## 6 6 4.89 ## 7 7 4.50 ## 8 8 3.84 ## 9 9 4.74 ## 10 10 6.41 ## # ... with 240 more rows # Step 1: Testing Iteration So, my process for building this function is going to be to create something very basic with few variables first and then gradually abstract this out to make a more responsive function. Eventually, I’ll get to a point where the rolling aggregation function will be general enough to allow for the specification of arbitrary aggregate functions and windows. The first step, then, is just to test the logic of the calculation I need to create to calculate rolling averages. I’ll do this by assuming a 28 day window (we’ll be able to change the window later), create a “truth” to test against using RcppRoll’s roll_mean() function, and then iterate using map(). truth <- roll_mean(df$val, n = 28, align = "right")

test <- map_dbl(
c(28:length(df$val)), #this represents the days I want to calculate the average for. I'm starting on day 28 (because I want a 28-day rolling average, and the first #time I'll have 28 days of data is on day 28) and going through the last day function(a) { mean(df$val[(a - 27):a], na.rm = FALSE)
} #this specifies what I'm doing -- taking the mean of the 'val' column for each 28 day window (day 1-28, day 2-29, etc). If I don't subtract 1 window value when I subset, I'll actually get 29 days.
)

all.equal(truth, test) #this tests to see that the vectors are equal.
## [1] TRUE

# Step 2: Building Out Functions

Great, so the logic of the calculation works. Now, let’s extend it a little bit to create a function where I can specify the variable I want to use as well as the window I want to take the rolling average over.

ee_roll_mean <- function(x, window) {
map_dbl(
c(window:length(x)),
function(a) {
mean(x[(a - window+1):a], na.rm = FALSE)
}
)
}

test_2 <- ee_roll_mean(df$val, 28) all.equal(test_2, truth) ## [1] TRUE It works when we set the window value to 28, but let’s also test that it works when we use a different window just to be safe. truth_win8 <- roll_mean(df$val, n = 8, align = "right")

test_win8 <- ee_roll_mean(df$val, window = 8) all.equal(truth_win8, test_win8) ## [1] TRUE This works well for taking the rolling average – we can specify the values we want to take the average over as well as the window for that average. But there are other functions we might be interested in getting rolling aggregates for as well. For instance, we might want to know the minimum or standard deviation of a value during some windows of time. Rather than write separate functions to do this, we can just extend our previous function to allow us to supply whichever aggregation function we want. ee_roll_func <- function(x, window, fn = mean) { map_dbl( c(window:length(x)), function(a) { fn(x[(a - window+1):a], na.rm = FALSE) } ) } test_3 <- ee_roll_func(df$val, window = 8, fn = sd)

#testing against the RcppRoll function that does the same thing
truth_3 <- roll_sd(dfval, n = 8, align = "right") all.equal(test_3, truth_3) ## [1] TRUE # Step 3: Pad the Output One thing I’m noticing when looking at the output of each of these functions is that the length of the output vectors differ depending on the value we pass to the window argument. length(test) ## [1] 223 length(test_win8) ## [1] 243 I’m also noticing that these outputs are shorter than the length of the input vector (which is length 250). This makes sense because the function can’t take, for example, the 28 day average before the 28th day, and so the length of the output vector will be 27 elements shorter than the length of the input vector. This isn’t so great if we want to add the results of this function back into our original df, though, because all of the vectors in a df need to be the same length. One solution is to “pad” our output vector with the appropriate amount of NA values so that it is the same length as the input vector and can therefore get added as a column in our df. So let’s do that. ee_roll_func_padded <- function(x, window, fn = mean) { map_dbl( c(window:length(x)), function(a) { fn(x[(a - window+1):a], na.rm = FALSE) } ) %>% append(rep(NA_real_, times = window-1), values = .) #this will pad the front with a number of NAs equal to the window value minus 1 } test_pad1 <- ee_roll_func_padded(dfval, window = 8) #note that if we don't supply a function, it will use the mean

##   [1]       NA       NA       NA       NA       NA       NA       NA 4.372225
##   [9] 4.634703 4.773530 4.751241 4.837210 4.835834 4.947405 5.023067 5.159392
##  [17] 5.259393 4.897024 5.154236 4.748580 4.790054 4.403228 4.522648 4.519479
##  [25] 4.480582 4.687750 4.701154 4.851093 4.652568 4.847791 4.811578 4.864686
##  [33] 4.672642 4.530416 4.582749 4.682431 4.717240 4.746443 4.652665 4.466197
##  [41] 4.611190 4.706513 4.568209 4.517622 4.872942 5.065789 5.186852 5.390533
##  [49] 5.395041 5.507209 5.403271 5.174482 5.179670 5.038712 5.020135 4.838939
##  [57] 4.875701 4.755078 4.865224 5.176775 5.202352 5.000563 4.797047 4.894503
##  [65] 4.810376 5.004196 4.977340 4.848640 4.753013 4.961929 5.142875 5.096611
##  [73] 5.248953 5.181127 4.941060 4.842180 4.693671 4.603321 4.722901 4.707204
##  [81] 4.667018 4.490093 4.642128 4.688560 4.940980 5.010917 4.865457 5.077085
##  [89] 4.943111 5.104771 5.225281 5.405689 5.459406 5.772019 5.873998 5.653444
##  [97] 5.727537 5.800159 5.719428 5.649400 5.519840 5.130266 4.799206 5.049435
## [105] 4.941485 4.868625 4.976469 5.154863 5.039641 5.037770 5.202060 4.829763
## [113] 5.054458 5.091318 5.113392 5.056769 4.999436 5.110106 5.070160 5.305183
## [121] 5.148242 5.163269 5.116071 5.209866 5.295613 5.295760 5.642222 5.797642
## [129] 5.800138 5.454873 5.221126 5.037245 5.077385 5.216140 5.121762 4.768109
## [137] 4.833714 5.100003 5.221173 5.314504 5.166415 4.883192 4.762374 4.661057
## [145] 4.620171 4.638887 4.789642 4.625148 4.791990 5.013448 4.746997 5.084247
## [153] 4.989471 4.899552 4.728081 4.728852 4.656302 4.596832 4.789755 4.571342
## [161] 4.750549 4.828835 4.946644 4.904696 4.951820 4.962249 4.952014 5.015733
## [169] 4.920095 4.695109 4.624958 4.687815 5.038474 5.314062 5.471601 5.659262
## [177] 5.667469 5.904322 5.968823 6.073087 5.663232 5.407968 5.177870 5.237016
## [185] 5.445955 5.679831 5.614257 5.233444 5.227926 5.097925 5.119121 4.940067
## [193] 4.803742 4.593282 4.749424 5.008870 4.902099 5.014811 5.048332 5.111487
## [201] 5.059727 4.972699 4.866232 4.952064 4.924344 5.077133 5.166955 5.172722
## [209] 5.304330 5.370433 5.299762 5.238768 5.450415 5.399515 5.197358 5.101200
## [217] 5.005289 5.243733 5.194603 5.205039 5.192346 5.082026 5.030877 5.072784
## [225] 5.032299 4.637538 4.781121 4.812846 4.758887 4.541770 4.712547 4.636478
## [233] 4.876790 5.177345 4.831910 4.870811 5.106333 5.162062 4.990127 5.058875
## [241] 4.603333 4.441803 4.618171 4.585108 4.444892 4.505732 4.827083 4.840013
## [249] 5.098275 5.081742

Notice that when we call test_pad1 we get a vector with several NA values appended to the front. And when we look at the length of each of these vectors, we can see that they’re length 250

## [1] 250
## [1] 250

# Step 4: Use Functions to Add Columns to Data

Now that we have a function that reliably outputs a vector the same length as the columns in our dataframe, we can use it in conjunction with other tidyverse operations to add columns to our dataframe.

df %>%
mutate(roll_avg = ee_roll_func_padded(val, window = 8, fn = mean))
## # A tibble: 250 x 3
##      day   val roll_avg
##    <int> <dbl>    <dbl>
##  1     1  2.64    NA
##  2     2  5.30    NA
##  3     3  4.29    NA
##  4     4  5.76    NA
##  5     5  3.75    NA
##  6     6  4.89    NA
##  7     7  4.50    NA
##  8     8  3.84     4.37
##  9     9  4.74     4.63
## 10    10  6.41     4.77
## # ... with 240 more rows

Finally, what if we wanted to get the rolling mean, standard deviation, min, and max all as new columns in our dataframe using the function we created. Our function allows us to pass in whichever aggregation function we want to use (well, probably not any function), so we can use pmap() from {purrr} to iterate over multiple functions and, in combination with the {glue} package, also set meaningful names for the new variables.

I’ll set up a dataframe called params that has the names of the new variables and the corresponding functions, then I’ll loop over these names and functions to create new columns in our original dataframe. I’m not going to go over all of the code here, but if you’re curious, it might be helpful to look at the documentation for {glue}, {purrr}, and possibly {rlang} (for the := operator).

library(glue)
##
## Attaching package: 'glue'
## The following object is masked from 'package:dplyr':
##
##     collapse
params <- tibble(
names = c("roll_avg", "roll_sd", "roll_min", "roll_max"),
fn = lst(mean, sd, min, max)
)

params %>%
pmap_dfc(~df %>%
transmute("{.x}" := ee_roll_func_padded(val, window = 8, fn = .y))) %>%
bind_cols(df, .)
## # A tibble: 250 x 6
##      day   val roll_avg roll_sd roll_min roll_max
##    <int> <dbl>    <dbl>   <dbl>    <dbl>    <dbl>
##  1     1  2.64    NA     NA        NA       NA
##  2     2  5.30    NA     NA        NA       NA
##  3     3  4.29    NA     NA        NA       NA
##  4     4  5.76    NA     NA        NA       NA
##  5     5  3.75    NA     NA        NA       NA
##  6     6  4.89    NA     NA        NA       NA
##  7     7  4.50    NA     NA        NA       NA
##  8     8  3.84     4.37   0.982     2.64     5.76
##  9     9  4.74     4.63   0.691     3.75     5.76
## 10    10  6.41     4.77   0.918     3.75     6.41
## # ... with 240 more rows

This gives us, for each 8-day window (e.g. day 1-8, day 2-9, etc) an average, standard deviation, minimum, and maximum of the val column.

# Wrapping Up

As sort of a final note, this activity was meant to be both an exercise for me in working through some programming using window functions as well as a walkthrough/tutorial for others interested in writing functions. That said, when I dive back into the Kaggle data I mentioned earlier, I’ll use the functions from the {RcppRoll} package rather than my own. These are optimized to run quickly because they use C++ code and they’re going to be more efficient than anything I just wrote. This doesn’t matter much when we use a little 250 observation dataframe for demonstration, but it will make a difference working with several thousand observations at once.