I was unhappy with the time dplyr and data.table were taking to create a new variable on my data.frame and decide to compare methods.
To my surprise, reassigning the results of dplyr::mutate() to a new data.frame seems to be faster than not doing so.
Why is this happening?
library(data.table)
library(tidyverse)
dt <- fread(".... data.csv") #load 200MB datafile
dt1 <- copy(dt)
dt2 <- copy(dt)
dt3 <- copy(dt)
a <- Sys.time()
dt1[, MONTH := month(as.Date(DATE))]
b <- Sys.time(); datatabletook <- b-a
c <- Sys.time()
dt_dplyr <- dt2 %>%
mutate(MONTH = month(as.Date(DATE)))
d <- Sys.time(); dplyr_reassign_took <- d - c
e <- Sys.time()
dt3 %>%
mutate(MONTH = month(as.Date(DATE)))
f <- Sys.time(); dplyrtook <- f - e
datatabletook = 17sec
dplyrtook = 47sec
dplyr_reassign_took = 17sec