-1

I was unhappy with the time dplyr and data.table were taking to create a new variable on my data.frame and decide to compare methods.

To my surprise, reassigning the results of dplyr::mutate() to a new data.frame seems to be faster than not doing so.

Why is this happening?

library(data.table)
library(tidyverse)


dt <- fread(".... data.csv") #load 200MB datafile

dt1 <- copy(dt)
dt2 <- copy(dt)
dt3 <- copy(dt)

a <- Sys.time()
dt1[, MONTH := month(as.Date(DATE))]
b <- Sys.time(); datatabletook <- b-a

c <- Sys.time()
dt_dplyr <- dt2 %>%
  mutate(MONTH = month(as.Date(DATE)))
d <- Sys.time(); dplyr_reassign_took <- d - c 

e <- Sys.time()
dt3 %>%
  mutate(MONTH = month(as.Date(DATE)))
f <- Sys.time(); dplyrtook <- f - e

datatabletook        = 17sec
dplyrtook            = 47sec
dplyr_reassign_took  = 17sec
Dan
  • 1,711
  • 2
  • 24
  • 39
  • 1
    `dt1 <- dt` does not create a new object called `dt1`; it just makes another pointer to `dt`. Try `dt1 <- copy(dt)`. – Frank Feb 16 '17 at 15:29
  • @Frank -- edited the question with your suggestion. Same/similar results. – Dan Feb 16 '17 at 16:10
  • 1
    Ok, interesting/unexpected result. I guess it would be helpful to have it for a reproducible example. – Frank Feb 16 '17 at 16:14
  • 1
    Perhaps it's not the assignment that's the problem, but the printing to console that take the extra time? Without repro it's a bit of a guessing game. Also, `dt` is a `data.table`, so is there some conversions going on? – Axeman Feb 16 '17 at 16:23
  • To check if it's the printing, you could wrap in `system.time()` instead of doing the double `Sys.time()` thing. – Frank Feb 16 '17 at 17:26
  • @Frank: wrapped in system.time() and they all take 17sec. Do you want to add the answer? >>>> dt1 <- copy(dt) dt2 <- copy(dt) dt3 <- copy(dt) system.time( dt1[, MONTH := month(as.Date(DATEWEATHER))] ) system.time( dt_dplyr <- dt2 %>% mutate(MONTH = month(as.Date(DATEWEATHER))) ) system.time( dt3 %>% mutate(MONTH = month(as.Date(DATEWEATHER))) ) – Dan Feb 16 '17 at 20:58

1 Answers1

3

There are a couple ways to benchmark with base R:

.t0 <- Sys.time()
    ...
.t1 <- Sys.time()
.t1 - t0    

 # or

 system.time({
     ...
 })

With the Sys.time way, you're sending each line to the console and may see some return value printed for each line, as @Axeman suggested. With {...}, there is only one return value (the last result inside the braces) and system.time will suppress it from printing.

If the printing is costly enough but is not part of what you want to measure, it can make a difference.


There are good reasons to prefer system.time over Sys.time for benchmarking; from @MattDowle's comment:

i) it does a gc first excluded from the timing to isolate from random gc's and

ii) it includes user and sys time as well as elapsed wall clock time.

The Sys.time() way will be affected by reading your email in Chrome or using Excel while the test runs, the system.time() way won't so long as you use the user and sys parts of the result.

Community
  • 1
  • 1
Frank
  • 66,179
  • 8
  • 96
  • 180
  • 2
    `system.time({...})` is much better too because i) it does a gc first excluded from the timing to isolate from random gc's and ii) it includes `user` and `sys` time as well as `elapsed` wall clock time. The `Sys.time()` way will be affected by reading your email in Chrome or using Excel while the test runs, the `system.time()` way won't so long as you use the `user` and `sys` parts of the result. – Matt Dowle Feb 16 '17 at 21:24