0

I have a database of tweets which I am currently downloading. I want to assign factor for each tweet based on his timestamp. However, this problem looks quite more challenging than I expected.

My example looks like this:

library(tidyverse)
library(lubridate)

creating boundaries:

start_time<-ymd_hms("2017-03-09 9:30:00", tz="EST")
end_time<-start_time+days()
start_time<-as.numeric(start_time)
end_time<-as.numeric(end_time)

creating first table. this table represents the table with tweets. In my PC, one day represents around 1M tweets, with around 1700 different timestamps:

example_times<-sample(start_time:end_time, 15)
example_table<-as.data.frame(rep(example_times, 200))
example_table$var<-as.character(as.roman(1:dim(example_table)[1]))
colnames(example_table)<-c("unix_ts", "text")
example_table$unix_ts<-as.POSIXct(example_table$unix_ts, origin=origin)

creating second table, from which I take times and factor which should be assigned to each of the tweets. At this moment I have only two classes, however I would like to create more in the future:

breaks<-c(1489069800, 1489071600, 1489073400, 1489075200, 1489077000, 1489078800, 
          1489080600, 1489082400, 1489084200, 1489086000, 1489087800, 1489089600, 
          1489091400, 1489093200, 1489156200)
classes<-c('DOWN', 'UP', 'UP', 'UP', 'UP', 'DOWN', 'UP', 'UP', 'UP', 'DOWN', 'DOWN', 'DOWN', 'UP', 'DOWN', 'UP')
key<-data.frame(breaks, classes, stringsAsFactors = FALSE)
key$breaks<-as.POSIXct(breaks, origin = origin)
key<-key%>% mutate("intrvl"=interval(lag(breaks), breaks))

my try to solve this problem looks like this:

assign_group<-function(unix_time){
    result<-key %>% 
        filter(unix_time %within% key$intrvl) %>%
        select(classes) %>%
        unlist
    names(result)<-NULL
    return(result)
}
sapply(example_table$unix_ts, assign_group)

this example is small and this solution should work here quite fast, however it is unmanagable when having dataset of 1M tweets. And even though it is big, there is only 1500 different timestamps, which I need to classify using assign_group. Could you please provide me with faster solution?

johnnyheineken
  • 543
  • 7
  • 20
  • 1
    Are you looking for `?cut.POSIXt`? – alistaire Mar 16 '17 at 22:29
  • @alistaire Hi, thanks, this might be the solution. However I cant make it work. This function requires (probably) to have the same amount of levels in factor as is the number of breaks. However I have only two levels.`Error in cut.default(unclass(x), unclass(breaks), labels = labels, right = right, : lengths of 'breaks' and 'labels' differ` Both classes and labels have same length, however I got this error. – johnnyheineken Mar 17 '17 at 08:47

1 Answers1

1

Looks like your use of dplyr is causing some problems. Instead, try the following:

First, remove the first row from key (if you can). The NA-NA interval seems useless (?). Via key <- key[-1, ]

Then rewrite your assign_group function as:

assign_group <- function(unix_time) {
  key[unix_time %within% key$intrvl, "classes"]
}

I love dplyr, but base R is probably a better and faster option in this case.

Finally, sapply tends to be pretty slow (see this post). Go with other functions like map_* from purrr (which you get with library(tidyverse)). E.g., could try map_chr(example_table$unix_ts, assign_group), or to add a factor as a new column to your data frame, mutate(example_table, ts_factor = as.factor(map_chr(unix_ts, assign_group))).

Community
  • 1
  • 1
Simon Jackson
  • 3,134
  • 15
  • 24
  • Hi, thank you for your answer. First - It was impossile to remove the first row in the original as it was the tibble and it was producing an error. I therefore transformed it to data.frame and no problem. Second - Thanks for the function. I had tried something similar, but abandoned this idea throughout the process. However this one works flawlessly. – johnnyheineken Mar 17 '17 at 08:22
  • Third: I tried both sapply and map_chr. however time spent on 100000 tweets is almost the same. `system.time(vec<-sapply(test_day$unix_ts[1:100000], assign_group))` user 23.01 system 0.00 elapsed 23.33 `system.time(vec<-map_chr(test_day$unix_ts[1:100000], assign_group))` user 22.63 system 0.00 elapsed 22.99. Overall I believe there must be some fast and easy solution, as I believe I can filter whole subset of identical timestamp and assign the same factor for each of the filtered subset. However, I am not sure how can I tell this to R. – johnnyheineken Mar 17 '17 at 08:26