How to generate random timestamps with N-seconds difference in Spark 2?

Question

I need to generate a new column in my DataFrame with random timestamps that would have a step of seconds. The DataFrame contains 10.000 rows. The starting timestamp should be 1516364153. I tried to solve the problem as follows:

df.withColumn("timestamp",lit(1516364153 + scala.util.Random.nextInt(2000)))

However, all timestamps are equal to some specific value, for example, 1516364282 instead of many different values. There might be some duplicates, but why all values are the same? It looks like only one random number has been generated and then it's propagated over the whole column.

How can I solve this problem?

Random.nextInt(2000), generates a random number between 0 to 1999. So if you get a random 0, you will get duplicates. — Praveen, Feb 10 '18 at 10:15
@Praveen: No, all values of timestamp are the same: `1516364282`. — Markus, Feb 10 '18 at 10:22

score 4 · Accepted Answer · edited Feb 11 '18 at 12:53

4

Just use rand:

import org.apache.spark.sql.functions.rand

df.withColumn("timestamp", (lit(1516364153) + rand() * 2000)).cast("long"))

edited Feb 11 '18 at 12:53

Alper t. Turker

34,230
9
83
115

answered Feb 10 '18 at 10:42

user9342088

56
1

3

Thank you for this code snippet, which might provide some limited, immediate help. A [proper explanation](https://meta.stackexchange.com/q/114762) would greatly improve its long-term value by showing why this is a good solution to the problem and would make it more useful to future readers with other, similar questions. Please [edit](https://meta.stackoverflow.com/posts/360251/edit) your answer to add some explanation, including the assumptions you’ve made. [ref](https://meta.stackoverflow.com/a/360251/8371915) – Alper t. Turker Feb 10 '18 at 10:57
When I execute this code I get values `1.5163641530446012E9`. – Markus Feb 10 '18 at 22:19
1

@Markus, did you try `df.withColumn("timestamp",lit(1516364153) + scala.util.Random.nextInt(2000))`? – Ramesh Maharjan Feb 10 '18 at 23:47
@RameshMaharjan: No, actually it generates the same values again. – Markus Feb 11 '18 at 13:47

score 1 · Answer 2 · edited Feb 12 '18 at 14:20

As stated in this answer here:

The reason why the random number is always the same, may be that it is created and initialized with a seed before the data is partitioned.

So one possible solution for you would be to use an UDF:

import org.apache.spark.sql.functions
val randomTimestamp = functions.udf((s: Int) => {
  s + scala.util.Random.nextInt(2000)
})

And then use it in the withColumn method:

df.withColumn("timestamp", randomTimestamp(lit(1516364153)))

I made a quick test in the spark-shell:

Original dataFrame:

+-----+-----+
| word|value|
+-----+-----+
|hello|    1|
|hello|    2|
|hello|    3|
+-----+-----+

Output:

+-----+-----+----------+
| word|value| timestamp|
+-----+-----+----------+
|hello|    1|1516364348|
|hello|    2|1516364263|
|hello|    3|1516365083|
+-----+-----+----------+

This could be the answer but RNG cannot be used like this [About how to add a new column to an existing DataFrame with random values in Scala](https://stackoverflow.com/q/42367464/8371915) — Alper t. Turker, Feb 10 '18 at 10:56

How to generate random timestamps with N-seconds difference in Spark 2?

2 Answers2