0

I need to generate a new column in my DataFrame with random timestamps that would have a step of seconds. The DataFrame contains 10.000 rows. The starting timestamp should be 1516364153. I tried to solve the problem as follows:

df.withColumn("timestamp",lit(1516364153 + scala.util.Random.nextInt(2000)))

However, all timestamps are equal to some specific value, for example, 1516364282 instead of many different values. There might be some duplicates, but why all values are the same? It looks like only one random number has been generated and then it's propagated over the whole column.

How can I solve this problem?

Markus
  • 3,562
  • 12
  • 48
  • 85

2 Answers2

4

Just use rand:

import org.apache.spark.sql.functions.rand

df.withColumn("timestamp", (lit(1516364153) + rand() * 2000)).cast("long"))
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • 3
    Thank you for this code snippet, which might provide some limited, immediate help. A [proper explanation](https://meta.stackexchange.com/q/114762) would greatly improve its long-term value by showing why this is a good solution to the problem and would make it more useful to future readers with other, similar questions. Please [edit](https://meta.stackoverflow.com/posts/360251/edit) your answer to add some explanation, including the assumptions you’ve made. [ref](https://meta.stackoverflow.com/a/360251/8371915) – Alper t. Turker Feb 10 '18 at 10:57
  • When I execute this code I get values `1.5163641530446012E9`. – Markus Feb 10 '18 at 22:19
  • 1
    @Markus, did you try `df.withColumn("timestamp",lit(1516364153) + scala.util.Random.nextInt(2000))`? – Ramesh Maharjan Feb 10 '18 at 23:47
  • @RameshMaharjan: No, actually it generates the same values again. – Markus Feb 11 '18 at 13:47
1

As stated in this answer here:

The reason why the random number is always the same, may be that it is created and initialized with a seed before the data is partitioned.

So one possible solution for you would be to use an UDF:

import org.apache.spark.sql.functions
val randomTimestamp = functions.udf((s: Int) => {
  s + scala.util.Random.nextInt(2000)
})

And then use it in the withColumn method:

df.withColumn("timestamp", randomTimestamp(lit(1516364153)))

I made a quick test in the spark-shell:

Original dataFrame:

+-----+-----+
| word|value|
+-----+-----+
|hello|    1|
|hello|    2|
|hello|    3|
+-----+-----+

Output:

+-----+-----+----------+
| word|value| timestamp|
+-----+-----+----------+
|hello|    1|1516364348|
|hello|    2|1516364263|
|hello|    3|1516365083|
+-----+-----+----------+
Ramesh Maharjan
  • 41,071
  • 6
  • 69
  • 97
SCouto
  • 7,808
  • 5
  • 32
  • 49
  • This could be the answer but RNG cannot be used like this [About how to add a new column to an existing DataFrame with random values in Scala](https://stackoverflow.com/q/42367464/8371915) – Alper t. Turker Feb 10 '18 at 10:56