Don't recalculate UDF

Asked Dec 14 '17 at 15:37

Active Dec 15 '17 at 09:32

Viewed 86 times

I'm calculating a column for my dataframe with a UDF that generates random values, and then I want to relate to that column in another column, but the UDF keeps getting recalculated.

Can I somehow make my calculated column not be recalculated?

Simplified example:

$ import scala.util.Random
$ val r = new Random
$ val func: () => Int = () => r.nextInt(100)
$ import org.apache.spark.sql.functions.udf
$ val udfFunc = udf(func)
$ import spark.implicits._
$ val df = Seq(1).toDF("value")
$ df.withColumn("udfRandom", udfFunc()).withColumn("sameUdfRandom", $"udfRandom").show
+-----+---------+-------------+
|value|udfRandom|sameUdfRandom|
+-----+---------+-------------+
|    1|       51|           76|
+-----+---------+-------------+

Other solutions I can think of:

calculate all values relating to the UDF's value as part of the UDF's output, and output all of it in a struct, to later be split up in columns somehow
send in the standard spark random() function values into the UDF, to inject the randomness from outside

edited Dec 15 '17 at 09:32

asked Dec 14 '17 at 15:37

Ygg

3,798
1
17
23

Don't recalculate UDF

0 Answers0