Spark SQL - single evaluation of non deterministic UDF

Question

I am generating some test data using UDFs in Spark SQL. I have one field, field_b, that uses random number generation in combination with another field, field_a. A third field, field_c, is the value of field_b divided by 100.

i.e.

select
  field_a
  randomUDF(field_a) as field_b
from
  my_table

I do this first, then use a second select (as I can't refer to the generated field) to form the third field, like so:

select
  field_a
  field_b
  divisionUDF(field_b) as field_c
from
  my_table

My problem is that it doesn't calculate the value of field_b; it keeps the reference to the function. This means that the randomly generated part differs and field_c is not field_b/100

Is there a way I can force it to evaluate field_b once and hold the value (short of writing to disk)? Even better, if it could be done in a single select statement (I know I could use a sub-query) that would be great to know.

score 3 · Answer 1 · answered Mar 24 '16 at 14:03

3

Kind of hack-ish, but you could try making randomUDF semi-deterministic. Make it give out the same answer twice in a row. Basically, add a variable prevResult that starts as null. Then switch on prevResult == null. If null, randomly generate the answer and set prevResult. If non-null, return prevResult and set prevResult to null.

That way you could use the function twice in the same select statement -- the first time it gives the random result, the second time it gives the same result.

answered Mar 24 '16 at 14:03

David Griffin

13,677
5
47
65

Or similarly, return an array from randomUDF, with the value and its double, and then split the array into fields in the next SELECT? Haven't tried it though. – max Jul 04 '16 at 04:20
@max yes [you can return a tuple from a UDF](https://www.balabit.com/blog/spark-scala-dataset-tutorial#User-defined-function), which you can later split out. Also, depending on the circumstances, you may be okay with a completely deterministic method, [one that derives psuedo-random values based on a hash of the other fields or upstream process](https://stackoverflow.com/a/42961574/1080804). – ecoe May 17 '18 at 18:42

Spark SQL - single evaluation of non deterministic UDF

1 Answers1