Non deterministic fields getting recalculated between showing, counting, and saving a dataframe

Question

We have a uuid udf :

import java.util.UUID
val idUdf = udf(() => idgen.incrementAndGet.toString + "_" + UUID.randomUUID)
spark.udf.register("idgen", idUdf)

An issue being faced is that when running count, or show or write each of those end up with a different value of the udf result.

    df.count()             // generates a UUID for each row
    df.show()              // regenerates a UUID for each row
    df.write.parquet(path) // .. you get the picture ..

What approaches might be taken to retain a single uuid result for a given row? The first thought would be to invoke a remote Key-Value store using some unique combination of other stable fields within each column. That is of course expensive both due to the lookup-per-row and the configuration and maintenance of the remote KV Store. Are there other mechanisms to achieve stability for these unique ID columns?

If the other questions treat on similar topics, it is entirely not evident from their titles. This question identifies a specific well defined scenario along with precise keywords for search. Well at least googlers can get to the other answers (if useful) via a couple of hops proxied by this question. — WestCoastProjects, Aug 24 '23 at 04:47

score 2 · Accepted Answer · answered Dec 16 '18 at 09:24

2

Just define your udf as nondeterministic by calling:

val idUdf = udf(() => idgen.incrementAndGet.toString + "_" + UUID.randomUUID)
    .asNondeterministic()

This will evaluate your udf just once and keep the result in the RDD

answered Dec 16 '18 at 09:24

Tom Lous

2,819
2
25
46

well better luck for next time around. spent couple of hours refactoring to save the RDD then read it back as a functional but v ugly workaround – WestCoastProjects Dec 16 '18 at 09:40
We’ve all been there :-) – Tom Lous Dec 16 '18 at 09:43

Non deterministic fields getting recalculated between showing, counting, and saving a dataframe

1 Answers1