1

We have a uuid udf :

import java.util.UUID
val idUdf = udf(() => idgen.incrementAndGet.toString + "_" + UUID.randomUUID)
spark.udf.register("idgen", idUdf)

An issue being faced is that when running count, or show or write each of those end up with a different value of the udf result.

    df.count()             // generates a UUID for each row
    df.show()              // regenerates a UUID for each row
    df.write.parquet(path) // .. you get the picture ..

What approaches might be taken to retain a single uuid result for a given row? The first thought would be to invoke a remote Key-Value store using some unique combination of other stable fields within each column. That is of course expensive both due to the lookup-per-row and the configuration and maintenance of the remote KV Store. Are there other mechanisms to achieve stability for these unique ID columns?

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
  • If the other questions treat on similar topics, it is entirely not evident from their titles. This question identifies a specific well defined scenario along with precise keywords for search. Well at least googlers can get to the other answers (if useful) via a couple of hops proxied by this question. – WestCoastProjects Aug 24 '23 at 04:47

1 Answers1

2

Just define your udf as nondeterministic by calling:

val idUdf = udf(() => idgen.incrementAndGet.toString + "_" + UUID.randomUUID)
    .asNondeterministic()

This will evaluate your udf just once and keep the result in the RDD

Tom Lous
  • 2,819
  • 2
  • 25
  • 46