We have a uuid udf :
import java.util.UUID
val idUdf = udf(() => idgen.incrementAndGet.toString + "_" + UUID.randomUUID)
spark.udf.register("idgen", idUdf)
An issue being faced is that when running count, or show or write each of those end up with a different value of the udf result.
df.count() // generates a UUID for each row
df.show() // regenerates a UUID for each row
df.write.parquet(path) // .. you get the picture ..
What approaches might be taken to retain a single uuid result for a given row? The first thought would be to invoke a remote Key-Value store using some unique combination of other stable fields within each column. That is of course expensive both due to the lookup-per-row and the configuration and maintenance of the remote KV Store. Are there other mechanisms to achieve stability for these unique ID columns?