Registering Scala Functions as Java Functions in Pyspark via UDF1 (struct & case classes)

Question

I have a Scala case class as

final case class TestStruct(
    num_1: Long,
    num_2: Long
        )

I have a method that converts string to struct and want to use it in pyspark. I define the following class:

package com.path.test

// other imports including Gson, etc.
import org.apache.spark.sql.api.java.UDF1

class TestJob extends UDF1[String, TestStruct] {
  def call(someString: String): TestStruct = {
      // code to get TestStruct from someString
  }
}

Then I register it in pyspark using

spark.udf.registerJavaFunction("get_struct", "com.path.test")

But when I use df = spark.sql("select get_struct(string_col) from db.tb"), it returns an empty struct. Even when I use df.printSchema() it just shows struct (nullable = true) and not the fields of it (num_1 and num_2)

Other people have shown success with integers (links below), but not with struct / case class data types:

spark-how-to-map-python-with-scala-or-java-user-defined-functions

using-scala-classes-as-udf-with-pyspark

Any help would be appreciated.

Try using `java.lang.Long`, perhaps. Scala's `Long` is not the same as Java's boxed `Long`. — user, Nov 30 '20 at 22:53
The real case class has other fields as well including `Map[String, Int]` for some fields and other case classes as other fields. Everything should be Java data types, and not scala? — ahoosh, Nov 30 '20 at 22:57
I think so, yes (but don't take my word for it, try out a minimal example first). — user, Nov 30 '20 at 22:58

Registering Scala Functions as Java Functions in Pyspark via UDF1 (struct & case classes)

0 Answers0