0

I have a Scala case class as

final case class TestStruct(
    num_1: Long,
    num_2: Long
        )

I have a method that converts string to struct and want to use it in pyspark. I define the following class:

package com.path.test

// other imports including Gson, etc.
import org.apache.spark.sql.api.java.UDF1

class TestJob extends UDF1[String, TestStruct] {
  def call(someString: String): TestStruct = {
      // code to get TestStruct from someString
  }
}

Then I register it in pyspark using

spark.udf.registerJavaFunction("get_struct", "com.path.test")

But when I use df = spark.sql("select get_struct(string_col) from db.tb"), it returns an empty struct. Even when I use df.printSchema() it just shows struct (nullable = true) and not the fields of it (num_1 and num_2)

Other people have shown success with integers (links below), but not with struct / case class data types:

spark-how-to-map-python-with-scala-or-java-user-defined-functions

using-scala-classes-as-udf-with-pyspark

Any help would be appreciated.

ahoosh
  • 1,340
  • 3
  • 17
  • 31
  • 1
    Try using `java.lang.Long`, perhaps. Scala's `Long` is not the same as Java's boxed `Long`. – user Nov 30 '20 at 22:53
  • The real case class has other fields as well including `Map[String, Int]` for some fields and other case classes as other fields. Everything should be Java data types, and not scala? – ahoosh Nov 30 '20 at 22:57
  • 1
    I think so, yes (but don't take my word for it, try out a minimal example first). – user Nov 30 '20 at 22:58

0 Answers0