1

Below is my java udf code,

package com.udf;

import org.apache.spark.sql.api.java.UDF1;

public class SparkUDF implements UDF1<String, String> {
    @Override
    public String call(String arg) throws Exception {
        if (validateString(arg))
            return arg;
        return "INVALID";
    }

public static boolean validateString(String arg) {
    if (arg == null | arg.length() != 11)
        return false;
    else
        return true;
}
}

I am building the Jar with this class as SparkUdf-1.0-SNAPSHOT.jar

I am having a table name as sample in hive and wanted to run below sql on spark shell.

> select UDF(name) from sample ;

Starting the spark-shell with below command.

spark-shell --jars SparkUdf-1.0-SNAPSHOT.jar

Can anyone tell, how to register the UDF on spark shell to use it in spark sql ?

shiv
  • 1,940
  • 1
  • 15
  • 22
  • https://stackoverflow.com/questions/52212709/register-udf-from-external-java-jar-class-in-pyspark May be some insights from here. – thebluephantom Feb 19 '19 at 22:06

2 Answers2

4

After some more searches , I got the answer,

Below are the steps,

spark-shell --jars SparkUdf-1.0-SNAPSHOT.jar

scala> import com.udf.SparkUDF;
scala> import com.udf.SparkUDF;
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};

scala> spark.udf.register("myfunc", new SparkUDF(),StringType)

scala> val sql1 = """ select myfunc(name) from sample """

scala> spark.sql(sql1).show();

You will get the results.

shiv
  • 1,940
  • 1
  • 15
  • 22
2

If you are trying to test the UDF from Jupyter Notebook and your UDF jar on S3:

Step 1: Load your UDF JAR into Jupyter Notebook:

%%configure -f 
{ 
    "conf": { 
        "spark.jars": "s3://s3-path/your-udf.jar" 
    } 
} 

Step 2: Register the scala based UDF in pySpark

spark.udf.registerJavaFunction("myudf", "<udf.package>.<UDFClass>") 

Step 3: Invoke the UDF from Spark SQL

df = spark.read.parquet("s3://s3-path-to-test-data/ts_date=2021-04-27") 
df.createOrReplaceTempView('stable') 

spark.sql("select *, myudf(arg1,arg2) as result from stable ").show(5,False) 
maths
  • 23
  • 2