9

I have few very, very simple functions in Python that I would like to use as UDFs in Spark SQL. It seems easy to register and use them from Python. But I would like to use them from Java/Scala when using JavaSQLContext or SQLContext. I noted that in spark 1.2.1 there is function registerPython but it is neither clear to me how to use it nor whether I should ...

Any ideas on how to to do this? I think that it might got easier in 1.3.0 but I'm limited to 1.2.1.

EDIT: As no longer working on this, I'm interest in knowing how to do this in any Spark version.

kkonrad
  • 1,262
  • 13
  • 32

2 Answers2

1
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql import SQLContext

def dummy_function(parameter_key):
    return "abc"

sqlContext.udf.register("dummy_function", dummy_function)

This is how we can define a function and register to use in any spark-sql query

0

Given that the latest implementation of Spark UDFs (2.3.1 documentation) doesn't include any python UDF registration functionality (scala and Java only), I'd recommend leveraging Jython to call your Python functions.

You'll be able to define a Java class with methods calling Jython to run your python functions, then register those Java methods as UDFs within your SQL context. While this is more roundabout than directly registering python code as a UDF, it has the benefit of complying with current patterns and having a more maintainable context switch.

bsplosion
  • 2,641
  • 27
  • 38
  • Also keep in mind that your code will fail at runtime if you use any functionality inside your function that is not serializable as each and every object inside your udf should be serializable for JVM to transport them between nodes during calculation. – Rony Aug 20 '18 at 12:56
  • Any update on this? Is Jython still the best option for this? – Khushbu Jun 06 '19 at 02:59
  • While it's still possible to use Jython for this, compatibility is limited to Python 2.7 and their development seems to have stalled for the most part (last significant news was from [April 2015](https://hg.python.org/jython/file/412a8f9445f7/NEWS)). Given the OP's statement that the Python functions are very simple, it seems it'd be better to refactor into Java and declare that code in UDFs. – bsplosion Jun 06 '19 at 13:20
  • Is there any new solution available for this problem. I have some ML python code to be executed in Java Spark SQL – Surabhi Mundra Mar 16 '20 at 08:38
  • @SurabhiMundra there's no new solution that I'm aware of - Jython still works, but since it is now sitting on a Python 2-only implementation, it's not broadly compatible with recent versions of Spark. [See this other answer for more info about updates](https://stackoverflow.com/a/33433803/2738164). – bsplosion Mar 16 '20 at 15:20