1

I'm writing a PySpark job but I got into some performance issues. Basically, all it does is read events from Kafka and logs the transformations made. Thing is, the transformation is calculated based on an object's function, and that object is pretty heave as it contains a Graph and an inner-cache which gets automatically updated. So when I write the following piece of code:

analyzer = ShortTextAnalyzer(root_dir)
logger.info("Start analyzing the documents from kafka")
ssc.union(*streams).filter(lambda x: x[1] != None).foreachRDD(lambda rdd: rdd.foreach(lambda record: analyzer.analyze_short_text_event(record[1])))

It serializes my analyzer which takes a lot of time because of the graph, and as it is copied to the executor, the cache is only relevant for that specific RDD.

If the job was written in Scala, I could have written an Object which would exist in every executor and then my object wouldn't have to be serialized each time.

Is there a way to do this in Python? To have my object created once for each executor and then it could avoid the serialization process?

Thanks in advance :)

UPDATE: I've read post How to run a function on all Spark workers before processing data in PySpark? but the answers there talk about sharing files or broadcasting variables. My object can't be broadcast because he isn't read-only. It updates it's inner-cache constantly and that's why I want one object of it on every executor (to avoid the need to serialize).

Community
  • 1
  • 1
sid802
  • 315
  • 2
  • 18

1 Answers1

0

What I ended up doing that avoided my object being serialized, was turning my class into a static class - only class variables and class methods. That way each executor imports that class once (with its relevant variables) and no serialization is needed.

sid802
  • 315
  • 2
  • 18