1

I am having trouble finding the size of my broadcast variables. This is relevant to my project due to pushing the memory limits of the cluster. The cluster is running on YARN. In the application manager I can see the memory usage for the individual executors and for the driver but I think those are only the persisted RDDS.

Jan van der Vegt
  • 1,471
  • 12
  • 34
  • Broadcasted data is just a plain Python object. It doesn't occupy any special space AFAIK. You should be able to simply estimate it's local size (`sys.getsizeof` should be enough for local objects) size and multiply it by a number of executors. – zero323 Feb 12 '16 at 15:53

1 Answers1

2

Spark uses pickle to serialize/deserialize broadcast variables. One thing you may want to try is checking the pickle dumps size, e.g.:

>>> import cPickle as pickle
>>> data = list(range(int(10*1e6)))  # or whatever your broadcast variable is
>>> len(pickle.dumps(data)) 
98888896  # the measurement of the size of your broadcast variable, in bytes

As for broadcast variables affecting the memory limits of your cluster, a previous question of mine has some helpful tips from zero323.

Community
  • 1
  • 1
captaincapsaicin
  • 950
  • 1
  • 7
  • 15
  • 1
    I thought that pyspark serializes stuff in java? are you sure this is exactly how pyspark serializes stuff @captaincapsaicin – makansij Dec 07 '17 at 23:12
  • 1
    https://github.com/apache/spark/blob/master/python/pyspark/cloudpickle.py indicates to me that PySpark uses a wrapper around pickle to serialize. – captaincapsaicin Feb 09 '18 at 19:41