I have a DataFrame which will be created by hiveContext by executing a Hive SQL, the queried data should be pushed to different datastores in my case.
The DataFrame has got thousands of partitions because of the SQL that I am trying to execute.
To push the data onto datastores I use mapPartitions() and obtain connections and push the data.
The load on the data destination is very high because of the number of partitions, I can coalsec() the number of partitions to a required count based on the size of DataFrame.
The amount of data generated by the SQL is not same in all my cases. In few cases, it may be few 100s of records and in few cases it may go to few millions. Hence I would need a dynamic way to decide the number of partitions to coalsec().
After googling I could see that we can use SizeEstimator.estimate() to estimate the size of DataFrame and then divide the count based on some calculations to get number of partitions. But looking at the implementation of SizeEstimator.estimate at spark's repo showed me that it has been implemented for a single JVM stand point of view and should be used for objects like broadcast variables etc, but not for RDDs/DataFrames which are distributed across JVMs.
Can anyone suggest how to resolve my issue? and please let me know if my understanding is wrong.
Can we use SizeEstimator.estimate for estimating size of RDD/DataFrame?
No we cant use for estimating size of RDD or Dataframe. it will give different sizes.
If you have a parquetfile on disk.. you can use estimate to know exact size of the file based on that number of partitions you can decide...
spark's repo showed me that it has been implemented for a single JVM stand point of view and should be used for objects like broadcast variables etc, but not for RDDs/DataFrames which are distributed across JVMs
This is correct.
See the test classes in spark SizeEstimatorSuite.scala to understand it better...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With