Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parallelism in reading Oracle data from using Spark 1.6.2 JDBC

We have around 40 million records for table.

How to choose the values of lowerBound, upperBound, and numPartitions while reading data from Oracle using Spark 1.6.2.

I have partition key. we have datastax cluster - 3 nodes - 18 cores each - 27 GB for each

Thanks in advance.

like image 710
Chandra Avatar asked Oct 26 '25 14:10

Chandra


1 Answers

The simplest heuristic is to calculate min and max of the column values. In SQL (normal JDBC) you can run:

select min(column) as lowerBound, max(column) as upperBound from table;

And set lowerBound and upperBound to the result of query.

But, it's not always so simple. Sometimes column has very skewed distribution - i.e., SomeFunnyID may have 100x values from range [0..100] and then 100.000.000x values higher than 100000. Then you reading will be very skewed and, because of that, slower.

In such cases I recommend to set dbtable parameter to select mod(s.someFunnyID, partitionCount) as partition_key, s.* from table s and then query set lowerBound to 0 and "upperBound" to partitionCount

val partitionCount = // here choose partition count
val df = spark.read.jdbc(...)
    .option("dbtable", s"select mod(s.someFunnyID, $partitionCount) as partition_key, s.* from table s")
    .option("lowerBound", "0")
    .option("upperBound", partitionCount)
    .option("partitionColumn", "partition_key")
    .load()

For partitionCount, I have only one simple heuristic: number of executors * executor cores

Be aware, that each time you must benchmark you configuration to see if it's correct in your case

like image 176
T. Gawęda Avatar answered Oct 29 '25 12:10

T. Gawęda



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!