Parallelism in reading Oracle data from using Spark 1.6.2 JDBC

Question

We have around 40 million records for table.

How to choose the values of lowerBound, upperBound, and numPartitions while reading data from Oracle using Spark 1.6.2.

I have partition key. we have datastax cluster - 3 nodes - 18 cores each - 27 GB for each

Thanks in advance.

T. Gawęda · Accepted Answer

The simplest heuristic is to calculate min and max of the column values. In SQL (normal JDBC) you can run:

select min(column) as lowerBound, max(column) as upperBound from table;

And set lowerBound and upperBound to the result of query.

But, it's not always so simple. Sometimes column has very skewed distribution - i.e., SomeFunnyID may have 100x values from range [0..100] and then 100.000.000x values higher than 100000. Then you reading will be very skewed and, because of that, slower.

In such cases I recommend to set dbtable parameter to select mod(s.someFunnyID, partitionCount) as partition_key, s.* from table s and then query set lowerBound to 0 and "upperBound" to partitionCount

val partitionCount = // here choose partition count
val df = spark.read.jdbc(...)
    .option("dbtable", s"select mod(s.someFunnyID, $partitionCount) as partition_key, s.* from table s")
    .option("lowerBound", "0")
    .option("upperBound", partitionCount)
    .option("partitionColumn", "partition_key")
    .load()

For partitionCount, I have only one simple heuristic: number of executors * executor cores

Be aware, that each time you must benchmark you configuration to see if it's correct in your case

Parallelism in reading Oracle data from using Spark 1.6.2 JDBC

Tags:

oracle-database

scala

cassandra

apache-spark

Chandra

1 Answers

T. Gawęda

Recent Activity

Donate For Us

Parallelism in reading Oracle data from using Spark 1.6.2 JDBC

Tags:

oracle-database

scala

cassandra

apache-spark

Chandra

1 Answers

T. Gawęda

Related questions

Recent Activity

Donate For Us