While fetching data from SQL Server via a JDBC connection in Spark, I found that I can set some parallelization parameters like partitionColumn, lowerBound, upperBound, and numPartitions. I have gone through spark documentation but wasn't able to understand it.
Can anyone explain me the meanings of these parameters?
It is simple:
partitionColumn is a column which should be used to determine partitions.lowerBound and upperBound determine range of values to be fetched. Complete dataset will use rows corresponding to the following query:
SELECT * FROM table WHERE partitionColumn BETWEEN lowerBound AND upperBound numPartitions determines number of partitions to be created. Range between lowerBound and upperBound is divided into numPartitions each with stride equal to:
upperBound / numPartitions - lowerBound / numPartitions For example if:
lowerBound: 0upperBound: 1000numPartitions: 10
Stride is equal to 100 and partitions correspond to following queries:
SELECT * FROM table WHERE partitionColumn BETWEEN 0 AND 100SELECT * FROM table WHERE partitionColumn BETWEEN 100 AND 200...SELECT * FROM table WHERE partitionColumn BETWEEN 900 AND 1000If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With