Given that the HashPartitioner docs say:
[HashPartitioner] implements hash-based partitioning using Java's Object.hashCode.
Say I want to partition DeviceData by its kind.
case class DeviceData(kind: String, time: Long, data: String)
Would it be correct to partition an RDD[DeviceData] by overwriting the deviceData.hashCode() method and use only the hashcode of kind?
But given that HashPartitioner takes a number of partitions parameter I am confused as to whether I need to know the number of kinds in advance and what happens if there are more kinds than partitions?
Is it correct that if I write partitioned data to disk it will stay partitioned when read?
My goal is to call
deviceDataRdd.foreachPartition(d: Iterator[DeviceData] => ...)
And have only DeviceData's of the same kind value in the iterator.
How about just doing a groupByKey using kind. Or another PairRDDFunctions method.
You make it seem to me that you don't really care about the partitioning, just that you get all of a specific kind in one processing flow?
The pair functions allow this:
rdd.keyBy(_.kind).partitionBy(new HashPartitioner(PARTITIONS))
.foreachPartition(...)
However, you can probably be a little safer with something more like:
rdd.keyBy(_.kind).reduceByKey(....)
or mapValues or a number of the other pair functions that guarantee you get the pieces as a whole
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With