Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to see the contents of each partition in an RDD in pyspark?

Tags:

rdd

pyspark

I want to learn a little more about how pyspark partitions data. I need a function such that:

a = sc.parallelize(range(10), 5)
show_partitions(a)

#output:[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]] (or however it partitions)
like image 496
Bovard Avatar asked Nov 08 '25 02:11

Bovard


1 Answers

The glom function is what you are looking for:

glom(self): Return an RDD created by coalescing all elements within each partition into a list.

a = sc.parallelize(range(10), 5)
a.glom().collect()
#output:[[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]
like image 135
Bovard Avatar answered Nov 11 '25 12:11

Bovard



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!