Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark RDD expand a row to multiple rows

I have the following RDD in pyspark and I believe this should be really simple to do but haven't been able to figure it out:

information = [ (10, 'sentence number one'),
                (17, 'longer sentence number two') ]

rdd = sc.parallelize(information)

I need to apply a transformation that turns that RDD into this:

[ ('sentence', 10),
  ('number', 10),
  ('one', 10),
  ('longer', 17),
  ('sentence', 17),
  ('number', 17),
  ('two', 17) ]

Basically expand a sentence key, into multiple rows with the words as keys.

I would like to avoid SQL.

like image 958
Franch Avatar asked Aug 30 '25 16:08

Franch


1 Answers

Use flatMap:

rdd.flatMap(lambda x: [(w, x[0]) for w in x[1].split()])

Example:

rdd.flatMap(lambda x: [(w, x[0]) for w in x[1].split()]).collect()
# [('sentence', 10), ('number', 10), ('one', 10), ('longer', 17), ('sentence', 17), ('number', 17), ('two', 17)]
like image 94
Psidom Avatar answered Sep 02 '25 06:09

Psidom