pyspark RDD expand a row to multiple rows

Question

I have the following RDD in pyspark and I believe this should be really simple to do but haven't been able to figure it out:

information = [ (10, 'sentence number one'),
                (17, 'longer sentence number two') ]

rdd = sc.parallelize(information)

I need to apply a transformation that turns that RDD into this:

[ ('sentence', 10),
  ('number', 10),
  ('one', 10),
  ('longer', 17),
  ('sentence', 17),
  ('number', 17),
  ('two', 17) ]

Basically expand a sentence key, into multiple rows with the words as keys.

I would like to avoid SQL.

Psidom · Accepted Answer

Use flatMap:

rdd.flatMap(lambda x: [(w, x[0]) for w in x[1].split()])

Example:

rdd.flatMap(lambda x: [(w, x[0]) for w in x[1].split()]).collect()
# [('sentence', 10), ('number', 10), ('one', 10), ('longer', 17), ('sentence', 17), ('number', 17), ('two', 17)]

pyspark RDD expand a row to multiple rows

Tags:

python

apache-spark

rdd

pyspark

Franch

1 Answers

Psidom

Recent Activity

Donate For Us

pyspark RDD expand a row to multiple rows

Tags:

python

apache-spark

rdd

pyspark

Franch

1 Answers

Psidom

Related questions

Recent Activity

Donate For Us