PySpark: creating new RDD from existing LabeledPointsRDD but modifying the label

Question

Is there a quick way to create a new RDD from an existing RDD that contains LabeledPoints, but only modify the labels for each row?

As an example, assume I have RDD called myRDD, and that myRDD has LabeledPoints as follow:

RDD = sc.parallelize([
    LabeledPoint(1, [1.0, 2.0, 3.0]),
    LabeledPoint(2, [3.0, 4.0, 5.0]),
    LabeledPoint(4, [6.0, 7.0, 8.0])])

This represents a take(5) of the RDD.

I want to simply create a new RDD from this one but I want to subtract 10 from each label.

When I try this, it fails miserably:

myRDD = RDD.map(lambda x: x[0].label - 10, x[1].features)

Please help me by also pointing out what is wrong with my reasoning in above attempt.

zero323 · Accepted Answer

what is wrong with your reasoning in above attempt?

First lets take a look at a whole map:

 map(lambda x: x[0].label - 10, x[1].features)

Right now it interpreted as map with function lambda x: x[0].label - 10 and some additional argument x[1].features. Let's start with returning a tuple:

map(lambda x: (x[0].label - 10, x[1].features)))

Function passed to map receives a single point at the time so indexing doesn't make sense, you should simply acceess label and features:

 map(lambda x: (x.label - 10, x.features))

Finally you have to create a new point:

map(lambda x: LabeledPoint(x.label - 10, x.features))

PySpark: creating new RDD from existing LabeledPointsRDD but modifying the label

Tags:

python

apache-spark

pyspark

apache-spark-mllib

Monty

1 Answers

zero323

Recent Activity

Donate For Us

PySpark: creating new RDD from existing LabeledPointsRDD but modifying the label

Tags:

python

apache-spark

pyspark

apache-spark-mllib

Monty

1 Answers

zero323

Related questions

Recent Activity

Donate For Us