Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark: creating new RDD from existing LabeledPointsRDD but modifying the label

Is there a quick way to create a new RDD from an existing RDD that contains LabeledPoints, but only modify the labels for each row?

As an example, assume I have RDD called myRDD, and that myRDD has LabeledPoints as follow:

RDD = sc.parallelize([
    LabeledPoint(1, [1.0, 2.0, 3.0]),
    LabeledPoint(2, [3.0, 4.0, 5.0]),
    LabeledPoint(4, [6.0, 7.0, 8.0])])

This represents a take(5) of the RDD.

I want to simply create a new RDD from this one but I want to subtract 10 from each label.

When I try this, it fails miserably:

myRDD = RDD.map(lambda x: x[0].label - 10, x[1].features)

Please help me by also pointing out what is wrong with my reasoning in above attempt.

like image 850
Monty Avatar asked Dec 14 '25 08:12

Monty


1 Answers

what is wrong with your reasoning in above attempt?

First lets take a look at a whole map:

 map(lambda x: x[0].label - 10, x[1].features)

Right now it interpreted as map with function lambda x: x[0].label - 10 and some additional argument x[1].features. Let's start with returning a tuple:

map(lambda x: (x[0].label - 10, x[1].features)))

Function passed to map receives a single point at the time so indexing doesn't make sense, you should simply acceess label and features:

 map(lambda x: (x.label - 10, x.features))

Finally you have to create a new point:

map(lambda x: LabeledPoint(x.label - 10, x.features))
like image 68
zero323 Avatar answered Dec 16 '25 22:12

zero323



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!