Is there a quick way to create a new RDD from an existing RDD that contains LabeledPoints, but only modify the labels for each row?
As an example, assume I have RDD called myRDD, and that myRDD has LabeledPoints as follow:
RDD = sc.parallelize([
LabeledPoint(1, [1.0, 2.0, 3.0]),
LabeledPoint(2, [3.0, 4.0, 5.0]),
LabeledPoint(4, [6.0, 7.0, 8.0])])
This represents a take(5) of the RDD.
I want to simply create a new RDD from this one but I want to subtract 10 from each label.
When I try this, it fails miserably:
myRDD = RDD.map(lambda x: x[0].label - 10, x[1].features)
Please help me by also pointing out what is wrong with my reasoning in above attempt.
what is wrong with your reasoning in above attempt?
First lets take a look at a whole map:
map(lambda x: x[0].label - 10, x[1].features)
Right now it interpreted as map with function lambda x: x[0].label - 10 and some additional argument x[1].features. Let's start with returning a tuple:
map(lambda x: (x[0].label - 10, x[1].features)))
Function passed to map receives a single point at the time so indexing doesn't make sense, you should simply acceess label and features:
map(lambda x: (x.label - 10, x.features))
Finally you have to create a new point:
map(lambda x: LabeledPoint(x.label - 10, x.features))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With