I have an RDD[LabeledPoint] intended to be used within a machine learning pipeline. How do we convert that RDD to a DataSet? Note the newer spark.ml apis require inputs in the Dataset format.
Convert Using createDataFrame Method The SparkSession object has a utility method for creating a DataFrame – createDataFrame. This method can take an RDD and create a DataFrame from it. The createDataFrame is an overloaded method, and we can call the method by passing the RDD alone or with a schema.
Creating DataFrame with schema To use createDataFrame() to create a DataFrame with schema we need to create a Schema first and then convert RDD to RDD of type Row. Pass Row[RDD] and schema to createDataFrame to create DataFrame.
Here is an answer that traverses an extra step - the DataFrame. We use the SQLContext to create a DataFrame and then create a DataSet using the desired object type - in this case a LabeledPoint:
val sqlContext = new SQLContext(sc)
val pointsTrainDf = sqlContext.createDataFrame(training)
val pointsTrainDs = pointsTrainDf.as[LabeledPoint]
Update Ever heard of a SparkSession ? (neither had I until now..)
So apparently the SparkSession is the Preferred Way (TM) in Spark 2.0.0 and moving forward. Here is the updated code for the new (spark) world order:
Spark 2.0.0+ approaches
Notice in both of the below approaches (simpler one of which credit @zero323) we have accomplished an important savings as compared to the SQLContext approach: no longer is it necessary to first create a DataFrame.
val sparkSession = SparkSession.builder().getOrCreate()
val pointsTrainDf = sparkSession.createDataset(training)
val model = new LogisticRegression()
.train(pointsTrainDs.as[LabeledPoint])
Second way for Spark 2.0.0+ Credit to @zero323
val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._
val trainDs = training.toDS()
Traditional Spark 1.X and earlier approach
val sqlContext = new SQLContext(sc) // Note this is *deprecated* in 2.0.0
import sqlContext.implicits._
val training = splits(0).cache()
val test = splits(1)
val trainDs = training**.toDS()**
See also: How to store custom objects in Dataset? by the esteemed @zero323 .
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With