SPARK - Use RDD.foreach to Create a Dataframe and execute actions on the Dataframe

Question

I am new to SPARK and figuring out a better way to achieve the following scenario. There is a database table containing 3 fields - Category, Amount, Quantity. First I try to pull all the distinct Categories from the database.

 val categories:RDD[String] = df.select(CATEGORY).distinct().rdd.map(r => r(0).toString)

Now for each category I want to execute the Pipeline which essentially creates dataframes from each category and apply some Machine Learning.

 categories.foreach(executePipeline)
 def execute(category: String): Unit = {
   val dfCategory = sqlCtxt.read.jdbc(JDBC_URL,"SELECT * FROM TABLE_NAME WHERE CATEGORY="+category)
dfCategory.show()    
}

Is it possible to do something like this ? Or is there any better alternative ?

Daniel B. · Accepted Answer

// You could get all your data with a single query and convert it to an rdd
val data = sqlCtxt.read.jdbc(JDBC_URL,"SELECT * FROM TABLE_NAME).rdd

// then group the data by category
val groupedData = data.groupBy(row => row.getAs[String]("category"))

// then you get an RDD[(String, Iterable[org.apache.spark.sql.Row])]
// and you can iterate over it and execute your pipeline
groupedData.map { case (categoryName, items) =>
  //executePipeline(categoryName, items)
}

SPARK - Use RDD.foreach to Create a Dataframe and execute actions on the Dataframe

Tags:

dataframe

scala

apache-spark

rdd

apache-spark-sql

atreya biswas

1 Answers

Daniel B.

Recent Activity

Donate For Us

SPARK - Use RDD.foreach to Create a Dataframe and execute actions on the Dataframe

Tags:

dataframe

scala

apache-spark

rdd

apache-spark-sql

atreya biswas

1 Answers

Daniel B.

Related questions

Recent Activity

Donate For Us