How to build Spark data frame with filtered records from MongoDB?

Question

My application has been built utilizing MongoDB as a platform. One collection in DB has massive volume of data and have opted for apache spark to retrieve and generate analytical data through calculation. I have configured Spark Connector for MongoDB to communicate with MongoDB. I need to query MongoDB collection using pyspark and build a dataframe consisting of resultset of mongodb query. Please suggest me an appropriate solution to it.

Ross · Accepted Answer

You can load the data directly into a dataframe like so:

# Create the dataframe
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", "mongodb://127.0.0.1/mydb.mycoll").load()

# Filter the data via the api
df.filter(people.age > 30)

# Filter via sql
df.registerTempTable("people")
over_thirty = sqlContext.sql("SELECT name, age FROM people WHERE age > 30")

For more information see the Mongo Spark connector Python API section or the introduction.py. The SQL queries are translated and passed back to the connector so that the data can be queried in MongoDB before being sent to the spark cluster.

You can also provide your own aggregation pipeline to apply to the collection before returning results into Spark:

dfr = sqlContext.read.option("pipeline", "[{ $match: { name: { $exists: true } } }]")
df = dfr.option("uri", ...).format("com.mongodb.spark.sql.DefaultSource").load()

How to build Spark data frame with filtered records from MongoDB?

Tags:

mongodb

mongodb-query

apache-spark

pyspark

Rubin Porwal

1 Answers

Ross

Recent Activity

Donate For Us

How to build Spark data frame with filtered records from MongoDB?

Tags:

mongodb

mongodb-query

apache-spark

pyspark

Rubin Porwal

1 Answers

Ross

Related questions

Recent Activity

Donate For Us