My application has been built utilizing MongoDB as a platform. One collection in DB has massive volume of data and have opted for apache spark to retrieve and generate analytical data through calculation. I have configured Spark Connector for MongoDB to communicate with MongoDB. I need to query MongoDB collection using pyspark and build a dataframe consisting of resultset of mongodb query. Please suggest me an appropriate solution to it.
You can load the data directly into a dataframe like so:
# Create the dataframe
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").option("uri", "mongodb://127.0.0.1/mydb.mycoll").load()
# Filter the data via the api
df.filter(people.age > 30)
# Filter via sql
df.registerTempTable("people")
over_thirty = sqlContext.sql("SELECT name, age FROM people WHERE age > 30")
For more information see the Mongo Spark connector Python API section or the introduction.py. The SQL queries are translated and passed back to the connector so that the data can be queried in MongoDB before being sent to the spark cluster.
You can also provide your own aggregation pipeline to apply to the collection before returning results into Spark:
dfr = sqlContext.read.option("pipeline", "[{ $match: { name: { $exists: true } } }]")
df = dfr.option("uri", ...).format("com.mongodb.spark.sql.DefaultSource").load()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With