I'm creating a Spark Structured streaming application which is going to be calculating data received from Kafka every 10 seconds.
To be able to do some of the calculations, I need to look up some information about sensors and placement in a Cassandra Database
I'm a little stuck at wrapping my head around how to keep the Cassandra data available throughout the cluster, and somehow update data from time to time, in-case we have done some changes to database table.
Currently, I'm querying the database as soon as I start the Spark locally using the Datastax Spark-Cassandra-connector
val cassandraSensorDf = spark
  .read
  .cassandraFormat("specifications", "sensors")
  .load
From here on I can use this cassandraSensorDs by joining it with my Structured Streaming Dataset.
.join(
   cassandraSensorDs ,
   sensorStateDf("plantKey") <=> cassandraSensorDf ("cassandraPlantKey")
)
How do I do additional queries to update this Cassandra data while having Structured Streaming Running? And how can I make the queried data available in a cluster setting?
Using broadcast variables, you may write a wrapper to fetch data from Cassandra periodically and update a broadcast variable. Do a map-side join on the stream with the broadcast variable. I have not tested this approach and I think this might as well be an overkill depending on your use case(throughput).
How can I update a broadcast variable in spark streaming?
Another approach is to query Cassandra for every item in your stream, to optimise on the connections you should make sure that you use connection pooling and create only one connection for a JVM/partition. This approach is simpler you don't have to worry about warming the Cassandra data periodically.
spark-streaming and connection pool implementation
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With