I want to write RDD[String] to Amazon S3 in Spark Streaming using Scala. These are basically JSON strings. Not sure how to do it more efficiently.
I found this post, in which the library spark-s3 is used. The idea is to create SparkContext and then SQLContext. After this the author of the post does something like this:
myDstream.foreachRDD { rdd =>
rdd.toDF().write
.format("com.knoldus.spark.s3")
.option("accessKey","s3_access_key")
.option("secretKey","s3_secret_key")
.option("bucket","bucket_name")
.option("fileType","json")
.save("sample.json")
}
What are another options besides spark-s3? Is it possible to append the file on S3 with the streaming data?
Files on S3 cannot be appended. An "append" means in S3 to replace the existing object with a new object that contains the additional data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With