I am running a spark job and I need to read from a HDFS table which is in lets say HadoopCluster-1. Now I want the aggregate dataframe into a table which is present in another HadoopCluster-2. What would be the best way to do it?
Is this a good way to achieve my goal ?
If you want to read hive table from cluster1 as a dataframe and write it as hive table in cluster2 after transforming dataframe, you can try below approach.
hive --service hiveserever2
hive --service metastore
Make sure hive is properly configured with username/password. You can mark both username/password as empty but you will get an error, you can resolve that by referring this link.
Now read hive table from cluster1 as spark dataframe and write it to hive table of cluster2 after transformation.
// spark-scala code
val sourceJdbcMap = Map(
"url"->"jdbc:hive2://<source_host>:<port>", //default port is 10000
"driver"->"org.apache.hive.jdbc.HiveDriver",
"user"->"<username>",
"password"->"<password>",
"dbtable"->"<source_table>")
val targetJdbcMap = Map(
"url"->"jdbc:hive2://<target_host>:<port>", //default port is 10000
"driver"->"org.apache.hive.jdbc.HiveDriver",
"user"->"<username>",
"password"->"<password>",
"dbtable"->"<target_table>")
val sourceDF = spark.read.format("jdbc").options(sourceJdbcMap).load()
val transformedDF = //transformation goes here...
transformedDF.write.options(targetJdbcMap).format("jdbc").save()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With