I'm working on a NRT solution that requires me to frequently update the metadata on an Impala table.
Currently this invalidation is done after my spark code has run. I would like to speed things up by doing this refresh/invalidate directly from my Spark code.
What would be the most efficient approach?
REFRESH and INVALIDATE METADATA commands are specific to Impala.
You must be connected to an Impala daemon to be able to run these -- which trigger a refresh of the Impala-specific metadata cache (in your case you probably just need a REFRESH of the list of files in each partition, not a wholesale INVALIDATE to rebuild the list of all partitions and all their files from scratch)
You could use the Spark SqlContext to connect to Impala via JDBC and read data -- but not run arbitrary commands. Damn. So you are back to the basics:
*.*.extraClassPath propertiesREFRESH somedb.sometable) -- the hard wayHopefully Google will find some examples of JDBC/Scala code such as this one
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With