Invalidate metadata/refresh imapala from spark code

Question

I'm working on a NRT solution that requires me to frequently update the metadata on an Impala table.

Currently this invalidation is done after my spark code has run. I would like to speed things up by doing this refresh/invalidate directly from my Spark code.

What would be the most efficient approach?

Oozie is just too slow (30 sec overhead? no thanks)
An SSH action to an (edge) node seems like a valid solution but feels "hackish"
I don't see a way to do this from the hive context in Spark either.

Samson Scharfrichter · Accepted Answer

REFRESH and INVALIDATE METADATA commands are specific to Impala.
You must be connected to an Impala daemon to be able to run these -- which trigger a refresh of the Impala-specific metadata cache (in your case you probably just need a REFRESH of the list of files in each partition, not a wholesale INVALIDATE to rebuild the list of all partitions and all their files from scratch)

You could use the Spark SqlContext to connect to Impala via JDBC and read data -- but not run arbitrary commands. Damn. So you are back to the basics:

download the latest Cloudera JDBC driver for Impala
install it on the server where you run your Spark job
list all the JARs in your *.*.extraClassPath properties
develop some Scala code to open a JDBC session against an Impala daemon and run arbitrary commands (such as REFRESH somedb.sometable) -- the hard way

Hopefully Google will find some examples of JDBC/Scala code such as this one

Invalidate metadata/refresh imapala from spark code

Tags:

apache-spark

hadoop

impala

Havnar

Video Answer

1 Answers

Samson Scharfrichter

Recent Activity

Donate For Us

Invalidate metadata/refresh imapala from spark code

Tags:

apache-spark

hadoop

impala

Havnar

Video Answer

1 Answers

Samson Scharfrichter

Related questions

Recent Activity

Donate For Us