I'm trying to query a Kerberized Hive cluster with SQL Alchemy. I'm able to submit queries using pyhs2 which confirms that it's possible to connect and query Hive when authenticated by Kerberos:
import pyhs2
with pyhs2.connect(host='hadoop01.woolford.io',
port=10500,
authMechanism='KERBEROS') as conn:
with conn.cursor() as cur:
cur.execute('SELECT * FROM default.mytable')
records = cur.fetchall()
# etc ...
I notice that Airbnb's Airflow uses SQL Alchemy and can connect to Kerberized Hive and so I imagine it's possible to do something like this:
engine = create_engine('hive://hadoop01.woolford.io:10500/default', connect_args={'?': '?'})
connection = engine.connect()
connection.execute("SELECT * FROM default.mytable")
# etc ...
I'm not sure what parameters should be set in the connect_args dictionary. Can you see what needs to be added to make this work (e.g. Kerberos service name, realm, etc.)?
Under the hood SQL Alchemy is using PyHive to connect to Hive. The current version of PyHive, v0.2.1, doesn't support Kerberos.
I notice that someone from Yahoo created a pull request that provides support for Kerberos. This PR has not yet been merged/released and so I just copied the code from the PR into /usr/lib/python2.7/site-packages/pyhive/hive.py on the Superset server created a connection like this:
engine = create_engine('hive://hadoop01:10500', connect_args={'auth': 'KERBEROS', 'kerberos_service_name': 'hive'})
Hopefully, the maintainer of PyHive will merge/release the support for Kerberos.
install these libraries
get your kerberos ticket and then;
engine = create_engine('hive://HOST:10500/DB_NAME',
connect_args={'auth': 'KERBEROS', 'kerberos_service_name': 'hive'})
ps: /DB_NAME is optional
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With