I am working with AWS Glue and PySpark ETL scripts, and want to use auxiliary libraries such as google_cloud_bigquery as a part of my PySpark scripts.
The documentation states this should be possible. This previous Stack Overflow discussion, especially one comment in one of the answers seems to provide additional proof. However, how to do it is unclear to me.
So the goal is to turn the pip installed packages into one or more zip files, in order to be able to just host the packages on S3 and point to them like so: 
s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip
How that should be done is not clearly stated anywhere I've looked.
i.e. how do I pip install a package and then turn it into a zip file that I can upload to S3 so PySpark can use it with such an S3 URL?
By using the command pip download I have been able to fetch the libs, but they are not .zip files by default but instead either .whl files or .tar.gz
..so not sure what to do to turn them into zip files that AWS Glue can digest. Maybe with .tar.gz I could just tar -xf them and then zip them back up, but how about whl files?
So, after going through the materials I sourced in the comments over the past 48 hours, here's how I solved the issue.
Note: I use Python2.7 because that's what AWS Glue seems to ship with.
By following the instructions in E. Kampf's blog post "Best Practices Writing Production-Grade PySpark Jobs" and this stack overflow answer, and some tweaking due to random errors along the way, I did the following:
mkdir ziplib && cd ziplib
Create a requirements.txt file with names of packages on each row.
Create a folder in it called deps:
mkdir deps
virtualenv -p python2.7 .
bin/pip2.7 install -r requirements.txt --install-option --install-lib="/absolute/path/to/.../ziplib/deps"
cd deps && zip -r ../deps.zip . && cd ..
..and so now I have a zip file which if I put onto AWS S3 and point it to from PySpark on AWS Glue, it seems to work.
HOWEVER... what I haven't been able to solve is that since some packages, such as the Google Cloud Python client libs, use what is known as Implicit Namespace Packages (PEP-420), they don't have the __init__.py files usually present in modules, and thus the import statements don't work. I'm at a loss here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With