Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Run .py file from google cloud dataproc python notebook

I have my files placed on dataproc storage as:

  • inotebook.ipynb
  • dependencies
    • test1.py
    • test2.py
    • __ init __.py

And currently working on inotebook.ipynb file and need to use functions in test1.py and test2.py files. Locally, I can use !python ....py and use functions available (or create a package and install). Is any of these options available on google cloud dataproc notebook?

I tried suggestions from below links and none worked:

  • Dataproc import python module stored in google cloud storage (gcs) bucket
  • Is there a way to import and run functions from saved .py files in a Jupyter notebook running on a Google Cloud Platform dataproc cluster?

Is there anyway to install a custom package or somehow run .py files from same sub-directory as my notebook file on dataproc?

like image 704
Azi M Avatar asked May 08 '26 00:05

Azi M


1 Answers

Unfortunately it is still a limitation of dataproc on using custom packages that are stored in GCS. I was able to make the mentioned workaround work with a few changes. I added defining of prefix to be able to properly point to the correct files under a directory and looping through the returned object to download the files to the local dataproc cluster and execute the succeeding lines of code. See code below:

GCS bucket structure:

my-bucket
  └───notebooks
        └───jupyter
            |   gcs_test.ipynb
            └───dependencies
                  └─── hi_gcs.py
                  └─── hello_gcs.py

hi_gcs.py:

def say_hi(name):
    return "Hi {}!".format(name)

hello_gcs.py:

def say_hello(name):
    return "Hello {}!".format(name)

gcs_test.ipynb:

from google.cloud import storage

def get_module():

    client = storage.Client()
    bucket = client.get_bucket('my-bucket')
    blobs = list(client.list_blobs(bucket,prefix='notebooks/jupyter/dependencies/'))
    # define the path to your python files at prefix
    for blob in blobs[1:]: # skip 1st element since this is the top directory
        name = blob.name.split('/')[-1] # get the filename only
        blob.download_to_filename(name) # download python files to the local dataproc cluster
    
def use_my_module(val):
    get_module()
    import hi_gcs 
    import hello_gcs 

    print(hello_gcs.say_hello(val))
    print(hi_gcs.say_hi(val))

use_my_module('User 1')

Output:

Hello User 1!
Hi User 1!
like image 54
Ricco D Avatar answered May 09 '26 13:05

Ricco D



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!