Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transfer data from a S3 bucket to a GCP bucket using temporary credentials

I would like to download a public dataset from the NIMH Data Archive. After creating an account on their website and accepting their Data Usage Agreement, I can download a CSV file which contains the path to all the files in the dataset I am interested in. Each path is of the form s3://NDAR_Central_1/....

1 Download on my personal computer

In the NDA Github repository, the nda-tools Python library exposes some useful Python code to download those files to my own computer. Say I want to download the following file:

s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz

Given my username (USRNAME) and password (PASSWD) (the ones I used to create my account on the NIMH Data Archive), the following code allows me to download this file to TARGET_PATH on my personal computer:

from NDATools.clientscripts.downloadcmd import configure
from NDATools.Download import Download

config = configure(username=USRNAME, password=PASSWD)
s3Download = Download(TARGET_PATH, config)

target_fnames = ['s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz']

s3Download.get_links('paths', target_fnames, filters=None)
s3Download.get_tokens()
s3Download.start_workers(False, None, 1)

Behind the hood, the get_tokens method of s3Download will use USRNAME and PASSWD to generate temporary access key, secret key and security token. Then, the start_workers method will use the boto3 and s3transfer Python libraries to download the selected file.

Everything works fine !

2 Download to a GCP bucket

Now, say I created a project on GCP and would like to directly download this file to a GCP bucket.

Ideally, I would like to do something like:

gsutil cp s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz gs://my-bucket

To do this, I execute the following Python code in the Cloud Shell (by running python3):

from NDATools.TokenGenerator import NDATokenGenerator
data_api_url = 'https://nda.nih.gov/DataManager/dataManager'
generator = NDATokenGenerator(data_api_url)
token = generator.generate_token(USRNAME, PASSWD)

This gives me the access key, the secret key and the session token. Indeed, in the following,

  • ACCESS_KEY refers to the value of token.access_key,
  • SECRET_KEY refers to the value of token.secret_key,
  • SECURITY_TOKEN refers to the value of token.session.

Then, I set these credentials as environment variables in the Cloud Shell:

export AWS_ACCESS_KEY_ID = [copy-paste ACCESS_KEY here]
export AWS_SECRET_ACCESS_KEY = [copy-paste SECRET_KEY here]
export AWS_SECURITY_TOKEN = [copy-paste SECURITY_TOKEN here]

Eventually, I also set up the .boto configuration file in my home. It looks like this:

[Credentials]
aws_access_key_id = $AWS_ACCESS_KEY_ID
aws_secret_access_key = $AWS_SECRET_ACCESS_KEY
aws_session_token = $AWS_SECURITY_TOKEN
[s3]
calling_format = boto.s3.connection.OrdinaryCallingFormat
use-sigv4=True
host=s3.us-east-1.amazonaws.com

When I run the following command:

gsutil cp s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz gs://my-bucket

I end up with:

AccessDeniedException: 403 AccessDenied

The full traceback is below:

Non-MD5 etag ("a21a0b2eba27a0a32a26a6b30f3cb060-6") present for key <Key: NDAR_Central_1,submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz>, data integrity checks are not possible.
Copying s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz [Content-Type=application/x-gzip]...
Exception in thread Thread-2:iB]
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/google/google-cloud-sdk/platform/gsutil/gslib/daisy_chain_wrapper.py", line 213, in PerformDownload
    decryption_tuple=self.decryption_tuple)
  File "/google/google-cloud-sdk/platform/gsutil/gslib/cloud_api_delegator.py", line 353, in GetObjectMedia
    decryption_tuple=decryption_tuple)
  File "/google/google-cloud-sdk/platform/gsutil/gslib/boto_translation.py", line 590, in GetObjectMedia
    generation=generation)
  File "/google/google-cloud-sdk/platform/gsutil/gslib/boto_translation.py", line 1723, in _TranslateExceptionAndRaise
    raise translated_exception  # pylint: disable=raising-bad-type
AccessDeniedException: AccessDeniedException: 403 AccessDenied
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>A93DBEA60B68E04D</RequestId><HostId>Z5XqPBmUdq05btXgZ2Tt7HQMzodgal6XxTD6OLQ2sGjbP20AyZ+fVFjbNfOF5+Bdy6RuXGSOzVs=</HostId></Error>

AccessDeniedException: 403 AccessDenied
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>A93DBEA60B68E04D</RequestId><HostId>Z5XqPBmUdq05btXgZ2Tt7HQMzodgal6XxTD6OLQ2sGjbP20AyZ+fVFjbNfOF5+Bdy6RuXGSOzVs=</HostId></Error>

I would like to be able to directly download this file from a S3 bucket to my GCP bucket (without having to create a VM, setup Python and run the code above [which works]). Why is it that the temporary generated credentials work on my computer but do not work in GCP Cloud Shell?

The complete log of the debug command

gsutil -DD cp s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz gs://my-bucket

can be found here.

like image 548
pitchounet Avatar asked Dec 05 '25 21:12

pitchounet


1 Answers

The procedure you are trying to implement is called "Transfer Job"

In order to transfer a file from Amazon S3 bucket to a Cloud Storage bucket:

A. Click the Burger Menu on the top left corner

B. Go to Storage > Transfer

C. Click Create Transfer

  1. Under Select source, select Amazon S3 bucket.

  2. In the Amazon S3 bucket text box, specify the source Amazon S3 bucket name. The bucket name is the name as it appears in the AWS Management Console.

  3. In the respective text boxes, enter the Access key ID and Secret key associated with the Amazon S3 bucket.

  4. To specify a subset of files in your source, click Specify file filters beneath the bucket field. You can include or exclude files based on file name prefix and file age.

  5. Under Select destination, choose a sink bucket or create a new one.

    • To choose an existing bucket, enter the name of the bucket (without the prefix gs://), or click Browse and browse to it.
    • To transfer files to a new bucket, click Browse and then click the New bucket icon.
  6. Enable overwrite/delete options if needed.

    By default, your transfer job only overwrites an object when the source version is different from the sink version. No other objects are overwritten or deleted. Enable additional overwrite/delete options under Transfer options.

  7. Under Configure transfer, schedule your transfer job to Run now (one time) or Run daily at the local time you specify.

  8. Click Create.

Before setting up the Transfer Job please make sure you have the necessary roles assigned to your account and the required permissions described here.

Also take into consideration that the Storage Transfer Service is currently available to certain Amazon S3 regions, described under the AMAZON S3 tab, of the Setting up a transfer job

Transfer jobs can also be done programmatically. More information here

Let me know if this was helpful.

EDIT

Neither the Transfer Service or gsutil command support currently "Temporary Security Credentials" even though they are supported by AWS. A workaround to do what you want is to change the source code of the gsutil command.

I also filed a Feature Request on your behalf, I suggest you to star it in order to get updates of the procedure.

like image 65
tzovourn Avatar answered Dec 07 '25 11:12

tzovourn