Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to unzip files from s3 and save it back on s3

I have some .zip files in bucket on S3. Which I need to unzip and need to save it back to in bucket without local file system.

I know S3 is static storage but can i unzip files on s3 itself by giving s3 buckets path.

I have following question.

  1. Can I pass buckets/folder's path to FileOutputStream(bucketPath) so it unzip file directly there.

    BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream(filePath));

  2. S3Object.putObject() also accept inputstream as a parameter, Can I convert ZipEntry into InputStream Directly and pass as a param with meta data.

  3. I need to use EMR to perform all operation(Local File system will not come in picture). Can I read zip file from s3 and unzip files using EMR and Save it on S3.

Here is My Code.

S3Object s3object = s3Client.getObject(new GetObjectRequest(bucketName,objName));   //sandip.zip

ZipInputStream in = new ZipInputStream(s3object.getObjectContent());
ZipEntry entry=in.getNextEntry(); // sandip_1.graphml
try {
    while ((entry!= null)){                         
        s3Client.putObject(bucketName, entry.getName(), new File(entry.getName()));
    }
 }
catch (IOException e) {
    e.printStackTrace();

}

My current code is throwing following exception.

Exception in thread "main" com.amazonaws.AmazonClientException: Unable to calculate MD5 hash: sandip_1.graphml (The system cannot find the file specified)
at com.amazonaws.services.s3.AmazonS3Client.putObject(AmazonS3Client.java:1319)
at com.amazonaws.services.s3.AmazonS3Client.putObject(AmazonS3Client.java:1273)
at com.example.testaws.test2.createAdjListZipFiles(Unknown Source)
at com.example.testaws.test1.main(test1.java:33)
Caused by: java.io.FileNotFoundException: sandip_1.graphml (The system cannot find the file specified)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(Unknown Source)
at java.io.FileInputStream.<init>(Unknown Source)
at com.amazonaws.util.Md5Utils.computeMD5Hash(Md5Utils.java:97)
at com.amazonaws.util.Md5Utils.md5AsBase64(Md5Utils.java:104)
at com.amazonaws.services.s3.AmazonS3Client.putObject(AmazonS3Client.java:1316)
... 3 more

Please give me hint or reference.

like image 340
Sandip Armal Patil Avatar asked Sep 14 '25 11:09

Sandip Armal Patil


1 Answers

First, you are right about one thing. S3 is static storage, so you can't do any file-level changes directly on S3. You somehow gotta download the files, transform as required and upload them back.

Second, you can definitely use EMR for this. It will, in fact, make your life very easy. Try this out:

  • Create an EMR cluster with Hive installed.

  • Create a Hive table somewhat like this: create external table x { record string } location 's3://blah';

  • Create another table, called y, just like above, with one addition: 'stored as textfile'

  • Now do an 'insert overwrite table y select record from x'.

Here, Hive will automatically detect that the input file is gzipped. After that, all you are doing is instructing Hive to store back the same data in same S3 location, but as a textfile.

P.S.- I am unable to post exact code or with correct formatting because I am answering this on the go. But I hope you get the general idea. This will definitely work as I have done this a few times.

like image 145
ketan vijayvargiya Avatar answered Sep 15 '25 23:09

ketan vijayvargiya