Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to list all files in a directory and its subdirectories in hadoop hdfs

Tags:

java

hadoop

hdfs

I have a folder in hdfs which has two subfolders each one has about 30 subfolders which,finally,each one contains xml files. I want to list all xml files giving only the main folder's path. Locally I can do this with apache commons-io's FileUtils.listFiles(). I have tried this

FileStatus[] status = fs.listStatus( new Path( args[ 0 ] ) );

but it only lists the two first subfolders and it doesn't go further. Is there any way to do this in hadoop?

like image 254
nik686 Avatar asked Sep 07 '25 13:09

nik686


2 Answers

If you are using hadoop 2.* API there are more elegant solutions:

    Configuration conf = getConf();
    Job job = Job.getInstance(conf);
    FileSystem fs = FileSystem.get(conf);

    //the second boolean parameter here sets the recursion to true
    RemoteIterator<LocatedFileStatus> fileStatusListIterator = fs.listFiles(
            new Path("path/to/lib"), true);
    while(fileStatusListIterator.hasNext()){
        LocatedFileStatus fileStatus = fileStatusListIterator.next();
        //do stuff with the file like ...
        job.addFileToClassPath(fileStatus.getPath());
    }
like image 82
Prasoon Joshi Avatar answered Sep 10 '25 07:09

Prasoon Joshi


You'll need to use the FileSystem object and perform some logic on the resultant FileStatus objects to manually recurse into the subdirectories.

You can also apply a PathFilter to only return the xml files using the listStatus(Path, PathFilter) method

The hadoop FsShell class has examples of this for the hadoop fs -lsr command, which is a recursive ls - see the source, around line 590 (the recursive step is triggered on line 635)

like image 30
Chris White Avatar answered Sep 10 '25 05:09

Chris White