I have 10M+ photos saved on the local file system. Now I want to go through each of them to analyze the binary of the photo to see if it's a dog. I basically want to do the analysis on a clustered hadoop environment. The problem is, how should I design the input for the map method? let's say, in the map method, 
new FaceDetection(photoInputStream).isDog() is all the underlying logic for the analysis. 
Specifically, 
Should I upload all of the photos to HDFS? Assume yes,
how can I use them in the map method? 
Is it ok to make the input(to the map) as a text file containing all of the photo path(in HDFS) with each a line, and in the map method, load the binary like: photoInputStream = getImageFromHDFS(photopath); (Actually, what is the right method to load file from HDFS during the execution of the map method?)
It seems I miss some knowledges about the basic principle for hadoop, map/reduce and hdfs, but can you please point me out in terms of the above question, Thanks!
MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to return a consolidated output back to the application.
MapReduce is the Hadoop's native batch processing engine. A MapReduce job splits a large dataset into independent chunks and organizes them into key and value pairs for parallel processing. The mapping and reducing functions receive not just values, but (key, value) pairs.
MapReduce is suitable for iterative computation involving large quantities of data requiring parallel processing. It represents a data flow rather than a procedure. It's also suitable for large-scale graph analysis; in fact, MapReduce was originally developed for determining PageRank of web documents.
The Apache Hadoop is an eco-system which provides an environment which is reliable, scalable and ready for distributed computing. MapReduce is a submodule of this project which is a programming model and is used to process huge datasets which sits on HDFS (Hadoop distributed file system).
how can I use them in the map method?
The major problem is that each file is going to be in one file. So if you have 10M files, you'll have 10M mappers, which doesn't sound terribly reasonable. You may want to considering  pre-serializing the files into SequenceFiles (one image per key-value pair). This will make loading the data into the MapReduce job native, so you don't have to write any tricky code. Also, you'll be able to store all of your data into one SequenceFile, if you so desire. Hadoop handles splitting SequenceFiles quite well.
Basically, the way this works is, you will have a separate Java process that takes several image files, reads the ray bytes into memory, then stores the data into a key-value pair in a SequenceFile. Keep going and keep writing into HDFS. This may take a while, but you'll only have to do it once.
Is it ok to make the input(to the map) as a text file containing all of the photo path(in HDFS) with each a line, and in the map method, load the binary like: photoInputStream = getImageFromHDFS(photopath); (Actually, what is the right method to load file from HDFS during the execution of the map method?)
This is not ok if you have any sort of reasonable cluster (which you should if you are considering Hadoop for this) and you actually want to be using the power of Hadoop. Your MapReduce job will fire off, and load the files, but the mappers will be running data-local to the text files, not the images! So, basically, you are going to be shuffling the image files everywhere since the JobTracker is not placing tasks where the files are. This will incur a significant amount of network overhead. If you have 1TB of images, you can expect that a lot of them will be streamed over the network if you have more than a few nodes. This may not be so bad depending on your situation and cluster size (less than a handful of nodes).
If you do want to do this, you can use the FileSystem API to create files (you want the open method).
I have 10M+ photos saved on the local file system.
Assuming it takes a sec to put each file into the sequence file. It will take ~115 days for the conversion of individual files into a sequence file. With parallel processing also on a single machine, I don't see much improvement because disk read/write will be a bottle neck with reading the photo files and writing the sequence file. Check this Cloudera article on small files problem. There is also a reference to a script which converts a tar file into a sequence file and how much time it took for the conversion.
Basically the photos have to be processed in a distributed way for converting them into sequence. Back to Hadoop :)
According to the Hadoop - The Definitive Guide
As a rule of thumb, each file, directory, and block takes about 150 bytes. So, for example, if you had one million files, each taking one block, you would need at least 300 MB of memory.
So, directly loading 10M of files will require around 3,000 MB of memory for just storing the namespace on the NameNode. Forget about streaming the photos across nodes during the execution of the job.
There should be a better way of solving this problem.
Another approach is to load the files as-is into HDFS and use CombineFileInputFormat which combines the small files into a input split and considers data locality while calculating the input splits. Advantage of this approach is that the files can be loaded into HDFS as-is without any conversion and there is also not much data shuffling across nodes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With