TextInputFormat VS non UTF-8 encoding

Question

I have to write a map reduce batch (using the org.apache.hadoop.mapreduce.* API) to process text files with the following properties:

ISO-8859-1 encoding.
CSV like
Separator is 0xef

I use a TextInputFormat since I want to perform the field splitting by myself. However it seems that TextInputFormat is only able to handle UTF-8 encoded files.

According to MAPREDUCE-232 there is a pending patch since 2008 but I have not been able to found a workaround. What are my options ? Converting the files in UTF-8 before-hand is not an option.

Edit: While reading the Hadoop source code I figured out a possible workaround. LineReader & friends only deal with bytes. They never convert bytes into String, they only match hard coded end of line separators and fill a byte buffer. Since ISO_8859_1 and UTF-8 share the same byte sequence for , it is thus possible to use:

public class MyMapper extends Mapper<IntWritable, Text, Text, Text> {

    public void map(IntWritable key, Text value, Context context) 
                   throws IOException, InterruptedException {
        String data = new String(value.getBytes(),
                                 0, value.getLength(), 
                                 Charsets.ISO_8859_1)
        // [...]
    }
}

Is this solution acceptable ?

jtahlborn · Accepted Answer

I don't have any particular experience with TextInputFormat, but if what you say is true (the underlying code is only looking for the single byte value of ), then converting those bytes to a String using your example code would be perfectly legitimate.

UPDATE:

your concern about relying on implementation details is valid, however, here are some points in your favor:

the "bug fix" is still open since 2008, and was rejected cause it didn't handle all encodings correctly (aka, this is a hard problem that needs more work to fix correctly)
the Text class works explicitly with utf-8 encoding. tough to change that later without breaking the whole world.
following on point 2, since your target encoding has a newline byte sequence compatible with utf-8, as long as you can always get back the original raw bytes, you should be fine.

TextInputFormat VS non UTF-8 encoding

Tags:

java

hadoop

mapreduce

Clément MATHIEU

1 Answers

jtahlborn

Recent Activity

Donate For Us

TextInputFormat VS non UTF-8 encoding

Tags:

java

hadoop

mapreduce

Clément MATHIEU

1 Answers

jtahlborn

Related questions

Recent Activity

Donate For Us