Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TextInputFormat VS non UTF-8 encoding

I have to write a map reduce batch (using the org.apache.hadoop.mapreduce.* API) to process text files with the following properties:

  • ISO-8859-1 encoding.
  • CSV like
  • Separator is 0xef

I use a TextInputFormat since I want to perform the field splitting by myself. However it seems that TextInputFormat is only able to handle UTF-8 encoded files.

According to MAPREDUCE-232 there is a pending patch since 2008 but I have not been able to found a workaround. What are my options ? Converting the files in UTF-8 before-hand is not an option.

Edit: While reading the Hadoop source code I figured out a possible workaround. LineReader & friends only deal with bytes. They never convert bytes into String, they only match hard coded end of line separators and fill a byte buffer. Since ISO_8859_1 and UTF-8 share the same byte sequence for \n, it is thus possible to use:

public class MyMapper extends Mapper<IntWritable, Text, Text, Text> {

    public void map(IntWritable key, Text value, Context context) 
                   throws IOException, InterruptedException {
        String data = new String(value.getBytes(),
                                 0, value.getLength(), 
                                 Charsets.ISO_8859_1)
        // [...]
    }
}

Is this solution acceptable ?

like image 547
Clément MATHIEU Avatar asked Dec 20 '25 16:12

Clément MATHIEU


1 Answers

I don't have any particular experience with TextInputFormat, but if what you say is true (the underlying code is only looking for the single byte value of \n), then converting those bytes to a String using your example code would be perfectly legitimate.

UPDATE:

your concern about relying on implementation details is valid, however, here are some points in your favor:

  1. the "bug fix" is still open since 2008, and was rejected cause it didn't handle all encodings correctly (aka, this is a hard problem that needs more work to fix correctly)
  2. the Text class works explicitly with utf-8 encoding. tough to change that later without breaking the whole world.
  3. following on point 2, since your target encoding has a newline byte sequence compatible with utf-8, as long as you can always get back the original raw bytes, you should be fine.
like image 180
jtahlborn Avatar answered Dec 23 '25 04:12

jtahlborn



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!