I have to write a map reduce batch (using the org.apache.hadoop.mapreduce.* API) to process text files with the following properties:
0xefI use a TextInputFormat since I want to perform the field splitting by myself. However it seems that TextInputFormat is only able to handle UTF-8 encoded files.
According to MAPREDUCE-232 there is a pending patch since 2008 but I have not been able to found a workaround. What are my options ? Converting the files in UTF-8 before-hand is not an option.
Edit: While reading the Hadoop source code I figured out a possible workaround. LineReader & friends only deal with bytes. They never convert bytes into String, they only match hard coded end of line separators and fill a byte buffer. Since ISO_8859_1 and UTF-8 share the same byte sequence for \n, it is thus possible to use:
public class MyMapper extends Mapper<IntWritable, Text, Text, Text> {
public void map(IntWritable key, Text value, Context context)
throws IOException, InterruptedException {
String data = new String(value.getBytes(),
0, value.getLength(),
Charsets.ISO_8859_1)
// [...]
}
}
Is this solution acceptable ?
I don't have any particular experience with TextInputFormat, but if what you say is true (the underlying code is only looking for the single byte value of \n), then converting those bytes to a String using your example code would be perfectly legitimate.
UPDATE:
your concern about relying on implementation details is valid, however, here are some points in your favor:
Text class works explicitly with utf-8 encoding. tough to change that later without breaking the whole world.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With