My csv files have header in the first line. Loading them into pig create a mess on any subsequent functions (like SUM). As of today I first apply a filter on the loaded data to remove the rows containing the headers :
affaires = load 'affaires.csv' using PigStorage(',') as (NU_AFFA:chararray, date:chararray) ;
affaires = filter affaires by date matches '../../..';
I think it is a bit stupid as a method, and I am wondering either there is a way to tell pig not to load the first line of the csv, like a "as_header" boolean parameter to the load function. I don't see it on the doc. What would be a best practice ? How do you usually deal with that ??
CSVExcelStorage loader support to skip the header row, so instead of PigStorage use CSVExcelStorage. Download piggybank.jar and try this option.
Sample example
input.csv
Name,Age,Location
a,10,chennai
b,20,banglore
PigScript:(With SKIP_INPUT_HEADER option)
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'UNIX', 'SKIP_INPUT_HEADER');
DUMP A;
Output:
(a,10,chennai)
(b,20,banglore)
Reference:
http://pig.apache.org/docs/r0.13.0/api/org/apache/pig/piggybank/storage/CSVExcelStorage.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With