I read in a csv-file that contains fields with numbers like that: "3". Can I convert this fields from "3" to 3 with PigLatin? I need it to use the SUM() - Function.
Thanks for your help!
What about just removing the " with REPLACE?
For example:
data =
    LOAD 'data.txt' AS (num:CHARARRAY);
numbers =
    FOREACH data
    GENERATE
        (INT) REPLACE(num, '\\"', '');
Then you can GROUP and SUM.
One advantage is that you can cast the returned string directly to a number (no need to deal with bags). REGEX_EXTRACT could be used to do the same too.
The TOKENIZE function will split a string on various characters considered to be word separators, one of which is a quote mark.  So if you tokenize "3" and take the middle item, it should be just 3.
You could write a UDF that strips the quotes around it OR use JacobM's approach.
However, afterwards, you should cast the chararray '3' to an int: (int)$1 or (int)myvalue. This way you can use sum.
http://pig.apache.org/docs/r0.5.0/piglatin_reference.html#Cast+Operators
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With