I've sample data like below:
id,log,code,sequence
100,sample <(>&<)> O sample ? PILE UP - 3 sample,20,7^M$
101,sample- 4/52$
sample$
CM,21,7^M$
102,sample AT 3PM,22,4^M$
In second row (id=101), log column has newline characters making 3 lines out of one line. I've enabled ":set list" option in vim editor to show newline ($) and endofline (^M) characters.
To handle newline characters AWS Suggested OpenCSVSerde here.
I tried using OPENCSVSerde serialisation with escapeChar=\\, quoteChar=\", seperatorChar=
,
Nonetheless, it is showing data as 5 rows where as I need three rows.
When I query in Athena, id=101 is showing only first line and rest is missing:
id,log,code,sequence
101,sample- 4/52
Any tips or example on how to handle multiline characters in a csv file column?
I'm exploring custom classifiers but no luck yet.
According to this doc https://docs.aws.amazon.com/athena/latest/ug/csv.html opencsvserde does not support line breaks.
I see that you are trying to put some kind of log there. Your options are:
Cleanup the log not to include the line breaks. Or,
use regexserde, which is not useful if your log format keeps changing. Or,
If both are not an option you can change ur format from csv to parquet or something else, where there are no line break issues
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With