Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parsing csv file that has newline characters in one of columns in AWS Athena/ AWS Glue catalog

I've sample data like below:

id,log,code,sequence
100,sample <(>&<)> O sample ? PILE UP - 3 sample,20,7^M$
101,sample- 4/52$
sample$
CM,21,7^M$
102,sample AT 3PM,22,4^M$

In second row (id=101), log column has newline characters making 3 lines out of one line. I've enabled ":set list" option in vim editor to show newline ($) and endofline (^M) characters.

To handle newline characters AWS Suggested OpenCSVSerde here. I tried using OPENCSVSerde serialisation with escapeChar=\\, quoteChar=\", seperatorChar=, Nonetheless, it is showing data as 5 rows where as I need three rows. When I query in Athena, id=101 is showing only first line and rest is missing:

id,log,code,sequence
101,sample- 4/52

Any tips or example on how to handle multiline characters in a csv file column?

I'm exploring custom classifiers but no luck yet.

like image 762
Ash Avatar asked Oct 16 '25 19:10

Ash


1 Answers

According to this doc https://docs.aws.amazon.com/athena/latest/ug/csv.html opencsvserde does not support line breaks.

I see that you are trying to put some kind of log there. Your options are:

  • Cleanup the log not to include the line breaks. Or,

  • use regexserde, which is not useful if your log format keeps changing. Or,

  • If both are not an option you can change ur format from csv to parquet or something else, where there are no line break issues

like image 137
Nelson Sequiera Avatar answered Oct 19 '25 13:10

Nelson Sequiera



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!