parsing csv file that has newline characters in one of columns in AWS Athena/ AWS Glue catalog

Question

I've sample data like below:

id,log,code,sequence
100,sample <(>&<)> O sample ? PILE UP - 3 sample,20,7^M$
101,sample- 4/52$
sample$
CM,21,7^M$
102,sample AT 3PM,22,4^M$

In second row (id=101), log column has newline characters making 3 lines out of one line. I've enabled ":set list" option in vim editor to show newline ($) and endofline (^M) characters.

To handle newline characters AWS Suggested OpenCSVSerde here. I tried using OPENCSVSerde serialisation with escapeChar=\, quoteChar=\", seperatorChar=, Nonetheless, it is showing data as 5 rows where as I need three rows. When I query in Athena, id=101 is showing only first line and rest is missing:

id,log,code,sequence
101,sample- 4/52

Any tips or example on how to handle multiline characters in a csv file column?

I'm exploring custom classifiers but no luck yet.

Nelson Sequiera · Accepted Answer

According to this doc https://docs.aws.amazon.com/athena/latest/ug/csv.html opencsvserde does not support line breaks.

I see that you are trying to put some kind of log there. Your options are:

Cleanup the log not to include the line breaks. Or,
use regexserde, which is not useful if your log format keeps changing. Or,
If both are not an option you can change ur format from csv to parquet or something else, where there are no line break issues

parsing csv file that has newline characters in one of columns in AWS Athena/ AWS Glue catalog

Tags:

amazon-web-services

amazon-athena

aws-glue-data-catalog

Ash

1 Answers

Nelson Sequiera

Recent Activity

Donate For Us

parsing csv file that has newline characters in one of columns in AWS Athena/ AWS Glue catalog

Tags:

amazon-web-services

amazon-athena

aws-glue-data-catalog

Ash

1 Answers

Nelson Sequiera

Related questions

Recent Activity

Donate For Us