Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing CSV file with multiline fields and escaped double quotes

Tags:

c++

csv

What is the best way to parse a CSV file with multiline fields and escaped quotes?

For example, this CSV

First field of first row,"This field is multiline

but that's OK because it's enclosed in double qoutes, and this
is an escaped "" double qoute" but this one "" is not
   "This is second field of second row, but it is not multiline
because it doesn't start 
with an immediate double quote"

looks in Excel like this:

enter image description here

Do I just preserve the state somehow, have a flag saying if the field I'm reading now has started with a qoute, etc.? Also, what happens to cases where there are double quotes but not at the start of the field, i.e. , "ABC", or ,"item "" item" "" are those considered invalid? Also, are "" quotes escaped if not inside a quoted field? Excel doesn't seem to.

Are there any other corner cases I might have missed?

like image 856
sashoalm Avatar asked Oct 20 '25 04:10

sashoalm


1 Answers

Your sample data has a single double quote just after 'double qoute' (sic) which terminates the double quoted start of field, but you have to continue reading until the next comma or the end of line. This is a malformed multi-line field, but you're seeing what Excel does with it. You can find a description of this in the (excellent) book The Practice of Programming (dead link — Internet Archive copy; live link at Princeton University) which includes a CSV parsing library in C and a reimplementation in C++, and which discusses this level of detail. There's also a standard for CSV in RFC 4180 "Common Format and MIME Type for Comma-Separated Values" and you can also study Wikipedia on the subject.


In the other answer is some sample code, still under test. It is pretty resilient within its limitations. Here it is modified into an SSCCE (Short, Self-Contained, Correct Example).

#include <stdbool.h>
#include <wchar.h>
#include <wctype.h>

extern const wchar_t *nextCsvField(const wchar_t *p, wchar_t sep, bool *newline);

// Returns a pointer to the start of the next field,
// or zero if this is the last field in the CSV
// p is the start position of the field
// sep is the separator used, i.e. comma or semicolon
// newline says whether the field ends with a newline or with a comma
const wchar_t *nextCsvField(const wchar_t *p, wchar_t sep, bool *newline)
{
    // Parse quoted sequences
    if ('"' == p[0]) {
        p++;
        while (1) {
            // Find next double-quote
            p = wcschr(p, L'"');
            // If we don't find it or it's the last symbol
            // then this is the last field
            if (!p || !p[1])
                return 0;
            // Check for "", it is an escaped double-quote
            if (p[1] != '"')
                break;
            // Skip the escaped double-quote
            p += 2;
        }
    }

    // Find next newline or comma.
    wchar_t newline_or_sep[4] = L"\n\r ";
    newline_or_sep[2] = sep;
    p = wcspbrk(p, newline_or_sep);

    // If no newline or separator, this is the last field.
    if (!p)
        return 0;

    // Check if we had newline.
    *newline = (p[0] == '\r' || p[0] == '\n');

    // Handle "\r\n", otherwise just increment
    if (p[0] == '\r' && p[1] == '\n')
        p += 2;
    else
        p++;

    return p;
}

static void dissect(const wchar_t *line)
{
    const wchar_t *start = line;
    const wchar_t *next;
    bool     eol;
    wprintf(L"Input: %d [%.*ls]\n", wcslen(line), wcslen(line)-1, line);
    while ((next = nextCsvField(start, L',', &eol)) != 0)
    {
        wprintf(L"Field: [%.*ls] (eol = %d)\n", (next - start - eol), start, eol);
        start = next;
    }
}

static const wchar_t multiline[] =
   L"First field of first row,\"This field is multiline\n"
    "\n"
    "but that's OK because it's enclosed in double quotes, and this\n"
    "is an escaped \"\" double quote\" but this one \"\" is not\n"
    "   \"This is second field of second row, but it is not multiline\n"
    "   because it doesn't start \n"
    "   with an immediate double quote\"\n"
    ;

int main(void)
{
    wchar_t line[1024];

    while (fgetws(line, sizeof(line)/sizeof(line[0]), stdin))
        dissect(line);
    dissect(multiline);

    return 0;
}

Example output

$ cat csv.data
a,bb, c ,d""e,f
1,"2","",,"""",4
$ ./wcsv < csv.data
Input: 16 [a,bb, c ,d""e,f]
Field: [a,] (eol = 0)
Field: [bb,] (eol = 0)
Field: [ c ,] (eol = 0)
Field: [d""e,] (eol = 0)
Field: [f] (eol = 1)
Input: 17 [1,"2","",,"""",4]
Field: [1,] (eol = 0)
Field: ["2",] (eol = 0)
Field: ["",] (eol = 0)
Field: [,] (eol = 0)
Field: ["""",] (eol = 0)
Field: [4] (eol = 1)
Input: 296 [First field of first row,"This field is multiline

but that's OK because it's enclosed in double quotes, and this
is an escaped "" double quote" but this one "" is not
   "This is second field of second row, but it is not multiline
   because it doesn't start 
   with an immediate double quote"]
Field: [First field of first row,] (eol = 0)
Field: ["This field is multiline

but that's OK because it's enclosed in double quotes, and this
is an escaped "" double quote" but this one "" is not] (eol = 1)
Field: [   "This is second field of second row,] (eol = 0)
Field: [ but it is not multiline] (eol = 1)
Field: [   because it doesn't start ] (eol = 1)
Field: [   with an immediate double quote"] (eol = 1)
$

I said 'within its limitations'; what are its limitations?

Primarily, it isolates the raw field, rather than the converted field. Thus, the field it isolates has to be modified to produce the 'real' value, with the enclosing double quotes stripped off, and internal doubled double quotes replaced by single quotes. Converting a raw field to the real value mimics quite a lot of the code in the nextCsvField() function. The inputs are the start of the field and the end of the field (the separator character). Here's a second SSCCE with an extra function csvFieldData(), and the dissect() function shown above revised to call it. The format of the dissected output is slightly different so it looks better:

#include <stdbool.h>
#include <wchar.h>
#include <wctype.h>

extern const wchar_t *nextCsvField(const wchar_t *p, wchar_t sep, bool *newline);

// Returns a pointer to the start of the next field,
// or zero if this is the last field in the CSV
// p is the start position of the field
// sep is the separator used, i.e. comma or semicolon
// newline says whether the field ends with a newline or with a comma
const wchar_t *nextCsvField(const wchar_t *p, wchar_t sep, bool *newline)
{
    // Parse quoted sequences
    if ('"' == p[0]) {
        p++;
        while (1) {
            // Find next double-quote
            p = wcschr(p, L'"');
            // If we don't find it or it's the last symbol
            // then this is the last field
            if (!p || !p[1])
                return 0;
            // Check for "", it is an escaped double-quote
            if (p[1] != '"')
                break;
            // Skip the escaped double-quote
            p += 2;
        }
    }

    // Find next newline or comma.
    wchar_t newline_or_sep[4] = L"\n\r ";
    newline_or_sep[2] = sep;
    p = wcspbrk(p, newline_or_sep);

    // If no newline or separator, this is the last field.
    if (!p)
        return 0;

    // Check if we had newline.
    *newline = (p[0] == '\r' || p[0] == '\n');

    // Handle "\r\n", otherwise just increment
    if (p[0] == '\r' && p[1] == '\n')
        p += 2;
    else
        p++;

    return p;
}

static wchar_t *csvFieldData(const wchar_t *fld_s, const wchar_t *fld_e, wchar_t *buffer, size_t buflen)
{
    wchar_t *dst = buffer;
    wchar_t *end = buffer + buflen - 1;
    const wchar_t *src = fld_s;

    if (*src == L'"')
    {
        const wchar_t *p = src + 1;
        while (p < fld_e && dst < end)
        {
            if (p[0] == L'"' && p+1 < fld_s && p[1] == L'"')
            {
                *dst++ = p[0];
                p += 2;
            }
            else if (p[0] == L'"')
            {
                p++;
                break;
            }
            else
                *dst++ = *p++;
        }
        src = p;
    }
    while (src < fld_e && dst < end)
        *dst++ = *src++;
    if (dst >= end)
        return 0;
    *dst = L'\0';
    return(buffer);
}

static void dissect(const wchar_t *line)
{
    const wchar_t *start = line;
    const wchar_t *next;
    bool     eol;
    wprintf(L"Input %3zd: [%.*ls]\n", wcslen(line), wcslen(line)-1, line);
    while ((next = nextCsvField(start, L',', &eol)) != 0)
    {
        wchar_t buffer[1024];
        wprintf(L"Raw Field: [%.*ls] (eol = %d)\n", (next - start - eol), start, eol);
        if (csvFieldData(start, next-1, buffer, sizeof(buffer)/sizeof(buffer[0])) != 0)
            wprintf(L"Field %3zd: [%ls]\n", wcslen(buffer), buffer);
        start = next;
    }
}

static const wchar_t multiline[] =
   L"First field of first row,\"This field is multiline\n"
    "\n"
    "but that's OK because it's enclosed in double quotes, and this\n"
    "is an escaped \"\" double quote\" but this one \"\" is not\n"
    "   \"This is second field of second row, but it is not multiline\n"
    "   because it doesn't start \n"
    "   with an immediate double quote\"\n"
    ;

int main(void)
{
    wchar_t line[1024];

    while (fgetws(line, sizeof(line)/sizeof(line[0]), stdin))
        dissect(line);
    dissect(multiline);

    return 0;
}

Example output

$ ./wcsv < csv.data
Input  16: [a,bb, c ,d""e,f]
Raw Field: [a,] (eol = 0)
Field   1: [a]
Raw Field: [bb,] (eol = 0)
Field   2: [bb]
Raw Field: [ c ,] (eol = 0)
Field   3: [ c ]
Raw Field: [d""e,] (eol = 0)
Field   4: [d""e]
Raw Field: [f] (eol = 1)
Field   1: [f]
Input  17: [1,"2","",,"""",4]
Raw Field: [1,] (eol = 0)
Field   1: [1]
Raw Field: ["2",] (eol = 0)
Field   1: [2]
Raw Field: ["",] (eol = 0)
Field   0: []
Raw Field: [,] (eol = 0)
Field   0: []
Raw Field: ["""",] (eol = 0)
Field   2: [""]
Raw Field: [4] (eol = 1)
Field   1: [4]
Input 296: [First field of first row,"This field is multiline

but that's OK because it's enclosed in double quotes, and this
is an escaped "" double quote" but this one "" is not
   "This is second field of second row, but it is not multiline
   because it doesn't start 
   with an immediate double quote"]
Raw Field: [First field of first row,] (eol = 0)
Field  24: [First field of first row]
Raw Field: ["This field is multiline

but that's OK because it's enclosed in double quotes, and this
is an escaped "" double quote" but this one "" is not] (eol = 1)
Field 140: [This field is multiline

but that's OK because it's enclosed in double quotes, and this
is an escaped " double quote" but this one "" is not]
Raw Field: [   "This is second field of second row,] (eol = 0)
Field  38: [   "This is second field of second row]
Raw Field: [ but it is not multiline] (eol = 1)
Field  24: [ but it is not multiline]
Raw Field: [   because it doesn't start ] (eol = 1)
Field  28: [   because it doesn't start ]
Raw Field: [   with an immediate double quote"] (eol = 1)
Field  34: [   with an immediate double quote"]
$ 

I've not tested with \r\n (or plain \r) line endings.

like image 111
Jonathan Leffler Avatar answered Oct 21 '25 19:10

Jonathan Leffler