Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sed - remove quotes within quotes in large csv files

Tags:

regex

csv

sed

I am using stream editor sed to convert a large set of text files data (400MB) into a csv format.

I have come very close to finish, but the outstanding problem are quotes within quotes, on a data like this:

1,word1,"description for word1","another text",""text contains "double quotes" some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for "word3"","another text","more text and more"

The desired output is:

1,word1,"description for word1","another text","text contains double quotes some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"

I have searched around for help, but I am not getting too close to solution, I have tried the following seds with regex patterns:

sed -i 's/(?<!^\s*|,)""(?!,""|\s*$)//g' *.txt
sed -i 's/(?<=[^,])"(?=[^,])//g' *.txt

These are from the below questions, but do not seem to be working for sed:

Related question for perl

Related question for SISS

The original files are *.txt and I am trying to edit them in place with sed.

like image 733
nol Avatar asked Oct 22 '22 01:10

nol


2 Answers

Here's one way using GNU awk and the FPAT variable:

gawk 'BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"; OFS=","; N="\"" } { for (i=1;i<=NF;i++) if ($i ~ /^\".*\"$/) { gsub(/\"/,"", $i); $i=N $i N } }1' file

Results:

1,word1,"description for word1","another text","text contains double
quotes some more text" 2,word2,"description for word2","another
text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"

Explanation:

Using FPAT, a field is defined as either "anything that is not a comma," or "a double quote, anything that is not a double quote, and a closing double quote". Then on every line of input, loop through each field and if the field starts and ends with a double quote, remove all quotes from the field. Finally, add double quotes surrounding the field.

like image 131
Steve Avatar answered Nov 15 '22 10:11

Steve


sed -e ':r s:["]\([^",]*\)["]\([^",]*\)["]\([^",]*\)["]:"\1\2\3":; tr' FILE

This looks over the strings of the type "STR1 "STR2" STR3 " and converts them to "STR1 STR2 STR3". If it found something, it repeats, to be sure that it eliminates all nested strings at a depth > 2.

It also assures that none of STRx contains comma.

like image 45
alinsoar Avatar answered Nov 15 '22 08:11

alinsoar