I am using stream editor sed to convert a large set of text files data (400MB) into a csv format.
I have come very close to finish, but the outstanding problem are quotes within quotes, on a data like this:
1,word1,"description for word1","another text",""text contains "double quotes" some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for "word3"","another text","more text and more"
The desired output is:
1,word1,"description for word1","another text","text contains double quotes some more text"
2,word2,"description for word2","another text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"
I have searched around for help, but I am not getting too close to solution, I have tried the following seds with regex patterns:
sed -i 's/(?<!^\s*|,)""(?!,""|\s*$)//g' *.txt
sed -i 's/(?<=[^,])"(?=[^,])//g' *.txt
These are from the below questions, but do not seem to be working for sed:
Related question for perl
Related question for SISS
The original files are *.txt and I am trying to edit them in place with sed.
Here's one way using GNU awk
and the FPAT variable:
gawk 'BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"; OFS=","; N="\"" } { for (i=1;i<=NF;i++) if ($i ~ /^\".*\"$/) { gsub(/\"/,"", $i); $i=N $i N } }1' file
Results:
1,word1,"description for word1","another text","text contains double
quotes some more text" 2,word2,"description for word2","another
text","text may not contain double quotes, but may contain commas ,"
3,word3,"description for word3","another text","more text and more"
Explanation:
Using FPAT, a field is defined as either "anything that is not a comma," or "a double quote, anything that is not a double quote, and a closing double quote". Then on every line of input, loop through each field and if the field starts and ends with a double quote, remove all quotes from the field. Finally, add double quotes surrounding the field.
sed -e ':r s:["]\([^",]*\)["]\([^",]*\)["]\([^",]*\)["]:"\1\2\3":; tr' FILE
This looks over the strings of the type "STR1 "STR2" STR3 "
and converts them to "STR1 STR2 STR3"
. If it found something, it repeats, to be sure that it eliminates all nested strings at a depth > 2.
It also assures that none of STRx contains comma
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With