I'm trying to read a CSV file into a DataFrame using readtable(). There is an unfortunate issue with the CSV file in that if the last x columns of a given row are blank, instead of generating that number of commas, it just ends the line. For example, I can have:
Col1,Col2,Col3,Col4
item1,item2,,item4
item5
Notice how in the third line, there is only one entry. Ideally, I would like readtable to fill the values for Col2, Col3, and Col4 with NA, NA, and NA; however, because of the lack of commas and therefore lack of empty strings, readtable() simply sees this as a row that doesn't match the number of columns. If I run readtable() in Julia with the sample CSV above, I get the error "Saw 2 Rows, 2 columns, and 5 fields, * Line 1 has 6 columns". If I add in 3 commas after item5, then it works.
Is there any way around this, or do I have to fix the CSV file?
If the CSV parsing doesn't need too much quote logic, it is easy to write a special purpose parser to handle the case of missing columns. Like so:
function bespokeread(s)
headers = split(strip(readline(s)),',')
ncols = length(headers)
data = [String[] for i=1:ncols]
while !eof(s)
newline = split(strip(readline(s)),',')
length(newline)<ncols && append!(newline,["" for i=1:ncols-length(newline)])
for i=1:ncols
push!(data[i],newline[i])
end
end
return DataFrame(;OrderedDict(Symbol(headers[i])=>data[i] for i=1:ncols)...)
end
Then the file:
Col1,Col2,Col3,Col4
item1,item2,,item4
item5
Would give:
julia> df = bespokeread(f)
2×4 DataFrames.DataFrame
│ Row │ Col1 │ Col2 │ Col3 │ Col4 │
├─────┼─────────┼─────────┼──────┼─────────┤
│ 1 │ "item1" │ "item2" │ "" │ "item4" │
│ 2 │ "item5" │ "" │ "" │ "" │
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With