I'm effectively trying to read the following file into a DataFrame and write it out again, unchanged.
F1,F2,F3,F4,F5
"blah","blah",123,"123","123"
"blue","blue",456,"456",""
I would have expected that pandas.read_csv would interpret all fields in the above as strings (and thus dtype:object
) except for F3.
However, F4 is being treated as a numeric field and given dtype:int64
This means that writing the data back again with quoting=csv.QUOTE_NONNUMERIC
loses the quotes around F4.
pd.read_csv(test, quotechar='"', sep=',')
Out[90]:
F1 F2 F3 F4 F5
0 blah blah 123 123 123.0
1 blue blue 456 456 NaN
pd.read_csv("test.txt", quotechar='"', sep=',').dtypes
Out[91]:
F1 object
F2 object
F3 int64
F4 int64
dtype: object
This also creates a further problem: if any of the F5 values are an empty string (i.e. ""
), F5 is instead treated as float64
with a NaN
value for the empty strings. Those values will be written back as ""
, but all the actual values will now have a .0
appended as they're considered floats.
I've played around with various options for the quoting
parameter - quoting=csv.QUOTE_NONE
is the only one which treats F4 and F5 as a string, but then I obviously end up with the actual quotation marks embedded in the data. I would have expected quoting=csv.QUOTE_NONNUMERIC
to do what I want, but it treats all of F3, F4, F5 as float64
.
I feel like I'm maybe missing something fundamental here about how dtypes are determined?
The only workaround I can think of is to read the header and first data line of every file, parse it for quoted fields, and build an explicit dict on the fly (or read the file first with quoting=csv.QUOTE_NONE
and do the same thing based on which columns contain quote marks) then read the file again with the explicit dtypes.
This seems like a long-winded (and slow!) alternative, so I'm hoping someone can point out something I've missed in the read_csv documentation. Thanks.
Extra detail in case anything is relevant:
header=nnn
when I call the read_csv
.What you want to achieve is unfortunately not possible.
The CSV quotes define the fields, not the content nor the type.
When the parses sees "123", "12,3" or "ab,c", it uses the quote to keep the comma as part of the field, not to define if the content is numeric or anything else. Then the quotes are not part of the content.
As @user2357112supportsMonica nicely pointed out, this is for example important for handling European decimal separator.
So, back to your question, "123" and 123 are exactly the same in terms of content and you cannot do anything to extract type information from it…
The CSV format has no notion of type, and the Standard Library csv
module only returns strings (after optionaly stripping the quotes). The Pandas read_csv
gently processes the CSV file that way and after splitting into fields, try to guess a type. So the quotes are normaly removed around numeric data to allow them to be converted to a numeric type.
But you can declare the expected types:
df = pd.read_csv(test, dtype= {'F4': str, 'F5': str})
It gives as expected for df
:
F1 F2 F3 F4 F5
0 blah blah 123 123 123
1 blue blue 456 456 NaN
and for df.dtypes
:
F1 object
F2 object
F3 int64
F4 object
F5 object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With