In Polars, how can one specify a single dtype for all columns in read_csv?
According to the docs, the schema_overrides argument to read_csv can take either a mapping (dict) in the form of {'column_name': dtype}, or a list of dtypes, one for each column.
However, it is not clear how to specify "I want all columns to be a single dtype".
If you wanted all columns to be String for example and you knew the total number of columns, you could do:
pl.read_csv('sample.csv', schema_overrides=[pl.String]*number_of_columns)
However, this doesn't work if you don't know the total number of columns. In Pandas, you could do something like:
pd.read_csv('sample.csv', dtype=str)
But this doesn't work in Polars.
Reading all data in a csv to any other type than pl.String likely fails with a lot of null values. We can use expressions to declare how we want to deal with those null values.
If you read a csv with infer_schema_length=0, polars does not know the schema and will read all columns as pl.String as that is a super type of all polars types.
When read as String we can use expressions to cast all columns.
(pl.read_csv("test.csv", infer_schema_length=0)
.with_columns(pl.all().cast(pl.Int32, strict=False))
Update: infer_schema=False was added in 1.2.0 as a more user-friendly name for this feature.
pl.read_csv("test.csv", infer_schema=False) # read all as pl.String
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With