Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Pyarrow read_csv handle different file encodings?

I have a .dat file that I had been reading with pd.read_csv and always needed to use encoding="latin" for it to read properly / without error. When I use pyarrow.csv.read_csv I dont see a parameter to select the encoding of the file but it still works without issue(which is great! but i dont understand why / if it only auto handles certain encodings). The only parameters im using are setting the delimiter="|" (with ParseOptions) and auto_dict_encode=True with (ConvertOptions).

How is pyarrow handling different encoding types?

like image 228
matthewmturner Avatar asked Sep 11 '25 00:09

matthewmturner


2 Answers

pyarrow currently has no functionality to deal with different encodings, and assumes UTF8 for string/text data.
But the reason it doesn't raise an error is that pyarrow will read any non-UTF8 strings as a "binary" type column, instead of "string" type.

A small example:

# writing a small file with latin encoding 
with open("test.csv", "w", encoding="latin") as f: 
    f.writelines(["col1,col2\n", "u,ù"])

Reading with pyarrow gives string for the first column (which only contains ASCII characters, thus also valid UTF8), but reads the second column as binary:

>>> from pyarrow import csv 
>>> csv.read_csv("test.csv")
pyarrow.Table
col1: string
col2: binary

With pandas you indeed get an error by default (because pandas has no binary data type, and will try to read all text columns as python strings, thus UTF8):

>>> pd.read_csv("test.csv")
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf9 in position 0: invalid start byte

>>> pd.read_csv("test.csv", encoding="latin")

  col1 col2
0    u    ù
like image 78
joris Avatar answered Sep 13 '25 19:09

joris


It's now possible to specify encodings with pyarrow.read_csv. According to the pyarrow docs for read_csv:

The encoding can be changed using the ReadOptions class.

A minimal example follows:

from pyarrow import csv

options = csv.ReadOptions(encoding='latin1')
table = csv.read_csv('path/to/file', options)

From what I can tell, the functionality was added in this PR, so it should work starting with pyarrow 1.0.

like image 33
daviewales Avatar answered Sep 13 '25 20:09

daviewales