Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which encoding is in use by csv.DictReader when reading csv?

I have a csv file saved encoded as UTF-8.

It contains non-ascii chars [umlauts].

I am reading the file using:

csv.DictReader(<file>,delimiter=<delimiter>).

My questions are:

  1. In which encoding is the file being read?
  2. I noticed that in order to refer to the strings as utf-8 I need to perform:

    str.decode('utf-8')
    

    Is there a better approach then reading the file in one encoding and then to convert to another, i.e. utf-8?

[Python version: 2.7]

like image 564
Maoritzio Avatar asked Jan 25 '26 08:01

Maoritzio


1 Answers

In Python 2.7, the CSV module does not apply any decoding - it opens the file in binary mode and returns bytes strings.

Use https://github.com/jdunck/python-unicodecsv, which decodes on the fly.

Use it like:

with open("myfile.csv", 'rb') as my_file:    
    r = unicodecsv.DictReader(my_file, encoding='utf-8')

r will contain a dict of Unicodes. It's important that the source file is opened as binary mode.

like image 74
Alastair McCormack Avatar answered Jan 27 '26 20:01

Alastair McCormack