Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compress a CSV file written to a StringIO Buffer in Python3

I'm parsing text from pdf files into rows of ordered char metadata; I need to serialize these files to cloud storage, which is all working fine, however due to their size I'd also like to gzip these files but I've run into some issues there.

Here is my code:

import io
import csv
import zlib

# This data file is sent over Flask
page_position_data = pdf_parse_page_layouts(data_file)
field_order = ['char', 'position', 'page']

output_buffer = io.StringIO()
writer = csv.DictWriter(output_buffer, field_order)
writer.writeheader()
for page, rows in page_position_data.items():
    for text_char_data_row in rows:
        writer.writerow(text_char_data_row)

stored_format = zlib.compress(output_buffer)

This reads each row into the io.StringIO Buffer successfully, but gzip/zlib seem to only work with bytes-like objects like io.BytesIO so the last line errors; I cannot create read a csv into a BytesIO Buffer because DictWriter/Writer error unless io.StringIO() is used.

Thank you for your help!

like image 992
Quantumpencil Avatar asked Sep 02 '25 04:09

Quantumpencil


1 Answers

I figured this out and wanted to show my answer for anyone who runs into this:

The issue is that zlib.compress is expecting a Bytes-like object; this actually doesn't mean either StringIO or BytesIO as both of these are "file-like" objects which implment read() and your normal unix file handles.

All you have to do to fix this is use StringIO() to write the csv file to and then call get the string from the StringIO() object and encode it into a bytestring; it can then be compressed by zlib.

import io
import csv
import zlib

# This data file is sent over Flask
page_position_data = pdf_parse_page_layouts(data_file)
field_order = ['char', 'position', 'page']

output_buffer = io.StringIO()
writer = csv.DictWriter(output_buffer, field_order)
writer.writeheader()
for page, rows in page_position_data.items():
    for text_char_data_row in rows:
        writer.writerow(text_char_data_row)

encoded = output_buffer.getvalue().encode()
stored_format = zlib.compress(encoded)
like image 169
Quantumpencil Avatar answered Sep 05 '25 01:09

Quantumpencil