Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently extract sheet names, and column names from large .xlsx with Python3

What are the Python3 options to efficiently (performance and memory) extract sheet names and for a given sheet, and also column names from a very large .xlsx file?

I've tried using pandas:

For sheet names using pd.ExcelFile:

    xl = pd.ExcelFile(filename)
    return xl.sheet_names

For column names using pd.ExcelFile:

    xl = pd.ExcelFile(filename)
    df = xl.parse(sheetname, nrows=2, **kwargs)
    df.columns

For column names using pd.read_excel with and without nrows (>v23):

    df = pd.read_excel(io=filename, sheet_name=sheetname, nrows=2)
    df.columns

However, both pd.ExcelFile and and pd.read_excel seem to read the entire .xlsx in memory and are therefore slow.

Thanks a lot!

like image 681
elke Avatar asked Sep 03 '25 04:09

elke


2 Answers

Here is the easiest way I can share with you:

# read the sheet file
import pandas as pd
my_sheets = pd.ExcelFile('sheet_filename.xlsx')
my_sheets.sheet_names
like image 82
Jade Cacho Avatar answered Sep 05 '25 00:09

Jade Cacho


According to this SO question, reading excel files in chunks is not supported (see this issue on github), and using nrows will always read all the file into memory first.

Possible solutions:

  • Convert the sheet to csv, and read that in chunks.
  • Use something other than pandas. See this page for a list of alternative libraries.
like image 24
Qusai Alothman Avatar answered Sep 05 '25 00:09

Qusai Alothman