How to load multiple files with custom process for each of them?

Question

I have several CSVfiles with the same structure:

data_product_1.csv
data_product_2.csv
data_product_3.csv
etc.

It is clear to me that to obtain a dataframe with all the data concatted together with polars, I can do something like:

import polars as pl
df = pl.read_csv("data_*.csv")

What I would like to do is to add an extra column to the final dataframe containing the name of the product, e.g.

data	value	product_code
2000-01-01	1	product_1
2000-01-02	2	product_1
2000-01-01	3	product_2
2000-01-02	4	product_2
2000-01-01	5	product_3

I'm aware I can load the files one by one, add the extra column and concat them together afterwards but I was wondering if I'm missing some other way to take advantage of polars performances here.

jqurious · Accepted Answer

It seems you're wanting the filename added as a column, e.g.

duckdb.sql("""
from read_csv_auto('data_*.csv', filename = true)
""")

┌────────────┬───────┬────────────────────┐
│    data    │ value │      filename      │
│    date    │ int64 │      varchar       │
├────────────┼───────┼────────────────────┤
│ 2000-01-01 │     1 │ data_product_1.csv │
│ 2000-01-02 │     2 │ data_product_1.csv │
│ 2000-01-01 │     3 │ data_product_2.csv │
│ 2000-01-02 │     4 │ data_product_2.csv │
│ 2000-01-01 │     4 │ data_product_3.csv │
│ 2000-01-02 │     5 │ data_product_3.csv │
└────────────┴───────┴────────────────────┘

This has been requested a few times but is yet to be added to Polars: https://github.com/pola-rs/polars/issues/9096

You can replace read_csv with scan_csv which delays reading the file and returns a LazyFrame instead.

The frames can be combined with concat which (by default) "computes" LazyFrames in parallel.

from pathlib import Path

# lazyframes
csvs = [
    pl.scan_csv(f).with_columns(product_code=pl.lit(f.name))
    for f in Path().glob("data_*.csv")
]

# inputs are read in parallel
df = pl.concat(csvs).collect()

shape: (6, 3)
┌────────────┬───────┬────────────────────┐
│ data       ┆ value ┆ product_code       │
│ ---        ┆ ---   ┆ ---                │
│ str        ┆ i64   ┆ str                │
╞════════════╪═══════╪════════════════════╡
│ 2000-01-01 ┆ 1     ┆ data_product_1.csv │
│ 2000-01-02 ┆ 2     ┆ data_product_1.csv │
│ 2000-01-01 ┆ 3     ┆ data_product_2.csv │
│ 2000-01-02 ┆ 4     ┆ data_product_2.csv │
│ 2000-01-01 ┆ 4     ┆ data_product_3.csv │
│ 2000-01-02 ┆ 5     ┆ data_product_3.csv │
└────────────┴───────┴────────────────────┘

How to load multiple files with custom process for each of them?

Tags:

python

dataframe

python-polars

Roberto Landi

1 Answers

jqurious

Recent Activity

Donate For Us

How to load multiple files with custom process for each of them?

Tags:

python

dataframe

python-polars

Roberto Landi

1 Answers

jqurious

Related questions

Recent Activity

Donate For Us