Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load multiple files with custom process for each of them?

I have several CSVfiles with the same structure:

  • data_product_1.csv
  • data_product_2.csv
  • data_product_3.csv
  • etc.

It is clear to me that to obtain a dataframe with all the data concatted together with polars, I can do something like:

import polars as pl
df = pl.read_csv("data_*.csv")

What I would like to do is to add an extra column to the final dataframe containing the name of the product, e.g.

data value product_code
2000-01-01 1 product_1
2000-01-02 2 product_1
2000-01-01 3 product_2
2000-01-02 4 product_2
2000-01-01 5 product_3

I'm aware I can load the files one by one, add the extra column and concat them together afterwards but I was wondering if I'm missing some other way to take advantage of polars performances here.

like image 415
Roberto Landi Avatar asked Oct 15 '25 19:10

Roberto Landi


1 Answers

It seems you're wanting the filename added as a column, e.g.

duckdb.sql("""
from read_csv_auto('data_*.csv', filename = true)
""")
┌────────────┬───────┬────────────────────┐
│    data    │ value │      filename      │
│    date    │ int64 │      varchar       │
├────────────┼───────┼────────────────────┤
│ 2000-01-01 │     1 │ data_product_1.csv │
│ 2000-01-02 │     2 │ data_product_1.csv │
│ 2000-01-01 │     3 │ data_product_2.csv │
│ 2000-01-02 │     4 │ data_product_2.csv │
│ 2000-01-01 │     4 │ data_product_3.csv │
│ 2000-01-02 │     5 │ data_product_3.csv │
└────────────┴───────┴────────────────────┘

This has been requested a few times but is yet to be added to Polars: https://github.com/pola-rs/polars/issues/9096

You can replace read_csv with scan_csv which delays reading the file and returns a LazyFrame instead.

The frames can be combined with concat which (by default) "computes" LazyFrames in parallel.

from pathlib import Path

# lazyframes
csvs = [
    pl.scan_csv(f).with_columns(product_code=pl.lit(f.name))
    for f in Path().glob("data_*.csv")
]

# inputs are read in parallel
df = pl.concat(csvs).collect()
shape: (6, 3)
┌────────────┬───────┬────────────────────┐
│ data       ┆ value ┆ product_code       │
│ ---        ┆ ---   ┆ ---                │
│ str        ┆ i64   ┆ str                │
╞════════════╪═══════╪════════════════════╡
│ 2000-01-01 ┆ 1     ┆ data_product_1.csv │
│ 2000-01-02 ┆ 2     ┆ data_product_1.csv │
│ 2000-01-01 ┆ 3     ┆ data_product_2.csv │
│ 2000-01-02 ┆ 4     ┆ data_product_2.csv │
│ 2000-01-01 ┆ 4     ┆ data_product_3.csv │
│ 2000-01-02 ┆ 5     ┆ data_product_3.csv │
└────────────┴───────┴────────────────────┘
like image 121
jqurious Avatar answered Oct 18 '25 09:10

jqurious



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!