Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas csv to object list is slow

I have a data file like the following (simplified, I have more columns):

timestamp frame_idx gaze_pos_x gaze_pos_y gaze_dir_x gaze_dir_y gaze_dir_z
0 2269.17 45 893.314 500.136 0.165454 -0.0222454 0.985967
1 2274.17 45 896.61 502.564 0.176397 -0.0098666 0.98427
2 2279.17 46 900.592 499.049 0.189087 -0.018215 0.981791
3 2284.17 46 906.321 478.184 0.18891 -0.0307506 0.981513
4 2289.17 46 893.465 502.793 0.175493 -0.0210113 0.984257
5 2294.17 46 898.629 497.182 0.190142 -0.0151722 0.981639
6 2299.3 46 893.554 496.782 0.183007 -0.0150504 0.982996
7 2304.3 46 905.338 482.343 0.188236 -0.0249608 0.981807
8 2309.3 46 897.44 495.476 0.187434 -0.0199951 0.982074
9 2424.3 48 893.358 495.474 0.171512 -0.0198278 0.984982

And an object like this (again simplified):

class Gaze:
    def __init__(self, ts, frame_idx, gaze2D, gaze_dir3D=None):
        self.ts = ts
        self.frame_idx = frame_idx
        self.gaze2D = gaze2D
        self.gaze_dir3D = gaze_dir3D

where gaze2D is a numpy array containing [gaze_pos_x, gaze_pos_y] and gaze_dir3D is a numpy array containing [gaze_dir_x, gaze_dir_y, gaze_dir_z].

I want to efficiently load in the data file and make one Gaze object per row. I have implemented the below, but this is very slow:

def readDataFromFile(fileName):
    gazes   = []
    data    = pd.read_csv(str(fileName), delimiter='\t', index_col=False, dtype=defaultdict(lambda: float, frame_idx=int))
    allCols = tuple([c for c in data.columns if col in c] for col in (
        'gaze_pos','gaze_dir'))
    # allCols -> ([gaze_pos_x, gaze_pos_y],[gaze_dir_x, gaze_dir_y, gaze_dir_z]), a list can be empty if a set of columns is missing (gaze_dir is optional)

    # run through all rows
    for _, row in data.iterrows():
        frame_idx = int(row['frame_idx'])  # must cast to int as pd.Series seems to lose typing of dataframe.... :s
        ts        = row['timestamp']

        # get all values (None if columns not present)
        # again need to cast to float despite all items in the series being a float, because the dtype of the series is object... :s
        args = tuple(row[c].astype('float').to_numpy() if c else None for c in allCols)
        gazes.append(Gaze(ts, frame_idx, *args))
    return gazes

As said, this is very slow, the row iteration takes forever, it is prohibitively slow for my use case. Is there a more efficient way of doing this? Using a similar read-in function using a csv.DictReader is a little faster but still way too slow.

like image 610
Diederick C. Niehorster Avatar asked Nov 14 '25 10:11

Diederick C. Niehorster


1 Answers

Posting the question and being implored to vectorize gave me new inspiration. Here's a fast solution!

def readDataFromFile(fileName):
    df = pd.read_csv(str(fileName), delimiter='\t', index_col=False, dtype=defaultdict(lambda: float, frame_idx=int))

    # group columns into numpy arrays, insert None if missing
    cols = ('gaze_pos','gaze_dir')
    allCols = tuple([c for c in df.columns if col in c] for col in cols)
    for c,ac in zip(cols,allCols):
        if ac:
            df[c] = [x for x in df[ac].values]  # make list of numpy arrays
        else:
            df[c] = None

    # clean up so we can assign into gaze objects directly
    lookup = {'timestamp':'ts'} | {k:v for k,v in zip(cols,['gaze2D','gaze_dir3D'])}
    df = df.drop(columns=[c for c in df.columns if c not in lookup.keys() and c!='frame_idx'])
    df = df.rename(columns=lookup)

    # make the gaze objects
    gazes = [Gaze(**kwargs) for kwargs in df.to_dict(orient='records')]

    return gazes
like image 127
Diederick C. Niehorster Avatar answered Nov 17 '25 01:11

Diederick C. Niehorster



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!