pandas csv to object list is slow

Question

I have a data file like the following (simplified, I have more columns):

	timestamp	frame_idx	gaze_pos_x	gaze_pos_y	gaze_dir_x	gaze_dir_y	gaze_dir_z
0	2269.17	45	893.314	500.136	0.165454	-0.0222454	0.985967
1	2274.17	45	896.61	502.564	0.176397	-0.0098666	0.98427
2	2279.17	46	900.592	499.049	0.189087	-0.018215	0.981791
3	2284.17	46	906.321	478.184	0.18891	-0.0307506	0.981513
4	2289.17	46	893.465	502.793	0.175493	-0.0210113	0.984257
5	2294.17	46	898.629	497.182	0.190142	-0.0151722	0.981639
6	2299.3	46	893.554	496.782	0.183007	-0.0150504	0.982996
7	2304.3	46	905.338	482.343	0.188236	-0.0249608	0.981807
8	2309.3	46	897.44	495.476	0.187434	-0.0199951	0.982074
9	2424.3	48	893.358	495.474	0.171512	-0.0198278	0.984982

And an object like this (again simplified):

class Gaze:
    def __init__(self, ts, frame_idx, gaze2D, gaze_dir3D=None):
        self.ts = ts
        self.frame_idx = frame_idx
        self.gaze2D = gaze2D
        self.gaze_dir3D = gaze_dir3D

where gaze2D is a numpy array containing [gaze_pos_x, gaze_pos_y] and gaze_dir3D is a numpy array containing [gaze_dir_x, gaze_dir_y, gaze_dir_z].

I want to efficiently load in the data file and make one Gaze object per row. I have implemented the below, but this is very slow:

def readDataFromFile(fileName):
    gazes   = []
    data    = pd.read_csv(str(fileName), delimiter='	', index_col=False, dtype=defaultdict(lambda: float, frame_idx=int))
    allCols = tuple([c for c in data.columns if col in c] for col in (
        'gaze_pos','gaze_dir'))
    # allCols -> ([gaze_pos_x, gaze_pos_y],[gaze_dir_x, gaze_dir_y, gaze_dir_z]), a list can be empty if a set of columns is missing (gaze_dir is optional)

    # run through all rows
    for _, row in data.iterrows():
        frame_idx = int(row['frame_idx'])  # must cast to int as pd.Series seems to lose typing of dataframe.... :s
        ts        = row['timestamp']

        # get all values (None if columns not present)
        # again need to cast to float despite all items in the series being a float, because the dtype of the series is object... :s
        args = tuple(row[c].astype('float').to_numpy() if c else None for c in allCols)
        gazes.append(Gaze(ts, frame_idx, *args))
    return gazes

As said, this is very slow, the row iteration takes forever, it is prohibitively slow for my use case. Is there a more efficient way of doing this? Using a similar read-in function using a csv.DictReader is a little faster but still way too slow.

Diederick C. Niehorster · Accepted Answer

Posting the question and being implored to vectorize gave me new inspiration. Here's a fast solution!

def readDataFromFile(fileName):
    df = pd.read_csv(str(fileName), delimiter='	', index_col=False, dtype=defaultdict(lambda: float, frame_idx=int))

    # group columns into numpy arrays, insert None if missing
    cols = ('gaze_pos','gaze_dir')
    allCols = tuple([c for c in df.columns if col in c] for col in cols)
    for c,ac in zip(cols,allCols):
        if ac:
            df[c] = [x for x in df[ac].values]  # make list of numpy arrays
        else:
            df[c] = None

    # clean up so we can assign into gaze objects directly
    lookup = {'timestamp':'ts'} | {k:v for k,v in zip(cols,['gaze2D','gaze_dir3D'])}
    df = df.drop(columns=[c for c in df.columns if c not in lookup.keys() and c!='frame_idx'])
    df = df.rename(columns=lookup)

    # make the gaze objects
    gazes = [Gaze(**kwargs) for kwargs in df.to_dict(orient='records')]

    return gazes

pandas csv to object list is slow

Tags:

performance

python

object

pandas

Diederick C. Niehorster

1 Answers

Diederick C. Niehorster

Recent Activity

Donate For Us

pandas csv to object list is slow

Tags:

performance

python

object

pandas

Diederick C. Niehorster

1 Answers

Diederick C. Niehorster

Related questions

Recent Activity

Donate For Us