Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Building a dataframe in C++

I am trying to build a DataFrame in C++. I'm facing some problems, such as dealing with variable data type.

I am thinking in a DataFrame inspired by Pandas DataFrame (from python). So my design idea is:

  1. Build an object 'Series' which is a vector of a fixed data type.
  2. Build an object 'DataFrame' which will store a list of Series (this list can be variable).

The item 1. is just a regular vector. So, for instance, the user would call

Series.fill({1,2,3,4}) and it would store the vector {1,2,3,4} in some attribute of Series, say Series.data.

Problem 1. How I would make a class that understands {1,2,3,4} as a vector of 4 integers. Is it possible?

The next problem is:

About 2., I can see the DataFrame as a matrix of n columns and m rows, but the columns can have different data types.

I tried to design this as a vector of n pointers, where each pointer would point to a vector of dimension m with different data types.

I tried to do something like

vector<void*> columns(10)

and fill it with something like

columns[0] = (int*) malloc(8*sizeof(int))

But this does not work, if I try to fill the vector, like

(*columns[0])[0] = 5;

I get an error

::value_type {aka void*}’ is not a pointer-to-object type (int *) (*a[0])[0] = 5;

How can I do it properly? I still have other questions like, how would I append an undetermined number of Series into a DataFrame, but for now, just building a matrix with columns with different data types is a great start.

I know that I must keep track of the types of pointers inside my void vector but I can create a parallel list with all data types and make this an attribute of my class DataFrame.

like image 661
Lucas Avatar asked Nov 01 '25 09:11

Lucas


1 Answers

Building a heterogeneous container (which a dataframe is supposed to be) in C++ is more complex than you think, because C++ is statically typed. It means you have to know all the types at compile time. Your approach uses a vector of pointers (there are a few variations of this approach, which I am not going into). This approach is very inefficient, because pointers are pointing to all over the memory and trashing your cache locality. I do not recommend even attempting to implement such a dataframe because there is really no point to it.

Look at this implementation of DataFrame in C++: https://github.com/hosseinmoein/DataFrame. You might be able to just use it as is. Or get insight from it how to implement a true heterogeneous DataFrame. It uses a collection of static vectors in a hash table to implement a true heterogeneous container. It also uses contiguous memory space, so it avoids the pointer effect.

like image 172
hmoein Avatar answered Nov 04 '25 01:11

hmoein