Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Polars how to turn column of type list[list[...]] into numpy ndarray

I know I can turn a normal polars series into a numpy array via .to_numpy().

import polars as pl

s = pl.Series("a", [1,2,3])

s.to_numpy()
# array([1, 2, 3])

However that does not work with a list type. What would be they way to turn such a construct into a 2-D array.

And even more general is there a way to turn a series of list[list[whatever]] into a 3-D and so on?

s = pl.Series("a", [[1,1,1],[1,2,3],[1,0,1]])

s.to_numpy()  
# exceptions.ComputeError: 'to_numpy' not supported for dtype: List(Int64)

Desired output would be:

array([[1, 1, 1],
       [1, 2, 3],
       [1, 0, 1]])

Or one step further

s = pl.Series("a", [[[1,1],[1,2]],[[1,1],[1,1]]])

s.to_numpy()  
# exceptions.ComputeError: 'to_numpy' not supported for dtype: List(Int64)

Desired output would be:

array([[[1, 1],
        [1, 2]],

       [[1, 1],
        [1, 1]]])
like image 939
J.N. Avatar asked Sep 14 '25 21:09

J.N.


1 Answers

You could explode the series then reshape the numpy array after. That is probably the only way with the current ComputeError specifying it's unsupported in polars. The list dtype can have varying list lengths row to row, which would ruin any computation like this, so it makes sense it is not supported.

That said, if you know your list column is of uniform length for every row, this operation can be generally written for any arbitrary nesting of list type. It just involves keeping track of the changed dimensions with each explode, and then calculating the proper new dimensions:

from itertools import pairwise

def multidimensional_to_numpy(s):
    dimensions = [1, len(s)]
    while s.dtype == pl.List:
        s = s.explode()
        dimensions.append(len(s))
    dimensions = [p[1] // p[0] for p in pairwise(dimensions)]
    return s.to_numpy().reshape(dimensions)
multidimensional_to_numpy(pl.Series("a", [1,2,3]))
array([1, 2, 3], dtype=int64
multidimensional_to_numpy(pl.Series("a", [[1,1,1],[1,2,3],[1,0,1]]))

array([[1, 1, 1],
       [1, 2, 3],
       [1, 0, 1]], dtype=int64)
multidimensional_to_numpy(pl.Series("a", [[[1,1],[1,2]], [[1,1],[1,1]]]))

array([[[1, 1],
        [1, 2]],

       [[1, 1],
        [1, 1]]], dtype=int64)

Note with the soon to be released Array dtype that guarantees same-length arrays throughout the column (and the current arr will become list), this answer could be improved upon in due time (maybe direct to_numpy support there?). In particular, the dimension calculating above should be able to be simplified to tracking the dtype.width for each inner array dtype.

like image 99
Wayoshi Avatar answered Sep 17 '25 12:09

Wayoshi