Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas and category replacement

I'm trying to reduce the size of ~300 csv files (about a billion rows) by replacing lengthy fields with shorter, categorical, values.

I'm making use of pandas, and I've iterated through each of the files to build an array that includes all of the unique values I'm trying to replace. I can't individually just use pandas.factorize on each file, because I need (for example) '3001958145' to map to the same value on file1.csv as well as file244.csv. I've created an array of what I'd like to replace these values with just by creating another array of incremented integers.

In [1]: toreplace = data['col1'].unique()
Out[1]: array([1000339602, 1000339606, 1000339626, ..., 3001958145, 3001958397,
   3001958547], dtype=int64)

In [2]: replacewith = range(0,toreplace))
Out[2]: [0, 1, 2,...]

Now, how do I go about efficiently swapping in my 'replacewith' variable for each corresponding 'toreplace' value for each of the files I need to iterate through?

With as capable as pandas is with dealing with categories, I assume there has to be a method out there that can accomplish this that I simply just can't find. The function I wrote to do this works (it relies on a pandas.factorized input rather than the arrangement I described above), but it relies on the replace function and iterating through the series so it's quite slow.

def powerreplace(pdseries,factorized):
  i = 0
  uniques = pdseries.unique()
  for unique in uniques:
    print '%i/%i' % (i,len(uniques))
    i=i+1
    pdseries.replace(to_replace=unique,
                     value=np.where(factorized[1]==unique)[0][0],
                     inplace=True)

Can anyone recommend a better way to go about doing this?

like image 464
andrewgcross Avatar asked Sep 12 '25 10:09

andrewgcross


1 Answers

This requires at least pandas 0.15.0; (however the .astype syntax is a bit more friendly in 0.16.0, so better to use that). Here are the docs for categoricals

Imports

In [101]: import pandas as pd
In [102]: import string
In [103]: import numpy as np    
In [104]: np.random.seed(1234)
In [105]: pd.set_option('max_rows',10)

Create a sample set to create some data

In [106]: uniques = np.array(list(string.ascii_letters))
In [107]: len(uniques)
Out[107]: 52

Create some data

In [109]: df1 = pd.DataFrame({'A' : uniques.take(np.random.randint(0,len(uniques)/2+5,size=1000000))})

In [110]: df1.head()
Out[110]: 
   A
0  p
1  t
2  g
3  v
4  m

In [111]: df1.A.nunique()
Out[111]: 31

In [112]: df2 = pd.DataFrame({'A' : uniques.take(np.random.randint(0,len(uniques),size=1000000))})

In [113]: df2.head()
Out[113]: 
   A
0  I
1  j
2  b
3  A
4  m
In [114]: df2.A.nunique()
Out[114]: 52

So we now have 2 frames that we want to categorize; the first frame happens to have less than the full set of categories. This is on purpose; you don't have to know the complete set upfront.

Convert the A columns to B columns that are a Categorical

In [116]: df1['B'] = df1['A'].astype('category')

In [118]: i = df1['B'].cat.categories

In [124]: i
Out[124]: Index([u'A', u'B', u'C', u'D', u'E', u'a', u'b', u'c', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'm', u'n', u'o', u'p', u'q', u'r', u's', u't', u'u', u'v', u'w', u'x', u'y', u'z'], dtype='object')

If we are iteratively processing these frames, we use the first ones to start. To get each successive one, we add the symmetric difference with the existing set. This keeps the categories in the same order, so when we factorize we get the same numbering scheme.

In [119]: cats = i.tolist() + i.sym_diff(df2['A'].astype('category').cat.categories).tolist()

We have now gained back the original set

In [120]: (np.array(sorted(cats)) == sorted(uniques)).all()
Out[120]: True

Set the next frames B column to be a categorical, BUT we specify the categories, so when it is factorized the same values are used

In [121]: df2['B'] = df2['A'].astype('category',categories=cats)

To prove it, we select the codes (the factorized map) from each. These codes match; df2 has an additional code (as Z is in the 2nd frame but not the first).

In [122]: df1[df1['B'].isin(['A','a','z','Z'])].B.cat.codes.unique()
Out[122]: array([30,  0,  5])

In [123]: df2[df2['B'].isin(['A','a','z','Z'])].B.cat.codes.unique()
Out[123]: array([ 0, 30,  5, 51])

You can simply then store the codes in lieu of the object dtyped data.

Note that it is actually quite efficient to serialize these to HDF5 as Categoricals are natively stored, see here

Note that we are creating a pretty memory efficient way of storing this data. Noting that the memory usage of in [154], the object dtype is actually MUCH higher the longer the string gets because this is just the memory for a pointer; the actual values are stored on the heap. While [155] is ALL the memory used.

In [153]: df2.dtypes
Out[153]: 
A      object
B    category
dtype: object

In [154]: df2.A.to_frame().memory_usage()
Out[154]: 
A    8000000
dtype: int64

In [155]: df2.B.to_frame().memory_usage()
Out[155]: 
B    1000416
dtype: int64
like image 134
Jeff Avatar answered Sep 15 '25 01:09

Jeff