If I am reading just a piece of csv I get the following data structure
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (2015-11-01 00:00:00, 4980770) to (2016-06-01 00:00:00, 8850573)
Data columns (total 5 columns):
CHANNEL          100000 non-null category
MCC              92660 non-null category
DOMESTIC_FLAG    100000 non-null category
AMOUNT           100000 non-null float32
CNT              100000 non-null uint8
dtypes: category(3), float32(1), uint8(1)
memory usage: 1.9+ MB
If I am reading the whole csv and concat the blocks as per above I get the following structure:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 30345312 entries, (2015-11-01 00:00:00, 4980770) to (2015-08-01 00:00:00, 88838)
Data columns (total 5 columns):
CHANNEL          object
MCC              float64
DOMESTIC_FLAG    category
AMOUNT           float32
CNT              uint8
dtypes: category(1), float32(1), float64(1), object(1), uint8(1)
memory usage: 784.6+ MB
Why are categorical variables changed to object / float64? How can I avoid this type change? Esp. the float64
This is the concatenation code:
df = pd.concat([process(chunk) for chunk in reader])
process function is just doing some cleaning and type assignments
Consider the following sample DataFrames:
In [93]: df1
Out[93]:
   A  B
0  a  a
1  b  b
2  c  c
3  a  a
In [94]: df2
Out[94]:
   A  B
0  b  b
1  c  c
2  d  d
3  e  e
In [95]: df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
A    4 non-null object
B    4 non-null category
dtypes: category(1), object(1)
memory usage: 140.0+ bytes
In [96]: df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
A    4 non-null object
B    4 non-null category
dtypes: category(1), object(1)
memory usage: 148.0+ bytes
NOTE: these two DFs have different categories:
In [97]: df1.B.cat.categories
Out[97]: Index(['a', 'b', 'c'], dtype='object')
In [98]: df2.B.cat.categories
Out[98]: Index(['b', 'c', 'd', 'e'], dtype='object')
when we concatenate them Pandas won't merge categories - it'll create an object column:
In [99]: m = pd.concat([df1, df2])
In [100]: m.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8 entries, 0 to 3
Data columns (total 2 columns):
A    8 non-null object
B    8 non-null object
dtypes: object(2)
memory usage: 192.0+ bytes
But if we concatenate two DFs with the same categories - everything works as expected:
In [102]: m = pd.concat([df1.sample(frac=.5), df1.sample(frac=.5)])
In [103]: m
Out[103]:
   A  B
3  a  a
0  a  a
3  a  a
2  c  c
In [104]: m.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4 entries, 3 to 2
Data columns (total 2 columns):
A    4 non-null object
B    4 non-null category
dtypes: category(1), object(1)
memory usage: 92.0+ bytes
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With