Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to resolve memory error caused by Get_dummies

I am using Python and I have dataset that has around 1 million records and around 50 column

some of these columns has different types (such as IssueCode column can have 7000 different codes, another column SolutionCode can have 1000 codes)

I am trying to build a predictive model.

therefore I have to convert data using get_dummies

but this has been causing this error Memory Error

File "C:\Users\am\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 963, in _get_dummies_1d dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=0)

MemoryError

I tried another solution by keeping columns without one-hot-encoding

now I get this error when try to build the model

ValueError: could not convert string to float: 'ABC'

I checked this solution

get_dummies python memory error

I converted all columns to int8 but still the same error

df = pd.concat([df.drop('IssueCode', 1), pd.get_dummies(df['IssueCode'],prefix = 'IssueCode_').astype(np.int8)], axis=1)
df = pd.concat([df.drop('SolutionCode', 1), pd.get_dummies(df['SolutionCode'],prefix = 'SolutionCode_').astype(np.int8)], axis=1)
df = pd.concat([df.drop('Col1', 1), pd.get_dummies(df['Col1'],prefix = 'Col1_').astype(np.int8)], axis=1)
df = pd.concat([df.drop('Col2', 1), pd.get_dummies(df['Col2'],prefix = 'Col2_').astype(np.int8)], axis=1)
df = pd.concat([df.drop('Col3', 1), pd.get_dummies(df['Col3'],prefix = 'Col3_').astype(np.int8)], axis=1)

I cannot get_dummies due to memory error and I cannot Not to get_dummies due to string to float Error

How to solve this

here is my code

from sklearn.model_selection import cross_val_predict
import pymssql
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import datetime
import random
from sklearn.ensemble import RandomForestRegressor


pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)


TaskTime = 900
RunTime = 120
sFolder = "/mnt/c/temp/"

def Lead0(value): 
        return "0" + str(value) if value < 10 else str(value)

dNow = datetime.datetime.now()
sNow = Lead0(dNow.year) + Lead0(dNow.month) + Lead0(dNow.day) + Lead0(dNow.hour) + Lead0(dNow.minute) + Lead0(dNow.second) 

print(sNow)

conn = pymssql.connect(server="MyServer", database="MyDB", port="1433", user="***", password="*****")
df = pd.read_sql("SELECT   *  FROM MyTable where MyDate between '1 jul 2018' and '30 jun 2019'", conn)
conn.close()


#df = pd.get_dummies(df)
#When I uncomment this I get Memory Error


mdl = RandomForestRegressor(n_estimators = 500) 

y_pred = cross_val_predict(mdl, X, y, cv=5)
#This is causing error String to float
like image 674
asmgx Avatar asked Oct 29 '25 22:10

asmgx


1 Answers

The first thing you may want to do is to specify appropriate data types for data frame columns to reduce the memory usage of the loaded dataframe (cf. https://www.dataquest.io/blog/pandas-big-data/).

As for one-hot encoding, one direct solution of the memory issue is to use sparse data types rather than regular data types (see doc for more details). This can be achieved by something like this:

df = pd.get_dummies(df, columns=["IssueCode", "SolutionCode", "Col1", "Col2", "Col3"], 
                    sparse=True, axis=1)

I am not sure whether pandas' sparse representation works well with sklearn though. If it does not work, you could try using sklearn's OneHotEncoder, which also offers sparse representation by default.

There are also other encoding techniques of categorical features that reduces dimensions (as well as the memory usage) but requires more work, e.g. merging values of categorical features into larger groups.

like image 110
GZ0 Avatar answered Nov 01 '25 11:11

GZ0



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!