I am using Python and I have dataset that has around 1 million records and around 50 column
some of these columns has different types (such as IssueCode column can have 7000 different codes, another column SolutionCode can have 1000 codes)
I am trying to build a predictive model.
therefore I have to convert data using get_dummies
but this has been causing this error Memory Error
File "C:\Users\am\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 963, in _get_dummies_1d dummy_mat = np.eye(number_of_cols, dtype=dtype).take(codes, axis=0)
MemoryError
I tried another solution by keeping columns without one-hot-encoding
now I get this error when try to build the model
ValueError: could not convert string to float: 'ABC'
I checked this solution
get_dummies python memory error
I converted all columns to int8 but still the same error
df = pd.concat([df.drop('IssueCode', 1), pd.get_dummies(df['IssueCode'],prefix = 'IssueCode_').astype(np.int8)], axis=1)
df = pd.concat([df.drop('SolutionCode', 1), pd.get_dummies(df['SolutionCode'],prefix = 'SolutionCode_').astype(np.int8)], axis=1)
df = pd.concat([df.drop('Col1', 1), pd.get_dummies(df['Col1'],prefix = 'Col1_').astype(np.int8)], axis=1)
df = pd.concat([df.drop('Col2', 1), pd.get_dummies(df['Col2'],prefix = 'Col2_').astype(np.int8)], axis=1)
df = pd.concat([df.drop('Col3', 1), pd.get_dummies(df['Col3'],prefix = 'Col3_').astype(np.int8)], axis=1)
I cannot get_dummies due to memory error and I cannot Not to get_dummies due to string to float Error
How to solve this
here is my code
from sklearn.model_selection import cross_val_predict
import pymssql
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
import datetime
import random
from sklearn.ensemble import RandomForestRegressor
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)
TaskTime = 900
RunTime = 120
sFolder = "/mnt/c/temp/"
def Lead0(value):
return "0" + str(value) if value < 10 else str(value)
dNow = datetime.datetime.now()
sNow = Lead0(dNow.year) + Lead0(dNow.month) + Lead0(dNow.day) + Lead0(dNow.hour) + Lead0(dNow.minute) + Lead0(dNow.second)
print(sNow)
conn = pymssql.connect(server="MyServer", database="MyDB", port="1433", user="***", password="*****")
df = pd.read_sql("SELECT * FROM MyTable where MyDate between '1 jul 2018' and '30 jun 2019'", conn)
conn.close()
#df = pd.get_dummies(df)
#When I uncomment this I get Memory Error
mdl = RandomForestRegressor(n_estimators = 500)
y_pred = cross_val_predict(mdl, X, y, cv=5)
#This is causing error String to float
The first thing you may want to do is to specify appropriate data types for data frame columns to reduce the memory usage of the loaded dataframe (cf. https://www.dataquest.io/blog/pandas-big-data/).
As for one-hot encoding, one direct solution of the memory issue is to use sparse data types rather than regular data types (see doc for more details). This can be achieved by something like this:
df = pd.get_dummies(df, columns=["IssueCode", "SolutionCode", "Col1", "Col2", "Col3"],
sparse=True, axis=1)
I am not sure whether pandas' sparse representation works well with sklearn though. If it does not work, you could try using sklearn's OneHotEncoder, which also offers sparse representation by default.
There are also other encoding techniques of categorical features that reduces dimensions (as well as the memory usage) but requires more work, e.g. merging values of categorical features into larger groups.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With