I have a large Pandas dataframe that I've imported from a SQL database. The whole process takes several hours. As I work on the data, inevitably the dataframe gets altered and I regularly want to go back to 'a known good dataset' and re-run various functions. Instead of importing the data from the database, I want to save the data at various points of the analysis process as CSV files which can then be used to restore the data as required; CSV is the format of choice because, for some reason, I haven't had much luck with pickling the dataframes. Simply importing the CSV data using pd.read_csv() alters the datatypes of the columns. As a result, I want to create a dictionary of dtypes which can be used to restore the data types when importing the CSV back into a dataframe.
As an example, a simple dataframe may be defined as shown below:
df = pd.DataFrame({'A':[1,2,3,4,5],'B':['a','b','c','d','e'],'C':[1.2,3.4,5.6,7.8,9.0]},index=[0,2,4,6,8])
which looks like:
A B C
0 1 a 1.2
2 2 b 3.4
4 3 c 5.6
6 4 d 7.8
8 5 e 9.0
A dictionary of dtypes can be created using:
dtypesDict = df.dtypes.to_dict()
which produces:
{'B': dtype('O'), 'C': dtype('float64'), 'A': dtype('int64')}
If I try to use this output to define a dictionary in hard-code so it can be used to set the data types for the columns which are imported using pd.read_csv()
, it fails as follows:
dtypesDict = {'B': dtype('O'), 'C': dtype('float64'), 'A': dtype('int64')}
NameError: name 'dtype' is not defined
However, defining the dictionary as:
dtypesDict = {'B': 'O', 'C': 'float64', 'A': 'int64'}
allows the CSV file to be imported with no problems.
I thought a dictionary comprehension was the way to go but I can't make that work:
dtypesDict = {k:bit_in_brackets_of_v for k,v in df.dtypes.to_dict().items()}
How can I automatically produce a dictionary of the correct format that can be defined in hard-code using cut a simple cut-and-paste process and will allow dtypes of columns imported from a CSV to be set correctly?
You can do dict
with zip
, to get the dtype name using dtype.name
dict(zip(list(df),[df[x].dtype.name for x in df]))
Out[6]: {'A': 'int64', 'B': 'object', 'C': 'float64'}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With