In my database ddl (SQL Server) I have fields which have default values:
my_database_field varchar(255) NOT NULL
DEFAULT 'NVT'
However, when I try to insert data into this table using DataFrame.to_sql() I get the following error:
Cannot insert the value NULL into column 'my_database_field', table 'MYDB.dbo.my_table'; column does not allow nulls. INSERT fails.
Here is the code:
with engine.begin() as conn:
dataframe.to_sql(table_name, conn, index=False, if_exists='append')
The column 'my_database_field' has NaN values:
data['my_database_field']
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
4154 NaN
4155 NaN
4156 NaN
4157 NaN
4158 NaN
This seems to be because the to_sql method is replacing NaN values with NULLs (probably how it should be)
I want the option to insert NaN values and have the database use the DEFAULT value defined in the schema instead of inserting NULL directly when using the to_sql method.
I've tried replacing all NaNs in the dataframe with empty strings however this suffers from the same problem. It propagates those empty strings to the data base and that's what appears, not the default value specified in the DDL.
The only way I can get it to work is to not use pandas dataframe.to_sql method at all but instead load the data from a csv file and write line by line into the data base using basic sqlalchemy SQL Expressions.
I've opened an issue in the pandas github: https://github.com/pandas-dev/pandas/issues/42408
Edit: As per @Larnu's comment, the key bit of information I was missing was that:
A DEFAULT value is only used when you don't specify the column in the INSERT. If you (or the application) explicitly supplies a NULL value, then a NULL value will be inserted. DEFAULT isn't a "use this value when NULL is supplied" property it's a "use this value when the column is omitted"
The problem now becomes how to elegantly handle inserting a large dataframe where some columns contain some NaNs. I'd like to avoid having to go row by row to insert the data with a check to see if any values in that row are NaN.
Edit 2: I solved this problem by using the ORM as follows:
Base = declarative_base()
class MyModel(Base):
__tablename__ = 'my_table'
my_id = Column(String, primary_key=True)
my_integer_field = Column(Integer)
my_database_field = Column(String, server_default='NVT')
@staticmethod
def from_dict(input_dict: dict):
for k, v in input_dict.items():
logging.debug(f"Key: {k} - Value: {v}")
if type(v) is str:
continue
try:
if np.isnan(v):
input_dict[k] = None
except TypeError as e:
logging.debug(e)
logging.debug(f"Could not test if: {type(v)}:{v} was NaN")
return __class__(**input_dict)
As you can see I have added a custom static method called 'from_dict' which builds my model from a dictionary. For each value I check if it is a np.nan and if it is I set it to None instead. This seems to make the ORM not include that field in the insert method under the hood and therefore let the database use the default value. The np.isnan() method only works for numbers not strings so for those I just catch the exception, log out some info and move on.
For string values it seems I must set the 'server_default' in the model itself. If anyone knows more about what's going on here please feel free to share.
As @Larnu mentions in his comment, the default value will only be used when that column is not stated in the insert statement. Pandas is seeing that column name in the dataframe and inserting. So you need to split up your data and remove columns where you want defaults.
With small data, you can iterate through the rows, use dropna on the columns and then to_sql.
For larger datasets you could split up your data so that you upload only rows with no NaN's first. Then create sets for every combination of columns, on rows where all those column values are null, drop the columns and only upload the columns that do have data. For example a dataframe with columns A and B, upload all rows where both values are not null, upload a dataframe of column A only where the value in B is null and vice versa.
import numpy as np
import pandas as pd
import itertools
#create random dataframe of integers
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)),
columns=['A','B','C','D'])
#turn ~20% of them into NaN's
df = df.mask(np.random.random(df.shape) < .2)
#upload the no nulls data first
df.dropna().to_sql(table_name, conn, index=False, if_exists='append')
#get column names into list
cols = list(df.columns)
#create powerset of column names
p_set = itertools.chain.from_iterable(
itertools.combinations(cols,r) for r in range(1,len(cols)))
#iterate through all combinations of column names
for null_cols in p_set:
#determine what the columns are where we DO want values
not_null_cols = list(set(cols) - set(null_cols))
#filter original dataframe to rows where ALL the null_cols are
# null AND ALL the other cols have a non-null value
sub_df = df[df[list(null_cols)].isnull().all(1) &
df[list(not_null_cols)].notnull().all(1)]
#if this dataframe has values, send it to database
if len(sub_df) > 0:
#sub_df still has all columns at this point, only keep the
# ones with values
sub_df = sub_df[not_null_cols]
sub_df.to_sql(table_name, conn, index=False, if_exists='append')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With