I have a pandas dataframe with almost 56 columns and 120000 row.
I would like to implement validation only on some columns and not for all of them.
I followed article at https://tmiguelt.github.io/PandasSchema/
When i did like something below function, it throws an error as
"Invalid number of columns. The schema specifies 2, but the data frame has 56"
def DoValidation(self, df):
null_validation = [CustomElementValidation(lambda d: d is not np.nan, 'this field cannot be null')]
schema = pandas_schema.Schema([Column('ItemId', null_validation)],
[Column('ItemName', null_validation)])
errors = schema.validate(df)
if (len(errors) > 0):
for error in errors:
print(error)
return False
return True
Am i doing something wrong ?
What is the correct way to validate specific column in a dataframe ?
Note: I have to implement different type of validations like decimal, length, null check validations etc on different columns and not just null check validation as show in function above.
As Yuki Ho mentioned in his answer, by default you have to specify as many columns in the schema as your dataframe.
But you can also use the columns parameter in schema.validate() to specify which columns to check. Combining that with schema.get_column_names() you can do the following to easily avoid your issue.
schema.validate(df, columns=schema.get_column_names())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With