Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Testing pandas dataframe with unittest framework

I'm trying to make unit-test that deals with csv files using python unittest framework. I want to test such cases as columns names match, values in columns match, etc. I know that there are more convenient libraries for it, like datatest and pytest , but I can use only unittest in my project.

Guess I'm using wrong unittest.TestCase methods, and send data in the wrong format. Please advise how to do it better way.

db.csv example:

  TIMESTAMP   TYPE   VALUE YEAR  FILE   SHEET
0 02-09-2018  Index   45   2018  tq.xls A01
1 13-05-2018  Index   21   2018  tq.xls A01
2 22-01-2019  Index   9    2019  aq.xls B02

Here is code example:

import pandas as pd
import unittest

class DFTests(unittest.TestCase):

    def setUp(self):
        test_file_name =  'db.csv'
        try:
            data = pd.read_csv(test_file_name,
                sep = ',',
                header = 0)
        except IOError:
            print('cannot open file')
        self.fixture = data

    #Check column names
    def test_columns(self):
        self.assertEqual(
            self.fixture.columns,
            {'TIMESTAMP', 'TYPE', 'VALUE','YEAR','FILE','SHEET'},
        )

    #Check timestamp format
    def test_timestamp(self):
        self.assertRaisesRegex(
            self.fixture['TIMESTAMP'],
            r'\d{2}-\d{2}-\d{4}'
        )

    #Check year values
    def test_year_values(self):
        self.assertIn(
            self.fixture['YEAR'],
            {2018, 2019, 2020},
        )


if __name__ == '__main__':
    unittest.main()

Errors:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
TypeError: assertRaisesRegex() arg 1 must be an exception type or tuple of exception types
TypeError: 'Series' objects are mutable, thus they cannot be hashed

Any help is appreciated.

like image 821
干猕猴桃 Avatar asked Feb 02 '26 01:02

干猕猴桃


1 Answers

You can use list comprehension to assert over each dataframe row. Try something like this:

import pandas as pd
import unittest

colnames = ["TIMESTAMP", " TYPE", " VALUE", " YEAR", " FILE", " SHEET"]
years = set([2018, 2019, 2020])


class DfTests(unittest.TestCase):
    def setUp(self):
        try:
            data = pd.read_csv("data.csv", sep=",")
            self.fixture = data
        except IOError as e:
            print(e)

    def test_colnames(self):
        self.assertListEqual(list(self.fixture.columns), colnames)

    def test_timestamp_format(self):
        ts = self.fixture["TIMESTAMP"]
        # You need to check for every row in the dataframe
        [self.assertRegex(i, r"\d{2}-\d{2}-\d{4}") for i in ts]

    def test_years(self):
        df_years = self.fixture[" YEAR"]
        self.assertTrue(all([i in years for i in df_years]))


if __name__ == "__main__":
    unittest.main()

Also, bear in mind that pandas has some built-in testing functions. On the other hand, when unit-testing dataframes (and general data validation) great_expectations would be probably the best tool for the job.

like image 185
anddt Avatar answered Feb 04 '26 15:02

anddt



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!