Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Determining columns which produce unique row for primary set keys

I am looking to find a way of getting columns names from dataframes which combination produce unique rows.

Let's say I have a df which looks like this:

id      name        location    cost        supplier            code
1       A           Paris       10          Google              100
1       B           London      10          Facebook            100
2       A           Rome        20          Google              100
2       B           Paris       20          Google              105
1       C           London      10          Facebook            200

Then I would expect to get id, name because having these columns I can identify every row. I have checked this question but it simply returns combinations of all columns. Since that question was asked 3 years ago, I am wondering if pandas have a method of automatically checking which columns produce unique rows.

like image 795
Jonas Palačionis Avatar asked Oct 27 '25 01:10

Jonas Palačionis


1 Answers

Out of interest I looked into this. There are multiple valid primary key combinations for this data set. semantic a primary key is defined as a combination of attribute that uniquely identifies a row

I have used set_index() instead of drop_duplicates() to test the semantic.

Clearly you could add to semantic, such as minimum number of columns or maximum number of columns.

Updated

Have coded change to semantic, cost of testing is significant. Have flexibility to optimise by defining minimum and maximum number of columns to be considered in candidate primary key.

import io
import itertools
import pandas as pd

df = pd.read_csv(io.StringIO("""id      name        location    cost        supplier            code
1       A           Paris       10          Google              100
1       B           London      10          Facebook            100
2       A           Rome        20          Google              100
2       B           Paris       20          Google              105
1       C           London      10          Facebook            200"""),sep="\s+",)

MINCOLS = len(df.columns) // 3
MAXCOLS = len(df.columns) - 1

valid = []
for combi in itertools.chain.from_iterable(
    itertools.combinations(df.columns, r)
    for r in range(max(MINCOLS, 1), min(MAXCOLS, len(df.columns)))
):
    try:
        df.set_index(list(combi), verify_integrity=True)
        valid.append(list(combi))
    except ValueError:
        pass
valid

output

[['id', 'name'],
 ['name', 'location'],
 ['name', 'cost'],
 ['location', 'code'],
 ['id', 'name', 'location'],
 ['id', 'name', 'cost'],
 ['id', 'name', 'supplier'],
 ['id', 'name', 'code'],
 ['id', 'location', 'code'],
 ['id', 'supplier', 'code'],
 ['name', 'location', 'cost'],
 ['name', 'location', 'supplier'],
 ['name', 'location', 'code'],
 ['name', 'cost', 'supplier'],
 ['name', 'cost', 'code'],
 ['location', 'cost', 'code'],
 ['location', 'supplier', 'code'],
 ['cost', 'supplier', 'code'],
 ['id', 'name', 'location', 'cost'],
 ['id', 'name', 'location', 'supplier'],
 ['id', 'name', 'location', 'code'],
 ['id', 'name', 'cost', 'supplier'],
 ['id', 'name', 'cost', 'code'],
 ['id', 'name', 'supplier', 'code'],
 ['id', 'location', 'cost', 'code'],
 ['id', 'location', 'supplier', 'code'],
 ['id', 'cost', 'supplier', 'code'],
 ['name', 'location', 'cost', 'supplier'],
 ['name', 'location', 'cost', 'code'],
 ['name', 'location', 'supplier', 'code'],
 ['name', 'cost', 'supplier', 'code'],
 ['location', 'cost', 'supplier', 'code']]
like image 189
Rob Raymond Avatar answered Oct 29 '25 17:10

Rob Raymond



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!