Determining columns which produce unique row for primary set keys

Question

I am looking to find a way of getting columns names from dataframes which combination produce unique rows.

Let's say I have a df which looks like this:

id      name        location    cost        supplier            code
1       A           Paris       10          Google              100
1       B           London      10          Facebook            100
2       A           Rome        20          Google              100
2       B           Paris       20          Google              105
1       C           London      10          Facebook            200

Then I would expect to get id, name because having these columns I can identify every row. I have checked this question but it simply returns combinations of all columns. Since that question was asked 3 years ago, I am wondering if pandas have a method of automatically checking which columns produce unique rows.

Rob Raymond · Accepted Answer

Out of interest I looked into this. There are multiple valid primary key combinations for this data set. semantic a primary key is defined as a combination of attribute that uniquely identifies a row

I have used set_index() instead of drop_duplicates() to test the semantic.

Clearly you could add to semantic, such as minimum number of columns or maximum number of columns.

Updated

Have coded change to semantic, cost of testing is significant. Have flexibility to optimise by defining minimum and maximum number of columns to be considered in candidate primary key.

import io
import itertools
import pandas as pd

df = pd.read_csv(io.StringIO("""id      name        location    cost        supplier            code
1       A           Paris       10          Google              100
1       B           London      10          Facebook            100
2       A           Rome        20          Google              100
2       B           Paris       20          Google              105
1       C           London      10          Facebook            200"""),sep="\s+",)

MINCOLS = len(df.columns) // 3
MAXCOLS = len(df.columns) - 1

valid = []
for combi in itertools.chain.from_iterable(
    itertools.combinations(df.columns, r)
    for r in range(max(MINCOLS, 1), min(MAXCOLS, len(df.columns)))
):
    try:
        df.set_index(list(combi), verify_integrity=True)
        valid.append(list(combi))
    except ValueError:
        pass
valid

output

[['id', 'name'],
 ['name', 'location'],
 ['name', 'cost'],
 ['location', 'code'],
 ['id', 'name', 'location'],
 ['id', 'name', 'cost'],
 ['id', 'name', 'supplier'],
 ['id', 'name', 'code'],
 ['id', 'location', 'code'],
 ['id', 'supplier', 'code'],
 ['name', 'location', 'cost'],
 ['name', 'location', 'supplier'],
 ['name', 'location', 'code'],
 ['name', 'cost', 'supplier'],
 ['name', 'cost', 'code'],
 ['location', 'cost', 'code'],
 ['location', 'supplier', 'code'],
 ['cost', 'supplier', 'code'],
 ['id', 'name', 'location', 'cost'],
 ['id', 'name', 'location', 'supplier'],
 ['id', 'name', 'location', 'code'],
 ['id', 'name', 'cost', 'supplier'],
 ['id', 'name', 'cost', 'code'],
 ['id', 'name', 'supplier', 'code'],
 ['id', 'location', 'cost', 'code'],
 ['id', 'location', 'supplier', 'code'],
 ['id', 'cost', 'supplier', 'code'],
 ['name', 'location', 'cost', 'supplier'],
 ['name', 'location', 'cost', 'code'],
 ['name', 'location', 'supplier', 'code'],
 ['name', 'cost', 'supplier', 'code'],
 ['location', 'cost', 'supplier', 'code']]

Determining columns which produce unique row for primary set keys

Tags:

python

algorithm

pandas

Jonas Palačionis

1 Answers

Updated

output

Rob Raymond

Recent Activity

Donate For Us

Determining columns which produce unique row for primary set keys

Tags:

python

algorithm

pandas

Jonas Palačionis

1 Answers

Updated

output

Rob Raymond

Related questions

Recent Activity

Donate For Us