I am looking to find a way of getting columns names from dataframes which combination produce unique rows.
Let's say I have a df which looks like this:
id name location cost supplier code
1 A Paris 10 Google 100
1 B London 10 Facebook 100
2 A Rome 20 Google 100
2 B Paris 20 Google 105
1 C London 10 Facebook 200
Then I would expect to get id, name because having these columns I can identify every row. I have checked this question but it simply returns combinations of all columns. Since that question was asked 3 years ago, I am wondering if pandas have a method of automatically checking which columns produce unique rows.
Out of interest I looked into this. There are multiple valid primary key combinations for this data set. semantic a primary key is defined as a combination of attribute that uniquely identifies a row
I have used set_index() instead of drop_duplicates() to test the semantic.
Clearly you could add to semantic, such as minimum number of columns or maximum number of columns.
Have coded change to semantic, cost of testing is significant. Have flexibility to optimise by defining minimum and maximum number of columns to be considered in candidate primary key.
import io
import itertools
import pandas as pd
df = pd.read_csv(io.StringIO("""id name location cost supplier code
1 A Paris 10 Google 100
1 B London 10 Facebook 100
2 A Rome 20 Google 100
2 B Paris 20 Google 105
1 C London 10 Facebook 200"""),sep="\s+",)
MINCOLS = len(df.columns) // 3
MAXCOLS = len(df.columns) - 1
valid = []
for combi in itertools.chain.from_iterable(
itertools.combinations(df.columns, r)
for r in range(max(MINCOLS, 1), min(MAXCOLS, len(df.columns)))
):
try:
df.set_index(list(combi), verify_integrity=True)
valid.append(list(combi))
except ValueError:
pass
valid
[['id', 'name'],
['name', 'location'],
['name', 'cost'],
['location', 'code'],
['id', 'name', 'location'],
['id', 'name', 'cost'],
['id', 'name', 'supplier'],
['id', 'name', 'code'],
['id', 'location', 'code'],
['id', 'supplier', 'code'],
['name', 'location', 'cost'],
['name', 'location', 'supplier'],
['name', 'location', 'code'],
['name', 'cost', 'supplier'],
['name', 'cost', 'code'],
['location', 'cost', 'code'],
['location', 'supplier', 'code'],
['cost', 'supplier', 'code'],
['id', 'name', 'location', 'cost'],
['id', 'name', 'location', 'supplier'],
['id', 'name', 'location', 'code'],
['id', 'name', 'cost', 'supplier'],
['id', 'name', 'cost', 'code'],
['id', 'name', 'supplier', 'code'],
['id', 'location', 'cost', 'code'],
['id', 'location', 'supplier', 'code'],
['id', 'cost', 'supplier', 'code'],
['name', 'location', 'cost', 'supplier'],
['name', 'location', 'cost', 'code'],
['name', 'location', 'supplier', 'code'],
['name', 'cost', 'supplier', 'code'],
['location', 'cost', 'supplier', 'code']]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With