Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count frequency of combinations of elements in python

Tags:

python

pandas

I have the following df:

enter image description here

What I want to do is to count the frequency of combination of elements. For example:

  • umbrella appears 8 times in the whole df
  • detergent appears 5 times
  • (beer, diaper) appear 2 times
  • (beer, milk) appear 2 times
  • (umbrella, milk, beer) appear 2 times

and so on, in other words, I need to generate something like this: enter image description here

Count all the frequencies of single and combined items and only keep those both single and combined items with frequency >= n, where n is any positive integer. For this example let's say n -> {1, 2, 3, 4}.

I've been trying to use the following code:

# candidates itemsets
records = []

# generates a list of lists of products that were bought together (convert df to list of lists)
for i in range(0, num_records):
    records.append([str(data.values[i,j]) for j in range(0, len(data.columns))])
    
# clean list (delete NaN values)
records = [[x for x in y if str(x) != 'nan'] for y in records]

OUTPUT:
[['detergent'],
 ['bread', 'water'],
 ['bread', 'umbrella', 'milk', 'diaper', 'beer'],
 ['detergent', 'beer', 'umbrella', 'milk'],
 ['cheese', 'detergent', 'diaper', 'umbrella'],
 ['umbrella', 'water', 'beer'],
 ['umbrella', 'water'],
 ['water', 'umbrella'],
 ['diaper', 'water', 'cheese', 'beer', 'detergent', 'umbrella'],
 ['umbrella', 'cheese', 'detergent', 'water', 'beer']]

and then:

setOfItems = []
newListOfItems = []
for item in records:
    if item in setOfItems:
        continue
    setOfItems.append(item)
    temp = list(item)
    occurence = records.count(item)
    temp.append(occurence)
    newListOfItems.append(temp)

OUTPUT:

['detergent', 1]
['bread', 'water', 1]
['bread', 'umbrella', 'milk', 'diaper', 'beer', 1]
['detergent', 'beer', 'umbrella', 'milk', 1]
['cheese', 'detergent', 'diaper', 'umbrella', 1]
['umbrella', 'water', 'beer', 1]
['umbrella', 'water', 1]
['water', 'umbrella', 1]
['diaper', 'water', 'cheese', 'beer', 'detergent', 'umbrella', 1]
['umbrella', 'cheese', 'detergent', 'water', 'beer', 1]

As you can see, it is only counting the freq of the whole row (from image 1), however my expected output is the one that appears in the second image.

like image 306
brenda Avatar asked Oct 14 '25 05:10

brenda


1 Answers

Interesting problem! I am using itertools.combinations() to generate all possible combinations and collections.Counter() to count for every combination how often it appears:

import pandas as pd
import itertools
from collections import Counter

# create sample data
df = pd.DataFrame([
    ['detergent', np.nan],
    ['bread', 'water', None],
    ['bread', 'umbrella', 'milk', 'diaper', 'beer'],
    ['umbrella', 'water'],
    ['water', 'umbrella'],
    ['umbrella', 'water']
])

def get_all_combinations_without_nan_or_None(row):
    # remove nan, None and double values
    set_without_nan = {value for value in row if isinstance(value, str)}
    
    # generate all possible combinations of the values in a row
    all_combinations = []
    for i in range(1, len(set_without_nan)+1):
        result = list(itertools.combinations(set_without_nan, i))
        all_combinations.extend(result)
        
    return all_combinations
    
# get all posssible combinations of values in a row
all_rows = df.apply(get_all_combinations_without_nan_or_None, 1).values
all_rows_flatten = list(itertools.chain.from_iterable(all_rows))

# use Counter to count how many there are of each combination
count_combinations = Counter(all_rows_flatten)

Docs on collections.Counter():
https://docs.python.org/2/library/collections.html#collections.Counter

Docs on itertools.combinations():
https://docs.python.org/2/library/itertools.html#itertools.combinations

like image 75
Sander van den Oord Avatar answered Oct 16 '25 21:10

Sander van den Oord



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!