I have the following df:
What I want to do is to count the frequency of combination of elements. For example:
and so on, in other words, I need to generate something like this:
Count all the frequencies of single and combined items and only keep those both single and combined items with frequency >= n, where n is any positive integer. For this example let's say n -> {1, 2, 3, 4}.
I've been trying to use the following code:
# candidates itemsets
records = []
# generates a list of lists of products that were bought together (convert df to list of lists)
for i in range(0, num_records):
records.append([str(data.values[i,j]) for j in range(0, len(data.columns))])
# clean list (delete NaN values)
records = [[x for x in y if str(x) != 'nan'] for y in records]
OUTPUT:
[['detergent'],
['bread', 'water'],
['bread', 'umbrella', 'milk', 'diaper', 'beer'],
['detergent', 'beer', 'umbrella', 'milk'],
['cheese', 'detergent', 'diaper', 'umbrella'],
['umbrella', 'water', 'beer'],
['umbrella', 'water'],
['water', 'umbrella'],
['diaper', 'water', 'cheese', 'beer', 'detergent', 'umbrella'],
['umbrella', 'cheese', 'detergent', 'water', 'beer']]
and then:
setOfItems = []
newListOfItems = []
for item in records:
if item in setOfItems:
continue
setOfItems.append(item)
temp = list(item)
occurence = records.count(item)
temp.append(occurence)
newListOfItems.append(temp)
OUTPUT:
['detergent', 1]
['bread', 'water', 1]
['bread', 'umbrella', 'milk', 'diaper', 'beer', 1]
['detergent', 'beer', 'umbrella', 'milk', 1]
['cheese', 'detergent', 'diaper', 'umbrella', 1]
['umbrella', 'water', 'beer', 1]
['umbrella', 'water', 1]
['water', 'umbrella', 1]
['diaper', 'water', 'cheese', 'beer', 'detergent', 'umbrella', 1]
['umbrella', 'cheese', 'detergent', 'water', 'beer', 1]
As you can see, it is only counting the freq of the whole row (from image 1), however my expected output is the one that appears in the second image.
Interesting problem! I am using itertools.combinations()
to generate all possible combinations and collections.Counter()
to count for every combination how often it appears:
import pandas as pd
import itertools
from collections import Counter
# create sample data
df = pd.DataFrame([
['detergent', np.nan],
['bread', 'water', None],
['bread', 'umbrella', 'milk', 'diaper', 'beer'],
['umbrella', 'water'],
['water', 'umbrella'],
['umbrella', 'water']
])
def get_all_combinations_without_nan_or_None(row):
# remove nan, None and double values
set_without_nan = {value for value in row if isinstance(value, str)}
# generate all possible combinations of the values in a row
all_combinations = []
for i in range(1, len(set_without_nan)+1):
result = list(itertools.combinations(set_without_nan, i))
all_combinations.extend(result)
return all_combinations
# get all posssible combinations of values in a row
all_rows = df.apply(get_all_combinations_without_nan_or_None, 1).values
all_rows_flatten = list(itertools.chain.from_iterable(all_rows))
# use Counter to count how many there are of each combination
count_combinations = Counter(all_rows_flatten)
Docs on collections.Counter()
:
https://docs.python.org/2/library/collections.html#collections.Counter
Docs on itertools.combinations()
:
https://docs.python.org/2/library/itertools.html#itertools.combinations
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With