Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Find frequency of subsets amongst multiple sets

Tags:

python

set

subset

I have a list of skills as follows:

skills = ['Listening', 'Written_Expression','Clerical',
         'Night_Vision', 'Accounting']

I have a separate list of sets, each of which contains the skills related to a particular job:

job_skills =  
     [{'Listening','Written_Expression','Clerical','Night_Vision'},
     {'Chemistry','Written_Expression','Clerical','Listening'},
     .
     .
     ]

I want to count the frequency with which each combination of 2 unique skills is a subset of a set in job_skills and return a list of lists/sets with the combinations and frequencies as follows:

skill_pairs = [{'Listening', 'Written_Expression', 2},
              {'Listening', 'Clerical', 2},
              .
              .
              {'Night_Vision', 'Accounting', 0}]

At the moment I'm doing the following:

skill_combos = []
for idx, i in enumerate(skills):
    for jdx, j in enumerate(skills[idx+1:]):
        temp = []
        for job in range(len(job_skills)):
            temp.append(set([i,j]).issubset(job_skills[job])
        skill_combos.append([i,j,sum(temp)])

This gets the job done but its slow given that I have approx half a million skill combinations. Is there a faster way of doing this? Ideally not using 3 loops.

Thanks

like image 243
Lonewoolf Avatar asked Dec 19 '25 22:12

Lonewoolf


1 Answers

You only need to count the combinations that are present, the rest is zero, for example:

from collections import Counter
from itertools import combinations

job_skills = [{'Listening', 'Written_Expression', 'Clerical', 'Night_Vision'},
              {'Chemistry', 'Written_Expression', 'Clerical', 'Listening'}]


counts = Counter(combo for skill_set in job_skills for combo in combinations(skill_set, 2))

for key, value in counts.items():
    print(key, value)

Output

('Clerical', 'Written_Expression') 2
('Clerical', 'Listening') 2
('Clerical', 'Night_Vision') 1
('Written_Expression', 'Listening') 2
('Written_Expression', 'Night_Vision') 1
('Listening', 'Night_Vision') 1
('Clerical', 'Chemistry') 1
('Written_Expression', 'Chemistry') 1
('Listening', 'Chemistry') 1

See itertools.combinations and collections.Counter. If you want a dictionary that returns 0 for the ones that are missing, wrap counts with a defaultdict:

total = defaultdict(int)
total.update(counts)
print(total[('Night_Vision', 'Accounting')])

Output

0
like image 117
Dani Mesejo Avatar answered Dec 21 '25 12:12

Dani Mesejo