Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Speed Up the Apriori Framework Based On to Generate Only Association Rules Which Consequents (Right Hand Side) Are One Element of the Data Set?

I have a csv file with 600 000 rows and 15 columns "Col1, Col2 ... COl15". I want to generate association rules where only the right hand side has only values from col15. I am using the apriori implementation from here

It calculates the minSupport for each itemset this way :

oneCSet = returnItemsWithMinSupport(itemSet,
                                        transactionList,
                                        minSupport,
                                        freqSet)
    print "reached line 80"
    currentLSet = oneCSet
    k = 2
    while(currentLSet != set([])):
        print k
        largeSet[k-1] = currentLSet
        currentLSet = joinSet(currentLSet, k)
        currentCSet = returnItemsWithMinSupport(currentLSet,
                                                transactionList,
                                                minSupport,
                                                freqSet)
        currentLSet = currentCSet
        k = k + 1

def returnItemsWithMinSupport(itemSet, transactionList, minSupport, freqSet):
        """calculates the support for items in the itemSet and returns a subset
       of the itemSet each of whose elements satisfies the minimum support"""
        _itemSet = set()
        localSet = defaultdict(int)
        #print itemSet

        for item in itemSet:
            #print "I am here", list(item)


            for transaction in transactionList:
                if item.issubset(transaction):
                    freqSet[item] += 1
                    localSet[item] += 1
        print "Done half"
        for item, count in localSet.items():
            support = float(count)/len(transactionList)

            if support >= minSupport:
                _itemSet.add(item)

        return _itemSet

But for the many rows I have, it would take a lot of time, Since I want the RHS to be constrained to only having values from a specific column(Col15), can I make the implementation faster by somehow cutting down on the frequent itemsets? One of the other ways is to filter the rules at the end, but it would have the same time complexity. Or is there some other implementation/library which helps me speed up things?

like image 921
Dreams Avatar asked Oct 28 '25 05:10

Dreams


1 Answers

  1. Split your data set, based on the value in your column 15, which will be your right hand side RHS. So if you have 5 different values in that column, you get 5 data sets now. Remove the last column each, which is constant now.

  2. Compute frequent itemsets (not association rules) on the other columns only, by Apriori on each subset (faster!). But you will still need a much better implementation than that random github version you linked. It only needs FIMs, not rules!

  3. Compose frequent itemset with partition key into an association rule, (FIS -> RHS) and evaluate like an association rule with your preferred metric.

This is a lot faster, because it will not generate frequent itemsets that span multiple col15 keys. Within each partition, all remaining data is relevant for your objective. Plus, it works with unmodified Apriori FIM generation.

like image 196
Has QUIT--Anony-Mousse Avatar answered Oct 30 '25 22:10

Has QUIT--Anony-Mousse



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!