I have a csv file with 600 000 rows and 15 columns "Col1, Col2 ... COl15". I want to generate association rules where only the right hand side has only values from col15. I am using the apriori implementation from here
It calculates the minSupport for each itemset this way :
oneCSet = returnItemsWithMinSupport(itemSet,
                                        transactionList,
                                        minSupport,
                                        freqSet)
    print "reached line 80"
    currentLSet = oneCSet
    k = 2
    while(currentLSet != set([])):
        print k
        largeSet[k-1] = currentLSet
        currentLSet = joinSet(currentLSet, k)
        currentCSet = returnItemsWithMinSupport(currentLSet,
                                                transactionList,
                                                minSupport,
                                                freqSet)
        currentLSet = currentCSet
        k = k + 1
def returnItemsWithMinSupport(itemSet, transactionList, minSupport, freqSet):
        """calculates the support for items in the itemSet and returns a subset
       of the itemSet each of whose elements satisfies the minimum support"""
        _itemSet = set()
        localSet = defaultdict(int)
        #print itemSet
        for item in itemSet:
            #print "I am here", list(item)
            for transaction in transactionList:
                if item.issubset(transaction):
                    freqSet[item] += 1
                    localSet[item] += 1
        print "Done half"
        for item, count in localSet.items():
            support = float(count)/len(transactionList)
            if support >= minSupport:
                _itemSet.add(item)
        return _itemSet
But for the many rows I have, it would take a lot of time, Since I want the RHS to be constrained to only having values from a specific column(Col15), can I make the implementation faster by somehow cutting down on the frequent itemsets? One of the other ways is to filter the rules at the end, but it would have the same time complexity. Or is there some other implementation/library which helps me speed up things?
Split your data set, based on the value in your column 15, which will be your right hand side RHS. So if you have 5 different values in that column, you get 5 data sets now. Remove the last column each, which is constant now.
Compute frequent itemsets (not association rules) on the other columns only, by Apriori on each subset (faster!). But you will still need a much better implementation than that random github version you linked. It only needs FIMs, not rules!
Compose frequent itemset with partition key into an association rule, (FIS -> RHS) and evaluate like an association rule with your preferred metric.
This is a lot faster, because it will not generate frequent itemsets that span multiple col15 keys. Within each partition, all remaining data is relevant for your objective. Plus, it works with unmodified Apriori FIM generation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With