How to Speed Up the Apriori Framework Based On to Generate Only Association Rules Which Consequents (Right Hand Side) Are One Element of the Data Set?

Question

I have a csv file with 600 000 rows and 15 columns "Col1, Col2 ... COl15". I want to generate association rules where only the right hand side has only values from col15. I am using the apriori implementation from here

It calculates the minSupport for each itemset this way :

oneCSet = returnItemsWithMinSupport(itemSet,
                                        transactionList,
                                        minSupport,
                                        freqSet)
    print "reached line 80"
    currentLSet = oneCSet
    k = 2
    while(currentLSet != set([])):
        print k
        largeSet[k-1] = currentLSet
        currentLSet = joinSet(currentLSet, k)
        currentCSet = returnItemsWithMinSupport(currentLSet,
                                                transactionList,
                                                minSupport,
                                                freqSet)
        currentLSet = currentCSet
        k = k + 1

def returnItemsWithMinSupport(itemSet, transactionList, minSupport, freqSet):
        """calculates the support for items in the itemSet and returns a subset
       of the itemSet each of whose elements satisfies the minimum support"""
        _itemSet = set()
        localSet = defaultdict(int)
        #print itemSet

        for item in itemSet:
            #print "I am here", list(item)


            for transaction in transactionList:
                if item.issubset(transaction):
                    freqSet[item] += 1
                    localSet[item] += 1
        print "Done half"
        for item, count in localSet.items():
            support = float(count)/len(transactionList)

            if support >= minSupport:
                _itemSet.add(item)

        return _itemSet

But for the many rows I have, it would take a lot of time, Since I want the RHS to be constrained to only having values from a specific column(Col15), can I make the implementation faster by somehow cutting down on the frequent itemsets? One of the other ways is to filter the rules at the end, but it would have the same time complexity. Or is there some other implementation/library which helps me speed up things?

Has QUIT--Anony-Mousse · Accepted Answer

Split your data set, based on the value in your column 15, which will be your right hand side RHS. So if you have 5 different values in that column, you get 5 data sets now. Remove the last column each, which is constant now.
Compute frequent itemsets (not association rules) on the other columns only, by Apriori on each subset (faster!). But you will still need a much better implementation than that random github version you linked. It only needs FIMs, not rules!
Compose frequent itemset with partition key into an association rule, (FIS -> RHS) and evaluate like an association rule with your preferred metric.

This is a lot faster, because it will not generate frequent itemsets that span multiple col15 keys. Within each partition, all remaining data is relevant for your objective. Plus, it works with unmodified Apriori FIM generation.

How to Speed Up the Apriori Framework Based On to Generate Only Association Rules Which Consequents (Right Hand Side) Are One Element of the Data Set?

Tags:

python

machine-learning

data-mining

apriori

Dreams

1 Answers

Has QUIT--Anony-Mousse

Recent Activity

Donate For Us

How to Speed Up the Apriori Framework Based On to Generate Only Association Rules Which Consequents (Right Hand Side) Are One Element of the Data Set?

Tags:

python

machine-learning

data-mining

apriori

Dreams

1 Answers

Has QUIT--Anony-Mousse

Related questions

Recent Activity

Donate For Us