Editing Association rule learning (section)

== Algorithms ==

Many algorithms for generating association rules have been proposed.

Some well-known algorithms are [[Apriori algorithm|Apriori]], Eclat and FP-Growth, but they only do half the job, since they are algorithms for mining frequent itemsets. Another step needs to be done after to generate rules from frequent itemsets found in a database.

=== Apriori algorithm ===
Apriori is given by R. Agrawal and R. Srikant in 1994 for frequent item set mining and association rule learning. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often. The name of the algorithm is Apriori because it uses prior knowledge of frequent itemset properties.

[[File:APriori.png|thumb|357x357px|The control flow diagram for the Apriori algorithm]]
'''Overview:''' [[Apriori algorithm|Apriori]] uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as ''candidate generation''), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. Apriori uses [[breadth-first search]] and a [[Hash tree (persistent data structure)|Hash tree]] structure to count candidate item sets efficiently. It generates candidate item sets of length  from item sets of length . Then it prunes the candidates which have an infrequent sub pattern. According to the downward closure lemma, the candidate set contains all frequent -length item sets. After that, it scans the transaction database to determine frequent item sets among the candidates.

'''Example:''' Assume that each row is a cancer sample with a certain combination of mutations labeled by a character in the alphabet. For example a row could have {a, c} which means it is affected by mutation 'a' and mutation 'c'. 
{| class="wikitable"
|+Input Set
!{a,  b}
!{c, d}
!{a, d}
!{a, e}
!{b, d}
!{a, b, d}
!{a, c, d}
!{a, b, c, d}
|}
Now we will generate the frequent item set by counting the number of occurrences of each character. This is also known as finding the support values. Then we will prune the item set by picking a minimum support threshold. For this pass of the algorithm we will pick 3. 
{| class="wikitable"
|+Support Values
!a
!b
!c
!d
|-
|6
|4
|3
|6
|}
Since all support values are three or above there is no pruning. The frequent item set is {a}, {b}, {c}, and {d}. After this we will repeat the process by counting pairs of mutations in the input set. 
{| class="wikitable"
|+Support Values
!{a, b}
!{a, c}
!{a, d}
!{b, c}
!{b, d}
!{c, d}
|-
|3
|2
|4
|1
|3
|3
|}
Now we will make our minimum support value 4 so only {a, d} will remain after pruning. Now we will use the frequent item set to make combinations of triplets.  We will then repeat the process by counting occurrences of triplets of mutations in the input set. 
{| class="wikitable"
|+Support Values
!{a, c, d}
|-
|2
|}
Since we only have one item the next set of combinations of quadruplets is empty so the algorithm will stop.

'''Advantages and Limitations:'''

Apriori has some limitations. Candidate generation can result in large candidate sets. For example a 10^4 frequent 1-itemset will generate a 10^7 candidate 2-itemset. The algorithm also needs to frequently scan the database, to be specific n+1 scans where n is the length of the longest pattern. Apriori is slower than the Eclat algorithm. However, Apriori performs well compared to Eclat when the dataset is large. This is because in the Eclat algorithm if the dataset is too large the tid-lists become too large for memory. FP-growth outperforms the Apriori and Eclat. This is due to the FP-growth algorithm not having candidate generation or test, using a compact data structure, and only having one database scan.<ref>{{cite arXiv|last=Heaton|first=Jeff|date=2017-01-30|title=Comparing Dataset Characteristics that Favor the Apriori, Eclat or FP-Growth Frequent Itemset Mining Algorithms|class=cs.DB|eprint=1701.09042}}</ref>

=== Eclat algorithm ===

Eclat<ref name="eclat" /> (alt. ECLAT, stands for Equivalence Class Transformation) is a [[backtracking]] algorithm, which traverses the frequent itemset lattice graph in a [[depth-first search]] (DFS) fashion. Whereas the [[breadth-first search]] (BFS) traversal used in the Apriori algorithm will end up checking every subset of an itemset before checking it, DFS traversal checks larger itemsets and can save on checking the support of some of its subsets by virtue of the downward-closer property. Furthermore it will almost certainly use less memory as DFS has a lower space complexity than BFS.

To illustrate this, let there be a frequent itemset {a, b, c}. a DFS may check the nodes in the frequent itemset lattice in the following order: {a} → {a, b} → {a, b, c}, at which point it is known that {b}, {c}, {a, c}, {b, c} all satisfy the support constraint by the downward-closure property. BFS would explore each subset of {a, b, c} before finally checking it. As the size of an itemset increases, the number of its subsets undergoes [[combinatorial explosion]].

It is suitable for both sequential as well as parallel execution with locality-enhancing properties.<ref>{{cite report |citeseerx=10.1.1.42.3283 |hdl=1802/501 |title=New Algorithms for Fast Discovery of Association Rules |pages=283–286 |year=1997 |first1=Mohammed Javeed |last1=Zaki |first2=Srinivasan |last2=Parthasarathy |first3=Mitsunori |last3=Ogihara |first4=Wei |last4=Li }}</ref><ref>{{cite journal|title=Parallel Algorithms for Discovery of Association Rules|doi=10.1023/A:1009773317876 |year=1997 |last1=Zaki |first1=Mohammed J. |journal=Data Mining and Knowledge Discovery |volume=1 |issue=4 |pages=343–373 |last2=Parthasarathy |first2=Srinivasan |last3=Ogihara |first3=Mitsunori |last4=Li |first4=Wei |s2cid=10038675 }}</ref>

=== FP-growth algorithm ===

FP stands for frequent pattern.<ref>{{cite book|last1=Han|title=Proceedings of the 2000 ACM SIGMOD international conference on Management of data |chapter=Mining frequent patterns without candidate generation |date=2000|volume=SIGMOD '00|pages=1–12|doi=10.1145/342009.335372|isbn=978-1581132175|citeseerx=10.1.1.40.4436|s2cid=6059661}}</ref>

In the first pass, the algorithm counts the occurrences of items (attribute-value pairs) in the dataset of transactions, and stores these counts in a 'header table'. In the second pass, it builds the FP-tree structure by inserting transactions into a [[trie]].

Items in each transaction have to be sorted by descending order of their frequency in the dataset before being inserted so that the tree can be processed quickly.
Items in each transaction that do not meet the minimum support requirement are discarded.
If many transactions share most frequent items, the FP-tree provides high compression close to tree root.

Recursive processing of this compressed version of the main dataset grows frequent item sets directly, instead of generating candidate items and testing them against the entire database (as in the apriori algorithm).

Growth begins from the bottom of the header table i.e. the item with the smallest support by finding all sorted transactions that end in that item. Call this item <math>I</math>.

A new conditional tree is created which is the original FP-tree projected onto <math>I</math>. The supports of all nodes in the projected tree are re-counted with each node getting the sum of its children counts. Nodes (and hence subtrees) that do not meet the minimum support are pruned. Recursive growth ends when no individual items conditional on <math>I</math> meet the minimum support threshold. The resulting paths from root to <math>I</math> will be frequent itemsets. After this step, processing continues with the next least-supported header item of the original FP-tree.

Once the recursive process has completed, all frequent item sets will have been found, and association rule creation begins.<ref>Witten, Frank, Hall: Data mining practical machine learning tools and techniques, 3rd edition{{page needed|date=January 2019}}</ref>

=== Others ===

==== ASSOC ====

The ASSOC procedure<ref>{{cite book |last=Hájek |first=Petr |author2=Havránek, Tomáš |title=Mechanizing Hypothesis Formation: Mathematical Foundations for a General Theory |publisher=Springer-Verlag |year=1978 |isbn=978-3-540-08738-0 |url=http://www.cs.cas.cz/hajek/guhabook/ }}</ref> is a GUHA method which mines for generalized association rules using fast [[bitstring]]s operations. The association rules mined by this method are more general than those output by apriori, for example "items" can be connected both with conjunction and disjunctions and the relation between antecedent and consequent of the rule is not restricted to setting minimum support and confidence as in apriori: an arbitrary combination of supported interest measures can be used.

==== OPUS search ====

OPUS is an efficient algorithm for rule discovery that, in contrast to most alternatives, does not require either monotone or anti-monotone constraints such as minimum support.<ref name=OPUS>Webb, Geoffrey I. (1995); ''OPUS: An Efficient Admissible Algorithm for Unordered Search'', Journal of Artificial Intelligence Research 3, Menlo Park, CA: AAAI Press, pp. 431-465 [http://webarchive.loc.gov/all/20011118141304/http://www.cs.washington.edu/research/jair/abstracts/webb95a.html online access]</ref> Initially used to find rules for a fixed consequent<ref name="OPUS" /><ref name="Bayardo">{{Cite journal |doi=10.1023/A:1009895914772 |last1=Bayardo |first1=Roberto J. Jr. |last2=Agrawal |first2=Rakesh |last3=Gunopulos |first3=Dimitrios |year=2000 |title=Constraint-based rule mining in large, dense databases |journal=Data Mining and Knowledge Discovery |volume=4 |issue=2 |pages=217–240 |s2cid=5120441 }}</ref> it has subsequently been extended to find rules with any item as a consequent.<ref name="webb">{{cite book |doi=10.1145/347090.347112 |chapter=Efficient search for association rules |title=Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining - KDD '00 |pages=99–107 |year=2000 |last1=Webb |first1=Geoffrey I. |isbn=978-1581132335 |citeseerx=10.1.1.33.1309 |s2cid=5444097 }}</ref> OPUS search is the core technology in the popular Magnum Opus association discovery system.