Editing Association rule learning (section)

=== Apriori algorithm ===
Apriori is given by R. Agrawal and R. Srikant in 1994 for frequent item set mining and association rule learning. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often. The name of the algorithm is Apriori because it uses prior knowledge of frequent itemset properties.

[[File:APriori.png|thumb|357x357px|The control flow diagram for the Apriori algorithm]]
'''Overview:''' [[Apriori algorithm|Apriori]] uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as ''candidate generation''), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. Apriori uses [[breadth-first search]] and a [[Hash tree (persistent data structure)|Hash tree]] structure to count candidate item sets efficiently. It generates candidate item sets of length  from item sets of length . Then it prunes the candidates which have an infrequent sub pattern. According to the downward closure lemma, the candidate set contains all frequent -length item sets. After that, it scans the transaction database to determine frequent item sets among the candidates.

'''Example:''' Assume that each row is a cancer sample with a certain combination of mutations labeled by a character in the alphabet. For example a row could have {a, c} which means it is affected by mutation 'a' and mutation 'c'. 
{| class="wikitable"
|+Input Set
!{a,  b}
!{c, d}
!{a, d}
!{a, e}
!{b, d}
!{a, b, d}
!{a, c, d}
!{a, b, c, d}
|}
Now we will generate the frequent item set by counting the number of occurrences of each character. This is also known as finding the support values. Then we will prune the item set by picking a minimum support threshold. For this pass of the algorithm we will pick 3. 
{| class="wikitable"
|+Support Values
!a
!b
!c
!d
|-
|6
|4
|3
|6
|}
Since all support values are three or above there is no pruning. The frequent item set is {a}, {b}, {c}, and {d}. After this we will repeat the process by counting pairs of mutations in the input set. 
{| class="wikitable"
|+Support Values
!{a, b}
!{a, c}
!{a, d}
!{b, c}
!{b, d}
!{c, d}
|-
|3
|2
|4
|1
|3
|3
|}
Now we will make our minimum support value 4 so only {a, d} will remain after pruning. Now we will use the frequent item set to make combinations of triplets.  We will then repeat the process by counting occurrences of triplets of mutations in the input set. 
{| class="wikitable"
|+Support Values
!{a, c, d}
|-
|2
|}
Since we only have one item the next set of combinations of quadruplets is empty so the algorithm will stop.

'''Advantages and Limitations:'''

Apriori has some limitations. Candidate generation can result in large candidate sets. For example a 10^4 frequent 1-itemset will generate a 10^7 candidate 2-itemset. The algorithm also needs to frequently scan the database, to be specific n+1 scans where n is the length of the longest pattern. Apriori is slower than the Eclat algorithm. However, Apriori performs well compared to Eclat when the dataset is large. This is because in the Eclat algorithm if the dataset is too large the tid-lists become too large for memory. FP-growth outperforms the Apriori and Eclat. This is due to the FP-growth algorithm not having candidate generation or test, using a compact data structure, and only having one database scan.<ref>{{cite arXiv|last=Heaton|first=Jeff|date=2017-01-30|title=Comparing Dataset Characteristics that Favor the Apriori, Eclat or FP-Growth Frequent Itemset Mining Algorithms|class=cs.DB|eprint=1701.09042}}</ref>