Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Association rule learning
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Process == Association rules are made by searching data for frequent if-then patterns and by using a certain criterion under Support and Confidence to define what the most important relationships are. Support is the evidence of how frequent an item appears in the data given, as Confidence is defined by how many times the if-then statements are found true. However, there is a third criteria that can be used, it is called Lift and it can be used to compare the expected Confidence and the actual Confidence. Lift will show how many times the if-then statement is expected to be found to be true. Association rules are made to calculate from itemsets, which are created by two or more items. If the rules were built from the analyzing from all the possible itemsets from the data then there would be so many rules that they wouldn’t have any meaning. That is why Association rules are typically made from rules that are well represented by the data. There are many different data mining techniques you could use to find certain analytics and results, for example, there is Classification analysis, Clustering analysis, and Regression analysis.<ref>{{Cite web|date=2021-11-08|title=Data Mining Techniques: Top 5 to Consider|url=https://www.precisely.com/blog/datagovernance/top-5-data-mining-techniques|access-date=2021-12-10|website=Precisely|language=en-US}}</ref> What technique you should use depends on what you are looking for with your data. Association rules are primarily used to find analytics and a prediction of customer behavior. For Classification analysis, it would most likely be used to question, make decisions, and predict behavior.<ref name=":2">{{Cite web|title=16 Data Mining Techniques: The Complete List - Talend|url=https://www.talend.com/resources/data-mining-techniques/|access-date=2021-12-10|website=Talend - A Leader in Data Integration & Data Integrity|language=en}}</ref> Clustering analysis is primarily used when there are no assumptions made about the likely relationships within the data.<ref name=":2"/> Regression analysis Is used when you want to predict the value of a continuous dependent from a number of independent variables.<ref name=":2"/> '''Benefits''' There are many benefits of using Association rules like finding the pattern that helps understand the correlations and co-occurrences between data sets. A very good real-world example that uses Association rules would be medicine. Medicine uses Association rules to help diagnose patients. When diagnosing patients there are many variables to consider as many diseases will share similar symptoms. With the use of the Association rules, doctors can determine the conditional probability of an illness by comparing symptom relationships from past cases.<ref>{{Cite web|title=What are Association Rules in Data Mining (Association Rule Mining)?|url=https://searchbusinessanalytics.techtarget.com/definition/association-rules-in-data-mining|access-date=2021-12-10|website=SearchBusinessAnalytics|language=en}}</ref> '''Downsides''' However, Association rules also lead to many different downsides such as finding the appropriate parameter and threshold settings for the mining algorithm. But there is also the downside of having a large number of discovered rules. The reason is that this does not guarantee that the rules will be found relevant, but it could also cause the algorithm to have low performance. Sometimes the implemented algorithms will contain too many variables and parameters. For someone that doesn’t have a good concept of data mining, this might cause them to have trouble understanding it.<ref>{{Cite web|title=Drawbacks and solutions of applying association rule mining in learning management systems|url=https://www.researchgate.net/publication/289657906|access-date=2021-12-10|website=ResearchGate|language=en}}</ref> '''Thresholds'''[[File:FrequentItems.png|thumb|Frequent itemset lattice, where the color of the box indicates how many transactions contain the combination of items. Note that lower levels of the lattice can contain at most the minimum number of their parents' items; e.g. {ac} can have only at most <math>\min(a,c)</math> items. This is called the ''downward-closure property''.<ref name="mining" />]]When using Association rules, you are most likely to only use Support and Confidence. However, this means you have to satisfy a user-specified minimum support and a user-specified minimum confidence at the same time. Usually, the Association rule generation is split into two different steps that needs to be applied: # A minimum Support threshold to find all the frequent itemsets that are in the database. # A minimum Confidence threshold to the frequent itemsets found to create rules. {| class="wikitable" |+Table 1. Example of '''Threshold for''' Support and Confidence. ! scope="col" | Items ! scope="col" | Support ! scope="col" | Confidence | rowspan="5" style="border: none; background: none;" | ! scope="col" | Items ! scope="col" | Support ! scope="col" | Confidence |- | Item A || 30%|| 50% || Item C || 45%|| 55% |- | Item B || 15%|| 25% || Item A || 30%|| 50% |- | Item C || 45%|| 55% || Item D || 35%|| 40% |- | Item D || 35%|| 40% || Item B || 15%|| 25% |} '''The Support Threshold is 30%, Confidence Threshold is 50%''' '''The Table on the left is the original unorganized data and the table on the right is organized by the thresholds. In this case Item C is better than the thresholds for both Support and Confidence which is why it is first. Item A is second because its threshold values are spot on. Item D has met the threshold for Support but not Confidence. Item B has not met the threshold for either Support or Confidence and that is why it is last.''' To find all the frequent itemsets in a database is not an easy task since it involves going through all the data to find all possible item combinations from all possible itemsets. The set of possible itemsets is the [[power set]] over {{mvar|I}} and has size <math>2^n-1</math> , of course this means to exclude the empty set which is not considered to be a valid itemset. However, the size of the power set will grow exponentially in the number of item {{mvar|n}} that is within the power set {{mvar|I}}. An efficient search is possible by using the '''''downward-closure property''''' of support<ref name="mining" /><ref>{{cite book|last1=Tan|first1=Pang-Ning|title=Introduction to Data Mining|last2=Michael|first2=Steinbach|last3=Kumar|first3=Vipin|publisher=[[Addison-Wesley]]|year=2005|isbn=978-0-321-32136-7|chapter=Chapter 6. Association Analysis: Basic Concepts and Algorithms|chapter-url=http://www-users.cs.umn.edu/~kumar/dmbook/ch6.pdf}}</ref> (also called ''anti-monotonicity''<ref name="pei">{{cite book|last1=Jian Pei|title=Proceedings 17th International Conference on Data Engineering|last2=Jiawei Han|last3=Lakshmanan|first3=L.V.S.|year=2001|isbn=978-0-7695-1001-9|pages=433–442|chapter=Mining frequent itemsets with convertible constraints|citeseerx=10.1.1.205.2150|doi=10.1109/ICDE.2001.914856|s2cid=1080975}}</ref>). This would guarantee that a frequent itemset and all its subsets are also frequent and thus will have no infrequent itemsets as a subset of a frequent itemset. Exploiting this property, efficient algorithms (e.g., Apriori<ref name="apriori">Agrawal, Rakesh; and Srikant, Ramakrishnan; [http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf ''Fast algorithms for mining association rules in large databases''] {{Webarchive|url=https://web.archive.org/web/20150225213708/http://rakesh.agrawal-family.com/papers/vldb94apriori.pdf|date=2015-02-25}}, in Bocca, Jorge B.; Jarke, Matthias; and Zaniolo, Carlo; editors, ''Proceedings of the 20th International Conference on Very Large Data Bases (VLDB), Santiago, Chile, September 1994'', pages 487-499</ref> and Eclat<ref name="eclat">{{Cite journal|last1=Zaki|first1=M. J.|year=2000|title=Scalable algorithms for association mining|journal=IEEE Transactions on Knowledge and Data Engineering|volume=12|issue=3|pages=372–390|citeseerx=10.1.1.79.9448|doi=10.1109/69.846291}}</ref>) can find all frequent itemsets.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)