Editing Machine learning (section)

== Approaches ==

{{Anchor|Algorithm types}}
[[File:Supervised_and_unsupervised_learning.png|thumb|upright=1.3|In supervised learning, the training data is labelled with the expected answers, while in [[unsupervised learning]], the model identifies patterns or structures in unlabelled data.]]
Machine learning approaches are traditionally divided into three broad categories, which correspond to learning paradigms, depending on the nature of the "signal" or "feedback" available to the learning system:
* [[Supervised learning]]: The computer is presented with example inputs and their desired outputs, given by a "teacher", and the goal is to learn a general rule that [[Map (mathematics)|maps]] inputs to outputs.
* [[Unsupervised learning]]: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end ([[feature learning]]).
* [[Reinforcement learning]]: A computer program interacts with a dynamic environment in which it must perform a certain goal (such as [[Autonomous car|driving a vehicle]] or playing a game against an opponent). As it navigates its problem space, the program is provided feedback that's analogous to rewards, which it tries to maximise.<ref name="bishop2006"/> 
Although each algorithm has advantages and limitations, no single algorithm works for all problems.<ref>{{cite journal |last1=Jordan |first1=M. I. |last2=Mitchell |first2=T. M. |title=Machine learning: Trends, perspectives, and prospects |journal=Science |date=17 July 2015 |volume=349 |issue=6245 |pages=255–260 |doi=10.1126/science.aaa8415|pmid=26185243 |bibcode=2015Sci...349..255J |s2cid=677218 }}</ref><ref>{{cite book |last1=El Naqa |first1=Issam |last2=Murphy |first2=Martin J. |title=Machine Learning in Radiation Oncology |chapter=What is Machine Learning? |date=2015 |pages=3–11 |doi=10.1007/978-3-319-18305-3_1|isbn=978-3-319-18304-6 |s2cid=178586107 }}</ref><ref>{{cite journal |last1=Okolie |first1=Jude A. |last2=Savage |first2=Shauna |last3=Ogbaga |first3=Chukwuma C. |last4=Gunes |first4=Burcu |title=Assessing the potential of machine learning methods to study the removal of pharmaceuticals from wastewater using biochar or activated carbon |journal=Total Environment Research Themes |date=June 2022 |volume=1–2 |pages=100001 |doi=10.1016/j.totert.2022.100001|s2cid=249022386 |doi-access=free |bibcode=2022TERT....100001O }}</ref>

=== Supervised learning ===
{{Main|Supervised learning}}
[[File:Svm max sep hyperplane with margin.png|thumb|A [[support-vector machine]] is a supervised learning model that divides the data into regions separated by a [[linear classifier|linear boundary]]. Here, the linear boundary divides the black circles from the white.]]
Supervised learning algorithms build a mathematical model of a set of data that contains both the inputs and the desired outputs.<ref>{{cite book |last1=Russell |first1=Stuart J. |last2=Norvig |first2=Peter |title=Artificial Intelligence: A Modern Approach |date=2010 |publisher=Prentice Hall |isbn=9780136042594 |edition=Third|title-link=Artificial Intelligence: A Modern Approach }}</ref> The data, known as [[training data]], consists of a set of training examples. Each training example has one or more inputs and the desired output, also known as a supervisory signal. In the mathematical model, each training example is represented by an [[array data structure|array]] or vector, sometimes called a [[feature vector]], and the training data is represented by a [[Matrix (mathematics)|matrix]]. Through [[Mathematical optimization#Computational optimization techniques|iterative optimisation]] of an [[Loss function|objective function]], supervised learning algorithms learn a function that can be used to predict the output associated with new inputs.<ref>{{cite book |last1=Mohri |first1=Mehryar |last2=Rostamizadeh |first2=Afshin |last3=Talwalkar |first3=Ameet |title=Foundations of Machine Learning |date=2012 |publisher=The MIT Press |isbn=9780262018258}}</ref> An optimal function allows the algorithm to correctly determine the output for inputs that were not a part of the training data. An algorithm that improves the accuracy of its outputs or predictions over time is said to have learned to perform that task.<ref name="Mitchell-1997" />

Types of supervised-learning algorithms include [[active learning (machine learning)|active learning]], [[Statistical classification|classification]] and [[Regression analysis|regression]].<ref name="Alpaydin-2010">{{cite book|last=Alpaydin|first=Ethem|title=Introduction to Machine Learning|date=2010|publisher=MIT Press|isbn=978-0-262-01243-0|page=9|url=https://books.google.com/books?id=7f5bBAAAQBAJ|access-date=25 November 2018|archive-date=17 January 2023|archive-url=https://web.archive.org/web/20230117053338/https://books.google.com/books?id=7f5bBAAAQBAJ|url-status=live}}</ref> Classification algorithms are used when the outputs are restricted to a limited set of values, while regression algorithms are used when the outputs can take any numerical value within a range. For example, in a classification algorithm that filters emails, the input is an incoming email, and the output is the folder in which to file the email. In contrast, regression is used for tasks such as predicting a person's height based on factors like age and genetics or forecasting future temperatures based on historical data.<ref>{{Cite web |title=Lecture 2 Notes: Supervised Learning |url=https://www.cs.cornell.edu/courses/cs4780/2022sp/notes/LectureNotes02.html |access-date=1 July 2024 |website=www.cs.cornell.edu}}</ref>

[[Similarity learning]] is an area of supervised machine learning closely related to regression and classification, but the goal is to learn from examples using a similarity function that measures how similar or related two objects are. It has applications in [[ranking]], [[recommender system|recommendation systems]], visual identity tracking, face verification, and speaker verification.

=== Unsupervised learning ===
{{Main|Unsupervised learning}}{{See also|Cluster analysis}}
Unsupervised learning algorithms find structures in data that has not been labelled, classified or categorised. Instead of responding to feedback, unsupervised learning algorithms identify commonalities in the data and react based on the presence or absence of such commonalities in each new piece of data. Central applications of unsupervised machine learning include clustering, [[dimensionality reduction]],<ref name="Friedman-1998" /> and [[density estimation]].<ref name="JordanBishop2004">{{cite book |first1=Michael I. |last1=Jordan |first2=Christopher M. |last2=Bishop |chapter=Neural Networks |editor=Allen B. Tucker |title=Computer Science Handbook, Second Edition (Section VII: Intelligent Systems) |location=Boca Raton, Florida |publisher=Chapman & Hall/CRC Press LLC |year=2004 |isbn=978-1-58488-360-9 }}</ref>

Cluster analysis is the assignment of a set of observations into subsets (called ''clusters'') so that observations within the same cluster are similar according to one or more predesignated criteria, while observations drawn from different clusters are dissimilar. Different clustering techniques make different assumptions on the structure of the data, often defined by some ''similarity metric'' and evaluated, for example, by ''internal compactness'', or the similarity between members of the same cluster, and ''separation'', the difference between clusters. Other methods are based on ''estimated density'' and ''graph connectivity''.

A special type of unsupervised learning called, [[self-supervised learning]] involves training a model by generating the supervisory signal from the data itself.<ref>{{Cite conference|last1=Misra |first1=Ishan |last2=Maaten |first2=Laurens van der |date=2020 |title=Self-Supervised Learning of Pretext-Invariant Representations |url=https://openaccess.thecvf.com/content_CVPR_2020/html/Misra_Self-Supervised_Learning_of_Pretext-Invariant_Representations_CVPR_2020_paper.html |publisher=[[Institute of Electrical and Electronics Engineers|IEEE]] |pages=6707–6717 |conference=2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) |doi=10.1109/CVPR42600.2020.00674 |location=Seattle, WA, USA |arxiv=1912.01991 }}</ref><ref>{{Cite journal |last1=Jaiswal |first1=Ashish |last2=Babu |first2=Ashwin Ramesh |last3=Zadeh |first3=Mohammad Zaki |last4=Banerjee |first4=Debapriya |last5=Makedon |first5=Fillia |date=March 2021 |title=A Survey on Contrastive Self-Supervised Learning |journal=Technologies |language=en |volume=9 |issue=1 |pages=2 |doi=10.3390/technologies9010002 |doi-access=free |issn=2227-7080|arxiv=2011.00362 }}</ref>

=== Semi-supervised learning ===
{{Main|Semi-supervised learning}}

Semi-supervised learning falls between [[unsupervised learning]] (without any labelled training data) and [[supervised learning]] (with completely labelled training data). Some of the training examples are missing training labels, yet many machine-learning researchers have found that unlabelled data, when used in conjunction with a small amount of labelled data, can produce a considerable improvement in learning accuracy.

In [[Weak supervision|weakly supervised learning]], the training labels are noisy, limited, or imprecise; however, these labels are often cheaper to obtain, resulting in larger effective training sets.<ref>{{Cite web|url=https://hazyresearch.github.io/snorkel/blog/ws_blog_post.html|title=Weak Supervision: The New Programming Paradigm for Machine Learning|author1=Alex Ratner|author2=Stephen Bach|author3=Paroma Varma|author4=Chris|others=referencing work by many other members of Hazy Research|website=hazyresearch.github.io|access-date=6 June 2019|archive-date=6 June 2019|archive-url=https://web.archive.org/web/20190606043931/https://hazyresearch.github.io/snorkel/blog/ws_blog_post.html}}</ref>

=== Reinforcement learning ===
{{Main|Reinforcement learning}}
[[File:Reinforcement learning diagram.svg|right|frameless]]
Reinforcement learning is an area of machine learning concerned with how [[software agent]]s ought to take [[Action selection|actions]] in an environment so as to maximise some notion of cumulative reward. Due to its generality, the field is studied in many other disciplines, such as [[game theory]], [[control theory]], [[operations research]], [[information theory]], [[simulation-based optimisation]], [[multi-agent system]]s, [[swarm intelligence]], [[statistics]] and [[genetic algorithm]]s. In reinforcement learning, the environment is typically represented as a [[Markov decision process]] (MDP). Many reinforcement learning algorithms use [[dynamic programming]] techniques.<ref>{{Cite book|author1=van Otterlo, M.|author2=Wiering, M.|title=Reinforcement Learning |chapter=Reinforcement Learning and Markov Decision Processes |volume=12|pages=3–42 |year=2012 |doi=10.1007/978-3-642-27645-3_1|series=Adaptation, Learning, and Optimization|isbn=978-3-642-27644-6}}</ref> Reinforcement learning algorithms do not assume knowledge of an exact mathematical model of the MDP and are used when exact models are infeasible. Reinforcement learning algorithms are used in autonomous vehicles or in learning to play a game against a human opponent.

=== Dimensionality reduction ===
[[Dimensionality reduction]] is a process of reducing the number of random variables under consideration by obtaining a set of principal variables.<ref>{{cite journal|url=https://science.sciencemag.org/content/290/5500/2323|title=Nonlinear Dimensionality Reduction by Locally Linear Embedding|first1=Sam T.|last1=Roweis|first2=Lawrence K.|last2=Saul|date=22 December 2000|journal=Science|volume=290|issue=5500|pages=2323–2326|doi=10.1126/science.290.5500.2323|pmid=11125150|bibcode=2000Sci...290.2323R|s2cid=5987139|language=en|access-date=17 July 2023|archive-date=15 August 2021|archive-url=https://web.archive.org/web/20210815021528/https://science.sciencemag.org/content/290/5500/2323|url-status=live|url-access=subscription}}</ref> In other words, it is a process of reducing the dimension of the [[Feature (machine learning)|feature]] set, also called the "number of features". Most of the dimensionality reduction techniques can be considered as either feature elimination or [[Feature extraction|extraction]]. One of the popular methods of dimensionality reduction is [[principal component analysis]] (PCA). PCA involves changing higher-dimensional data (e.g., 3D) to a smaller space (e.g., 2D).
The [[manifold hypothesis]] proposes that high-dimensional data sets lie along low-dimensional [[manifold]]s, and many dimensionality reduction techniques make this assumption, leading to the area of [[manifold learning]] and [[manifold regularisation]].

=== Other types ===
Other approaches have been developed which do not fit neatly into this three-fold categorisation, and sometimes more than one is used by the same machine learning system. For example, [[topic model]]ling, [[meta-learning (computer science)|meta-learning]].<ref>{{cite book
 |author=Pavel Brazdil |author2=Christophe Giraud Carrier |author3=Carlos Soares |author4=Ricardo Vilalta
 | title =Metalearning: Applications to Data Mining
 | year = 2009
 | edition = Fourth
 | pages = 10–14, ''passim''
 | publisher = [[Springer Science+Business Media]]
 |isbn = 978-3540732624
 }}</ref>

==== Self-learning ====
Self-learning, as a machine learning paradigm was introduced in 1982 along with a neural network capable of self-learning, named ''crossbar adaptive array'' (CAA).<ref>Bozinovski, S. (1982). "A self-learning system using secondary reinforcement". In Trappl, Robert (ed.). Cybernetics and Systems Research: Proceedings of the Sixth European Meeting on Cybernetics and Systems Research. North-Holland. pp. 397–402. {{ISBN|978-0-444-86488-8}}.</ref><ref>Bozinovski, S. (1999) "Crossbar Adaptive Array: The first connectionist network that solved the delayed reinforcement learning problem" In A. Dobnikar, N. Steele, D. Pearson, R. Albert (eds.) Artificial Neural Networks and Genetic Algorithms, Springer Verlag, p. 320-325, ISBN 3-211-83364-1 </ref> It gives a solution to the problem learning without any external reward, by introducing emotion as an internal reward. Emotion is used as state evaluation of a self-learning agent. The CAA self-learning algorithm computes, in a crossbar fashion, both decisions about actions and emotions (feelings) about consequence situations. The system is driven by the interaction between cognition and emotion.<ref>Bozinovski, Stevo (2014) "Modeling mechanisms of cognition-emotion interaction in artificial neural networks, since 1981." Procedia Computer Science p. 255-263</ref>
The self-learning algorithm updates a memory matrix W =||w(a,s)|| such that in each iteration executes the following machine learning routine: 
#  in situation ''s'' perform action ''a''
#  receive a consequence situation ''s'''
#  compute emotion of being in the consequence situation ''v(s')''
#  update crossbar memory  ''w'(a,s) = w(a,s) + v(s')''

It is a system with only one input, situation, and only one output, action (or behaviour) a. There is neither a separate reinforcement input nor an advice input from the environment. The backpropagated value (secondary reinforcement) is the emotion toward the consequence situation. The CAA exists in two environments, one is the behavioural environment where it behaves, and the other is the genetic environment, wherefrom it initially and only once receives initial emotions about situations to be encountered in the behavioural environment. After receiving the genome (species) vector from the genetic environment, the CAA learns a goal-seeking behaviour, in an environment that contains both desirable and undesirable situations.<ref>Bozinovski, S. (2001) "Self-learning agents: A connectionist theory of emotion based on crossbar value judgment." Cybernetics and Systems 32(6) 637–667.</ref>

==== Feature learning ====
{{Main|Feature learning}}

Several learning algorithms aim at discovering better representations of the inputs provided during training.<ref name="pami">{{cite journal |author1=Y. Bengio |author2=A. Courville |author3=P. Vincent |title=Representation Learning: A Review and New Perspectives |journal= IEEE Transactions on Pattern Analysis and Machine Intelligence|year=2013|doi=10.1109/tpami.2013.50 |pmid=23787338 |volume=35 |issue=8 |pages=1798–1828|arxiv=1206.5538 |s2cid=393948 }}</ref> Classic examples include [[principal component analysis]] and cluster analysis. Feature learning algorithms, also called representation learning algorithms, often attempt to preserve the information in their input but also transform it in a way that makes it useful, often as a pre-processing step before performing classification or predictions. This technique allows reconstruction of the inputs coming from the unknown data-generating distribution, while not being necessarily faithful to configurations that are implausible under that distribution. This replaces manual [[feature engineering]], and allows a machine to both learn the features and use them to perform a specific task.

Feature learning can be either supervised or unsupervised. In supervised feature learning, features are learned using labelled input data. Examples include [[artificial neural network]]s, [[multilayer perceptron]]s, and supervised [[dictionary learning]]. In unsupervised feature learning, features are learned with unlabelled input data.  Examples include dictionary learning, [[independent component analysis]], [[autoencoder]]s, [[matrix decomposition|matrix factorisation]]<ref>{{cite conference |author1=Nathan Srebro |author2=Jason D. M. Rennie |author3=Tommi S. Jaakkola |title=Maximum-Margin Matrix Factorization |conference=[[Conference on Neural Information Processing Systems|NIPS]] |year=2004}}</ref> and various forms of [[Cluster analysis|clustering]].<ref name="coates2011">{{cite conference
|last1 = Coates
|first1 = Adam
|last2 = Lee
|first2 = Honglak
|last3 = Ng
|first3 = Andrew Y.
|title = An analysis of single-layer networks in unsupervised feature learning
|conference = Int'l Conf. on AI and Statistics (AISTATS)
|year = 2011
|url = http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS2011_CoatesNL11.pdf
|access-date = 25 November 2018
|archive-url = https://web.archive.org/web/20170813153615/http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS2011_CoatesNL11.pdf
|archive-date = 13 August 2017
}}</ref><ref>{{cite conference|last1 = Csurka|first1 = Gabriella|last2 = Dance|first2 = Christopher C.|last3 = Fan|first3 = Lixin|last4 = Willamowski|first4 = Jutta|last5 = Bray|first5 = Cédric|title = Visual categorization with bags of keypoints|conference = ECCV Workshop on Statistical Learning in Computer Vision|year = 2004|url = https://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/csurka-eccv-04.pdf|access-date = 29 August 2019|archive-date = 13 July 2019|archive-url = https://web.archive.org/web/20190713040210/http://www.cs.cmu.edu/~efros/courses/LBMV07/Papers/csurka-eccv-04.pdf|url-status = live}}</ref><ref name="jurafsky">{{cite book |title=Speech and Language Processing |author1=Daniel Jurafsky |author2=James H. Martin |publisher=Pearson Education International |year=2009 |pages=145–146}}</ref>

[[Manifold learning]] algorithms attempt to do so under the constraint that the learned representation is low-dimensional. [[Sparse coding]] algorithms attempt to do so under the constraint that the learned representation is sparse, meaning that the mathematical model has many zeros. [[Multilinear subspace learning]] algorithms aim to learn low-dimensional representations directly from [[tensor]] representations for multidimensional data, without reshaping them into higher-dimensional vectors.<ref>{{cite journal |first1=Haiping |last1=Lu |first2=K.N. |last2=Plataniotis |first3=A.N. |last3=Venetsanopoulos |url=http://www.dsp.utoronto.ca/~haiping/Publication/SurveyMSL_PR2011.pdf |title=A Survey of Multilinear Subspace Learning for Tensor Data |journal=Pattern Recognition |volume=44 |number=7 |pages=1540–1551 |year=2011 |doi=10.1016/j.patcog.2011.01.004 |bibcode=2011PatRe..44.1540L |access-date=4 September 2015 |archive-date=10 July 2019 |archive-url=https://web.archive.org/web/20190710225429/http://www.dsp.utoronto.ca/~haiping/Publication/SurveyMSL_PR2011.pdf |url-status=live }}</ref> [[Deep learning]] algorithms discover multiple levels of representation, or a hierarchy of features, with higher-level, more abstract features defined in terms of (or generating) lower-level features. It has been argued that an intelligent machine is one that learns a representation that disentangles the underlying factors of variation that explain the observed data.<ref>{{cite book | title = Learning Deep Architectures for AI | author = Yoshua Bengio | publisher = Now Publishers Inc. | year = 2009 | isbn = 978-1-60198-294-0 | pages = 1–3 | url = https://books.google.com/books?id=cq5ewg7FniMC&pg=PA3 | author-link = Yoshua Bengio | access-date = 15 February 2016 | archive-date = 17 January 2023 | archive-url = https://web.archive.org/web/20230117053339/https://books.google.com/books?id=cq5ewg7FniMC&pg=PA3 | url-status = live }}</ref>

Feature learning is motivated by the fact that machine learning tasks such as classification often require input that is mathematically and computationally convenient to process. However, real-world data such as images, video, and sensory data has not yielded attempts to algorithmically define specific features. An alternative is to discover such features or representations through examination, without relying on explicit algorithms.

==== Sparse dictionary learning ====
{{Main|Sparse dictionary learning}}
Sparse dictionary learning is a feature learning method where a training example is represented as a linear combination of [[basis function]]s and assumed to be a [[sparse matrix]]. The method is [[strongly NP-hard]] and difficult to solve approximately.<ref>{{cite journal |first=A. M. |last=Tillmann |title=On the Computational Intractability of Exact and Approximate Dictionary Learning |journal=IEEE Signal Processing Letters |volume=22 |issue=1 |year=2015 |pages=45–49 |doi=10.1109/LSP.2014.2345761|bibcode=2015ISPL...22...45T |arxiv=1405.6664 |s2cid=13342762 }}</ref> A popular [[heuristic]] method for sparse dictionary learning is the [[k-SVD|''k''-SVD]] algorithm. Sparse dictionary learning has been applied in several contexts. In classification, the problem is to determine the class to which a previously unseen training example belongs. For a dictionary where each class has already been built, a new training example is associated with the class that is best sparsely represented by the corresponding dictionary. Sparse dictionary learning has also been applied in [[image de-noising]]. The key idea is that a clean image patch can be sparsely represented by an image dictionary, but the noise cannot.<ref>[[Michal Aharon|Aharon, M]], M Elad, and A Bruckstein. 2006. "[http://sites.fas.harvard.edu/~cs278/papers/ksvd.pdf K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation] {{Webarchive|url=https://web.archive.org/web/20181123142158/http://sites.fas.harvard.edu/~cs278/papers/ksvd.pdf |date=2018-11-23 }}." Signal Processing, IEEE Transactions on 54 (11): 4311–4322</ref>

==== Anomaly detection ====
{{Main|Anomaly detection}}
In [[data mining]], anomaly detection, also known as outlier detection, is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data.<ref name="Zimek-2017">{{Citation|last1=Zimek|first1=Arthur|title=Outlier Detection|date=2017|encyclopedia=Encyclopedia of Database Systems|pages=1–5|publisher=Springer New York|language=en|doi=10.1007/978-1-4899-7993-3_80719-1|isbn=9781489979933|last2=Schubert|first2=Erich}}</ref> Typically, the anomalous items represent an issue such as [[bank fraud]], a structural defect, medical problems or errors in a text. Anomalies are referred to as [[outlier]]s, novelties, noise, deviations and exceptions.<ref>{{cite journal | last1 = Hodge | first1 = V. J. | last2 = Austin | first2 = J. | doi = 10.1007/s10462-004-4304-y | title = A Survey of Outlier Detection Methodologies | journal = Artificial Intelligence Review | volume = 22 | issue = 2 | pages = 85–126 | year = 2004 | url = http://eprints.whiterose.ac.uk/767/1/hodgevj4.pdf | citeseerx = 10.1.1.318.4023 | s2cid = 59941878 | access-date = 25 November 2018 | archive-date = 22 June 2015 | archive-url = https://web.archive.org/web/20150622042146/http://eprints.whiterose.ac.uk/767/1/hodgevj4.pdf | url-status = live }}</ref>

In particular, in the context of abuse and network intrusion detection, the interesting objects are often not rare objects, but unexpected bursts of inactivity. This pattern does not adhere to the common statistical definition of an outlier as a rare object. Many outlier detection methods (in particular, unsupervised algorithms) will fail on such data unless aggregated appropriately. Instead, a cluster analysis algorithm may be able to detect the micro-clusters formed by these patterns.<ref>{{cite journal |first1=Paul |last1=Dokas |first2=Levent |last2=Ertoz |first3=Vipin |last3=Kumar |first4=Aleksandar |last4=Lazarevic |first5=Jaideep |last5=Srivastava |first6=Pang-Ning |last6=Tan |title=Data mining for network intrusion detection |year=2002 |journal=Proceedings NSF Workshop on Next Generation Data Mining |url=https://www-users.cse.umn.edu/~lazar027/MINDS/papers/nsf_ngdm_2002.pdf |access-date=26 March 2023 |archive-date=23 September 2015 |archive-url=https://web.archive.org/web/20150923211542/http://www.csee.umbc.edu/~kolari1/Mining/ngdm/dokas.pdf |url-status=live }}</ref>

Three broad categories of anomaly detection techniques exist.<ref name="ChandolaSurvey">{{cite journal |last1=Chandola |first1=V. |last2=Banerjee |first2=A. |last3=Kumar |first3=V. |s2cid=207172599 |year=2009 |title=Anomaly detection: A survey|journal=[[ACM Computing Surveys]]|volume=41|issue=3|pages=1–58|doi=10.1145/1541880.1541882}}</ref> Unsupervised anomaly detection techniques detect anomalies in an unlabelled test data set under the assumption that the majority of the instances in the data set are normal, by looking for instances that seem to fit the least to the remainder of the data set. Supervised anomaly detection techniques require a data set that has been labelled as "normal" and "abnormal" and involves training a classifier (the key difference from many other statistical classification problems is the inherently unbalanced nature of outlier detection). Semi-supervised anomaly detection techniques construct a model representing normal behaviour from a given normal training data set and then test the likelihood of a test instance to be generated by the model.

==== Robot learning ====
[[Robot learning]] is inspired by a multitude of machine learning methods, starting from supervised learning, reinforcement learning,<ref>{{cite journal|title=Learning efficient haptic shape exploration with a rigid tactile sensor array, S. Fleer, A. Moringen, R. Klatzky, H. Ritter |year=2020 |doi=10.1371/journal.pone.0226880|arxiv=1902.07501 |pmid=31896135 |doi-access=free |last1=Fleer |first1=S. |last2=Moringen |first2=A. |last3=Klatzky |first3=R. L. |last4=Ritter |first4=H. |journal=PLOS ONE |volume=15 |issue=1 |pages=e0226880 |pmc=6940144 }}</ref><ref>{{Citation|last1=Moringen|first1=Alexandra|title=Attention-Based Robot Learning of Haptic Interaction|date=2020|work=Haptics: Science, Technology, Applications|volume=12272|pages=462–470|editor-last=Nisky|editor-first=Ilana|place=Cham|publisher=Springer International Publishing|language=en|doi=10.1007/978-3-030-58147-3_51|isbn=978-3-030-58146-6|last2=Fleer|first2=Sascha|last3=Walck|first3=Guillaume|last4=Ritter|first4=Helge|series=Lecture Notes in Computer Science |s2cid=220069113|editor2-last=Hartcher-O'Brien|editor2-first=Jess|editor3-last=Wiertlewski|editor3-first=Michaël|editor4-last=Smeets|editor4-first=Jeroen|doi-access=free}}</ref> and finally [[meta-learning (computer science)|meta-learning]] (e.g. MAML).

==== Association rules ====
{{Main|Association rule learning}}{{See also|Inductive logic programming}}
Association rule learning is a [[rule-based machine learning]] method for discovering relationships between variables in large databases. It is intended to identify strong rules discovered in databases using some measure of "interestingness".<ref name="piatetsky">Piatetsky-Shapiro, Gregory (1991), ''Discovery, analysis, and presentation of strong rules'', in Piatetsky-Shapiro, Gregory; and Frawley, William J.; eds., ''Knowledge Discovery in Databases'', AAAI/MIT Press, Cambridge, MA.</ref>

Rule-based machine learning is a general term for any machine learning method that identifies, learns, or evolves "rules" to store, manipulate or apply knowledge. The defining characteristic of a rule-based machine learning algorithm is the identification and utilisation of a set of relational rules that collectively represent the knowledge captured by the system. This is in contrast to other machine learning algorithms that commonly identify a singular model that can be universally applied to any instance in order to make a prediction.<ref>{{Cite journal|last1=Bassel|first1=George W.|last2=Glaab|first2=Enrico|last3=Marquez|first3=Julietta|last4=Holdsworth|first4=Michael J.|last5=Bacardit|first5=Jaume|date=1 September 2011|title=Functional Network Construction in Arabidopsis Using Rule-Based Machine Learning on Large-Scale Data Sets|journal=The Plant Cell|language=en|volume=23|issue=9|pages=3101–3116|doi=10.1105/tpc.111.088153|issn=1532-298X|pmc=3203449|pmid=21896882|bibcode=2011PlanC..23.3101B }}</ref> Rule-based machine learning approaches include [[learning classifier system]]s, association rule learning, and [[artificial immune system]]s.

Based on the concept of strong rules, [[Rakesh Agrawal (computer scientist)|Rakesh Agrawal]], [[Tomasz Imieliński]] and Arun Swami introduced association rules for discovering regularities between products in large-scale transaction data recorded by [[point-of-sale]] (POS) systems in supermarkets.<ref name="mining">{{Cite book | last1 = Agrawal | first1 = R. | last2 = Imieliński | first2 = T. | last3 = Swami | first3 = A. | doi = 10.1145/170035.170072 | chapter = Mining association rules between sets of items in large databases | title = Proceedings of the 1993 ACM SIGMOD international conference on Management of data - SIGMOD '93 | pages = 207 | year = 1993 | isbn = 978-0897915922 | citeseerx = 10.1.1.40.6984 | s2cid = 490415 }}</ref> For example, the rule <math>\{\mathrm{onions, potatoes}\} \Rightarrow \{\mathrm{burger}\}</math> found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as promotional [[pricing]] or [[product placement]]s. In addition to [[market basket analysis]], association rules are employed today in application areas including [[Web usage mining]], [[intrusion detection]], [[continuous production]], and [[bioinformatics]]. In contrast with [[sequence mining]], association rule learning typically does not consider the order of items either within a transaction or across transactions.

[[Learning classifier system|Learning classifier systems]] (LCS) are a family of rule-based machine learning algorithms that combine a discovery component, typically a [[genetic algorithm]], with a learning component, performing either [[supervised learning]], [[reinforcement learning]], or [[unsupervised learning]]. They seek to identify a set of context-dependent rules that collectively store and apply knowledge in a [[piecewise]] manner in order to make predictions.<ref>{{Cite journal|last1=Urbanowicz|first1=Ryan J.|last2=Moore|first2=Jason H.|date=22 September 2009|title=Learning Classifier Systems: A Complete Introduction, Review, and Roadmap|journal=Journal of Artificial Evolution and Applications|language=en|volume=2009|pages=1–25|doi=10.1155/2009/736398|issn=1687-6229|doi-access=free}}</ref>

[[Inductive logic programming]] (ILP) is an approach to rule learning using [[logic programming]] as a uniform representation for input examples, background knowledge, and hypotheses. Given an encoding of the known background knowledge and a set of examples represented as a logical database of facts, an ILP system will derive a hypothesized logic program that [[Entailment|entails]] all positive and no negative examples. [[Inductive programming]] is a related field that considers any kind of programming language for representing hypotheses (and not only logic programming), such as [[Functional programming|functional programs]].

Inductive logic programming is particularly useful in [[bioinformatics]] and [[natural language processing]]. [[Gordon Plotkin]] and [[Ehud Shapiro]] laid the initial theoretical foundation for inductive machine learning in a logical setting.<ref>Plotkin G.D. [https://www.era.lib.ed.ac.uk/bitstream/handle/1842/6656/Plotkin1972.pdf;sequence=1 Automatic Methods of Inductive Inference] {{Webarchive|url=https://web.archive.org/web/20171222051034/https://www.era.lib.ed.ac.uk/bitstream/handle/1842/6656/Plotkin1972.pdf;sequence=1 |date=22 December 2017 }}, PhD thesis, University of Edinburgh, 1970.</ref><ref>Shapiro, Ehud Y. [http://ftp.cs.yale.edu/publications/techreports/tr192.pdf Inductive inference of theories from facts] {{Webarchive|url=https://web.archive.org/web/20210821071609/http://ftp.cs.yale.edu/publications/techreports/tr192.pdf |date=21 August 2021 }}, Research Report 192, Yale University, Department of Computer Science, 1981. Reprinted in J.-L. Lassez, G. Plotkin (Eds.), Computational Logic, The MIT Press, Cambridge, MA, 1991, pp. 199–254.</ref><ref>Shapiro, Ehud Y. (1983). ''Algorithmic program debugging''. Cambridge, Mass: MIT Press. {{ISBN|0-262-19218-7}}</ref> Shapiro built their first implementation (Model Inference System) in 1981: a Prolog program that inductively inferred logic programs from positive and negative examples.<ref>Shapiro, Ehud Y. "[http://dl.acm.org/citation.cfm?id=1623364 The model inference system] {{Webarchive|url=https://web.archive.org/web/20230406011006/https://dl.acm.org/citation.cfm?id=1623364 |date=2023-04-06 }}." Proceedings of the 7th international joint conference on Artificial intelligence-Volume 2. Morgan Kaufmann Publishers Inc., 1981.</ref> The term ''inductive'' here refers to [[Inductive reasoning|philosophical]] induction, suggesting a theory to explain observed facts, rather than [[mathematical induction]], proving a property for all members of a well-ordered set.