Editing Boosting (machine learning)

{{Short description|Method in machine learning}}
{{Technical|date=September 2023}}<!-- A lot of technical jargon that less experienced ML users may not understand-->
{{Machine learning|Supervised learning}}
In [[machine learning]] (ML), '''boosting''' is an [[Ensemble learning|ensemble]] [[metaheuristic]] for primarily reducing [[Bias–variance tradeoff|bias (as opposed to variance)]].<ref>{{cite web|url=http://oz.berkeley.edu/~breiman/arcall96.pdf|archive-url=https://web.archive.org/web/20150119081741/http://oz.berkeley.edu/~breiman/arcall96.pdf|url-status=dead|archive-date=2015-01-19|title=BIAS, VARIANCE, AND ARCING CLASSIFIERS|last1=Leo Breiman|author-link=Leo Breiman|date=1996|publisher=TECHNICAL REPORT|quote=Arcing [Boosting] is more successful than bagging in variance reduction|access-date=19 January 2015}}</ref> It can also improve the [[Stability (learning theory)|stability]] and accuracy of ML [[Statistical classification|classification]] and [[Regression analysis|regression]] algorithms. Hence, it is prevalent in [[supervised learning]] for converting weak learners to strong learners.<ref>{{cite book |last=Zhou Zhi-Hua |author-link=Zhou Zhihua |date=2012 |title=Ensemble Methods: Foundations and Algorithms |publisher= Chapman and Hall/CRC |page=23 |isbn=978-1439830031 |quote=The term boosting refers to a family of algorithms that are able to convert weak learners to strong learners }}</ref>

The concept of boosting is based on the question posed by [[Michael Kearns (computer scientist)|Kearns]] and [[Leslie Valiant|Valiant]] (1988, 1989)<!--Please do not cite only one, because "Kearns and Valiant" is used as a convention to denote this question.-->:<ref name="Kearns88">Michael Kearns(1988); [http://www.cis.upenn.edu/~mkearns/papers/boostnote.pdf ''Thoughts on Hypothesis Boosting''], Unpublished manuscript (Machine Learning class project, December 1988)</ref><ref>{{cite book |last1=Michael Kearns |author-link=Michael Kearns (computer scientist) |last2=Leslie Valiant |title=Proceedings of the twenty-first annual ACM symposium on Theory of computing - STOC '89 |chapter=Cryptographic limitations on learning Boolean formulae and finite automata |author2-link=Leslie Valiant |date=1989 |publisher=ACM |volume=21 |pages=433–444 |doi=10.1145/73007.73049 |isbn= 978-0897913072|s2cid=536357 }}</ref> "Can a set of weak learners create a single strong learner?" A weak learner is defined as a [[Statistical classification|classifier]] that is only slightly correlated with the true classification. A strong learner is a classifier that is arbitrarily well-correlated with the true classification. [[Robert Schapire]] answered the question in the affirmative in a paper published in 1990.<ref name="Schapire90">{{cite journal |last=Schapire |first=Robert E. |year=1990 |title=The Strength of Weak Learnability |url=http://www.cs.princeton.edu/~schapire/papers/strengthofweak.pdf |url-status=dead |journal=Machine Learning |volume=5 |issue=2 |pages=197–227 |citeseerx=10.1.1.20.723 |doi=10.1007/bf00116037 |s2cid=53304535 |archive-url=https://web.archive.org/web/20121010030839/http://www.cs.princeton.edu/~schapire/papers/strengthofweak.pdf |archive-date=2012-10-10 |access-date=2012-08-23}}</ref><!--Please do not cite only one, because "Kearns and Valiant" is used as a convention to denote this question.--> This has had significant ramifications in machine learning and [[statistics]], most notably leading to the development of boosting.<ref>{{cite journal |last = Leo Breiman |author-link = Leo Breiman |date = 1998|title = Arcing classifier (with discussion and a rejoinder by the author)|journal = Ann. Stat.|volume = 26|issue = 3|pages = 801–849|doi = 10.1214/aos/1024691079|quote = Schapire (1990) proved that boosting is possible. (Page 823)|doi-access = free}}</ref><!--{{citation needed|date=July 2014}} Could use secondary source to back up this claim. -->

Initially, the ''hypothesis boosting problem'' simply referred to the process of turning a weak learner into a strong learner.<ref name="Kearns88" /> Algorithms that achieve this quickly became known as "boosting". [[Yoav Freund|Freund]] and Schapire's arcing (Adapt[at]ive Resampling and Combining),<ref>Yoav Freund and Robert E. Schapire (1997); [https://www.cis.upenn.edu/~mkearns/teaching/COLT/adaboost.pdf ''A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting''], Journal of Computer and System Sciences, 55(1):119-139</ref> as a general technique, is more or less synonymous with boosting.<ref>Leo Breiman (1998); [http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aos/1024691079 ''Arcing Classifier (with Discussion and a Rejoinder by the Author)''], Annals of Statistics, vol. 26, no. 3, pp. 801-849: "The concept of weak learning was introduced by Kearns and Valiant (1988<!-- Michael Kearns, Leslie G. Valiant (1988); ''Learning Boolean Formulae or Finite Automata is as Hard as Factoring'', Technical Report TR-14-88, Harvard University Aiken Computation Laboratory, August 1988 -->, 1989<!-- Michael Kearns, Leslie G. Valiant (1989) ''Cryptographic Limitations on Learning Boolean Formulae and Finite Automata'', Proceedings of the Twenty-First Annual ACM Symposium on Theory of Computing (pp. 433-444). New York, NY: ACM Press, later republished in the Journal of the Association for Computing Machinery, 41(1):67–95, January 1994 -->), who left open the question of whether weak and strong learnability are equivalent. The question was termed the ''boosting problem'' since a solution 'boosts' the low accuracy of a weak learner to the high accuracy of a strong learner. Schapire (1990) proved that boosting is possible. A ''boosting algorithm'' is a method that takes a weak learner and converts it into a strong one. Freund and Schapire (1997) proved that an algorithm similar to arc-fs is boosting.</ref>

== Algorithms ==
While boosting is not algorithmically constrained, most boosting algorithms consist of iteratively learning weak classifiers with respect to a distribution and adding them to a final strong classifier. When they are added, they are weighted in a way that is related to the weak learners' accuracy.  After a weak learner is added, the data weights are readjusted, known as "re-[[weighting]]". Misclassified input data gain a higher weight and examples that are classified correctly lose weight.{{NoteTag|Some boosting-based classification algorithms actually decrease the weight of repeatedly misclassified examples; for example boost by majority and [[BrownBoost]].}} Thus, future weak learners focus more on the examples that previous weak learners misclassified.

[[File:Ensemble Boosting.svg|thumb|An illustration presenting the intuition behind the boosting algorithm, consisting of the parallel learners and weighted dataset]]

There are many boosting algorithms. The original ones, proposed by [[Robert Schapire]] (a [[Recursion (computer science)|recursive]] majority gate formulation),<ref name="Schapire90" /> and [[Yoav Freund]] (boost by majority),<ref name="Mason00">Llew Mason, Jonathan Baxter, Peter Bartlett, and Marcus Frean (2000); ''Boosting Algorithms as Gradient Descent'', in [[Sara Solla|S. A. Solla]], T. K. Leen, and K.-R. Muller, editors, ''Advances in Neural Information Processing Systems'' 12, pp.&nbsp;512-518, MIT Press</ref> were not [[Adaptive algorithm|adaptive]] and could not take full advantage of the weak learners. Schapire and Freund then developed [[AdaBoost]], an adaptive boosting algorithm that won the prestigious [[Gödel Prize]].

Only algorithms that are provable boosting algorithms in the [[probably approximately correct learning]] formulation can accurately be called ''boosting algorithms''.  Other algorithms that are similar in spirit{{Clarify|reason=What spirit?|date=October 2018}} to boosting algorithms are sometimes called "leveraging algorithms", although they are also sometimes incorrectly called boosting algorithms.<ref name="Mason00"/>

The main variation between many boosting algorithms is their method of [[weighting]] [[training data]] points and [[Hypothesis|hypotheses]]. [[AdaBoost]] is very popular and the most significant historically as it was the first algorithm that could adapt to the weak learners. It is often the basis of introductory coverage of boosting in university machine learning courses.<ref>{{Cite web|url=http://math.mit.edu/~rothvoss/18.304.3PM/Presentations/1-Eric-Boosting304FinalRpdf.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://math.mit.edu/~rothvoss/18.304.3PM/Presentations/1-Eric-Boosting304FinalRpdf.pdf |archive-date=2022-10-09 |url-status=live|title=Boosting (AdaBoost algorithm)|last=Emer|first=Eric|website=MIT|access-date=2018-10-10}}</ref> There are many more recent algorithms such as [[LPBoost]], TotalBoost, [[BrownBoost]], [[xgboost]], MadaBoost, [[LogitBoost]], and others. Many boosting algorithms fit into the AnyBoost framework,<ref name="Mason00"/> which shows that boosting performs [[gradient descent]] in a [[function space]] using a [[Convex function|convex]] [[Loss functions for classification|cost function]].

==Object categorization in computer vision==
{{Main|Object categorization from image search}}
Given images containing various known objects in the world, a classifier can be learned from them to automatically [[Statistical classification|classify]] the objects in future images.  Simple classifiers built based on some [[Feature (computer vision)|image feature]] of the object tend to be weak in categorization performance. Using boosting methods for object categorization is a way to unify the weak classifiers in a special way to boost the overall ability of categorization.{{Citation needed|date=January 2024}}

===Problem of object categorization===
[[Object categorization from image search|Object categorization]] is a typical task of [[computer vision]] that involves determining whether or not an image contains some specific category of object.  The idea is closely related with recognition, identification, and detection.  Appearance based object categorization typically contains [[feature extraction]], [[learning#3|learning]] a [[Classifier (mathematics)|classifier]], and applying the classifier to new examples.  There are many ways to represent a category of objects, e.g. from [[Shape analysis (digital geometry)|shape analysis]], [[bag of words model]]s, or local descriptors such as [[Scale-invariant feature transform|SIFT]], etc.  Examples of [[supervised learning|supervised classifiers]] are [[Naive Bayes classifier]]s, [[support vector machine]]s, [[mixtures of Gaussians]], and [[Artificial neural network|neural networks]].  However, research{{Which|date=October 2018}} has shown that object categories and their locations in images can be discovered in an [[Unsupervised learning|unsupervised manner]] as well.<ref>Sivic, Russell, Efros, Freeman & Zisserman,  "Discovering objects and their location in images", ICCV 2005</ref>

===Status quo for object categorization===
The recognition of object categories in images is a challenging problem in [[computer vision]], especially when the number of categories is large.  This is due to high intra class variability and the need for generalization across variations of objects within the same category. Objects within one category may look quite different. Even the same object may appear unalike under different viewpoint, [[Scaling (geometry)|scale]], and [[Illumination (image)|illumination]]. Background clutter and partial occlusion add difficulties to recognition as well.<ref>A. Opelt, A. Pinz, et al., "Generic Object Recognition with Boosting", IEEE Transactions on PAMI 2006</ref>  Humans are able to recognize thousands of object types, whereas most of the existing [[object recognition]] systems are trained to recognize only a few,{{How many|date=October 2018}} e.g. [[Face|human faces]], [[car]]s, simple objects, etc.<ref>M. Marszalek, "Semantic Hierarchies for Visual Object Recognition", 2007</ref>{{Update inline|date=October 2018|reason=The source is from 2007 and computer vision is a lot more advanced now|?=yes}}  Research has been very active on dealing with more categories and enabling incremental additions of new categories, and although the general problem remains unsolved, several multi-category objects detectors (for up to hundreds or thousands of categories<ref>{{Cite web|url=http://image-net.org/challenges/LSVRC/2017/|title=Large Scale Visual Recognition Challenge|date=December 2017}}</ref>) have been developed.  One means is by [[Feature (computer vision)|feature]] sharing and boosting.

===Boosting for binary categorization===
AdaBoost can be used for face detection as an example of [[binary categorization]]. The two categories are faces versus background. The general algorithm is as follows:
#Form a large set of simple features
#Initialize weights for training images
#For T rounds
##Normalize the weights
##For available features from the set, train a classifier using a single feature and evaluate the training error
##Choose the classifier with the lowest error
##Update the weights of the training images: increase if classified wrongly by this classifier, decrease if correctly
#Form the final strong classifier as the linear combination of the T classifiers (coefficient larger if training error is small)
After boosting, a classifier constructed from 200 features could yield a 95% detection rate under a <math>10^{-5}</math> [[Type I and type II errors|false positive rate]].<ref>P. Viola, M. Jones, "Robust Real-time Object Detection", 2001</ref>

Another application of boosting for binary categorization is a system that detects pedestrians using [[patterns]] of motion and appearance.<ref>{{cite conference|first1=P.|last1=Viola|first2=M.|last2=Jones|first3=D.|last3=Snow|title=Detecting Pedestrians Using Patterns of Motion and Appearance|conference=ICCV|year=2003|url=http://www.merl.com/publications/docs/TR2003-90.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://www.merl.com/publications/docs/TR2003-90.pdf |archive-date=2022-10-09 |url-status=live}}</ref> This work is the first to combine both motion information and appearance information as features to detect a walking person. It takes a similar approach to the [[Viola–Jones object detection framework|Viola-Jones object detection framework]].

===Boosting for multi-class categorization===
Compared with binary categorization, [[multi-class categorization]] looks for common features that can be shared across the categories at the same time.  They turn to be more generic [[Edge detection|edge]] like features. During learning, the detectors for each category can be trained jointly. Compared with training separately, it [[Generalization|generalizes]] better, needs less training data, and requires fewer features to achieve the same performance.
 
The main flow of the algorithm is similar to the binary case. What is different is that a measure of the joint training error shall be defined in advance. During each iteration the algorithm chooses a classifier of a single feature (features that can be shared by more categories shall be encouraged). This can be done via converting [[multi-class classification]] into a binary one (a set of categories versus the rest),<ref>A. Torralba, K. P. Murphy, et al., "Sharing visual features for multiclass and multiview object detection", IEEE Transactions on PAMI 2006</ref> or by introducing a penalty error from the categories that do not have the feature of the classifier.<ref>A. Opelt, et al., "Incremental learning of object detectors using a visual shape alphabet", CVPR 2006</ref>

In the paper "Sharing visual features for multiclass and multiview object detection", A. Torralba et al. used [[GentleBoost]] for boosting and showed that when training data is limited, learning via sharing features does a much better job than no sharing, given same boosting rounds. Also, for a given performance level, the total number of features required (and therefore the run time cost of the classifier) for the feature sharing detectors, is observed to scale approximately [[logarithm]]ically with the number of class, i.e., slower than [[linear]] growth in the non-sharing case. Similar results are shown in the paper "Incremental learning of object detectors using a visual shape alphabet", yet the authors used [[AdaBoost]] for boosting.

== Convex vs. non-convex boosting algorithms ==

Boosting algorithms can be based on [[Convex optimization|convex]] or non-convex optimization algorithms.  Convex algorithms, such as [[AdaBoost]] and [[LogitBoost]], can be "defeated" by random  noise such that they can't learn basic and learnable combinations of weak hypotheses.<ref>P. Long and R. Servedio.  25th International Conference on Machine Learning (ICML), 2008, pp. 608--615.</ref><ref name=long-criticism>{{cite journal |last1=Long |first1=Philip M. |last2=Servedio |first2=Rocco A. |date=March 2010 |title=Random classification noise defeats all convex potential boosters |url=https://www.cs.columbia.edu/~rocco/Public/mlj9.pdf |archive-url=https://ghostarchive.org/archive/20221009/https://www.cs.columbia.edu/~rocco/Public/mlj9.pdf |archive-date=2022-10-09 |url-status=live |journal=Machine Learning |volume=78 |issue=3 |pages=287–304 |doi=10.1007/s10994-009-5165-z |s2cid=53861 |access-date=2015-11-17|doi-access=free }}</ref> This limitation was pointed out by Long & Servedio in 2008.  However, by 2009, multiple authors demonstrated that  boosting algorithms based on non-convex optimization, such as [[BrownBoost]], can learn from noisy datasets and can specifically learn the underlying classifier of the Long–Servedio dataset.

== See also ==
{{Div col}}
* [[Random forest]]
* [[Alternating decision tree]]
* [[Bootstrap aggregating]] (bagging)
* [[Cascading classifiers|Cascading]]
* [[CoBoosting]]
* [[Logistic regression]]
* [[Principle of maximum entropy|Maximum entropy methods]]
* [[Gradient boosting]]
* [[Margin classifier]]s
* [[cross-validation (statistics)|Cross-validation]]
* [[List of datasets for machine learning research]]
{{div col end}}

==Implementations==
* [[scikit-learn]], an open source machine learning library for [[Python (programming language)|Python]]
* [[Orange (software)|Orange]], a free data mining software suite, module [http://docs.orange.biolab.si/reference/rst/Orange.ensemble.html Orange.ensemble]
* [[Weka (machine learning)|Weka]] is a machine learning set of tools that offers variate implementations of boosting algorithms like AdaBoost and LogitBoost
* [[R (programming language)|R]] package [https://cran.r-project.org/web/packages/gbm/index.html GBM] (Generalized Boosted Regression Models) implements extensions to Freund and Schapire's AdaBoost algorithm and Friedman's gradient boosting machine.
* [https://sourceforge.net/projects/jboost/ jboost]; AdaBoost, LogitBoost, RobustBoost, Boostexter and alternating decision trees
* R package [https://cran.r-project.org/web/packages/adabag/index.html adabag]: Applies Multiclass AdaBoost.M1, AdaBoost-SAMME and Bagging
* R package [https://cran.r-project.org/web/packages/xgboost/index.html xgboost]: An implementation of gradient boosting for linear and tree-based models.

== Notes ==
{{NoteFoot}}

== References ==
{{Reflist}}

==Further reading==
* {{cite journal |first1=Yoav |last1=Freund |first2=Robert E. |last2=Schapire |date=1997 |url=https://www.cse.ucsd.edu/~yfreund/papers/adaboost.pdf |title=A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting |journal=Journal of Computer and System Sciences |volume=55 |number=1 |pages=119–139|doi=10.1006/jcss.1997.1504 }}
* {{cite journal |first=Robert E. |last=Schapire |title=The strength of weak learnability |journal=Machine Learning |volume=5 |number=2 |pages=197–227 |date=1990 |doi=10.1007/BF00116037 |s2cid=6207294 |url=http://citeseer.ist.psu.edu/schapire90strength.html|doi-access=free }}
* {{cite journal |first1=Robert E. |last1=Schapire |first2=Yoram |last2=Singer |date=1999 |url=http://citeseer.ist.psu.edu/schapire99improved.html |title=Improved Boosting Algorithms Using Confidence-Rated Predictors |journal=Machine Learning |volume=37 |number=3 |pages=297–336|doi=10.1023/A:1007614523901 |s2cid=2329907 |doi-access=free }}
* {{cite journal | url = http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/colt08.pdf | title = On the margin explanation of boosting algorithm. | first = Zhihua | last = Zhou | journal = In: Proceedings of the 21st Annual Conference on Learning Theory (COLT'08) | pages = 479–490 | year = 2008 }}
* {{cite journal | url = http://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/aij13marginbound.pdf | title = On the doubt about margin explanation of boosting. | first = Zhihua | last = Zhou | journal = Artificial Intelligence |  volume = 203 | pages = 1–18 | year = 2013 | doi = 10.1016/j.artint.2013.07.002 | arxiv = 1009.3613 | s2cid = 2828847 }}

== External links ==
* Robert E. Schapire (2003); [http://www.cs.princeton.edu/courses/archive/spr08/cos424/readings/Schapire2003.pdf ''The Boosting Approach to Machine Learning: An Overview''], MSRI (Mathematical Sciences Research Institute) Workshop on Nonlinear Estimation and Classification
* [https://direct.mit.edu/books/oa-monograph/5342/BoostingFoundations-and-Algorithms Boosting: Foundations and Algorithms by Robert E. Schapire and Yoav Freund]

{{Authority control}}

[[Category:Classification algorithms]]
[[Category:Ensemble learning]]
[[Category:Learning in computer vision]]
[[Category:Object recognition and categorization]]