Editing Bootstrap aggregating (section)

== Process of the algorithm ==

=== Key Terms ===
There are three types of datasets in bootstrap aggregating. These are the '''original, bootstrap, and out-of-bag datasets.''' Each section below will explain how each dataset is made except for the original dataset. The original dataset is whatever information is given.

=== Creating the bootstrap dataset ===
The bootstrap dataset is made by randomly picking objects from the original dataset. Also, '''it must be the same size as the original dataset.''' However, the difference is that the bootstrap dataset can have duplicate objects. Here is a simple example to demonstrate how it works along with the illustration below:

[[File:Bootstrap Example 2.png|672x672px|Bootstrap Example]]

Suppose the '''original dataset''' is a '''group of 12 people.''' Their names are '''Emily, Jessie, George, Constantine, Lexi, Theodore, John, James, Rachel, Anthony, Ellie, and Jamal.'''

By randomly picking a group of names, let us say '''our bootstrap dataset''' had '''James, Ellie, Constantine, Lexi, John, Constantine, Theodore, Constantine, Anthony, Lexi, Constantine, and Theodore.''' In this case, the bootstrap sample contained four duplicates for Constantine, and two duplicates for Lexi, and Theodore.

=== Creating the out-of-bag dataset ===
The out-of-bag dataset '''represents the remaining people who were not in the bootstrap dataset.''' It can be calculated by taking the difference between the original and the bootstrap datasets. In this case, the remaining samples who were not selected are '''Emily, Jessie, George, Rachel, and Jamal.''' Keep in mind that since both datasets are sets, when taking the difference the duplicate names are ignored in the bootstrap dataset. The illustration below shows how the math is done:

[[File:Complete Example 2.png|840x840px|Complete Example]]

=== Application ===
Creating the bootstrap and out-of-bag datasets is crucial since it is used to test the accuracy of [[ensemble learning]] algorithms like [[random forest]]. For example, a model that produces 50 trees using the bootstrap/out-of-bag datasets will have a better accuracy than if it produced 10 trees. Since the algorithm generates multiple trees and therefore multiple datasets the chance that an object is left out of the bootstrap dataset is low. The next few sections talk about how the random forest algorithm works in more detail.

===Creation of Decision Trees ===
The next step of the algorithm involves the generation of [[decision tree]]s from the bootstrapped dataset. To achieve this, the process examines each gene/feature and determines for how many samples the feature's presence or absence yields a positive or negative result. This information is then used to compute a [[confusion matrix]], which lists the true positives, false positives, true negatives, and false negatives of the feature when used as a classifier. These features are then ranked according to various [[Decision tree learning|classification metrics]] based on their confusion matrices. Some common metrics include estimate of positive correctness (calculated by subtracting false positives from true positives), measure of "goodness", and [[Information gain in decision trees|information gain]]. These features are then used to partition the samples into two sets: those that possess the top feature, and those that do not.

The diagram below shows a decision tree of depth two being used to classify data. For example, a data point that exhibits Feature 1, but not Feature 2, will be given a "No". Another point that does not exhibit Feature 1, but does exhibit Feature 3, will be given a "Yes".

[[File:Decision_Tree_Depth_2.png|Decision Tree Depth 2]]

This process is repeated recursively for successive levels of the tree until the desired depth is reached. At the very bottom of the tree, samples that test positive for the final feature are generally classified as positive, while those that lack the feature are classified as negative. These trees are then used as predictors to classify new data.

===Random Forests ===
The next part of the algorithm involves introducing yet another element of variability amongst the bootstrapped trees. In addition to each tree only examining a bootstrapped set of samples, only a small but consistent number of unique features are considered when ranking them as classifiers. This means that each tree only knows about the data pertaining to a small constant number of features, and a variable number of samples that is less than or equal to that of the original dataset. Consequently, the trees are more likely to return a wider array of answers, derived from more diverse knowledge. This results in a [[random forest]], which possesses numerous benefits over a single decision tree generated without randomness. In a random forest, each tree "votes" on whether or not to classify a sample as positive based on its features. The sample is then classified based on majority vote. An example of this is given in the diagram below, where the four trees in a random forest vote on whether or not a patient with mutations A, B, F, and G has cancer. Since three out of four trees vote yes, the patient is then classified as cancer positive.
[[File:Random Forest Diagram Extra Wide.png|center|frameless|1035x1035px]]
Because of their properties, random forests are considered one of the most accurate data mining algorithms, are less likely to [[Overfitting|overfit]] their data, and run quickly and efficiently even for large datasets.<ref>{{Cite web|title=Random forests - classification description|url=https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm|access-date=2021-12-09|website=stat.berkeley.edu}}</ref> They are primarily useful for classification as opposed to [[Regression analysis|regression]], which attempts to draw observed connections between statistical variables in a dataset. This makes random forests particularly useful in such fields as banking, healthcare, the stock market, and [[e-commerce]] where it is important to be able to predict future results based on past data.<ref name=":4">{{Cite web|title=Introduction to Random Forest in Machine Learning|url=https://www.section.io/engineering-education/introduction-to-random-forest-in-machine-learning/|access-date=2021-12-09|website=Engineering Education (EngEd) Program {{!}} Section}}</ref> One of their applications would be as a useful tool for predicting cancer based on genetic factors, as seen in the above example.

There are several important factors to consider when designing a random forest. If the trees in the random forests are too deep, overfitting can still occur due to over-specificity. If the forest is too large, the algorithm may become less efficient due to an increased runtime. Random forests also do not generally perform well when given sparse data with little variability.<ref name=":4" /> However, they still have numerous advantages over similar data classification algorithms such as [[neural network]]s, as they are much easier to interpret and generally require less data for training.{{citation needed|date=June 2024}} As an integral component of random forests, bootstrap aggregating is very important to classification algorithms, and provides a critical element of variability that allows for increased accuracy when analyzing new data, as discussed below.