Editing Bootstrap aggregating (section)

== Improving Random Forests and Bagging ==
While the techniques described above utilize [[random forest]]s and [[Bootstrapping|bagging]] (otherwise known as bootstrapping), there are certain techniques that can be used in order to improve their execution and voting time, their prediction accuracy, and their overall performance. The following are key steps in creating an efficient random forest:

# Specify the maximum depth of trees: Instead of allowing the random forest to continue until all nodes are pure, it is better to cut it off at a certain point in order to further decrease chances of overfitting.
# Prune the dataset: Using an extremely large dataset may create results that are less indicative of the data provided than a smaller set that more accurately represents what is being focused on.
#* Continue pruning the data at each node split rather than just in the original bagging process.
# Decide on accuracy or speed: Depending on the desired results, increasing or decreasing the number of trees within the forest can help. Increasing the number of trees generally provides more accurate results while decreasing the number of trees will provide quicker results.

{| class="wikitable"
|+Pros and Cons of Random Forests and Bagging 
!Pros
!Cons
|-
|There are overall less requirements involved for normalization and scaling, making the use of random forests more convenient.<ref>{{Cite web|title=Random Forest Pros & Cons|url=https://holypython.com/rf/random-forest-pros-cons/|access-date=2021-11-26|website=HolyPython.com}}</ref>
|The algorithm may change significantly if there is a slight change to the data being bootstrapped and used within the forests.<ref>{{Cite web|last=K|first=Dhiraj|date=2020-11-22|title=Random Forest Algorithm Advantages and Disadvantages|url=https://dhirajkumarblog.medium.com/random-forest-algorithm-advantages-and-disadvantages-1ed22650c84f|access-date=2021-11-26|website=Medium}}</ref> In other words, random forests are incredibly dependent on their datasets, changing these can drastically change the individual trees' structures. 
|-
|Easy data preparation. Data is prepared by creating a bootstrap set and a certain number of decision trees to build a random forest that also utilizes feature selection, as mentioned in {{Slink||Random Forests}}. 
|Random Forests are more complex to implement than lone decision trees or other algorithms. This is because they take extra steps for bagging, as well as the need for recursion in order to produce an entire forest, which complicates implementation. Because of this, it requires much more computational power and computational resources. 
|-
|Consisting of multiple [[decision tree]]s, forests are able to more accurately make predictions than single trees. 
|Requires much more time to train the data compared to decision trees. Having a large forest can quickly begin to decrease the speed in which one's program operates because it has to traverse much more data even though each tree is using a smaller set of samples and features. 
|-
|Works well with non-linear data. As most tree based algorithms use linear splits, using an ensemble of a set of trees works better than using a single tree on data that has nonlinear properties (i.e. most real world distributions). Working well with non-linear data is a huge advantage because other data mining techniques such as single decision trees do not handle this as well. 
|Much easier to interpret than a random forest.  A single tree can be walked by hand (by a human) leading to a somewhat "explainable" understanding for the analyst of what the tree is actually doing.  As the number of trees and schemes grow for ensembling those trees into predictions, this reviewing becomes much more difficult if not impossible. 
|-
|There is a lower risk of [[overfitting]] and runs efficiently on even large datasets.<ref>{{Cite web|last=Team|first=Towards AI|title=Why Choose Random Forest and Not Decision Trees – Towards AI – The World's Leading AI and Technology Publication|date=2 July 2020 |url=https://towardsai.net/p/machine-learning/why-choose-random-forest-and-not-decision-trees,%20https://towardsai.net/p/machine-learning/why-choose-random-forest-and-not-decision-trees|access-date=2021-11-26}}</ref> This is the result of the random forest's use of bagging in conjunction with random feature selection. 
|Does not predict beyond the range of the training data. This is a con because while bagging is often effective, not all of the data is being considered, therefore it cannot predict an entire dataset. 
|-
|The random forest classifier operates with a high accuracy and speed.<ref>{{Cite web|title=Random Forest|url=https://corporatefinanceinstitute.com/resources/knowledge/other/random-forest/|access-date=2021-11-26|website=Corporate Finance Institute}}</ref> Random forests are much faster than decision trees because of using a smaller dataset. 
|To recreate specific results, it is necessary to keep track of the exact random seed used to generate the bootstrap sets. This may be important when collecting data for research or within a data mining class. Using random seeds is essential to the random forests, but can make it hard to support claims based on forests if there is a failure to record the seeds. 
|-
|Deals with [[missing data]] and datasets with many outliers well. They deal with this by using [[Binning method|binning]], or by grouping values together to avoid values that are terribly far apart. 
|
|}