Editing Biostatistics (section)

== Developments and big data ==
{{More citations needed section|date=December 2016}}

Recent developments have made a large impact on biostatistics. Two important changes have been the ability to collect data on a high-throughput scale, and the ability to perform much more complex analysis using computational techniques. This comes from the development in areas as [[DNA sequencing|sequencing]] technologies, [[Bioinformatics]] and [[Machine learning]] ([[Machine learning in bioinformatics]]).

=== Use in high-throughput data ===

New biomedical technologies like [[DNA microarray|microarrays]], [[DNA sequencing|next-generation sequencers]] (for genomics) and [[mass spectrometers|mass spectrometry]] (for proteomics) generate enormous amounts of data, allowing many tests to be performed simultaneously.<ref>{{cite journal|last1=Hayden|first1=Erika Check|title=Biostatistics: Revealing analysis|journal=Nature|date=8 February 2012|volume=482|issue=7384|pages=263–265|doi=10.1038/nj7384-263a|pmid=22329008|doi-access=free}}</ref> Careful analysis with biostatistical methods is required to separate the signal from the noise. For example, a microarray could be used to measure many thousands of genes simultaneously, determining which of them have different expression in diseased cells compared to normal cells. However, only a fraction of genes will be differentially expressed.<ref>{{cite journal|last1=Efron|first1=Bradley|title=Microarrays, Empirical Bayes and the Two-Groups Model|journal=Statistical Science|date=February 2008|volume=23|issue=1|pages=1–22|doi=10.1214/07-STS236|arxiv=0808.0572|s2cid=8417479}}</ref>

[[Multicollinearity]] often occurs in high-throughput biostatistical settings. Due to high intercorrelation between the predictors (such as [[gene expression]] levels), the information of one predictor might be contained in another one. It could be that only 5% of the predictors are responsible for 90% of the variability of the response. In such a case, one could apply the biostatistical technique of dimension reduction (for example via principal component analysis). Classical statistical techniques like linear or [[logistic regression]] and [[linear discriminant analysis]] do not work well for high dimensional data (i.e. when the number of observations n is smaller than the number of features or predictors p: n < p). As a matter of fact, one can get quite high R<sup>2</sup>-values despite very low predictive power of the statistical model. These classical statistical techniques (esp. [[least squares]] linear regression) were developed for low dimensional data (i.e. where the number of observations n is much larger than the number of predictors p: n >> p). In cases of high dimensionality, one should always consider an independent validation test set and the corresponding residual sum of squares (RSS) and R<sup>2</sup> of the validation test set, not those of the training set.

Often, it is useful to pool information from multiple predictors together. For example, [[Gene Set Enrichment Analysis]] (GSEA) considers the perturbation of whole (functionally related) gene sets rather than of single genes.<ref>{{cite journal|last1=Subramanian|first1=A.|last2=Tamayo|first2=P.|last3=Mootha|first3=V. K.|last4=Mukherjee|first4=S.|last5=Ebert|first5=B. L.|last6=Gillette|first6=M. A.|last7=Paulovich|first7=A.|last8=Pomeroy|first8=S. L.|last9=Golub|first9=T. R.|last10=Lander|first10=E. S.|last11=Mesirov|first11=J. P.|title=Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles|journal=Proceedings of the National Academy of Sciences|date=30 September 2005|volume=102|issue=43|pages=15545–15550|doi=10.1073/pnas.0506580102|pmid=16199517|pmc=1239896|bibcode=2005PNAS..10215545S|doi-access=free}}</ref> These gene sets might be known biochemical pathways or otherwise functionally related genes. The advantage of this approach is that it is more robust: It is more likely that a single gene is found to be falsely perturbed than it is that a whole pathway is falsely perturbed. Furthermore, one can integrate the accumulated knowledge about biochemical pathways (like the [[JAK-STAT signaling pathway]]) using this approach.

=== Bioinformatics advances in databases, data mining, and biological interpretation ===

The development of [[biological database]]s enables storage and management of biological data with the possibility of ensuring access for users around the world. They are useful for researchers depositing data, retrieve information and files (raw or processed) originated from other experiments or indexing scientific articles, as [[PubMed]]. Another possibility is search for the desired term (a gene, a protein, a disease, an organism, and so on) and check all results related to this search. There are databases dedicated to [[Single-nucleotide polymorphism|SNPs]] ([[dbSNP]]), the knowledge on genes characterization and their pathways ([[KEGG]]) and the description of gene function classifying it by cellular component, molecular function and biological process ([[Gene ontology|Gene Ontology]]).<ref name=":4">{{cite journal|doi=10.1002/jcp.21218|pmid=17654500|title=Bioinformatics|journal=Journal of Cellular Physiology|volume=213|issue=2|pages=365–9|year=2007|last1=Moore|first1=Jason H|s2cid=221831488|doi-access=free}}</ref> In addition to databases that contain specific molecular information, there are others that are ample in the sense that they store information about an organism or group of organisms. As an example of a database directed towards just one organism, but that contains much data about it, is the ''[[Arabidopsis thaliana]]'' genetic and molecular database – TAIR.<ref>{{cite web|url=https://www.arabidopsis.org/|title=TAIR - Home Page|website=www.arabidopsis.org}}</ref> Phytozome,<ref>{{cite web|url=https://phytozome.jgi.doe.gov/pz/portal.html|title=Phytozome|website=phytozome.jgi.doe.gov}}</ref> in turn, stores the assemblies and annotation files of dozen of plant genomes, also containing visualization and analysis tools. Moreover, there is an interconnection between some databases in the information exchange/sharing and a major initiative was the [[International Nucleotide Sequence Database Collaboration]] (INSDC)<ref>{{cite web|url=http://www.insdc.org/|title=International Nucleotide Sequence Database Collaboration - INSDC|website=www.insdc.org}}</ref> which relates data from DDBJ,<ref>{{cite web|url=https://www.ddbj.nig.ac.jp/index-e.html|title=Top|website=www.ddbj.nig.ac.jp|date=11 January 2024 }}</ref> EMBL-EBI,<ref>{{cite web|url=https://www.ebi.ac.uk/|title=The European Bioinformatics Institute < EMBL-EBI|website=www.ebi.ac.uk}}</ref> and NCBI.<ref>{{cite web|url=https://www.ncbi.nlm.nih.gov/|title=National Center for Biotechnology Information|publisher=U. S. National Library of Medicine – |website=www.ncbi.nlm.nih.gov}}</ref>

Nowadays, increase in size and complexity of molecular datasets leads to use of powerful statistical methods provided by computer science algorithms which are developed by [[machine learning]] area. Therefore, data mining and machine learning allow detection of patterns in data with a complex structure, as biological ones, by using methods of [[Supervised learning|supervised]] and [[unsupervised learning]], regression, detection of [[Cluster analysis|clusters]] and [[Association rule learning|association rule mining]], among others.<ref name=":4"/> To indicate some of them, [[self-organizing map]]s and [[k-means clustering|''k''-means]] are examples of cluster algorithms; [[Artificial neural network|neural networks]] implementation and [[support vector machine]]s models are examples of common machine learning algorithms.

Collaborative work among molecular biologists, bioinformaticians, statisticians and computer scientists is important to perform an experiment correctly, going from planning, passing through data generation and analysis, and ending with biological interpretation of the results.<ref name=":4"/>

=== Use of computationally intensive methods ===

On the other hand, the advent of modern computer technology and relatively cheap computing resources have enabled computer-intensive biostatistical methods like [[Bootstrapping (statistics)|bootstrapping]] and [[Re-sampling (statistics)|re-sampling]] methods.

In recent times, [[random forests]] have gained popularity as a method for performing [[statistical classification]]. Random forest techniques generate a panel of decision trees. Decision trees have the advantage that you can draw them and interpret them (even with a basic understanding of mathematics and statistics). Random Forests have thus been used for clinical decision support systems.{{citation needed|date=December 2016}}