Editing Biostatistics (section)

=== Use in high-throughput data ===

New biomedical technologies like [[DNA microarray|microarrays]], [[DNA sequencing|next-generation sequencers]] (for genomics) and [[mass spectrometers|mass spectrometry]] (for proteomics) generate enormous amounts of data, allowing many tests to be performed simultaneously.<ref>{{cite journal|last1=Hayden|first1=Erika Check|title=Biostatistics: Revealing analysis|journal=Nature|date=8 February 2012|volume=482|issue=7384|pages=263–265|doi=10.1038/nj7384-263a|pmid=22329008|doi-access=free}}</ref> Careful analysis with biostatistical methods is required to separate the signal from the noise. For example, a microarray could be used to measure many thousands of genes simultaneously, determining which of them have different expression in diseased cells compared to normal cells. However, only a fraction of genes will be differentially expressed.<ref>{{cite journal|last1=Efron|first1=Bradley|title=Microarrays, Empirical Bayes and the Two-Groups Model|journal=Statistical Science|date=February 2008|volume=23|issue=1|pages=1–22|doi=10.1214/07-STS236|arxiv=0808.0572|s2cid=8417479}}</ref>

[[Multicollinearity]] often occurs in high-throughput biostatistical settings. Due to high intercorrelation between the predictors (such as [[gene expression]] levels), the information of one predictor might be contained in another one. It could be that only 5% of the predictors are responsible for 90% of the variability of the response. In such a case, one could apply the biostatistical technique of dimension reduction (for example via principal component analysis). Classical statistical techniques like linear or [[logistic regression]] and [[linear discriminant analysis]] do not work well for high dimensional data (i.e. when the number of observations n is smaller than the number of features or predictors p: n < p). As a matter of fact, one can get quite high R<sup>2</sup>-values despite very low predictive power of the statistical model. These classical statistical techniques (esp. [[least squares]] linear regression) were developed for low dimensional data (i.e. where the number of observations n is much larger than the number of predictors p: n >> p). In cases of high dimensionality, one should always consider an independent validation test set and the corresponding residual sum of squares (RSS) and R<sup>2</sup> of the validation test set, not those of the training set.

Often, it is useful to pool information from multiple predictors together. For example, [[Gene Set Enrichment Analysis]] (GSEA) considers the perturbation of whole (functionally related) gene sets rather than of single genes.<ref>{{cite journal|last1=Subramanian|first1=A.|last2=Tamayo|first2=P.|last3=Mootha|first3=V. K.|last4=Mukherjee|first4=S.|last5=Ebert|first5=B. L.|last6=Gillette|first6=M. A.|last7=Paulovich|first7=A.|last8=Pomeroy|first8=S. L.|last9=Golub|first9=T. R.|last10=Lander|first10=E. S.|last11=Mesirov|first11=J. P.|title=Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles|journal=Proceedings of the National Academy of Sciences|date=30 September 2005|volume=102|issue=43|pages=15545–15550|doi=10.1073/pnas.0506580102|pmid=16199517|pmc=1239896|bibcode=2005PNAS..10215545S|doi-access=free}}</ref> These gene sets might be known biochemical pathways or otherwise functionally related genes. The advantage of this approach is that it is more robust: It is more likely that a single gene is found to be falsely perturbed than it is that a whole pathway is falsely perturbed. Furthermore, one can integrate the accumulated knowledge about biochemical pathways (like the [[JAK-STAT signaling pathway]]) using this approach.