Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Biostatistics
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
=== Use in high-throughput data === New biomedical technologies like [[DNA microarray|microarrays]], [[DNA sequencing|next-generation sequencers]] (for genomics) and [[mass spectrometers|mass spectrometry]] (for proteomics) generate enormous amounts of data, allowing many tests to be performed simultaneously.<ref>{{cite journal|last1=Hayden|first1=Erika Check|title=Biostatistics: Revealing analysis|journal=Nature|date=8 February 2012|volume=482|issue=7384|pages=263β265|doi=10.1038/nj7384-263a|pmid=22329008|doi-access=free}}</ref> Careful analysis with biostatistical methods is required to separate the signal from the noise. For example, a microarray could be used to measure many thousands of genes simultaneously, determining which of them have different expression in diseased cells compared to normal cells. However, only a fraction of genes will be differentially expressed.<ref>{{cite journal|last1=Efron|first1=Bradley|title=Microarrays, Empirical Bayes and the Two-Groups Model|journal=Statistical Science|date=February 2008|volume=23|issue=1|pages=1β22|doi=10.1214/07-STS236|arxiv=0808.0572|s2cid=8417479}}</ref> [[Multicollinearity]] often occurs in high-throughput biostatistical settings. Due to high intercorrelation between the predictors (such as [[gene expression]] levels), the information of one predictor might be contained in another one. It could be that only 5% of the predictors are responsible for 90% of the variability of the response. In such a case, one could apply the biostatistical technique of dimension reduction (for example via principal component analysis). Classical statistical techniques like linear or [[logistic regression]] and [[linear discriminant analysis]] do not work well for high dimensional data (i.e. when the number of observations n is smaller than the number of features or predictors p: n < p). As a matter of fact, one can get quite high R<sup>2</sup>-values despite very low predictive power of the statistical model. These classical statistical techniques (esp. [[least squares]] linear regression) were developed for low dimensional data (i.e. where the number of observations n is much larger than the number of predictors p: n >> p). In cases of high dimensionality, one should always consider an independent validation test set and the corresponding residual sum of squares (RSS) and R<sup>2</sup> of the validation test set, not those of the training set. Often, it is useful to pool information from multiple predictors together. For example, [[Gene Set Enrichment Analysis]] (GSEA) considers the perturbation of whole (functionally related) gene sets rather than of single genes.<ref>{{cite journal|last1=Subramanian|first1=A.|last2=Tamayo|first2=P.|last3=Mootha|first3=V. K.|last4=Mukherjee|first4=S.|last5=Ebert|first5=B. L.|last6=Gillette|first6=M. A.|last7=Paulovich|first7=A.|last8=Pomeroy|first8=S. L.|last9=Golub|first9=T. R.|last10=Lander|first10=E. S.|last11=Mesirov|first11=J. P.|title=Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles|journal=Proceedings of the National Academy of Sciences|date=30 September 2005|volume=102|issue=43|pages=15545β15550|doi=10.1073/pnas.0506580102|pmid=16199517|pmc=1239896|bibcode=2005PNAS..10215545S|doi-access=free}}</ref> These gene sets might be known biochemical pathways or otherwise functionally related genes. The advantage of this approach is that it is more robust: It is more likely that a single gene is found to be falsely perturbed than it is that a whole pathway is falsely perturbed. Furthermore, one can integrate the accumulated knowledge about biochemical pathways (like the [[JAK-STAT signaling pathway]]) using this approach.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)