Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Biostatistics
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
{{Short description|Application of statistical techniques to biological systems}} {{redirect|Biometry|the automated recognition of people based on intrinsic physical or behavioural traits|Biometrics}} {{for|the academic journal|Biostatistics (journal)}} '''Biostatistics''' (also known as '''biometry''') is a branch of [[statistics]] that applies statistical methods to a wide range of topics in [[biology]]. It encompasses the design of biological [[experiment]]s, the collection and analysis of data from those experiments and the interpretation of the results. == History == === Biostatistics and genetics === Biostatistical modeling forms an important part of numerous modern biological theories. [[Genetics]] studies, since its beginning, used statistical concepts to understand observed experimental results. Some genetics scientists even contributed with statistical advances with the development of methods and tools. [[Gregor Mendel]] started the genetics studies investigating genetics segregation patterns in families of peas and used statistics to explain the collected data. In the early 1900s, after the rediscovery of Mendel's Mendelian inheritance work, there were gaps in understanding between genetics and evolutionary Darwinism. [[Francis Galton]] tried to expand Mendel's discoveries with human data and proposed a different model with fractions of the heredity coming from each ancestral composing an infinite series. He called this the theory of "[[Francis Galton|Law of Ancestral Heredity]]". His ideas were strongly disagreed by [[William Bateson]], who followed Mendel's conclusions, that genetic inheritance were exclusively from the parents, half from each of them. This led to a vigorous debate between the biometricians, who supported Galton's ideas, as [[Raphael Weldon]], [[Arthur Dukinfield Darbishire]] and [[Karl Pearson]], and Mendelians, who supported Bateson's (and Mendel's) ideas, such as [[Charles Davenport]] and [[Wilhelm Johannsen]]. Later, biometricians could not reproduce Galton conclusions in different experiments, and Mendel's ideas prevailed. By the 1930s, models built on statistical reasoning had helped to resolve these differences and to produce the neo-Darwinian [[Modern synthesis (20th century)|modern evolutionary synthesis]].also Solving these differences also allowed to define the concept of population genetics and brought together genetics and evolution. The three leading figures in the establishment of [[population genetics]] and this synthesis all relied on statistics and developed its use in biology. * [[Ronald Fisher]] worked alongside statistician Betty Allan developing several basic statistical methods in support of his work studying the crop experiments at [[Rothamsted Research]], published in Fisher's books [[Statistical Methods for Research Workers]] (1925) and [[The Genetical Theory of Natural Selection]] (1930), as well as Allan's scientific papers.<ref>{{Cite web |last=Centre for Transformative Innovation |first=Swinburne University of Technology |title=Allan, Frances Elizabeth (Betty) - Person - Encyclopedia of Australian Science and Innovation |url=https://www.eoas.info/biogs/P001468b.htm |access-date=2022-10-26 |website=www.eoas.info |language=en-gb}}</ref> Fisher went on to give many contributions to genetics and statistics. Some of them include the [[ANOVA]], [[p-value]] concepts, [[Ronald Fisher|Fisher's exact test]] and [[Ronald Fisher|Fisher's equation]] for [[population dynamics]]. He is credited for the sentence "Natural selection is a mechanism for generating an exceedingly high degree of improbability".<ref>{{cite journal|last1=Gunter|first1=Chris |title=Quantitative Genetics|journal=Nature|date=10 December 2008|volume=456|issue=7223 |pages=719|doi=10.1038/456719a|pmid=19079046 |bibcode=2008Natur.456..719G|doi-access=free}}</ref> * [[Sewall G. Wright]] developed [[F-statistics|''F''-statistics]] and methods of computing them and defined [[inbreeding coefficient]]. * [[J. B. S. Haldane]]'s book, ''The Causes of Evolution'', reestablished natural selection as the premier mechanism of evolution by explaining it in terms of the mathematical consequences of Mendelian genetics. He also developed the theory of [[primordial soup]]. These and other biostatisticians, [[mathematical biology|mathematical biologists]], and statistically inclined geneticists helped bring together [[evolutionary biology]] and [[genetics]] into a consistent, coherent whole that could begin to be [[Statistics|quantitative]]ly modeled. In parallel to this overall development, the pioneering work of [[D'Arcy Thompson]] in ''On Growth and Form'' also helped to add quantitative discipline to biological study. Despite the fundamental importance and frequent necessity of statistical reasoning, there may nonetheless have been a tendency among biologists to distrust or deprecate results which are not [[qualitative data|qualitatively]] apparent. One anecdote describes [[Thomas Hunt Morgan]] banning the [[Friden, Inc.|Friden calculator]] from his department at [[Caltech]], saying "Well, I am like a guy who is prospecting for gold along the banks of the Sacramento River in 1849. With a little intelligence, I can reach down and pick up big nuggets of gold. And as long as I can do that, I'm not going to let any people in my department waste scarce resources in [[placer mining]]."<ref>{{cite web|url=http://www.tilsonfunds.com/MungerUCSBspeech.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://www.tilsonfunds.com/MungerUCSBspeech.pdf |archive-date=2022-10-09 |url-status=live|title=Academic Economics: Strengths and Faults After Considering Interdisciplinary Needs|author=Charles T. Munger|date=2003-10-03}}</ref> == Research planning == Any research in [[life sciences]] is proposed to answer a [[scientific question]] we might have. To answer this question with a high certainty, we need [[Accuracy and precision|accurate]] results. The correct definition of the main [[hypothesis]] and the research plan will reduce errors while taking a decision in understanding a phenomenon. The research plan might include the research question, the hypothesis to be tested, the [[experimental design]], [[data collection]] methods, [[data analysis]] perspectives and costs involved. It is essential to carry the study based on the three basic principles of experimental statistics: [[randomization]], [[Replication (statistics)|replication]], and local control. === Research question === The research question will define the objective of a study. The research will be headed by the question, so it needs to be concise, at the same time it is focused on interesting and novel topics that may improve science and knowledge and that field. To define the way to ask the [[scientific question]], an exhaustive [[literature review]] might be necessary. So the research can be useful to add value to the [[scientific community]].<ref name=":3">{{cite journal|last1=Nizamuddin|first1=Sarah L.|last2=Nizamuddin|first2=Junaid|last3=Mueller|first3=Ariel|last4=Ramakrishna|first4=Harish|last5=Shahul|first5=Sajid S.|title=Developing a Hypothesis and Statistical Planning|journal=Journal of Cardiothoracic and Vascular Anesthesia|date=October 2017|volume=31|issue=5|pages=1878–1882|doi=10.1053/j.jvca.2017.04.020|pmid=28778775}}</ref> === Hypothesis definition === Once the aim of the study is defined, the possible answers to the research question can be proposed, transforming this question into a [[hypothesis]]. The main propose is called [[null hypothesis]] (H<sub>0</sub>) and is usually based on a permanent knowledge about the topic or an obvious occurrence of the phenomena, sustained by a deep literature review. We can say it is the standard expected answer for the data under the situation in [[Experiment|test]]. In general, H<sub>O</sub> assumes no association between treatments. On the other hand, the [[alternative hypothesis]] is the denial of H<sub>O</sub>. It assumes some degree of association between the treatment and the outcome. Although, the hypothesis is sustained by question research and its expected and unexpected answers.<ref name=":3" /> As an example, consider groups of similar animals (mice, for example) under two different diet systems. The research question would be: what is the best diet? In this case, H<sub>0</sub> would be that there is no difference between the two diets in mice [[metabolism]] (H<sub>0</sub>: μ<sub>1</sub> = μ<sub>2</sub>) and the [[alternative hypothesis]] would be that the diets have different effects over animals metabolism (H<sub>1</sub>: μ<sub>1</sub> ≠ μ<sub>2</sub>). The [[hypothesis]] is defined by the researcher, according to his/her interests in answering the main question. Besides that, the [[alternative hypothesis]] can be more than one hypothesis. It can assume not only differences across observed parameters, but their degree of differences (''i.e.'' higher or shorter). === Sampling === Usually, a study aims to understand an effect of a phenomenon over a [[population]]. In [[biology]], a [[population]] is defined as all the [[individual]]s of a given [[species]], in a specific area at a given time. In biostatistics, this concept is extended to a variety of collections possible of study. Although, in biostatistics, a [[population]] is not only the individuals, but the total of one specific component of their [[organism]]s, as the whole [[genome]], or all the sperm [[cell (biology)|cells]], for animals, or the total leaf area, for a plant, for example. It is not possible to take the [[Measurement|measures]] from all the elements of a [[population]]. Because of that, the [[Sampling (statistics)|sampling]] process is very important for [[statistical inference]]. [[Sampling (statistics)|Sampling]] is defined as to randomly get a representative part of the entire population, to make posterior inferences about the population. So, the [[Sample (statistics)|sample]] might catch the most [[Statistical variability|variability]] across a population.<ref name=":2">{{cite journal| doi= 10.1177/0115426507022006629| pmid= 18042950| title= Biostatistics Primer: Part I| journal= Nutrition in Clinical Practice| volume= 22| issue= 6| pages= 629–35| year= 2017| last1= Overholser| first1= Brian R| last2= Sowinski| first2= Kevin M}}</ref> The [[sample size]] is determined by several things, since the scope of the research to the resources available. In [[clinical research]], the trial type, as [[inferiority]], [[Equivalence (measure theory)|equivalence]], and [[superior (hierarchy)|superior]]ity is a key in determining sample [[size]].<ref name=":3" /> === Experimental design === [[Experimental designs]] sustain those basic principles of [[design of experiments|experimental statistics]]. There are three basic experimental designs to randomly allocate [[treatment group|treatments]] in all [[Quadrat|plots]] of the [[experiment]]. They are [[completely randomized design]], [[randomized block design]], and [[factorial designs]]. Treatments can be arranged in many ways inside the experiment. In [[agriculture]], the correct [[experimental design]] is the root of a good study and the arrangement of [[treatment group|treatments]] within the study is essential because [[environment (systems)|environment]] largely affects the [[Quadrat|plots]] ([[plants]], [[livestock]], [[microorganism]]s). These main arrangements can be found in the literature under the names of "[[lattice model (physics)|lattices]]", "incomplete blocks", "[[split plot]]", "augmented blocks", and many others. All of the designs might include [[Scientific control|control plots]], determined by the researcher, to provide an [[Estimation theory|error estimation]] during [[inference]]. In [[clinical studies]], the [[sample (statistics)|sample]]s are usually smaller than in other biological studies, and in most cases, the [[environment (systems)|environment]] effect can be controlled or measured. It is common to use [[Randomized controlled trial|randomized controlled clinical trials]], where results are usually compared with [[observational study]] designs such as [[case–control]] or [[cohort (statistics)|cohort]].<ref>{{cite journal|last1=Szczech|first1=Lynda Anne|last2=Coladonato|first2=Joseph A.|last3=Owen|first3=William F.|title=Key Concepts in Biostatistics: Using Statistics to Answer the Question "Is There a Difference?"|journal=Seminars in Dialysis|date=4 October 2002|volume=15|issue=5|pages=347–351|doi=10.1046/j.1525-139X.2002.00085.x|pmid=12358639|s2cid=30875225}}</ref> === Data collection === Data collection methods must be considered in research planning, because it highly influences the sample size and experimental design. Data collection varies according to the type of data. For [[qualitative data]], collection can be done with structured questionnaires or by observation, considering presence or intensity of disease, using score criterion to categorize levels of occurrence.<ref>{{cite journal|last1=Sandelowski|first1 = Margarete|title=Combining Qualitative and Quantitative Sampling, Data Collection, and Analysis Techniques in Mixed-Method Studies|journal=Research in Nursing & Health |date=2000|volume=23|issue=3|pages=246–255|doi=10.1002/1098-240X(200006)23:3<246::AID-NUR9>3.0.CO;2-H|pmid=10871540|citeseerx=10.1.1.472.7825|s2cid=10733556 }}</ref> For [[quantitative data]], collection is done by measuring numerical information using instruments. In agriculture and biology studies, yield data and its components can be obtained by [[metric measure]]s. However, pest and disease injuries in plants are obtained by observation, considering score scales for levels of damage. Especially, in genetic studies, modern methods for data collection in field and laboratory should be considered, as high-throughput platforms for phenotyping and genotyping. These tools allow bigger experiments, while turn possible evaluate many plots in lower time than a human-based only method for data collection. Finally, all data collected of interest must be stored in an organized data frame for further analysis. == Analysis and data interpretation == === Descriptive tools === {{Main| Descriptive statistics}} Data can be represented through [[Table (information)|tables]] or [[chart|graphical]] representation, such as line charts, bar charts, histograms, scatter plot. Also, [[Central tendency|measures of central]] tendency and [[Statistical dispersion|variability]] can be very useful to describe an overview of the data. Follow some examples: ==== Frequency tables ==== One type of table is the [[frequency]] table, which consists of data arranged in rows and columns, where the frequency is the number of occurrences or repetitions of data. Frequency can be:<ref>{{Cite web|url=https://www.sangakoo.com/en/unit/absolute-relative-cumulative-frequency-and-statistical-tables|title=Absolute, relative, cumulative frequency and statistical tables – Probability and Statistics|last=Maths|first=Sangaku|website=www.sangakoo.com|language=en|access-date=2018-04-10}}</ref> '''Absolute''': represents the number of times that a determined value appear; <math display="block">N = f_1 + f_2 + f_3 + ... + f_n</math> '''Relative''': obtained by the division of the absolute frequency by the total number; <math display="block">n_i = \frac{f_i}{N}</math> In the next example, we have the number of genes in ten [[operon]]s of the same organism. : {{math|1=Genes = {{mset|2,3,3,4,5,3,3,3,3,4}}}} {| class="wikitable" |+ !Genes number !Absolute frequency !Relative frequency |- |1 |0 |0 |- |2 |1 |0.1 |- |3 |6 |0.6 |- |4 |2 |0.2 |- |5 |1 |0.1 |} ==== Line graph ==== [[File:Examples of descriptive tools.png|thumb| Figure A: '''Line graph example'''. The birth rate in Brazil (2010–2016);<ref name=":1">{{Cite web|url=http://tabnet.datasus.gov.br/cgi/deftohtm.exe?sinasc/cnv/nvuf.def|title=DATASUS: TabNet Win32 3.0: Nascidos vivos – Brasil|website=DATASUS: Tecnologia da Informação a Serviço do SUS}}</ref> Figure B: '''Bar chart example.''' The birth rate in [[Brazil]] for the December months from 2010 to 2016; Figure C: '''Example of Box Plot''': number of glycines in the proteome of eight different organisms (A-H); Figure D: '''Example of a scatter plot.''']] [[Line graph]]s represent the variation of a value over another metric, such as time. In general, values are represented in the vertical axis, while the time variation is represented in the horizontal axis.<ref name=":0">{{Cite book|title=Introduction to Biostatistics. A Guide to Design, Analysis, and Discovery|last1=Forthofer|first1=Ronald N.|last2=Lee|first2=Eun Sul|publisher=Academic Press|year=1995|isbn=978-0-12-262270-0}}</ref> ==== Bar chart ==== A [[bar chart]] is a graph that shows categorical data as bars presenting heights (vertical bar) or widths (horizontal bar) proportional to represent values. Bar charts provide an image that could also be represented in a tabular format.<ref name=":0" /> In the bar chart example, we have the birth rate in Brazil for the December months from 2010 to 2016.<ref name=":1" /> The sharp fall in December 2016 reflects the outbreak of [[Zika virus]] in the birth rate in Brazil. ==== Histograms ==== [[File:Example histogram.png|thumb|'''Example of a histogram.'''|350x350px]]The [[histogram]] (or frequency distribution) is a graphical representation of a dataset tabulated and divided into uniform or non-uniform classes. It was first introduced by [[Karl Pearson]].<ref>{{Cite journal|last=Pearson|first=Karl|date=1895-01-01|title=X. Contributions to the mathematical theory of evolution.—II. Skew variation in homogeneous material|journal=Phil. Trans. R. Soc. Lond. A|language=en|volume=186|pages=343–414|doi=10.1098/rsta.1895.0010|issn=0264-3820|bibcode=1895RSPTA.186..343P|doi-access=free}}</ref> ==== Scatter plot ==== A [[scatter plot]] is a mathematical diagram that uses Cartesian coordinates to display values of a dataset. A scatter plot shows the data as a set of points, each one presenting the value of one variable determining the position on the horizontal axis and another variable on the vertical axis.<ref>{{Cite book|title=Seeing through statistics|last=Utts|first=Jessica M.|date=2005|publisher=Thomson, Brooks/Cole|isbn=978-0534394028|edition= 3rd|location=Belmont, CA|oclc=56568530}}</ref> They are also called '''scatter graph''', '''scatter chart''', '''scattergram''', or '''scatter diagram'''.<ref>{{Cite book|title=Basic statistics|last=Jarrell|first=Stephen B.|date=1994|publisher=Wm. C. Brown Pub|isbn=978-0697215956|location=Dubuque, Iowa|oclc=30301196}}</ref> ==== Mean ==== {{Main| Mean}} The [[arithmetic mean]] is the sum of a collection of values (<math>{x_1+x_2+x_3+\cdots +x_n}</math>) divided by the number of items of this collection (<math>{n}</math>). : <math>\bar{x} = \frac{1}{n}\left (\sum_{i=1}^n{x_i}\right ) = \frac{x_1+x_2+\cdots +x_n}{n}</math> ==== Median ==== {{Main| Median}} The [[median]] is the value in the middle of a dataset. ==== Mode ==== {{Main| Mode (statistics)}} The [[mode (statistics)|mode]] is the value of a set of data that appears most often.<ref>{{Cite book|title=Econometrics|last=Gujarati|first=Damodar N.|publisher=McGraw-Hill Irwin|year=2006}}</ref> {| class="wikitable" |+ |Comparison among mean, median and mode<br /> Values = { 2,3,3,3,3,3,4,4,11 } !Type !Example !Result |- | align="center" |[[Arithmetic mean|Mean]] | align="center" | ( 2 + 3 + 3 + 3 + 3 + 3 + 4 + 4 + 11 ) / 9 | align="center" |'''4''' |- | align="center" |[[Median]] | align="center" |2, 3, 3, 3, '''3''', 3, 4, 4, 11 | align="center" |'''3''' |- | align="center" |Mode | align="center" |2, '''3, 3, 3, 3, 3''', 4, 4, 11 | align="center" |'''3''' |} ==== Box plot ==== [[Box plot]] is a method for graphically depicting groups of numerical data. The maximum and minimum values are represented by the lines, and the interquartile range (IQR) represent 25–75% of the data. [[Outlier]]s may be plotted as circles. ==== Correlation coefficients ==== Although correlations between two different kinds of data could be inferred by graphs, such as scatter plot, it is necessary validate this though numerical information. For this reason, [[correlation coefficient]]s are required. They provide a numerical value that reflects the strength of an association.<ref name=":0" /> ==== Pearson correlation coefficient ==== [[File:Correlation coefficient.png|right|thumb|Scatter diagram that demonstrates the Pearson correlation for different values of ''ρ.'']] [[Pearson correlation coefficient]] is a measure of association between two variables, X and Y. This coefficient, usually represented by ''ρ'' (rho) for the population and ''r'' for the sample, assumes values between −1 and 1, where ''ρ'' = 1 represents a perfect positive correlation, ''ρ'' = −1 represents a perfect negative correlation, and ''ρ'' = 0 is no linear correlation.<ref name=":0" /> === Inferential statistics === {{Main| Statistical inference}} It is used to make [[inference]]s<ref>{{Cite journal|title=Essentials of Biostatistics in Public Health & Essentials of Biostatistics Workbook: Statistical Computing Using Excel|journal=Australian and New Zealand Journal of Public Health|volume=33|issue=2|pages=196–197|doi=10.1111/j.1753-6405.2009.00372.x|issn=1326-0200|year=2009|doi-access=free |last1=Watson |first1=Lyndsey }}</ref> about an unknown population, by estimation and/or hypothesis testing. In other words, it is desirable to obtain parameters to describe the population of interest, but since the data is limited, it is necessary to make use of a representative sample in order to estimate them. With that, it is possible to test previously defined hypotheses and apply the conclusions to the entire population. The [[Standard error|standard error of the mean]] is a measure of variability that is crucial to do inferences.<ref name=":2" /> * [[Statistical hypothesis testing|Hypothesis testing]] Hypothesis testing is essential to make inferences about populations aiming to answer research questions, as settled in "Research planning" section. Authors defined four steps to be set:<ref name=":2"/> # ''The hypothesis to be tested'': as stated earlier, we have to work with the definition of a [[null hypothesis]] (H<sub>0</sub>), that is going to be tested, and an [[alternative hypothesis]]. But they must be defined before the experiment implementation. # ''Significance level and decision rule'': A decision rule depends on the [[significance level|level of significance]], or in other words, the acceptable error rate (α). It is easier to think that we define a ''critical value'' that determines the statistical significance when a [[test statistic]] is compared with it. So, α also has to be predefined before the experiment. # ''Experiment and statistical analysis'': This is when the experiment is really implemented following the appropriate [[Design of experiments|experimental design]], data is collected and the more suitable statistical tests are evaluated. # ''Inference'': Is made when the [[null hypothesis]] is rejected or not rejected, based on the evidence that the comparison of [[p-value]]s and α brings. It is pointed that the failure to reject H<sub>0</sub> just means that there is not enough evidence to support its rejection, but not that this hypothesis is true. * [[Confidence intervals]] A confidence interval is a range of values that can contain the true real parameter value in given a certain level of confidence. The first step is to estimate the best-unbiased estimate of the population parameter. The upper value of the interval is obtained by the sum of this estimate with the multiplication between the standard error of the mean and the confidence level. The calculation of lower value is similar, but instead of a sum, a subtraction must be applied.<ref name=":2" /> == Statistical considerations == === Power and statistical error === When testing a hypothesis, there are two types of statistic errors possible: [[Type I error]] and [[Type II error]]. * The type I error or [[False positives and false negatives|false positive]] is the incorrect rejection of a true null hypothesis * The type II error or [[False positives and false negatives|false negative]] is the failure to reject a false [[null hypothesis]]. The [[significance level]] denoted by α is the type I error rate and should be chosen before performing the test. The type II error rate is denoted by β and [[Statistical power|statistical power of the test]] is 1 − β. === p-value === The [[p-value]] is the probability of obtaining results as extreme as or more extreme than those observed, assuming the [[null hypothesis]] (H<sub>0</sub>) is true. It is also called the calculated probability. It is common to confuse the p-value with the [[Statistical significance|significance level (α)]], but, the α is a predefined threshold for calling significant results. If p is less than α, the null hypothesis (H<sub>0</sub>) is rejected.<ref>{{cite journal|doi=10.1038/nature.2016.19503|pmid=26961635|title=Statisticians issue warning over misuse of P values|journal=Nature|volume=531|issue=7593|pages=151|year=2016|last1=Baker|first1=Monya|bibcode=2016Natur.531..151B|doi-access=free}}</ref> === Multiple testing === In multiple tests of the same hypothesis, the probability of the occurrence of [[False positives and false negatives|false positives]] [[Family-wise error rate|(familywise error rate)]] increase and a strategy is needed to account for this occurrence. This is commonly achieved by using a more stringent threshold to reject null hypotheses. The [[Bonferroni correction]] defines an acceptable global significance level, denoted by α* and each test is individually compared with a value of α = α*/m. This ensures that the familywise error rate in all m tests, is less than or equal to α*. When m is large, the Bonferroni correction may be overly conservative. An alternative to the Bonferroni correction is to control the [[False discovery rate|false discovery rate (FDR)]]. The FDR controls the expected proportion of the rejected [[Null hypothesis|null hypotheses]] (the so-called discoveries) that are false (incorrect rejections). This procedure ensures that, for independent tests, the false discovery rate is at most q*. Thus, the FDR is less conservative than the Bonferroni correction and have more power, at the cost of more false positives.<ref>Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57, 289–300 (1995).</ref> === Mis-specification and robustness checks === The main hypothesis being tested (e.g., no association between treatments and outcomes) is often accompanied by other technical assumptions (e.g., about the form of the probability distribution of the outcomes) that are also part of the null hypothesis. When the technical assumptions are violated in practice, then the null may be frequently rejected even if the main hypothesis is true. Such rejections are said to be due to model mis-specification.<ref>{{Cite web|url=https://www.statlect.com/glossary/null-hypothesis|title=Null hypothesis|website=www.statlect.com|access-date=2018-05-08}}</ref> Verifying whether the outcome of a statistical test does not change when the technical assumptions are slightly altered (so-called robustness checks) is the main way of combating mis-specification. === Model selection criteria === [[Model selection|Model criteria selection]] will select or model that more approximate true model. The [[Model selection|Akaike's Information Criterion (AIC)]] and The [[Model selection|Bayesian Information Criterion (BIC)]] are examples of asymptotically efficient criteria. == Developments and big data == {{More citations needed section|date=December 2016}} Recent developments have made a large impact on biostatistics. Two important changes have been the ability to collect data on a high-throughput scale, and the ability to perform much more complex analysis using computational techniques. This comes from the development in areas as [[DNA sequencing|sequencing]] technologies, [[Bioinformatics]] and [[Machine learning]] ([[Machine learning in bioinformatics]]). === Use in high-throughput data === New biomedical technologies like [[DNA microarray|microarrays]], [[DNA sequencing|next-generation sequencers]] (for genomics) and [[mass spectrometers|mass spectrometry]] (for proteomics) generate enormous amounts of data, allowing many tests to be performed simultaneously.<ref>{{cite journal|last1=Hayden|first1=Erika Check|title=Biostatistics: Revealing analysis|journal=Nature|date=8 February 2012|volume=482|issue=7384|pages=263–265|doi=10.1038/nj7384-263a|pmid=22329008|doi-access=free}}</ref> Careful analysis with biostatistical methods is required to separate the signal from the noise. For example, a microarray could be used to measure many thousands of genes simultaneously, determining which of them have different expression in diseased cells compared to normal cells. However, only a fraction of genes will be differentially expressed.<ref>{{cite journal|last1=Efron|first1=Bradley|title=Microarrays, Empirical Bayes and the Two-Groups Model|journal=Statistical Science|date=February 2008|volume=23|issue=1|pages=1–22|doi=10.1214/07-STS236|arxiv=0808.0572|s2cid=8417479}}</ref> [[Multicollinearity]] often occurs in high-throughput biostatistical settings. Due to high intercorrelation between the predictors (such as [[gene expression]] levels), the information of one predictor might be contained in another one. It could be that only 5% of the predictors are responsible for 90% of the variability of the response. In such a case, one could apply the biostatistical technique of dimension reduction (for example via principal component analysis). Classical statistical techniques like linear or [[logistic regression]] and [[linear discriminant analysis]] do not work well for high dimensional data (i.e. when the number of observations n is smaller than the number of features or predictors p: n < p). As a matter of fact, one can get quite high R<sup>2</sup>-values despite very low predictive power of the statistical model. These classical statistical techniques (esp. [[least squares]] linear regression) were developed for low dimensional data (i.e. where the number of observations n is much larger than the number of predictors p: n >> p). In cases of high dimensionality, one should always consider an independent validation test set and the corresponding residual sum of squares (RSS) and R<sup>2</sup> of the validation test set, not those of the training set. Often, it is useful to pool information from multiple predictors together. For example, [[Gene Set Enrichment Analysis]] (GSEA) considers the perturbation of whole (functionally related) gene sets rather than of single genes.<ref>{{cite journal|last1=Subramanian|first1=A.|last2=Tamayo|first2=P.|last3=Mootha|first3=V. K.|last4=Mukherjee|first4=S.|last5=Ebert|first5=B. L.|last6=Gillette|first6=M. A.|last7=Paulovich|first7=A.|last8=Pomeroy|first8=S. L.|last9=Golub|first9=T. R.|last10=Lander|first10=E. S.|last11=Mesirov|first11=J. P.|title=Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles|journal=Proceedings of the National Academy of Sciences|date=30 September 2005|volume=102|issue=43|pages=15545–15550|doi=10.1073/pnas.0506580102|pmid=16199517|pmc=1239896|bibcode=2005PNAS..10215545S|doi-access=free}}</ref> These gene sets might be known biochemical pathways or otherwise functionally related genes. The advantage of this approach is that it is more robust: It is more likely that a single gene is found to be falsely perturbed than it is that a whole pathway is falsely perturbed. Furthermore, one can integrate the accumulated knowledge about biochemical pathways (like the [[JAK-STAT signaling pathway]]) using this approach. === Bioinformatics advances in databases, data mining, and biological interpretation === The development of [[biological database]]s enables storage and management of biological data with the possibility of ensuring access for users around the world. They are useful for researchers depositing data, retrieve information and files (raw or processed) originated from other experiments or indexing scientific articles, as [[PubMed]]. Another possibility is search for the desired term (a gene, a protein, a disease, an organism, and so on) and check all results related to this search. There are databases dedicated to [[Single-nucleotide polymorphism|SNPs]] ([[dbSNP]]), the knowledge on genes characterization and their pathways ([[KEGG]]) and the description of gene function classifying it by cellular component, molecular function and biological process ([[Gene ontology|Gene Ontology]]).<ref name=":4">{{cite journal|doi=10.1002/jcp.21218|pmid=17654500|title=Bioinformatics|journal=Journal of Cellular Physiology|volume=213|issue=2|pages=365–9|year=2007|last1=Moore|first1=Jason H|s2cid=221831488|doi-access=free}}</ref> In addition to databases that contain specific molecular information, there are others that are ample in the sense that they store information about an organism or group of organisms. As an example of a database directed towards just one organism, but that contains much data about it, is the ''[[Arabidopsis thaliana]]'' genetic and molecular database – TAIR.<ref>{{cite web|url=https://www.arabidopsis.org/|title=TAIR - Home Page|website=www.arabidopsis.org}}</ref> Phytozome,<ref>{{cite web|url=https://phytozome.jgi.doe.gov/pz/portal.html|title=Phytozome|website=phytozome.jgi.doe.gov}}</ref> in turn, stores the assemblies and annotation files of dozen of plant genomes, also containing visualization and analysis tools. Moreover, there is an interconnection between some databases in the information exchange/sharing and a major initiative was the [[International Nucleotide Sequence Database Collaboration]] (INSDC)<ref>{{cite web|url=http://www.insdc.org/|title=International Nucleotide Sequence Database Collaboration - INSDC|website=www.insdc.org}}</ref> which relates data from DDBJ,<ref>{{cite web|url=https://www.ddbj.nig.ac.jp/index-e.html|title=Top|website=www.ddbj.nig.ac.jp|date=11 January 2024 }}</ref> EMBL-EBI,<ref>{{cite web|url=https://www.ebi.ac.uk/|title=The European Bioinformatics Institute < EMBL-EBI|website=www.ebi.ac.uk}}</ref> and NCBI.<ref>{{cite web|url=https://www.ncbi.nlm.nih.gov/|title=National Center for Biotechnology Information|publisher=U. S. National Library of Medicine – |website=www.ncbi.nlm.nih.gov}}</ref> Nowadays, increase in size and complexity of molecular datasets leads to use of powerful statistical methods provided by computer science algorithms which are developed by [[machine learning]] area. Therefore, data mining and machine learning allow detection of patterns in data with a complex structure, as biological ones, by using methods of [[Supervised learning|supervised]] and [[unsupervised learning]], regression, detection of [[Cluster analysis|clusters]] and [[Association rule learning|association rule mining]], among others.<ref name=":4"/> To indicate some of them, [[self-organizing map]]s and [[k-means clustering|''k''-means]] are examples of cluster algorithms; [[Artificial neural network|neural networks]] implementation and [[support vector machine]]s models are examples of common machine learning algorithms. Collaborative work among molecular biologists, bioinformaticians, statisticians and computer scientists is important to perform an experiment correctly, going from planning, passing through data generation and analysis, and ending with biological interpretation of the results.<ref name=":4"/> === Use of computationally intensive methods === On the other hand, the advent of modern computer technology and relatively cheap computing resources have enabled computer-intensive biostatistical methods like [[Bootstrapping (statistics)|bootstrapping]] and [[Re-sampling (statistics)|re-sampling]] methods. In recent times, [[random forests]] have gained popularity as a method for performing [[statistical classification]]. Random forest techniques generate a panel of decision trees. Decision trees have the advantage that you can draw them and interpret them (even with a basic understanding of mathematics and statistics). Random Forests have thus been used for clinical decision support systems.{{citation needed|date=December 2016}} == Applications == {{Prose|section|date=March 2016}} === Public health === [[Public health]], including [[epidemiology]], [[health services research]], [[nutrition]], [[environmental health]] and health care policy & management. In these [[medicine]] contents, it's important to consider the design and analysis of the [[clinical trial]]s. As one example, there is the assessment of severity state of a patient with a prognosis of an outcome of a disease. With new technologies and genetics knowledge, biostatistics are now also used for [[Systems medicine]], which consists in a more personalized medicine. For this, is made an integration of data from different sources, including conventional patient data, clinico-pathological parameters, molecular and genetic data as well as data generated by additional new-omics technologies.<ref>{{cite journal|doi=10.1038/emm.2017.290|pmid=29497170|pmc=5898894|title=Whither systems medicine?|journal=Experimental & Molecular Medicine|volume=50|issue=3|pages=e453|year=2018|last1=Apweiler|first1=Rolf|display-authors=et al}}</ref> === Quantitative genetics === The study of [[population genetics]] and [[statistical genetics]] in order to link variation in [[genotype]] with a variation in [[phenotype]]. In other words, it is desirable to discover the genetic basis of a measurable trait, a quantitative trait, that is under polygenic control. A genome region that is responsible for a continuous trait is called a [[quantitative trait locus]] (QTL). The study of QTLs become feasible by using [[molecular marker]]s and measuring traits in populations, but their mapping needs the obtaining of a population from an experimental crossing, like an F2 or [[recombinant inbred strain]]s/lines (RILs). To scan for QTLs regions in a genome, a [[gene map]] based on linkage have to be built. Some of the best-known QTL mapping algorithms are Interval Mapping, Composite Interval Mapping, and Multiple Interval Mapping.<ref>{{cite journal|doi=10.1007/s10709-004-2705-0|pmid=15881678|title=QTL mapping and the genetic basis of adaptation: Recent developments|journal=Genetica|volume=123|issue=1–2|pages=25–37|year=2005|last1=Zeng|first1=Zhao-Bang|s2cid=1094152}}</ref> However, QTL mapping resolution is impaired by the amount of recombination assayed, a problem for species in which it is difficult to obtain large offspring. Furthermore, allele diversity is restricted to individuals originated from contrasting parents, which limit studies of allele diversity when we have a panel of individuals representing a natural population.<ref>{{cite journal|doi=10.1186/1746-4811-9-29|pmid=23876160|pmc=3750305|title=The advantages and limitations of trait analysis with GWAS: A review|journal=Plant Methods|volume=9|pages=29|year=2013|last1=Korte|first1=Arthur|last2=Farlow|first2=Ashley |issue=1 |doi-access=free |bibcode=2013PlMet...9...29K }}</ref> For this reason, the [[genome-wide association study]] was proposed in order to identify QTLs based on [[linkage disequilibrium]], that is the non-random association between traits and molecular markers. It was leveraged by the development of high-throughput [[SNP genotyping]].<ref>{{cite journal|doi=10.3835/plantgenome2008.02.0089|title=Status and Prospects of Association Mapping in Plants|journal= The Plant Genome|volume=1|pages=5–20|year=2008|last1=Zhu|first1=Chengsong|last2=Gore|first2=Michael|last3=Buckler|first3=Edward S|last4=Yu|first4=Jianming|doi-access=free}}</ref> In [[Animal breeding|animal]] and [[plant breeding]], the use of markers in [[Selective breeding|selection]] aiming for breeding, mainly the molecular ones, collaborated to the development of [[marker-assisted selection]]. While QTL mapping is limited due resolution, GWAS does not have enough power when rare variants of small effect that are also influenced by environment. So, the concept of Genomic Selection (GS) arises in order to use all molecular markers in the selection and allow the prediction of the performance of candidates in this selection. The proposal is to genotype and phenotype a training population, develop a model that can obtain the genomic estimated breeding values (GEBVs) of individuals belonging to a genotype and but not phenotype population, called testing population.<ref>{{cite journal|doi=10.1016/j.tplants.2017.08.011|pmid=28965742|title=Genomic Selection in Plant Breeding: Methods, Models, and Perspectives|journal=Trends in Plant Science|volume=22|issue=11|pages=961–975|year=2017|last1=Crossa|first1=José|last2=Pérez-Rodríguez|first2=Paulino|last3=Cuevas|first3=Jaime|last4=Montesinos-López|first4=Osval|last5=Jarquín|first5=Diego|last6=De Los Campos|first6=Gustavo|last7=Burgueño|first7=Juan|last8=González-Camacho|first8=Juan M|last9=Pérez-Elizalde|first9=Sergio|last10=Beyene|first10=Yoseph|last11=Dreisigacker|first11=Susanne|last12=Singh|first12=Ravi|last13=Zhang|first13=Xuecai|last14=Gowda|first14=Manje|last15=Roorkiwal|first15=Manish|last16=Rutkoski|first16=Jessica|last17=Varshney|first17=Rajeev K|bibcode=2017TPS....22..961C |url=http://oar.icrisat.org/10280/1/Genomic%20Selection%20in%20Plant%20Breeding%20Methods%2C%20Models%2C%20and%20Perspectives.pdf |archive-url=https://ghostarchive.org/archive/20221009/http://oar.icrisat.org/10280/1/Genomic%20Selection%20in%20Plant%20Breeding%20Methods%2C%20Models%2C%20and%20Perspectives.pdf |archive-date=2022-10-09 |url-status=live}}</ref> This kind of study could also include a validation population, thinking in the concept of [[cross-validation (statistics)|cross-validation]], in which the real phenotype results measured in this population are compared with the phenotype results based on the prediction, what used to check the accuracy of the model. As a summary, some points about the application of quantitative genetics are: * This has been used in agriculture to improve crops ([[Plant breeding]]) and [[livestock]] ([[Animal breeding]]). * In biomedical research, this work can assist in finding candidates [[gene]] [[allele]]s that can cause or influence predisposition to diseases in [[human genetics]] === Expression data === Studies for differential expression of genes from [[RNA-Seq]] data, as for [[Real-time polymerase chain reaction|RT-qPCR]] and [[microarrays]], demands comparison of conditions. The goal is to identify genes which have a significant change in abundance between different conditions. Then, experiments are designed appropriately, with replicates for each condition/treatment, randomization and blocking, when necessary. In RNA-Seq, the quantification of expression uses the information of mapped reads that are summarized in some genetic unit, as [[exon]]s that are part of a gene sequence. As [[microarray]] results can be approximated by a normal distribution, RNA-Seq counts data are better explained by other distributions. The first used distribution was the [[Poisson distribution|Poisson]] one, but it underestimate the sample error, leading to false positives. Currently, biological variation is considered by methods that estimate a dispersion parameter of a [[negative binomial distribution]]. [[Generalized linear model]]s are used to perform the tests for statistical significance and as the number of genes is high, multiple tests correction have to be considered.<ref>{{cite journal| doi =10.1186/gb-2010-11-12-220| pmid =21176179| pmc =3046478| title =From RNA-seq reads to differential expression results| journal =Genome Biology| volume =11| issue =12| pages =220| year =2010| last1 =Oshlack| first1 =Alicia| last2 =Robinson| first2 =Mark D| last3 =Young| first3 =Matthew D| doi-access =free}}</ref> Some examples of other analysis on [[genomics]] data comes from microarray or [[proteomics]] experiments.<ref>{{cite book|title=Statistical Analysis of Gene Expression Microarray Data|author1=Helen Causton |author2=John Quackenbush |author3=Alvis Brazma |publisher=Wiley-Blackwell|year=2003}}</ref><ref>{{cite book|title=Microarray Gene Expression Data Analysis: A Beginner's Guide|author=Terry Speed|publisher=Chapman & Hall/CRC|year=2003}}</ref> Often concerning diseases or disease stages.<ref>{{cite book|title=Medical Biostatistics for Complex Diseases|author1=Frank Emmert-Streib |author2=Matthias Dehmer |publisher=Wiley-Blackwell|year=2010|isbn= 978-3-527-32585-6}}</ref> === Other studies === * [[Ecology]], [[ecological forecasting]] * Biological [[sequence analysis]]<ref>{{cite book|title=Statistical Methods in Bioinformatics: An Introduction|author1=Warren J. Ewens |author2=Gregory R. Grant |publisher=Springer|year=2004}}</ref> * [[Systems biology]] for gene network inference or pathways analysis.<ref>{{cite book|title=Applied Statistics for Network Biology: Methods in Systems Biology|author1=Matthias Dehmer |author2=Frank Emmert-Streib |author3=Armin Graber |author4=Armindo Salvador |publisher=Wiley-Blackwell|year=2011|isbn= 978-3-527-32750-8}}</ref> * [[Clinical research]] and pharmaceutical development * [[Population dynamics]], especially in regards to [[fisheries science]]. * [[Phylogenetics]] and [[evolution]] * [[Pharmacodynamics]] * [[Pharmacokinetics]] * [[Neuroimaging]] == Tools == There are a lot of tools that can be used to do statistical analysis in biological data. Most of them are useful in other areas of knowledge, covering a large number of applications (alphabetical). Here are brief descriptions of some of them: * [[ASReml]]: Another software developed by VSNi<ref name="vsni">{{cite web|url=https://www.vsni.co.uk/|title=Home - VSN International|website=www.vsni.co.uk}}</ref> that can be used also in R environment as a package. It is developed to estimate variance components under a general linear mixed model using [[restricted maximum likelihood]] (REML). Models with fixed effects and random effects and nested or crossed ones are allowed. Gives the possibility to investigate different [[Covariance matrix|variance-covariance]] matrix structures. * CycDesigN:<ref>{{cite web|url=https://www.vsni.co.uk/software/cycdesign/|title=CycDesigN - VSN International|website=www.vsni.co.uk}}</ref> A computer package developed by VSNi<ref name="vsni" /> that helps the researchers create experimental designs and analyze data coming from a design present in one of three classes handled by CycDesigN. These classes are resolvable, non-resolvable, partially replicated and [[Crossover study|crossover designs]]. It includes less used designs the Latinized ones, as t-Latinized design.<ref>{{cite journal|last1=Piepho|first1=Hans-Peter|last2=Williams|first2=Emlyn R|last3=Michel|first3=Volker|year=2015|title=Beyond Latin Squares: A Brief Tour of Row-Column Designs|journal=Agronomy Journal|volume=107|issue=6|pages=2263|doi=10.2134/agronj15.0144|bibcode=2015AgrJ..107.2263P }}</ref> * [[Orange (software)|Orange]]: A programming interface for high-level data processing, data mining and data visualization. Include tools for gene expression and genomics.<ref name=":4" /> * [[R (programming language)|R]]: An [[open source]] environment and programming language dedicated to statistical computing and graphics. It is an implementation of [[S (programming language)|S]] language maintained by CRAN.<ref>{{cite web|url=https://cran.r-project.org/|title=The Comprehensive R Archive Network|website=cran.r-project.org}}</ref> In addition to its functions to read data tables, take descriptive statistics, develop and evaluate models, its repository contains packages developed by researchers around the world. This allows the development of functions written to deal with the statistical analysis of data that comes from specific applications.<ref>{{cite book|title=Biostatistics explored through R software: An overview|author=Renganathan V|year=2021|publisher=Vinaitheerthan Renganathan |isbn=9789354936586}}</ref> In the case of Bioinformatics, for example, there are packages located in the main repository (CRAN) and in others, as [[Bioconductor]]. It is also possible to use packages under development that are shared in hosting-services as [[GitHub]]. * [[SAS (software)|SAS]]: A data analysis software widely used, going through universities, services and industry. Developed by a company with the same name ([[SAS Institute]]), it uses [[SAS language]] for programming. * PLA 3.0:<ref>{{Cite web|url=https://www.bioassay.de/products/pla-30/|title=PLA 3.0|last=Stegmann|first=Dr Ralf|date=2019-07-01|website=PLA 3.0 – Software for Biostatistical Analysis|language=en|access-date=2019-07-02}}</ref> Is a biostatistical analysis software for regulated environments (e.g. drug testing) which supports Quantitative Response Assays (Parallel-Line, Parallel-Logistics, Slope-Ratio) and Dichotomous Assays (Quantal Response, Binary Assays). It also supports weighting methods for combination calculations and the automatic data aggregation of independent assay data. * [[Weka (machine learning)|Weka]]: A [[Java (programming language)|Java]] software for [[machine learning]] and [[data mining]], including tools and methods for visualization, clustering, regression, association rule, and classification. There are tools for cross-validation, bootstrapping and a module of algorithm comparison. Weka also can be run in other programming languages as Perl or R.<ref name=":4" /> * [[Python (programming language)]] image analysis, deep-learning, machine-learning * [[SQL]] databases * [[NoSQL]] * [[NumPy]] numerical python * [[SciPy]] * [[SageMath]] * [[LAPACK]] linear algebra * [[MATLAB]] * [[Apache Hadoop]] * [[Apache Spark]] * [[Amazon Web Services]] == Scope and training programs == Almost all educational programmes in biostatistics are at [[postgraduate]] level. They are most often found in schools of public health, affiliated with schools of medicine, forestry, or agriculture, or as a focus of application in departments of statistics. In the United States, where several universities have dedicated biostatistics departments, many other top-tier universities integrate biostatistics faculty into statistics or other departments, such as [[epidemiology]]. Thus, departments carrying the name "biostatistics" may exist under quite different structures. For instance, relatively new biostatistics departments have been founded with a focus on [[bioinformatics]] and [[computational biology]], whereas older departments, typically affiliated with schools of [[public health]], will have more traditional lines of research involving epidemiological studies and [[clinical trial]]s as well as bioinformatics. In larger universities around the world, where both a statistics and a biostatistics department exist, the degree of integration between the two departments may range from the bare minimum to very close collaboration. In general, the difference between a statistics program and a biostatistics program is twofold: (i) statistics departments will often host theoretical/methodological research which are less common in biostatistics programs and (ii) statistics departments have lines of research that may include biomedical applications but also other areas such as industry ([[quality control]]), business and [[economics]] and biological areas other than medicine. == Specialized journals == {{See also|List of statistics journals#Biostatistics|l1=List of biostatistics journals}} * Biostatistics<ref>{{cite web|url=https://academic.oup.com/biostatistics|title=Biostatistics - Oxford Academic|website=OUP Academic}}</ref> * International Journal of Biostatistics<ref>{{Cite web|url=https://www.degruyter.com/view/j/ijb|title=The International Journal of Biostatistics}}</ref> * Journal of Epidemiology and Biostatistics<ref>{{cite web|url=https://ncbiinsights.ncbi.nlm.nih.gov/2018/06/15/pubmed-journals-shut-down/|title=PubMed Journals will be shut down|date=15 June 2018}}</ref> * Biostatistics and Public Health<ref>https://ebph.it/ Epidemiology</ref> * Biometrics<ref>{{cite web|url=https://onlinelibrary.wiley.com/journal/15410420|title=Biometrics|website=onlinelibrary.wiley.com|doi=10.1111/(ISSN)1541-0420}}</ref> * Biometrika<ref>{{cite web|url=https://academic.oup.com/biomet|title=Biometrika - Oxford Academic|website=OUP Academic}}</ref> * Biometrical Journal<ref>{{cite web|url=https://onlinelibrary.wiley.com/journal/15214036|title=Biometrical Journal|website=onlinelibrary.wiley.com|doi=10.1002/(ISSN)1521-4036}}</ref> * Communications in Biometry and Crop Science<ref>{{cite web|url=http://agrobiol.sggw.waw.pl/cbcs/|title=Communications in Biometry and Crop Science|website=agrobiol.sggw.waw.pl}}</ref> * Statistical Applications in Genetics and Molecular Biology<ref>{{cite web|url=https://www.degruyter.com/view/j/sagmb|title=Statistical Applications in Genetics and Molecular Biology|date=1 May 2002|website=www.degruyter.com}}</ref> * Statistical Methods in Medical Research<ref>{{cite web|url=https://journals.sagepub.com/home/smm|title=Statistical Methods in Medical Research|website=SAGE Journals|date=12 December 2024 }}</ref> * Pharmaceutical Statistics<ref>{{cite web|url=https://onlinelibrary.wiley.com/journal/15391612|title=Pharmaceutical Statistics|website=onlinelibrary.wiley.com|doi=10.1002/(ISSN)1539-1612 }}</ref> * Statistics in Medicine<ref>{{cite web|url=https://onlinelibrary.wiley.com/journal/10970258|title=Statistics in Medicine|website=onlinelibrary.wiley.com|doi=10.1002/(ISSN)1097-0258}}</ref> == See also == * [[Bioinformatics]] * [[Epidemiological method]] * [[Epidemiology]] * [[Group size measures]] * [[Health indicator]] * [[Mathematical and theoretical biology]] == References == {{Reflist}} == External links == * {{Commons category-inline}} * [https://www.biometricsociety.org/ The International Biometric Society] * [https://web.archive.org/web/20080827161431/http://www.biostatsresearch.com/repository/ The Collection of Biostatistics Research Archive] * [http://www.medpagetoday.com/lib/content/Medpage-Guide-to-Biostatistics.pdf Guide to Biostatistics (MedPageToday.com)] {{Webarchive|url=https://web.archive.org/web/20120522144801/http://www.medpagetoday.com/lib/content/Medpage-Guide-to-Biostatistics.pdf |date=2012-05-22 }} * [https://web.archive.org/web/20150402180351/http://www.biostat.katerynakon.in.ua/en/ Biomedical Statistics] {{Statistics|applications}} {{Biology-footer}} {{Public health}} {{Authority control}} [[Category:Biostatistics| ]] [[Category:Bioinformatics]]
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)
Pages transcluded onto the current version of this page
(
help
)
:
Template:Authority control
(
edit
)
Template:Biology-footer
(
edit
)
Template:Citation needed
(
edit
)
Template:Cite book
(
edit
)
Template:Cite journal
(
edit
)
Template:Cite web
(
edit
)
Template:Commons category-inline
(
edit
)
Template:For
(
edit
)
Template:Main
(
edit
)
Template:Math
(
edit
)
Template:More citations needed section
(
edit
)
Template:Prose
(
edit
)
Template:Public health
(
edit
)
Template:Redirect
(
edit
)
Template:Reflist
(
edit
)
Template:See also
(
edit
)
Template:Short description
(
edit
)
Template:Statistics
(
edit
)
Template:Webarchive
(
edit
)