Editing Biostatistics (section)

== Analysis and data interpretation ==

=== Descriptive tools ===
{{Main| Descriptive statistics}}

Data can be represented through [[Table (information)|tables]] or [[chart|graphical]] representation, such as line charts, bar charts, histograms, scatter plot. Also, [[Central tendency|measures of central]] tendency and [[Statistical dispersion|variability]] can be very useful to describe an overview of the data. Follow some examples:

==== Frequency tables ====

One type of table is the [[frequency]] table, which consists of data arranged in rows and columns, where the frequency is the number of occurrences or repetitions of data. Frequency can be:<ref>{{Cite web|url=https://www.sangakoo.com/en/unit/absolute-relative-cumulative-frequency-and-statistical-tables|title=Absolute, relative, cumulative frequency and statistical tables – Probability and Statistics|last=Maths|first=Sangaku|website=www.sangakoo.com|language=en|access-date=2018-04-10}}</ref>

'''Absolute''': represents the number of times that a determined value appear;

<math display="block">N = f_1 + f_2 + f_3 + ... + f_n</math>

'''Relative''': obtained by the division of the absolute frequency by the total number;

<math display="block">n_i = \frac{f_i}{N}</math>

In the next example, we have the number of genes in ten [[operon]]s of the same organism.

: {{math|1=Genes = {{mset|2,3,3,4,5,3,3,3,3,4}}}}
{| class="wikitable"
|+
!Genes number
!Absolute frequency
!Relative frequency
|-
|1
|0
|0
|-
|2
|1
|0.1
|-
|3
|6
|0.6
|-
|4
|2
|0.2
|-
|5
|1
|0.1
|}

==== Line graph ====

[[File:Examples of descriptive tools.png|thumb| Figure A: '''Line graph example'''. The birth rate in Brazil (2010–2016);<ref name=":1">{{Cite web|url=http://tabnet.datasus.gov.br/cgi/deftohtm.exe?sinasc/cnv/nvuf.def|title=DATASUS: TabNet Win32 3.0: Nascidos vivos – Brasil|website=DATASUS: Tecnologia da Informação a Serviço do SUS}}</ref> Figure B: '''Bar chart example.''' The birth rate in [[Brazil]] for the December months from 2010 to 2016; Figure C: '''Example of Box Plot''': number of glycines in the proteome of eight different organisms (A-H); Figure D: '''Example of a scatter plot.''']]

[[Line graph]]s represent the variation of a value over another metric, such as time. In general, values are represented in the vertical axis, while the time variation is represented in the horizontal axis.<ref name=":0">{{Cite book|title=Introduction to Biostatistics. A Guide to Design, Analysis, and Discovery|last1=Forthofer|first1=Ronald N.|last2=Lee|first2=Eun Sul|publisher=Academic Press|year=1995|isbn=978-0-12-262270-0}}</ref>

==== Bar chart ====

A [[bar chart]] is a graph that shows categorical data as bars presenting heights (vertical bar) or widths (horizontal bar) proportional to represent values. Bar charts provide an image that could also be represented in a tabular format.<ref name=":0" />

In the bar chart example, we have the birth rate in Brazil for the December months from 2010 to 2016.<ref name=":1" /> The sharp fall in December 2016 reflects the outbreak of [[Zika virus]] in the birth rate in Brazil.

==== Histograms ====

[[File:Example histogram.png|thumb|'''Example of a histogram.'''|350x350px]]The [[histogram]] (or frequency distribution) is a graphical representation of a dataset tabulated and divided into uniform or non-uniform classes. It was first introduced by [[Karl Pearson]].<ref>{{Cite journal|last=Pearson|first=Karl|date=1895-01-01|title=X. Contributions to the mathematical theory of evolution.—II. Skew variation in homogeneous material|journal=Phil. Trans. R. Soc. Lond. A|language=en|volume=186|pages=343–414|doi=10.1098/rsta.1895.0010|issn=0264-3820|bibcode=1895RSPTA.186..343P|doi-access=free}}</ref>

==== Scatter plot ====

A [[scatter plot]] is a mathematical diagram that uses Cartesian coordinates to display values of a dataset. A scatter plot shows the data as a set of points, each one presenting the value of one variable determining the position on the horizontal axis and another variable on the vertical axis.<ref>{{Cite book|title=Seeing through statistics|last=Utts|first=Jessica M.|date=2005|publisher=Thomson, Brooks/Cole|isbn=978-0534394028|edition= 3rd|location=Belmont, CA|oclc=56568530}}</ref> They are also called '''scatter graph''', '''scatter chart''', '''scattergram''', or '''scatter diagram'''.<ref>{{Cite book|title=Basic statistics|last=Jarrell|first=Stephen B.|date=1994|publisher=Wm. C. Brown Pub|isbn=978-0697215956|location=Dubuque, Iowa|oclc=30301196}}</ref>

==== Mean ====
{{Main| Mean}}

The [[arithmetic mean]] is the sum of a collection of values (<math>{x_1+x_2+x_3+\cdots +x_n}</math>) divided by the number of items of this collection (<math>{n}</math>).

: <math>\bar{x} = \frac{1}{n}\left (\sum_{i=1}^n{x_i}\right ) = \frac{x_1+x_2+\cdots +x_n}{n}</math>

==== Median ====
{{Main| Median}}

The [[median]] is the value in the middle of a dataset.

==== Mode ====
{{Main| Mode (statistics)}}

The [[mode (statistics)|mode]] is the value of a set of data that appears most often.<ref>{{Cite book|title=Econometrics|last=Gujarati|first=Damodar N.|publisher=McGraw-Hill Irwin|year=2006}}</ref>
{| class="wikitable" 
|+  |Comparison among mean, median and mode<br />
Values = { 2,3,3,3,3,3,4,4,11 }
!Type
!Example
!Result
|-
| align="center"  |[[Arithmetic mean|Mean]]
| align="center" | ( 2 + 3 + 3 + 3 + 3 + 3 + 4 + 4 + 11 ) / 9
| align="center" |'''4'''
|-
| align="center" |[[Median]]
| align="center" |2, 3, 3, 3, '''3''', 3, 4, 4, 11
| align="center" |'''3'''
|-
| align="center" |Mode
| align="center" |2, '''3, 3, 3, 3, 3''', 4, 4, 11
| align="center" |'''3'''
|}

==== Box plot ====

[[Box plot]] is a method for graphically depicting groups of numerical data. The maximum and minimum values are represented by the lines, and the interquartile range (IQR) represent 25–75% of the data. [[Outlier]]s may be plotted as circles.

==== Correlation coefficients ====

Although correlations between two different kinds of data could be inferred by graphs, such as scatter plot, it is necessary validate this though numerical information. For this reason, [[correlation coefficient]]s are required. They provide a numerical value that reflects the strength of an association.<ref name=":0" />

==== Pearson correlation coefficient ====

[[File:Correlation coefficient.png|right|thumb|Scatter diagram that demonstrates the Pearson correlation for different values of ''ρ.'']] [[Pearson correlation coefficient]] is a measure of association between two variables, X and Y. This coefficient, usually represented by ''ρ'' (rho) for the population and ''r'' for the sample, assumes values between −1 and 1, where ''ρ''  = 1 represents a perfect positive correlation, ''ρ'' = −1 represents a perfect negative correlation, and ''ρ'' = 0 is no linear correlation.<ref name=":0" />

=== Inferential statistics ===
{{Main| Statistical inference}}

It is used to make [[inference]]s<ref>{{Cite journal|title=Essentials of Biostatistics in Public Health & Essentials of Biostatistics Workbook: Statistical Computing Using Excel|journal=Australian and New Zealand Journal of Public Health|volume=33|issue=2|pages=196–197|doi=10.1111/j.1753-6405.2009.00372.x|issn=1326-0200|year=2009|doi-access=free |last1=Watson |first1=Lyndsey }}</ref> about an unknown population, by estimation and/or hypothesis testing. In other words, it is desirable to obtain parameters to describe the population of interest, but since the data is limited, it is necessary to make use of a representative sample in order to estimate them. With that, it is possible to test previously defined hypotheses and apply the conclusions to the entire population. The  [[Standard error|standard error of the mean]] is a measure of variability that is crucial to do inferences.<ref name=":2" />
* [[Statistical hypothesis testing|Hypothesis testing]]

Hypothesis testing is essential to make inferences about populations aiming to answer research questions, as settled in "Research planning" section. Authors defined four steps to be set:<ref name=":2"/>

# ''The hypothesis to be tested'': as stated earlier, we have to work with the definition of a [[null hypothesis]] (H<sub>0</sub>), that is going to be tested, and an [[alternative hypothesis]]. But they must be defined before the experiment implementation.
# ''Significance level and decision rule'': A decision rule depends on the [[significance level|level of significance]], or in other words, the acceptable error rate (α). It is easier to think that we define a ''critical value'' that determines the statistical significance when a [[test statistic]] is compared with it. So, α also has to be predefined before the experiment.
# ''Experiment and statistical analysis'': This is when the experiment is really implemented following the appropriate [[Design of experiments|experimental design]], data is collected and the more suitable statistical tests are evaluated.
# ''Inference'': Is made when the [[null hypothesis]] is rejected or not rejected, based on the evidence that the comparison of [[p-value]]s and α brings. It is pointed that the failure to reject H<sub>0</sub> just means that there is not enough evidence to support its rejection, but not that this hypothesis is true.
* [[Confidence intervals]]

A confidence interval is a range of values that can contain the true real parameter value in given a certain level of confidence. The first step is to estimate the best-unbiased estimate of the population parameter. The upper value of the interval is obtained by the sum of this estimate with the multiplication between the standard error of the mean and the confidence level. The calculation of lower value is similar, but instead of a sum, a subtraction must be applied.<ref name=":2" />