Editing Biostatistics (section)

== Research planning ==

Any research in [[life sciences]] is proposed to answer a [[scientific question]] we might have. To answer this question with a high certainty, we need [[Accuracy and precision|accurate]] results. The correct definition of the main [[hypothesis]] and the research plan will reduce errors while taking a decision in understanding a phenomenon. The research plan might include the research question, the hypothesis to be tested, the [[experimental design]], [[data collection]] methods, [[data analysis]] perspectives and costs involved. It is essential to carry the study based on the three basic principles of experimental statistics: [[randomization]], [[Replication (statistics)|replication]], and local control.

=== Research question ===

The research question will define the objective of a study. The research will be headed by the question, so it needs to be concise, at the same time it is focused on interesting and novel topics that may improve science and knowledge and that field. To define the way to ask the [[scientific question]], an exhaustive [[literature review]] might be necessary. So the research can be useful to add value to the [[scientific community]].<ref name=":3">{{cite journal|last1=Nizamuddin|first1=Sarah L.|last2=Nizamuddin|first2=Junaid|last3=Mueller|first3=Ariel|last4=Ramakrishna|first4=Harish|last5=Shahul|first5=Sajid S.|title=Developing a Hypothesis and Statistical Planning|journal=Journal of Cardiothoracic and Vascular Anesthesia|date=October 2017|volume=31|issue=5|pages=1878–1882|doi=10.1053/j.jvca.2017.04.020|pmid=28778775}}</ref>

=== Hypothesis definition ===

Once the aim of the study is defined, the possible answers to the research question can be proposed, transforming this question into a [[hypothesis]]. The main propose is called [[null hypothesis]] (H<sub>0</sub>) and is usually based on a permanent knowledge about the topic or an obvious occurrence of the phenomena, sustained by a deep literature review. We can say it is the standard expected answer for the data under the situation in [[Experiment|test]]. In general, H<sub>O</sub> assumes no association between treatments. On the other hand, the [[alternative hypothesis]] is the denial of H<sub>O</sub>. It assumes some degree of association between the treatment and the outcome. Although, the hypothesis is sustained by question research and its expected and unexpected answers.<ref name=":3" />

As an example, consider groups of similar animals (mice, for example) under two different diet systems. The research question would be: what is the best diet? In this case, H<sub>0</sub> would be that there is no difference between the two diets in mice [[metabolism]] (H<sub>0</sub>: μ<sub>1</sub> = μ<sub>2</sub>) and the [[alternative hypothesis]] would be that the diets have different effects over animals metabolism (H<sub>1</sub>: μ<sub>1</sub> ≠ μ<sub>2</sub>).

The [[hypothesis]] is defined by the researcher, according to his/her interests in answering the main question. Besides that, the [[alternative hypothesis]] can be more than one hypothesis. It can assume not only differences across observed parameters, but their degree of differences (''i.e.'' higher or shorter).

=== Sampling ===

Usually, a study aims to understand an effect of a phenomenon over a [[population]]. In [[biology]], a [[population]] is defined as all the [[individual]]s of a given [[species]], in a specific area at a given time. In biostatistics, this concept is extended to a variety of collections possible of study. Although, in biostatistics, a [[population]] is not only the individuals, but the total of one specific component of their [[organism]]s, as the whole [[genome]], or all the sperm [[cell (biology)|cells]], for animals, or the total leaf area, for a plant, for example.

It is not possible to take the [[Measurement|measures]] from all the elements of a [[population]]. Because of that, the [[Sampling (statistics)|sampling]] process is very important for [[statistical inference]]. [[Sampling (statistics)|Sampling]] is defined as to randomly get a representative part of the entire population, to make posterior inferences about the population. So, the [[Sample (statistics)|sample]] might catch the most [[Statistical variability|variability]] across a population.<ref name=":2">{{cite journal| doi= 10.1177/0115426507022006629| pmid= 18042950| title= Biostatistics Primer: Part I| journal= Nutrition in Clinical Practice| volume= 22| issue= 6| pages= 629–35| year= 2017| last1= Overholser| first1= Brian R| last2= Sowinski| first2= Kevin M}}</ref> The [[sample size]] is determined by several things, since the scope of the research to the resources available. In [[clinical research]], the trial type, as [[inferiority]], [[Equivalence (measure theory)|equivalence]], and [[superior (hierarchy)|superior]]ity is a key in determining sample [[size]].<ref name=":3" />

=== Experimental design ===
[[Experimental designs]] sustain those basic principles of [[design of experiments|experimental statistics]]. There are three basic experimental designs to randomly allocate [[treatment group|treatments]] in all [[Quadrat|plots]] of the [[experiment]]. They are [[completely randomized design]], [[randomized block design]], and [[factorial designs]]. Treatments can be arranged in many ways inside the experiment. In [[agriculture]], the correct [[experimental design]] is the root of a good study and the arrangement of [[treatment group|treatments]] within the study is essential because [[environment (systems)|environment]] largely affects the [[Quadrat|plots]] ([[plants]], [[livestock]], [[microorganism]]s). These main arrangements can be found in the literature under the names of "[[lattice model (physics)|lattices]]", "incomplete blocks", "[[split plot]]", "augmented blocks", and many others. All of the designs might include [[Scientific control|control plots]], determined by the researcher, to provide an [[Estimation theory|error estimation]] during [[inference]].

In [[clinical studies]], the [[sample (statistics)|sample]]s are usually smaller than in other biological studies, and in most cases, the [[environment (systems)|environment]] effect can be controlled or measured. It is common to use [[Randomized controlled trial|randomized controlled clinical trials]], where results are usually compared with [[observational study]] designs such as [[case–control]] or [[cohort (statistics)|cohort]].<ref>{{cite journal|last1=Szczech|first1=Lynda Anne|last2=Coladonato|first2=Joseph A.|last3=Owen|first3=William F.|title=Key Concepts in Biostatistics: Using Statistics to Answer the Question "Is There a Difference?"|journal=Seminars in Dialysis|date=4 October 2002|volume=15|issue=5|pages=347–351|doi=10.1046/j.1525-139X.2002.00085.x|pmid=12358639|s2cid=30875225}}</ref>

=== Data collection ===

Data collection methods must be considered in research planning, because it highly influences the sample size and experimental design.

Data collection varies according to the type of data. For [[qualitative data]], collection can be done with structured questionnaires or by observation, considering presence or intensity of disease, using score criterion to categorize levels of occurrence.<ref>{{cite journal|last1=Sandelowski|first1 = Margarete|title=Combining Qualitative and Quantitative Sampling, Data Collection, and Analysis Techniques in Mixed-Method Studies|journal=Research in Nursing & Health |date=2000|volume=23|issue=3|pages=246–255|doi=10.1002/1098-240X(200006)23:3<246::AID-NUR9>3.0.CO;2-H|pmid=10871540|citeseerx=10.1.1.472.7825|s2cid=10733556 }}</ref> For [[quantitative data]], collection is done by measuring numerical information using instruments.

In agriculture and biology studies, yield data and its components can be obtained by [[metric measure]]s. However, pest and disease injuries in plants are obtained by observation, considering score scales for levels of damage. Especially, in genetic studies, modern methods for data collection in field and laboratory should be considered, as high-throughput platforms for phenotyping and genotyping. These tools allow bigger experiments, while turn possible evaluate many plots in lower time than a human-based only method for data collection.
Finally, all data collected of interest must be stored in an organized data frame for further analysis.