Editing Bioinformatics (section)

== History ==
The first definition of the term ''bioinformatics'' was coined by [[Paulien Hogeweg]] and [[Ben Hesper]] in 1970, to refer to the study of information processes in biotic systems.<ref>{{cite journal |last1=Ouzounis |first1=C. A. |last2=Valencia |first2=A. |date=2003 |title=Early bioinformatics: the birth of a discipline—a personal view |journal=Bioinformatics |volume=19 |issue=17 |pages=2176–2190 | pmid=14630646 | doi=10.1093/bioinformatics/btg309| doi-access=free}}</ref><ref name="Hogeweg2011">{{cite journal |vauthors=Hogeweg P |title=The Roots of Bioinformatics in Theoretical Biology |journal=PLOS Computational Biology |volume=7 |issue=3 |pages=e1002021 |date=2011 |pmid=21483479 |pmc=3068925 | doi=10.1371/journal.pcbi.1002021 | bibcode = 2011PLSCB...7E2021H | doi-access = free }}</ref><ref>{{Cite journal| vauthors = Hesper B, Hogeweg P |year=1970|title=BIO-INFORMATICA: een werkconcept |trans-title=BIO-INFORMATICS: a working concept |language=nl |journal=Het Kameleon|volume=1 |issue=6| pages=28–29}}</ref><ref>{{cite arXiv |vauthors=Hesper B, Hogeweg P |eprint=2111.11832v1 |title=Bio-informatics: a working concept. A translation of "Bio-informatica: een werkconcept" by B. Hesper and P. Hogeweg |date=2021 |class=q-bio.OT}}</ref><ref>{{cite journal |vauthors = Hogeweg P |title=Simulating the growth of cellular forms |journal=Simulation |volume=31 |issue=3 |pages=90–96 |year=1978 |doi=10.1177/003754977803100305 |s2cid=61206099 }}</ref> This definition placed bioinformatics as a field parallel to [[biochemistry]] (the study of chemical processes in biological systems).<ref name="Hogeweg2011" />

Bioinformatics and computational biology involved the analysis of biological data, particularly DNA, RNA, and protein sequences. The field of bioinformatics experienced explosive growth starting in the mid-1990s, driven largely by the [[Human Genome Project]] and by rapid advances in DNA sequencing technology.{{cn|date=February 2025}}

Analyzing biological data to produce meaningful information involves writing and running software programs that use [[algorithm]]s from [[graph theory]], [[artificial intelligence]], [[soft computing]], [[data mining]], [[image processing]], and [[computer simulation]]. The algorithms in turn depend on theoretical foundations such as [[discrete mathematics]], [[control theory]], [[system theory]], [[information theory]], and [[statistics]].{{cn|date=May 2024}}

=== Sequences ===
[[File: Example DNA sequence.png|thumbnail|right|Sequences of genetic material are frequently used in bioinformatics and are easier to manage using computers than manually.]]
[[File:Muscle alignment view.png|thumb|369x369px|These are sequences being compared in a MUSCLE multiple sequence alignment (MSA). Each sequence name (leftmost column) is from various louse species, while the sequences themselves are in the second column.]]

There has been a tremendous advance in speed and cost reduction since the completion of the Human Genome Project, with some labs able to [[DNA sequencing|sequence]] over 100,000 billion bases each year, and a full genome can be sequenced for $1,000 or less.<ref>{{cite web | vauthors = Colby B | date = 2022 | work = Sequencing.com | title = Whole Genome Sequencing Cost | url = https://sequencing.com/education-center/whole-genome-sequencing/whole-genome-sequencing-cost | access-date = 8 April 2022 | archive-date = 15 March 2022 | archive-url = https://web.archive.org/web/20220315025036/https://sequencing.com/education-center/whole-genome-sequencing/whole-genome-sequencing-cost | url-status = live }}</ref>

Computers became essential in molecular biology when [[protein sequences]] became available after [[Frederick Sanger]] determined the sequence of [[insulin]] in the early 1950s.<ref name="Sanger1951">{{cite journal |vauthors=Sanger F, Tuppy H |title=The Amino-acid Sequence in the Phenylalanyl Chain of Insulin. I. The identification of lower peptides from partial hydrolysates |journal=Biochemical Journal |volume=49 |issue=4 |pages=463–81 |date=1951 |pmid=14886310 |doi=10.1042/bj0490463 |pmc=1197535 }}</ref><ref name="Sanger1953">{{cite journal |vauthors=Sanger F, Thompson EO |title=The Amino-acid Sequence in the Glycyl Chain of Insulin. I. The identification of lower peptides from partial hydrolysates |journal=Biochemical Journal |volume=53 |issue=3 |pages=353–66 |date=1953 |pmid=13032078 |doi=10.1042/bj0530353 |pmc=1198157 }}</ref> Comparing multiple sequences manually turned out to be impractical. [[Margaret Oakley Dayhoff]], a pioneer in the field,<ref>{{cite book | vauthors=Moody G |year=2004 |title=Digital Code of Life: How Bioinformatics is Revolutionizing Science, Medicine, and Business |publisher=John Wiley & Sons |location=Hoboken, NJ, USA |isbn=978-0-471-32788-2 |url-access=registration |url=https://archive.org/details/digitalcodeoflif0000mood }}</ref> compiled one of the first protein sequence databases, initially published as books<ref name="Dayhoff1965">{{cite book |vauthors=Dayhoff MO, Eck RV, Chang MA, Sochard MR |date=1965 |title=ATLAS of PROTEIN SEQUENCE and STRUCTURE |publisher=National Biomedical Research Foundation |location=Silver Spring, MD, USA |url=https://ntrs.nasa.gov/api/citations/19660014530/downloads/19660014530.pdf |lccn=65-29342 }}</ref> as well as methods of sequence alignment and [[molecular evolution]].<ref name="pmid17775169">{{cite journal |vauthors=Eck RV, Dayhoff MO |title= Evolution of the Structure of Ferredoxin Based on Living Relics of Primitive Amino Acid Sequences | journal = Science | volume = 152 | issue = 3720 | pages = 363–6 | date = April 1966 | pmid = 17775169 | doi = 10.1126/science.152.3720.363 | s2cid = 23208558 | bibcode = 1966Sci...152..363E }}</ref> Another early contributor to bioinformatics was [[Elvin A. Kabat]], who pioneered biological sequence analysis in 1970 with his comprehensive volumes of antibody sequences released online with Tai Te Wu between 1980 and 1991.<ref>{{cite journal | vauthors = Johnson G, Wu TT | title = Kabat database and its applications: 30 years after the first variability plot | journal = Nucleic Acids Research | volume = 28 | issue = 1 | pages = 214–8 | date = January 2000 | pmid = 10592229 | pmc = 102431 | doi = 10.1093/nar/28.1.214 }}</ref>

In the 1970s, new techniques for sequencing DNA were applied to bacteriophage MS2 and øX174, and the extended nucleotide sequences were then parsed with informational and statistical algorithms. These studies illustrated that well known features, such as the coding segments and the triplet code, are revealed in straightforward statistical analyses and were the proof of the concept that bioinformatics would be insightful.<ref>{{cite journal | vauthors = Erickson JW, Altman GG |title=A Search for Patterns in the Nucleotide Sequence of the MS2 Genome |journal=Journal of Mathematical Biology |date=1979 |volume=7 |issue=3 |pages=219–230 |doi=10.1007/BF00275725 |s2cid=85199492 }}</ref><ref>{{cite journal | vauthors = Shulman MJ, Steinberg CM, Westmoreland N | title = The coding function of nucleotide sequences can be discerned by statistical analysis | journal = Journal of Theoretical Biology | volume = 88 | issue = 3 | pages = 409–20 | date = February 1981 | pmid = 6456380 | doi = 10.1016/0022-5193(81)90274-5 | bibcode = 1981JThBi..88..409S }}</ref>