Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Sequence database
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
{{see also|Protein structure database}} In the field of [[bioinformatics]], a '''sequence database''' is a type of [[biological database]] that is composed of a large collection of computerized ("[[Digital data|digital]]") [[nucleic acid sequence]]s, [[protein sequence]]s, or other [[polymer]] sequences stored on a computer. The [[UniProt]] database is an example of a [[protein sequence]] database. As of 2013 it contained over 40 million sequences and is growing at an exponential rate.<ref>{{cite journal|last1=Cochrane|first1=G.|last2=Karsch-Mizrachi|first2=I.|last3=Nakamura|first3=Y.|title=The International Nucleotide Sequence Database Collaboration|journal=Nucleic Acids Research|date=23 November 2010|volume=39|issue=Database|pages=D15βD18|doi=10.1093/nar/gkq1150|pmid=21106499|pmc=3013722}}</ref> Historically, sequences were published in paper form, but as the number of sequences grew, this storage method became unsustainable. == Search == Searching in a sequence database involves looking for similarities between a genomic/protein sequence and a query string and, finding the sequence in the database that "best" matches the target sequence (based on criteria which vary depending on the search method). The number of matches/hits is used to formulate a score that determines the similarity between the sequence query and the sequences in the sequence database.<ref>{{cite book |last1=Sung |first1=Wing-Kin |title=Algorithms in bioinformatics : a practical introduction |date=2010 |publisher=Chapman & Hall/CRC Press |location=Boca Raton |isbn=9781420070330 |page=109 |url=https://www.comp.nus.edu.sg/~ksung/algo_in_bioinfo/}}</ref> The main goal is to have a good balance between the two criteria. == History == === 1950 === The need for sequence databases originated in 1950 when Fredrick Sanger reported the primary structure of insulin. He won his second Nobel Prize for creating methods for sequencing nucleic acids, and his comparative approach is what sparked other protein biochemists to begin collecting amino acid sequences. Thus marking the beginning of molecular databases.<ref name=":0">{{Citation |last=Hagen |first=Joel B. |title=The Origin and Early Reception of Sequence Databases |date=2011 |url=https://doi.org/10.1007/978-1-60761-987-1_4 |work=Data Mining in Proteomics: From Standards to Applications |pages=61β77 |editor-last=Hamacher |editor-first=Michael |series=Methods in Molecular Biology |volume=696 |place=Totowa, NJ |publisher=Humana Press |language=en |doi=10.1007/978-1-60761-987-1_4 |pmid=21063941 |isbn=978-1-60761-987-1 |access-date=2022-05-05 |editor2-last=Eisenacher |editor2-first=Martin |editor3-last=Stephan |editor3-first=Christian|url-access=subscription }}</ref> === 1960 === In 1965 Margaret Dayhoff and her team at the National Biomedical Research Foundation (NBRF) published ''"The Atlas of Protein Sequence and Structure".'' They put all know protein sequences in the ''Atlas'', even unpublished material. This can be seen as the first attempt to create a molecular database. They made use of the newly computerized (1964) Medical Literature Analysis and Retrieval System (MEDLARS) at the National Institutes of Health (NIH). The team used computers to store the data but had to manually type and proofread each sequence, which had a high cost in time and money.<ref name=":0" /> In 1966 the team released the second edition of the ''Atlas,'' double the size of the first. It contained about 1000 sequences, and this time was coined as an information explosion. The National Biomedical Research Foundation (NBRF) was on the cutting edge of utilizing computers for medicine and biology at this time. Dayhoff and her team made use of their facilities for determining amino acid sequences of protein molecules in mainframe computers. The number of discovered sequences continued to grow allowing for a deeper comparative analysis of proteins than ever before. This led to many developments such as, probabilistic models of amino acid substitutions, sequence aligning and phylogenetic trees of evolutionary relationships of proteins.<ref name=":0" /> === 1970 === Entire sequencing process became fully automated.<ref name=":0" /> === 1980 === The first nucleotide sequence database was created. Previously known as the European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Data Library (now known as European Nucleotide archive). [[Human Genome Project]] began in 1988. The project's goal was sequence and map all the genes in a human which required the capability to create and utilize a large sequence database.<ref>{{Cite web |title=History < EMBL-EBI |url=https://www.ebi.ac.uk/history |access-date=2022-05-05 |website=www.ebi.ac.uk}}</ref> === Present day === We now have many sequence databases, tools for using them and easy access to them. One of the largest being [[GenBank]] which contains over 2 billion sequences.<ref name=":0" /> === Timeline === [[File:Sequence Database Timeline.png|thumb|661x661px|Timeline for the creation of sequence databases.|center]] == Current issues == === Storage & redundancy === Records in sequence databases are deposited from a wide range of sources, from individual researchers to large genome sequencing centers. As a result, the sequences themselves, and especially the biological annotations attached to these sequences, may vary in quality. There is much redundancy, as multiple labs may submit numerous sequences that are identical, or nearly identical, to others in the databases.<ref name="Sikic-2010">{{Cite journal | last1 = Sikic | first1 = K. | last2 = Carugo | first2 = O. | title = Protein sequence redundancy reduction: comparison of various method | journal = Bioinformation | volume = 5 | issue = 6 | pages = 234β9 | year = 2010 | doi = 10.6026/97320630005234| pmid = 21364823 | pmc=3055704}}</ref> Many annotations of the sequences are based not on laboratory experiments, but on the results of sequence similarity searches for previously annotated sequences. Once a sequence has been annotated based on similarity to others, and itself deposited in the database, it can also become the basis for future annotations. This can lead to a ''transitive annotation problem'' because there may be several such annotation transfers by sequence similarity between a particular database record and actual [[wet lab]] experimental information.<ref name="Iliopoulos-2003">{{Cite journal | last1 = Iliopoulos | first1 = I. | last2 = Tsoka | first2 = S. | last3 = Andrade | first3 = MA. | last4 = Enright | first4 = AJ. | last5 = Carroll | first5 = M. | last6 = Poullet | first6 = P. | last7 = Promponas | first7 = V. | last8 = Liakopoulos | first8 = T. | last9 = Palaios | first9 = G. | last10 = Pasquier | first10 = C | last11 = Hamodrakas | first11 = S | last12 = Tamames | first12 = J | last13 = Yagnik | first13 = A. T. | last14 = Tramontano | first14 = A | last15 = Devos | first15 = D | last16 = Blaschke | first16 = C | last17 = Valencia | first17 = A | last18 = Brett | first18 = D | last19 = Martin | first19 = D | last20 = Leroy | first20 = C | last21 = Rigoutsos | first21 = I | last22 = Sander | first22 = C | last23 = Ouzounis | first23 = C. A. | title = Evaluation of annotation strategies using an entire genome sequence | journal = Bioinformatics | volume = 19 | issue = 6 | pages = 717β26 |date=April 2003 | doi = 10.1093/bioinformatics/btg077| pmid = 12691983 | display-authors = 8 | doi-access = free }}</ref> Therefore, care must be taken when interpreting the annotation data from sequence databases. === Scoring methods === Most of the current database search algorithms rank alignment by a score, which is usually a particular scoring system.<ref>{{cite journal |title=Issues in searching molecular sequence databases|last1=Altschul |first1=Stephen |last2=Boguski |first2=Mark |last3=Gish |first3=Warren |last4=Wootton |first4=John |journal=Nature Genetics |year=1994 |volume=6 |issue=2 |pages=119β129 |url=https://www.nature.com/articles/ng0294-119.pdf |publisher=Nature Publishing Group|doi=10.1038/ng0294-119 |pmid=8162065 |s2cid=270160 }}</ref> The solution towards solving this issue is found by making a variety of scoring systems available to suit to the specific problem. === Alignment statistics === When using a searching algorithm we often produce an ordered list which can often carry a lack of biological significance.<ref>{{cite journal |title=Issues in searching molecular sequence databases|last1=Altschul |first1=Stephen |last2=Boguski |first2=Mark |last3=Gish |first3=Warren |last4=Wootton |first4=John |journal=Nature Genetics |year=1994 |volume=6 |issue=2 |pages=119β129 |url=https://www.nature.com/articles/ng0294-119.pdf |publisher=Nature Publishing Group|doi=10.1038/ng0294-119 |pmid=8162065 |s2cid=270160 }}</ref> ==See also== * [[FASTA format]] * [[Similarity Matrix of Proteins|SIMAP]] * [[List of biological databases]] * [[Bioinformatics]] == References == {{Reflist|2}} ==External links== * [http://www.ebi.ac.uk/Databases/ European Bioinformatics Institute databases] * [https://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome NCBI completely sequenced genomes] * [http://www.yeastgenome.org/ Stanford Saccharomyces Genome Database] *[https://www.ncbi.nlm.nih.gov/protein Protein], the [[National Institute of Health|NIH]] protein database, a collection of sequences from several sources, including translations from annotated coding regions in [[GenBank]], [[RefSeq]] and [[Third Party Annotation|TPA]], as well as records from [[SwissProt]], [[Protein Information Resource|PIR]], PRF, and [[Protein Data Bank|PDB]] {{Bioinformatics}} {{Use dmy dates|date=April 2017}} [[Category:Biotechnology databases]]
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)
Pages transcluded onto the current version of this page
(
help
)
:
Template:Bioinformatics
(
edit
)
Template:Citation
(
edit
)
Template:Cite book
(
edit
)
Template:Cite journal
(
edit
)
Template:Cite web
(
edit
)
Template:Reflist
(
edit
)
Template:See also
(
edit
)
Template:Use dmy dates
(
edit
)