Editing Biopython

{{short description|Collection of open-source Python software tools for computational biology}}
{{Infobox software
| name = Biopython
| logo = Biopython logo.png
| author = Chapman B, Chang J<ref name="Chapman2000"/>
| released = {{Start date and age|2002|12|17}}
| latest release version = {{wikidata|property|reference|edit|P348}}
| latest release date = <!-- {{Start date and age|{{wikidata|qualifier|P348|P577}}}} -->
| programming language = [[Python (programming language)|Python]], [[C (programming language)|C]]
| platform = [[Cross platform]]
| genre = [[Bioinformatics]]
| license = [http://www.biopython.org/DIST/LICENSE Biopython License]
| website = {{Official URL}}
| repo = {{wikidata|property|reference|edit|P1324}}
}}

'''Biopython''' is an [[Open-source software|open-source]] collection of non-commercial [[Python (programming language)|Python]] tools for [[computational biology]] and [[bioinformatics]].<ref name="Chapman2000">{{Cite journal |last1=Chapman |first1=Brad |last2=Chang |first2=Jeff |title=Biopython: Python tools for computational biology |journal=ACM SIGBIO Newsletter |date=August 2000 |volume=20 |issue=2 |pages=15–19 |doi=10.1145/360262.360268 |s2cid=9417766 |doi-access=free }}</ref><ref name="Cock2009">{{cite journal |last1=Cock |first1=Peter JA |last2=Antao |first2=Tiago |last3=Chang |first3=Jeffery T |last4=Chapman |first4=Brad A |last5=Cox |first5=Cymon J |last6=Dalke |first6=Andrew |last7=Friedberg |first7=Iddo |last8=Hamelryck |first8=Thomas |last9=Kauff |first9=Frank |last10=Wilczynski |first10=Bartek |last11=de Hoon |first11=Michiel JL  |title = Biopython: freely available Python tools for computational molecular biology and bioinformatics |url= | journal = Bioinformatics| volume = 25| issue = 11| pages = 1422–3|date=20 March 2009| pmid = 19304878| pmc = 2682512| doi = 10.1093/bioinformatics/btp163}}</ref><ref name="lists">Refer to the Biopython website for other [http://biopython.org/wiki/Documentation#Papers papers describing Biopython], and a list of over one hundred [http://biopython.org/wiki/Publications publications using/citing Biopython].</ref> It contains classes to represent [[biological sequence]]s and [[Bioinformatics#Genome annotation|sequence annotations]], and it is able to read and write to a variety of file formats. It also allows for a programmatic means of accessing online [[Biological database|databases of biological information]], such as those at [[National Center for Biotechnology Information|NCBI]]. Separate modules extend Biopython's capabilities to [[sequence alignment]], [[protein structure]], [[population genetics]], [[phylogenetics]], [[sequence motif]]s, and [[machine learning]]. Biopython is one of a number of Bio* projects designed to reduce [[Duplicate code|code duplication]] in [[computational biology]].<ref name="Mangalam2002">{{cite journal| last=Mangalam | first=Harry |title=The Bio* toolkits—a brief overview | journal=Briefings in Bioinformatics |date=September 2002 | volume= 3 | issue= 3 | pages= 296–302 | pmid=12230038 | doi=10.1093/bib/3.3.296| doi-access=free }}</ref>

== History ==

Biopython development began in 1999 and it was first released in July 2000.<ref name="Chapman2004">{{Citation | first = Brad | last = Chapman |title=The Biopython Project: Philosophy, functionality and facts |url=http://www.biopython.org/DIST/docs/presentations/biopython_exelixis.pdf |date=11 March 2004 |accessdate=11 September 2014}}</ref> It was developed during a similar time frame and with analogous goals to other projects that added bioinformatics capabilities to their respective programming languages, including [[BioPerl]], [[BioRuby]] and [[BioJava]]. Early developers on the project included Jeff Chang, Andrew Dalke and Brad Chapman, though over 100 people have made contributions to date.<ref name="Contributors">{{Citation |title=List of Biopython contributors |url=http://biopython.org/SRC/biopython/CONTRIB |accessdate=11 September 2014 |url-status=dead |archiveurl=https://archive.today/20140911121354/http://biopython.org/SRC/biopython/CONTRIB |archivedate=11 September 2014 }}</ref> In 2007, a similar [[Python (programming language)|Python]] project, namely '''PyCogent''', was established.<ref>{{Cite journal
 | pmid = 17708774
| pmc = 2375001
| year = 2007
| last1 = Knight
| first1 = R
| title = Py ''Cogent'': A toolkit for making sense from sequence
| journal = Genome Biology
| volume = 8
| issue = 8
| pages = R171
| last2 = Maxwell
| first2 = P
| last3 = Birmingham
| first3 = A
| last4 = Carnes
| first4 = J
| last5 = Caporaso
| first5 = J. G.
| last6 = Easton
| first6 = B. C.
| last7 = Eaton
| first7 = M
| last8 = Hamady
| first8 = M
| last9 = Lindsay
| first9 = H
| last10 = Liu
| first10 = Z
| last11 = Lozupone
| first11 = C
| last12 = McDonald
| first12 = D
| last13 = Robeson
| first13 = M
| last14 = Sammut
| first14 = R
| last15 = Smit
| first15 = S
| last16 = Wakefield
| first16 = M. J.
| last17 = Widmann
| first17 = J
| last18 = Wikman
| first18 = S
| last19 = Wilson
| first19 = S
| last20 = Ying
| first20 = H
| last21 = Huttley
| first21 = G. A.
| doi = 10.1186/gb-2007-8-8-r171
| doi-access = free
}}</ref>

The initial scope of Biopython involved accessing, indexing and processing biological sequence files. While this is still a major focus, over the following years added modules have extended its functionality to cover additional areas of biology (see [[#Key features and examples|Key features and examples]]).

As of version 1.77, Biopython no longer supports Python 2.<ref name="Python27EoL">{{Citation | first = Chris | last = Daley |title=Biopython 1.77 released |url=https://www.open-bio.org/2020/05/25/biopython-1-77-released/ |accessdate=6 October 2021}}</ref>

== Design ==
Wherever possible, Biopython follows the conventions used by the Python programming language to make it easier for users familiar with Python. For example, <code>Seq</code> and <code>SeqRecord</code> objects can be manipulated via [[Array slicing#1991: Python|slicing]], in a manner similar to Python's strings and lists. It is also designed to be functionally similar to other Bio* projects, such as BioPerl.<ref name="Chapman2004"/>

Biopython is able to read and write most common file formats for each of its functional areas, and its license is permissive and compatible with most other software licenses, which allow Biopython to be used in a variety of software projects.<ref name="lists"/>

== Key features and examples ==

=== Sequences ===
A core concept in Biopython is the biological sequence, and this is represented by the <code>Seq</code> class.<ref name="Tutorial">{{Citation |last1=Chang |first1=Jeff |last2=Chapman |first2=Brad |last3=Friedberg |first3=Iddo |last4=Hamelryck |first4=Thomas |last5=de Hoon |first5=Michiel |last6=Cock |first6=Peter |last7=Antao |first7=Tiago |last8=Talevich |first8=Eric |last9=Wilczynski |first9=Bartek  |title=Biopython Tutorial and Cookbook |url=http://biopython.org/DIST/docs/tutorial/Tutorial.html |date=29 May 2014 |accessdate=28 August 2014}}</ref>  A Biopython <code>Seq</code> object is similar to a Python string in many respects: it supports the Python slice notation, can be concatenated with other sequences and is immutable. In addition, it includes sequence-specific methods and specifies the particular biological alphabet used.

<syntaxhighlight lang="pycon">
>>> # This script creates a DNA sequence and performs some typical manipulations
>>> from Bio.Seq import Seq
>>> dna_sequence = Seq("AGGCTTCTCGTA", IUPAC.unambiguous_dna)
>>> dna_sequence
Seq('AGGCTTCTCGTA', IUPACUnambiguousDNA())
>>> dna_sequence[2:7]
Seq('GCTTC', IUPACUnambiguousDNA())
>>> dna_sequence.reverse_complement()
Seq('TACGAGAAGCCT', IUPACUnambiguousDNA())
>>> rna_sequence = dna_sequence.transcribe()
>>> rna_sequence
Seq('AGGCUUCUCGUA', IUPACUnambiguousRNA())
>>> rna_sequence.translate()
Seq('RLLV', IUPACProtein())
</syntaxhighlight>

=== Sequence annotation ===
The <code>SeqRecord</code> class describes sequences, along with information such as name, description and features in the form of <code>SeqFeature</code> objects.  Each <code>SeqFeature</code> object specifies the type of the feature and its location. Feature types can be ‘gene’, ‘CDS’ (coding sequence), ‘repeat_region’, ‘mobile_element’ or others, and the position of features in the sequence can be exact or approximate.

<syntaxhighlight lang="pycon">
>>> # This script loads an annotated sequence from file and views some of its contents.
>>> from Bio import SeqIO
>>> seq_record = SeqIO.read("pTC2.gb", "genbank")
>>> seq_record.name
'NC_019375'
>>> seq_record.description
'Providencia stuartii plasmid pTC2, complete sequence.'
>>> seq_record.features[14]
SeqFeature(FeatureLocation(ExactPosition(4516), ExactPosition(5336), strand=1), type='mobile_element')
>>> seq_record.seq
Seq("GGATTGAATATAACCGACGTGACTGTTACATTTAGGTGGCTAAACCCGTCAAGC...GCC", IUPACAmbiguousDNA())
</syntaxhighlight>

=== Input and output ===
Biopython can read and write to a number of common sequence formats, including [[FASTA format|FASTA]], [[FASTQ format|FASTQ]], GenBank, Clustal, PHYLIP and [[Nexus file|NEXUS]].  When reading files, descriptive information in the file is used to populate the members of Biopython classes, such as <code>SeqRecord</code>.  This allows records of one file format to be converted into others.

Very large sequence files can exceed a computer's memory resources, so Biopython provides various options for accessing records in large files.  They can be loaded entirely into memory in Python data structures, such as lists or [[Associative array|dictionaries]], providing fast access at the cost of memory usage.  Alternatively, the files can be read from disk as needed, with slower performance but lower memory requirements.

<syntaxhighlight lang="pycon">
>>> # This script loads a file containing multiple sequences and saves each one in a different format.
>>> from Bio import SeqIO
>>> genomes = SeqIO.parse("salmonella.gb", "genbank")
>>> for genome in genomes:
...     SeqIO.write(genome, genome.id + ".fasta", "fasta")
</syntaxhighlight>

=== Accessing online databases ===

Through the Bio.Entrez module, users of Biopython can download biological data from NCBI databases.  Each of the functions provided by the [[Entrez]] search engine is available through functions in this module, including searching for and downloading records.

<syntaxhighlight lang="pycon">
>>> # This script downloads genomes from the NCBI Nucleotide database and saves them in a FASTA file.
>>> from Bio import Entrez
>>> from Bio import SeqIO
>>> output_file = open("all_records.fasta", "w")
>>> Entrez.email = "my_email@example.com"
>>> records_to_download = ["FO834906.1", "FO203501.1"]
>>> for record_id in records_to_download:
...     handle = Entrez.efetch(db="nucleotide", id=record_id, rettype="gb")
...     seqRecord = SeqIO.read(handle, format="gb")
...     handle.close()
...     output_file.write(seqRecord.format("fasta"))
</syntaxhighlight>

=== Phylogeny ===

[[File:Phylo.draw.png|thumb|300px|Figure 1: A rooted phylogenetic tree created by Bio.Phylo showing the relationship between different organisms' Apaf-1 homologs<ref name="Zmasek2007">{{cite journal |last1=Zmasek |first1=Christian M |last2=Zhang |first2=Qing |last3=Ye |first3=Yuzhen |last4=Godzik |first4=Adam |date=24 October 2007 |title=Surprising complexity of the ancestral apoptosis network |journal=Genome Biology |volume=8 |issue=10 |doi=10.1186/gb-2007-8-10-r226 |pages=R226 |pmid=17958905 |pmc=2246300 |doi-access=free }}</ref>]]

[[File:Phylo.draw graphviz.png|thumb|Figure 2: The same tree as above, drawn unrooted using Graphviz via Bio.Phylo]]

The Bio.Phylo module provides tools for working with and visualising [[phylogenetic tree]]s.  A variety of file formats are supported for reading and writing, including [[Newick format|Newick]], [[Nexus file|NEXUS]] and [[phyloXML]].  Common tree manipulations and traversals are supported via the <code>Tree</code> and <code>Clade</code> objects.  Examples include converting and collating tree files, extracting subsets from a tree, changing a tree's root, and analysing branch features such as length or score.<ref name="Talevich2012">{{cite journal |last1=Talevich |first1=Eric |last2=Invergo |first2=Brandon M |last3=Cock |first3=Peter JA |last4=Chapman |first4=Brad A |date=21 August 2012 |title=Bio.Phylo: A unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython |url= |journal=BMC Bioinformatics |volume=13 |issue=209 |pages=209 |doi=10.1186/1471-2105-13-209 |pmid=22909249 |pmc=3468381 |doi-access=free }}</ref>

Rooted trees can be drawn in [[ASCII art|ASCII]] or using [[matplotlib]] (see Figure 1), and the [[Graphviz]] library can be used to create unrooted layouts (see Figure 2).

=== Genome diagrams ===

[[File:PKPS77.png|thumb|300px|Figure 3: A diagram of the genes on the pKPS77 plasmid,<ref name="NC_023330.1">{{cite web |url=https://www.ncbi.nlm.nih.gov/nuccore/NC_023330.1 |title=Klebsiella pneumoniae strain KPS77 plasmid pKPS77, complete sequence |date= |website= |publisher=NCBI |accessdate=10 September 2014}}</ref> visualised using the GenomeDiagram module in Biopython]]

The GenomeDiagram module provides methods of visualising sequences within Biopython.<ref name="Pritchard2006">{{Cite journal |last1=Pritchard |first1=Leighton |last2=White |first2=Jennifer A |last3=Birch |first3=Paul RJ |last4=Toth |first4=Ian K |title=GenomeDiagram: a python package for the visualization of large-scale genomic data |journal=Bioinformatics |date=March 2006 |volume=22 |issue=5 |pages=616–617 |doi=10.1093/bioinformatics/btk021 |pmid=16377612|doi-access=free }}</ref>  Sequences can be drawn in a linear or circular form (see Figure 3), and many output formats are supported, including [[Portable Document Format|PDF]] and [[Portable Network Graphics|PNG]].  Diagrams are created by making tracks and then adding sequence features to those tracks.  By looping over a sequence's features and using their attributes to decide if and how they are added to the diagram's tracks, one can exercise much control over the appearance of the final diagram.  Cross-links can be drawn between different tracks, allowing one to compare multiple sequences in a single diagram.

=== Macromolecular structure ===

The Bio.PDB module can load molecular structures from [[Protein Data Bank (file format)|PDB]] and [[Crystallographic Information File|mmCIF]] files, and was added to Biopython in 2003.<ref name="Hamelryck2003">{{cite journal |last1=Hamelryck |first1=Thomas |last2=Manderick |first2=Bernard |date=10 May 2003 |title=PDB file parser and structure class implemented in Python |journal=Bioinformatics |volume=19 |issue=17 |doi=10.1093/bioinformatics/btg299 |pages=2308–2310|pmid=14630660 |doi-access=free }}</ref>  The <code>Structure</code> object is central to this module, and it organises macromolecular structure in a hierarchical fashion: <code>Structure</code> objects contain <code>Model</code> objects which contain <code>Chain</code> objects which contain <code>Residue</code> objects which contain <code>Atom</code> objects.  Disordered residues and atoms get their own classes, <code>DisorderedResidue</code> and <code>DisorderedAtom</code>, that describe their uncertain positions.

Using Bio.PDB, one can navigate through individual components of a macromolecular structure file, such as examining each atom in a protein.  Common analyses can be carried out, such as measuring distances or angles, comparing residues and calculating residue depth.

=== Population genetics ===

The Bio.PopGen module adds support to Biopython for Genepop, a software package for statistical analysis of population genetics.<ref name="Rousset2008">{{cite journal |last=Rousset |first=François |date=January 2008 |title=GENEPOP'007: a complete re-implementation of the GENEPOP software for Windows and Linux |journal=Molecular Ecology Resources |volume=8 |issue=1 |doi=10.1111/j.1471-8286.2007.01931.x |pmid=21585727 |pages=103–106|bibcode=2008MolER...8..103R |s2cid=25776992 }}</ref>  This allows for analyses of [[Hardy–Weinberg principle|Hardy–Weinberg equilibrium]], [[linkage disequilibrium]] and other features of a population's [[Allele frequency|allele frequencies]].

This module can also carry out population genetic simulations using [[coalescent theory]] with the fastsimcoal2 program.<ref name="Excoffier2011">{{cite journal |last1=Excoffier |first1=Laurent |last2=Foll |first2=Matthieu |date=1 March 2011 |title=fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios |journal=Bioinformatics |volume=27 |issue=9 |doi=10.1093/bioinformatics/btr124 |pages=1332–1334 |pmid=21398675|doi-access=free }}</ref>

=== Wrappers for command line tools ===

Many of Biopython's modules contain command line wrappers for commonly used tools, allowing these tools to be used from within Biopython.  These wrappers include [[BLAST (biotechnology)|BLAST]], [[Clustal]], PhyML, [[EMBOSS]] and [[SAMtools]].  Users can subclass a generic wrapper class to add support for any other command line tool.

== See also ==
* [[Open Bioinformatics Foundation]]
* [[BioPerl]]
* [[BioRuby]]
* [[BioJS]]
* [[BioJava]]

==References==
{{Reflist}}

== External links ==
* {{Official website|https://biopython.org/}}
* [https://biopython.org/DIST/docs/tutorial/Tutorial.html Biopython Tutorial and Cookbook] ([https://biopython.org/DIST/docs/tutorial/Tutorial.pdf PDF])
* [https://github.com/biopython/biopython Biopython source code on GitHub]

[[Category:Articles with example Python (programming language) code]]
[[Category:Bioinformatics software]]
[[Category:Computational science]]
[[Category:Python (programming language) scientific libraries]]
[[Category:Free bioinformatics software]]