Editing BioJava

{{Infobox software
| name = BioJava
| logo = BioJava-logo-full.png
| author = Andreas Prlić
| developer = Amr ALHOSSARY, Andreas Prlic, Dmytro Guzenko, Hannes Brandstätter-Müller, Jose Manuel Duarte, Thomas Down, Michael L Heuer, Peter Troshin, JianJiong Gao, Aleix Lafita, Peter Rose, Spencer Bliven
| released = {{Start date and age|2002}}
| latest release version = 6.0.3
| latest release date = {{Start date and age|2021|12|19}}
| latest preview version = 
| latest preview date = 
| programming language = [[Java (programming language)|Java]]
| platform = [[Web browser]] with [[Java SE]]
| language = English
| genre = [[Bioinformatics]]
| license = [[GNU Lesser General Public License|Lesser GPL]] 2.1
| website = {{Official URL}}
| repo = {{URL|https://github.com/biojava}}
}}
'''BioJava''' is an [[open-source software]] project dedicated to providing [[Java (software platform)|Java]] tools for processing [[Biology|biological]] data.<ref name=BioJava>{{cite journal |vauthors=Prlić A, Yates A, Bliven SE |title=BioJava: an open-source framework for bioinformatics in 2012 |journal=Bioinformatics |volume=28 |issue=20 |pages=2693–5 |date=October 2012 |pmid=22877863 |pmc=3467744 |doi=10.1093/bioinformatics/bts494 |url=|display-authors=etal}}</ref><ref name="pmid18689808">{{cite journal|vauthors=Holland RC, Down TA, Pocock M, Prlić A, Huen D, James K |title=BioJava: an open-source framework for bioinformatics. |journal=Bioinformatics |year= 2008 |volume= 24 |issue= 18 |pages= 2096–7 |pmid=18689808 |doi=10.1093/bioinformatics/btn397 |pmc=2530884 |display-authors=etal}}</ref><ref name="Mathura">VS Matha and P Kangueane, 2009, ''Bioinformatics: a concept-based introduction'', 2009. p26</ref> BioJava is a set of [[Library (computing)|library]] functions written in the programming language [[Java (programming language)|Java]] for manipulating sequences, protein structures, file parsers, [[Common Object Request Broker Architecture]] (CORBA) interoperability, [[Distributed Annotation System]] (DAS), access to [[AceDB]], dynamic programming, and simple statistical routines. BioJava supports a range of data, starting from DNA and protein sequences to the level of 3D protein structures. The BioJava libraries are useful for automating many daily and mundane [[bioinformatics]] tasks such as to parsing a [[Protein Data Bank]] (PDB) file, interacting with Jmol and many more.<ref name=Jmol /> This [[application programming interface]] (API)  provides various file parsers, data models and algorithms to facilitate working with the standard data formats and enables rapid application development and analysis.

Additional projects from BioJava include rcsb-sequenceviewer, biojava-http, biojava-spark, and rcsb-viewers.

==Features==
BioJava provides software modules for many of the typical tasks of bioinformatics programming. These include:

* Accessing [[Nucleotide sequence|nucleotide]] and [[Peptide sequence|peptide]] sequence data from local and remote [[Biological database|databases]]
* Transforming [[List of file formats#Biology|formats]] of database/ file records
* Protein structure parsing and manipulation
* Manipulating individual sequences
* Searching for similar sequences
* Creating and manipulating [[sequence alignment]]s

== History and publications==
The BioJava project grew out of work by Thomas Down and Matthew Pocock to create an API to simplify development of Java-based Bioinformatics tools.  BioJava is an active open source project that has been developed over more than 12 years and by more than 60 developers. BioJava is one of a number of Bio* projects designed to reduce code duplication.<ref name="pmid12230038">{{cite journal|author=Mangalam H|year=2002|title=The Bio* toolkits--a brief overview|journal=Briefings in Bioinformatics|volume=3|issue=3|pages=296–302|doi=10.1093/bib/3.3.296|pmid=12230038|doi-access=free}}</ref> Examples of such projects that fall under Bio* apart from BioJava are [[BioPython]],<ref>{{cite journal|display-authors=etal|vauthors=Cock PJ, Antao T, Chang JT|date=June 2009|title=Biopython: freely available Python tools for computational molecular biology and bioinformatics|url= |journal=Bioinformatics|volume=25|issue=11|pages=1422–3|doi=10.1093/bioinformatics/btp163|pmc=2682512|pmid=19304878}}</ref> [[BioPerl]],<ref>{{cite journal|display-authors=etal|vauthors=Stajich JE, Block D, Boulez K|date=October 2002|title=The Bioperl toolkit: Perl modules for the life sciences|journal=Genome Res.|volume=12|issue=10|pages=1611–8|doi=10.1101/gr.361602|pmc=187536|pmid=12368254}}</ref> [[BioRuby]],<ref>{{cite journal|vauthors=Goto N, Prins P, Nakao M, Bonnal R, Aerts J, Katayama T|date=October 2010|title=BioRuby: bioinformatics software for the Ruby programming language|url= |journal=Bioinformatics|volume=26|issue=20|pages=2617–9|doi=10.1093/bioinformatics/btq475|pmc=2951089|pmid=20739307}}</ref> EMBOSS<ref>{{cite journal|vauthors=Rice P, Longden I, Bleasby A|date=June 2000|title=EMBOSS: the European Molecular Biology Open Software Suite|journal=Trends Genet.|volume=16|issue=6|pages=276–7|doi=10.1016/S0168-9525(00)02024-2|pmid=10827456}}</ref> etc.

In October 2012, the first paper on BioJava was published.<ref name="BioJava2">{{cite journal|display-authors=etal|vauthors=Prlić A, Yates A, Bliven SE|date=October 2012|title=BioJava: an open-source framework for bioinformatics in 2012|url= |journal=Bioinformatics|volume=28|issue=20|pages=2693–5|doi=10.1093/bioinformatics/bts494|pmc=3467744|pmid=22877863}}</ref> This paper detailed BioJava's modules, functionalities, and purpose.

As of November 2018 Google Scholar counts more than 130 citations.<ref>{{Cite web|url=https://scholar.google.com/scholar?cites=3048631375755320177&as_sdt=2005&sciodt=0,5&hl=en|title=Google Scholar|website=scholar.google.com|access-date=2018-11-22}}</ref>

The most recent paper on BioJava was written in February 2017.<ref>{{Cite journal|last1=Gao|first1=Jianjiong|last2=Prlić|first2=Andreas|last3=Bi|first3=Chunxiao|last4=Bluhm|first4=Wolfgang F.|last5=Dimitropoulos|first5=Dimitris|last6=Xu|first6=Dong|last7=Bourne|first7=Philip E.|last8=Rose|first8=Peter W.|date=2017-02-17|title=BioJava-ModFinder: identification of protein modifications in 3D structures from the Protein Data Bank|journal=Bioinformatics|language=en|volume=33|issue=13|pages=2047–2049|doi=10.1093/bioinformatics/btx101|pmid=28334105|pmc=5870676|issn=1367-4803}}</ref> This paper detailed a new tool named BioJava-ModFinder. This tool can be used for identification and subsequent mapping of protein modifications to 3D in the Protein Data Bank ([[Protein Data Bank|PBD]]). The package was also integrated with the [[Rcsb|RCSB]] PDB web application and added protein modification annotations to the sequence diagram and structure display. More than 30,000 structures with protein modifications were identified by using BioJava-ModFinder and can be found on the RCSB PDB website.

In the year 2008, BioJava's first Application note was published.<ref name="pmid18689808"/> It was migrated from its original CVS repository to [[GitHub]] in April 2013.<ref>{{cite web|url=http://biojava.org/wiki/Get_source#History|title=History|access-date=30 Jan 2015}}</ref> The project has been moved to a separate repository, BioJava-legacy, and is still maintained for minor changes and bug fixes.<ref>[http://www.biojava.org/docs/api1.8.2/ BioJava-legacy] {{webarchive|url=https://web.archive.org/web/20130109110621/http://www.biojava.org/docs/api1.8.2/|date=2013-01-09}}</ref>

Version 3 was released in December 2010. It was a major update to the prior versions. The aim of this release was to rewrite BioJava so that it could be modularized into small, reusable components. This allowed developers to contribute more easily and reduced dependencies. The new approach seen in BioJava 3 was modeled after the [[Apache Commons]].

Version 4 was released in January 2015. This version brought many new features and improvements to the packages biojava-core, biojava-structure, biojava-structure-gui, biojava-phylo, as well as others. BioJava 4.2.0 was the first release to be available using Maven from the Maven Central.

Version 5 was released in March 2018. This represents a major milestone for the project. BioJava 5.0.0 is the first released based on Java 8 which introduces the use of [[Anonymous function|lambda]] functions and streaming API calls. There were also major changes to biojava-structure module. Also, the previous data models for macro-molecular structures have been adapted to more closely represent the [[Crystallographic Information File|mmCIF]] data model. This was the first release in over two years. Some of the other improvements include optimizations in the biojava-structure module to improve symmetry detection and added support for MMTF formats. Other general improvements include Javadoc updates, dependency versions, and all tests are now Junit4. The release contains 1,170 commits from 19 contributors.

==Modules==
During 2014-2015, large parts of the original code base were rewritten. BioJava 3 is a clear departure from the version 1 series. It now consists of several independent modules built using an automation tool called [[Apache Maven]].<ref>{{cite web|last=Maven|first=Apache|title=Maven|url=http://maven.apache.org|publisher=Apache}}</ref> These modules provide state-of-the-art tools for protein structure comparison, pairwise and multiple sequence alignments, working with DNA and protein sequences, analysis of amino acid properties, detecting protein modifications, predicting disordered regions in proteins, and parsers for common file formats using a biologically meaningful data model. The original code has been moved into a separate BioJava legacy project, which is still available for backward compatibility.<ref>[http://www.biojava.org/docs/api1.8.2/ BioJava legacy project] {{webarchive|url=https://web.archive.org/web/20130109110621/http://www.biojava.org/docs/api1.8.2/ |date=2013-01-09 }}</ref>

BioJava 5 introduced new features to two modules, biojava-alignment and biojava-structure.

The following sections will describe several of the new modules and highlight
some of the new features that are included in the latest version of BioJava.
[[File:BioJava 5 Module Layout.png|thumb|1292x1292px]]

===Core Module===
This module provides Java [[Class (computer programming)|classes]] to model [[amino acid]] or [[nucleotide]] sequences. The classes were designed so that the names are familiar and make sense to biologists and also provide a concrete representation of the steps in going from a gene sequence to a protein sequence for computer scientists and programmers.

A major change between the legacy BioJava project and BioJava3 lies in the way framework has been designed to exploit then-new innovations in Java. A sequence is defined as a generic [[Interface (computing)#Software interfaces in object-oriented languages|interface]] allowing the rest of the modules to create any utility that operates on all sequences. Specific classes for common sequences such as DNA and proteins have been defined in order to improve usability for biologists. The translation engine really leverages this work by allowing conversions between DNA, RNA and amino acid sequences. This engine can handle details such as choosing the codon table, converting start codons to methionine, trimming stop codons, specifying the reading frame and handing ambiguous sequences.

Special attention has been paid to designing the storage of sequences to minimize space needs. Special design patterns such as the [[Proxy pattern]] allowed the developers to create the framework such that sequences can be stored in memory, fetched on demand from a web service such as UniProt, or read from a FASTA file as needed. The latter two approaches save memory by not loading sequence data until it is referenced in the application. This concept can be extended to handle very large genomic datasets, such as NCBI GenBank or a proprietary database.

===Protein structure modules===
[[File:This window shows two proteins with IDs "4hhb.A" and "4hhb.B" aligned against each other.png|framed|right|This window shows two proteins with IDs "4hhb.A" and "4hhb.B" aligned against each other. The code is given on the left side. This is produced using BioJava libraries which in turn uses Jmol viewer.<ref name=Jmol>Hanson, R.M. (2010) Jmol a paradigm shift in crystallographic visualization.</ref> The FATCAT<ref name=fatcat /> rigid algorithm is used here to do the alignment.]]
The protein structure modules provide tools to represent and manipulate 3D biomolecular structures. They focus on protein structure comparison.

The following algorithms have been implemented and included in BioJava.
* FATCAT algorithm for flexible and rigid body alignment.<ref name=fatcat>{{cite journal |vauthors=Ye Y, Godzik A |title=Flexible structure alignment by chaining aligned fragment pairs allowing twists |journal=Bioinformatics |volume=19 |issue=Suppl 2|pages=ii246–55 |date=October 2003 |pmid=14534198 |doi=10.1093/bioinformatics/btg1086|doi-access=free }}</ref>
* The standard Combinatorial Extension (CE) algorithm.<ref>{{cite journal |vauthors=Shindyalov IN, Bourne PE |title=Protein structure alignment by incremental combinatorial extension (CE) of the optimal path |journal=Protein Eng. |volume=11 |issue=9 |pages=739–47 |date=September 1998 |pmid=9796821 |doi=10.1093/protein/11.9.739|doi-access=free }}</ref>
* A new version of CE that can detect circular permutations in proteins.<ref>{{cite journal |vauthors=Bliven S, Prlić A |title=Circular permutation in proteins |journal=PLOS Comput. Biol. |volume=8 |issue=3 |pages=e1002445 |year=2012 |pmid=22496628 |pmc=3320104 |doi=10.1371/journal.pcbi.1002445 |bibcode=2012PLSCB...8E2445B |doi-access=free }}</ref>
These algorithms are used to provide the RCSB Protein Data Bank (PDB)<ref>{{cite journal |vauthors=Rose PW, Beran B, Bi C |title=The RCSB Protein Data Bank: redesigned web site and web services |journal=Nucleic Acids Res. |volume=39 |issue=Database issue |pages=D392–401 |date=January 2011 |pmid=21036868 |pmc=3013649 |doi=10.1093/nar/gkq1021 |url=|display-authors=etal}}</ref> Protein Comparison Tool as well as systematic comparisons of all proteins in the PDB on a weekly basis.<ref>{{cite journal |vauthors=Prlić A, Bliven S, Rose PW |title=Pre-calculated protein structure alignments at the RCSB PDB website |journal=Bioinformatics |volume=26 |issue=23 |pages=2983–5 |date=December 2010 |pmid=20937596 |pmc=3003546 |doi=10.1093/bioinformatics/btq572 |url=|display-authors=etal}}</ref>

Parsers for PDB<ref>{{cite journal |vauthors=Bernstein FC, Koetzle TF, Williams GJ |title=The Protein Data Bank: a computer-based archival file for macromolecular structures |journal=J. Mol. Biol. |volume=112 |issue=3 |pages=535–42 |date=May 1977 |pmid=875032 |doi=10.1016/s0022-2836(77)80200-3 |display-authors=etal}}</ref> and mmCIF<ref>Fitzgerald, P.M.D. et al. (2006) Macromolecular dictionary (mmCIF). In Hall, S.R.</ref> file formats allow the loading of structure data into a reusable data model. This feature is used by the SIFTS project to map between UniProt sequences and PDB structures.<ref>{{cite journal |vauthors=Velankar S, McNeil P, Mittard-Runte V |title=E-MSD: an integrated data resource for bioinformatics |journal=Nucleic Acids Res. |volume=33 |issue=Database issue |pages=D262–5 |date=January 2005 |pmid=15608192 |pmc=540012 |doi=10.1093/nar/gki058 |url=|display-authors=etal}}</ref> Information from the RCSB PDB can be dynamically fetched without the need to manually download data. For visualization, an interface to the 3D viewer Jmol is provided.<ref name=Jmol />

===Genome and Sequencing modules===

This module is focused on the creation of gene sequence objects from the core module. This is realized by supporting the parsing of the following popular standard file formats generated by open source gene prediction applications:
* GTF files generated by GeneMark<ref>{{cite journal |vauthors=Besemer J, Borodovsky M |title=GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses |journal=Nucleic Acids Res. |volume=33 |issue=Web Server issue |pages=W451–4 |date=July 2005 |pmid=15980510 |pmc=1160247 |doi=10.1093/nar/gki487 |url=}}</ref>
* GFF2 files generated by GeneID<ref>{{cite book |vauthors=Blanco E, Abril JF |title=Bioinformatics for DNA Sequence Analysis |chapter=Computational Gene Annotation in New Genome Assemblies Using GeneID |volume=537 |pages=243–61 |year=2009 |pmid=19378148 |doi=10.1007/978-1-59745-251-9_12 |series=Methods in Molecular Biology |isbn=978-1-58829-910-9 }}</ref>
* GFF3 files generated by Glimmer<ref>{{cite journal |vauthors=Kelley DR, Liu B, Delcher AL, Pop M, Salzberg SL |title=Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering |journal=Nucleic Acids Res. |volume=40 |issue=1 |pages=e9 |date=January 2012 |pmid=22102569 |pmc=3245904 |doi=10.1093/nar/gkr1067 |url=}}</ref>
Then the gene sequence objects are written out as a GFF3 format and is imported into GMOD.<ref>{{cite journal |vauthors=Stein LD, Mungall C, Shu S |title=The generic genome browser: a building block for a model organism system database |journal=Genome Res. |volume=12 |issue=10 |pages=1599–610 |date=October 2002 |pmid=12368253 |pmc=187535 |doi=10.1101/gr.403602 |display-authors=etal}}</ref>
These file formats are well defined but what gets written in the file is very flexible.

For providing input-output support for several common variants of the FASTQ file format from the next generation sequencers,<ref>{{cite journal |vauthors=Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM |title=The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants |journal=Nucleic Acids Res. |volume=38 |issue=6 |pages=1767–71 |date=April 2010 |pmid=20015970 |pmc=2847217 |doi=10.1093/nar/gkp1137 |url=}}</ref> a separate sequencing module is provided. For samples on how to use this module please go to this [http://biojava.org/wiki/BioJava:CookBook3:FASTQ link].

===Alignment module===
This module contains several classes and methods that allow users to perform pairwise and multiple sequence alignment. Sequences can be aligned in both a single and multi-threaded fashion. BioJava implements the [[Needleman–Wunsch algorithm|Needleman-Wunsch]]<ref>{{cite journal |vauthors=Needleman SB, Wunsch CD |title=A general method applicable to the search for similarities in the amino acid sequence of two proteins |journal=J. Mol. Biol. |volume=48 |issue=3 |pages=443–53 |date=March 1970 |pmid=5420325 |doi=10.1016/0022-2836(70)90057-4}}</ref> algorithm for optimal global alignments and the [[Smith–Waterman algorithm|Smith and Waterman's]]<ref>{{cite journal |vauthors=Smith TF, Waterman MS |title=Identification of common molecular subsequences |journal=J. Mol. Biol. |volume=147 |issue=1 |pages=195–7 |date=March 1981 |pmid=7265238 |doi=10.1016/0022-2836(81)90087-5|citeseerx=10.1.1.63.2897 }}</ref> algorithm for local alignments.
The outputs of both local and global alignments are available in standard formats. In addition to these two algorithms, there is an implementation of Guan–Uberbacher algorithm<ref>{{cite journal |vauthors=Guan X, Uberbacher EC |title=Alignments of DNA and protein sequences containing frameshift errors |journal=Comput. Appl. Biosci. |volume=12 |issue=1 |pages=31–40 |date=February 1996 |pmid=8670617 |doi=10.1093/bioinformatics/12.1.31|doi-access=free }}</ref> which performs global sequence alignment very efficiently since it only uses linear memory.

For '''[[Multiple Sequence Alignment]]''', any of the methods discussed above can be used to progressively perform a multiple sequence alignment.

===ModFinder module===
[[File:An example application using the ModFinder module and the protein structure module.png|framed|right|An example application using the ModFinder module and the protein structure module. Protein modifications are mapped onto the sequence and structure of ferredoxin I (PDB ID 1GAO).<ref>{{cite journal |vauthors=Chen K, Jung YS, Bonagura CA |title=Azotobacter vinelandii ferredoxin I: a sequence and structure comparison approach to alteration of [4Fe-4S]2+/+ reduction potential |journal=J. Biol. Chem. |volume=277 |issue=7 |pages=5603–10 |date=February 2002 |pmid=11704670 |doi=10.1074/jbc.M108916200 |display-authors=etal|doi-access=free }}</ref> Two possible iron–sulfur clusters are shown on the protein sequence (3Fe–4S (F3S): orange triangles/lines; 4Fe–4S (SF4): purple diamonds/ lines). The 4Fe–4S cluster is displayed in the Jmol structure window above the sequence display]]
The ModFinder module provides new methods to identify and classify protein modifications in protein 3D structures. Over 400 different types of protein modifications such as [[phosphorylation]], [[glycosylation]], [[Disulfide|disulfide bonds metal chelation]] etc. were collected and curated based on annotations in PSI-MOD,<ref>{{cite journal |vauthors=Montecchi-Palazzi L, Beavis R, Binz PA |title=The PSI-MOD community standard for representation of protein modification data |journal=Nat. Biotechnol. |volume=26 |issue=8 |pages=864–6 |date=August 2008 |pmid=18688235 |doi=10.1038/nbt0808-864 |s2cid=205270043 |display-authors=etal}}</ref> RESID<ref>{{cite journal |author=Garavelli JS |title=The RESID Database of Protein Modifications as a resource and annotation tool |journal=Proteomics |volume=4 |issue=6 |pages=1527–33 |date=June 2004 |pmid=15174122 |doi=10.1002/pmic.200300777 |s2cid=25712150 |doi-access=free }}</ref> and RCSB PDB.<ref>{{cite journal |vauthors=Berman HM, Westbrook J, Feng Z |title=The Protein Data Bank |journal=Nucleic Acids Res. |volume=28 |issue=1 |pages=235–42 |date=January 2000 |pmid=10592235 |pmc=102472 |doi=10.1093/nar/28.1.235|display-authors=etal}}</ref> The module also provides an API for detecting pre-, co-, and post-translational protein modifications within protein structures. This module can also identify phosphorylation and print all pre-loaded modifications from a structure.

===Amino acid properties module===
This module attempts to provide accurate physio-chemical properties of proteins.
The properties that can calculated using this module are as follows:
*[[Molecular mass]]
*[[Mass attenuation coefficient|Extinction coefficient]]
*[[Instability index]]
*Aliphatic index
*Grand average of hydropathy
*[[Isoelectric point]]
*Amino acid composition

The precise molecular weights for common isotopically labelled amino acids are included in this module. There also exists flexibility to define new amino acid molecules with their molecular weights using simple [[XML]] configuration files. This can be useful where the precise mass is of high importance such as [[mass spectrometry]] experiments.

===Protein disorder module===
The goal of this module is to provide users ways to find disorders in protein molecules. BioJava includes a Java implementation of the [https://archive.today/20130415142236/http://bioinformatics.oxfordjournals.org/content/21/16/3369.full RONN] predictor. The BioJava 3.0.5 makes use of Java's support for multithreading to improve performance by up to 3.2 times,<ref>{{cite journal |vauthors=Yang ZR, Thomson R, McNeil P, Esnouf RM |title=RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins |journal=Bioinformatics |volume=21 |issue=16 |pages=3369–76 |date=August 2005 |pmid=15947016 |doi=10.1093/bioinformatics/bti534 |doi-access=free }}</ref> on a modern quad-core machine, as compared to the legacy C implementation.

There are two ways to use this module:
*Using library function calls
*Using command line

Some features of this module include:

* Calculating the probability of disorder for every residue in a sequence
* Calculating the probability of disorder for every residue in the sequence for all proteins from a FASTA input file
* Get the disordered regions of the protein for a single protein sequence or for all the proteins from a FASTA input file

===Web service access module===
As per the current trends in bioinformatics, web based tools are gaining popularity. The web service module allows bioinformatics services to be accessed using [[Representational state transfer|REST]] protocols. Currently, two services are implemented: NCBI Blast through the Blast URLAPI (previously known as QBlast) and the HMMER web service.<ref>{{cite journal |vauthors=Finn RD, Clements J, Eddy SR |title=HMMER web server: interactive sequence similarity searching |journal=Nucleic Acids Res. |volume=39 |issue=Web Server issue |pages=W29–37 |date=July 2011 |pmid=21593126 |pmc=3125773 |doi=10.1093/nar/gkr367 |url=}}</ref>

== Comparisons with other alternatives ==
The need for customized software in the field of [[bioinformatics]] has been addressed by several groups and individuals. Similar to BioJava, [[open-source software]] projects such as [[BioPerl]], [[BioPython]], and [[BioRuby]] all provide tool-kits with multiple functionality that make it easier to create customized pipelines or analysis.

As the names suggest, the projects mentioned above use different programming languages. All of these APIs offer similar tools so on what criteria should one base their choice? For programmers who are experienced in only one of these languages, the choice is straightforward. However, for a well-rounded bioinformaticist who knows all of these languages and wants to choose the best language for a job, the choice can be made based on the following guidelines given by a software review done on the Bio* tool-kits.<ref name="pmid12230038"/>

In general, for small programs (<500 lines) that will be used by only an individual or small group, it is hard to beat [[Perl]] and [[BioPerl]]. These constraints probably cover the needs of 90 per cent of personal bioinformatics programming.

For beginners, and for writing larger programs in the Bio domain, especially those to be shared and supported by others, [[Python (programming language)|Python’s]] clarity and brevity make it very attractive.

For those who might be leaning towards a career in bioinformatics and who want to learn only one language, [[Java (programming language)|Java]] has the widest general programming support, very good support in the Bio domain with BioJava, and is now the de facto language of business (the new COBOL, for better or worse).

Apart from these Bio* projects there is another project called STRAP which uses Java and aims for similar goals. The STRAP-toolbox, similar to BioJava is also a Java-toolkit for the design of Bioinformatics programs and scripts. The similarities and differences between BioJava and STRAP are as follows:

'''Similarities'''
* Both provide comprehensive collections of methods for protein sequences.
* Both are used by Java programmers to code bioinformatics algorithms.
* Both separate implementations and definitions by using java interfaces.
* Both are open source projects.
* Both can read and write many sequence file formats.

'''Differences'''
* BioJava is applicable to nucleotide and peptide sequences and can be applied for entire genomes. STRAP cannot cope with single sequences as long as an entire chromosome. Instead STRAP manipulates peptide sequences and 3D- structures of the size of single proteins. Nevertheless, it can hold a high number of sequences and structures in memory. STRAP is designed for protein sequences but can read coding nucleotide files, which are then translated to peptide sequences.
* STRAP is very fast since the graphical user interface must be highly responsive. BioJava is used where speed is less critical.
* BioJava is well designed in terms of type safety, ontology and object design. BioJava uses objects for sequences, annotations and sequence positions. Even single amino acids or nucleotides are object references. To enhance speed, STRAP avoids frequent object instantiations and invocation of non-final object-methods.
**In BioJava peptide sequences and nucleotide sequences are lists of symbols. The symbols can be retrieved one after the other with an iterator or sub-sequences can be obtained. The advantages are that the entire sequence does not necessarily reside in memory and that programs are less susceptible to programming errors. ''Symbol'' objects are immutable elements of an alphabet. In STRAP however simple byte arrays are used for sequences and float arrays for coordinates. Besides speed the low memory consumption is an important advantage of basic data types. Classes in Strap expose internal data. Therefore, programmers might commit programming errors like manipulating byte arrays directly instead of using the setter methods. Another disadvantage is that no checks are performed in STRAP whether the characters in sequences are valid with respect to an underlying alphabet.
**In BioJava sequence positions are realized by the class ''Location''. Discontiguous ''Location'' objects are composed of several contiguous ''RangeLocation'' objects or ''PointLocation'' objects. For the class ''StrapProtein'' however, single residue positions are indicated by integer numbers between 0 and ''countResidues()-1''. Multiple positions are given by boolean arrays. True at a given index means selected whereas false means not selected.
* BioJava throws exceptions when methods are invoked with invalid parameters. STRAP avoids the time-consuming creation of Throwable objects. Instead, errors in methods are indicated by the return values NaN, -1 or null. From the point of program design however ''Throwable'' objects are nicer.
* In BioJava a ''Sequence'' object is either a peptide sequence or a nucleotide sequence. A StrapProtein can hold both at the same time if a coding nucleotide sequence was read and translated into protein. Both, the nucleotide sequence and the peptide sequence are contained in the same StrapProtein object. The coding or non-coding regions can be changed and the peptide sequence alters accordingly.

== Projects using BioJava ==

The following projects make use of BioJava.
* Metabolic Pathway Builder: Software suite dedicated to the exploration of connections among genes, proteins, reactions and metabolic pathways
* [http://www.dengueinfo.org/ DengueInfo] {{Webarchive|url=https://web.archive.org/web/20061208145541/http://www.dengueinfo.org/ |date=2006-12-08 }}: a Dengue genome information portal that uses BioJava in the middleware and talks to a biosql database.
* [https://biojava.org/wiki/Dazzle%3AEnsembl Dazzle]: A BioJava based DAS server.
* [[BioSense]]: A [[Plug-in (computing)|plug-in]] for the InforSense Suite, an analytics software platform by IDBS that unitizes BioJava.
* [[Bioclipse]]: A free, open source, workbench for chemo- and bioinformatics with powerful editing and visualizing abilities for molecules, sequences, proteins, spectra, etc.
* [http://www.geneinfo.eu/prompt/ PROMPT]: A free, open source framework and application for the comparison and mapping of protein sets. Uses BioJava for handling most input data formats.
* [[Cytoscape]]: An open source bioinformatics software platform to visualize molecular interaction networks.
* [https://www.bioinformatics.org/forums/forum.php?forum_id=3671 BioWeka]: An open source biological data mining application.
* [https://www.geneious.com/ Geneious]: A molecular biology toolkit.
* [https://www.ncbi.nlm.nih.gov/staff/slottad/MassSieve/ MassSieve]: An open source application to analyze mass spec proteomics data.
* [https://bip.weizmann.ac.il/toolbox/structure/seq_align.htm STRAP]: A tool for multiple sequence alignment and sequence-based structure alignment.
* [https://www.jstacs.de/index.php/Main_Page Jstacs]: A Java framework for statistical analysis and classification of biological sequences
* [https://ml.jku.at/software/LSTM_protein/ jLSTM]: "Long Short-Term Memory" for protein classification
* [http://raphaelbauer.github.io/lajolla/ LaJolla]: An open source [[structural alignment]] tool for RNA and proteins using an index structure for fast alignment of thousands of structures; includes an easy-to-use command line interface.
* [http://www.genbeans.org/ GenBeans]: A rich client platform for bioinformatics primarily focused on molecular biology and sequence analysis.
* [http://jensembl.sourceforge.net/ JEnsembl]: A version-aware Java API to Ensembl data systems.<ref>{{cite journal|vauthors=Paterson T, Law A|date=November 2012|title=JEnsembl: a version-aware Java API to Ensembl data systems|url= |journal=Bioinformatics|volume=28|issue=21|pages=2724–31|doi=10.1093/bioinformatics/bts525|pmc=3476335|pmid=22945789}}</ref>
* [https://www.kimlab.org/software/musi MUSI]: An integrated system to identify multiple specificity from very large peptide or nucleic acid data sets.<ref>{{cite journal|display-authors=etal|vauthors=Kim T, Tyndel MS, Huang H|date=March 2012|title=MUSI: an integrated system for identifying multiple specificity from very large peptide or nucleic acid data sets|url= |journal=Nucleic Acids Res.|volume=40|issue=6|pages=e47|doi=10.1093/nar/gkr1294|pmc=3315295|pmid=22210894}}</ref>
* [https://www.mdpi.com/2218-273X/10/3/461 Bioshell]: A utility library for structural bioinformatics<ref>{{cite journal|vauthors=Gront D, Kolinski A|date=February 2008|title=Utility library for structural bioinformatics|journal=Bioinformatics|volume=24|issue=4|pages=584–5|doi=10.1093/bioinformatics/btm627|pmid=18227118|doi-access=free}}</ref>

== See also ==
* [[Open Bioinformatics Foundation]]
* [[BioPerl]], [[Biopython]], [[BioRuby]], [[BioJS]].
* [[Bioclipse]]
* [[Comparison of software for molecular mechanics modeling]]

==References==
{{Reflist|2}}

==External links==
* {{Official website}}

[[Category:Bioinformatics software]]
[[Category:Java platform software]]
[[Category:Free bioinformatics software]]