Editing FASTA format

{{short description|File format for DNA or protein sequences}}

{{Infobox file format
| name          = FASTA format
| icon          = 
| iconcaption   = 
| icon_size     = 
| screenshot    = 
| screenshot_size = 
| caption       = 
| _noextcode    = 
| extensions    = .fasta, .fas, .fa, .fna, .ffn, .faa, .mpfa, .frn
| _nomimecode   = 
| mime          = {{code|text/x-fasta}}
| type_code     = 
| uniform_type  = no
| conforms_to   = 
| magic         = 
| developer     = [[David J. Lipman]]<br />[[William Pearson (scientist)|William R. Pearson]]<ref name=rapid>{{cite journal | vauthors = Lipman DJ, Pearson WR | title = Rapid and sensitive protein similarity searches | journal = Science | volume = 227 | issue = 4693 | pages = 1435–41 | date = March 1985 | pmid = 2983426 | doi = 10.1126/science.2983426 | bibcode = 1985Sci...227.1435L }} {{closed access}}</ref><ref name=improved>{{cite journal | vauthors = Pearson WR, Lipman DJ | title = Improved tools for biological sequence comparison | journal = Proceedings of the National Academy of Sciences of the United States of America | volume = 85 | issue = 8 | pages = 2444–8 | date = April 1988 | pmid = 3162770 | pmc = 280013 | doi = 10.1073/pnas.85.8.2444 | bibcode = 1988PNAS...85.2444P | doi-access = free }}</ref>
| released      = 1985
| latest_release_version = 
| latest_release_date = <!-- {{start date and age|YYYY|mm|dd|df=yes/no}} -->
| genre         = [[Bioinformatics]]
| container_for = 
| contained_by  = 
| extended_from = [[ASCII]] for [[FASTA]]
| extended_to   = [[FASTQ format]]<ref name=fastq/>
| standard      = <!-- or: | standards = -->
| free          = 
| url           = {{URL|https://www.ncbi.nlm.nih.gov/BLAST/fasta.shtml}}
}}

In [[bioinformatics]] and [[biochemistry]], the '''FASTA format''' is a text-based [[File format|format]] for representing either [[nucleotide sequence]]s or amino acid (protein) sequences, in which nucleotides or [[amino acid]]s are represented using single-letter codes.

The format allows for sequence names and comments to precede the sequences. It originated from the [[FASTA]] software package and has since become a near-universal standard in [[bioinformatics]].<ref>{{Cite web |title=What is FASTA format? |url=https://zhanggroup.org/FASTA/ |url-status=live |archive-url=https://web.archive.org/web/20221204183844/https://zhanggroup.org/FASTA/ |archive-date=2022-12-04 |access-date=2022-12-04 |website=Zhang Lab}}</ref>

The simplicity of FASTA format makes it easy to manipulate and parse sequences using text-processing tools and [[scripting language]]s.

== Overview ==
A sequence begins with a greater-than character (">") followed by a description of the sequence (all in a single line). The lines immediately following the description line are the sequence representation, with one letter per amino acid or nucleic acid, and are typically no more than 80 characters in length.

For example:

<syntaxhighlight lang="text">
>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*
</syntaxhighlight>

=== Original format ===
The original FASTA/[[William Pearson (scientist)|Pearson]] format is described in the documentation for the [[FASTA]] suite of programs. It can be downloaded with any free distribution of FASTA (see fasta20.doc, fastaVN.doc, or fastaVN.me—where VN is the Version Number).

In the original format, a sequence was represented as a series of lines, each of which was no longer than 120 characters and usually did not exceed 80 characters. This probably was to allow for the preallocation of fixed line sizes in software: at the time most users relied on [[Digital Equipment Corporation]] (DEC) [[VT220]] (or compatible) terminals which could display 80 or 132 characters per line.<ref>{{Cite web |last=Landsteiner |first=mass:werk, Norbert |date=2019-02-20 |title=(Now Go Bang!) Raster CRT Typography (According to DEC) |url=https://www.masswerk.at/nowgobang/2019/dec-crt-typography |access-date=2024-03-15 |website=Now Go Bang! — mass:werk / Blog |language=en}}</ref><ref>{{Cite web |title=VT220 Built-in Glyphs |url=https://www.vt100.net/dec/vt220/glyphs |access-date=2024-03-15 |website=VT100}}</ref> Most people preferred the bigger font in 80-character modes and so it became the recommended fashion to use 80 characters or less (often 70) in FASTA lines. Also, the width of a standard printed page is 70 to 80 characters (depending on the font). Hence, 80 characters became the norm.<ref>{{Cite web |title=Why is 80 characters the 'standard' limit for code width? |url=https://softwareengineering.stackexchange.com/questions/148677/why-is-80-characters-the-standard-limit-for-code-width |access-date=2024-03-15 |website=Software Engineering Stack Exchange |language=en}}</ref>

The first line in a FASTA file started either with a ">" (greater-than) symbol or, less frequently, a ";"<ref>{{Cite web |date=2023-08-01 |title=FASTA Database Format |url=https://www.loc.gov/preservation/digital/formats/fdd/fdd000622.shtml |access-date=2024-03-15 |website=www.loc.gov}}</ref> (semicolon) was taken as a comment. Subsequent lines starting with a semicolon would be ignored by software. Since the only comment used was the first, it quickly became used to hold a summary description of the sequence, often starting with a unique library accession number, and with time it has become commonplace to always use ">" for the first line and to not use ";" comments (which would otherwise be ignored).

Following the initial line (used for a unique description of the sequence) was the actual sequence itself in the standard one-letter character string. Anything other than a valid character would be ignored (including spaces, tabulators, asterisks, etc...). It was also common to end the sequence with an "*" (asterisk) character (in analogy with use in PIR formatted sequences) and, for the same reason, to leave a blank line between the description and the sequence. Below are a few sample sequences:
<syntaxhighlight lang="text">
;LCBO - Prolactin precursor - Bovine
; a sample sequence in FASTA format
MDSKGSSQKGSRLLLLLVVSNLLLCQGVVSTPVCPNGPGNCQVSLRDLFDRAVMVSHYIHDLSS
EMFNEFDKRYAQGKGFITMALNSCHTSSLPTPEDKEQAQQTHHEVLMSLILGLLRSWNDPLYHL
VTEVRGMKGAPDAILSRAIEIEEENKRLLEGMEMIFGQVIPGAKETEPYPVWSGLPSLQTKDED
ARYSAFYNLLHCLRRDSSKIDTYLKLLNCRIIYNNNC*

>MCHU - Calmodulin - Human, rabbit, bovine, rat, and chicken
MADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMINEVDADGNGTID
FPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAELRHVMTNLGEKLTDEEVDEMIREA
DIDGDGQVNYEEFVQMMTAK*

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY
</syntaxhighlight>
A multiple-sequence FASTA format, or multi-FASTA format, would be obtained by concatenating several single-sequence FASTA files in one file. This does not imply a contradiction with the format as only the first line in a FASTA file may start with a ";" or ">", forcing all subsequent sequences to start with a ">" in order to be taken as separate sequences (and further forcing the exclusive reservation of ">" for the sequence definition line). Thus, the examples above would be a multi-FASTA file if taken together.

Modern bioinformatics programs that rely on the FASTA format expect the sequence headers to be preceded by ">". The sequence is generally represented as "interleaved", or on multiple lines as in the above example, but may also be "sequential", or on a single line. Running different bioinformatics programs may require conversions between "sequential" and "interleaved" FASTA formats.

==Description line==

The description line (defline) or header/identifier line, which begins with ">", gives a name and/or a unique identifier for the sequence, and may also contain additional information. In a deprecated practice, the header line sometimes contained more than one header, separated by a ^A (Control-A) character. In the original [[William Pearson (scientist)|Pearson]] FASTA format, one or more comments, distinguished by a semi-colon at the beginning of the line, may occur after the header. Some databases and bioinformatics applications do not recognize these comments and follow [https://www.ncbi.nlm.nih.gov/blast/fasta.shtml the NCBI FASTA specification]. An example of a multiple sequence FASTA file follows:

<syntaxhighlight lang="text">
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
</syntaxhighlight>

=== NCBI identifiers ===
The [[National Center for Biotechnology Information|NCBI]] defined a standard for the unique identifier used for the sequence (SeqID) in the header line.  This allows a sequence that was obtained from a database to be labelled with a reference to its database record.  The database identifier format is understood by the NCBI tools like <code>makeblastdb</code> and <code>table2asn</code>. The following list describes the NCBI FASTA defined format for sequence identifiers.<ref>{{cite book |title=NCBI C++ Toolkit Book |publisher=National Center for Biotechnology Information |url=https://ncbi.github.io/cxx-toolkit/pages/ch_demo#ch_demo.id1_fetch.html_ref_fasta |access-date=2018-12-19}}</ref>

{| class="wikitable sortable" style="border: 1px solid black; margin-bottom: 10px;"
|-
! Type
! Format(s)
! Example(s)
|-
| local (i.e. no database reference)
| <code>lcl&#124;''integer''</code><br />
<code>lcl&#124;''string''</code>
| <code>lcl&#124;123</code><br />
<code>lcl&#124;hmm271</code>
|-
| GenInfo backbone seqid
| <code>bbs&#124;''integer''</code>
| <code>bbs&#124;123</code>
|-
| GenInfo backbone moltype
| <code>bbm&#124;''integer''</code>
| <code>bbm&#124;123</code>
|-
| GenInfo import ID
| <code>gim&#124;''integer''</code>
| <code>gim&#124;123</code>
|-
| [https://www.ncbi.nlm.nih.gov/Genbank/index.html GenBank]
| <code>gb&#124;''accession''&#124;''locus''</code>
| <code>gb&#124;M73307&#124;AGMA13GT</code>
|-
| [http://www.embl-heidelberg.de EMBL]
| <code>emb&#124;''accession''&#124;''locus''</code>
| <code>emb&#124;CAM43271.1&#124;</code>
|-
| [https://web.archive.org/web/20140312021627/http://pir.georgetown.edu/ PIR]
| <code>pir&#124;''accession''&#124;''name''</code>
| <code>pir&#124;&#124;G36364</code>
|-
| [http://www.ebi.ac.uk/swissprot SWISS-PROT]
| <code>sp&#124;''accession''&#124;''name''</code>
| <code>sp&#124;P01013&#124;OVAX_CHICK</code>
|-
| patent
| <code>pat&#124;''country''&#124;''patent''&#124;''sequence-number''</code>
| <code>pat&#124;US&#124;RE33188&#124;1</code>
|-
| pre-grant patent
| <code>pgp&#124;''country''&#124;''application-number''&#124;''sequence-number''</code>
| <code>pgp&#124;EP&#124;0238993&#124;7</code>
|-
| [https://www.ncbi.nlm.nih.gov/projects/RefSeq RefSeq]
| <code>ref&#124;''accession''&#124;''name''</code>
| <code>ref&#124;NM_010450.1&#124;</code>
|-
| general database reference<br />(a reference to a database that's not in this list)
| <code>gnl&#124;''database''&#124;''integer''</code><br />
<code>gnl&#124;''database''&#124;''string''</code>
| <code>gnl&#124;taxon&#124;9606</code><br />
<code>gnl&#124;PID&#124;e1632</code>
|-
| GenInfo integrated database
| <code>gi&#124;''integer''</code>
| <code>gi&#124;21434723</code>
|-
| [http://www.ddbj.nig.ac.jp DDBJ]
| <code>dbj&#124;''accession''&#124;''locus''</code>
| <code>dbj&#124;BAC85684.1&#124;</code>
|-
| [http://www.prf.or.jp PRF]
| <code>prf&#124;''accession''&#124;''name''</code>
| <code>prf&#124;&#124;0806162C</code>
|-
| [https://web.archive.org/web/20080828002005/http://www.rcsb.org./pdb PDB]
| <code>pdb&#124;''entry''&#124;''chain''</code>
| <code>pdb&#124;1I4L&#124;D</code>
|-
| third-party [https://www.ncbi.nlm.nih.gov/Genbank/index.html GenBank]
| <code>tpg&#124;''accession''&#124;''name''</code>
| <code>tpg&#124;BK003456&#124;</code>
|-
| third-party [http://www.embl-heidelberg.de EMBL]
| <code>tpe&#124;''accession''&#124;''name''</code>
| <code>tpe&#124;BN000123&#124;</code>
|-
| third-party [http://www.ddbj.nig.ac.jp DDBJ]
| <code>tpd&#124;''accession''&#124;''name''</code>
| <code>tpd&#124;FAA00017&#124;</code>
|-
| TrEMBL
| <code>tr&#124;''accession''&#124;''name''</code>
| <code>tr&#124;Q90RT2&#124;Q90RT2_9HIV1</code>
|}

The vertical bars ("|") in the above list are not separators in the sense of the [[Backus–Naur form]] but are part of the format. Multiple identifiers can be concatenated, also separated by vertical bars.

==Sequence representation==

Following the header line, the actual sequence is represented. Sequences may be [[primary structure|protein sequences]] or [[nucleic acid]] sequences, and they can contain gaps or alignment characters (see [[sequence alignment]]). Sequences are expected to be represented in the standard [[International Union of Biochemistry and Molecular Biology|IUB]]/[[International Union of Pure and Applied Chemistry|IUPAC]] [[amino acid]] and [[nucleic acid]] codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap character; and in amino acid sequences, U and * are acceptable letters (see below). Numerical digits are not allowed but are used in some databases to indicate the position in the sequence. The nucleic acid codes supported are:<ref>
{{cite web |author=Tao Tao |date=2011-08-24 |title=Single Letter Codes for Nucleotides |url=https://www.ncbi.nlm.nih.gov/staff/tao/tools/tool_lettercode.html |url-status=dead |archive-url=https://web.archive.org/web/20120914234405/http://www.ncbi.nlm.nih.gov/staff/tao/tools/tool_lettercode.html |archive-date=2012-09-14 |access-date=2012-03-15 |work=[NCBI Learning Center] |publisher=[[National Center for Biotechnology Information]]}}</ref><ref>{{cite web |url=http://www.dna.affrc.go.jp/misc/MPsrch/InfoIUPAC.html |title=IUPAC code table |publisher=NIAS DNA Bank |url-status=dead |archive-url=https://web.archive.org/web/20110811073845/http://www.dna.affrc.go.jp/misc/MPsrch/InfoIUPAC.html |archive-date=2011-08-11 }}</ref><ref>{{cite web |title=anysymbol |url=https://mafft.cbrc.jp/alignment/software/anysymbol.html |website=MAFFT - a multiple sequence alignment program}}</ref>
{| class="wikitable sortable" style="border:solid 1px black;"
! Nucleic Acid Code
! Meaning
! Mnemonic
|-
| A
| A
| [[adenine|'''A'''denine]]
|-
| C
| C
| [[cytosine|'''C'''ytosine]]
|-
| G
| G
| [[guanine|'''G'''uanine]]
|-
| T
| T
| [[thymine|'''T'''hymine]]
|-
| U
| U
| [[uracil|'''U'''racil]]
|-
| (i)
| i
| [[inosine|'''i'''nosine]] (non-standard)
|-
| R
| A or G (I)
| [[purine|pu'''R'''ine]]
|-
| Y
| C, T or U
| [[pyrimidine|p'''Y'''rimidines]]
|-
| K
| G, T or U
| bases which are [[ketone|'''K'''etones]]
|-
| M
| A or C
| bases with [[amino|a'''M'''ino groups]]
|-
| S
| C or G
| '''S'''trong interaction
|-
| W
| A, T or U
| '''W'''eak interaction
|-
| B
| not A (i.e. C, G, T or U)
| '''B''' comes after A
|-
| D
| not C (i.e. A, G, T or U)
| '''D''' comes after C
|-
| H
| not G (i.e., A, C, T or U)
| '''H''' comes after G
|-
| V
| neither T nor U (i.e. A, C or G)
| '''V''' comes after U
|-
| N
| A C G T U
| '''N'''ucleic acid
|-
| -
| gap of indeterminate length
|
|}

The amino acid codes supported (22 amino acids and 3 special codes) are:
{| class="wikitable sortable" style="border:solid 1px black;"
! Amino Acid Code
! Meaning
|-
| A
| [[Alanine]]
|-
| B
| [[Aspartic acid]] (D) or [[Asparagine]] (N)
|-
| C
| [[Cysteine]]
|-
| D
| [[Aspartic acid]]
|-
| E
| [[Glutamic acid]]
|-
| F
| [[Phenylalanine]]
|-
| G
| [[Glycine]]
|-
| H
| [[Histidine]]
|-
| I
| [[Isoleucine]]
|-
| J
| [[Leucine]] (L) or [[Isoleucine]] (I)
|-
| K
| [[Lysine]]
|-
| L
| [[Leucine]]
|-
| M
| [[Methionine]]/[[Start codon]]
|-
| N
| [[Asparagine]]
|-
| O
| [[Pyrrolysine]] (rare)
|-
| P
| [[Proline]]
|-
| Q
| [[Glutamine]]
|-
| R
| [[Arginine]]
|-
| S
| [[Serine]]
|-
| T
| [[Threonine]]
|-
| U
| [[Selenocysteine]] (rare)
|-
| V
| [[Valine]]
|-
| W
| [[Tryptophan]]
|-
| Y
| [[Tyrosine]]
|-
| Z
| [[Glutamic acid]] (E) or [[Glutamine]] (Q)
|-
| X
| any
|-
| *
| translation stop
|-
| -
| gap of indeterminate length
|}

==FASTA file==

===Filename extension===
There is no standard [[filename extension]] for a text file containing FASTA formatted sequences. The table below shows each extension and its respective meaning.

{| class="wikitable sortable" style="border:solid 1px black;"
! Extension
! Meaning
! Notes
|-
|fasta, fas, fa<ref>{{cite web |url=http://www.jalview.org/help/html/io/fileformats.html |title=Alignment Fileformats |date=22 May 2019 |access-date=22 May 2019}}</ref>
| generic FASTA
| Any generic FASTA file
|-
|fna
|FASTA nucleic acid
|Used generically to specify nucleic acids
|-
|ffn
|FASTA nucleotide of gene regions
|Contains coding regions for a genome
|-
|faa
|FASTA amino acid
|Contains amino acid sequences
|-
|mpfa
|FASTA amino acids
|Contains multiple protein sequences
|-
|frn
|FASTA [[non-coding RNA]]
|Contains non-coding RNA regions for a genome, e.g. tRNA, rRNA
|}

===Compression===
The compression of FASTA files requires a specific compressor to handle both channels of information: identifiers and sequence. For improved compression results, these are mainly divided into two streams where the compression is made assuming independence. For example, the algorithm MFCompress<ref name="MFCompress">{{cite journal | vauthors = Pinho AJ, Pratas D | title = MFCompress: a compression tool for FASTA and multi-FASTA data | journal = Bioinformatics | volume = 30 | issue = 1 | pages = 117–8 | date = January 2014 | pmid = 24132931 | pmc = 3866555 | doi = 10.1093/bioinformatics/btt594 }}</ref> performs lossless compression of these files using context modelling and arithmetic encoding. Genozip,<ref name="Genozip">{{Cite journal |last1=Lan |first1=Divon |last2=Tobler |first2=Ray |last3=Souilmi |first3=Yassine |last4=Llamas |first4=Bastien |date=2021-02-15 |title=Genozip: a universal extensible genomic data compressor |url=https://doi.org/10.1093/bioinformatics/btab102 |journal=Bioinformatics |volume=37 |issue=16 |pages=2225–2230 |doi=10.1093/bioinformatics/btab102 |issn=1367-4803 |pmc=8388020 |pmid=33585897}}</ref> a software package for compressing genomic files, uses an extensible context-based model. Benchmarks of FASTA file compression algorithms have been reported by Hosseini et al. in 2016,<ref name="Morteza">{{Cite journal |last1=Hosseini |first1=Morteza |last2=Pratas |first2=Diogo |last3=Pinho |first3=Armando J. |date=2016 |title=A Survey on Data Compression Methods for Biological Sequences |journal=Information |language=en |volume=7 |issue=4 |pages=56 |doi=10.3390/info7040056 |issn=2078-2489 |doi-access=free }}</ref> and Kryukov et al. in 2020.<ref name="SCB">{{cite journal | vauthors = Kryukov K, Ueda MT, Nakagawa S, Imanishi T | title = Sequence Compression Benchmark (SCB) database—A comprehensive evaluation of reference-free compressors for FASTA-formatted sequences | journal = GigaScience | volume = 9 | issue = 7 | pages = giaa072 | date = July 2020 | pmid = 32627830 | pmc = 7336184 | doi = 10.1093/gigascience/giaa072 }}</ref>

===Encryption===
The encryption of FASTA files can be performed with various tools, including Cryfa and Genozip. Cryfa uses AES encryption and also enables data compression.<ref name="CRYFA1">{{cite book | vauthors = Pratas D, Hosseini M, Pinho A | chapter= Cryfa: a tool to compact and encrypt FASTA files |title=11th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB) |publisher=Springer| volume = 616|date=2017|pages=305–312|doi=10.1007/978-3-319-60816-7_37| series = Advances in Intelligent Systems and Computing| isbn = 978-3-319-60815-0}}</ref><ref name="CRYFA2">{{Cite journal |last1=Hosseini |first1=Morteza |last2=Pratas |first2=Diogo |last3=Pinho |first3=Armando J |date=2019-01-01 |editor-last=Berger |editor-first=Bonnie |title=Cryfa: a secure encryption tool for genomic data |url=https://academic.oup.com/bioinformatics/article/35/1/146/5055587 |journal=Bioinformatics |language=en |volume=35 |issue=1 |pages=146–148 |doi=10.1093/bioinformatics/bty645 |issn=1367-4803 |pmc=6298042 |pmid=30020420}}</ref> Similarly, Genozip can encrypt FASTA files with AES-256 during compression.<ref name="Genozip" />

=={{anchor|Extended Format}}Extensions==
[[FASTQ format]] is a form of FASTA format extended to indicate information related to sequencing. It is created by the [[Sanger Centre]] in Cambridge.<ref name=fastq>{{cite journal | vauthors = Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM | title = The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants | journal = Nucleic Acids Research | volume = 38 | issue = 6 | pages = 1767–71 | date = April 2010 | pmid = 20015970 | pmc = 2847217 | doi = 10.1093/nar/gkp1137 }}</ref>

A2M/A3M are a family of FASTA-derived formats used for [[sequence alignment]]s. In A2M/A3M sequences, lowercase characters are taken to mean insertions, which are then indicated in the other sequences as the dot ("{{tt|.}}") character. The dots can be discarded for compactness without loss of information. As with typical FASTA files used in alignments, the gap ("{{tt|-}}") is taken to mean exactly one position.<ref>{{cite web |title=Description of A2M alignment format |url=https://compbio.soe.ucsc.edu/a2m-desc.html |url-status=dead |archive-url=https://web.archive.org/web/20220815104407/https://compbio.soe.ucsc.edu/a2m-desc.html |archive-date=2022-08-15 |work=[[SAMtools]]}}</ref> A3M is similar to A2M, with the added rule that gaps aligned to insertions can too be discarded.<ref>{{cite web |title=soedinglab/hh-suite: reformat.pl |url=https://github.com/soedinglab/hh-suite/blob/master/scripts/reformat.pl |website=GitHub |date=20 November 2022 |language=en}}</ref>

==Working with FASTA files==

A plethora of user-friendly scripts are available from the community to perform FASTA file manipulations. Online toolboxes, such as FaBox<ref name="FaBox">{{Cite journal |last=Villesen |first=P. |date=2007 |title=FaBox: an online toolbox for fasta sequences |url=https://onlinelibrary.wiley.com/doi/10.1111/j.1471-8286.2007.01821.x |journal=Molecular Ecology Notes |language=en |volume=7 |issue=6 |pages=965–968 |doi=10.1111/j.1471-8286.2007.01821.x |issn=1471-8278|url-access=subscription }}</ref> or the FASTX-Toolkit within Galaxy servers, are also available.<ref name=Galaxyserver>{{cite journal | vauthors = Blankenberg D, Von Kuster G, Bouvier E, Baker D, Afgan E, Stoler N, ((Galaxy Team)), Taylor J, Nekrutenko A | title = Dissemination of scientific software with Galaxy ToolShed | journal = Genome Biology | volume = 15 | issue = 2 | pages = 403 | date = 2014 | doi = 10.1186/gb4161 | pmid = 25001293 | pmc = 4038738 | doi-access = free }}</ref> These can be used to segregate sequence headers/identifiers, rename them, shorten them, or extract sequences of interest from large FASTA files based on a list of wanted identifiers (among other available functions). A tree-based approach to sorting multi-FASTA files (TREE2FASTA<ref name=tree2fasta>{{cite journal | vauthors = Sauvage T, Plouviez S, Schmidt WE, Fredericq S | title = TREE2FASTA: a flexible Perl script for batch extraction of FASTA sequences from exploratory phylogenetic trees | journal = BMC Research Notes | volume = 11 | pages = 403 | issue = 1 | date = March 2018 | doi = 10.1186/s13104-018-3268-y | pmid = 29506565 | pmc = 5838971 | doi-access = free }}</ref>) also exists based on the coloring and/or annotation of sequences of interest in the FigTree viewer. Additionally, the [[Bioconductor]] ''Biostrings'' package can be used to read and manipulate FASTA files in [[R (programming language)|R]].<ref>{{cite journal| url=https://bioconductor.org/packages/release/bioc/html/Biostrings.html | title=''Biostrings: Efficient manipulation of biological strings''. | last1=Pagès | first1=H | last2 = Aboyoun | first2=P | last3=Gentleman | first3=R | last4=DebRoy | first4=S | date=2018 | website = Bioconductor.org | publisher = R package version 2.48.0| doi=10.18129/B9.bioc.Biostrings }}</ref>

Several online format converters exist to rapidly reformat multi-FASTA files to different formats (e.g. NEXUS, PHYLIP) for use with different phylogenetic programs, such as the converter available on phylogeny.fr.<ref name=phylodotfr>{{cite journal | vauthors = Dereeper A, Guignon V, Blanc G, Audic S, Buffet S, Chevenet F, Dufayard JF, Guindon S, Lefort V, Lescot M, Claverie JM, Gascuel O | title = Phylogeny.fr: robust phylogenetic analysis for the non-specialist | journal = Nucleic Acids Research | volume = 36 | issue = Web Server issue | pages = W465–9 | date = July 2008 | doi = 10.1093/nar/gkn180 | pmid = 18424797 | pmc = 2447785 }}</ref>

==See also==
* The [[FASTQ format]], used to represent DNA sequencer reads along with quality scores.
* The [[SAM (file format)|SAM]] and [[CRAM (file format)|CRAM]] formats, used to represent genome sequencer reads that have been aligned to genome sequences.
* The GVF format (Genome Variation Format), an extension based on the [[GFF3]] format.

== References ==
{{Reflist|35em}}

== External links ==
*[http://www.bioconductor.org Bioconductor]
*[http://hannonlab.cshl.edu/fastx_toolkit/ FASTX-Toolkit]
*[http://tree.bio.ed.ac.uk/software/figtree/ FigTree viewer]
*[http://phylogeny.lirmm.fr/phylo_cgi/data_converter.cgi Phylogeny.fr]
*[http://bioinformatics.ua.pt/gto GTO]

{{Bioinformatics}}

[[Category:Bioinformatics]]
[[Category:Biological sequence format]]