Editing Sequence alignment (section)

==Representations==

Alignments are commonly represented both graphically and in text format. In almost all sequence alignment representations, sequences are written in rows arranged so that aligned residues appear in successive columns. In text formats, aligned columns containing identical or similar characters are indicated with a system of conservation symbols. As in the image above, an asterisk or pipe symbol is used to show identity between two columns; other less common symbols include a colon for conservative substitutions and a period for semiconservative substitutions. Many sequence visualization programs also use color to display information about the properties of the individual sequence elements; in DNA and RNA sequences, this equates to assigning each nucleotide its own color. In protein alignments, such as the one in the image above, color is often used to indicate amino acid properties to aid in judging the [[conservation (genetics)|conservation]] of a given amino acid substitution. For multiple sequences the last row in each column is often the [[consensus sequence]] determined by the alignment; the consensus sequence is also often represented in graphical format with a [[sequence logo]] in which the size of each nucleotide or amino acid letter corresponds to its degree of conservation.<ref name=Schneider>{{cite journal| journal=Nucleic Acids Res | volume=18 | pages=6097–6100 | year=1990 |author1=Schneider TD |author2=Stephens RM | title=Sequence logos: a new way to display consensus sequences |pmid=2172928 |pmc=332411 |url=|doi=10.1093/nar/18.20.6097| issue=20}}</ref>

Sequence alignments can be stored in a wide variety of text-based file formats, many of which were originally developed in conjunction with a specific alignment program or implementation. Most web-based tools allow a limited number of input and output formats, such as [[FASTA format]] and [[GenBank]] format and the output is not easily editable. Several conversion programs that provide graphical and/or command line interfaces are available {{Dead link|date=August 2009}}, such as [https://web.archive.org/web/20071024223546/http://bioweb.pasteur.fr/seqanal/interfaces/readseq.html READSEQ] and [[EMBOSS]].  There are also several programming packages which provide this conversion functionality, such as [[BioPython]],  [[BioRuby]] and [[BioPerl]]. The [[SAM (file format)|SAM/BAM files]] use the CIGAR (Compact Idiosyncratic Gapped Alignment Report) string format to represent an alignment of a sequence to a reference by encoding a sequence of events (e.g. match/mismatch, insertions, deletions).<ref>{{Cite web|url=https://samtools.github.io/hts-specs/SAMv1.pdf|title=Sequence Alignment/Map Format Specification}}</ref>

===CIGAR Format===
Ref.  :        GTCGTAGAATA <br />
[[Read (biology)|Read]]:             CACGTAG—TA <br />
CIGAR: 2S5M2D2M
where: <br />
2S = 2 soft clipping (could be mismatches, or a read longer than the matched sequence) <br />
5M = 5 matches or mismatches <br />
2D = 2 deletions <br />
2M = 2 matches or mismatches

The original CIGAR format from the [https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate exonerate alignment program] did not distinguish between mismatches or matches with the M character.

The SAMv1 spec document defines newer CIGAR codes.  In most cases it is preferred to use the '=' and 'X' characters to denote matches or mismatches rather than the older 'M' character, which is ambiguous.

{| class="wikitable"
! CIGAR Code
! BAM Integer
! Description
! Consumes query
! Consumes reference
|-
| M||0||alignment match (can be a sequence match or mismatch)||yes||yes
|-
| I||1||insertion to the reference||yes||no
|-
| D||2||deletion from the reference||no||yes
|-
| N||3||skipped region from the reference||no||yes
|-
| S||4||soft clipping (clipped sequences present in SEQ)||yes||no
|-
| H||5||hard clipping (clipped sequences NOT present in SEQ)||no||no
|-
| P||6||padding (silent deletion from padded reference)||no||no
|-
| =||7||sequence match||yes||yes
|-
| X||8||sequence mismatch||yes||yes
|-
|
|}
* "Consumes query" and "consumes reference" indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively.
* H can only be present as the first and/or last operation.
* S may only have H operations between them and the ends of the CIGAR string.
* For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not defined.
* Sum of lengths of the M/I/S/=/X operations shall equal the length of SEQ