Editing Sequence analysis (section)

== Overview of nucleotide sequence analysis (DNA & RNA) ==

Nucleotide sequence analyses identify functional elements like protein binding sites, uncover genetic variations like SNPs, study gene expression patterns, and understand the genetic basis of traits. It helps to understand mechanisms that contribute to processes like replication and transcription. Some of the tasks involved are outlined below.

=== Quality control and preprocessing === 
Quality control assesses the quality of sequencing reads obtained from the sequencing technology (e.g. [[Illumina, Inc.|Illumina]]). It is the first step in sequence analysis to limit wrong conclusions due to poor quality data. The tools used at this stage depend on the sequencing platform. For instance,  FastQC checks the quality of short reads (including RNA sequences), Nanoplot or PycoQC are used for [[third-generation sequencing | long read sequences]] (e.g. Nanopore sequence reads), and MultiQC aggregates the result of FastQC in a webpage format.<ref>{{cite web |last1=Batut |first1=Bérénice |last2=Doyle |first2=Maria |last3=Cormier |first3=Alexandre |last4=Bretaudeau |first4=Anthony |last5=Leroi |first5=Laura |last6=Corre |first6=Erwan |last7=Robin |first7=Stéphanie |last8=nil |first8=gallantries |last9=Hyde |first9=Cameron |title=Quality Control (Galaxy Training Materials) |url=https://training.galaxyproject.org/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html |website=Galaxy Training! |date=3 November 2023 |access-date=26 April 2024}}</ref><ref name=galaxy1>{{cite journal |last1=Hiltemann |first1=Saskia |last2=Rasche |first2=Helena |last3=Gladman |first3=Simon | display-authors = 2 |title=Galaxy Training: A Powerful Framework for Teaching! |journal=PLOS Computational Biology |date=January 2023 |volume=19 |issue=1 |pages=e1010752 |doi=10.1371/journal.pcbi.1010752 |doi-access=free |pmid=36622853 |pmc=9829167 |bibcode=2023PLSCB..19E0752H }}</ref><ref name=galaxy2>{{cite journal |last1=Batut |first1=Bérénice |last2=Hiltemann |first2=Saskia | display-authors = 1 |title=Community-Driven Data Analysis Training for Biology |journal=Cell Systems |date=2018 |volume=6 |issue=6 |pages=752–758.e1 |doi=10.1016/j.cels.2018.05.012 |pmid=29953864 |pmc=6296361 |url=https://doi.org/10.1016%2Fj.cels.2018.05.012}}</ref> 

Quality control provides information such as read lengths, [[GC-content|GC content]], presence of adapter sequences (for short reads), and a quality score, which is often expressed on a [[phred quality score|PHRED scale]].<ref name=sequence_analysis>{{cite journal |last1=Prijibelski |first1=Andrey B. |last2=Korobeynikov |first2=Anton I. |last3=Lapidus |first3=Alla L. |title=Sequence Analysis |journal=Encyclopedia of Bioinformatics and Computational Biology |date=September 2018 |volume=3 |pages=292–322 |doi=10.1016/B978-0-12-809633-8.20106-4 |isbn=978-0-12-811432-2 |url=https://doi.org/10.1016/B978-0-12-809633-8.20106-4|url-access=subscription }}</ref> If adapters or other artifacts from PCR amplification are present in the reads (particularly short reads), they are removed using software such as Trimmomatic<ref>{{cite journal |last1=Bolger |first1=Anthony M. |last2=Lohse |first2=Marc |last3=Usadel |first3=Bjoern |title=Trimmomatic: a flexible trimmer for Illumina sequence data |journal=Bioinformatics |date=April 2014 |volume=30 |issue=15 |pages=2114–2120 |doi=10.1093/bioinformatics/btu170 |pmid=24695404 |pmc=4103590 |url=https://doi.org/10.1093/bioinformatics/btu170}}</ref> or Cutadapt.<ref>{{cite journal |last1=Marcel |first1=Martin |title=Cutadapt removes adapter sequences from high-throughput sequencing reads |journal=EMBnet.journal |date=2011 |volume=17 |page=10 |doi=10.14806/ej.17.1.200 |url=https://doi.org/10.14806/ej.17.1.200}}</ref>

=== Read alignment ===

At this step, sequencing reads whose quality have been improved are mapped to a [[reference genome]] using alignment tools like BWA<ref>{{cite journal |last1=Li |first1=Heng |last2=Durbin |first2=Richard |title=Fast and accurate short read alignment with Burrows–Wheeler transform |journal=Bioinformatics |date=July 2009 |volume=25 |issue=14 |pages=1754–1760 |doi=10.1093/bioinformatics/btp324 |pmid=19451168 |url=https://doi.org/10.1093/bioinformatics/btp324}}</ref> for short DNA sequence reads, minimap<ref>{{cite journal |last1=Li |first1=Heng |title=Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences |journal=Bioinformatics |date=March 2016 |volume=32 |issue=14 |pages=2103–2110 |doi=10.1093/bioinformatics/btw152 |pmid=27153593 |pmc=4937194 |url=https://doi.org/10.1093/bioinformatics/btw152}}</ref> for long read DNA sequences, and STAR<ref>{{cite journal |last1=Dobin |first1=Alexander |last2=Davis |first2=Carrie A. |last3=Schlesinger |first3=Felix | display-authors = 2 |title=STAR: ultrafast universal RNA-seq aligner |journal=Bioinformatics |date=October 2012 |volume=29 |issue=1 |pages=15–21 |doi=10.1093/bioinformatics/bts635 |pmid=23104886 |url=https://doi.org/10.1093/bioinformatics/bts635|pmc=3530905 }}</ref> for RNA sequence reads. The purpose of mapping is to find the origin of any given read based on the reference sequence. It is also important for detecting variations or [[molecular phylogenetics|phylogenetic studies]].
The output from this step, that is, the aligned reads, are stored in compatible file formats known as SAM, which contains information about the reference genome as well as individual reads. Alternatively, [[Binary Alignment Map|BAM file]] formats are preferred as they use much less desk or storage space.<ref name=sequence_analysis/>

'''Note''': This is different from sequence alignment which compares two or more whole sequences (or sequence regions) to quantify similarity or differences or to identify an unknown sequence (as discussed below).

'''The following analyses steps are peculiar to DNA sequences:'''

=== Variant calling === 

Identifying variants is a popular aspect of sequence analysis as variants often contain information of biological significance, such as explaining the mechanism of drug resistance in an infectious disease. These variants could be single nucleotide variants (SNVs), small insertions/deletions (indels), and large [[structural variation|structural variants]]. The read alignments are sorted using [[SAMtools]], after which variant callers such as GATK<ref>{{cite journal |last1=McKenna |first1=Aaron |last2=Hanna |first2=Matthew |last3=Banks |first3=Eric | display-authors = 2 |title=The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data |journal=Genome Research |date=July 2010 |volume=20 |issue=9 |pages=1297–1303 |doi=10.1101/gr.107524.110 |pmid=20644199 |pmc=2928508 |url=https://doi.org/10.1101/gr.107524.110}}</ref> are used to identify differences compared to the reference sequence. 

The choice of variant calling tool depends heavily on the sequencing technology used, so GATK is often used when working with short reads, while long read sequences require tools like DeepVariant<ref>{{cite journal |last1=Poplin |first1=R |last2=Chang |first2=PC |last3=Alexander |first3=D |display-authors=2 |title=A universal SNP and small-indel variant caller using deep neural networks |journal=Nature Biotechnology |date=September 2018 |volume=36 |issue=10 |pages=983–987 |doi=10.1038/nbt.4235 |pmid=30247488 |url=https://doi.org/10.1038/nbt.4235}}</ref> and Sniffles.<ref>{{cite journal |last1=Sedlazeck |first1=F.J. |last2=Rescheneder |first2=P |last3=Smolka |first3=M |display-authors=2 |title=Accurate detection of complex structural variations using single-molecule sequencing |journal=Nature Methods |date=April 2018 |volume=15 |issue=6 |pages=461–468 |doi=10.1038/s41592-018-0001-7 |pmid=29713083 |pmc=5990442 |url=https://doi.org/10.1038/s41592-018-0001-7}}</ref> Tools may also differ based on organism (prokaryotes or eukaryotes), source of sequence data (cancer vs [[metagenomics|metagenomic]]), and variant type of interest (SNVs or structural variants). The output of variant calling is typically in [[Variant Call Format|vcf format]], and can be filtered using allele frequencies, quality scores, or other factors based on the research question at hand.<ref name=sequence_analysis/>

=== Variant annotation ===

This step adds context to the variant data using curated information from peer-reviewed papers and publicly available databases like gnomAD and [[Ensembl genome database project|Ensembl]]. Variants can be annotated with information about genomic features, functional consequences, regulatory elements, and population frequencies using tools like ANNOVAR or SnpEff,<ref>{{cite journal |last1=Cingolani |first1=P |last2=Platts |first2=A |last3=Wang |first3=L |display-authors=2 |title=A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff |journal=Fly |date=April 2012 |volume=6 |issue=2 |pages=80–92 |doi=10.4161/fly.19695 |pmid=22728672 |pmc=3679285 |url=https://doi.org/10.4161/fly.19695}}</ref> or custom scripts and pipeline. The output from this step is an annotation file in bed or txt format.<ref name=sequence_analysis/>

=== Visualization and interpretation === 

Genomic data, such as read alignments, coverage plots, and variant calls, can be visualized using [[genome browser]]s like IGV (Integrative Genomics Viewer) or UCSC Genome Browser. Interpretation of the results is done in the context of the biological question or hypothesis under investigation. The output can be a graphical representation of data in the forms of Circos plots, volcano plots, etc., or other forms of report describing the observations.<ref name=sequence_analysis/>

DNA sequence analysis could also involve statistical modeling to infer relationships and epigenetic analysis, like identifying differential methylation regions using a tool like DSS.

'''The following steps are peculiar to RNA sequences:'''
 
=== Gene expression analysis === 

Mapped RNA sequences are analyzed to estimate gene expression levels using quantification tools such as HTSeq,<ref>{{cite journal |last1=Anders |first1=Simon |last2=Pyl |first2=Paul Theodore |last3=Huber |first3=Wolfgang |title=HTSeq—a Python framework to work with high-throughput sequencing data |journal=Bioinformatics |date=January 2015 |volume=31 |issue=2 |pages=166–169 |doi=10.1093/bioinformatics/btu638 |pmid=25260700 |url=https://doi.org/10.1093/bioinformatics/btu638|pmc=4287950 }}</ref> and identify differentially expressed genes (DEGs) between experimental conditions using statistical methods like [[DESeq2]].<ref>{{cite journal |last1=Love |first1=M.I. |last2=Huber |first2=W. |last3=Anders |first3=S. |title=Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 |journal=Genome Biology |date=December 2014 |volume=15 |issue=12 |page=550 |doi=10.1186/s13059-014-0550-8 |doi-access=free |pmid=25516281 |pmc=4302049 }}</ref> This is carried out to compare the expression levels of genes or isoforms between or across different samples, and infer biological relevance.<ref name=sequence_analysis/>
The output of gene expression analysis is typically a table with values representing the expression levels of gene IDs or names in rows and samples in the columns as well as standard errors and p-values. The results in the table can be further visualized using volcano plots and heatmaps, where colors represent the estimated expression level. Packages like ggplot2 in R and Matplotlib in Python are often used to create the visuals. The table can also be annotated using a reference annotation file, usually in [[gene transfer format|GTF or GFF]] format to provide more context about the genes, such as the chromosome name, strand, and start and positions, and aid result interpretation.<ref name=sequence_analysis/><ref name=galaxy1/><ref name=galaxy2/><ref name=galaxy3>{{cite web |last1=Batut |first1=Bérénice |last2=Freeberg |first2=Mallory |last3=Heydarian |first3=Mo |display-authors=2 |title=Reference-based RNA-Seq data analysis (Galaxy Training Materials) |url=https://training.galaxyproject.org/training-material/topics/transcriptomics/tutorials/ref-based/tutorial.html |website=Galaxy Training! |date=17 March 2024 |access-date=26 April 2024}}</ref>

=== Functional enrichment analysis === 

Functional enrichment analysis identifies biological processes, pathways, and functional impacts associated with differentially expressed genes obtained from the previous step. It uses tools like GOSeq<ref>{{cite journal |last1=Young |first1=M.D. |last2=Wakefield |first2=M.J |last3=Smythe |first3=G.K |display-authors=2 |title=Gene ontology analysis for RNA-seq: accounting for selection bias |journal=Genome Biology |date=February 2010 |volume=11 |issue=2 |pages=R14 |doi=10.1186/gb-2010-11-2-r14 |doi-access=free |pmid=20132535 |hdl=11343/56416 |hdl-access=free }}</ref> and Pathview.<ref>{{cite journal |last1=Luo |first1=Weijun |last2=Brouwer |first2=Cory |title=Pathview: an R/Bioconductor package for pathway-based data integration and visualization |journal=Bioinformatics |date=June 2013 |volume=29 |issue=14 |pages=1830–1831 |doi=10.1093/bioinformatics/btt285 |pmid=23740750 |pmc=3702256 |url=https://doi.org/10.1093/bioinformatics/btt285}}</ref> This creates a table with information about what pathways and molecular processes are associated with the differentially expressed genes, what genes are down or upregulated, and what [[gene ontology]] terms are recurrent or over-represented.<ref name=sequence_analysis/><ref name=galaxy1/><ref name=galaxy2/><ref name=galaxy3/> 

RNA sequence analysis explores gene expression dynamics and regulatory mechanisms underlying biological processes and diseases. Interpretation of images and tables are carried out within the context of the hypotheses being investigated.

''See also: [[Transcriptomic technologies]].''