Editing Probabilistic context-free grammar (section)

==== Example: Using evolutionary information to guide structure prediction ====
The KH-99 algorithm by Knudsen and Hein lays the basis of the Pfold approach to predicting RNA secondary structure.<ref name="Knudsen 2003" /> In this approach the parameterization requires evolutionary history information derived from an alignment tree in addition to probabilities of columns and mutations. The grammar probabilities are observed from a training dataset.

===== Estimate column probabilities for paired and unpaired bases =====
In a structural alignment the probabilities of the unpaired bases columns and the paired bases columns are independent of other columns. By counting bases in single base positions and paired positions one obtains the frequencies of  bases in loops and stems.
For basepair {{mvar|X}} and {{mvar|Y}} an occurrence of <math>XY</math> is also counted as an occurrence of <math>YX</math>. Identical basepairs such as <math>XX</math> are counted twice.

===== Calculate mutation rates for paired and unpaired bases =====
By pairing sequences in all possible ways overall mutation rates are estimated. In order to recover plausible mutations a sequence identity threshold should be used so that the comparison is between similar sequences. This approach uses 85% identity threshold between pairing sequences. 
First single base positions differences -except for gapped columns- between sequence pairs are counted such that if the same position in two sequences had different bases {{mvar|X, Y}} the count of the difference is incremented for each sequence.

 {{nowrap|while <math>X\ne Y</math>}}
                {{nowrap|<math>        C_{\text{XY}} +1</math> first sequence  pair}}
                {{nowrap|<math>        C_{\text{YX}} +1</math> second sequence pair}}

 {{nowrap|Calculate mutation rates.}}
                {{nowrap|Let  <math>r_{\text{XY}}= </math> mutation of base X to base Y <math>= \frac {K~C_{\text{XY}}} {P_{x}P_{s}}</math>}}
                {{nowrap|Let  <math>r_{\text{XX}}= </math> the negative of the rate of X mutation to other bases <math>= - \sum r_{\text{XY}}</math>}}
                {{nowrap|<math>P_{s} =</math> the probability that the base is not paired.}}

For unpaired bases a 4 X 4 mutation rate matrix is used that satisfies that the mutation flow from X to Y is reversible:<ref name="Tavaré 1986" />
:                <math>PX^rXY = PY^rYX</math> 
For basepairs a 16 X 16 rate distribution matrix is similarly generated.<ref name="Muse 1995" /><ref name="Schöniger  1994" />
The PCFG is used to predict the prior probability distribution of the structure whereas posterior probabilities are estimated by the inside-outside algorithm and the most likely structure is found by the CYK algorithm.<ref name="Knudsen 2003" />

===== Estimate alignment probabilities =====
After calculating the column prior probabilities the alignment probability is estimated by summing over all possible secondary structures. Any column {{mvar|C}}  in a secondary structure <math>\sigma</math> for a sequence {{mvar|D}} of length {{mvar|l}} such that <math>D=(C_1,~C_2, ...C_l )</math> can be scored with respect to the alignment tree {{mvar|T}} and the mutational model {{mvar|M}}. The prior distribution given by the PCFG is <math>P(\sigma|M)</math>. The phylogenetic tree, {{mvar|T}} can be calculated from the model by maximum likelihood estimation. Note that gaps are treated as unknown bases and the summation can be done through [[dynamic programming]].<ref name="Baker 1979" /> 
:       <math>P(D|T,M)</math>
:               <math>=\sum P (D, \sigma |T,M)</math>
:               <math>=\sum P(D|\sigma, T, M) P(\sigma|T,M)</math>
:               <math>=\sum P(D|\sigma,T,M) P(\sigma|M)</math>

===== Assign production probabilities to each rule in the grammar =====
Each structure in the grammar is assigned production probabilities devised from the structures of the training dataset. These prior probabilities give weight to predictions accuracy.<ref name="Knudsen 1999" /><ref name="Lari and Young 1990" /><ref name="Lari and Young 1991" /> The number of times each rule is used depends on the observations from the training dataset for that particular grammar feature. These probabilities are written in parentheses in the grammar formalism and each rule will have a total of 100%.<ref name="Knudsen 2003" /> For instance:

:              <math> S \to LS (80\%) |L (20\%)</math>
:              <math>L \to s (70\%) | dFd (30\%)</math>
:              <math>F \to dFd (60.4\%)| LS (39.6\%)</math>

===== Predict the structure likelihood =====
Given the prior alignment frequencies of the data the most likely structure from the ensemble predicted by the grammar can then be computed by maximizing <math>P(\sigma|D,T,M)</math> through the CYK algorithm. The structure with the highest predicted number of correct predictions is reported as the consensus structure.<ref name="Knudsen 2003" />

: <math>\sigma_{MAP}= \arg\underset{\sigma}\max P(D| \sigma,T^ML, M) P(\sigma|M)</math>

===== Pfold improvements on the KH-99 algorithm =====
PCFG based approaches are desired to be scalable and general enough. Compromising speed for accuracy needs to as minimal as possible. Pfold addresses the limitations of the KH-99 algorithm with respect to scalability, gaps, speed and accuracy.<ref name="Knudsen 2003" /> 
*In Pfold gaps are treated as unknown. In this sense the probability of a gapped column equals that of an ungapped one. 
*In Pfold the tree {{mvar|T}} is calculated prior to structure prediction through neighbor joining and not by maximum likelihood through the PCFG grammar. Only the branch lengths are adjusted to maximum likelihood estimates. 
*An assumption of Pfold is that all sequences have the same structure. Sequence identity threshold and allowing a 1% probability that any nucleotide becomes another limit the performance deterioration due to alignment errors.