Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Probabilistic context-free grammar
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==== Example: Using evolutionary information to guide structure prediction ==== The KH-99 algorithm by Knudsen and Hein lays the basis of the Pfold approach to predicting RNA secondary structure.<ref name="Knudsen 2003" /> In this approach the parameterization requires evolutionary history information derived from an alignment tree in addition to probabilities of columns and mutations. The grammar probabilities are observed from a training dataset. ===== Estimate column probabilities for paired and unpaired bases ===== In a structural alignment the probabilities of the unpaired bases columns and the paired bases columns are independent of other columns. By counting bases in single base positions and paired positions one obtains the frequencies of bases in loops and stems. For basepair {{mvar|X}} and {{mvar|Y}} an occurrence of <math>XY</math> is also counted as an occurrence of <math>YX</math>. Identical basepairs such as <math>XX</math> are counted twice. ===== Calculate mutation rates for paired and unpaired bases ===== By pairing sequences in all possible ways overall mutation rates are estimated. In order to recover plausible mutations a sequence identity threshold should be used so that the comparison is between similar sequences. This approach uses 85% identity threshold between pairing sequences. First single base positions differences -except for gapped columns- between sequence pairs are counted such that if the same position in two sequences had different bases {{mvar|X, Y}} the count of the difference is incremented for each sequence. {{nowrap|while <math>X\ne Y</math>}} {{nowrap|<math> C_{\text{XY}} +1</math> first sequence pair}} {{nowrap|<math> C_{\text{YX}} +1</math> second sequence pair}} {{nowrap|Calculate mutation rates.}} {{nowrap|Let <math>r_{\text{XY}}= </math> mutation of base X to base Y <math>= \frac {K~C_{\text{XY}}} {P_{x}P_{s}}</math>}} {{nowrap|Let <math>r_{\text{XX}}= </math> the negative of the rate of X mutation to other bases <math>= - \sum r_{\text{XY}}</math>}} {{nowrap|<math>P_{s} =</math> the probability that the base is not paired.}} For unpaired bases a 4 X 4 mutation rate matrix is used that satisfies that the mutation flow from X to Y is reversible:<ref name="Tavaré 1986" /> : <math>PX^rXY = PY^rYX</math> For basepairs a 16 X 16 rate distribution matrix is similarly generated.<ref name="Muse 1995" /><ref name="Schöniger 1994" /> The PCFG is used to predict the prior probability distribution of the structure whereas posterior probabilities are estimated by the inside-outside algorithm and the most likely structure is found by the CYK algorithm.<ref name="Knudsen 2003" /> ===== Estimate alignment probabilities ===== After calculating the column prior probabilities the alignment probability is estimated by summing over all possible secondary structures. Any column {{mvar|C}} in a secondary structure <math>\sigma</math> for a sequence {{mvar|D}} of length {{mvar|l}} such that <math>D=(C_1,~C_2, ...C_l )</math> can be scored with respect to the alignment tree {{mvar|T}} and the mutational model {{mvar|M}}. The prior distribution given by the PCFG is <math>P(\sigma|M)</math>. The phylogenetic tree, {{mvar|T}} can be calculated from the model by maximum likelihood estimation. Note that gaps are treated as unknown bases and the summation can be done through [[dynamic programming]].<ref name="Baker 1979" /> : <math>P(D|T,M)</math> : <math>=\sum P (D, \sigma |T,M)</math> : <math>=\sum P(D|\sigma, T, M) P(\sigma|T,M)</math> : <math>=\sum P(D|\sigma,T,M) P(\sigma|M)</math> ===== Assign production probabilities to each rule in the grammar ===== Each structure in the grammar is assigned production probabilities devised from the structures of the training dataset. These prior probabilities give weight to predictions accuracy.<ref name="Knudsen 1999" /><ref name="Lari and Young 1990" /><ref name="Lari and Young 1991" /> The number of times each rule is used depends on the observations from the training dataset for that particular grammar feature. These probabilities are written in parentheses in the grammar formalism and each rule will have a total of 100%.<ref name="Knudsen 2003" /> For instance: : <math> S \to LS (80\%) |L (20\%)</math> : <math>L \to s (70\%) | dFd (30\%)</math> : <math>F \to dFd (60.4\%)| LS (39.6\%)</math> ===== Predict the structure likelihood ===== Given the prior alignment frequencies of the data the most likely structure from the ensemble predicted by the grammar can then be computed by maximizing <math>P(\sigma|D,T,M)</math> through the CYK algorithm. The structure with the highest predicted number of correct predictions is reported as the consensus structure.<ref name="Knudsen 2003" /> : <math>\sigma_{MAP}= \arg\underset{\sigma}\max P(D| \sigma,T^ML, M) P(\sigma|M)</math> ===== Pfold improvements on the KH-99 algorithm ===== PCFG based approaches are desired to be scalable and general enough. Compromising speed for accuracy needs to as minimal as possible. Pfold addresses the limitations of the KH-99 algorithm with respect to scalability, gaps, speed and accuracy.<ref name="Knudsen 2003" /> *In Pfold gaps are treated as unknown. In this sense the probability of a gapped column equals that of an ungapped one. *In Pfold the tree {{mvar|T}} is calculated prior to structure prediction through neighbor joining and not by maximum likelihood through the PCFG grammar. Only the branch lengths are adjusted to maximum likelihood estimates. *An assumption of Pfold is that all sequences have the same structure. Sequence identity threshold and allowing a 1% probability that any nucleotide becomes another limit the performance deterioration due to alignment errors.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)