Editing Probabilistic context-free grammar (section)

==Grammar construction==
Context-free grammars are represented as a set of rules inspired from attempts to model natural languages.<ref name="Chomsky_1956" /><ref name="Chomsky_1959" /><ref name="Chomsky_1957" /> The rules are absolute and have a typical syntax representation known as [[Backus–Naur form]]. The production rules consist of terminal '''<math>\left \{a, b \right\}</math>''' and non-terminal '''{{mvar|S}}''' symbols and a blank '''<math>\epsilon</math>''' may also be used as an end point.  In the production rules of CFG and PCFG the left side has only one nonterminal whereas the right side can be any string of terminal or nonterminals. In PCFG nulls are excluded.<ref name="Durbin 1998" /> An example of a grammar:
                                     
:                                     <math>
                                        S \to aS, S \to bS, S \to \epsilon
     </math>                               
This grammar can be shortened using the '|' ('or') character into:

:                                     <math>    S\to aS | bS |\epsilon </math>

Terminals in a grammar are words and through the grammar rules a non-terminal symbol is transformed into a string of either terminals and/or non-terminals. The above grammar is read as "beginning from a non-terminal {{mvar|S}} the emission can generate either {{mvar|a}} or {{mvar|b}} or <math>\epsilon</math>".
Its derivation is: 
:                                      <math>S \Rightarrow aS\Rightarrow abS\Rightarrow abbS \Rightarrow abb</math>

[[Ambiguous grammar]] may result in ambiguous parsing if applied on [[homograph]]s since the same word sequence can have more than one interpretation. [[pun|Pun sentences]] such as the newspaper headline "Iraqi Head Seeks Arms" are an example of ambiguous parses.

One strategy of dealing with ambiguous parses (originating with grammarians as early as [[Pāṇini]]) is to add yet more rules, or prioritize them so that one rule takes precedence over others. This, however, has the drawback of proliferating the rules, often to the point where they become difficult to manage. Another difficulty is overgeneration, where unlicensed structures are also generated.

Probabilistic grammars circumvent these problems by ranking various productions on frequency weights, resulting in a "most likely" (winner-take-all) interpretation.  As usage patterns are altered in [[Historical linguistics|diachronic]] shifts, these probabilistic rules can be re-learned, thus updating the grammar.

Assigning probability to production rules makes a PCFG. These probabilities are informed by observing distributions on a training set of similar composition to the language to be modeled. On most samples of broad language, probabilistic grammars where probabilities are estimated from data typically outperform hand-crafted grammars. CFGs when contrasted with PCFGs are not applicable to RNA structure prediction because while they incorporate sequence-structure relationship they lack the scoring metrics that reveal a sequence structural potential <ref name="Dowell 2004" />