Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
TeX
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Hyphenation and justification=== In comparison with manual typesetting, the problem of [[Justification (typesetting)|justification]] is easy to solve with a digital system such as TeX, which, provided that good points for line breaking have been defined, can automatically spread the spaces between words to fill in the line. The problem is thus to find the set of breakpoints that will give the most visually pleasing result. Many line-breaking algorithms use a [[greedy algorithm|''first-fit'' approach]], where the breakpoints for each line are determined one after the other, and no breakpoint is changed after it has been chosen.<ref>{{Citation | first = Michael P | last = Barnett | title = Computer Typesetting: Experiments and Prospects | place = [[Cambridge, Massachusetts|Cambridge]], [[Massachusetts|MA]] | publisher = [[MIT Press]] | date = 1965}}</ref> Such a system is not able to define a breakpoint depending on the effect that it will have on the following lines. In comparison, the ''total-fit'' line-breaking algorithm used by TeX and developed by Donald Knuth and Michael Plass <ref>url=http://svn.tug.org/interviews/plass.html</ref> considers ''all'' the possible breakpoints in a paragraph, and finds the combination of line breaks that will produce the most globally pleasing arrangement. Formally, the algorithm defines a value called ''badness'' associated with each possible line break; the badness is increased if the spaces on the line must stretch or shrink too much to make the line the correct width. Penalties are added if a breakpoint is particularly undesirable: for example, if a word must be [[hyphen]]ated, if two lines in a row are hyphenated, or if a very loose line is immediately followed by a very tight line. The algorithm will then find the breakpoints that will minimize the sum of squares of the badness (including penalties) of the resulting lines. If the paragraph contains <math>n</math> possible breakpoints, the number of situations that must be evaluated naively is <math>2^n</math>. However, by using the method of [[dynamic programming]], the complexity of the algorithm can be brought down to <math>O(n^2)</math> (see [[Big O notation]]). Further simplifications (for example, not testing extremely unlikely breakpoints such as a hyphenation in the first word of a paragraph, or very overfull lines) lead to an efficient algorithm whose running time is <math>O(n w)</math>, where <math>w</math> is the width of a line. A similar algorithm is used to determine the best way to break paragraphs across two pages, in order to avoid [[Widow (typesetting)|widows]] or [[Orphan (typesetting)|orphans]] (lines that appear alone on a page while the rest of the paragraph is on the following or preceding page). However, in general, a thesis by Michael Plass shows how the page-breaking problem can be [[NP-complete]] because of the added complication of placing figures.{{Sfn | Knuth | Plass | 1981}} TeX's line-breaking algorithm has been adopted by several other programs, such as [[Adobe InDesign]] (a [[desktop publishing]] [[Computer application|application]])<ref>{{Citation | publisher = [[Advogato]] | url = http://www.advogato.org/article/28.html | type = interview | title = Donald E. Knuth| journal = TUGboat | volume = 21 | date = 2000 | pages = 103β10 | access-date = 26 December 2005 | archive-url = https://web.archive.org/web/20090122043044/http://www.advogato.org/article/28.html | archive-date = 22 January 2009 | url-status = dead }}</ref> and the [[GNU]] [[fmt (Unix)|fmt]] [[Unix]] [[command line]] utility.<ref>{{Citation|publisher=GNU Project |chapter-url=https://www.gnu.org/software/coreutils/manual/html_node/fmt-invocation.html#fmt-invocation |title=Core GNU utilities (GNU coreutils) manual |chapter=4.1 fmt: Reformat paragraph text |date=2016 }}</ref> If no suitable line break can be found for a line, the system will try to hyphenate a word. The original version of TeX used a hyphenation algorithm based on a set of rules for the removal of prefixes and suffixes of words, and for deciding if it should insert a break between the two consonants in a pattern of the form [[vowel]]β[[consonant]]β[[consonant]]β[[vowel]] (which is possible most of the time).{{Sfn | Liang | 1983 | p = 3}} TeX82 introduced a new hyphenation algorithm, designed by [[Frank Liang]] in 1983, to assign priorities to breakpoints in letter groups. A list of hyphenation patterns is first generated automatically from a corpus of hyphenated words (a list of 50,000 words). If TeX must find the acceptable hyphenation positions in the word ''encyclopedia'', for example, it will consider all the subwords of the extended word ''.encyclopedia.'', where ''.'' is a special marker to indicate the beginning or end of the word. The list of subwords includes all the subwords of length 1 (''.'', ''e'', ''n'', ''c'', ''y'', etc.), of length 2 (''.e'', ''en'', ''nc'', etc.), etc., up to the subword of length 14, which is the word itself, including the markers. TeX will then look into its list of hyphenation patterns, and find subwords for which it has calculated the desirability of hyphenation at each position. In the case of our word, 11 such patterns can be matched, namely <sub>1</sub>c<sub>4</sub>l<sub>4</sub>, <sub>1</sub>cy, <sub>1</sub>d<sub>4</sub>i<sub>3</sub>a, <sub>4</sub>edi, e<sub>3</sub>dia, <sub>2</sub>i<sub>1</sub>a, ope<sub>5</sub>d, <sub>2</sub>p<sub>2</sub>ed, <sub>3</sub>pedi, pedia<sub>4</sub>, y<sub>1</sub>c. For each position in the word, TeX will calculate the ''maximum value'' obtained among all matching patterns, yielding en<sub>1</sub>cy<sub>1</sub>c<sub>4</sub>l<sub>4</sub>o<sub>3</sub>p<sub>4</sub>e<sub>5</sub>d<sub>4</sub>i<sub>3</sub>a<sub>4</sub>. Finally, the acceptable positions are those indicated by an [[even and odd numbers|odd]] number, yielding the acceptable hyphenations ''en-cy-clo-pe-di-a''. This system based on subwords allows the definition of very general patterns (such as <sub>2</sub>i<sub>1</sub>a), with low indicative numbers (either odd or even), which can then be superseded by more specific patterns (such as <sub>1</sub>d<sub>4</sub>i<sub>3</sub>a) if necessary. These patterns find about 90% of the hyphens in the original dictionary; more importantly, they do not insert any spurious hyphen. In addition, a list of exceptions (words for which the patterns do not predict the correct hyphenation) are included with the Plain TeX format; additional ones can be specified by the user.{{Sfn | Liang | 1983}}{{Rp | needed = yes | date = March 2013}}<ref>{{Citation | title = The TeXbook | chapter = Appendix H: Hyphenation | pages = 449β55}}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)