Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Document-term matrix
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== History of the concept == The document-term matrix emerged in the earliest years of the computerization of text. The increasing capacity for storing documents created the problem of retrieving a given document in an efficient manner. While previously the work of classifying and indexing was accomplished by hand, researchers explored the possibility of doing this automatically using word frequency information. One of the first published document-term matrices was in [[Harold Borko]]'s 1962 article "The construction of an empirically based mathematically derived classification system" (page 282, see also his 1965 article<ref>{{Cite journal|last=Borko|first=Harold|date=1965|title=A Factor Analytically Derived Classification System for Psychological Reports|url=http://dx.doi.org/10.2466/pms.1965.20.2.393|journal=Perceptual and Motor Skills|volume=20|issue=2|pages=393–406|doi=10.2466/pms.1965.20.2.393|pmid=14279310|s2cid=34230652|issn=0031-5125|url-access=subscription}}</ref>). Borko references two computer programs, "FEAT" which stood for "Frequency of Every Allowable Term," written by John C. Olney of the System Development Corporation and the Descriptor Word Index Program, written by [[Eileen Stone]] also of the System Development Corporation: <blockquote>Having selected the documents which were to make up the experimental library, the next step consisted of keypunching the entire body of text preparatory to computer processing. The program used for this analysis was FEAT (Frequency of Every Allowable Term). it was written by John C. Olney of the System Development Corporation and is designed to perform frequency and summary counts of individual words and of word pairs. The output of this program is an alphabetical listing, by frequency of occurrence, of all word types which appeared in the text. Certain function words such as and, the, at, a, etc., were placed in a "forbidden word list" table, and the frequency of these words was recorded in a separate listing... A special computer program, called the Descriptor Word Index Program, was written to provide this information and to prepare a document-term matrix in a form suitable for in-put to the Factor Analysis Program. The Descriptor Word Index program was prepared by Eileen Stone of the System Development Corporation.<ref>{{Cite book|last=Borko|first=Harold|title=Proceedings of the May 1-3, 1962, spring joint computer conference on - AIEE-IRE '62 (Spring) |chapter=The construction of an empirically based mathematically derived classification system |date=1962|pages=279–289|location=New York, New York, USA|publisher=ACM Press|doi=10.1145/1460833.1460865|isbn=9781450378758|s2cid=6483337|doi-access=free}}</ref></blockquote>Shortly thereafter, [[Gerard Salton]] published "Some hierarchical models for automatic document retrieval" in 1963 which also included a visual depiction of a document-term matrix.<ref name=":0">{{Cite journal|last=Salton|first=Gerard|date=July 1963|title=Some hierarchical models for automatic document retrieval|url=http://dx.doi.org/10.1002/asi.5090140307|journal=American Documentation|volume=14|issue=3|pages=213–222|doi=10.1002/asi.5090140307|issn=0096-946X|url-access=subscription}}</ref> Salton was at Harvard University at the time and his work was supported by the Air Force Cambridge Research Laboratories and Sylvania Electric Products, Inc. In this paper, Salton introduces the document-term matrix by comparison to a kind of term-context matrix used to measure similarities between words:<blockquote>If it is desired to generate document associations or document clusters instead of word associations, the same procedures can be used with slight modifications. Instead of starting with a word-sentence matrix ''C'',... it is now convenient to construct a word-document matrix ''F,'' listing frequency of occurrence of word W<sub>i</sub> in Document D<sub>j</sub>... Document similarities can now be computed as before by comparing pairs of rows and by obtaining similarity coefficients based on the frequency of co-occurrences of the content words included in the given document. This procedure produces a document-document similarity matrix which can in turn be used for the generation of document clusters...<ref name=":0" /></blockquote>In addition to Borko and Salton, in 1964, F.W. Lancaster published a comprehensive review of automated indexing and retrieval. While the work was published while he worked at the Herner and Company in Washington D.C., the paper was written while he was "employed in research work at Aslib, on the Aslib Cranfield Project."<ref>{{Cite journal|last=LANCASTER|first=F.W.|date=1964-01-01|title=MECHANIZED DOCUMENT CONTROL: A Review of Some Recent Research|url=https://doi.org/10.1108/eb049960|journal=ASLIB Proceedings|volume=16|issue=4|pages=132–152|doi=10.1108/eb049960|issn=0001-253X|url-access=subscription}}</ref> Lancaster credits Borko with the document-term matrix:<blockquote>Harold Borko, of the System Development Corporation, has carried this operation a little further. A significant group of clue words is chosen from the vocabulary of an experimental collection. These are arranged in a document/term matrix to show the frequency of occurrence of each term in each document.... A correlation coefficient for each word pair is then computed, based on their co-occurrence in the document set. The resulting term/term matrix... is then factor analysed and a series of factors are isolated. These factors, when interpreted and named on the basis of the terms with high loadings which appear in each of the factors, become the classes of an empirical classification. The terms with high loadings in each factor are the clue words or predictors of the categories. </blockquote>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)