Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Document-term matrix
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==General concept== When creating a data-set of [[term (language)|terms]] that appear in a corpus of [[document]]s, the document-term matrix contains rows corresponding to the documents and columns corresponding to the terms. Each ''ij'' cell, then, is the number of times word ''j'' occurs in document ''i''. As such, each row is a vector of term counts that represents the content of the document corresponding to that row. For instance if one has the following two (short) documents: *D1 = "I like databases" *D2 = "I dislike databases", then the document-term matrix would be: {| border="1" cellspacing="0" ! ||I||like||dislike||databases |- align=center |'''D1'''||1||1||0||1 |- align=center |'''D2'''||1||0||1||1 |} which shows which documents contain which terms and how many times they appear. Note that, unlike representing a document as just a token-count list, the document-term matrix includes all terms in the corpus (i.e. the corpus vocabulary), which is why there are zero-counts for terms in the corpus which do not also occur in a specific document. For this reason, document-term matrices are usually stored in a sparse matrix format. As a result of the power-law distribution of tokens in nearly every corpus (see [[Zipf's law]]), it is common to weight the counts. This can be as simple as dividing counts by the total number of tokens in a document (called relative frequency or proportions), dividing by the maximum frequency in each document (called prop max), or taking the log of frequencies (called log count). If one desires to weight the words most unique to an individual document as compared to the corpus as a whole, it is common to use [[tf-idf]], which divides the term frequency by the term's document frequency.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)