Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Document retrieval
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
'''Document retrieval''' is defined as the matching of some stated user query against a set of [[free-text]] records. These records could be any type of mainly [[natural language|unstructured text]], such as [[newspaper article]]s, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words. Document retrieval is sometimes referred to as, or as a branch of, '''text retrieval'''. Text retrieval is a branch of [[information retrieval]] where the information is stored primarily in the form of [[natural language|text]]. Text databases became decentralized thanks to the [[personal computer]]. Text retrieval is a critical area of study today, since it is the fundamental basis of all [[internet]] [[search engine]]s. ==Description== Document retrieval systems find information to given criteria by matching text records (''documents'') against user queries, as opposed to [[expert system]]s that answer questions by [[Inference|inferring]] over a logical [[knowledge base|knowledge database]]. A document retrieval system consists of a database of documents, a [[classification algorithm]] to build a full text index, and a user interface to access the database. A document retrieval system has two main tasks: # Find relevant documents to user queries # Evaluate the matching results and sort them according to relevance, using algorithms such as [[PageRank]]. Internet [[search engines]] are classical applications of document retrieval. The vast majority of retrieval systems currently in use range from simple Boolean systems through to systems using [[statistical]] or [[natural language processing]] techniques. ==Variations== There are two main classes of indexing schemata for document retrieval systems: ''form based'' (or ''word based''), and ''content based'' indexing. The document classification scheme (or [[Search engine indexing|indexing algorithm]]) in use determines the nature of the document retrieval system. ===Form based=== Form based document retrieval addresses the exact syntactic properties of a text, comparable to substring matching in string searches. The text is generally unstructured and not necessarily in a natural language, the system could for example be used to process large sets of chemical representations in molecular biology. A [[suffix tree]] algorithm is an example for form based indexing. ===Content based=== The content based approach exploits semantic connections between documents and parts thereof, and semantic connections between queries and documents. Most content based document retrieval systems use an [[inverted index]] algorithm. A ''signature file'' is a technique that creates a ''quick and dirty'' filter, for example a [[Bloom filter]], that will keep all the documents that match to the query and ''hopefully'' a few ones that do not. The way this is done is by creating for each file a signature, typically a hash coded version. One method is superimposed coding. A post-processing step is done to discard the false alarms. Since in most cases this structure is inferior to [[inverted file]]s in terms of speed, size and functionality, it is not used widely. However, with proper parameters it can beat the inverted files in certain environments. ==Example: PubMed== The [[PubMed]]<ref>{{cite journal |vauthors=Kim W, Aronson AR, Wilbur WJ |title=Automatic MeSH term assignment and quality assessment |journal=Proc AMIA Symp |pages=319β23 |year=2001 |pmid=11825203 |pmc=2243528 }} </ref> form interface features the "related articles" search which works through a comparison of words from the documents' title, abstract, and [[Medical Subject Headings|MeSH]] terms using a word-weighted algorithm.<ref>{{cite book|url=https://www.ncbi.nlm.nih.gov/books/NBK3827/#pubmedhelp.Computation_of_Related_Citati|title=Computation of Related Citations|publisher=National Center for Biotechnology Information (US)|date=2019-02-06}}</ref><ref>{{cite journal|journal=BMC Bioinformatics|date=Oct 30, 2007|volume=8|pages=423|pmid=17971238|title=PubMed related articles: a probabilistic topic-based model for content similarity|author=Lin J1, Wilbur WJ|doi=10.1186/1471-2105-8-423|pmc=2212667 |doi-access=free }}</ref> == See also == * [[Compound term processing]] * [[Document classification]] * [[Enterprise search]] * [[Evaluation measures (information retrieval)]] * [[Full text search]] * [[Information retrieval]] * [[Latent semantic indexing]] * [[Search engine]] == References == {{Reflist}} ==Further reading== * {{cite journal|first1=Christos|last1=Faloutsos|first2=Stavros|last2=Christodoulakis|title=Signature files: An access method for documents and its analytical performance evaluation|journal=ACM Transactions on Information Systems|volume=2|issue=4|year=1984|pages=267β288|doi=10.1145/2275.357411|s2cid=8120705|doi-access=free}} * {{cite journal|author1=Justin Zobel |author2=Alistair Moffat |author3=Kotagiri Ramamohanarao |title=Inverted files versus signature files for text indexing|journal=ACM Transactions on Database Systems|volume=23|issue=4|year=1998|pages= 453β490|url=https://www.cs.columbia.edu/~gravano/Qual/Papers/19%20-%20Inverted%20files%20versus%20signature%20files%20for%20text%20indexing.pdf|doi=10.1145/296854.277632|citeseerx=10.1.1.54.8753 |s2cid=7293918 }} * {{cite journal|author1=Ben Carterette |author2=Fazli Can |title=Comparing inverted files and signature files for searching a large lexicon|journal=Information Processing and Management|volume= 41|issue=3|year=2005|pages= 613β633|url=http://www.users.miamioh.edu/canf/papers/ipm04b.pdf|doi=10.1016/j.ipm.2003.12.003}} == External links == {{Commons category}} * [http://cir.dcs.uni-pannon.hu/cikkek/FINAL_DOMINICH.pdf Formal Foundation of Information Retrieval], Buckinghamshire Chilterns University College [[Category:Information retrieval genres]] [[Category:Electronic documents]] [[Category:Substring indices]] [[Category:Search engine software]]
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)
Pages transcluded onto the current version of this page
(
help
)
:
Template:Cite book
(
edit
)
Template:Cite journal
(
edit
)
Template:Commons category
(
edit
)
Template:Reflist
(
edit
)