Editing Document retrieval

'''Document retrieval''' is defined as the matching of some stated user query against a set of [[free-text]] records. These records could be any type of mainly [[natural language|unstructured text]], such as [[newspaper article]]s, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.

Document retrieval is sometimes referred to as, or as a branch of, '''text retrieval'''. Text retrieval is a branch of [[information retrieval]] where the information is stored primarily in the form of [[natural language|text]]. Text databases became decentralized thanks to the [[personal computer]]. Text retrieval is a critical area of study today, since it is the fundamental basis of all [[internet]] [[search engine]]s.

==Description==
Document retrieval systems find information to given criteria by matching text records (''documents'') against user queries, as opposed to [[expert system]]s that answer questions by [[Inference|inferring]] over a logical [[knowledge base|knowledge database]]. A document retrieval system consists of a database of documents, a [[classification algorithm]] to build a full text index, and a user interface to access the database.

A document retrieval system has two main tasks:
# Find relevant documents to user queries
# Evaluate the matching results and sort them according to relevance, using algorithms such as [[PageRank]].

Internet [[search engines]] are classical applications of document retrieval. The vast majority of retrieval systems currently in use range from simple Boolean systems through to systems using [[statistical]] or [[natural language processing]] techniques.

==Variations==
There are two main classes of indexing schemata for document retrieval systems: ''form based'' (or ''word based''), and ''content based'' indexing. The document classification scheme (or [[Search engine indexing|indexing algorithm]]) in use determines the nature of the document retrieval system.

===Form based===
Form based document retrieval addresses the exact syntactic properties of a text, comparable to substring matching in string searches. The text is generally unstructured and not necessarily in a natural language, the system could for example be used to process large sets of chemical representations in molecular biology. A [[suffix tree]] algorithm is an example for form based indexing.

===Content based===
The content based approach exploits semantic connections between documents and parts thereof, and semantic connections between queries and documents. Most content based document retrieval systems use an [[inverted index]] algorithm.

A ''signature file'' is a technique that creates a ''quick and dirty'' filter, for example a [[Bloom filter]], that will keep all the documents that match to the query and ''hopefully'' a few ones that do not. The way this is done is by creating for each file a signature, typically a hash coded version. One method is superimposed coding. A post-processing step is done to discard the false alarms. Since in most cases this structure is inferior to [[inverted file]]s in terms of speed, size and functionality, it is not used widely. However, with proper parameters it can beat the inverted files in certain environments.

==Example: PubMed==
The [[PubMed]]<ref>{{cite journal |vauthors=Kim W, Aronson AR, Wilbur WJ |title=Automatic MeSH term assignment and quality assessment |journal=Proc AMIA Symp |pages=319–23 |year=2001 |pmid=11825203 |pmc=2243528 }}
</ref> form interface features the "related articles" search which works through a comparison of words from the documents' title, abstract, and [[Medical Subject Headings|MeSH]] terms using a word-weighted algorithm.<ref>{{cite book|url=https://www.ncbi.nlm.nih.gov/books/NBK3827/#pubmedhelp.Computation_of_Related_Citati|title=Computation of Related Citations|publisher=National Center for Biotechnology Information (US)|date=2019-02-06}}</ref><ref>{{cite journal|journal=BMC Bioinformatics|date=Oct 30, 2007|volume=8|pages=423|pmid=17971238|title=PubMed related articles: a probabilistic topic-based model for content similarity|author=Lin J1, Wilbur WJ|doi=10.1186/1471-2105-8-423|pmc=2212667 |doi-access=free }}</ref>

== See also ==

* [[Compound term processing]]
* [[Document classification]]
* [[Enterprise search]]
* [[Evaluation measures (information retrieval)]]
* [[Full text search]]
* [[Information retrieval]]
* [[Latent semantic indexing]]
* [[Search engine]]

== References ==
{{Reflist}}

==Further reading==
* {{cite journal|first1=Christos|last1=Faloutsos|first2=Stavros|last2=Christodoulakis|title=Signature files: An access method for documents and its analytical performance evaluation|journal=ACM Transactions on Information Systems|volume=2|issue=4|year=1984|pages=267–288|doi=10.1145/2275.357411|s2cid=8120705|doi-access=free}}
* {{cite journal|author1=Justin Zobel |author2=Alistair Moffat |author3=Kotagiri Ramamohanarao |title=Inverted files versus signature files for text indexing|journal=ACM Transactions on Database Systems|volume=23|issue=4|year=1998|pages= 453–490|url=https://www.cs.columbia.edu/~gravano/Qual/Papers/19%20-%20Inverted%20files%20versus%20signature%20files%20for%20text%20indexing.pdf|doi=10.1145/296854.277632|citeseerx=10.1.1.54.8753 |s2cid=7293918 }}
* {{cite journal|author1=Ben Carterette |author2=Fazli Can |title=Comparing inverted files and signature files for searching a large lexicon|journal=Information Processing and Management|volume= 41|issue=3|year=2005|pages= 613–633|url=http://www.users.miamioh.edu/canf/papers/ipm04b.pdf|doi=10.1016/j.ipm.2003.12.003}}

== External links ==
{{Commons category}}
* [http://cir.dcs.uni-pannon.hu/cikkek/FINAL_DOMINICH.pdf Formal Foundation of Information Retrieval], Buckinghamshire Chilterns University College


[[Category:Information retrieval genres]]
[[Category:Electronic documents]]
[[Category:Substring indices]]
[[Category:Search engine software]]