Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Text corpus
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
{{Short description|Digital collections of natural language data}} {{Use dmy dates|date=July 2022}} In [[linguistics]] and [[natural language processing]], a '''corpus''' ({{plural form}}: '''corpora''') or '''text corpus''' is a dataset, consisting of natively digital and older, digitalized, [[language resource]]s, either annotated or unannotated. Annotated, they have been used in [[corpus linguistics]] for statistical [[statistical hypothesis testing|hypothesis testing]], checking occurrences or validating linguistic rules within a specific language territory. == Overview == A corpus may contain texts in a single language (''monolingual corpus'') or text data in multiple languages (''multilingual corpus''). In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as [[annotation]]. An example of annotating a corpus is [[part-of-speech tagging]], or ''POS-tagging'', in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of ''tags''. Another example is indicating the [[Lemma (morphology)|lemma]] (base) form of each word. When the language of the corpus is not a working language of the researchers who use it, [[interlinear gloss]]ing is used to make the annotation bilingual. Some corpora have further ''structured'' levels of analysis applied. In particular, smaller corpora may be fully [[Parsing|parsed]]. Such corpora are usually called [[Treebank]]s or [[Treebank|Parsed Corpora]]. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around one to three million words. Other levels of linguistic structured analysis are possible, including annotations for [[Morphology (linguistics)|morphology]], [[semantics]] and [[pragmatics]]. == Applications == Corpora are the main knowledge base in [[corpus linguistics]]. Other notable areas of application include: * [[Language technology]], [[natural language processing]], [[computational linguistics]] ** The analysis and processing of various types of corpora are also the subject of much work in [[computational linguistics]], [[speech recognition]] and [[machine translation]], where they are often used to create [[hidden Markov model]]s for part of speech tagging and other purposes. Corpora and [[frequency list]]s derived from them are useful for [[language teaching]]. Corpora can be considered as a type of [[foreign language writing aid]] as the contextualised grammatical knowledge acquired by non-native language users through exposure to authentic texts in corpora allows learners to grasp the manner of sentence formation in the target language, enabling effective writing.<ref name="Yoon">Yoon, H., & Hirvela, A. (2004). [https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.1073.2322&rep=rep1&type=pdf ESL Student Attitudes toward Corpus Use in L2 Writing]. ''Journal of Second Language Writing, 13''(4), 257–283. Retrieved 21 March 2012.</ref> * [[Machine translation]] ** Multilingual corpora that have been specially formatted for side-by-side comparison are called ''aligned parallel corpora''. There are two main types of [[parallel corpora]] which contain texts in two languages. In a ''translation corpus'', the texts in one language are translations of texts in the other language. In a ''comparable corpus'', the texts are of the same kind and cover the same content, but they are not translations of each other.<ref>{{cite book | last1 = Wołk | first1 = K. | last2 = Marasek | first2 = K. | title = New Perspectives in Information Systems and Technologies, Volume 1 | chapter = Real-Time Statistical Speech Translation | series = Advances in Intelligent Systems and Computing | date = 7 April 2014 | publisher = Springer | volume = 275 | pages = 107–114 | doi = 10.1007/978-3-319-05951-8_11 | arxiv = 1509.09090 | issn = 2194-5357 | isbn = 978-3-319-05950-1| s2cid = 15361632}}</ref> To exploit a parallel text, some kind of text alignment identifying equivalent text segments (phrases or sentences) is a prerequisite for analysis. [[Machine translation]] algorithms for translating between two languages are often trained using parallel fragments comprising a first-language corpus and a second-language corpus, which is an element-for-element translation of the first-language corpus.<ref>{{cite conference |last1=Wolk |first1=Krzysztof |last2=Marasek |first2=Krzysztof |editor1-last=Král |editor1-first=Pavel |editor2-last=Matoušek |editor2-first=Václav |arxiv=1509.08639 |contribution=Tuned and GPU-accelerated parallel data mining from comparable corpora |doi=10.1007/978-3-319-24033-6_4 |pages=32–40 |publisher=Springer |series=Lecture Notes in Computer Science |title=Text, Speech, and Dialogue – 18th International Conference, TSD 2015, Plzeň, Czech Republic, September 14–17, 2015, Proceedings |volume=9302 |year=2015|isbn=978-3-319-24032-9 }}</ref> * [[Philology|Philologies]] ** Text corpora are also used in the study of [[historical document]]s, for example in attempts to [[decipherment|decipher]] ancient scripts, or in [[Biblical scholarship]]. Some archaeological corpora can be of such short duration that they provide a snapshot in time. One of the shortest corpora in time may be the 15–30 year [[Amarna letters]] texts ([[1350 BC]]). The ''corpus'' of an ancient city, (for example the "[[Kültepe]] Texts" of Turkey), may go through a series of corpora, determined by their find site dates. == Some notable text corpora == {{Main article|List of text corpora}} == See also == * [[Concordance (publishing)|Concordance]] * [[Corpus linguistics]] * [[Culturomics]] * [[Distributional–relational database]] * [[Linguistic Data Consortium]] * [[Natural language processing]] * [[Natural Language Toolkit]] * [[Parallel text]] * [[Speech corpus]] * [[Translation memory]] * [[Treebank]] * [[Zipf's law]] == References == {{Reflist}} == External links == * [http://www.clres.com/corp.html ACL SIGLEX Resource Links: Text Corpora] {{Webarchive|url=https://web.archive.org/web/20130813141813/http://www.clres.com/corp.html |date=2013-08-13 }} * [https://archive.today/20121222193153/http://www.ahds.ac.uk/linguistic-corpora Developing Linguistic Corpora: a Guide to Good Practice] * [http://corpus.byu.edu/ Free samples (not free), web-based corpora (45-425 million words each): American (COCA, COHA, TIME), British (BNC), Spanish, Portuguese] * [http://ucnk.korpus.cz/intercorp/?lang=en Intercorp] Building synchronous parallel corpora of the languages taught at the Faculty of Arts of Charles University. * [https://the.sketchengine.co.uk/open/ Sketch Engine: Open corpora with free access] * [http://www.tscorpus.com/ TS Corpus – A Turkish Corpus freely available for academic research.] * [http://www.tnc.org.tr/ Turkish National Corpus – A general-purpose corpus for contemporary Turkish] * [https://digital.lib.hkbu.edu.hk/corpus/index.php Corpus of Political Speeches], Free access to political speeches by American and Chinese politicians, developed by Hong Kong Baptist University Library * [https://ruscorpora.ru/en/ Russian National Corpus] <!-- ===========================({{NoMoreLinks}})=============================== | PLEASE BE CAUTIOUS IN ADDING MORE LINKS TO THIS ARTICLE. WIKIPEDIA IS | | NOT A COLLECTION OF LINKS. | | | | Excessive or inappropriate links WILL BE DELETED. | | See [[Wikipedia:External links]] and [[Wikipedia:Spam]] for details. | | | | If there are already plentiful links, please propose additions or | | replacements on this article's discussion page. Or submit your link | | to the appropriate category at Curlie (www.curlie.org) | | and link back to that category using the {{dmoz}} template. | ===========================({{NoMoreLinks}})=============================== --> {{Natural Language Processing}} [[Category:Discourse analysis]] [[Category:Corpus linguistics]] [[Category:Computational linguistics]] [[Category:Works based on multiple works]] [[Category:Test items]]<!-- broad sense --> [[lt:Tekstynas]]
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)
Pages transcluded onto the current version of this page
(
help
)
:
Template:Cite book
(
edit
)
Template:Cite conference
(
edit
)
Template:Main article
(
edit
)
Template:Natural Language Processing
(
edit
)
Template:Plural form
(
edit
)
Template:Reflist
(
edit
)
Template:Short description
(
edit
)
Template:Use dmy dates
(
edit
)
Template:Webarchive
(
edit
)