Editing Text mining (section)

== Text analysis processes ==
Subtasks—components of a larger text-analytics effort—typically include:

* [[Dimensionality reduction]] is an important technique for pre-processing data. It is used to identify the root word for actual words and reduce the size of the text data.{{citation needed|date=October 2022}}
* [[Information retrieval]] or identification of a [[text corpus|corpus]] is a preparatory step: collecting or identifying a set of textual materials, on the Web or held in a [[file system]], [[database]], or content [[corpus manager]], for analysis.
* Although some text analytics systems apply exclusively advanced statistical methods, many others apply more extensive [[natural language processing]], such as [[part of speech tagging]], syntactic [[parsing]], and other types of linguistic analysis.<ref>{{Cite thesis|title=Exploração de informações contextuais para enriquecimento semântico em representações de textos|url=http://www.teses.usp.br/teses/disponiveis/55/55134/tde-03012019-103253/|publisher=Universidade de São Paulo|date=2018-11-14|place=São Carlos|degree=Mestrado em Ciências de Computação e Matemática Computacional|doi=10.11606/d.55.2019.tde-03012019-103253|language=pt|first=João|last=Antunes|doi-access=free}}</ref>
* [[Named entity recognition]] is the use of gazetteers or statistical techniques to identify named text features: people, organizations, place names, stock ticker symbols, certain abbreviations, and so on.
* Disambiguation—the use of [[context (language use)|contextual]] clues—may be required to decide where, for instance, "Ford" can refer to a former U.S. president, a vehicle manufacturer, a movie star, a river crossing, or some other entity.<ref>{{Cite journal|last1=Moro|first1=Andrea|last2=Raganato|first2=Alessandro|last3=Navigli|first3=Roberto|date=December 2014|title=Entity Linking meets Word Sense Disambiguation: a Unified Approach|journal=Transactions of the Association for Computational Linguistics|volume=2|pages=231–244|doi=10.1162/tacl_a_00179|issn=2307-387X|doi-access=free}}</ref>
* Recognition of pattern-identified entities: Features such as telephone numbers, e-mail addresses, quantities (with units) can be discerned via regular expression or other [[Pattern matching|pattern matches]].
*[[Document clustering]]: identification of sets of similar text documents.<ref>{{Cite journal|last1=Chang|first1=Wui Lee|last2=Tay|first2=Kai Meng|last3=Lim|first3=Chee Peng|date=2017-02-06|title=A New Evolving Tree-Based Model with Local Re-learning for Document Clustering and Visualization|journal=Neural Processing Letters|volume=46|issue=2|pages=379–409|doi=10.1007/s11063-017-9597-3|s2cid=9100902|issn=1370-4621}}</ref>
* [[Coreference]] resolution: identification of [[noun phrase]]s and other terms that refer to the same object.
* Extraction of relationships, facts and events: identification of associations among entities and other information in texts.
* [[Sentiment analysis]]: discerning of subjective material and extracting information about attitudes: sentiment, opinion, mood, and emotion. This is done at the entity, concept, or topic level and aims to distinguish opinion holders and objects.<ref>{{cite journal |last1=Benchimol |first1=Jonathan |last2=Kazinnik |first2=Sophia |last3=Saadon |first3=Yossi |date=2022 |title=Text mining methodologies with R: An application to central bank texts |url=https://paperswithcode.com/paper/text-mining-methodologies-with-r-an |journal=Machine Learning with Applications |volume=8 |pages=100286 |doi=10.1016/j.mlwa.2022.100286|s2cid=243798160 |doi-access=free }}</ref>
* Quantitative text analysis: a set of techniques stemming from the social sciences where either a human judge or a computer extracts semantic or grammatical relationships between words in order to find out the meaning or stylistic patterns of, usually, a casual personal text for the purpose of [[psychological profiling]] etc.<ref>{{cite book|doi=10.1037/11383-011 |title=Handbook of multimethod measurement in psychology |year=2006 |last1=Mehl |first1=Matthias R. |isbn=978-1-59147-318-3 |page=141|chapter=Quantitative Text Analysis }}</ref>
* Pre-processing usually involves tasks such as tokenization, filtering and stemming.