Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Text mining
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Applications == Text mining technology is now broadly applied to a wide variety of government, research, and business needs. All these groups may use text mining for records management and searching documents relevant to their daily activities. Legal professionals may use text mining for [[e-discovery]], for example. Governments and military groups use text mining for [[national security]] and intelligence purposes. Scientific researchers incorporate text mining approaches into efforts to organize large sets of text data (i.e., addressing the problem of [[unstructured data]]), to determine ideas communicated through text (e.g., [[sentiment analysis]] in [[social media]]<ref>{{Cite journal|last1=Pang|first1=Bo|last2=Lee|first2=Lillian|author2-link=Lillian Lee (computer scientist)|date=2008|title=Opinion Mining and Sentiment Analysis|journal=Foundations and Trends in Information Retrieval|volume=2|issue=1–2|pages=1–135|doi=10.1561/1500000011|issn=1554-0669|citeseerx=10.1.1.147.2755|s2cid=207178694 }}</ref><ref>{{Cite journal|last1=Paltoglou|first1=Georgios|last2=Thelwall|first2=Mike|date=2012-09-01|title=Twitter, MySpace, Digg: Unsupervised Sentiment Analysis in Social Media|journal=ACM Transactions on Intelligent Systems and Technology |volume=3|issue=4|pages=66|doi=10.1145/2337542.2337551|s2cid=16600444|issn=2157-6904}}</ref><ref>{{Cite web|url=http://alt.qcri.org/semeval2017/task4/|title=Sentiment Analysis in Twitter < SemEval-2017 Task 4|website=alt.qcri.org|access-date=2018-10-02}}</ref>) and to support [[scientific discovery]] in fields such as the [[life sciences]] and [[bioinformatics]]. In business, applications are used to support [[competitive intelligence]] and automated [[ad serving|ad placement]], among numerous other activities. === Security applications === Many text mining software packages are marketed for [[security appliance|security applications]], especially monitoring and analysis of online plain text sources such as [[Internet news]], [[blog]]s, etc. for [[national security]] purposes.<ref>{{cite book |doi=10.1007/978-3-540-88181-0_7 |title=Proceedings of the International Workshop on Computational Intelligence in Security for Information Systems CISIS'08 |series=Advances in Soft Computing |year=2009 |last1=Zanasi |first1=Alessandro |isbn=978-3-540-88180-3 |volume=53 |page=53|chapter=Virtual Weapons for Real Wars: Text Mining for National Security }}</ref> It is also involved in the study of text [[encryption]]/[[decryption]]. === Biomedical applications === {{Main|Biomedical text mining}} [[File:Text mining protocol.png|alt=A flowchart of a text mining protocol.|thumb|An example of a text mining protocol used in a study of protein-protein complexes, or [[protein docking]].<ref>{{Cite journal|last1=Badal|first1=Varsha D.|last2=Kundrotas|first2=Petras J.|last3=Vakser|first3=Ilya A.|date=2015-12-09|title=Text Mining for Protein Docking|journal=PLOS Computational Biology|volume=11|issue=12|pages=e1004630|doi=10.1371/journal.pcbi.1004630|issn=1553-7358|pmc=4674139|pmid=26650466|bibcode=2015PLSCB..11E4630B |doi-access=free }}</ref>]] A range of text mining applications in the biomedical literature has been described,<ref>{{cite journal |doi=10.1371/journal.pcbi.0040020 |title=Getting Started in Text Mining |year=2008 |last1=Cohen |first1=K. Bretonnel |last2=Hunter |first2=Lawrence |journal=PLOS Computational Biology |volume=4 |pages=e20 |pmid=18225946 |issue=1 |pmc=2217579|bibcode=2008PLSCB...4...20C |doi-access=free }}</ref> including computational approaches to assist with studies in [[protein docking]],<ref>{{cite journal |doi=10.1371/journal.pcbi.1004630 |title=Text mining for protein docking|journal=PLOS Computational Biology|volume=11|issue=12|pages=e1004630|pmid=26650466 |pmc=4674139|year=2015|last1=Badal|first1=V. D|last2=Kundrotas|first2=P. J|last3=Vakser|first3=I. A|bibcode=2015PLSCB..11E4630B |doi-access=free }}</ref> [[protein interactions]],<ref>{{Cite journal|last1=Papanikolaou|first1=Nikolas|last2=Pavlopoulos|first2=Georgios A.|last3=Theodosiou|first3=Theodosios|last4=Iliopoulos|first4=Ioannis|date=2015|title=Protein–protein interaction predictions using text mining methods|journal=Methods|volume=74|pages=47–53|doi=10.1016/j.ymeth.2014.10.026|pmid=25448298|issn=1046-2023}}</ref><ref>{{Cite journal|last1=Szklarczyk|first1=Damian|last2=Morris|first2=John H|last3=Cook|first3=Helen|last4=Kuhn|first4=Michael|last5=Wyder|first5=Stefan|last6=Simonovic|first6=Milan|last7=Santos|first7=Alberto|last8=Doncheva|first8=Nadezhda T|last9=Roth|first9=Alexander|date=2016-10-18|title=The STRING database in 2017: quality-controlled protein–protein association networks, made broadly accessible|journal=Nucleic Acids Research|volume=45|issue=D1|pages=D362–D368|doi=10.1093/nar/gkw937|issn=0305-1048|pmc=5210637|pmid=27924014}}</ref> and protein-disease associations.<ref>{{Cite journal|last1=Liem|first1=David A.|last2=Murali|first2=Sanjana|last3=Sigdel|first3=Dibakar|last4=Shi|first4=Yu|last5=Wang|first5=Xuan|last6=Shen|first6=Jiaming|last7=Choi|first7=Howard|last8=Caufield|first8=John H.|last9=Wang|first9=Wei|last10=Ping|first10=Peipei|last11=Han|first11=Jiawei|date=2018-10-01|title=Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease|journal=American Journal of Physiology. Heart and Circulatory Physiology|volume=315|issue=4|pages=H910–H924|doi=10.1152/ajpheart.00175.2018|issn=1522-1539|pmid=29775406|pmc=6230912}}</ref> In addition, with large patient textual datasets in the clinical field, datasets of demographic information in population studies and adverse event reports, text mining can facilitate clinical studies and precision medicine. Text mining algorithms can facilitate the stratification and indexing of specific clinical events in large patient textual datasets of symptoms, side effects, and comorbidities from electronic health records, event reports, and reports from specific diagnostic tests.<ref>{{cite journal |last1=Van Le |first1=D |last2=Montgomery |first2=J |last3=Kirkby |first3=KC |last4=Scanlan |first4=J |title=Risk Prediction using Natural Language Processing of Electronic Mental Health Records in an Inpatient Forensic Psychiatry Setting. |journal=Journal of Biomedical Informatics |volume=86 |pages=49–58 |date=10 August 2018 |doi=10.1016/j.jbi.2018.08.007 |pmid=30118855|doi-access=free }}</ref> One online text mining application in the biomedical literature is [[PubGene]], a publicly accessible [[search engine]] that combines biomedical text mining with network visualization.<ref>{{cite journal |doi=10.1038/ng0501-21 |title=A literature network of human genes for high-throughput analysis of gene expression |year=2001 |last1=Jenssen |first1=Tor-Kristian |last2=Lægreid |first2=Astrid |last3=Komorowski |first3=Jan |last4=Hovig |first4=Eivind |journal=Nature Genetics |volume=28 |pages=21–8 |pmid=11326270 |issue=1|s2cid=8889284 }}</ref><ref>{{cite journal |doi=10.1038/ng0501-9 |title=Linking microarray data to the literature |year=2001 |last1=Masys |first1=Daniel R. |journal=Nature Genetics |volume=28 |pages=9–10 |pmid=11326264 |issue=1|s2cid=52848745 }}</ref> [[GoPubMed]] is a knowledge-based search engine for biomedical texts. Text mining techniques also enable us to extract unknown knowledge from unstructured documents in the clinical domain<ref>{{Cite journal|last=Renganathan|first=Vinaitheerthan|date=2017|title=Text Mining in Biomedical Domain with Emphasis on Document Clustering|journal=Healthcare Informatics Research|volume=23|issue=3|pages=141–146|doi=10.4258/hir.2017.23.3.141|pmid=28875048|pmc=5572517|issn=2093-3681}}</ref> === Software applications === Text mining methods and software is also being researched and developed by major firms, including [[IBM]] and [[Microsoft]], to further automate the mining and analysis processes, and by different firms working in the area of search and indexing in general as a way to improve their results. Within the public sector, much effort has been concentrated on creating software for tracking and monitoring [[Information Awareness Office|terrorist activities]].<ref>[http://yatsko.zohosites.com/texor-a-chat-mining-program.html] {{webarchive|url=https://web.archive.org/web/20131004224652/http://yatsko.zohosites.com/texor-a-chat-mining-program.html|date=October 4, 2013}}</ref> For study purposes, [[Weka (machine learning)|Weka software]] is one of the most popular options in the scientific world, acting as an excellent entry point for beginners. For Python programmers, there is an excellent toolkit called [[Natural Language Toolkit|NLTK]] for more general purposes. For more advanced programmers, there's also the [[Gensim]] library, which focuses on word embedding-based text representations. === Online media applications === Text mining is being used by large media companies, such as the [[Tribune Company]], to clarify information and to provide readers with greater search experiences, which in turn increases site "stickiness" and revenue. Additionally, on the back end, editors are benefiting by being able to share, associate and package news across properties, significantly increasing opportunities to monetize content. === Business and marketing applications === Text analytics is being used in business, particularly, in marketing, such as in [[customer relationship management]].<ref name="Text Analytics: The Why Behind the Score">{{cite web|url=http://www.medallia.com/text-analytics/ |title=Text Analytics |publisher=Medallia |access-date=2015-02-23}}</ref> Coussement and Van den Poel (2008)<ref name="10.1016/j.im.2008.01.005">{{cite journal |doi=10.1016/j.im.2008.01.005 |url=http://econpapers.repec.org/RePEc:rug:rugwps:08/502 |title=Integrating the voice of customers through call center emails into a decision support system for churn prediction |year=2008 |last1=Coussement |first1=Kristof |last2=Van Den Poel |first2=Dirk |journal=Information & Management |volume=45 |issue=3 |pages=164–74|citeseerx=10.1.1.113.3238 }}</ref><ref>{{cite journal |doi=10.1016/j.dss.2007.10.010 |url=http://econpapers.repec.org/RePEc:rug:rugwps:07/481 |title=Improving customer complaint management by automatic email classification using linguistic style features as predictors |year=2008 |last1=Coussement |first1=Kristof |last2=Van Den Poel |first2=Dirk |journal=Decision Support Systems |volume=44 |issue=4 |pages=870–82}}</ref> apply it to improve [[predictive analytics]] models for customer churn ([[customer attrition]]).<ref name="10.1016/j.im.2008.01.005" /> Text mining is also being applied in stock returns prediction.<ref name="Galvez2017">{{cite journal | title=Assessing the usefulness of online message board mining in automatic stock prediction systems |author1=Ramiro H. Gálvez |author2=Agustín Gravano | journal=Journal of Computational Science | volume=19 | pages=1877–7503 | year=2017 | doi=10.1016/j.jocs.2017.01.001| hdl=11336/60065 | hdl-access=free }}</ref> === Sentiment analysis === [[Sentiment analysis]] may involve analysis of products such as movies, books, or hotel reviews for estimating how favorable a review is for the product.<ref>{{cite book |doi=10.3115/1118693.1118704 |title=Proceedings of the ACL-02 conference on Empirical methods in natural language processing |year=2002 |last1=Pang |first1=Bo |last2=Lee |first2=Lillian |last3=Vaithyanathan |first3=Shivakumar |volume=10 |pages=79–86|chapter=Thumbs up? |s2cid=7105713 }}</ref> Such an analysis may need a labeled data set or labeling of the [[affect (psychology)|affectivity]] of words. Resources for affectivity of words and concepts have been made for [[WordNet]]<ref>{{cite journal |author1=Alessandro Valitutti |author2=Carlo Strapparava |author3=Oliviero Stock | title = Developing Affective Lexical Resources | journal = PsychNology Journal | year = 2005 | issue = 1 | pages = 61–83 | url = http://www.psychnology.org/File/PSYCHNOLOGY_JOURNAL_2_1_VALITUTTI.pdf | volume = 2 }}</ref> and [[ConceptNet]],<ref name="camnet">{{cite conference | author = Erik Cambria |author2=Robert Speer |author3=Catherine Havasi |author4=Amir Hussain | title = SenticNet: a Publicly Available Semantic Resource for Opinion Mining | book-title = Proceedings of AAAI CSK | year = 2010 | pages = 14–18 | url = http://www.aaai.org/ocs/index.php/FSS/FSS10/paper/download/2216/2617.pdf }}</ref> respectively. Text has been used to detect emotions in the related area of affective computing.<ref>{{cite journal |doi=10.1109/T-AFFC.2010.1 |title=Affect Detection: An Interdisciplinary Review of Models, Methods, and Their Applications |year=2010 |last1=Calvo |first1=Rafael A |last2=d'Mello |first2=Sidney |journal=IEEE Transactions on Affective Computing |volume=1 |issue=1 |pages=18–37|s2cid=753606 }}</ref> Text based approaches to affective computing have been used on multiple corpora such as students evaluations, children stories and news stories. === Scientific literature mining and academic applications === The issue of text mining is of importance to publishers who hold large [[database]]s of information needing [[index (database)|indexing]] for retrieval. This is especially true in scientific disciplines, in which highly specific information is often contained within the written text. Therefore, initiatives have been taken such as [[Nature (journal)|Nature's]] proposal for an Open Text Mining Interface (OTMI) and the [[National Institutes of Health]]'s common Journal Publishing [[Document Type Definition]] (DTD) that would provide semantic cues to machines to answer specific queries contained within the text without removing publisher barriers to public access. Academic institutions have also become involved in the text mining initiative: * The [[National Centre for Text Mining]] (NaCTeM), is the first publicly funded text mining centre in the world. NaCTeM is operated by the [[University of Manchester]]<ref>{{cite web|url=http://www.manchester.ac.uk |title=The University of Manchester |publisher=Manchester.ac.uk |access-date=2015-02-23}}</ref> in close collaboration with the Tsujii Lab,<ref>{{cite web |url=http://www-tsujii.is.s.u-tokyo.ac.jp/index.html |title=Tsujii Laboratory |publisher=Tsujii.is.s.u-tokyo.ac.jp |access-date=2015-02-23 |archive-date=2012-03-07 |archive-url=https://web.archive.org/web/20120307231425/http://www-tsujii.is.s.u-tokyo.ac.jp/index.html |url-status=dead }}</ref> [[University of Tokyo]].<ref>{{cite web|url=http://www.u-tokyo.ac.jp/index_e.html |title=The University of Tokyo |publisher=UTokyo |access-date=2015-02-23}}</ref> NaCTeM provides customised tools, research facilities and offers advice to the academic community. They are funded by the [[Joint Information Systems Committee]] (JISC) and two of the UK [[research council (United Kingdom)|research councils]] ([[EPSRC]] & [[BBSRC]]). With an initial focus on text mining in the [[biology|biological]] and [[biomedical]] sciences, research has since expanded into the areas of [[social sciences]]. * In the United States, the [[UC Berkeley School of Information|School of Information]] at [[University of California, Berkeley]] is developing a program called BioText to assist [[biology]] researchers in text mining and analysis. * The [[Text Analysis Portal for Research]] (TAPoR), currently housed at the [[University of Alberta]], is a scholarly project to catalogue text analysis applications and create a gateway for researchers new to the practice. ==== Methods for scientific literature mining ==== Computational methods have been developed to assist with information retrieval from scientific literature. Published approaches include methods for searching,<ref>{{Cite book|last1=Shen|first1=Jiaming|last2=Xiao|first2=Jinfeng|last3=He|first3=Xinwei|last4=Shang|first4=Jingbo|last5=Sinha|first5=Saurabh|last6=Han|first6=Jiawei|date=2018-06-27|title=Entity Set Search of Scientific Literature: An Unsupervised Ranking Approach|publisher=ACM|pages=565–574|doi=10.1145/3209978.3210055|isbn=978-1-4503-5657-2|s2cid=13748283}}</ref> determining novelty,<ref>{{Cite journal|last1=Walter|first1=Lothar|last2=Radauer|first2=Alfred|last3=Moehrle|first3=Martin G.|date=2017-02-06|title=The beauty of brimstone butterfly: novelty of patents identified by near environment analysis based on text mining|journal=Scientometrics|volume=111|issue=1|pages=103–115|doi=10.1007/s11192-017-2267-4|s2cid=11174676|issn=0138-9130}}</ref> and clarifying [[homonym]]s<ref>{{Cite journal|last1=Roll|first1=Uri|last2=Correia|first2=Ricardo A.|last3=Berger-Tal|first3=Oded|date=2018-03-10|title=Using machine learning to disentangle homonyms in large text corpora|journal=Conservation Biology|volume=32|issue=3|pages=716–724|doi=10.1111/cobi.13044|pmid=29086438|bibcode=2018ConBi..32..716R |s2cid=3783779|issn=0888-8892}}</ref> among technical reports. === Digital humanities and computational sociology === The automatic analysis of vast textual corpora has created the possibility for scholars to analyze millions of documents in multiple languages with very limited manual intervention. Key enabling technologies have been parsing, [[machine translation]], topic [[categorization]], and machine learning. [[File:Tripletsnew2012.png|thumb|right|Narrative network of US Elections 2012<ref name="ReferenceA">Automated analysis of the US presidential elections using Big Data and network analysis; S Sudhahar, GA Veltri, N Cristianini; Big Data & Society 2 (1), 1-28, 2015</ref>]] The automatic parsing of textual corpora has enabled the extraction of actors and their relational networks on a vast scale, turning textual data into network data. The resulting networks, which can contain thousands of nodes, are then analyzed by using tools from network theory to identify the key actors, the key communities or parties, and general properties such as robustness or structural stability of the overall network, or centrality of certain nodes.<ref>Network analysis of narrative content in large corpora; S Sudhahar, G De Fazio, R Franzosi, N Cristianini; Natural Language Engineering, 1-32, 2013</ref> This automates the approach introduced by quantitative narrative analysis,<ref>Quantitative Narrative Analysis; Roberto Franzosi; Emory University © 2010</ref> whereby [[subject-verb-object]] triplets are identified with pairs of actors linked by an action, or pairs formed by actor-object.<ref name="ReferenceA" /> [[Content analysis]] has been a traditional part of social sciences and media studies for a long time. The automation of content analysis has allowed a "[[big data]]" revolution to take place in that field, with studies in social media and newspaper content that include millions of news items. [[Gender bias]], [[readability]], content similarity, reader preferences, and even mood have been analyzed based on text mining methods over millions of documents.<ref>{{Cite journal|last1=Lansdall-Welfare|first1=Thomas|last2=Sudhahar|first2=Saatviga|last3=Thompson|first3=James|last4=Lewis|first4=Justin|last5=Team|first5=FindMyPast Newspaper|last6=Cristianini|first6=Nello|date=2017-01-09|title=Content analysis of 150 years of British periodicals|journal=Proceedings of the National Academy of Sciences|volume=114|issue=4|pages=E457–E465|doi=10.1073/pnas.1606380114|issn=0027-8424|pmid=28069962|pmc=5278459|bibcode=2017PNAS..114E.457L |doi-access=free}}</ref><ref>I. Flaounas, M. Turchi, O. Ali, N. Fyson, T. De Bie, N. Mosdell, J. Lewis, N. Cristianini, The Structure of EU Mediasphere, PLoS ONE, Vol. 5(12), pp. e14243, 2010.</ref><ref>Nowcasting Events from the Social Web with Statistical Learning V Lampos, N Cristianini; ACM Transactions on Intelligent Systems and Technology (TIST) 3 (4), 72</ref><ref>NOAM: news outlets analysis and monitoring system; I Flaounas, O Ali, M Turchi, T Snowsill, F Nicart, T De Bie, N Cristianini Proc. of the 2011 ACM SIGMOD international conference on Management of data</ref><ref>Automatic discovery of patterns in media content, N Cristianini, Combinatorial Pattern Matching, 2-13, 2011</ref> The analysis of readability, gender bias and topic bias was demonstrated in Flaounas et al.<ref>I. Flaounas, O. Ali, T. Lansdall-Welfare, T. De Bie, N. Mosdell, J. Lewis, N. Cristianini, RESEARCH METHODS IN THE AGE OF DIGITAL JOURNALISM, Digital Journalism, Routledge, 2012</ref> showing how different topics have different gender biases and levels of readability; the possibility to detect mood patterns in a vast population by analyzing Twitter content was demonstrated as well.<ref>Circadian Mood Variations in Twitter Content; Fabon Dzogang, Stafford Lightman, Nello Cristianini. Brain and Neuroscience Advances, 1, 2398212817744501.</ref><ref>Effects of the Recession on Public Mood in the UK; T Lansdall-Welfare, V Lampos, N Cristianini; Mining Social Network Dynamics (MSND) session on Social Media Applications</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)