Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Stop word
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
{{Confuse|Safeword}}{{short description|Common word that search engines avoid indexing to save time and space}} '''Stop words''' are the words in a '''stop list''' (or '''''stoplist''''' or '''''negative dictionary''''') which are filtered out ("stopped") before or after [[Natural language processing|processing of natural language]] data (i.e. text) because they are deemed to have little semantic value or are otherwise insignificant for the task at hand.<ref>{{Cite book | last1 = Rajaraman | first1 = A. | last2 = Ullman | first2 = J. D. | doi = 10.1017/CBO9781139058452.002 | chapter = Data Mining | title = Mining of Massive Datasets | pages = 1–17| year = 2011 | isbn = 9781139058452 | chapter-url = http://i.stanford.edu/~ullman/mmds/ch1.pdf}}</ref> There is no single universal list of stop words used by all natural language processing (NLP) tools, nor any agreed upon rules for identifying stop words, and indeed not all tools even use such a list. Therefore, any group of words can be chosen as the stop words for a given purpose. The "general trend in [information retrieval] systems over time has been from standard use of quite large stop lists (200–300 terms) to very small stop lists (7–12 terms) to no stop list whatsoever".<ref>{{Cite book|last=Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze|title=Introduction to Information Retrieval|publisher=Cambridge University Press|year=2008|pages=27}}</ref> == History of stop words == A predecessor concept was used in creating some [[Bible concordance|concordance]]s. For example, the first [[Bible concordance#Hebrew|Hebrew concordance]], [[Isaac Nathan ben Kalonymus]]'s {{lang|he-Latn|Me’ir Nativ}}, contained a one-page list of unindexed words, with nonsubstantive prepositions and conjunctions which are similar to common modern stop words.<ref>{{cite journal |last1=Weinberg |first1=Bella Hass |date=2004 |title=Predecessors of scientific indexing structures in the domain of religion |url=https://www.asis.org/History/11-weinberg.pdf |url-status=dead |journal=Second Conference on the History and Heritage of Scientific and Technical Information Systems |pages=126–134 |archive-url=https://web.archive.org/web/20160103065355/https://www.asis.org/History/11-weinberg.pdf |archive-date=3 Jan 2016 |access-date=17 February 2016}}</ref> [[Hans Peter Luhn]], one of the pioneers in [[information retrieval]], is credited with coining the phrase and using the concept when introducing his [[Key Word in Context]] [[automatic indexing]] process.<ref>{{Cite journal|title = Keyword-in-Context Index for Technical Literature (KWIC Index)|last = Luhn|first = H. P.|journal = American Documentation|volume = 11|issue = 4|publisher = International Business Machines Corp.|year = 1959|location = Yorktown Heights, NY|pages = 288–295|doi = 10.1002/asi.5090110403}}</ref> The phrase "stop word", which is not in Luhn's 1959 presentation, and the associated terms "stop list" and "stoplist" appear in the literature shortly afterward.<ref>{{cite journal|last1=Flood|first1=Barbara J.|title=Historical note: The Start of a Stop List at Biological Abstracts|journal=Journal of the American Society for Information Science|date=1999|volume=50|issue=12|page=1066|doi=10.1002/(SICI)1097-4571(1999)50:12<1066::AID-ASI5>3.0.CO;2-A}}</ref> Although it is commonly assumed that stop lists include only the most frequent words in a language, it was C.J. Van Rijsbergen who proposed the first standardized list which was not based on word frequency information. The "Van list" included 250 English words. Martin Porter's word stemming program developed in the 1980s built on the Van list, and the Porter list is now commonly used as a default stop list in a variety of software applications. In 1990, Christopher Fox proposed the first general stop list based on empirical word frequency information derived from the [[Brown Corpus]]:<blockquote>This paper reports an exercise in generating a stop list for general text based on the Brown corpus of 1,014,000 words drawn from a broad range of literature in English. We start with a list of tokens occurring more than 300 times in the Brown corpus. From this list of 278 words, 32 are culled on the grounds that they are too important as potential index terms. Twenty-six words are then added to the list in the belief that they may occur very frequently in certain kinds of literature. Finally, 149 words are added to the list because the finite state machine based filter in which this list is intended to be used is able to filter them at almost no cost. The final product is a list of 421 stop words that should be maximally efficient and effective in filtering the most frequently occurring and semantically neutral words in general literature in English.<ref>{{Cite journal|last=Fox|first=Christopher|date=1989-09-01|title=A stop list for general text|url=https://doi.org/10.1145/378881.378888|journal=ACM SIGIR Forum|volume=24|issue=1–2|pages=19–21|doi=10.1145/378881.378888|s2cid=20240000 |issn=0163-5840|url-access=subscription}}</ref></blockquote> In [[Search engine optimization|SEO]] terminology, stop words are the most common words that many search engines used to avoid for the purposes of saving space and time in processing of large data during [[Web crawler|crawling]] or [[Search engine indexing|indexing]]. For some [[search engine]]s, these are some of the most common, short [[function word]]s, such as ''the'', ''is'', ''at'', ''which'', and ''on''. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as "[[The Who]]", "[[The The]]", or "[[Take That]]". Other search engines remove some of the most common words—including [[lexical word]]s, such as "want"—from a query in order to improve performance.<ref>[http://blog.stackoverflow.com/2008/12/podcast-32 Stackoverflow]: "One of our major performance optimizations for the "related questions" query is removing the top 10,000 most common English dictionary words (as determined by Google search) before submitting the query to the SQL Server 2008 full text engine. It’s shocking how little is left of most posts once you remove the top 10k English dictionary words. This helps limit and narrow the returned results, which makes the query dramatically faster".</ref> In recent years the SEO best practices around stop words have evolved along with the fields of [[machine learning]] and NLP. In February 2021, John Mueller, Webmaster Trends Analyst at [[Google]], tweeted "I wouldn't worry about stop words at all; write naturally. Search engines look at much, much more than individual words. '[[To be or not to be]]' just is a collection of stop words, but stop words alone don't do it any justice."<ref>{{Cite web |title=Google: Stop Worrying About Stop Words Just Write Naturally |url=https://www.seroundtable.com/google-on-stop-words-30935.html |access-date=2022-07-15 |website=seroundtable.com|date=16 February 2021 }}</ref><ref>{{Cite web |last=John |first=Mueller |date=Feb 6, 2021 |title=John Mueller on stop words in 2021: "I wouldn't worry about stop words at all" |url=https://twitter.com/JohnMu/status/1358165707784077312 |access-date=July 15, 2022 |website=Twitter}}</ref> == See also == * [[Concept mining]] * [[Filler (linguistics)]] * [[Index (search engine)]] * [[Information extraction]] * [[Query expansion]] * [[Stemming]] * [[Text mining]] ==References== {{Reflist|2}} == External links == * [https://dev.mysql.com/doc/refman/5.5/en/fulltext-stopwords.html Full-Text Stopwords in MySQL ] * [https://www.textfixer.com/resources/common-english-words.txt English Stop Words (CSV)] * [https://e-padi.com/stop-words-indonesia-query-php-array.htm Stop Words Indonesia Query PHP Array] * [https://tcpip.wtf/en/deutsche-stopwords.htm German Stop Words], [https://archive.today/20130210071035/http://aniol-consulting.de/uebersicht-deutscher-stop-words/ German Stop Words and phrases], another list of [https://web.archive.org/web/20100308021834/http://www.ranks.nl/stopwords/german.html German stop words] * [[:pl:Wikipedia:Stopwords|Polish Stop Words]] * [https://code.google.com/p/stop-words/ Collection of stop words in 29 languages] ([https://web.archive.org/web/*/http://tonyb.sk/_my/ir/stop-words-collection-2014-02-24.zip archive]) *[http://www.techie-knowledge.co.in/2018/07/stop-words-in-hindi-language.html List of Hindi Stop Words] {{Natural Language Processing}} {{SearchEngineOptimization}} [[Category:Information retrieval techniques]]
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)
Pages transcluded onto the current version of this page
(
help
)
:
Template:Cite book
(
edit
)
Template:Cite journal
(
edit
)
Template:Cite web
(
edit
)
Template:Confuse
(
edit
)
Template:Distinguish
(
edit
)
Template:Lang
(
edit
)
Template:Natural Language Processing
(
edit
)
Template:Rcatsh
(
edit
)
Template:Reflist
(
edit
)
Template:SearchEngineOptimization
(
edit
)
Template:Short description
(
edit
)