Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Web crawler
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
===Politeness policy=== Crawlers can retrieve data much quicker and in greater depth than human searchers, so they can have a crippling impact on the performance of a site. If a single crawler is performing multiple requests per second and/or downloading large files, a server can have a hard time keeping up with requests from multiple crawlers. As noted by Koster, the use of Web crawlers is useful for a number of tasks, but comes with a price for the general community.<ref>Koster, M. (1995). Robots in the web: threat or treat? ConneXions, 9(4).</ref> The costs of using Web crawlers include: * network resources, as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time; * server overload, especially if the frequency of accesses to a given server is too high; * poorly written crawlers, which can crash servers or routers, or which download pages they cannot handle; and * personal crawlers that, if deployed by too many users, can disrupt networks and Web servers. A partial solution to these problems is the [[Robots Exclusion Standard|robots exclusion protocol]], also known as the robots.txt protocol that is a standard for administrators to indicate which parts of their Web servers should not be accessed by crawlers.<ref>Koster, M. (1996). [http://www.robotstxt.org/wc/exclusion.html A standard for robot exclusion] {{Webarchive|url=https://web.archive.org/web/20071107021800/http://www.robotstxt.org/wc/exclusion.html |date=7 November 2007 }}.</ref> This standard does not include a suggestion for the interval of visits to the same server, even though this interval is the most effective way of avoiding server overload. Recently commercial search engines like [[Google.com|Google]], [[Ask.com|Ask Jeeves]], [[Bing (search engine)|MSN]] and [[Yahoo! Search]] are able to use an extra "Crawl-delay:" parameter in the [[robots.txt]] file to indicate the number of seconds to delay between requests. The first proposed interval between successive pageloads was 60 seconds.<ref>Koster, M. (1993). [http://www.robotstxt.org/wc/guidelines.html Guidelines for robots writers] {{Webarchive|url=https://web.archive.org/web/20050422045839/http://www.robotstxt.org/wc/guidelines.html |date=22 April 2005 }}.</ref> However, if pages were downloaded at this rate from a website with more than 100,000 pages over a perfect connection with zero latency and infinite bandwidth, it would take more than 2 months to download only that entire Web site; also, only a fraction of the resources from that Web server would be used. Cho uses 10 seconds as an interval for accesses,<ref name=cho2003/> and the WIRE crawler uses 15 seconds as the default.<ref name=baeza2002>Baeza-Yates, R. and Castillo, C. (2002). [http://www.chato.cl/papers/baeza02balancing.pdf Balancing volume, quality and freshness in Web crawling]. In Soft Computing Systems β Design, Management and Applications, pages 565β572, Santiago, Chile. IOS Press Amsterdam.</ref> The MercatorWeb crawler follows an adaptive politeness policy: if it took ''t'' seconds to download a document from a given server, the crawler waits for 10''t'' seconds before downloading the next page.<ref>{{cite journal |author1=Heydon, Allan |author2=Najork, Marc |title=Mercator: A Scalable, Extensible Web Crawler |date=1999-06-26 |url=http://www.cindoc.csic.es/cybermetrics/pdf/68.pdf |access-date=2009-03-22 |url-status=dead |archive-url=https://web.archive.org/web/20060219085958/http://www.cindoc.csic.es/cybermetrics/pdf/68.pdf |archive-date=19 February 2006}}</ref> Dill ''et al.'' use 1 second.<ref>{{cite journal |last1 = Dill |first1 = S. |last2 = Kumar |first2 = R. |last3 = Mccurley |first3 = K. S. |last4 = Rajagopalan |first4 = S. |last5 = Sivakumar |first5 = D. |last6 = Tomkins |first6 = A. |year = 2002 |title = Self-similarity in the web |url = http://www.mccurley.org/papers/fractal.pdf |journal = ACM Transactions on Internet Technology|volume = 2 |issue = 3 |pages = 205β223 |doi=10.1145/572326.572328|s2cid = 6416041 }}</ref> For those using Web crawlers for research purposes, a more detailed cost-benefit analysis is needed and ethical considerations should be taken into account when deciding where to crawl and how fast to crawl.<ref>{{Cite journal| author1 = M. Thelwall | author2 = D. Stuart | year = 2006 | url = http://www.scit.wlv.ac.uk/%7Ecm1993/papers/Web_Crawling_Ethics_preprint.doc | title = Web crawling ethics revisited: Cost, privacy and denial of service | volume = 57 | issue = 13 | pages = 1771β1779 | journal = Journal of the American Society for Information Science and Technology| doi = 10.1002/asi.20388 }}</ref> Anecdotal evidence from access logs shows that access intervals from known crawlers vary between 20 seconds and 3β4 minutes. It is worth noticing that even when being very polite, and taking all the safeguards to avoid overloading Web servers, some complaints from Web server administrators are received. [[Sergey Brin]] and [[Larry Page]] noted in 1998, "... running a crawler which connects to more than half a million servers ... generates a fair amount of e-mail and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen."<ref name=brin1998>{{cite journal|url=http://infolab.stanford.edu/~backrub/google.html|doi=10.1016/s0169-7552(98)00110-x|title=The anatomy of a large-scale hypertextual Web search engine|journal=Computer Networks and ISDN Systems|volume=30|issue=1β7|pages=107β117|year=1998|last1=Brin|first1=Sergey|last2=Page|first2=Lawrence|s2cid=7587743 }}</ref>
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)