Editing Web crawler (section)

== Overview ==
A Web crawler starts with a list of [[Uniform Resource Locator|URLs]] to visit. Those first URLs are called the ''seeds''. As the crawler visits these URLs, by communicating with [[web server]]s that respond to those URLs, it identifies all the [[hyperlink]]s in the retrieved web pages and adds them to the list of URLs to visit, called the ''[[crawl frontier]]''. URLs from the frontier are [[Recursion|recursively]] visited according to a set of policies. If the crawler is performing archiving of [[website]]s (or [[web archiving]]), it copies and saves the information as it goes. The archives are usually stored in such a way they can be viewed, read and navigated as if they were on the live web, but are preserved as 'snapshots'.<ref name="GoogleBooks-9237380">{{cite book |url=https://www.springer.com/gp/book/9783540233381 |title=Web Archiving |isbn=978-3-54046332-0 |date=15 February 2007 |publisher=Springer |access-date=24 April 2014 |page=1 |first=Julien |last=Masanès}}</ref>

The archive is known as the ''repository'' and is designed to store and manage the collection of [[web page]]s. The [[Repository (version control)|repository]] only stores [[HTML]] pages and these pages are stored as distinct files. A repository is similar to any other system that stores data, like a modern-day database. The only difference is that a repository does not need all the functionality offered by a database system. The repository stores the most recent version of the web page retrieved by the crawler.{{Cn|date=February 2023}}

The large volume implies the crawler can only download a limited number of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change can imply the pages might have already been updated or even deleted.

The number of possible URLs crawled being generated by server-side software has also made it difficult for web crawlers to avoid retrieving [[duplicate content]]. Endless combinations of [[HTTP]] GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of [[thumbnail]] size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This [[mathematical combination]] creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

As Edwards ''et al.'' noted, "Given that the [[Bandwidth (computing)|bandwidth]] for conducting crawls is neither infinite nor free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained."<ref name=edwards2001>{{Cite book |author = Edwards, J.; McCurley, K. S.; and Tomlin, J. A. |title = Proceedings of the 10th international conference on World Wide Web |chapter = An adaptive model for optimizing performance of an incremental web crawler |pages = 106–113 |year = 2001 |doi = 10.1145/371920.371960 |url = http://www10.org/cdrom/papers/210/index.html |isbn = 978-1581133486 |citeseerx = 10.1.1.1018.1506 |s2cid = 10316730 |access-date = 25 January 2007 |archive-date = 25 June 2014 |archive-url = https://web.archive.org/web/20140625233510/http://www10.org/cdrom/papers/210/index.html |url-status = dead }}</ref> A crawler must carefully choose at each step which pages to visit next.