Editing Web crawler (section)

===Re-visit policy===
The Web has a very dynamic nature, and crawling a fraction of the Web can take weeks or months. By the time a Web crawler has finished its crawl, many events could have happened, including creations, updates, and deletions.

From the search engine's point of view, there is a cost associated with not detecting an event, and thus having an outdated copy of a resource. The most-used cost functions are freshness and age.<ref>{{Cite conference | publisher = ACM | doi = 10.1145/342009.335391 | isbn = 1-58113-217-4 | pages = 117–128 | author1 = Junghoo Cho | author2 = Hector Garcia-Molina | title = Synchronizing a database to improve freshness | book-title = Proceedings of the 2000 ACM SIGMOD international conference on Management of data | location = Dallas, Texas, United States | access-date = 2009-03-23 | year = 2000 | url = http://www.cs.brown.edu/courses/cs227/2002/cache/Cho.pdf}}</ref>

'''Freshness''': This is a binary measure that indicates whether the local copy is accurate or not. The freshness of a page ''p'' in the repository at time ''t'' is defined as:

: <math>
F_p(t) = \begin{cases} 1 & {\rm
if}~p~{\rm~is~equal~to~the~local~copy~at~time}~t\\ 0 & {\rm otherwise} \end{cases}
</math>

'''Age''': This is a measure that indicates how outdated the local copy is. The age of a page ''p'' in the repository, at time ''t'' is defined as:

: <math>
A_p(t) = \begin{cases} 0 & {\rm if}~p~{\rm~is~not~modified~at~time}~t\\ t - {\rm modification~time~of}~p &
{\rm otherwise} \end{cases}
</math>

[[Edward G. Coffman, Jr.|Coffman]] ''et al.'' worked with a definition of the objective of a Web crawler that is equivalent to freshness, but use a different wording: they propose that a crawler must minimize the fraction of time pages remain outdated. They also noted that the problem of Web crawling can be modeled as a multiple-queue, single-server polling system, on which the Web crawler is the server and the Web sites are the queues. Page modifications are the arrival of the customers, and switch-over times are the interval between page accesses to a single Web site. Under this model, mean waiting time for a customer in the polling system is equivalent to the average age for the Web crawler.<ref name="coffman">{{Cite journal | doi = 10.1002/(SICI)1099-1425(199806)1:1<15::AID-JOS3>3.0.CO;2-K | volume = 1 | issue = 1 | pages = 15–29 | author1 = E. G. Coffman Jr | author2 = Zhen Liu | author3 = Richard R. Weber | title = Optimal robot scheduling for Web search engines | journal = Journal of Scheduling | year = 1998 | citeseerx = 10.1.1.36.6087 }}</ref>

The objective of the crawler is to keep the average freshness of pages in its collection as high as possible, or to keep the average age of pages as low as possible. These objectives are not equivalent: in the first case, the crawler is just concerned with how many pages are outdated, while in the second case, the crawler is concerned with how old the local copies of pages are.

[[File:Web Crawling Freshness Age.png|thumb|Evolution of Freshness and Age in a web crawler]]
Two simple re-visiting policies were studied by Cho and Garcia-Molina:<ref name=cho2003>{{Cite journal |doi = 10.1145/958942.958945|title = Effective page refresh policies for Web crawlers|journal = ACM Transactions on Database Systems|volume = 28|issue = 4|pages = 390–426|year = 2003|last1 = Cho|first1 = Junghoo|last2 = Garcia-Molina|first2 = Hector|s2cid = 147958}}</ref>
* Uniform policy: This involves re-visiting all pages in the collection with the same frequency, regardless of their rates of change.
* Proportional policy: This involves re-visiting more often the pages that change more frequently. The visiting frequency is directly proportional to the (estimated) change frequency.

In both cases, the repeated crawling order of pages can be done either in a random or a fixed order.

Cho and Garcia-Molina proved the surprising result that, in terms of average freshness, the uniform policy outperforms the proportional policy in both a simulated Web and a real Web crawl. Intuitively, the reasoning is that, as web crawlers have a limit to how many pages they can crawl in a given time frame, (1) they will allocate too many new crawls to rapidly changing pages at the expense of less frequently updating pages, and (2) the freshness of rapidly changing pages lasts for shorter period than that of less frequently changing pages. In other words, a proportional policy allocates more resources to crawling frequently updating pages, but experiences less overall freshness time from them.

To improve freshness, the crawler should penalize the elements that change too often.<ref name="cho2003a">{{Cite journal | doi = 10.1145/857166.857170 | volume = 3 | issue = 3 | pages = 256–290 | author1 = Junghoo Cho | author2 = Hector Garcia-Molina | title = Estimating frequency of change | journal =  ACM Transactions on Internet Technology| year = 2003 | citeseerx = 10.1.1.59.5877 | s2cid = 9362566 }}</ref> The optimal re-visiting policy is neither the uniform policy nor the proportional policy. The optimal method for keeping average freshness high includes ignoring the pages that change too often, and the optimal for keeping average age low is to use access frequencies that monotonically (and sub-linearly) increase with the rate of change of each page. In both cases, the optimal is closer to the uniform policy than to the proportional policy: as [[Edward G. Coffman, Jr.|Coffman]] ''et al.'' note, "in order to minimize the expected obsolescence time, the accesses to any particular page should be kept as evenly spaced as possible".<ref name="coffman" /> Explicit formulas for the re-visit policy are not attainable in general, but they are obtained numerically, as they depend on the distribution of page changes. Cho and Garcia-Molina show that the exponential distribution is a good fit for describing page changes,<ref name="cho2003a" /> while [[Panos Ipeirotis|Ipeirotis]] ''et al.'' show how to use statistical tools to discover parameters that affect this distribution.<ref>Ipeirotis, P., Ntoulas, A., Cho, J., Gravano, L. (2005) [http://pages.stern.nyu.edu/~panos/publications/icde2005.pdf Modeling and managing content changes in text databases] {{Webarchive|url=https://web.archive.org/web/20050905013119/http://pages.stern.nyu.edu/~panos/publications/icde2005.pdf |date=5 September 2005 }}. In Proceedings of the 21st IEEE International Conference on Data Engineering, pages 606-617, April 2005, Tokyo.</ref> The re-visiting policies considered here regard all pages as homogeneous in terms of quality ("all pages on the Web are worth the same"), something that is not a realistic scenario, so further information about the Web page quality should be included to achieve a better crawling policy.