Editing Web crawler (section)

====Focused crawling====
{{Main|Focused crawler}}

The importance of a page for a crawler can also be expressed as a function of the similarity of a page to a given query. Web crawlers that attempt to download pages that are similar to each other are called '''focused crawler''' or '''topical crawlers'''. The concepts of topical and focused crawling were first introduced by [[Filippo Menczer]]<ref>Menczer, F. (1997). [http://informatics.indiana.edu/fil/Papers/ICML.ps ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery] {{Webarchive|url=https://web.archive.org/web/20121221113620/http://informatics.indiana.edu/fil/Papers/ICML.ps |date=21 December 2012 }}. In D. Fisher, ed., Machine Learning: Proceedings of the 14th International Conference (ICML97). Morgan Kaufmann</ref><ref>Menczer, F. and Belew, R.K. (1998). [http://informatics.indiana.edu/fil/Papers/AA98.ps Adaptive Information Agents in Distributed Textual Environments] {{Webarchive|url=https://web.archive.org/web/20121221113630/http://informatics.indiana.edu/fil/Papers/AA98.ps |date=21 December 2012 }}. In K. Sycara and M. Wooldridge (eds.) Proc. 2nd Intl. Conf. on Autonomous Agents (Agents '98). ACM Press</ref> and by Soumen Chakrabarti ''et al.''<ref>{{cite journal|url=http://www.fxpal.com/people/vdberg/pubs/www8/www1999f.pdf|archive-url=https://web.archive.org/web/20040317210216/http://www.fxpal.com/people/vdberg/pubs/www8/www1999f.pdf|url-status=dead|archive-date=2004-03-17|doi=10.1016/s1389-1286(99)00052-3|title=Focused crawling: A new approach to topic-specific Web resource discovery|journal=Computer Networks|volume=31|issue=11–16|pages=1623–1640|year=1999|last1=Chakrabarti|first1=Soumen|last2=Van Den Berg|first2=Martin|last3=Dom|first3=Byron}}</ref>

The main problem in focused crawling is that in the context of a Web crawler, we would like to be able to predict the similarity of the text of a given page to the query before actually downloading the page. A possible predictor is the anchor text of links; this was the approach taken by Pinkerton<ref name=pinkerton1994>Pinkerton, B. (1994). [https://web.archive.org/web/20010904075500/http://archive.ncsa.uiuc.edu/SDG/IT94/Proceedings/Searching/pinkerton/WebCrawler.html Finding what people want: Experiences with the WebCrawler]. In Proceedings of the First World Wide Web Conference, Geneva, Switzerland.</ref> in the first web crawler of the early days of the Web. Diligenti ''et al.''<ref>Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. (2000). [http://clgiles.ist.psu.edu/papers/VLDB-2000-focused-crawling.pdf Focused crawling using context graphs]. In Proceedings of 26th International Conference on Very Large Databases (VLDB), pages 527-534, Cairo, Egypt.</ref> propose using the complete content of the pages already visited to infer the similarity between the driving query and the pages that have not been visited yet. The performance of a focused crawling depends mostly on the richness of links in the specific topic being searched, and a focused crawling usually relies on a general Web search engine for providing starting points.

=====Academic focused crawler=====
An example of the [[focused crawlers]] are academic crawlers, which crawls free-access academic related documents, such as the ''citeseerxbot'', which is the crawler of [[CiteSeer]]<sup>X</sup> search engine. Other academic search engines are [[Google Scholar]] and [[Microsoft Academic Search]] etc. Because most academic papers are published in [[PDF]] formats, such kind of crawler is particularly interested in crawling PDF, [[PostScript]] files, [[Microsoft Word]] including their [[Zipped file|zipped]] formats. Because of this, general open-source crawlers, such as [[Heritrix]], must be customized to filter out other [[MIME types]], or a [[middleware]] is used to extract these documents out and import them to the focused crawl database and repository.<ref>{{Cite book | doi=10.1145/2389936.2389949| chapter=Web crawler middleware for search engine digital libraries| title=Proceedings of the twelfth international workshop on Web information and data management - WIDM '12| pages=57| year=2012| last1=Wu| first1=Jian| last2=Teregowda| first2=Pradeep| last3=Khabsa| first3=Madian| last4=Carman| first4=Stephen| last5=Jordan| first5=Douglas| last6=San Pedro Wandelmer| first6=Jose| last7=Lu| first7=Xin| last8=Mitra| first8=Prasenjit| last9=Giles| first9=C. Lee| isbn=9781450317207| s2cid=18513666}}</ref> Identifying whether these documents are academic or not is challenging and can add a significant overhead to the crawling process, so this is performed as a post crawling process using [[machine learning]] or [[regular expression]] algorithms. These academic documents are usually obtained from home pages of faculties and students or from publication page of research institutes. Because academic documents make up only a small fraction of all web pages, a good seed selection is important in boosting the efficiencies of these web crawlers.<ref>{{Cite book |doi = 10.1145/2380718.2380762|chapter = The evolution of a crawling strategy for an academic document search engine|title = Proceedings of the 3rd Annual ACM Web Science Conference on - Web ''Sci'' '12|pages = 340–343|year = 2012|last1 = Wu|first1 = Jian|last2 = Teregowda|first2 = Pradeep|last3 = Ramírez|first3 = Juan Pablo Fernández|last4 = Mitra|first4 = Prasenjit|last5 = Zheng|first5 = Shuyi|last6 = Giles|first6 = C. Lee|isbn = 9781450312288|s2cid = 16718130}}</ref> Other academic crawlers may download plain text and [[HTML]] files, that contains [[metadata]] of academic papers, such as titles, papers, and abstracts. This increases the overall number of papers, but a significant fraction may not provide free PDF downloads.

=====Semantic focused crawler=====
Another type of focused crawlers is semantic focused crawler, which makes use of [[domain ontology|domain ontologies]] to represent topical maps and link Web pages with relevant ontological concepts for the selection and categorization purposes.<ref>{{cite book|chapter-url=https://www.researchgate.net/publication/44241179|doi=10.1007/978-3-642-02457-3_74|chapter=State of the Art in Semantic Focused Crawlers|title=Computational Science and Its Applications – ICCSA 2009|volume=5593|pages=910–924|series=Lecture Notes in Computer Science|year=2009|last1=Dong|first1=Hai|last2=Hussain|first2=Farookh Khadeer|last3=Chang|first3=Elizabeth|isbn=978-3-642-02456-6|hdl=20.500.11937/48288}}</ref> In addition, ontologies can be automatically updated in the crawling process. Dong et al.<ref>{{cite journal|url=https://www.researchgate.net/publication/264620349|doi=10.1002/cpe.2980|title=SOF: A semi-supervised ontology-learning-based focused crawler|journal=Concurrency and Computation: Practice and Experience|volume=25|issue=12|pages=1755–1770|year=2013|last1=Dong|first1=Hai|last2=Hussain|first2=Farookh Khadeer|s2cid=205690364}}</ref> introduced such an ontology-learning-based crawler using a [[support-vector machine]] to update the content of ontological concepts when crawling Web pages.