Editing Web crawler (section)

=====Academic focused crawler=====
An example of the [[focused crawlers]] are academic crawlers, which crawls free-access academic related documents, such as the ''citeseerxbot'', which is the crawler of [[CiteSeer]]<sup>X</sup> search engine. Other academic search engines are [[Google Scholar]] and [[Microsoft Academic Search]] etc. Because most academic papers are published in [[PDF]] formats, such kind of crawler is particularly interested in crawling PDF, [[PostScript]] files, [[Microsoft Word]] including their [[Zipped file|zipped]] formats. Because of this, general open-source crawlers, such as [[Heritrix]], must be customized to filter out other [[MIME types]], or a [[middleware]] is used to extract these documents out and import them to the focused crawl database and repository.<ref>{{Cite book | doi=10.1145/2389936.2389949| chapter=Web crawler middleware for search engine digital libraries| title=Proceedings of the twelfth international workshop on Web information and data management - WIDM '12| pages=57| year=2012| last1=Wu| first1=Jian| last2=Teregowda| first2=Pradeep| last3=Khabsa| first3=Madian| last4=Carman| first4=Stephen| last5=Jordan| first5=Douglas| last6=San Pedro Wandelmer| first6=Jose| last7=Lu| first7=Xin| last8=Mitra| first8=Prasenjit| last9=Giles| first9=C. Lee| isbn=9781450317207| s2cid=18513666}}</ref> Identifying whether these documents are academic or not is challenging and can add a significant overhead to the crawling process, so this is performed as a post crawling process using [[machine learning]] or [[regular expression]] algorithms. These academic documents are usually obtained from home pages of faculties and students or from publication page of research institutes. Because academic documents make up only a small fraction of all web pages, a good seed selection is important in boosting the efficiencies of these web crawlers.<ref>{{Cite book |doi = 10.1145/2380718.2380762|chapter = The evolution of a crawling strategy for an academic document search engine|title = Proceedings of the 3rd Annual ACM Web Science Conference on - Web ''Sci'' '12|pages = 340–343|year = 2012|last1 = Wu|first1 = Jian|last2 = Teregowda|first2 = Pradeep|last3 = Ramírez|first3 = Juan Pablo Fernández|last4 = Mitra|first4 = Prasenjit|last5 = Zheng|first5 = Shuyi|last6 = Giles|first6 = C. Lee|isbn = 9781450312288|s2cid = 16718130}}</ref> Other academic crawlers may download plain text and [[HTML]] files, that contains [[metadata]] of academic papers, such as titles, papers, and abstracts. This increases the overall number of papers, but a significant fraction may not provide free PDF downloads.