Editing Deep web (section)

==Content types==
While it is not always possible to discover directly a specific web server's content so that it may be indexed, a site potentially can be accessed indirectly (due to [[Vulnerability (computing)|computer vulnerabilities]]).

To discover content on the web, search engines use [[web crawler]]s that follow hyperlinks through known protocol virtual [[port (computer networking)|port numbers]]. This technique is ideal for discovering content on the surface web but is often ineffective at finding deep web content. For example, these crawlers do not attempt to find dynamic pages that are the result of database queries due to the indeterminate number of queries that are possible.<ref name=":0" /> It has been noted that this can be overcome (partially) by providing links to query results, but this could unintentionally inflate the popularity of a site of the deep web.

[[DeepPeep]], [[Intute]], [http://Aleph%20Open%20Search Aleph Open Search], [[Deep Web Technologies]], [[Scirus]], and [[Ahmia.fi]] are a few search engines that have accessed the deep web. Intute ran out of funding and is now a temporary static archive as of July 2011.<ref>{{cite web | url=http://www.intute.ac.uk/faq.html | title=Intute FAQ, dead link | access-date=October 13, 2012}}</ref> Scirus retired near the end of January 2013.<ref>{{cite web|title=Elsevier to Retire Popular Science Search Engine|url=http://library.bldrdoc.gov/newsarc/201312.html|website=library.bldrdoc.gov|date=December 2013|access-date=June 22, 2015|quote=by end of January 2014, Elsevier will be discontinuing Scirus, its free science search engine. Scirus has been a wide-ranging research tool, with over 575 million items indexed for searching, including webpages, pre-print articles, patents, and repositories.|archive-url=https://web.archive.org/web/20150623002452/http://library.bldrdoc.gov/newsarc/201312.html|archive-date=June 23, 2015|url-status=dead}}</ref>

Researchers have been exploring how the deep web can be crawled in an automatic fashion, including content that can be accessed only by special software such as [[Tor (anonymity network)|Tor]]. In 2001, Sriram Raghavan and Hector Garcia-Molina (Stanford Computer Science Department, Stanford University)<ref name=raghavan2000>{{cite web
 | author = Sriram Raghavan
 | first2 = Hector
 | last2 = Garcia-Molina
 | title = Crawling the Hidden Web
 | publisher = Stanford Digital Libraries Technical Report
 | year = 2000
 | url = http://ilpubs.stanford.edu:8090/456/1/2000-36.pdf
 | access-date = December 27, 2008
 | archive-date = May 8, 2018
 | archive-url = https://web.archive.org/web/20180508094122/http://ilpubs.stanford.edu:8090/456/1/2000-36.pdf
 | url-status = dead
 }}</ref><ref>{{cite conference |first=Sriram |last=Raghavan |author2=Garcia-Molina, Hector | year=2001 | title=Crawling the Hidden Web | book-title=Proceedings of the 27th International Conference on Very Large Data Bases (VLDB) | pages=129–38 | url=http://www.dia.uniroma3.it/~vldbproc/017_129.pdf }}</ref> presented an architectural model for a hidden-Web crawler that used important terms provided by users or collected from the query interfaces to query a Web form and crawl the Deep Web content. Alexandros Ntoulas, Petros Zerfos, and Junghoo Cho of [[University of California, Los Angeles|UCLA]] created a hidden-Web crawler that automatically generated meaningful queries to issue against search forms.<ref>{{cite web
 | first1 = Ntoulas
 | last1 = Alexandros
 | first2 = Petros | last2 = Zerfos | first3 = Junghoo | last3 = Cho
 | title = Downloading Hidden Web Content
 | publisher = [[UCLA]] Computer Science
 | year = 2005
 | url = http://oak.cs.ucla.edu/~cho/papers/ntoulas-hidden.pdf
 | access-date = February 24, 2009}}</ref> Several form query languages (e.g., DEQUEL<ref>{{cite journal
 | first1 = Denis
 | last1 = Shestakov
 | first2 = Sourav S. | last2 = Bhowmick | first3 = Ee-Peng | last3 = Lim
 | title = DEQUE: Querying the Deep Web
 | journal = Data & Knowledge Engineering |volume=52 |issue=3
 | pages = 273–311
 | year = 2005
 | doi = 10.1016/S0169-023X(04)00107-7
 | url = http://www.inf.ufsc.br/~r.mello/deepWeb/querying/DKE2005-Sourav.pdf
 }}</ref>) have been proposed that, besides issuing a query, also allow extraction of structured data from result pages. Another effort is DeepPeep, a project of the [[University of Utah]] sponsored by the [[National Science Foundation]], which gathered hidden-web sources (web forms) in different domains based on novel focused crawler techniques.<ref>{{cite conference
 | first1 = Luciano
 | last1 = Barbosa
 | first2 = Juliana
 | last2 = Freire
 | author2-link = Juliana Freire
 | title = An Adaptive Crawler for Locating Hidden-Web Entry Points
 | conference = WWW Conference 2007
 | year = 2007
 | url = http://www.cs.utah.edu/~lbarbosa/publications/ache-www2007.pdf
 | access-date = March 20, 2009
 | archive-date = June 5, 2011
| archive-url = https://web.archive.org/web/20110605082603/http://www.cs.utah.edu/~lbarbosa/publications/ache-www2007.pdf
 | url-status = dead
 }}</ref><ref>{{cite conference
 | first1 = Luciano
 | last1 = Barbosa
 | first2 = Juliana
 | last2 = Freire
 | author2-link = Juliana Freire
 | title = Searching for Hidden-Web Databases
 | conference = WebDB 2005
 | year = 2005
 | url = http://www.cs.utah.edu/~lbarbosa/publications/webdb2005.pdf
 | access-date = March 20, 2009
 | archive-date = June 5, 2011
| archive-url = https://web.archive.org/web/20110605082629/http://www.cs.utah.edu/~lbarbosa/publications/webdb2005.pdf
 | url-status = dead
 }}</ref>

Commercial search engines have begun exploring alternative methods to crawl the deep web. The [[Sitemap Protocol]] (first developed and introduced by Google in 2005) and [[Open Archives Initiative Protocol for Metadata Harvesting|OAI-PMH]] are mechanisms that allow search engines and other interested parties to discover deep web resources on particular web servers. Both mechanisms allow web servers to advertise the URLs that are accessible on them, thereby allowing automatic discovery of resources that are not linked directly to the surface web. Google's deep web surfacing system computes submissions for each HTML form and adds the resulting HTML pages into the Google search engine index. The surfaced results account for a thousand queries per second to deep web content.<ref>{{cite conference | first1 = Jayant | last1 = Madhavan | first2 = David | last2 = Ko | first3 = Łucja | last3 = Kot | first4 = Vignesh | last4 = Ganapathy | first5 = Alex | last5 = Rasmussen | first6 = Alon | last6 = Halevy | title = Google's Deep-Web Crawl | publisher = VLDB Endowment, ACM | conference = PVLDB '08, August 23-28, 2008, Auckland, New Zealand | year = 2008 | url = https://homes.cs.washington.edu/~alon/files/vldb08deepweb.pdf | access-date = April 17, 2009 | archive-date = September 16, 2012 | archive-url = https://web.archive.org/web/20120916104001/http://homes.cs.washington.edu/~alon/files/vldb08deepweb.pdf | url-status = dead }}</ref> In this system, the pre-computation of submissions is done using three algorithms:
# selecting input values for text search inputs that accept keywords,
# identifying inputs that accept only values of a specific type (e.g., date) and
# selecting a small number of input combinations that generate URLs suitable for inclusion into the Web search index.

In 2008, to facilitate users of [[Tor (anonymity network)#Hidden services|Tor hidden services]] in their access and search of a hidden [[.onion]] suffix, [[Aaron Swartz]] designed [[Tor2web]]—a proxy application able to provide access by means of common web browsers.<ref name=RELEASE>{{cite web|last=Aaron|first=Swartz|title=In Defense of Anonymity|url=http://www.aaronsw.com/weblog/tor2web|access-date=February 4, 2014}}</ref> Using this application, deep web links appear as a random sequence of letters followed by the .onion [[top-level domain]].