Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Deep web
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Content types== While it is not always possible to discover directly a specific web server's content so that it may be indexed, a site potentially can be accessed indirectly (due to [[Vulnerability (computing)|computer vulnerabilities]]). To discover content on the web, search engines use [[web crawler]]s that follow hyperlinks through known protocol virtual [[port (computer networking)|port numbers]]. This technique is ideal for discovering content on the surface web but is often ineffective at finding deep web content. For example, these crawlers do not attempt to find dynamic pages that are the result of database queries due to the indeterminate number of queries that are possible.<ref name=":0" /> It has been noted that this can be overcome (partially) by providing links to query results, but this could unintentionally inflate the popularity of a site of the deep web. [[DeepPeep]], [[Intute]], [http://Aleph%20Open%20Search Aleph Open Search], [[Deep Web Technologies]], [[Scirus]], and [[Ahmia.fi]] are a few search engines that have accessed the deep web. Intute ran out of funding and is now a temporary static archive as of July 2011.<ref>{{cite web | url=http://www.intute.ac.uk/faq.html | title=Intute FAQ, dead link | access-date=October 13, 2012}}</ref> Scirus retired near the end of January 2013.<ref>{{cite web|title=Elsevier to Retire Popular Science Search Engine|url=http://library.bldrdoc.gov/newsarc/201312.html|website=library.bldrdoc.gov|date=December 2013|access-date=June 22, 2015|quote=by end of January 2014, Elsevier will be discontinuing Scirus, its free science search engine. Scirus has been a wide-ranging research tool, with over 575 million items indexed for searching, including webpages, pre-print articles, patents, and repositories.|archive-url=https://web.archive.org/web/20150623002452/http://library.bldrdoc.gov/newsarc/201312.html|archive-date=June 23, 2015|url-status=dead}}</ref> Researchers have been exploring how the deep web can be crawled in an automatic fashion, including content that can be accessed only by special software such as [[Tor (anonymity network)|Tor]]. In 2001, Sriram Raghavan and Hector Garcia-Molina (Stanford Computer Science Department, Stanford University)<ref name=raghavan2000>{{cite web | author = Sriram Raghavan | first2 = Hector | last2 = Garcia-Molina | title = Crawling the Hidden Web | publisher = Stanford Digital Libraries Technical Report | year = 2000 | url = http://ilpubs.stanford.edu:8090/456/1/2000-36.pdf | access-date = December 27, 2008 | archive-date = May 8, 2018 | archive-url = https://web.archive.org/web/20180508094122/http://ilpubs.stanford.edu:8090/456/1/2000-36.pdf | url-status = dead }}</ref><ref>{{cite conference |first=Sriram |last=Raghavan |author2=Garcia-Molina, Hector | year=2001 | title=Crawling the Hidden Web | book-title=Proceedings of the 27th International Conference on Very Large Data Bases (VLDB) | pages=129β38 | url=http://www.dia.uniroma3.it/~vldbproc/017_129.pdf }}</ref> presented an architectural model for a hidden-Web crawler that used important terms provided by users or collected from the query interfaces to query a Web form and crawl the Deep Web content. Alexandros Ntoulas, Petros Zerfos, and Junghoo Cho of [[University of California, Los Angeles|UCLA]] created a hidden-Web crawler that automatically generated meaningful queries to issue against search forms.<ref>{{cite web | first1 = Ntoulas | last1 = Alexandros | first2 = Petros | last2 = Zerfos | first3 = Junghoo | last3 = Cho | title = Downloading Hidden Web Content | publisher = [[UCLA]] Computer Science | year = 2005 | url = http://oak.cs.ucla.edu/~cho/papers/ntoulas-hidden.pdf | access-date = February 24, 2009}}</ref> Several form query languages (e.g., DEQUEL<ref>{{cite journal | first1 = Denis | last1 = Shestakov | first2 = Sourav S. | last2 = Bhowmick | first3 = Ee-Peng | last3 = Lim | title = DEQUE: Querying the Deep Web | journal = Data & Knowledge Engineering |volume=52 |issue=3 | pages = 273β311 | year = 2005 | doi = 10.1016/S0169-023X(04)00107-7 | url = http://www.inf.ufsc.br/~r.mello/deepWeb/querying/DKE2005-Sourav.pdf }}</ref>) have been proposed that, besides issuing a query, also allow extraction of structured data from result pages. Another effort is DeepPeep, a project of the [[University of Utah]] sponsored by the [[National Science Foundation]], which gathered hidden-web sources (web forms) in different domains based on novel focused crawler techniques.<ref>{{cite conference | first1 = Luciano | last1 = Barbosa | first2 = Juliana | last2 = Freire | author2-link = Juliana Freire | title = An Adaptive Crawler for Locating Hidden-Web Entry Points | conference = WWW Conference 2007 | year = 2007 | url = http://www.cs.utah.edu/~lbarbosa/publications/ache-www2007.pdf | access-date = March 20, 2009 | archive-date = June 5, 2011 | archive-url = https://web.archive.org/web/20110605082603/http://www.cs.utah.edu/~lbarbosa/publications/ache-www2007.pdf | url-status = dead }}</ref><ref>{{cite conference | first1 = Luciano | last1 = Barbosa | first2 = Juliana | last2 = Freire | author2-link = Juliana Freire | title = Searching for Hidden-Web Databases | conference = WebDB 2005 | year = 2005 | url = http://www.cs.utah.edu/~lbarbosa/publications/webdb2005.pdf | access-date = March 20, 2009 | archive-date = June 5, 2011 | archive-url = https://web.archive.org/web/20110605082629/http://www.cs.utah.edu/~lbarbosa/publications/webdb2005.pdf | url-status = dead }}</ref> Commercial search engines have begun exploring alternative methods to crawl the deep web. The [[Sitemap Protocol]] (first developed and introduced by Google in 2005) and [[Open Archives Initiative Protocol for Metadata Harvesting|OAI-PMH]] are mechanisms that allow search engines and other interested parties to discover deep web resources on particular web servers. Both mechanisms allow web servers to advertise the URLs that are accessible on them, thereby allowing automatic discovery of resources that are not linked directly to the surface web. Google's deep web surfacing system computes submissions for each HTML form and adds the resulting HTML pages into the Google search engine index. The surfaced results account for a thousand queries per second to deep web content.<ref>{{cite conference | first1 = Jayant | last1 = Madhavan | first2 = David | last2 = Ko | first3 = Εucja | last3 = Kot | first4 = Vignesh | last4 = Ganapathy | first5 = Alex | last5 = Rasmussen | first6 = Alon | last6 = Halevy | title = Google's Deep-Web Crawl | publisher = VLDB Endowment, ACM | conference = PVLDB '08, August 23-28, 2008, Auckland, New Zealand | year = 2008 | url = https://homes.cs.washington.edu/~alon/files/vldb08deepweb.pdf | access-date = April 17, 2009 | archive-date = September 16, 2012 | archive-url = https://web.archive.org/web/20120916104001/http://homes.cs.washington.edu/~alon/files/vldb08deepweb.pdf | url-status = dead }}</ref> In this system, the pre-computation of submissions is done using three algorithms: # selecting input values for text search inputs that accept keywords, # identifying inputs that accept only values of a specific type (e.g., date) and # selecting a small number of input combinations that generate URLs suitable for inclusion into the Web search index. In 2008, to facilitate users of [[Tor (anonymity network)#Hidden services|Tor hidden services]] in their access and search of a hidden [[.onion]] suffix, [[Aaron Swartz]] designed [[Tor2web]]βa proxy application able to provide access by means of common web browsers.<ref name=RELEASE>{{cite web|last=Aaron|first=Swartz|title=In Defense of Anonymity|url=http://www.aaronsw.com/weblog/tor2web|access-date=February 4, 2014}}</ref> Using this application, deep web links appear as a random sequence of letters followed by the .onion [[top-level domain]].
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)