Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Information extraction
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==World Wide Web applications== IE has been the focus of the MUC conferences. The proliferation of the [[World Wide Web|Web]], however, intensified the need for developing IE systems that help people to cope with the [[data deluge|enormous amount of data]] that are available online. Systems that perform IE from online text should meet the requirements of low cost, flexibility in development and easy adaptation to new domains. MUC systems fail to meet those criteria. Moreover, linguistic analysis performed for unstructured text does not exploit the HTML/[[XML]] tags and the layout formats that are available in online texts. As a result, less linguistically intensive approaches have been developed for IE on the Web using [[Wrapper (data mining)|wrappers]], which are sets of highly accurate rules that extract a particular page's content. Manually developing wrappers has proved to be a time-consuming task, requiring a high level of expertise. [[Machine learning]] techniques, either [[Supervised learning|supervised]] or [[Unsupervised learning|unsupervised]], have been used to induce such rules automatically. ''Wrappers'' typically handle highly structured collections of web pages, such as product catalogs and telephone directories. They fail, however, when the text type is less structured, which is also common on the Web. Recent effort on ''adaptive information extraction'' motivates the development of IE systems that can handle different types of text, from well-structured to almost free text -where common wrappers fail- including mixed types. Such systems can exploit shallow natural language knowledge and thus can be also applied to less structured texts. A recent{{when|date=March 2017}} development is Visual Information Extraction,<ref>{{cite arXiv|eprint = 1506.08454|title=WYSIWYE: An Algebra for Expressing Spatial and Textual Rules for Information Extraction|first1=Vijil |last1=Chenthamarakshan|first2=Prasad M |last2=Desphande |first3= Raghu |last3=Krishnapuram |first4= Ramakrishnan |last4=Varadarajan |first5= Knut |last5=Stolze|year=2015|class=cs.CL}}</ref><ref>{{cite CiteSeerX|citeseerx = 10.1.1.21.8236|title=Visual Web Information Extraction with Lixto|first1=Robert |last1=Baumgartner|first2=Sergio |last2=Flesca |first3= Georg |last3=Gottlob|year=2001|pages=119β128}}</ref> that relies on rendering a webpage in a browser and creating rules based on the proximity of regions in the rendered web page. This helps in extracting entities from complex web pages that may exhibit a visual pattern, but lack a discernible pattern in the HTML source code.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)