Journal article icon

Journal article

Automatically learning gazetteers from the deep web.

Abstract:
Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples, but to produce accurate results, these examples must have the quality of human annotations. We resolve this conflict with AMBER, a system for fully automated data extraction from result pages. In contrast to previous approaches, AMBER employs domain specific gazetteers to discern basic domain attributes on a page, and leverages repeated occurrences of similar attributes to group related attributes into records rather than relying on the noisy structure of the DOM.With this approach AMBER is able to identify records and their attributes with almost perfect accuracy (>98%) on a large sample of websites. To make such an approach feasible at scale, AMBER automatically learns domain gazetteers from a small seed set. In this demonstration, we show how AMBER uses the repeated structure of records on deep web result pages to learn such gazetteers. This is only possible with a highly accurate extraction system. Depending on its parametrization, this learning process runs either fully automatically or with human interaction. We show how AMBER bootstraps a gazetteer for UK locations in 4 iterations: From a small seed sample we achieve 94.4% accuracy in recognizing UK locations in the 4th iteration. Copyright is held by the International World Wide Web Conference Committee (IW3C2).

Actions


Access Document


Publisher copy:
10.1145/2187980.2188044

Authors


Contributors

Role:
Editor
Role:
Editor
Role:
Editor
Role:
Editor
Role:
Editor


Publisher:
ACM
Journal:
WWW (Companion Volume) More from this journal
Pages:
341-344
Publication date:
2012-01-01
DOI:


Language:
English
Keywords:
Pubs id:
pubs:332652
UUID:
uuid:48482079-371d-46cd-8614-6a8e8c439d2c
Local pid:
pubs:332652
Source identifiers:
332652
Deposit date:
2012-12-19

Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP