Conference item icon

Conference item

Browserless web data extraction: challenges and opportunities

Abstract:
Most modern web scrapers use an embedded browser to render web pages and to simulate user actions. Such scrapers (or wrappers) are therefore expensive to execute, in terms of time and network traffic. In contrast, it is magnitudes more resource-efficient to use a “browserless” wrapper which directly accesses a web server through HTTP requests, and takes the desired data directly from the raw replies. However, creating and maintaining browserless wrappers of high precision requires specialists, and is prohibitively laborintensive at scale. In this paper, we demonstrate the principal feasibility of automatically translating browser-based wrappers into “browserless” wrappers. We present the first algorithm and system performing such an automated translation on suitably restricted types of web sites. This system works in the vast majority of test cases and produces very fast and extremely resource-efficient wrappers. We discuss research challenges for extending our approach to a general method applicable to a yet larger number of cases.
Publication status:
Published
Peer review status:
Peer reviewed

Actions


Access Document


Files:
Publisher copy:
10.1145/3178876.3186008

Authors


More by this author
Institution:
University of Oxford
Division:
MPLS
Department:
Computer Science
Role:
Author
More by this author
Institution:
University of Oxford
Division:
MPLS Division
Department:
Computer Science
Role:
Author
More by this author
Institution:
University of Oxford
Division:
MPLS Division
Department:
Computer Science
Role:
Author
More by this author
Institution:
University of Oxford
Division:
MPLS Division
Department:
Computer Science
Role:
Author


Publisher:
Association for Computing Machinery
Host title:
WWW '18 Proceedings of the 2018 World Wide Web Conference
Journal:
Web Conference More from this journal
Pages:
1095-1104
Publication date:
2018-04-10
Acceptance date:
2017-12-22
DOI:
ISBN:
9781450356398


Keywords:
Pubs id:
pubs:820391
UUID:
uuid:c72a7aac-b296-419c-bf5b-3d1661c305e4
Local pid:
pubs:820391
Source identifiers:
820391
Deposit date:
2018-02-20

Terms of use



Views and Downloads






If you are the owner of this record, you can report an update to it here: Report update to this record

TO TOP