Conference item
Browserless web data extraction: challenges and opportunities
- Abstract:
- Most modern web scrapers use an embedded browser to render web pages and to simulate user actions. Such scrapers (or wrappers) are therefore expensive to execute, in terms of time and network traffic. In contrast, it is magnitudes more resource-efficient to use a “browserless” wrapper which directly accesses a web server through HTTP requests, and takes the desired data directly from the raw replies. However, creating and maintaining browserless wrappers of high precision requires specialists, and is prohibitively laborintensive at scale. In this paper, we demonstrate the principal feasibility of automatically translating browser-based wrappers into “browserless” wrappers. We present the first algorithm and system performing such an automated translation on suitably restricted types of web sites. This system works in the vast majority of test cases and produces very fast and extremely resource-efficient wrappers. We discuss research challenges for extending our approach to a general method applicable to a yet larger number of cases.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Access Document
- Files:
-
-
(Preview, Version of record, pdf, 3.3MB, Terms of use)
-
- Publisher copy:
- 10.1145/3178876.3186008
Authors
- Publisher:
- Association for Computing Machinery
- Host title:
- WWW '18 Proceedings of the 2018 World Wide Web Conference
- Journal:
- Web Conference More from this journal
- Pages:
- 1095-1104
- Publication date:
- 2018-04-10
- Acceptance date:
- 2017-12-22
- DOI:
- ISBN:
- 9781450356398
- Keywords:
- Pubs id:
-
pubs:820391
- UUID:
-
uuid:c72a7aac-b296-419c-bf5b-3d1661c305e4
- Local pid:
-
pubs:820391
- Source identifiers:
-
820391
- Deposit date:
-
2018-02-20
Terms of use
- Copyright holder:
- IW3C2 (International World Wide Web Conference Committee)
- Copyright date:
- 2018
- Notes:
-
Copyright © 2018 IW3C2 (International World Wide Web Conference Committee), published
under Creative Commons CC BY 4.0 License.
- Licence:
- CC Attribution (CC BY)
If you are the owner of this record, you can report an update to it here: Report update to this record