RED: Redundancy-Driven Data Extraction from Result Pages?

Data-driven websites are mostly accessed through search interfaces. Such sites follow a common publishing pattern that, surprisingly, has not been fully exploited for unsupervised data extraction yet: the result of a search is presented as a paginated list of result records. Each result record contains the main attributes about one single object, and links to a page dedicated to the details of that object. We present red, an automatic approach and a prototype system for extracting data records from sites following this publishing pattern. red leverages the inherent redundancy between result records and corresponding detail pages to design an effective, yet fully-unsupervised and domain-independent method. It is able to extract from result pages all the attributes of the objects that appear both in the result records and in the corresponding detail pages. With respect to previous unsupervised methods, our method does not require any a priori domain-dependent knowledge (e.g, an ontology), can achieve a significantly higher accuracy while automatically selecting only object attributes, a task which is out of the scope of traditional fully unsupervised approaches. With respect to previous supervised or semi-supervised methods, red can reach similar accuracy in many domains (e.g., job postings) without requiring supervision for each domain, let alone each website.


INTRODUCTION AND OVERVIEW
Data-driven websites across many application domains (e.g., real estate agencies, e-commerce) follow a common publishing pattern to provide access to the underlying data: A search interface to query specific types of objects (e.g., products, properties) yields the relevant objects as a list of result records, often paginated into result pages. Every result record contains key attributes about one single object, so that the user can get a first impression of the result before deciding to dig further into detailed information. Each result record includes a link to a detail page, containing more attributes [24].
To the best of our knowledge, no existing data extraction system has targeted the inherent data intra-site redundancy underlying this popular publishing pattern. Rather, most existing unsupervised approaches [2,8,9,29,30,34,35] can be applied either over a collection of result pages, or over a collection of detail pages. Other approaches rely on the same publishing pattern, but focus only on segmenting the result pages [28], or rely on the much weaker signals arising from the aligning of the labels of the fields on the search form directly against the labels of the data on the detail pages [37]; finally, there are several approaches that focus on the problem of finding redundancy [3,38] among several sites, but the problem quickly trespasses on that of integrating data coming from autonomous sources [3,5,6,23,38], a problem that is well known not to have a simple solution [14].
This paper proposes red, the first data extraction method that leverages this intra-site redundancy to extract data records from result pages. Given a set of result pages and the corresponding set of detail pages from a website, red infers a collection of extraction rules (or simply rules). Each one is capable of extracting the values for a certain attribute, e.g., the price of a product, from all the records of a result page. These rules can be applied to every result page of the same website to extract all the relevant data.
We show that exploiting this publishing pattern on a site provides a low-hanging fruit opportunity for significantly improving the precision of fully unsupervised data extraction. Existing unsupervised approaches struggle to distinguish relevant data (i.e., object attributes) from noise. Moreover, they are based on a fragile trade-off between the expressiveness of formalism used for describing the extraction rules and the efficiency of the learning task. Conversely, intra-site redundancy distinguishes relevant data that appears both on result records and on detail pages from noise that only appears on either without any supervision, and allows to filter Figure 1 shows a (simplified) instance of a real estate website that serves as running example. It consists of a result page containing six records along with their corresponding detail pages. The result records and detail pages publish attributes such as Price, Location, number of Beds, and Type of properties. Colored annotations highlight overlapping values for attributes appearing within the records in the result page and the corresponding detail pages. For this example, red is capable of generating rules extracting the correct values of all attributes for all result records. Figure 2a shows the values extracted by the extraction rules when applied to the example result page in input.
red generates these rules in three, sequential steps: (1) During the extraction rules generation step, red generates two sets of rules, namely the sets of result rules and detail rules: Figure 2b and Figure 2c show the values extracted by some of the generated rules (shown in Figure 5a and Figure 5b) on the result page and on the detail pages, respectively, for our running example (where r i (d i ) is the i-th result (detail) rule generated). red's extraction rules generation algorithm (Section 3) selects the rules from a fragment of XPath [13] expressions that, as empirically verified over many real websites, includes correct extraction rules.
The main challenge in this step is generating a complete set of rules efficiently: at least one correct detail rule and one correct result rule is needed for every attribute. It turns out that the completeness of the rules generation algorithm can be achieved only at expense of using a rather expressive family of extraction rules, which means generating many candidate rules, most of which are actually selecting noise.
(2) In the second step, redundancy seeking, red aims to identify pairs of result/detail rules that are considered redundant, as they extract the same values for every object. In this quest for redundancy, red is faced with various challenges. First, in addition to neat redundancy between correct rules, there is inevitably a lot of noisy redundancy: Some values may occur multiple times across the records (typical if the search is narrow, e.g., only properties in London, or because attributes have a narrow domain, e.g., Beds). Moreover, some values may be repeated several times within the same record or detail page (the third detail page contains a second occurrence of 'Studio', dotted-blue underlined in Figure 1).
To complicate matters, certain attributes may be optional i.e., not report a value for some result record, such as the Price attribute which is missing for the third and sixth properties of our running example. Both rules r 0 and d 0 select Price values: d 0 is applied to each of the detail pages separately, and can therefore extract the price value if occurring on the page, or it can just report that the value is missing (indicated by nil). Conversely, r 0 is applied once on the whole result page (and not on the individual records): it can only extract the price values found on the page, irrespective of which record they belong to. Therefore, taken alone, r 0 is not able to distinguish which records contain a value, and which do not (in our example the third and last record). That leaves r 0 and d 0 mis-"aligned" and without further consideration they would not be considered redundant. This problem has been already tackled by known complex and error prone segmentation techniques for splitting a result page into records [19,28,29]. Rather, red devises an innovative soft-segmentation technique (Section 4.1) which, again, leverages the underlying intra-site redundancy: In particular, red exploits the presence of navigational links (named detail links) that point from each result record to the corresponding detail page. In Figure 1, for each record, one occurrence of the detail link is highlighted (by the departing orange solid arrow). All these detail link occurrences are captured by an extraction rule l, named link rule, whose generation is facilitated by the knowledge of urls of the detail pages given in input to red.
(3) In its final step, named noise removal, red addresses the issue of finding and discarding pairs of rules that exhibit some limited redundancy, but not sufficient to likely be correct extraction rules. red addresses this challenge by implementing a noise cleaning process (Section 5) that aims at separating the redundancy resulting from the result/detail publishing pattern, from that arising by accident.
Finally, red outputs all the result extraction rules left after the previous steps along with the special link rule, as shown in Figure 2a for the running example. It is worth noting that, while in this paper we are focusing on data extraction from result page records, red is perfectly capable of inferring valid extraction rules also for attributes published on the detail pages. Aside of space reasons, we focus on result rules as executing only such rules at a large scale is much cheaper: Once the rules have been generated, no more detail pages have to be crawled, and that is a significant saving, as there are many more detail pages than result pages.
The paper is organized around the main contributions of red: • The first formal description of the typical result/detail publishing pattern and the definition of the Finding Result Attributes Problem (Section 2); • A highly accurate, domain independent, and fully unsupervised data extraction method that leverages the redundancy implied by the publishing pattern on real-life websites. It is presented in three steps: a rules generation algorithm to efficiently generate complete sets of extraction rules (Section 3); a technique to measure the redundancy of rules (Section 4), even in the presence of optional attributes (Section 4.1); and finally, a technique to separate and validate regular, attribute redundancy from noisy, accidental redundancy (Section 5); • A proof-of-concept showing the feasibility of the approach, and the experimental evaluation (Section 6) showing that red can attain highly accurate results scoring at rule level an F -measure F r > 91%, and even better results at value level with F v > 96%. We demonstrate its advantages against state-of-the-art large scale data extraction systems.

PROBLEM DESCRIPTION
The result/detail publishing pattern depicted in Figure 1 assumes that a website publishes data coming from an underlying abstract relation with attributes A containing all the available information about a set of objects O. In our example, the abstract relation contains one tuple for every published property, each with attributes such as the property's Price, Address, number of Beds, and Type.
Result pages and detail pages are generated by various scripts that retrieve data from the abstract relation, embed them into an HTML template, and publish them as web pages. This process performs a sequence of transformations: An initial selection (σ , to recycle a symbol from relational algebra) produces the set of objects to publish in each result page as encoded into HTML source code by the result page script λ. Every result page contains a set of result records, each produced by applying the result record script λ r to one published object. The corresponding detail page is the result of applying the detail script λ d on the very same object, and is linked from the corresponding result record.
According to the model depicted in Figure 1, even if the detail and result record scripts work on the same set of objects, they may end up publishing different sets of attributes. The detail pages include all the attributes of the abstract relation, whereas the result pages publish only a subset of these attributes selected by a projection operation π .
We now introduce the problem of recovering the values of the attributes published on the result pages. An extraction rule either locates one or several string values from a single page, or it produces a distinct special value nil to denote the absence of a value. For our purposes, extraction rules are specified by using XPath expressions [13] belonging to a simple but carefully designed fragment (as detailed in Section 3).
We distinguish two types of extraction rules, named detail and result rules after the type of pages they are meant to work on.
We call detail rule any extraction rule that when applied on a detail page, produces at most one value or nil. We use d to denote a detail extraction rule, and we use d(p) to denote the value it extracts from detail page p. Precisely, d(p) is either the XPath string-value obtained by applying the rule d on detail page p, or it is nil if the XPath expression returns an empty node-set [13].
A detail rule for an attribute A is said to be correct if it extracts its values from every detail page in the input and extracts nil for those pages that do not contain a value for A. We call noisy (or incorrect) a rule that is not correct: noisy rules mix values of several attributes or they extract part of the underlying HTML template. We distinguish the special case of a partially-correct rule for an attribute A, i.e., a rule extracting only correct values of attribute A except for some pages, on which it wrongly extracts nil. The value extracted by a detail rule is naturally associated with the object corresponding to the detail page it is applied to.
Similarly, we call result rule any extraction rule meant to work on result pages and we denote it by r . A result rule applied on a result page p produces zero, one, or several values denoted r (p). Precisely, r returns the string-value for each node in the node-set returned by evaluating its XPath expression on the result page p [13].

PROBLEM DESCRIPTION
The result/detail publishing pattern depicted in Figure 1 assumes that a website publishes data coming from an underlying abstract relation with attributes A containing all the available information about a set of objects O. In our example, the abstract relation contains one tuple for every published property, each with attributes such as the property's Price, Address, number of Beds, and Type.
Result pages and detail pages are generated by various scripts that retrieve data from the abstract relation, embed them into an HTML template, and publish them as web pages. This process performs a sequence of transformations: An initial selection (σ , to recycle a symbol from relational algebra) produces the set of objects to publish in each result page as encoded into HTML source code by the result page script λ. Every result page contains a set of result records, each produced by applying the result record script λ r to one published object. The corresponding detail page is the result of applying the detail script λ d on the very same object, and is linked from the corresponding result record.
According to the model depicted in Figure 1, even if the detail and result record scripts work on the same set of objects, they may end up publishing different sets of attributes. The detail pages include all the attributes of the abstract relation, whereas the result pages publish only a subset of these attributes selected by a projection operation π .
We now introduce the problem of recovering the values of the attributes published on the result pages. An extraction rule either locates one or several string values from a single page, or it produces a distinct special value nil to denote the absence of a value. For our purposes, extraction rules are specified by using XPath expressions [13] belonging to a simple but carefully designed fragment (as detailed in Section 3).
We distinguish two types of extraction rules, named detail and result rules after the type of pages they are meant to work on.
We call detail rule any extraction rule that when applied on a detail page, produces at most one value or nil. We use d to denote a detail extraction rule, and we use d (p) to denote the value it extracts from detail page p. Precisely, d (p) is either the XPath string-value obtained by applying the rule d on detail page p, or it is nil if the XPath expression returns an empty node-set [13].
A detail rule for an attribute A is said to be correct if it extracts its values from every detail page in the input and extracts nil for those pages that do not contain a value for A. We call noisy (or incorrect) a rule that is not correct: noisy rules mix values of several attributes or they extract part of the underlying HTML template. We distinguish the special case of a partially-correct rule for an attribute A, i.e., a rule extracting only correct values of attribute A except for some pages, on which it wrongly extracts nil. The value extracted by a detail rule is naturally associated with the object corresponding to the detail page it is applied to.
Similarly, we call result rule any extraction rule meant to work on result pages and we denote it by r . A result rule applied on a result page p produces zero, one, or several values denoted r (p). Precisely, r returns the string-value for each node in the node-set returned by evaluating its XPath expression on the result page p [13].
Result rules, differently from detail rules, cannot produce nil values at all, even in the case that they return an empty nodeset. Indeed, without knowing the exact boundaries of every result record, the values extracted by a result rule cannot be trivially associated to the record/object they belong to on the result page (Section 4.1 describes a solution to this problem).
Therefore, the notion of correctness of a result rule for an attribute A differs from the notion of correctness of a detail rule, even if they are associated with the same attribute: a result rule is correct if it produces exactly the ordered set of non-nil values of attribute A as they occur in the source of that page. By partially-correct result rule for an attribute A we mean a result rule extracting a strict subset of the correct values of A.
The problem that we aim to solve can be formulated as follows: Problem (Finding Result Attributes). Given a set of result pages P r and the set of corresponding detail pages P d over the same objects, find the correct values (including nil values) for every attribute A of the abstract relation published in the result pages.
red tries to solve this problem by producing a result rule r A per every attribute A in the result pages, together with an additional result rule, named link rule and denoted by l, extracting exactly one occurrence of the link leading to the corresponding detail page per result record: an output result rule r A associated with an optional attribute A is suitably annotated either r a A or r b A to specify how the values extracted by r A should be padded with missing nil values, i.e., respectively after or before occurrences of the detail links.

EXTRACTION RULES GENERATION
We introduce an extraction rules generation algorithm working on both result and detail pages. It is a single-parameter algorithm designed to output rules from an XPath fragment whose expressiveness can be easily and effectively tuned by setting the parameter. Its goals are two-fold: on one hand, it aims at being complete, i.e., it has to generate at least a correct rule for every attribute; on the other hand, it should not generate too many noisy rules, because their presence makes the Finding Result Attributes problem harder.
The rules generation algorithm includes two main steps: template analysis, and extraction rules enumeration.

Template Analysis
The template analysis aims at identifying those nodes in the DOM tree [26] of the input pages that are template nodes (e.g., the <li> node, and the textual label: 'Price:' in Figure 3a). 1 red's template analysis algorithm is inspired by ExAlg [2], suitably adapted to the result/detail publishing pattern. Our analysis just aims at deciding, for every node in the pages, whether it should be considered as part of the template or not; conversely, ExAlg also solves the complex problem of finding a full description of the HTML template.
The analysis on a set of input pages P builds on the notion of occurrence-vector, i.e., a vector f of |P | integers indexed by the pages in P, so that f (p) reports how many occurrences of equivalent nodes are present in the DOM tree of page p. Two nodes are said to be equivalent if and only if two conditions hold: (i) they are either text nodes with the same value, or they both are element nodes associated with the same element name and attribute names, (ii) their respective parent nodes are equivalent (or they both are the root nodes). Equivalent nodes with the same occurrence-vector are then grouped into the same equivalence class.
As observed in [2], by considering a sufficiently large number of pages, the nodes that occur in large and frequently occurring classes are the scaffold of the underlying template. An inherent and significant limitation of this kind of statistical analysis is that the inferred equivalence classes become easily noisy as their support (total number of occurrences) or size (total number of nodes) decreases. Indeed, ExAlg's template inference process described is extremely brittle to the presence of noisy equivalence classes and cannot deal with singleton collections of input pages.
As in the original proposal, we use thresholds on the minimum size and on the minimum support to prevent the algorithm from generating too many noisy classes. 2 However, it is worth noticing that in our setting the template analysis is just an optimization aiming at reducing the number of generated extraction rules and that we have empirically observed that a single result page, usually containing up to tens of records, is already enough for the specific goals of our analysis.
The classes whose occurrence-vector exactly reports the number of objects published per page are called the root equivalence classes: their nodes occur exactly once in every detail page and as many times as the number of result records in every result page (an information available as part of red's input). We classify as template nodes those occurring in the root equivalence classes or in any other equivalence class that appears less frequently than the root equivalent class in every page. 3 Example 3.1. The root equivalence classes for the running example are ϵ r 0 and ϵ d 0 reported in Figure 4; the resulting template nodes are depicted with gray background in Figure 3. Notice that all their nodes are template nodes. ϵ d 2 , ϵ d 3 , ϵ d 4 , ϵ d 5 are discarded because their support is too small and indeed they are noisy classes listing values that occur more than once by chance. ϵ r 3 is discarded because it occurs more frequently than the root equivalence class ϵ r 0 . Also nodes in ϵ r 1 , ϵ r 2 are correctly classified as template nodes. Notice that nodes in ϵ r 4 and ϵ r 5 are erroneously classified as template nodes. The wrong classification of a value as template node will not prevent the generation of correct rules for the corresponding attribute as long as other occurrences of the values of that attribute are correctly classified as target values.

Extraction Rules Enumeration
The rules generation algorithm builds on the output of the template analysis to enumerate result (respectively, detail) extraction rules associated with tree-paths starting from a uniquely identifiable template node within the result records (resp., detail page), named pivot, and reaching a non-template target value node.   We use as candidate pivot every template node that is either a text, or an element with 'id' or 'class' attribute; every non-template text node is considered as a candidate target value. We then enumerate all the possible tree-paths leading from any pivot node to any target node by hopping over the tree along with a set of predefined XPath step expressions. Namely, we consider expressions capable of moving to the parent element node, to one of the children (either text or element), to the next/previous sibling element.
A couple of additional constraints on the tree-paths are enforced in order to reduce the number of enumerated tree-paths and discard those associated with incorrect rules. First, the path length is bounded by a threshold δ . Second, the analysis considers only a small subset of the input pages which is assumed unbiased. 4 Each tree-path is translated into an executable extraction rule by appending several XPath step sub-expressions, eventually.
The first XPath step has to match the pivot, and takes one of the following forms depending on its node type: //e[contains(., "v")] for a textual pivot v having e as parent node; //e[@id="v"] (resp., //e[@class="v"]), for an element pivot e having an id (resp., class) attribute valued v; just //e for any other element pivot.
Then it follows a sequence of XPath step expressions each after one of the hops composing the tree-path from the pivot to the target value: parent, child, following-sibling (abbreviated fs in Figure 5a), 4 We use at most 3 detail pages and 1 result page.   and preceding-sibling (ps); each step (except those on the parent axis) is also followed by an XPath positional predicate, e.g., ' [4]', and by a node-test: The last step uses text() to target textual values, where all the other steps use '*' to select element nodes.      and preceding-sibling (ps); each step (except those on the parent axis) is also followed by an XPath positional predicate, e.g., ' [4]', and by a node-test: The last step uses text() to target textual values, where all the other steps use '*' to select element nodes.
Example 3.2. Consider rule r 6 shown in Figure 5a: it is based on the 3 steps tree-path 'at i strong W2U 0DF London' from the pivot node 'at' to the Location value. The XPath expression is obtained by combining: the expression selecting the pivot node (//i[contains(., 'at')]); the expression fs::*[1] to move to the next We use as candidate pivot every template node that is either a text, or an element with 'id' or 'class' attribute; every non-template text node is considered as a candidate target value. We then enumerate all the possible tree-paths leading from any pivot node to any target node by hopping over the tree along with a set of predefined XPath step expressions. Namely, we consider expressions capable of moving to the parent element node, to one of the children (either text or element), to the next/previous sibling element.
A couple of additional constraints on the tree-paths are enforced in order to reduce the number of enumerated tree-paths and discard those associated with incorrect rules. First, the path length is bounded by a threshold δ . Second, the analysis considers only a small subset of the input pages which is assumed unbiased. 4 Each tree-path is translated into an executable extraction rule by appending several XPath step sub-expressions, eventually.
The first XPath step has to match the pivot, and takes one of the following forms depending on its node type: //e[contains(., "v")] for a textual pivot v having e as parent node; //e[@id="v"] (resp., //e[@class="v"]), for an element pivot e having an id (resp., class) attribute valued v; just //e for any other element pivot.
Then it follows a sequence of XPath step expressions each after one of the hops composing the tree-path from the pivot to the target value: parent, child, following-sibling (abbreviated fs in Figure 5a), and preceding-sibling (ps); each step (except those on the parent axis) is also followed by an XPath positional predicate, e.g., ' [4]', and by a node-test: The last step uses text() to target textual values, where all the other steps use '*' to select element nodes.
Example 3.2. Consider rule r 6 shown in Figure 5a: it is based on the 3 steps tree-path 'at i strong W2U 0DF London' from the pivot node 'at' to the Location value. The XPath expression is obtained by combining: the expression selecting the pivot node (//i[contains(., 'at')]); the expression fs::*[1] to move to the next sibling element; and the final expression text() [1] to reach the target text value. A few tree-paths generating the result rules are shown directly in Figure 3 by means of colored dashed lines annotated with the rule and connecting the pivot to the target value across all the nodes in the path. red groups the generated rules by extracted values. Within a group producing the same values, only the rule associated with the shortest tree-path is retained, ties are broken by selecting the rule generated earlier. In the running example, r 7 is removed as it extracts the same values as r 5 . The rationale to prefer shorter paths is that template nodes far away from the target values are progressively less likely to generate rules that work reliably across all records and pages. More sophisticated criteria, such as those considering the robustness of the extraction rule [11,22] could be adopted, but this is beyond the scope of the present paper.

REDUNDANCY SEEKING
red analyzes the redundancy between all pairs of generated result and detail rules for identifying the likely correct rules. From an abstract point of view, a pair of result/detail rules can be considered redundant when the two rules end up extracting values associated with the same attribute of the abstract relation for every object.
Concretely, given one of such pairs composed of one result rule and one detail rule, a pairwise comparison of the extracted values is not trivial as they might extract a different number of values due to the presence of optional attributes. It is necessary to find out the correct alignment of the rules, i.e., where nil should be inserted to indicate a missing value within the list of values extracted by the result rule. We introduce a novel soft segmentation technique to find the alignment between result rules and detail rules, thus avoiding the tricky problem of finding the exact boundaries of every result record [20,28]; then, we describe a score function for measuring the redundancy of a pair composed by one aligned result rule and one detail rule.

Soft Segmentation
The presence of optional attributes in the result pages can lead to the generation of rules that extract fewer values than the number of records. Soft segmentation is red's technique for finding the correct alignment of the extracted values (w.r.t. the result records).
A prerequisite of the soft segmentation technique is the availability of exactly one detail link per each result record, i.e., that pointing to the corresponding detail page. red leverages the knowledge of the detail pages, whose urls are assumed given as part of the input, to locate the detail link occurrences within the result records. As all occurrences must be associated with same "slot" of the underlying template, we have to disambiguate all the cases in which several copies of a detail link occur within the same result record, or optional detail links are present (e.g., for example in the first and third featured result records in Figure 1).
We apply the rules generation algorithm described in Section 3 by taking any of the link occurrences as target and generate a set of link rules. Only the rules extracting same number of links as the number of result records are considered, and in presence of several candidates, only the rule generated earliest is saved as link rule.
The soft segmentation technique finds out the correct alignment of the values extracted by a result rule w.r.t. detail the link occurrences during a traversal of the DOM tree. If a result rule is correct and extracts exactly the same number of values as the number of result records, the extracted values perfectly interleave with the detail link occurrences, i.e., either the values always occur before the links or they always occur after them. Conversely, in presence of result rules extracting fewer values, a correct alignment is needed.  For dealing with these cases, nil values should be injected to reestablish the pattern that either every link precedes a value or vice-versa. At most two interleaving sequences of the values, called alignments, of a result rule, are possible: Given the result rule r , we use r a (resp., r b ) to denote the alignment of the rule in which all nodes extracted by r are considered occuring after (resp., before) the corresponding detail link in every result record.  The presence of two consecutive links (l 5 and l 6 ) prevent values and links from interleaving perfectly. Thereby r 1 is removed. Table 1 shows the alignments for all the result rules in Figure 5a w.r.t. link rule l for our running example. The soft segmentation algorithm takes as input a result rule r , the link rule l and a set of result pages. It visits every page by ordering links and values into a sequence of occurrences and checks whether it is possible to inject nil values into the sequence to make values (including nil) and links to interleave perfectly. If so, it outputs at most two admissible alignments of the result rules, namely r a and r b (we simply write r where the alignments coincide).

Redundancy Score
Given a pair (r * , d ) composed of one aligned result rule (with * ∈ {a, b}) and one detail rule, and a set of objects O, the redundancy score of the pair is defined as the average pairwise score between their values: where d (o) and r * (o) denote the string extracted for object o by d and r * , respectively. We then define the following redundancy score as a distance function over pairs of values: is a Jensen-Shannon string distance modified for handling nil. If both v 1 and v 2 are nil, it returns 0. If either is nil, but not the other, it returns 1. In all other cases, it's a standard string distance. We noticed that the change of formats for the values of an attribute within the same site are very rare in real sites as they also tend to confuse the end user, i.e., the Price attribute is displayed as 1.5k on a the result record whereas it is presented as 1,500 on the corresponding detail page of the running example in Figure 1. Hence, we observed only negligibly different results by adopting any other popular string distance function [7]. In practice, dealing with the substring cases is more important than the choice of the string distance function.  Table 2 reports all the pairs scoring less than 0.4 ordered by their redundancy score for our running example. Each pair is composed of an aligned result rule from Table 1, and a detail rule from Figure 5b. The column "Attributes" reports whether the rules of the pair are correct, noisy, partially correct, or mis-aligned.

NOISE REMOVAL
red uses the redundancy score to filter pairs of rules that are somewhat, but not sufficiently, redundant. Given a redundancy score threshold ρ, we consider as not redundant all the pairs of rules having a redundancy score greater than ρ. Unfortunately, many of the remaining pairs can still contain noisy rules: On one hand, there could be an incorrect alignment of the values, due to the presence of too many similar values in result records, e.g., (r b 8 , d 7 ) in Table 2; on the other hand, even noisy rules happen to be incidentally similar sometimes, e.g., (r a 8 , d 5 ) and (r a 5 , d 4 ). This is especially true when the range of the possible values of the compared attributes is rather limited; for example (the number of) Rooms, Beds, and Baths in the real estate domain are all small positive integers. Generally speaking, the correct result rules cannot be trivially separated from the incorrect ones with any fixed value of the threshold. red processes the redundant pairs by ascending redundancy score to remove the noisy pairs. First, the result rule validation analyzes whether a result rule has been correctly aligned and therefore properly fits in the HTML template of the result pages. Then, the noise redundancy removal leverages the redundancy scores to select for each attribute only the best pairs of rules, i.e., those having the lowest redundancy score.

Result Rules Validation
The result rules validation technique is based on the availability of some correctly aligned result rules as output of the previous processing steps. Beside the link rule, which is assumed to be correct, several other result rules might have only one alignment after the soft segmentation process, for example r b 0 and r 2 in Figure 1. red leverages these already aligned and therefore inherently validated rules to validate other result rules.
A result rule r is validated against an already correctly-aligned rule д that is assumed correct. We can thereby infer two possible traversal sequences of their nodes corresponding to two positional alignments, and namely: s b -extracted nodes of r are before those of д; and s a -extracted nodes of r are after those of д. Then, as described in Section 4.1, we get the real traversal sequence s by traversing the DOM tree. If s does not matches any of s b and s a , r is not a valid result rule. We thereby remove r and all the redundant pairs containing r . Neither of these matches the actual sequence in which these values occur on the page: ⟨W2U 0DF..., 3, W3U 1AF..., 4, 2, E15 2CD..., 3, E15 2XD..., 1, 2⟩. Therefore, r a 8 fails to validate w.r.t. r 9 and is dropped.

Removing Redundant Noise
Once all the result rules have been validated, the surviving pairs include correct pairs together with partially correct and noisy redundant pairs that are incidentally similar.
The key assumption to remove the noisy pairs in the first category is that correct pairs have a better redundancy score than pairs containing noisy rules. The assumption is exploited in the first step of the algorithm removeRedNoise shown in Listing 1 (lines 2-9) and dealing with already validated result rules.
The second step of the algorithm (lines 10-17) prefers pairs of rules extracting fewer nil values: at this stage, all the noisy pairs have already been removed so that we can directly remove the result rules extracting a larger number of nils. Notice that with the exception of the initial ordering (line 11 vs line 3), the two steps are performed by two almost identical loops (lines 2-9 vs lines 10-17). However, they cannot be merged because the latter loop expects as add (r, d ) to Π ′ ; /* saves only the best pair. */ /* Prefer pairs whose detail rule extracts less nil s. */ 10 Π ′′ ← ∅ ; 11 order Π' by decreasing n. of nil in the detail rules; input a list of pairs from which the noisy rules have been already removed, as produced by the first loop. After the two processing steps, the surving pairs are the output result rules that red classifies as correct.
Example 5.2. Consider again two pairs involving d 4 in Example 4.4, i.e., (r 6 , d 4 ) and (r 5 , d 4 ). The first step of removeRedNoise (lines 2-9) removes the latter pair, i.e., the pair with the worst redundancy score, since r 5 and r 6 extract overlapping nodes and only one of them is assumed to be correct.
Both pairs (r 6 , d 4 ) and (r a 8 , d 5 ) are 0-scored and r 6 and r a 8 are two overlapping result rules: at most one of them can be correct. Both pairs reach the second loop (lines [10][11][12][13][14][15][16][17]. The (r a 8 , d 5 ) pair is then removed because it contains the largest number of nil elements. pair π red(π ) note Table 3 reports (ordered by the redundancy score) the pairs after all the steps of the noise removal process. The pairs (r 2 , d 1 ) and (r 2 , d 6 ) are both 0-scored and the correct rule r 2 for Type is included in the output for its redundancy with the Description, an attribute on the detail pages that often "includes" the Type as a substring. So red output on the running example is the set of aligned result rules r b 0 , r 2 , r 6 and r 9 together with the link rule l.

EVALUATION
We inspect red's characteristics and performance through a series of experiments on real-world websites. We first evaluate red on a dataset derived from the one used in diadem [18], a state-of-theart ontology-based data extraction system. We then apply red to another dataset comprised of websites from a variety of application domains, a challenging setting for ontology-based data extraction systems that require some supervision on the level of the application domain. We show that, in contrast, red can attain consistently high precision and recall over many domains.
As for the comparison with other fully unsupervised approaches beside diadem, we also considered depta [41], another unsupervised system that, to the best of our knowledge, deals with the most similar setting to ours: unsupervised extraction from result pages, even a single result page.

Evaluation method
Datasets: We evaluate red on two datasets, named dia_ds and red_ds, including 130 websites in total. dia_ds consists of 100, still reachable websites from the original diadem's dataset, half in the real_estate, half in the used_cars domain. We use diadem to generate wrappers on these sites and collect the detail pages for the dataset, as needed by red.
red_ds consists of 30 sites from 10 domains, obtained by randomly picking 3 sites from the Alexa Top 100 Global Sites. We exclude sites where the detail pages are not hosted on the site itself but rather refer to other sites (typically metasearch portals), as these cases do not fit red's assumption that detail pages share a common template. Each test-case is related to a single website and it consists of one result page and the corresponding set of detail pages. All the records listed in the result pages are obtained by following the default sorting criteria from the website. Both datasets can be found on https://github.com/redwww/experiments.
Optional Attributes: To validate the need for dealing with optional attributes in result pages, we observed that in our datasets there is a considerable number of sites that contain optional attributes. In detail, 17% of sites in dia_ds (i.e., 20 sites) and 36% of those in red_ds (i.e., 13 sites) contain optional attributes. This well illustrates the importance of approaches, such as the soft segmentation technique, dealing with optional attributes in result pages.
Metrics: For each attribute of the result pages in our datasets, we manually crafted the correct XPath expression to extract correct data as golden standard, based on which we computed the number of true positives (tp), false negatives (f n), false positives (f p), precision P = tp/(tp + f p), recall R = tp/(tp + f n), and the F 1 −measure, as F = (2 · P · R)/(P + R).
These metrics have been computed both at (macro or) rule level and at (micro) level of extracted values. At rule-level, we consider a rule correct only if it extracts a set of values perfectly matching with that extracted by the golden rule, without any missing value: any rule extracting just a few noisy values or missing a single value, is considered the same as if it were completely wrong. We also calculate all the metrics at value-level by comparing the number of correct/noisy extracted values. In Table 4 and in Table 5 we report the main results for the two considered datasets: P, R, F stand for the precision, recall and F -measure, reported both at rule-level (e.g.,

Comparison with DIADEM
For a fair comparson between red and diadem, we limit the target attributes to those in the ontology used by diadem, but also report red's performance on all available attributes. Table 4 presents P/R/F results of red and diadem systems over the dataset dia_ds, both at the rule-level and at the value-level. Generally, red and diadem perform roughly at the same high precision, recall, and F -measure. Considering that red uses no domain ontology or other prior knowledge about the domain, that is quite remarkable. Moreover, red can extract all the redundant attributes published in the result pages (as shown in the all column), not just the ones covered by the diadem ontology, achieving similar results. Overall, red identifies 33% more attributes (about 800 attribute extraction rules) than diadem (about 600). The same close performance can be observed at site-level: red extracts all attributes correctly in 73% of the websites, compared to 76% for diadem.

Failures & Limitations.
The most frequent errors for red and diadem systems can be grouped in two main categories.
Noise: diadem leverages its ontology to automatically annotate target pages. Wrong extracted values from diadem system are mainly due to misleading annotations. In our experiments, there are 5 sites where it suffers of this problem, e.g., top-lettings.co.uk. diadem extracts the template nodes Available: always occurring before the attribute AvailableDate and it wrongly takes it as a value of the attribute PropertyStatus. In contrast, red's most common type of noise are partially correct rules, that extract only a fraction of the correct values.
Missing rules: Where diadem mostly misses attributes due to a lack of overlap with the background ontology, red mostly misses attributes where the sites do not conform to its assumptions and specifically that result attributes also occur on the detail page and that attributes have a large enough domain to show some variability between result records. On site davidtompkins.co.uk the Description values in the result pages do not occur in detail pages. On cotswoldlettings.co.uk the Status of the properties is an optional attribute that can only take the value SOLD when present. red wrongly considers all these values as part of the template.

Multi-domain evaluation
We further evaluate red on red_ds, a dataset covering 10 domains (Coffee, Concerts, Florist, Jewerlry, Job Search, Threatres, Books, Camera, Lighting, Sigars) with 3 randomly selected sites for each application domain. For space reasons, in Table 5 we just report the results for one example domain Cigars, 5 the quality metrics averaged over all the sites computed both at rule and value level, and other aggregated results; namely: total number of records (n) contained in the result page, total number of correct (tp), missing (f n) and incorrect (f p) values/rules produced.
Overall, red attained a high performance at both rule-level (0.93 of F -measure) and value-level (0.96 of F -measure). At the valuelevel the precision is remarkable high (0.99).
Among 8 noisy rules in the output produced by red, only 3 of them are entirely incorrect while all the other noisy rules are partially correct rules. On ncfjobs site, red seems to perform badly as there are 4 noisy rules, out of which 3 are partially correct rules for Description, and 1 for JobTitle. All these noisy rules survived the noise-removal step because red rules generation algorithm was not complete, i.e., it could not find correct rules for these attributes.

Comparison with other unsupervised approaches.
For comparison with state-of-the-art unsupervised data extraction systems, we ran depta [41] on the dataset red_ds. This system is also capable of extracting records from result pages and does not require multiple sample pages. Although the algorithm was proposed in 2006, we compared with a runnable version re-implemented in 2012. 6 Since it blindly outputs multiple tables without telling the user which one contains the target data, we manually picked a single table containing the published records and evaluated precision, recall and F -measure at the column-level, to get scores comparable with red's rule-level scores. This yields a negligible advantage for depta as it is usually fairly easy to pick out the dominant table.
We evaluate depta's record identification step over 14 out of 30 sites considered on which the prototype was able to produce a correctly-aligned table. Among these 14 sites, depta achieves a respectable precision of 0.70 and recall of 0.93.

Robustness to parameter setting
red depends on two key parameters: δ is the max allowed pivotto-value distance used during the generation of the rules; ρ is the minimum threshold on the score to consider a pair of result/detail rules as not redundant during the search for correct pairs of rules. Figure 6a plots the performance of red on the red_ds dataset over several δ values (with fixed ρ = 0.3). While the precision is only slightly affected, the recall is significantly related to the maximum pivot-to-value distance. The larger the number of generated rules, the higher is the probability of generating correct rules. Unfortunately, it turns out that by generating too many rules, red may end up introducing so many noisy rules that the probability of a noisy rule to be erroneously considered correct is increased, causing a loss in precision. In our experiments, δ = 6 proved to be the sweet spot.
The other key parameter is the maximum redundancy score, ρ. A pair of result/detail rules is considered not redundant if its redundancy score is above ρ. It significantly affects the noise removal step whose task becomes harder and harder as noisy pairs are injected in the red's processing pipeline. As shown in Figure 6b (with fixed   δ = 6) the precision starts with maximum values, and then it gradually decreases as ρ becomes larger and larger. The noise removal process can still manage to remove most of the noises when ρ is at 0.4 but for larger values of the parameter there is a loss of precision, as the greater number of noisy pairs makes that process less effective. Conversely, too small values for ρ, meaning a too strict criteria for considering a pair of rules as redundant, makes red wrongly remove some correct but not perfectly redundant pairs, as is evidenced by the recall loss of the plot in Figure 6b at ρ < 0.3. For red experimental evaluation, we measured red performance by using the best empirically known values for the two parameters δ and ρ, i.e., ρ = 0.3 and δ = 6.
Efficiency: By considering the number of input pages as a constant, red efficiency can be conveniently analyzed in term of the initial number of rules generated. On red_ds dataset, it generates an average number of 1259.17 result rules and 4912.10 detail rules, which reduce to 205.50 and 289.03 distinct vectors, respectively, after duplicates removal step; only 88.87 candidate pairs have a redundancy score smaller than ρ. red achieves an average running time of 72.6 secs on a 64-bit Ubuntu system with an 8-core Intel i7 CPU at 3.40GHz, processing most sites in less than 30 seconds. red running time is spent mostly on executing the generated extraction rules, a step that could be easily parallelised.

RELATED WORK
Web data extraction approaches and tools has been proposed both from research community and from industry [15,16]. Due to the space limitation, we compare red with the most related work and systems such as depta [41], and diadem [18], which tackle with a similar problem as ours. depta executes two steps on a single result page, (i) identifying data records, and (ii) aligning values belonging to the same attribute. The first step, which is an enhancement of the mdr algorithm [29], works on a tag tree based on the visual containment relations of different HTML elements. It exploits the similarity of tag strings of nodes (and their descendant nodes) to find the data records. A further partial tree alignment method based on tree-edit distance is applied for the second step. depta has two major drawbacks: First, it is highly sensitive even to small exceptions in the structure of the template (e.g., the record of some featured products may be presented differently from others) and optional nodes (e.g., the discount) in the repeated patterns. Second, due to the lack of any cues for identifying target values, it cannot easily separate target values from template nodes and other noises in the input pages.
One of the state-of-the-art system for domain-specific data extraction from result pages is diadem. We limit our discussion to its component most related to red, i.e., that extracting data from result pages. diadem adopts automatic annotators to improve the data identification process, and requires the writing of a domaindependent ontology which has to be designed and maintained by an expert. We exploit the intra-site highly precise redundancy to make red more accurately target the relevant data without using any domain dependent feature.
Unsupervised approaches like RoadRunner [8], ExAlg [2], Fi-VaTech [27], trinity [35], and ViDE [30] all are based on the observation that pages providing homogeneous data are generated by using regular HTML templates, so they end up being mostly similar. It turns out that by analyzing the differences among a bunch of pages sharing a common template, it is then possible to reverseengineer a model of that template, for example by means of regular expressions [2,9]. However, in practice all these systems are not used for large scale extraction tasks because their performances are highly dependent on how well the input pages satisfy the underlying assumptions, to the point that the output quality level is unpredictable at scale [10]. In contrast, red does not infer a precise description of the template, and it does not rely on it to find the data as all these latter systems, including ExAlg, do.
The work in [11,12,22] addresses the problem of XPath rules generation problem, but starts from a different input, i.e., a set of example nodes most of which are likely correct target nodes. These are usually provided by automatic annotators which are not available for every domain.
The approaches in [5,23] and in [3] exploit content and/or schema redundancy and focus on template-based sites as red does. However, they solve a completely different set of problems by adopting techniques based on the availability of data about the same objects across several autonomous sites. In other words, they focus on inter-site redundancy rather than on intra-site redundancy. It is therefore crucially important to align the different formats (e.g., Height expressed in inches on a site vs the same attribute expressed in centimeters on another site), a problem that we show is negligible intra-site. Also, since both approaches work directly on detail pages, they do not address record segmentation problems at all.
In [28] it is tackled the problem of segmenting result pages into records by exploiting links to detail pages exactly as red. However, differently from red, the authors do not consider at all the problem of extracting data from the result records once segmented; rather they describe two different approaches for tackling the segmentation problem: one formulates the task as a constraint satisfaction problem (CSP) and the other uses a probabilistic inference setting.