Unloading Unwanted Information: From Physical Websites to Personalized Web Views

15
Unloading Unwanted Information: From Physical Websites to Personalized Web Views Zehua Liu, Wee Keong Ng, Ee-Peng Lim, Yangfeng Huang Centre for Advanced Information Systems School of Computer Engineering Nanyang Technological University Singapore, 639798 {aszhliu, awkng, aseplim}@ntu.edu.sg [email protected] Feifei Li Computer Science Department Boston University Boston MA 02215, USA [email protected] Abstract With the explosion of information on the Web, informa- tion that is made available from websites is generally over- whelming for users surfing the sites. The majority of the users who are facing this information overloading problem are the ordinary home users who do not have much techni- cal knowledge. It is thus important to allow these users to easily create personalized views of websites such that they only see what they want in the way they prefer. In this paper, we propose the concept of a personalized Web view to cater to this requirement. Underlying this con- cept is a data model that represents websites from the logi- cal point of view and a declarative langauge that transforms logical views into personalized Web views. To empower or- dinary users with the ability to build their own personalized Web views, we have designed and implemented a software system, known as WICCAP. This system includes a wizard to help users create data models that map physical web- sites into logical views. It also has an information extrac- tion agent that allows users to instantiate their personalized Web views of the target websites by transforming from log- ical views previously defined. In order to increase the fun and flexibility of using this software, a flexible presentation toolkit has been designed to present the information in a manner that is programmable by the users. 1. Introduction 1.1. Motivation Information available on the World Wide Web is mas- sive and is ever increasing. From the users’ point of view, much of the information presented in a website is not com- pletely of their interest. There may be auxiliary and addi- tional information, such as advertisements and other irrel- evant information that is not always needed. Sometimes, the core content of a website may not be of interest to the user as well. For example, a user visiting the Amazon.com website 1 may be only interested in fictions but not science or horror books. Using conventional methods of retriev- ing information—browsing and keyword searching—users are often inundated with too much (irrelevant) information. This gives rise to the information overloading problem. While surfing a website, users usually have something in mind of what they are looking for; this constitutes their information view of the website. A tool that enables a user to specify his view of website and that facilitates the con- struction of the view would help to alleviate the informa- tion overloading problem, as unwanted information (either auxiliary or uninteresting) would be “unloaded” by exclud- ing them from Web views. As a significant portion of Web surfers are ordinary users who do not possess much techni- cal knowledge, it is important that such tools be targeted at this group of users and be easy to use. The problem of creating personalized Web views to re- duce information overloading is not a new one. Some Web portals such as MyYahoo! 2 addresses this issue by allowing users to specify the content and layout of the frontpage of their login page. Such self-design functionality may help to reduce the amount of inappropriate core content to some extent. However, it still leaves most of the auxiliary infor- mation especially advertisement untouched. Nonetheless, the majority of websites on the Web do not have such ser- vices and functionalities. Over the past few years, some Information Extraction (IE) systems (mainly in the form of wrappers) [1, 5, 7, 9, 13, 1 http://www.amazon.com 2 http://my.yahoo.com 1

Transcript of Unloading Unwanted Information: From Physical Websites to Personalized Web Views

Unloading Unwanted Information: From Physical Websites toPersonalized Web Views

Zehua Liu, Wee Keong Ng, Ee-Peng Lim, Yangfeng HuangCentre for Advanced Information Systems

School of Computer EngineeringNanyang Technological University

Singapore, 639798{aszhliu, awkng, aseplim}@ntu.edu.sg

[email protected]

Feifei LiComputer Science Department

Boston UniversityBoston MA 02215, USA

[email protected]

Abstract

With the explosion of information on the Web, informa-tion that is made available from websites is generally over-whelming for users surfing the sites. The majority of theusers who are facing this information overloading problemare the ordinary home users who do not have much techni-cal knowledge. It is thus important to allow these users toeasily create personalized views of websites such that theyonly see what they want in the way they prefer.

In this paper, we propose the concept of a personalizedWeb view to cater to this requirement. Underlying this con-cept is a data model that represents websites from the logi-cal point of view and a declarative langauge that transformslogical views into personalized Web views. To empower or-dinary users with the ability to build their own personalizedWeb views, we have designed and implemented a softwaresystem, known as WICCAP. This system includes a wizardto help users create data models that map physical web-sites into logical views. It also has an information extrac-tion agent that allows users to instantiate their personalizedWeb views of the target websites by transforming from log-ical views previously defined. In order to increase the funand flexibility of using this software, a flexible presentationtoolkit has been designed to present the information in amanner that is programmable by the users.

1. Introduction

1.1. Motivation

Information available on the World Wide Web is mas-sive and is ever increasing. From the users’ point of view,much of the information presented in a website is not com-

pletely of their interest. There may be auxiliary and addi-tional information, such as advertisements and other irrel-evant information that is not always needed. Sometimes,the core content of a website may not be of interest to theuser as well. For example, a user visiting the Amazon.comwebsite1 may be only interested in fictions but not scienceor horror books. Using conventional methods of retriev-ing information—browsing and keyword searching—usersare often inundated with too much (irrelevant) information.This gives rise to the information overloading problem.

While surfing a website, users usually have somethingin mind of what they are looking for; this constitutes theirinformation view of the website. A tool that enables a userto specify his view of website and that facilitates the con-struction of the view would help to alleviate the informa-tion overloading problem, as unwanted information (eitherauxiliary or uninteresting) would be “unloaded” by exclud-ing them from Web views. As a significant portion of Websurfers are ordinary users who do not possess much techni-cal knowledge, it is important that such tools be targeted atthis group of users and be easy to use.

The problem of creating personalized Web views to re-duce information overloading is not a new one. Some Webportals such as MyYahoo!2 addresses this issue by allowingusers to specify the content and layout of the frontpage oftheir login page. Such self-design functionality may helpto reduce the amount of inappropriate core content to someextent. However, it still leaves most of the auxiliary infor-mation especially advertisement untouched. Nonetheless,the majority of websites on the Web do not have such ser-vices and functionalities.

Over the past few years, some Information Extraction(IE) systems (mainly in the form of wrappers) [1, 5, 7, 9, 13,

1http://www.amazon.com2http://my.yahoo.com

1

16] have been proposed to automatically extract target infor-mation from the Web and to transform the information intostructured or semi-structured format for further processing.The original goals of these systems are to make the infor-mation accessible to software programs so that subsequentquerying, restructuring and integration can be performed.The personalization and presentation capabilities are lack-ing in these systems, making them incomplete as solutionsto the information overloading problem. For ordinary homeusers, these systems also tend to be difficult to use due tothe lack of relevant technical knowledge and the unfamiliarparadigm of relational views and querying.

1.2. Contributions

In this paper, we propose a novel approach to combinethese two aspects of website personalization and informa-tion extraction. A software framework, called the WICCAPsystem, has been designed and implemented to enable ordi-nary users to create personalized Web views in a simple andflexible manner. The aim of WICCAP is to provide fast andflexible ways of customizing views of websites while keep-ing this task simple and easy to perform. To achieve this, (1)one or more global logical views of a target website are firstidentified and constructed; (2) based on these global views,different users create their own views that contain only theportions that they are interested in and specify how suchviews should be generated; (3) users also specify how andwhen their views should be visually shown to them. Withthe help of the tools provided by the WICCAP system, usersare able to easily and quickly design their preferred viewsof websites.

The key contributions of the paper are summarized asfollows:

• We propose the concept of a personalized Web viewto alleviate the information overloading problem bycombining the mapping of physical website layoutstructures to logical views, the transformation of log-ical views into personalized views, and the presenta-tion of transformed views in a customizable manner.The technical details of website mapping and viewtransformation are encapsulated by the underlying for-malisms which the users need not be concerned with.

• We design a system architecture to enable ordinaryhome users to create personalized Web views. This is athree-layered architecture to accomplish the three stepsoutlined earlier. The key originality of this architec-ture, as compared with previous approaches [4, 5, 13],is the explicit separation of the tasks of informationmodeling and information extraction. This allows or-dinary users to be able to extract accurately and exactlywhat they want because different users can work on the

sub-tasks that they do best – expert users specify howto extract information and ordinary end users decidewhat to extract and how and when to present the ex-tracted information, .

• We present a data model for mapping websites’ physi-cal structures into logical views. The data model dealswith information at site-level, as opposed to page-levelfor most of other systems [7, 9, 12, 13, 16]. It repre-sents websites from a logical point of view and com-pletely decouples the view from the physical locationof webpages.

• We propose a declarative language that is used inter-nally by the extraction module to transform the logicalviews of websites into personalized Web views. Thelanguage provides basic constructs for specifying pro-jection, selection and join operators. Although not asexpressive as XQuery or XSLT, the language capturesexactly what are required to transform the Web views.The constructs provided by the language have one-to-one correspondence in the frontend GUI so that userscan visually specify the transformation process with-out concerning the syntax of the language.

• We have implemented a rich set of tools to facilitate thevarious tasks required to create a personalized view ofa website. Providing a collection of easy-to-use toolsthat reduce the technical requirement to the minimumis crucial for a system that targets ordinary home users.Each of the tools is designed to accomplish a specificgoal that is inline with the individual steps for creatinga personalized Web view. The tools may be used bya single user or by several users who collaborativelycreate the views.

1.3. Organization of the Paper

The remainder of the paper is organized as follows. Sec-tion 2 describes what are personalized Web views and theWICCAP architecture for supporting the creation of theseviews. Sections 3, 4 and 5 then present the different com-ponents in the WICCAP architecture in detail. Section 6compares our work with related systems. Finally, Section 7gives some concluding remarks.

2. System Overview

2.1. Personalized Web Views

A personalized Web view has three components: (1)global view(s) of websites; (2) customization of globalviews into a Web view; and (3) presentation style of cus-tomized view(s). In this paper, the term “personalized Web

2

view” has a rather loose definition, depending on the ex-tent that the view is personalized. A partially personalizedWeb view may omit, for example, the parameters of post-processing operations (see Section 4.3) in the second com-ponent. In the rest of the paper, the terms “Web view” and“personalized Web view” will be used interchangeably torefer to both complete and partial views whenever there isno ambiguity. A personalized Web view may model only aportion of a website, so long as it is self-contained. In somecases, a Web view may be about just a single “data-rich”Web page where there is a large amount of data containedwithin one page.

Among the three components, the third one is relativelysimple to implement. Thus, we only focus on the repre-sentation of the first and second components. We have de-vised two formalisms to internally represent these two com-ponents. The WICCAP Data Model (WDM) (discussed inSection 3.1) is a logical data model that attempts to map thephysical layout structure of a website to its implicit logi-cal structure. It is well suited to represent the global viewsof websites as it provides a higher level abstraction of thetarget websites and at the same time maintains the relation-ship with the original website so that extraction of infor-mation can occur later. The View Customization Language(VCL) (discussed in Section 4.5) is a declarative languagecomprised of a series of operators that describe how to in-crementally transform global views into personalized Webviews. The basic constructs defined in VCL correspond tothe parameters used in the second component and providethe exact expressive power that is required to perform viewtransformation. Both WDM and VCL are only used inter-nally in the WICCAP system to represent Web views and areshielded from the users by front-end GUI tools.

To make it simple and manageable, the process of build-ing personalized Web views is divided into three steps:modeling the website with its logical view, customizing theWeb view based on the logical views, and defining the waythat the view is delivered to the user. Most users would notbe concerned with the first step because the global logicalviews of popular websites will be delivered together withthe WICCAP tools. In the case when the target website isnot in the pre-built list, users may create their own using theview creation tool provided (see Section 3).

To begin creating a Web view, a global view about thewebsite of interest is chosen and different parameters aresupplied to customize the view. These parameters includescope, post-processing operations, and execution schedule(see Section 4). The scope of the Web view can be re-fined by selecting only relevant portions of the global view.To permit further customization, post-processing operationscan also be applied to process and transform the informa-tion extracted according to the view. For example, a usermay use keyword filtering to remove any information thatcontains the word “sex”. The schedule of performing in-

formation extraction is also an important parameter, whereusers specify when to extract information and how often theextraction should be. More than one global view(s) aboutdifferent websites could also be chosen if the user wants tocombine data from multiple sources using the consolidationoperation.

As the last step, the Web view is further personalized byapplying the parameters about the presentation of extractedinformation. The user decides a visual style (presentationtemplate) to display the information in a way of his/her lik-ing. As a user usually has several Web views that are activeat the same time, these Web views can be scheduled to beshown at different time slots and at different intervals in aTV-program-like fashion.

Although there are quite a few parameters to config-ure to define a complete personalized Web view, most ofthem could be omitted, leaving them with the default val-ues. Nonetheless, with the availability of the GUI tools inWICCAP, it would still be easy for ordinary users to specifythe parameters.

Example 1 [Usage Scenario] As a database researcher,Andy wishes to be kept informed when the Stanforddatabase group has new publication. He creates a person-alized Web view by picking the global view that models thecorresponding site3. The scope may not be detailed, buthe specifies that information should be updated weekly andonly new items to be reported (post-processing operation).He also schedules the new items to be shown at 8:30AMevery Monday morning. Being a soccer-lover, Andy alsowants to read sports news everyday. He creates anotherWeb view with the global view of BBC News online4 andconfines the scope to only the Sports section. He would liketo have headline news shown to him everyday at lunch time.To save some time, the titles of the news will be displayed inan auto-scrolling list with external links to the full article.nolinebreak ¤

2.2. WICCAP Architecture

The WWW Information Collection, Collaging and Pro-gramming (WICCAP) system is designed to construct per-sonalized Web views in a simple and efficient manner. Athree-layer architecture (Figure 1) has been introduced toaccommodate the three steps of creating a personalized Webview. The Mapping Wizard takes in information of a par-ticular website and produces global logical view(s) of thewebsite (or what we call mapping rules). More than onelogical view(s) for a website might be possible, if we lookat the website from different angles or with different pur-poses. With these logical views of a website, the Network

3http://dbpubs.stanford.edu:8090/aux/index-en.html4http://news.bbc.co.uk/

3

WWW WWW

WICCAP System

Global Logical Views

Mapping Wizard Partially

Personalized Web View

Network Extraction

Agent

WIPAP Presentation

Toolkit Personalized Web View

WWW Information

Source

User Create Global Views

Define scope , post - processing ,

& schedule parameters

Define Presentation Template &

Schedule parameters

Figure 1. Architecture of the WICCAP System.

Extraction Agent (NEAT) component allows users to cus-tomize these views based on their preferences and extractsand transforms the desired information from the website.Finally, the Wipap Presentation Toolkit allows users to ap-ply different presentation styles and templates to make theinformation presented in personalized ways.

The separation of the whole process into three steps isimportant to the goal of empowering ordinary users to cre-ate their own Web views. In WICCAP, there are two groupsof target users. The information modeling task (first step),especially the sub-task of specifying the extraction details(which requires some technical knowledge) is meant to beperformed by expert users whereas the view customizationand presentation tasks (second and third steps) can be in-voked by ordinary users whose main interests are to obtaininformation without worrying about the technical details.Without such explicit separation, it may easily lead to ex-traction of inaccurate information (due to lack of knowl-edge, when all done by ordinary users) or undesired infor-mation (due to misunderstanding of needs, when all done byexpert users). The three-layer architecture enables ordinaryusers to use the system easily while still maintaining highaccuracy of the extracted information and high efficiency ofthe logical view creation process.

Intermediate information such as global views and cus-tomized views are stored using XML document(s) togetherwith XML schemas that define their formats. This allowsany XML-enabled application to receive the extracted re-sults easily. For instance, an information integration systemcould accept the global logical views of multiple websites(the output of the first layer) and integrate them to allowuniform access.

3. Mapping Wizard

3.1. Logical Data Model

In most existing systems [1, 5, 10, 12], users directly op-erate on Web pages to specify the data to be extracted. Thisusually requires users to have technical knowledge aboutHTML and relevant technologies. For ordinary users, itwould be much easier and less error-prone to provide them alogical view representing the target website and to let themspecify the data to be extracted based on this view, ratherthan on the original webpages. For this to be possible, suchlogical views must reflect most users’ understanding of thewebsite and represent most of the data that users would pos-sibly like to extract (this is why they are called global log-ical views). A data model, called the WICCAP Data Model(WDM) [15], has been proposed to serve this purpose.

The role of the WICCAP Data Model is to relate informa-tion from a website in terms of commonly perceived logicalstructures, instead of physical file directory structures. Thelogical structure here refers to one’s perception of the or-ganization of the contents of a specific website. It is basedon the observation that different pieces of information ina website are usually related to one another through cer-tain logical structure that is hidden from the inter-linking ofmultiple Web pages. Such hidden logical structure is usu-ally apparent to most users when they look at a website.

For instance, when one thinks of a newspaper website,one generally thinks of a site that has a list of sections suchas world news, local news, and sports news. Each sectionmay have subsections and/or a list of news articles, eachof which may have a number of items including the title,abstract or summary, the article itself, and perhaps otherrelated articles. This hierarchy of information is the com-monly perceived structure of a newspaper website that re-

4

flects most users’ concept and understanding and coversmost data that users would like to extract.

To decouple the logical view from the physical file direc-tory hierarchy and page layout of the target website whilemaintaining the mapping between the two, a special set ofelements dedicated to mapping information has been de-fined in WDM. These elements form the basic constructsof the extraction rules used in Wiccap. A detailed discus-sion of the elements in WDM can be found in [15].

3.2. Visual Utility Tools

Although the global logical views are to be created byusers who have some technical expertise, the process ofmanually creating additional views are still difficult andslow. The Mapping Wizard tool is provided to facilitate andautomate this process. It is a supervised wrapper generationtool (in Web information extraction terminology) that inter-actively helps users to define what to be extracted from thewebsite, how to extract them and how to organize them intoa meaningful logical view.

In the Mapping Wizard, a set of utility tools has beenimplemented to address several bottlenecks that slow downthe overall process (see [15] for a more detailed discussion).Among these tools, those that automatically derive extrac-tion rules (i.e., mapping elements in WDM) of each dataitem are the more useful ones as they relieve users frommanually studying HTML sources to figure out the rules.A few other tools that operate on multiple data items at thesame time further improve the efficiency of the overall pro-cess. When used in conjunction, the utility tools reduce thetime required to generate a global logical view for a givenwebsite.

All the utility tools are available through the GUI of theMapping Wizard. Most tools operate on the HTML browserview of the website, making it easy to specify the targetdata. With the tools and the GUI, users need not be con-cerned with the underlying HTML source and the syntax ofthe WICCAP data model as both of them are hidden from theusers. Although we say that the Mapping Wizard is meantto be used by expert users with relevant knowledge, it wouldstill be possible for ordinary users who are familiar with theutility tools to use the Mapping Wizard to create global logi-cal views for their favorite websites when they are not foundin the list of pre-built views. Experiments conducted [15]showed that on average it takes about an hour or two to pro-duce a global logical view of a website, depending on thegranularity of information to be extracted.

3.3. Automated Extraction of Website Skeletons

The utility tools that have been incorporated into theMapping Wizard are well suited for specifying the contentsto be extracted within one page. Although a primitive tool

for extracting a group of hyperlinks within a page is alsoprovided, the modeling of the logical structure of an entirewebsite, which is usually exhibited by the hyperlink struc-ture, still remains relatively manual. To further improve thedegree of automation of the overall logical view construc-tion process, an algorithm has been developed to automat-ically discover the skeleton of a website, i.e. the underly-ing hyperlink structure that is used to organized the contentpages in the website [14]. With the discovered skeleton,users can jump-start the logical views building process anddirectly proceed to work on the individual details of eachpage in the skeleton.

The algorithm works in a recursive manner by applyinga two-step process to discover the navigation links in a pageand retrieving pages from the links to discover more naviga-tion links. Given a page, the two-step process involves ex-amining all hyperlinks within the page in groups to producesome candidates and identifying, among all candidates, thebest one as the group of navigation links that point to pagesin the next level in the website structure. All the navigationlinks discovered and the pages retrieved during the discov-ery process form the skeleton of the website. Experimentsconducted on real life websites showed that the algorithm isable to achieve high recall with moderate precision [14].

In this particular context of a supervised wrapper gener-ation system, a high recall is preferable to a high precision(with low recall) because it is much more difficult for usersto find out the correct navigation pages (in the case of lowrecall) than to remove the incorrectly identified navigationpages from a list (in the case of high recall but low preci-sion). Thus, this algorithm nicely complements other utilitytools because it addresses the main bottleneck untackled bythese tools, i.e. the near manual process of specifying thehyperlink structure that constitutes the skeleton of a log-ical view. Therefore, although the algorithm has not yetbeen evaluated together with other utility tools, we expectthat incorporating it into the Mapping Wizard should sig-nificantly accelerate the overall process of creating globallogical views of websites.

4. Network Extraction Agent

The Network Extraction Agent (NEAT) is the componentthat is responsible for helping users customize the parame-ters of the Web views and manage extraction jobs that re-trieve data from website based on (partially) personalizedWeb views. It is intended to be used by ordinary users. Theuser interface has been designed to be simple, user-friendlyand easy-to-use such that ordinary users who do not possessmuch technical knowledge of information extraction will beable to use it. Users are expected to use NEAT together withWIPAP (discussed in the next section) to produce the finalpersonalized Web views.

5

Figure 2. Network Extraction Agent

Figure 2 shows the screenshot of the NEAT system. Asit has been designed with ordinary users as the target usergroup, the GUI has been built to be as close to a typical Win-dows application as possible. As can be seen in the screen-shot, the main GUI resembles that of Microsoft WindowsExplorer, with the tree structure to represent the hierarchi-cal categories of the personalized Web views on the left-hand side and the list of active Web views under the high-lighted category on the right-hand side. Parameters of eachWeb view can be configured in a popup dialog. When theextraction agent is constructing a personalized Web view(i.e. extracting data from the website), the progress is alsodisplayed in a popup dialog, as shown in the middle of Fig-ure 2. With this familiar interface and the wizard guidedoperations, the user will be able to quickly adapt to NEATand use it to create personalized Web views effectively.

4.1. Customizing Web Views

A user creates a new Web view by first selecting a globalview from a list of available global views of different web-sites. To define the scope parameter, NEAT presents theselected view in a tree-like structure to allow the user tospecify which parts of the logical view are to be extracted.For example, from the logical view of the BBC News onlinewebsite, the user may indicate that he or she is only inter-ested in sports news by selecting the subtree rooted at thenode titled Sports News. When extracting information, theagent only extract those parts of the global view that havebeen selected.

Merely extracting the information based on the refinedview may not satisfy the user, since he or she may not be in-terested in all the information extracted. Several parametersfor post-processing can be configured to place further con-straints on the view. These parameters are to be applied onthe information after they have been extracted by the agent.

Currently, NEAT has three post-processing parameters: fil-tering, incremental updating, and consolidation. The con-solidation parameter will be discussed in Section 4.3

The filtering parameter allows the partial views con-structed to be filtered based on different condition(s). It con-sists of a set of “condition-action” pairs where each condi-tion is an expression to be evaluated on the value of certaindata item in the view and each action is a transformation op-erator5 on some other items. The target items in the pairs areselected in the partial logical view refined by the scope pa-rameter. Examples of filtering condition include “Keep onlynews article items with the word “World Cup” in the title ordescription” or “Remove all product item from Dell whichhave a value of more than $1,000 in the price attribute”.

Most users are used to receiving notification of newemails. The same expectation applies here, where usersmay expect to be informed only of newly published papersin digital libraries or to see only “new” news articles in anewspaper website. This can be done by incremental updat-ing. When this feature is set, the agent will first extract dataaccording to the view and compare them with the previousextraction result to determine whether there is any new con-tent. Only the new data will be included in the Web viewgenerated. The whole process appears to the user as if theview is incrementally updated according to changes on thewebsite. This feature is particularly useful for sites wherethere is a regularly updated list of records; e.g., news sites.It should be noted that this feature can also be applied re-cursively to nested lists (allowed in the definition of WDM),enabling incremental updating in a hierarchical fashion.

Besides defining the scope and post-processing parame-ters, a user is also provided with options to specify whenand how frequent the job should be performed. This issimilar to the scheduling function of most software, suchas the Microsoft Task Scheduler. This schedule parame-ter is important because different sites update data at differ-ent rates. For example, newspaper website may be updatedonce per day, while stock information might be updated ev-ery minute.

The parameters described above are obtained through astep-by-step wizard that guides users to specify each param-eter. The process of creating a personalized Web view con-sists of only point-and-click actions guided by the wizard.At the end of the process, the wizard automatically gen-erates a set of rules written in a declarative language (seeSection 4.5) that represents the operators required to trans-form the global views into the customized view. These rulescan then be executed by the extraction agent to perform theactual transformation of views.

As mentioned earlier, all parameters are optional. In theextreme case, the user may just click the “Next” button sev-eral times until the “Finish” button; this produces a Web

5Only a delete operator is defined in the current implementation

6

view identical to the original global view that requires theuser to manually start the extraction. All the personalizedWeb views created are organized into a user-defined cate-gory hierarchy.

4.2. Information Extraction

The fundamental function of NEAT is to accurately andreliably extract information specified in the Web views. Itperforms information extraction according to the extractionrules given in the global view. In NEAT, the process of con-structing a Web view is called an extraction job. An extrac-tion job can either be started by the user manually or by theextraction engine automatically according to the scheduleparameter. All extraction jobs are managed in multi-threadmode.

One important issue of a Web information extraction sys-tem is the resilience of the extraction rules in the face ofchanges to the website. In the WICCAP Data Model, theglobal logical view is a tree with each tree node represent-ing a data item to be extracted. The extraction rules asso-ciated with each node are independent of the siblings’ butmay rely on the parent’s. This means that any change affect-ing the extraction of a data item could only (possibly) affectthe children of that node. When the affected node is a leaf,the effect is minimal. However, if there is any major changethat causes the entire extraction process to return nothingfor a few times, the agent will report an error, which wouldrequire the user to verify the correctness of the global view.

The information extracted is stored in XML format.Users generally do not see the extracted information di-rectly because XML content is usually not very meaningfulto ordinary users. This information will be consumed bythe next layer, WIPAP, to present to the users as the finalWeb view. However, it could also be used directly by otherapplications for integration or other purposes.

4.3. Consolidation

Sometimes, information from multiple sources describesimilar or related contents. For example, if the user has per-sonalized Web views for both the WashingtonPost and CNNwebsites, the headline news would likely to cover similarevents although the titles of the news might not match ver-batim. In this case, it would be useful to combine similaritems and present only one item to the user. The consolida-tion parameter can be used with multiple similar Web viewsto integrate them into a single Web view by combining sim-ilar items in the source views. In effect, this achieves somekind of semi-automatic view integration as schema match-ing of views is manually specified while the similarity de-tection of items is done automatically.

A user selects several global views and obtains a (inte-grated) personalized Web view as the output. One of the

global views is selected as the reference view. The items ineach individual view to be compared are manually selected.The path to the selected item in the reference view and theitem’s label will be used in the output view. We restrictthe elements for consolidation to those whose multiplicityis more than one (i.e. those of the list type). Items not in thesubtrees rooted from the selected items will go directly intothe final view. In order for such consolidation to make moresense, source views are required to have a similar structurewhich, in practice, may not be too far-fetched as sites of thesame category (e.g., newspaper websites) are likely to havea similar logical structure. In the worst case where no sim-ilar items are detected, the final Web view becomes a blindcombination of all sources without any integration.

Similarity detection is performed automatically usingLatent Semantic Indexing (LSI) [8] algorithm (with termweighting). In a typical scenario where we are comparingtwo lists of items from two sources (e.g., two sets of newsarticles), all items are fed to the LSI algorithm to computethe similarity matrix based on rank-k singular value decom-position (SVD) [8]. The resulting matrix is used by theClique [2] algorithm to cluster the items. Only one itemin each cluster is (randomly) chosen as the representativeone to be stored into the final view.

4.4. Supporting HTML Form Query

The WICCAP Data Model defines a special group of el-ements to cater for HTML Form and other related HTMLtags. When the agent encounters such elements in the globallogical view, it will check whether all required elements inthe form has pre-set values. If not, a dialog box is dynam-ically constructed at runtime to allow the user to enter thecorresponding values. All visible elements in the dynamicdialog are arranged in a manner that looks similar (if not ex-actly the same) to the original page. The input values fromthe user are used to construct the complete query string thatcan be posted to the remote Web server. The returned Webpage is processed in a manner like other normal static Webpages. Next, the extraction process continues.

Feature support for HTML form is critical to flexible re-trieval of pages due to the extensive use of forms on web-sites. Some typical usages of forms are search functions ofwebsites and user login authentication. Without any supportfor HTML Forms, it might be difficult to retrieve informa-tion from these sites. Support for the form feature is alsoone of the characteristics that distinguish the WICCAP sys-tem from other similar systems, as currently only few of therelated systems [4, 10, 16] take forms into consideration ordirectly incorporate them into their systems.

7

4.5. The View Customization Language

The parameters configured in NEAT are internally repre-sented by a declarative langauge called the View Customiza-tion Language (VCL). All the basic constructs in VCL havecorresponding GUI components in the step-by-step wizardthat guides users through the view customization process.Users of the NEAT never have to deal with VCL. After usersspecify all the parameters, the NEAT internally generates aVCL program, which consists of a set of VCL statements,that instructs the extraction agent how to perform the extrac-tion and post-processing to create the desired Web view.

All VCL programs follow the same structure6 as below:

(create webview <view_name> as( <keep_stmt> | <delete_stmt> | <incupd_stmt>

| <consolidate_stmt> | <update_stmt> ))*output <final_view_name>

where each create webview statement defines a (tempo-rary) Web view based on the other Web view(s) definedearlier. The output statement at the end determines whichWeb view to be used as the final output. An example of aVCL program is depicted in Figure 3. The view customiza-tion process starts from the views, called global virtualviews, formed by the data extracted according to global log-ical views of websites (e.g. BBCGlobalView in Figure 3).These views are virtual because they are never materialized,but instead are used as intermediate views to be further re-fined. Five alternative statements are allowed when defin-ing a Web View, corresponding to the scope parameter, theschedule parameter and the three post-processing parame-ters. The output Web view is incrementally defined by aseries of temporary views, each refined by a parameter.

Scope The scope parameter is in fact a projection opera-tor on the global virtual views by projecting only parts ofthe source view into the target view. In this case, simpleXPath7 expressions, as used in XQuery8 for projection, arenot sufficient because they return only the nodes matchingthe XPath expressions without their ancestors in the com-plete DOM tree. On the contrary, the keep only statementin VCL is designed to project a portion of the XML DOMtree, keeping the entire path(s) from the root node to thenode(s) selected by the specified XPath expression(s) andthe subtree(s) rooted from these nodes. For example, thefirst create webview statement (line 1–4) defines a newWeb view by refining the scope of the view to only the firstand fifth children under the root node of the global virtual

6We only give an informal description of VCL in this section. Thecomplete EBNF definition can be found at Appendix A.

7http://www.w3.org/TR/xpath20/8http://www.w3.org/TR/xquery/

1: create webview BBCViewWorldTechNews as2: keep only /Wiccap/Section[1]3: and /Wiccap/Section[5]4: from BBCGlobalView5: create webview BBCViewNoWar as6: delete /Wiccap/Section/Region/Record7: where ./Item[@Type=“Title”] contains “iraq”8: delete /Wiccap/Section/Region/Record9: where ./Item[@Type=“Description”] contains “iraq”

10: from BBCViewWordTechNews11: create webview BBCViewIncUpd as12: incupd /Wiccap/Section/Region/Record13: using key ./Item[@Type=“Title”]14: from BBCViewNoWar15: create webview BBCViewSchedule as16: update start at 08:30 on 30/06/200317: repeat every 1 day18: from BBCViewIncUpd

......

...19: create webview CNNViewSchedule as

......

...20: create webview NewsView as21: consolidate BBCGlobalView with CNNGlobalView22: merging /Wiccap/Section[@Name=“World”]/Region/Record23: in BBCViewSchedule24: with /Wiccap/Section[@Name=“World”]/Region/Record25: in CNNViewSchedule26: output NewsView

Figure 3. A VCL Program for a Partial WebView

view BBCGlobalView (i.e. the World and Technology sec-tions), resulting in a new Web view with only two childnodes (Section nodes) at the first level.

It is also possible to specify the portions that the users donot want, instead of those that they want, by using the keepall except statement. The syntax is similar to that of keeponly as shown in Figure 3 but the semantics is exactly theopposite.

Filtering As mentioned earlier, a filtering parameter con-sists of a set of ”condition-action” pairs. Each such pair isdefined by a delete <XPath> where <condition> state-ment. When the <condition> is evaluated to true, thenode(s) selected by the XPath expression will be removed.This parameter corresponds to a selection operator wherethe nodes to be selected are specified by negation (i.e. byspecifying those to be removed). Multiple delete state-ments indicate a disjunction of all the nodes to be filtered.

The condition is of the form

( not )? ( some | every )? <XPath> <op> <value>

where <op> ∈ {contains,=, >,<}. The XPath in thecondition is typically some children or descendants of the

8

node selected by the XPath in the delete statement. How-ever, the abbreviated syntax . and .. and absolute refer-ence (starting with “ / ”) could also be used to access thenode itself, its ancestors or any arbitrary nodes. The ex-ample in Figure 3 (line 6–10) shows a typical usage of thefiltering parameter where the user, being tired of news aboutthe war, wants to remove all news articles (denoted by theRecord nodes) whose title or description contain the wordiraq.

Incremental Updating An incupd all statement meansthat all nodes in the view should be incrementally updated.Alternatively, selective update can be specified with oneor more incupd <XPath> statement. Optional using keystatements can be added to indicate which child node(s)should be used to test the equality of two nodes. When nousing key statements are present, the data value of the sub-trees rooted from the nodes selected by the XPath will beused.

Again, this parameter can be considered as a selectionoperator that selects nodes based not on their values but ontheir freshness (i.e. whether they are new). In Figure 3(line 11–14), a Web view named BBCViewIncUpd is cre-ated based on the previously defined view BBCViewNoWarto incrementally update the news article by comparing thetitles of newly extracted articles with that of those existingones.

Schedule This parameter is only instructional, in thesense that it does not perform any transformation of theWeb view but instead simply instructs the extraction agenton when (the start at statement) and how often (the repeatevery statement) the Web view should be constructed. Therepeat every statement is optional. Its absence indicatesthat the construction of the Web view is to be performedonce only. If no update statement is specified in a VCLprogram, the construction of the Web view defined by thisprogram has to be initiated manually by the user.

Consolidation Different from the above four statementsthat are defined based on a single Web view, the consol-idate statement combines multiple Web views into one.The first Web view following the consolidate keyword istreated as the reference view that is to be consolidated withone or more views specified after the with keyword. Themerging statement indicates the nodes in each view thatare to be supplied to the similarity detection algorithm. As-suming that another Web view, named CNNViewScheduleabout CNN News is defined in a similar manner as the BBCview, the last create webview statement in Figure 3 (line20–25) shows how the two views can be combined to pro-vide a single world news section. Note that although thisexample is based on only two views, it is possible to have

more than two views to be consolidated. Multiple mergingstatements can also be used to combine several groups of(possibly similar) nodes.

Since nodes that are not specified in the merging state-ments go directly into the final view while only representa-tives of similar nodes are stored, this consolidation param-eter can be roughly considered as a join operator, with adifferent notion of the equality predicate for testing nodesthat connect different source views.

The Web view specified in the output statement is the fi-nal view constructed from this VCL program. Thus, strictlyspeaking, a VCL program is not closed – it takes in one ormore trees or a forest (the global virtual views) and producea single tree (the final view).

The order of different create webview statements maybe relevant to correctness of the final view. Some orderingof parameters are not meaningful or even not valid. Caremust be taken to ensure that such situations do not occur.Nevertheless, invalid programs can be totally avoided byhaving the GUI wizard enforce the order in which the pa-rameters can be specified (recall that all parameters are con-figured only through the step-by-step wizard).

If we consider the input virtual views and the output viewas XML documents, the entire view customization processspecified by a VCL program can be viewed as the transfor-mation of XML documents, which can also be performedby XQuery or XSLT9. Indeed, these two languages were thefirst two alternatives that we looked at before we decided todesign our own one. Both XQuery and XSLT provide ex-pressive constructs to transform or query XML trees/forests.However, for the purpose of Web view customization usingthe parameters, only limited projection and selection oper-ators are required – XQuery and XSLT are over-expressivefor this particular task. More importantly, the constructsand expressions provided in these two languages, althoughvery powerful, are of too low level for expressing the fiveparameters. For example, Figure 4 shows one possible def-inition of the filter parameter using XQuery. Comparing itwith the 6 lines (line 6–10) in Figure 3, it is obvious thatXQuery is too bulky for the task of customizing Web views.In addition, we have also found it very difficult to expressthe consolidation parameter using either XQuery or XSLT.Finally, the incremental updating and the schedule param-eter requires some features that are not available in bothlanguages.

One final note about the VCL language is that the ex-ecution of VCL programs needs not strictly follow the se-quence of the create webview statements. It is possible tooptimize the execution of a VCL program. For example, itis not necessary to extract all information from the website

9http://www.w3.org/TR/xslt

9

define function webview($viewname) as node externaldefine function filter($n, $target) as node {

if (count($n intersect $target) = 0)then element { local-name($n) }

{for $a in $n/@*return $a,for $c in $n/*return filter($c, $target), $n/text()

}else ()

}let $base_view := webview("BBCViewWorldTechNews")let $t := $base_view/Wiccap/Section/Region/Record

[contains(Item[@Type="Title"], "iraq")]union$base_view/Wiccap/Section/Region/Record[contains(Item[@Type="Description"],

"iraq")]let $i := $base_view/Wiccapreturn filter($i, $t)

Figure 4. An XQuery Implementation of thefilter Parameter

according to the global view before we apply the scope pa-rameter. Obviously, it is more efficient to apply the scopeparameter to trim the global view first and then to extractonly those information that is within the specified scope.The evaluation of multiple delete statements may be op-timized by considering those operating on the same set ofnodes together, instead of evaluating them separately.

5. WIPAP Presentation Toolkit

The (partial) Web views created by NEAT could not beconsidered as fully personalized if the content is not pre-sented in a manner to users’ preferences. The Web Infor-mation Player and Programmer (WIPAP) allows users tocustomize their views further by incorporating informationabout how to present the views (presentation templates) andwhen to present them (presentation schedule). It is also re-sponsible for showing the views according to the definedparameters. With the WIPAP, the views can now be con-sidered finalized personalized Web views that can be con-sumed by the users.

Setting the template parameter is as easy as selecting onefrom a list of pre-defined presentation templates. Thesetemplates control how the extracted information are visu-ally presented. In the current implementation, MacromediaFlash is used to display information since it is flexible andis able to display content in a nice, animated and appealingfashion. The presentation templates have been designed asFlash clips with ActionScript10, an embedded scripting lan-

10http://www.actionscripts.org/

guage that can dynamically take in the data to be displayed.A calender-styled program wizard is provided to allow

users to conveniently specify when and how long each Webview will be presented on screen. This is very similar toadding a new meeting appointment into the Microsoft Out-look calendar. The Web view can be configured to recur atany specified interval. This presentation schedule is sim-ilar to a TV program schedule; thus making it appealingand intuitive to users. It should be pointed out that this pre-sentation schedule is different from the execution scheduleconfigured in NEAT. The former refers to when and how of-ten the extracted data should be presented whereas the latterrefers to the extraction of data from the websites.

Note that because these two parameters are relativelysimple, we consider the task of representing them as astraightforward implementation issue. Thus, we do not de-velop any formalism for that.

The WIPAP presentation toolkit is the most flexible layerin the WICCAP architecture. Although we have chosen topresent the information using Flash, it is possible to havemany different implementation of the third layer that makeuse of different technology for information presentation.This may range from simple HTML Web pages, a well or-ganized re-structured website to special purpose client-sideapplications. It is possible to incorporate them all so longas they allow the configuration of the two parameters dis-cussed above.

Currently, only one Web view is presented at any pointin time. In our future implementation, presenting multipleviews at the same time will be explored by finding efficientways of laying out the different views. This shall, to a cer-tain extent, help to achieve the integration among multipleWeb views.

Figure 5 is a snapshot of the WIPAP system at runtime.The design criteria of WIPAP’s GUI is quite similar to thatof NEAT; to make the application user-friendly and intuitiveto use for ordinary users. The outlook of WIPAP is similarto Windows Media Player. On the left hand side, the listof frequently-used functions is made available as buttons.The right hand side consists of the main panel for show-ing the content of the active Web view (in the center) andthe flattened skeleton of the view (on the right, below the"Back" button). On the top are two shortcuts to allowusers to change the presentation templates and the activeWeb view being shown. This screenshot, in fact, only showsone of the skins available in WIPAP. Users are allowed tochange the appearance of WIPAP by selecting a differentskin.

6. Related Work

In relation to personalization systems, the personalizedWeb views in WICCAP offer more flexibility and function-

10

Figure 5. The WIPAP System

alities than the personalization services actively provided bythe information provider websites. Our approach also dif-fers greatly from traditional third-party Web-based person-alization tools in that ours is client-based while others areserver-based or proxy-based. The customization of viewsoccurs in the client’s PC and there is no cooperative actionsrequired from the server side. Neither does it require the ex-istence of some Web server or proxy. Although server-sidesystems may allow re-use of views, the created Web viewsin WICCAP could also be shared among users.

Web Montage [3] is a server-side system that automati-cally generates a start page with links and contents by min-ing of users’ routine browsing patterns. It assumes that Webaccess log is available to the system, which is usually notpossible for uncooperative websites. WebViews [10] pro-vides an interesting VCR-style recording component thatremembers the path to retrieve certain Web page and allowsWeb Views to be created to extract a fragment of the Webpage. Having only one page per view limits its ability to cre-ate views of websites. Comparing with these two systems,the personalized Web view in WICCAP has a much richerscope and offers more flexibility to customize the views. Ithas the post-processing parameters that are lacking in bothsystems. The presentation component in these systems arelimited to the form of HTML Web pages. On the contrary,the third layer in WICCAP extends the concept of person-alization to information presentation with the template andscheduling parameters.

The Araneus project [4] proposed to manage Web databy first modeling the target websites using the Araneus DataModel (ADM) and using two languages Ulixes and Pene-lope to define some database views over the websites andrestructure the database views to produce another hypertextview. This procedure is more suitable for site administra-

tor to maintain and manipulate hypertext views than for or-dinary home users to selectively receive information. Forthe information overloading problem, projection and selec-tion with simple join for integration serve the purpose well.Thus, there is no need to convert the data model into re-lational views for restructuring and turn them back to hy-pertext views; a direct transformation language like VCLis preferred. The proposed ADM, although reflects somelogical structure, still follows the physical structure of thetarget website quite closely as it treats page as a basic con-struct and explicitly uses hyperlinks in the data model. Onthe contrary, the WICCAP Data Model completely decou-ples the physical website structure from the created logicalviews by hiding all the physical concepts such as pages andlinks under the logical nodes in the views. An additionallayer is also introduced in WICCAP to provide more flexi-ble presentation of information.

The WICCAP is also related to systems [1, 5, 7, 11, 12,13, 16] in the area of Web information extraction, whichare frequently referred to as wrappers or wrapper gener-ation systems. These systems focus on the extraction ofdata and the creation of wrappers that perform the extrac-tion. Most of these systems only deal with a single Webpage or a set of Web pages with similar structure; whilethe WICCAP Data Model captures the logical structure atsite-level. Some of the important parameters such as post-processing and information presentation are lacking in thesesystems, making them incomplete as a system to be used byend users. Nevertheless, these systems have focused quiteextensively on automating the wrapper generation processand making it easy to use, as they assume a single (and of-ten non-technical) user throughout the entire extraction pro-cess. The wrapper generation module of some systems (e.g.XWrap [13], Lixto [5] and DEByE [12]) offers nice userinterface that help users generate wrappers (they start fromdifferent angles but serve the same purpose). These mod-ules could be used in place of the first layer in WICCAP toaddress the diverse needs of different users.

Apart from the comparison among similar systems, thetopic of information extraction has readily invited compari-son with efforts in the Semantic Web initiative [6]. Infor-mation extraction in general, and the WICCAP in partic-ular, tries to address the information overloading problemfrom the information consumers’ side. This is appropriatein short term, because currently the information providers,mainly the websites, are not providing much help. In longterm, when information providers help to advocate the Se-mantic Web and add semantics (e.g. RDF11 and Ontol-ogy12) to the contents that they provide, the Semantic Webwill be more promising, as the information consumers andproviders will be able to cooperate seamlessly to make the

11http://www.w3c.org/rdf/12http://www.semanticWeb.org/knowmarkup.html

11

best out of the information. However, if the WICCAP DataModel is well adopted and information providers start tosupply logical views for their websites, a similar end-effectto the Semantic Web may be achieved by the WICCAP sys-tem, although the capabilities of knowledge induction andevolution are still lacking, which may have to be built intothe application layer.

7. Conclusions

In this paper, we propose the concept of a personalizedWeb view to transform information from websites into Webviews of users’ preference. We introduce a system for en-abling ordinary end-users to create such Web views. Theproposed WICCAP system allows users to perform this taskin a few simple steps by setting various parameters to cus-tomize Web views. A collection of supporting utility toolshave been implemented to automate and facilitate each ofthese steps. With personalized Web views, users are ableto alleviate the information overloading syndrome; they areable to view information in a preferred manner.

We have used the WICCAP system to create personal-ized Web views for several websites from different genres,including online newspapers, online bookstores, digital li-braries, product catalog and even for personal account in-formation in our university library. Our preliminary experi-ences with the system indicate that ordinary users are ableto create their own Web views using the tools without muchdifficulty and they are generally satisfied with the alternatepresentation styles offered by the presentation toolkit. Anexperiment on the first component, the Mapping Wizard,had been conducted and reported in [15]. However, we haveyet to carried out a comprehensive evaluation of the overallWICCAP system, especially on how easy-to-use it is and towhat extent it satisfies users as compared with other meansof accessing information. This is one of the future work thatwe are targeting in the near future.

Other future work includes maintenance of global logi-cal views in the face of change. This may involve changedetection at the source website and change propagation toindividual Web views that have been created. Support forintegration of views across multiple websites is another im-portant area to work on.

References

[1] B. Adelberg. NoDoSE – A Tool for Semi-AutomaticallyExtracting Semi-Structured Data from Text Documents. InProceedings ACM SIGMOD International Conference onManagement of Data (SIGMOD 1998), pages 283–294,Seattle, Washington, USA, June 2-4 1998.

[2] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Au-tomatic Subspace Clustering of High Dimensional Data for

Data Mining Applications. In Proceedings ACM SIGMODInternational Conference on Management of Data (SIG-MOD 1998), pages 94–105, Seattle, Washington, USA, June2–4 1998.

[3] C. R. Anderson and E. Horvitz. Web Montage: A DynamicPersonalized Start Page. In Proceedings of the 11th WorldWide Web Conference (WWW11), pages 704–712, Honolulu,Hawaii, USA, May 7–11 2002.

[4] P. Atzeni, G. Mecca, and P. Merialdo. To Weave the Web. InProceedings of 23rd International Conference on Very LargeData Bases (VLDB 97), pages 206–215, Athens, Greece,August 25–29 1997.

[5] R. Baumgartner, S. Flesca, and G. Gottlob. Visual Web In-formation Extraction with Lixto. In Proceedings of 27th In-ternational Conference on Very Large Data Bases (VLDB2001), pages 119–128, Roma, Italy, September 11–14 2001.

[6] T. Berners-Lee, J. Hendler, and O. Lassila. The SemanticWeb. Scientific American, May 2001.

[7] V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: To-wards Automatic Data Extraction from Large Web Sites. InProceedings of 27th International Conference on Very LargeData Bases (VLDB 2001), pages 109–118, Roma, Italy,September 11–14 2001.

[8] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Fur-nas, and R. A. Harshman. Indexing by Latent SemanticAnalysis. Journal of the American Society of InformationScience, 41(6):391–407, 1990.

[9] D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle,D. W. Lonsdale, Y.-K. Ng, and R. D. Smith. Conceptual-model-based data extraction from multiple-record Webpages. Data & Knowledge Engineering, 31(3):227–251,1999.

[10] J. Freire, B. Kumar, and D. Lieuwen. WebViews: AccessingPersonalized Web Content and Services. In Proceedings ofthe 10th World Wide Web Conference (WWW10), pages 576–586, Hong Kong, China, May 1–5 2001.

[11] J. Hammer, H. Garcı́a-Molina, J. Cho, A. Crespo, andR. Aranha. Extracting Semistructured Information from theWeb. In Proceedings of the Workshop on Management ofSemistructured Data, pages 18–25, Tucson, Arizona, USA,May 16 1997.

[12] A. H. F. Laender, B. A. Ribeiro-Neto, and A. S. da Silva.DEByE– Data Extraction By Example. Data & KnowledgeEngineering, 40(2):121–154, 2002.

[13] L. Liu, C. Pu, and W. Han. XWRAP: An XML-EnabledWrapper Construction System for Web Information Sources.In Proceedings of the 16th International Conference on DataEngineering (ICDE 2000), pages 611–621, San Diego, Cal-ifornia, USA, February 28 – March 3 2000.

[14] Z. Liu, W. K. Ng, and E.-P. Lim. An Automated Algo-rithm for Extracting Website Skeleton. In Proceedings ofthe 9th International Conference on Database Systems forAdvanced Applications (DASFAA 2004), Jeju Island, Korea,March 17–19 2004.

[15] Z. Liu, W. K. Ng, E.-P. Lim, and F. Li. Towards BuildingLogical Views of Websites. To Appear in Data & Knowl-edge Engineering, 2004.

[16] A. Sahuguet and F. Azavant. Building intelligent Web ap-plications using lightweight wrappers. Data & KnowledgeEngineering, 36(3):283–316, 2001.

12

A EBNF Definition of VCL

This section gives the EBNF definition of the VCL. Thecomplete definition of the VCL is given in Figure 6. Thedefinitions of the terminals are borrowed from the defini-tions of W3C XQuery13 and XML Schema Part 214.

B Some Examples of VCL

In this section, we elaborate the semantics of differentstatements in VCL by giving some examples. Instances ofthe VCL language are called VCL programs. Each VCLprogram, as defined in Figure 6 contains a (possibly empty)list of create webview statements together with an outputstatement. There are five different types of create web-view statements. The following subsections describe eachof them in details.

B.1 The keep Statement

Figure 7 shows an example of the keep statement thatproduce a Web view containing only, among all nodes inthe original BBCGlobalView view, the third section underthe root node and the second section of the first section un-der the root node15. The new Web view defined by thisstatement will still have the same root node as the originalone and the subtrees under the specified section nodes arekept.

Figure 8 shows two equivalent examples of the keepstatement. The first one uses the all except keyword toindicate that the XPath expression that immediately followsrefers to the portion of the tree that is to be removed. In theexample, all sections starting from the 4th one are removed,which is equivalent to saying “keeping the first three sec-tions”, as shown in the second example in Figure 8.

B.2 The delete Statement

The examples in Figure 9 illustrate different usages ofthe delete statement to specify filtering parameters to trans-form the Web view. The first one is a typical usage of thedelete statement by specifying the node to be deleted andthe condition based on the value of some child node of thenode to be deleted. In this example, all Record nodes willbe deleted if they have a child node named Item with an at-tribute Type valued “Title” and the value of the child nodecontains the term “soccer”.

The second example shows how numerical values canbe used in the where condition. The new Web view pro-duced by this statement will only contain products withprice higher than 3000.

13http://www.w3.org/TR/xquery/14http://www.w3.org/TR/xmlschema-2/15Each Web view is a tree

The last example illustrates the flexibility introduced byusing XPath expressions. The XPath “.//*” means the cur-rent node and all its descendants. Thus, all Record nodeswhose value or whose descendants’ values containing theterm “soccer” will not be retained in the new Web viewBBCViewFilter2.

B.3 The incupd Statement

The incupd all statement in Figure 10 means that allnodes in the BBCViewFilter1 should be incrementally up-dated when constructing the new Web view BBCViewIn-cUpd1. It should be pointed out that this statement is rarelyused as it is unlikely that all nodes in a Web view is con-stantly changing. Typically, only nodes that are of the listtype are constantly updated.

The second example show how to specify the nodes thatshould be incrementally updated. Each incupd keyword to-gether with an XPath expression indicates one node or oneset of nodes16 to be considered.

The third example shows how to specify a node to beused to test equality when determining which nodes are newand which ones are not. In this case, the Item node with aType attribute valued “Title” is used. Without the using keystatement, the node and all its descendants will be used forthe comparison.

B.4 The update Statement

The update statement’s syntax is rather straightforward,as depicted in Figure 11. The start at statement specifywhen to start constructing the Web view. If there is no re-peat statement following, the view will only be constructedonce. However, if a repeat statement is present, the viewwill be continuously refreshed according to the frequencyspecified.

B.5 The consolidate Statement

The example in Figure 12 shows how two views can becombined using the consolidate statement. The view spec-ified immediately after the consolidate keyword is consid-ered as the reference view. The merging statements specifythe nodes from each view that are to be consolidated usingthe similarity detection algorithm.

16An XPath may select a forest.

13

Named Terminals

QViewName := [http://www.w3.org/TR/xquery/#prod-QName]QValue := [http://www.w3.org/TR/xquery/#prod-Literal]QNumber := [http://www.w3.org/TR/xquery/#prod-NumericLiteral]QTime := [http://www.w3.org/TR/xmlschema-2/#time]QDate := [http://www.w3.org/TR/xmlschema-2/#date]PathExpr := [http://www.w3.org/TR/xquery/#prod-PathExpr]

Non-Terminals

VCLProgram := ( CreateWebView )+ OutputWebViewCreateWebView := ‘create’ ‘webview’ QViewName ‘as’

( KeepStmt| DeleteStmt| IncupdStmt| UpdateStmt| ConsolidateStmt )

OutputWebView := ‘output’ QViewName

KeepStmt := ‘keep’ ( ‘only’ | ‘all except’ ) MultiPathExpr‘from’ QViewName

MutliPathExpr := PathExpr ( ‘and’ PathExpr )*

DeleteStmt := ( ‘delete’ PathExpr ‘where’ DeleteCond )+‘from’ QViewName

DeleteCond := (’not’)? PathExpr DeleteCondOp QValueDeleteCondOp := ( ‘contains’ | ‘=’ | ‘>’ | ‘<’ )IncupdStmt := ( ‘incupd’ ‘all’ | ( ‘incupd’ IncupdPath )+ )

‘from’ QViewNameIncupdPath := PathExpr ( ‘using’ ‘key’ MultiPathExpr )?

UpdateStmt := ‘update’ StartStmt RepeatStmt?‘from’ QViewName

StartStmt := ‘start’ ‘at’ QTime ‘on’ QDateRepeatStmt := ‘repeat’ ‘every’ QNumber RepeatUnitRepeatUnit := ( ‘second’ | ‘minute’ | ‘hour’ | ‘day’ | ‘week’ | ‘month’ )

ConsolidateStmt := ‘consolidate’ QViewName ‘with’ QViewName ( ‘and’ QViewName )*( MergingStmt WithStmt )+

MergingStmt := ‘merging’ PathExpr ‘in’ QViewNameWithStmt := ‘with’ PathExpr ‘in’ QViewName

( ‘and’ PathExpr ‘in’ QViewName)*

Figure 6. EBNF Grammar of VCL

create webview BBCViewScope1 askeep only

/Wiccap/Section[3]and/Wiccap/Section[1]/Section[2]

From BBCGlobalView

Figure 7. An Example of the keep Statement

14

create webview BBCViewScope2 askeep all except /Wiccap/Section[position()>4]

from BBCGlobalView

create webview BBCViewScope2 askeep only /Wiccap/Section[position()<=4]

from BBCGlobalView

Figure 8. Another Example of the keep Statement

create webview BBCViewFilter1 asdelete /Wiccap/Section/Region/Record

where ./Item[@Type="Title"] contains "soccer"delete /Wiccap/Section/Region/Record

where ./Item[@Type="Description"] contains "soccer"from BBCViewScope1

create webview DellViewFilter1 asdelete /Wiccap/Category/ProductList/Product

where ./Price > 3000from DellGlobalView

create webview BBCViewFilter2 asdelete /Wiccap/Section/Region/Record

where .//* contains "soccer"from BBCViewScope1

Figure 9. Examples of the delete Statement

create webview BBCViewIncUpd1 asincupd all

from BBCViewFilter1

create webview BBCViewIncUpd2 asincupd /Wiccap/Section[3]/Region/Recordincupd /Wiccap/Section[1]/Section[2]/Region/Record

from BBCViewScope1

create webview BBCViewIncUpd3 asincupd /Wiccap/Section/Region/Record

using key ./Item[@Type="Title"]from BBCViewFilter1

Figure 10. Examples of the incupd Statement

create webview BBCViewSchedule1 asupdate start at 08:30 on 30/06/2003

repeat every 1 dayfrom BBCViewIncUpd2

Figure 11. An Example of the udpate Statement

create webview NewsView asconsolidate BBCGlobalView with CNNGlobalViewmerging /Wiccap/Section[@Name="World"]/Region/Record in BBCGlobalView

with /Wiccap/Section[@Name="World News"]/Region/Record in CNNGlobalViewmerging /Wiccap/Section[@Name="Technology"]/Region/Record in BBCGlobalView

with /Wiccap/Section[@Name="Science & Technology"]/Region/Record in CNNGlobalView

Figure 12. An Example of the consolidate Statement

15