WebServicesandInformationDelivery for
DiverseEnvironments
JulianaFreire BharatKumar
Bell Laboratories,600MountainAve.,MurrayHill, NJ07974,USA�juliana,bharat� @research.bell-labs.com
Abstract
Thereis a growing needfor techniquesthatprovidealternativemeansto accessWebcontentand
services,be it the ability to browsethe Web througha voice interfacelike the PhoneBrowser, or
throughawirelessPDA or smartphone.TheWebwasdesignedandworkswell for desktopcomput-
ers,to beviewedin largescreensandthroughgoodnetwork connections.However, usingtheWeb
througha phoneor a smallwirelessdevice posesa numberof challenges.In this paper, we discuss
theissuesinvolvedin makingexisting Webcontentandservicesavailablefor diverseenvironments,
and describePersonalClipper, a systemthat allows casualusersto easily createcustomized(and
simplified)viewsof Websitesthatarewell-suitedfor differenttypesof terminals.
1 Introduction
The ability to take information,entertainmentand e-commerceon the go hasa lot of promise. The
wirelessdatamarket is expectedto grow enormouslyin thenext few years.In theUS alone,Dataquest
expectsthat thenumberof wirelessdatasubscriberswill explodefrom 3 million in 1998to 36 million
in 2003. Thus,very soon,millions of peoplewill beableto accesstheWeb,orderservicesandgoods
from wirelessInternetdevices.However, theexisting Webinfrastructureandcontentweredesignedfor
desktopcomputersandarenotwell-suitedfor devicesthathavelessprocessingpowerandmemory, small
screens,andlimited inputdevices.In addition,wirelessdatanetworksprovidelessbandwidth,havehigh
latency andarenotasstableastraditional(wired)networks.
Considerfor exampleaccessingthe Web from a personaldigital assistant(PDA) suchasthe Palm
Pilot. CurrentwirelessdataservicessuchasOmnisky [14] run over CDPD1, whosethroughputrates
vary from 5-6 kpbsup to 12-13kbps. With a screensizeof 160x160pixelson a 6x6cmsurface,it can
beveryhardto browsethroughlargepageswith rich graphics.In addition,input facilitiesarelimited —
evenwith Palm’sGrafitti text inputsystem,enteringtext canbevery timeconsuming.
In order to addresstheselimitations of bandwidth,screenreal estate,andinput facilities, thereare
threedifferentapproaches/modelscurrentlyin use:
1CDPD[7] is awirelessIP network thatoverlayson theexistingAMPS(analog)cellularinfrastructure.
1
� Re-engineeringexisting Websites: contentproviderscreatedifferentversionsof their Websites
thatprovidecontentformattedfor specificdevices.For example:TheNew YorkTimeshasapalm-
friendlysectionat ��������� ����������������������������� �!�#"$�%���� &'������&'���������(�)�*� "��+�,����-,��.$�������/� ; �����0 "$��!�#"$� pro-
videsa specializedinterfacefor Web-enabledphones,aswell asfor thePalmVII [2]; andvarious
otherWebsitesnow have mobilephone-friendlyversions(see[22] for a list suchsites).� Creatingspecializedwrappers that export a different view of a Webpage or service: third-party
servicessuchas 1����)�*&2�!�#"$� and ")&3����� ����"$1)�*� � �!�#"$� providewrapperswhichexportwireless-friendly
clippingsof asetof Webpagesandservices,suchasstockquotes,traffic andweatherinformation.
Thesewrappersrequirenomodificationsto theunderlyingWebsites.� Using proxiesthat filter and reformatWeb content: proxiescan be programmedto transform
contentaccordingto client’s displaysizeandcapabilities.For example,ProxiWeb[18] transforms
HTML pagesandembeddedfiguresinto a formatthatcanbedisplayedonaPalmPilot.
But theseapproacheshave drawbacks(Table1 summarizesthe featuresof theseapproaches).From a
contentprovider’s perspective, having to createandmaintainmultiple versionsof a Website to support
differentdevices is labor intensive and can be very expensive. The sameis true of specializedWeb
clippings,asprograms(or wrappers)have to becreatedfor eachWebsiteandtheseneedto beupdated
every time the underlyingsite changes.From a user’s point of view, both solutionsarerestrictive, as
neitherall Web sitessupportall kinds of devices,nor wrapper-basedsolutionsoffer clippings for all
contentor servicesausermayneed.
Proxy transcoders,on the other hand,performon-the-fly contenttranslationand thus, are a good
generalsolutionfor allowing usersto browsevirtually any Web site. The kind of translationdoneby
theseproxiesincludereductionof imageresolution,modificationof HTML constructsthat cannot be
effectively viewed in smallerscreens(e.g., ProxiWeb rewrites pagesthat containframesso that they
displaythe links correspondingto the frames),andtranslationfrom HTML to otherlanguagessuchas
the wirelessmarkuplanguage(WML) [25]. But sinceWeb pagesmustbe presentedas faithfully as
possible,thesegeneralpurposeproxiesdo not performany personalization— Web pagesarealways
displayedin theirentirety. This is clearlynot theidealsolutionfor somebodyaccessingtheWebthrough
a cellular phonewith a 3-line display. Besides,somefeaturesarehardandsometimesimpossibleto
translate.For example,existing browsersfor thePalm Pilot do not supportJavaScriptandthus,it is not
possibleto guaranteethatpageswith JavaScriptwill behave correctlyin thesebrowsers.It is often the
casethatproxiesarenotableto transcodecomplex pages.
In thispaper, wedescribethePersonalClippersystem.PersonalClipperprovidesaplatformthatallows
end-usersto easilycreateandmaintainpersonalizedclippingsof Web sites. TheseWeb clippingsare
shortcutsto contentandservicesa user(or a groupof users)is interestedin, suchasthe CNN health
headlines,weatherinformation for a specificcity, flight information from Travelocity, or one’s bank
balance.By allowinguserstocreatetheirownWebclippings,aservicecanbeofferedthatispersonalized
and not restrictedto a setof supportedWeb sites,and userscaneasily customizesuchclippings for
specificdevices.
2
multi-versionsites wrappedservices transcodingproxies PersonalClipper
creationcost high high n/a low
maintenance high high n/a low
personalization limited limited none high
coverage low low medium high
Table1: Summaryof delivery techniques
Theprocessof creatingclippingsis quitesimple: it requiresno programmingexpertise, andcanbe
doneby casualWebusers.Furthermore,PersonalClippergeneratesclippingsthatarerobust to certain
changesto Websites,andthustheneedfor maintenanceis reduced.Unlike othersystemsfor creating
personalportals(e.g., Portal-to-Go[15], ezlogin[9]), thePersonalClippersystemoffersprivacy: clipping
creation(andretrieval) canbe donefrom the user’s machine,without the interventionof a third-party
server.
Thestructureof thepaperis asfollows. Westartin Section2 with amotivatingexample.In Section3
we describethePersonalClippersystem,its methodologyandarchitecture.Section4 describeshow the
PersonalClippercanbeusedto createviewsof pagesandservicesthatarewell-suitedto differenttypesof
devices.Relatedwork is discussedin Section5. Weconcludein Section6 with somefuturedirections.
2 Motivating Example
Considerthe following scenario.Julianaplansto attendthe VLDB conferenceandsheis looking for
flights from JFK to Cairo that leave from JFK on September9th, andreturnfrom Cairoon September
16th.Shemusttake thefollowing steps:
� Goto ��������� ��45464�� ��&3��7���� "��� �8�#�!�#"$�� ChoosetheFind/Booka Flight option,� Enterthelogin information,� Choosethe9 BestItinerariesoption,� Specifydetailsof itinerary.
This seriesof steps(depictedin Figure1) producesa pagewith a list of alternative flights. Now, if she
wantsto do this from herPalmPilot throughawirelessmodem,therearesomeproblems:2� manyinteractionsareneededto accesstheflightspage: Giventhehighlatency andlow-bandwidth
of wirelessdataservices,performingall thesestepsthrougha wirelessmodemor on a cellular
phonecan be hard (especiallyif a significantamountof information needsto be input), very
time consuming,andsometimesimpossible(e.g., certainpagesrequireJavaScriptwhich is not
supportedby micro browsers,suchasProxiWeb[18] andAvantGo[5]).
2In fact, we werenot able to accessthe Travelocity site usingeither the ProxiWeb [18] browser, which wasnot ableto
retrieveeventheinitial page,or AvantGo[5], whichdoesnotsupportsecureconnections.
3
Figure1: Sequenceof stepsto retrieve flight itinerariesfrom ��&3��7���� "��� �8� �!�#"$�
http://webclipserver.com/jfreire? clipping=travelcairo&mode=pull
Figure2: Retrieving flight itinerariesfrom ��&9� 7���� ":��� �8� �!�#"$� usingthePersonalClipperserver
4
� irrelevant informationis downloaded:Most times,oneis only interestedin a subsetof the infor-
mationpresentedin a Web page. In this example,not only the usermustdownloada seriesof
intermediatepages,but shemustalsodownloadthe whole Flights pageeven thoughshemight
only need,say, thefirst threeitineraries.Beingableto accessonly thedesiredinformationis espe-
cially importantin thewirelessenvironmentwherebandwidthis scarceandexpensive,andscreen
spaceis limited.
If we considerthe exampleabove, the ideal would be to createa shortcut that gives a one-click
accessto the first threeitineraries(asshown in Figure2). In general,it would be useful if onecould
easilycreatenot only simpleshortcuts,but alsodifferentviewsof Websitesthatarebettersuitedto be
accessedfrom differentterminals.In thePalm Pilot scenario,it would beusefulto reducethenumber
of requiredinteractions,andtheamountof datainput andtransferred.For example,onecouldcreatea
clippingtemplatefor Travelocity thatwouldautomaticallylogin,andalwaysfill in thedepartingcity and
preferredairlinewith default values,andrequirefrom theuserjust thetravel datesanddestination.
3 The PersonalClipper
In this sectionwe describethePersonalClippersystemandits architecture,anddiscussthemain issues
involvedin creatingandaccessingclippings.
Therearetwostepsinvolvedin creatingWebclippings:retrievingaWebpage,andextractingelements
from a retrieved page. Given the growing trendof interactive Web sitesthat publishdataon demand,
retrieving theinformationfrom theWebis becomingincreasinglycomplicated.Many sites,from online
classifiedadsto banks,requireusersto fill a sequenceof forms and/orfollow a sequenceof links to
accessapagethey need,andoften,thesehard-to-reachpagescannotbebookmarkedusingthebookmark
facilities implementedin popularbrowsers. In order to createclippings of thesepages,the process
to accessthemmustbe automated.Also, asdescribedin the exampleof Section2, oncethe desired
pageis retrieved, a usermay want to specify individual elementsof the pagesheis interestedin, so
that irrelevant informationis filteredout. A Webclipping thusmustencapsulatetheactionsrequiredto
retrieve aparticularpage,andthespecificationof whichelementsshouldbeextractedfrom thatpage.
It is possibleto automatethe retrieval of pagesby writing programsin Java or in morespecialized
languagessuchasWebL [11]. Onecanalsowrite Perl scriptsto extract individual fragmentsof Web
pages.However, thisapproachis notfeasiblefor casualWebusersthatarenotprogrammers.In addition,
giventhedynamicnatureof theWeb,maintainingtheseprogramsandscriptscanbeverycostly, asthey
might requiremodificationsevery timeWebsiteschange.
ThePersonalClipperaddressestheseproblemsby providing a VCR-styleinterfacesimilar to theWe-
bVCR[3] to transparentlyrecordbrowsingsteps;anda point-and-clickinterfaceto let usersselectpage
fragments.Furthermore,the systemusestechniquesthat enhancethe robustnessof clippings,so that
they work evenif certainchangesoccurin theunderlyingWebsites.
After aclipping is created,it canbeaccessedthroughaPersonalClipperserver, thatmaybelocatedat
5
gatewayVoice
proxyPalm
proxyWAP
Web
ServerPersonalClipper
http
http
http
httpvoice
wap
ProxiWeb
http
Figure3: AccessingWebClippings
a user’s machine,at a serviceprovider, or insideanIntranet.As Figure3 shows, thePersonalClipperis
aWebservicethatacceptsrequestsfrom HTTPclients.A requestto thePersonalClipperservercontains
an identifier for a particularclipping3, which whenexecuted,accessesa particularWeb page,clips it,
andreturnstheresulting(clipped)pageto therequestingclient.
The architectureof the PersonalClipperserver is shown in Figure 4. The PersonalClipperserver
consistsof thefollowing modules:1) theclipping DB, which storesclipping specifications;2) theuser
profile manager, that performsuserauthenticationfor sensitive clippingsstoredon the server (e.g., a
clipping that retrieves a user’s 401(k) balance);3) the clipping scheduler, that periodically executes
clippings(if sospecifiedby theclippingcreator);4) thecachemanager, thatstorescachedclippings;and
5) theclippingexecutionengine,that interactswith anHTML parser, Javascriptinterpretor, andHTML
contentextractor, to executespecifiedclippings.
In what follows we give a moredetaileddescriptionof clipping creationandexecution.For easeof
presentation,we restrictour discussionto thescenariowherethePersonalClipperserver is hostedasa
Web-basedservicethatausercanaccessusinga Java-enabledWebbrowser.
3.1 Creating Web Clippings
Webclippingshave two components:retrieval andextraction.As depictedin Figure4, thePersonalClip-
perprovidesappletsfor bothtasks:therecordingappletandtheextractionapplet.
In order to createa clipping, a usermustfirst specify the pageto be clipped. If the pagerequires
multiple stepsto be retrieved and doesnot have a well-definedURL, the usercan usethe recording
3As will bedescribedin Section3.2,requestsmayalsoincludeotherparameterssuchasinputvaluesfor theclipping.
6
Browser
RecordingApplet
ClippingExecution
CacheManager
User ProfileManager
ClippingScheduler
HTMLParser
JavascriptInterpretor
ContentExtractor
Clipping DB
ExtractionApplet
Engine
Personal Clipper Server
http
Figure4: PersonalClipperServer Architecture
appletto createthescriptto accessthepage.Therecodingappletis avariantof theWebVCR[3]. It has
a VCR-styleinterfaceto recordbrowsingactions.Whentheuserclicks the record buttonon theapplet,
sheis promptedto input theURL for startingpagewhich is thenloadedinto a browserwindow. From
thispointon,theappletmonitorsall useractionsin thatbrowserwindow4. Thismonitoringis transparent
to theuser, whocansimply navigateherway to thefinal pageasusual.Whentheuserreachesthefinal
page,thesequenceof recordedactions(i.e., links traversed,formsfilled alongwith theuserinputs,and
any otherinteractionswith active contenton thepage5) is saved.
Duringtherecordingprocess,if theuseris requiredto fill out forms,shecanoptionallyspecifywhich
field valuesareto bestoredin theclipping specificationitself, andwhich onesareto berequestedfrom
the userevery time the clipping is executed. This allows the userto createparameterizedclippings.
For example,a clipping to retrieve stockquoteinformationfrom ;=<,� ��>?���:�!�#"$� canhave asa parameter
the ticker symbol,so that theuserdoesnot needto createa separateclipping for eachstock. Also, for
securityreasons,a usermaychooseto not to save certainkindsof informationsuchpasswordsinsidea
clipping,or to save it encrypted.
In contrastwith the currentpracticeof writing wrapperprograms(e.g., using languagessuchas
WebL [11] or Java), the PersonalClippersystemoffers an alternative to quickly andeasilycreateac-
cesswrappers/scriptsthatrequiresnoprogramming— creatingandupdatingthesewrappersis a simple
processinvolving only theusualbrowsingactions.
Oncethedesiredpageis retrieved, theextractionappletcanbeusedto specifythe fragmentsof the
4The appletaddsJavascriptevent handlersto all active elementson the page,and when an event fires, it recordsthe
correspondingaction.For moredetailson themonitoringprocess,thereaderis referredto [3].5Note however that this is currentlyrestrictedto handlingJavascript,andnot arbitraryactive contentsuchasappletsand
pluginsonapage.
7
pagethatshouldbeextracted.An interestingproblemis how to identify thesefragments.In general,any
extractionspecificationchosenneedsto providetheability to 1) addressindividualor groupsof arbitrary
elementsin a page,and2) specifyrules(thatusethe above addressingscheme)to extract the relevant
contentfrom thepage.Wewantedasolutionthatwasstandard,powerful, portableandefficient.
Our first choicewasto usethe DOM API [8] to specifyextractionexpressions.However this API
is ratherlimited, e.g., it doesnot allow the retrieval of tablesfrom an HTML document. We found
XPath [26] to be a better, more flexible addressingschemethan the DOM API. XPath views XML
documentsasa tree and provides a mechanismfor addressingany nodein this tree. Onedrawback
of usingXPath is that it requirespagesto be well-formed. Sincebrowsersarevery forgiving in this
respect,many Web sitesgeneratepagesthat are ill-formed (e.g., have overlappingtags,missingend
tags,etc.).Consequently, thePersonalClippersystemmustfirst cleanup HTML pages(e.g., usingtools
suchasHTML Tidy [10]) beforeusersinputXPathexpressionsover thedocumentto specifythedesired
content.
The XPath expressionsbelow extract the first threeitineraries(eachitinerary is representedby an
HTML table)from theflight selectionpageof theTravelocityexampleof Section2.
@ @�A=BDCFEG@�H�I�J)K=@�L#M#NOBPM#Q=R!S�T9@�B�U$H�EVM,R!S�TXW@ @�A=BDCFEG@�H�I�J)K=@�L#M#NOBPM#Q=R!S�T9@�B�U$H�EVM,R!Y�TXW@ @�AZB�CFEG@�H�I�J,KZ@�L#M NOB[M QZR!S�T9@�B[U$H�EVMR \�T (1)
@ @�A=BDCFEG@�H�I�J)K=@�L#M#NOBPM#Q=R!S�T9@�B�U$H�EVM,R ]^I _�`GBD`DI�N5aGbdcfegUNhJ6]^I _ `GB�`2I�N5aGbjilk�T(2)
@ @�B�U$H�EVM,R L�I�NOB�U`GNm_:aV_ B�Q�`GNonpaGb�q!r[]sQ�`2L#M#rtbmU,NhJj]pI�_�`GB�`2I�N5aGb6cue�T(3)
Theseexpressionscanbe rathercomplicated,andwriting themcanbe an involved task. In addition,
thereare multiple ways to specify a particularpageelement,and somemay be preferableto others
(asexplainedin Section3.3). To addresstheseproblems,we arecurrentlydesigninga point-and-click
interface that lets usersselectportionsof Web pages(as sheseesthem in the Web browser), and it
automaticallygeneratesextractionexpressions.The point-and-clickinterfacewill provide userswith
differentlevelsof abstractionthatcorrespondto abreadth-firstsearchin theportionof thedocumenttree
that is visible in thebrowser. For example,if a useris interestedin particularcellsof a table,hemust
first selectthetableandthen,zoominto thetableto selectthedesiredcells.
Figure5 illustratesaWebclippingspecification(simplifiedfor expositionpurposes)for theTraveloc-
ity exampledescribedin Section2. Thefirst partof a clipping specificationcorrespondsto a sequence
of browsing steps(i.e.,cwv�xzy{i
,c|y6},~�� i
, andc|�/�zxz� i
). Thec|�(�F�dxz�z����i
ele-
mentscontaintheextractionspecifications.Notethatmultiplefragmentscanbespecified,andusersmay
chooseto specifythesefragmentsaccordingto theterminalwherethey will bedisplayed.For example,
8
if theclipping is to bedisplayedin a Palm Pilot, theusercouldchooseto extract thefirst 3 itineraries
(theextractiontagwith � Q U)nC�M#NOB NhUC�Md� r � `GQ)_�B Y `GB�`GNmM Q�UQ�`DM�_ r), whereasif theclippingis to bedis-
playedin a Web-enabledcellularphonewith a 3-line display, a singleitinerarymaybepreferable(e.g.,
theextractiontagwith � Q U,n,C�M#NOB NhUC�Md� r � `GQ)_ B `GB�`GNmM Q�UQ�K r).
Giventheunpredictablebehavior of theWeb(network delays,unreachablesites,etc.),cachingplays
an importantrole in a PersonalClipperserver. Userscanspecifyfor eachclipping, if andhow often it
shouldbeexecuted(e.g., weatherinformationfrom for a user’s hometown shouldberefreshedevery 6
hours)andcached.
3.2 Executing Clippings
After a clipping is specified,it canbesaved,anduploadedto a PersonalClipperserver. Usersmaythen
accessclippingsvia URLs thatuniquelyidentify them.Usersmayfurtherspecifyadditionalparameters
suchasinput valuesfor clipping (e.g., thepassword to accessa bankaccount);the modeof operation
(pull or push);whethertheclippingshouldbecached;andhow oftenit shouldberefreshed.
In the pull mode, the URL invokes a CGI script at the server, which in turn executesthe clipping
specificationandimmediatelyreturnstheclippedcontentto therequestingclient. In thepushmode, the
executionanddelivery of theclipping areasynchronous,i.e., theclipping canbereturnedto theclient
later, possiblythroughprotocolsotherthanHTTP (e.g., clippingscouldbeemailedto users).Thepush
modeis preferablewhenback-endWebsitesareslow or temporarilyunreachable,or whentheenduser
cannotor doesnotwantto keepasessionopenfor too long6.
The clipping executionis as follows. The pagecorrespondingto the startingURL is fetchedand
parsed.Theuseractionsarethenexecutedin sequence,someof whichmightcausenew Webpagesto be
fetched.For example,link traversalsareexecutedby fetchingthecorrespondingURL; form submissions
areexecutedby first filling theform fieldswith therecordeduserinputs,andthensubmittingtheform;
andif thereareany Javascripteventhandlerson elementsof thepagetheuserhasinteractedwith, such
actionsarefilteredthroughtheJavascriptinterpretorto ensurethatthesamehandlersfire duringreplay.
After thefinalpagehasbeenretrieved(andcleaned),theextractionexpressionsareevaluatedto extract
thedesiredcontent.Thiscanbedonewith anXSLT [27] interpretorsuchasXT7. Theextractedcontent
is thenreturnedto theclient.
Note that all processing(retrieval andextraction)is doneat thePersonalClipperserver. Only select
portionsof Webpagesarereturnedto therequestingclient,effectively giving usersaone-clickaccessto
desiredcontent,andconsiderablyreducingthecommunicationbetweentheclientandthePersonalClip-
perserver. This featureis speciallyusefulin wirelessenvironmentswhereusershave to accesstheWeb
throughhigh latency andlow-bandwidthconnections.
SinceWeb pagesmay changebetweenrecordand replay, the PersonalClipperusestechniquesto
ensurethatreplayinga sequenceof recordedactionswill leadto theintendedpage,andthat thecorrect
6Somewirelessservices,suchasSprintPCS,chargefor usagetime.7XT is availableat http://www.jclark.com/xml/xt.html.
9
�F�)� �z���������(���������)��� � ���,� �����$ �� ��¡��¢���)� �%��£�:¤�¥�¦z�
���)��§��z�©¨�ª���d�¢���)��§��z��©� �)��«%�©� �����¬�����#��$®,¯� ���)��� � ���,� �����$ �����¡Z� ��� °�¨����°:��¨$ ����)��±���� ¡$���ª,���³²´ ¤�� µ�¶¸·�¹,º»²¼¯¸�½��� � ��«�����)����°)���%��¨�ª���d�½�#�)����°)���%��¢���,�¸�£���,��¨)���$ ��,��ª�¡$� ¨ �$ ¾�)��¨ ¿:®�À Á�Â)�½�����,�¸�
�¢�#�:¤�¥�¦z���Ã,Ä ��Åz�³�³Æ Ç�ÇÈ� ��°:��¨)« ���#¡gÇ�ÇÉ�³�½��Ã,Ä���Åz��£�:¤�¥�¦z�³�³Æ Ç�Ç5Ê�Ë,�)®#�$¤��$��¨,���)���:� �)®�� ��¨ ¿ÈÇ�ÇÌ�³�¢�#�:¤�¥�¦%���Ã,Ä ��Åz�
�©¨,��¡:���³�½��¨,��¡$����Í¡$�����,������,�,®#�z�Ρ$�����,�������)����°)���%�³�£�#� ����°)������½�,�#�:� ��¨z��� �����$®¸����� ��$®,¯� ���)���)�����,� �����$ ��#��¡%� Ï�Ï)Ð ������°�¨ ¡$���¨Z ��#�,��±·�¹,º¸²Ñ¯³�¢� �,�#�:� ��¨z��¢���,�¸�£���,��¨)���$ ��,��ª�¡$� ¨ �$ «���� ¡s®�À Ò�Â)�½�����,�¸�� ´�Ó�Ó �)·d�
? ? � ´�Ó�Ó �(�³�F¨,��¡:�������$��� �����$� ��¨¬�£��¨)��¡$���³�£���,�g�½Ô%�¢�����,�¸��������)�z���)��� ���¢�������,���³�F� �)� �¬�É®#�)���)��%�£��� � ���¬����)� ���©�)��ª�¨)������%�½�#� � �d�³�¢� ´�Ó�Ó ��� ? ? � ´�Ó�Ó �(�³�F¨,��¡:���£ ���)����� ������)�����%�½��¨,��¡$���³�½���,�g�Õ¯�Ò%�£�����)�d��������)�z���)��§��(�£�������,���³�F� �)���(�¢®#�)��� ��%�¢��� �)���¬����)� ���½Ö�Ã�¦z�¢���)� �d�³�½� ´�Ó�Ó �z�� ´�Ó�Ó �(�³�F¨,��¡:���£ ���)����� ¡$��¨ ���¬�¢��¨,��¡$���³�£���,�g�ׯ�¯¸�£�����,�g��������)�z�¢®�� ���,�#�³ÇÍ� ¨,�%�¢�������,���³�F� �)���¬�Ì®#�)��� ����¢��� �)� �¬��¢®�� ���,�#�)�� ��¨ ��§z��Á¬�½�g�³�©� ��§��¬�£·����(�¢���)��§��z�³�£� ´�Ó�Ó �z�� ´�Ó�Ó �(�³�F¨,��¡:���£ ���)����� ���¬�¢��¨,��¡:���³�½���)�d�ׯ�Ø%�¢�����,�¸��������)�z�¢®�� ���,�#�³ÇÍ� ¨,�%�¢�������,���³�F� �)���¬�Ì®#�)��� ����¢��� �)� �¬��¢®�� ���,�#�)�� ��¨ ��§z�ׯ�ج�£�d�³���)��§��z�Ù¯�Ð%�¢���)��§��z�³�£� ´�Ó�Ó �z� ? ?
�¢� ´�Ó�Ó �)·g��¢��Ã,Ä���Åz���¹�Ú Ó � ´ Û�Ó «��)��° ¡$��¨ � ¨)��¡$�g²jÜ)«:���:®#� Ð ���$��¨,��� ���$� �,®#Ü,�
�����)��Ý���:À���� ¨ �)���¨:®)ÞG®����$�?¨ °$Þ3ß+à ÜG���$������Ü�ß^��¨)6�,�)® ���$� � ¨=Þ9ßO�½Ô Â�¢��¹�Ú Ó � ´ Û�Ó ���¹�Ú Ó � ´ Û�Ó «��)��° ¡$��¨ � ¨)��¡$�g²jÜ)«:���:®#� ���:��¨,���)����� Ü�
�����)��Ý���:À���� ¨ �)���¨:®)ÞG®����$�?¨ °$Þ3ß+à ÜG���$������Ü�ß^��¨)6�,�)® ���$� � ¨=Þ9ßO�½Ø Â�¢��¹�Ú Ó � ´ Û�Ó �
Figure5: WebClipping for retrieving theitinerariesfrom ���D����� ��46454�� ��&3��7���� "��� �8� �!�#"$�
10
fragmentsareextracted— evenwhentheunderlyingpagesaremodified.
3.3 Robustness Issues
Usually, changesto Webpagesdo not poseproblemsto a userbrowsing theWeb,but they do present
a challengeto a systemthatperformsautomaticnavigation. In a sequenceof recordedbrowing actions,
somelinks maycontainembeddedsessionids,andformsmaycontainhiddenelementsthatchangefrom
oneinteractionto the next. Thus,for eachuseractionduring replay, thePersonalClippersystemmust
locatethe correctobject (link, form or button) to be operatedon, and this canbe challengingin the
presenceof changesto Webpages(e.g., suchasaddition/removal of bannerads).
To ensurethat clippingsexecuteproperlyandretrieve the intendedpage,enoughinformationmust
besavedfor eachaction.For example,for a link traversalthePersonalClippersaves: theDOM address
of the link, its text andURL. During replay, if an exactmatchfor the link cannotbe found in a page,
heuristicsare usedthat try and find the closestmatch for it. For a more detaileddiscussionon the
heuristicsused,see[3]. Notehowever that if thepagestructurechangesradically, theseheuristicsmay
fail, in whichcasetheclippingwill needto bere-recorded.
Extractionexpressionscanalsobemaderobust to changesto Webpages.For example,in theXPath
expression(1) above, if anew �#�����[��& tagis addedto thedocument,theexpressionwill no longerretrieve
thecorrecttables.Besidestheindex of theparticularnodeto beextracted,thespecificationmayinclude
extra informationthat helpsthe systemidentify certainelementsif the indiceshappento change.For
instance,theXPathexpression(3) specifiestableswith anindex lessthan5 andthatcontainthe“price”
string — this expressionwould still retrieve the correctitinerarieseven if new �#��������& tagsareadded.
Robustnesscanalsobeimprovedby addingredundancy in thespecification,for example,thepathfrom
therootof thedocumentto theelement,andsomecontextual information(suchassurroundingtext) [16].
4 Delivering Clippings to Diverse Terminals
ThePersonalClipperfunctionsasa Webservice,andasFigure3 shows, thedestinationfor theclipped
contentcanbe any user-agentthat understandsHTTP (e.g., a browseron a user’s desktop).The Per-
sonalClipperplatform can thus be usedto createpersonalportalslike �g������D�+� ��)� �!�#"$� that puts to-
getherWebclippingswith informationfrom variousWebsites,andthatuserscanaccessfrom theirWeb
browserswith a single-click.ThePersonalClippercanalsobeusedin conjunctionwith othergateways
andtranscodingproxiesto provide contentto devicesthatdo not handleHTTP/HTML, for example,it
canbeusedtogetherwith aWAP gateway.
Therearemany benefitsto usingthePersonalClipperfor delivering informationto diverseterminals.
By offloadingall processingandmostnetwork communicationto a server, it fits well thethin-clientar-
chitectureusedfor wirelessdevices.In addition,by customizingandfiltering content,it cansignificantly
simplify Webpages,makingthe job of transcodingproxiesa lot easier. In this section,we examinein
moredetail someof the issuesinvolved in usingthe PersonalClipperin conjunctionwith transcoding
11
proxies.For simplicity, we focusonWAP proxies.
The WirelessApplication Protocol(WAP) is basedon a 3-tier architecturewherethe centralcom-
ponent,the gateway, is responsiblefor encodinganddecodingrequestsfrom wirelessdevicesto Web
serversandvice-versa.As Figure3 illustrates,asa userbrowsestheWebthrougha Web-enabledcel-
lular phone,requestsaresentto a WAP gateway. TheWAP gateway decodesandexecutestherequests
(e.g., a URL fetch). Whentherequesteddocumentis retrievedfrom theWeb,it is translatedinto WML
(WirelessMarkupLanguage),appropriatelyencoded,andreturnedto thephone.SinceWAP gateways
talk HTTP andHTML, it is straightforward to useany existing WAP gateway togetherwith a Personal-
Clipperserver.
WAP providesa pushframework [23] thatcanbeusedin conjunctionwith thePersonalClipperpush
modeto providebatch/asynchronouscontentretrieval. Theusagescenariois asfollows. An enduserre-
questsaclippingfrom aPersonalClipperserverby specifyingits URL andoptionallyasetof parameters
(e.g., thefrequency of push).ThePersonalClipperserver would thenactasa pushinitiator, periodically
retrieving andfiltering thespecifiedcontent,andpushingit to theuser’sdevicevia apushproxygateway.
Notificationservicescouldalsobebuilt usingthepushmechanism.For example,rulescouldbeadded
to theclippingspecificationthatdictateunderwhatconditionstheclippingshouldbepushedinto thede-
vice. For devicesthatdonot supporta pushframework, differentmechanismsmaybeused:specialized
servers/gateways could be layeredon top of the PersonalClipperto sendinformationto pagers,email
addresses,or convert contentto speechandsendit to avoicemailbox.
To enablesecuree-commerceservices,andto allow endusersto accesssensitive information(e.g.,
401(k) balance),thereneedsto be a mechanismto provide securitybetweenthe device andthe back-
endWebsites.SincethePersonalClipperserver executesrequestson user’s behalf,it is not possibleto
establishanend-to-endsecureconnection.Thenext bestscenariois to have two secureconnections:one
betweenthedevice andthePersonalClipperserver, andanotherbetweenPersonalClipperandtheback-
endWebsite.For WAP devices,thisrequiresWTLS (WirelessTransportLevel Security)[24] to provide
application-level security, ratherthansecureconnectionsbetweentheuser-agentandtheWAP gateway
only. In thisscenario,for devicesthatcannothandleHTML, thetaskof transcodingtherequest/response
mustbe performedat the PersonalClipperserver, sincea separateWAP gateway would not be ableto
accesstheencrypteddataflowing from thePersonalClipperserver to theWAP device.
Note that tighter couplingbetweenthe PersonalClipperandtranscodingproxiesis possible. In this
scenario,the PersonalClippercould be usedas a universalserver that acceptsrequestsfrom various
devicesandreturnscontentformattedaccordingwith the type of the device — multi-device clippings
couldbecreatedthat specifyhow theclipping shouldbedisplayedin differentdevices. A á �ãâ(}:�¬�tagcouldbeaddedto clippingspecifications(seeFigure5), for example:
��ä�¹�å$¤ Û ¹5�����,�g²jÜ:������� ª����� ���,��¨,��Ü�¡$���� �³²jÜ�¨,��¿$� ��Ê�Ò�Ò�Ò�Ü,���ä$¤ ·�æ � ´�ç «��)��° ¡$��¨ �¸² Ü «���:®#� ���$��¨)���)����� Ü �d�
�¢��ä�¹�å$¤ Û ¹%�
The á �ãâ/}:�¬�tag may alsocontaininformationaboutgeneralcapabilitiesandcharacteristicsof the
12
device,assuggestedin theW3CCC/PPnote[6].
5 Related Work
The areaof informationdelivery to heterogeneousdeviceshasattracteda lot of attentionrecently. In
the domainof wirelessdevicessuchasPDAs andcellular phones,the WirelessApplication Protocol
(WAP) initiative [25] is workingonstandardsolutionsto enablewirelessusersaccessto secure,reliable,
statefultransactionservicesvia resourceconstrainedportableterminals[12]. The main objectives of
theWAP Forumare: to bring Internetcontentandservicesto digital cellularphonesandotherwireless
terminals;createa protocol that will work acrossdiffering wirelessnetwork technologies;enablethe
creationof contentandapplicationsthatscaleacrossa very wide rangeof bearernetworks anddevice
types.WAP thushasthepotentialto enabletransport-independent client/servercommunicationssessions
from portabledevices over wirelesslinks. However, WAP also facesimportantchallenges.WAP is
basedona3-tierarchitecturewherethecentralcomponent,thegateway, is responsiblefor encodingand
decodingrequestsfrom wirelessdevicesto Webserversandvice-versa.Giventhegrowing complexity of
Websites(e.g., thepresenceof scriptinglanguages,dynamiccontent,malformedcontent),transcoding
canbe very hard,and in practice,many pagesandservicesare just not amenableto transcodingand
cannotbeaccessedthroughWAP. By allowing usersto easilycustomizeservicesandfilter out irrelevant
contentand complex features,the PersonalClippersystemgreatlysimplifies the transcodingprocess,
increasingtheWebcoveragefor WAP devices.
The simplificationof transcodingappliesto domainsotherthanWAP. The PhoneBrowser[17] pro-
videsa programmableplatform that gives the generalpopulationof Web pageauthorsthe meansfor
building Interactive Voice Response(IVR) systemswithout having to own any IVR equipment.Web
IVR applicationsarecurrentlybuilt usingtranscoderssuchasSpyglass’Prism[20]: contentproviders
write Prismscriptsto transcodeexisting pagesinto versionsthatmoreamenableto beingreadout. The
PersonalClippersystemcan be usedas an alternative to automaticallygeneratethesescriptswithout
requiringusersto write programs.
In theareaof informationintegration,many systemsandtechniqueshavebeenproposedto wrapWeb
sites. Most of the work in this area,though,focuseson extractingstructurefrom semi-structureddata
(e.g., [4, 1]). Theextractorcomponentof thePersonalClippersystemis notconcernedwith understand-
ing thestructureor discoveringtheschemaof theunderlyingdata,but in providing robustmechanismsto
identifyinghigh-level HTML or XML syntacticcomponents(e.g., thefirst tableafteraspecificstring).
Thefirst versionof PersonalClipperusesXPathto addressspecificcomponentsto beextractedfrom
Webpages.Otherlanguagescouldalsobeusedfor this purpose,for exampleWebL [11] or thescheme
usedby W4F[19]. Theselanguagesprovidegoodmechanismsto extractfragmentsfrom documents– in
somecases,they areeasierto usethanXPath. However, XPathis a widely acceptedstandardandthere
arefreelyavailabletoolsto processXPathexpressions.
Recently, therehasbeena proliferationof personalizationsystemswhich offer servicesthat range
from notificationsaboutchangesto certainWeb pages(e.g., Mindit [13]) to the creationof personal
13
portals(e.g., Portal-to-Go[15], ezlogin[9], Yodlee[28]). Thesesystemshave somedrawbacks,most
notably:� Limitedcoverage: servicesoffer clippingsfor a limited numberof sites.For example,Yodlee2Go
[28] allows usersto checkflight info on Expedia,but it doesnot allow usersto accessExpedia’s
rentalcarservices.8� Lack of privacy: in orderto usetheseservices,usersareforcedto go throughthird-partyservers
thatcanobserve all userinteraction(passwordsinputaswell ascontentretrieved).
ThePersonalClippersystemaddressesbothof theseproblems:it letsuserscreateclippingsfor virtually
any Website/page;andbyplacingaPersonalClipperserveratauser’smachine,it offerscompleteprivacy
— clippings can be createdand accessedwithout the needto go througha third-party server. It is
worth pointingout thateventhoughthemainmotivationfor PersonalClipperis to provide personalized
Webclippingsto end-users,thesystemcanalsobeusedby portalservicesto simplify thecreationand
maintenanceof specializedwrappers.
6 Discussion
ThePersonalClipperprovidesaplatformthatletsend-usersaswell ascontentproviderseasilycreateand
maintaincustomized(andsimplified)views of Webpagesandservices.Thesecustomizedclippingsare
easyto create(clippingcreationrequiresnoprogrammingexpertise);requirelow maintenance(they are
robust to certainchangesin theunderlyingWebsites);they arehighly customizable;andclippingsmay
becreatedfor virtually any Website.
Whenusedasa Webservice,thePersonalClipperserver performsall processing(pageretrieval and
contentextraction) requiredto constructa clipping, andreturnsto the requestingclient just the final
clipping. By reducingthenumberof interactionsrequiredto retrieve hard-to-reachpages,andallowing
usersto customizeclippingsso that needfor datainput is diminishedandonly the desiredcontentis
retrieved,thePersonalClippercanbeanintegral partof a thin-clientarchitecturefor contentdelivery. It
canbe speciallyusefulin environmentswhereusershave to accessthe Web throughhigh latency and
low-bandwidthconnections.
Furthermore,in conjunctionwith transcodinggateways,customizedclippingscanbeusedto provide
contenttodevicesthatdonothandleHTTP/HTML, for example,it canbeusedtogetherwith aWAPgate-
way. An importantadvantageof usingthePersonalClipperin thisscenariocomesfrom thecustomization
andcontentfiltering, whichcansignificantlysimplify Webpages,makingthejob of transcodingproxies
a lot easier.
Finally, it is worth pointingout thatWebclippingscontainusefulinformationaboutthecapabilities
of Webservices,suchasfor instancetheattributesneededto retrieve a certainWebpage.It would be
interestingto investigateif andhow this informationcanbeusedto facilitatenot only thediscovery and
selectionof specificservices,but alsotheprocessof combiningdifferentservices.
8Note thateven thoughezlogin.com[9] offers theoption for usersto createtheir own Webclippings,they arenot ableto
createclippingsof certainhard-to-reachpages,thatfor exampleinvolve JavaScriptactions.
14
Acknowledgements: The authorsthankJayantHaritsafor usefulcommentson the first draft of this
paper.
References
[1] B. Adelberg. NoDoSe- atool for semi-automaticallyextractingstructuredandsemi-structureddata
from text documents.In Proc.SIGMOD, pages283–294,1998.
[2] Amazonanywhere.http://www.amazon.com/anywhere.
[3] V. Anupam,J.Freire,B. Kumar, andD. Lieuwen.AutomatingWebnavigationwith theWebVCR.
In Proc.of WWW, pages503–517,2000.
[4] N. AshishandC.A. Knoblock.Wrappergenerationfor semi-structuredinternetsources.SIGMOD
Record, 26(4):8–15,1997.
[5] Avantgo. ���D����� ��46454��¾��7 �,���Dè)"Z�!�#"$� .
[6] HTML Tidy. ���D����� ��46454�� 4³é$�!")&'è��êdëh�:ìXí¸êgîïPðñðóòñò .
[7] CDPD. http://www.wirelessdata.org/develop/cdpdspec.
[8] DOM. �������:�� ��45464�� 4»é$�!",&'è��êgëh�:ëóî^ð^ï?ôjí¬õ/ï?ö���7���� ï?÷ .[9] ezlogin. ��������� ��45464��!��0 � "�è$����!�#"$� .
[10] HTML Tidy. ���D����� ��46454�� 4³é$�!")&'è�:òZ�#"$�� ���:ëO��è è,�����P����� -)� .[11] T. KistleraandH. Marais. WebL: a programminglanguagefor theWeb. In Proc.of WWW, 1998.
http://www.research.digital.com/SRC/WebL/index.html.
[12] JamesKobielus. Wirelessapplicationprotocol. TechnicalReportv1, The Burton Group,April
2000.
[13] Mind-it. ���D����� ��46454����������(����-s�!�#"$�%� .[14] Omnisky. http://www.omnisky.com.
[15] Portal-to-go.���D����� ��46454%�!")&3����� ����"$1)�*� � �!�#"$� .
[16] T. PhelpsandR. Wilenski. Robust intra-documentlocations. In Proc. of WWW, pages105–118,
2000.
[17] Phonebrowser. http://phonebrowser.research.bell-labs.com/.
[18] ProxiWeb. �������:�� ��45464�����&t"�.:���������!�#"$� .
15
[19] ArnaudSahuguetandFabienAzavant.Building light-weightwrappersfor legacy webdata-sources
usingW4F. In Proc.of VLDB, pages738–741,1999.
[20] Spyglassprism. http://www.spyglass.com.
[21] Voicexml. http://www.voicexml.org.
[22] http://www.sprintpcs.com/wireless/wwbrowsing providers.html.
[23] Wappusharchitecturaloverview. http://www.wapforum.org/, November1999.
[24] Wirelesstransportlayer securityspecificationversion1.1. http://www.wapforum.org, November
1999.
[25] WirelessApplication ProtocolForum. WirelessApplicationProtocol: The CompleteStandard.
Wiley, 1999.
[26] XPath. �������:�� ��45454%� 4»é$�!",&'è��êgëh� .:������� .[27] XSLT. ���D����� ��46454�� 4³é$�!")&'è��êdëh� .��#� � .[28] Yodlee2go.���D����� ��46454�� � ":-=� �#� �!�#"$� .
16
Top Related