Web Mining: T7

Web Mining: Assignment 7

Antonio Villacorta [email protected]

March 15, 2012

1 Web Dynamics: definition and goals

Web dynamics studies how web information content, topology, and usage change, and whatkinds of models and techniques scale up to the fast rate of web change [12]. Some key aspectsof web dynamics are:

• Related to Information on the web

– Web data mining: content, structure and usage mining.

– Collaborative filtering: provide recommendations to a user based on preferencesof other similar users.

– Web interaction: how to solve the navigation problem whereby users lose trackof the context when following a sequence of links and are unsure how to proceedin terms of satisfying their original goal.

– Web site design: in particular Web site usability and adaptative web sites.

• Related to Computation on the web

– Data intensive web applications

– Web-based workflow management systems

– Event-condition-action (ECA)

– Mobile agents

2 Web characteristics

Some important characteristics of the web are mentioned below:

• The topology of the web is a complex graph with a bow-tie structure, with a highlyconnected nucleus.

• Quality of web pages varies widely.

1

• There is a lot of redundancy in the web.

• There is mix of structured data and unstructured texts in the web.

• TheWeb consists of surface web (browsable) and deep web (accessible through queries).

• A lot of web content generation is dynamic.

• The web is a platform, plenty of web services are now available and accessible online.

• The web has a exponential growth.

• The web is becoming social, people create the Web, “populate the Web”, by socializingand gradually moving members from the physical world to the online world.

3 Zipf Law and Web Power Laws

Many web features can be described as a power law, meaning that the probability of attain-ing a certain value x is proportional to x−β [1]. In other words:

p(x) = Cx−β , β > 0

β is called the exponent of the power law, and C is a constant. Zipf law is a particular-ization of a power law with exponent β = 1.

Power-law distributions occur in an extraordinarily diverse range of phenomena. Inaddition to city populations, the sizes of earthquakes, moon craters, solar flares, computerfiles and wars, the frequency of use of words in any human language, the frequency ofoccurrence of personal names in most cultures, the numbers of papers scientists write, thenumber of citations received by papers, the number of hits on web pages, the sales of books,music recordings and almost every other branded commodity, the numbers of species inbiological taxonomy, people’s annual incomes and a host of other variables all follow power-law distributions [14].

Unlike the more familiar Gaussian distribution, a power law distribution has no ‘typical’scale and is hence frequently called ‘scale-free’. A power law also gives a finite probability tovery large elements, whereas the exponential tail in a Gaussian distribution makes elementsmuch larger than the mean extremely unlikely.

Zipf law is present in many features of the Internet: there are many small elementscontained within the Web, but few large ones. A few sites consist of millions of pages, butmillions of sites only contain a handful of pages. Few sites contain millions of links, butmany sites have one or two. Millions of users flock to a few select sites, giving little attentionto millions of others.

Links in the web for example are not randomly distributed; for one thing, the distributionof the number of links into a web page does not follow the Poisson distribution one wouldexpect if every web page were to pick the destinations of its links uniformly at random.Rather, this distribution is a power law, in which the total number of web pages within-degree x is proportional to x−β ; the value of β typically reported by studies is 2.1 [16].

Three web features which follow a power law are discussed in detail below.

2

3.1 Growth model

The pervasiveness of Zipf distributions on the Internet can be explained by an intuitivegrowth model that incorporates three simple assumptions. Let us formulate the argumentin terms of the number of web pages hosted on a website. Similar arguments can be appliedjust as easily to the number of visitors or links:

• Multiplicative growth: day-to-day fluctuations in site size are proportional to the sizeof the site, i.e. the number of pages added to or removed from the site is proportionalto the number of pages already present.

• Sites can appear at different times: the number of web sites has been growing expo-nentially since its inception, which means that there are many more young sites thanold ones. Once the age of the site is factored in to the multiplicative growth process,P(n), the probability of finding a site of size n, is a power law, that is, it is proportionalto n−β , with β > 1.

• Sites can grow at different rates: considering sites with a wide range of distributionsin growth rates yields the same result: a power-law distribution in site size.

The existence of a power law in the growth of the web not only implies the lack of anylength scale for the web, but also allows the expected number of sites of any given size tobe determined without exhaustively crawling the web [8].

3.2 Caching

In order to quickly satisfy users’ request for web content, Internet Service Providers (ISP)utilize caching, whereby frequently used files are copied and stored near to users on thenetwork. It is important to note, however, that the effectiveness of caching relies heavily onthe existence of Zipf’s law.

Caching has two advantages. First, since the requests are served immediately from thecache, the response time can be significantly faster than contacting the origin server. Second,caching conserves bandwidth by avoiding redundant transfers along remote internet links.The benefits of caching are confirmed by its wide use by ISPs. They benefit because they areable to reduce the amount of inter-ISP traffic that they have to pay for. Caching by proxiesbenefits not only the ISPs and the users, but also the websites holding the original content.Their content reaches the users more quickly and they avoid being overloaded themselvesby two many direct requests.

However, since any cache has a finite size, it is impossible for the cache to store all ofthe files users are requesting. Here Zipf’s law comes into play. The cache need only storethe most frequently requested files in order to satisfy a large fraction of users requests.

The distribution of web requests from a fixed group of users follows a Zipf-like distribu-tion, n−β , very well. The value of β varies from trace to trace, ranging from 0.64 to 0.83.Other observations found related to caching are [5]:

3

• The “10/90” rule (i.e. 90% of accesses go to 10% of items), evident in program ex-ecution, does not apply to web accesses seen by a proxy. The concentration of webaccesses to “hot” documents depends on β, and it takes 25% to 40% of documents todraw 70% of web accesses.

• There is low statistical correlation between the frequency that a document is accessedand its size, though the average size of cold documents (for example, those accessedless than 10 times) is larger than that of hot documents (for example, those accessedmore than 10 times).

• The statistical correlation between a document’s access frequency and its averagemodifications per request varies from trace to trace, but is generally quite low.

3.3 Networks

The World Wide Web is a network of interconnected webpages and the Internet backboneis a physical network used to transmit data, including web pages, in the form of packets,from one location to another [1].

The scale-free degree distribution of the Internet backbone implies that some nodes inthe network maintain a large number of connections (proportional to the total size of thenetwork), while for the most part nodes have just one or two connections. This is a twoedged sword when it comes to resilience of the network. It means that if a node fails atrandom, it is most likely one with very few connections, and its failure won’t affect theperformance of the network overall. However, if one were to specifically target just a few ofthe high degree nodes, the network could be adversely affected. Because many routes passthrough the high degree nodes, their removal would require rerouting through longer andless optimal paths. Once a sufficient number of high degree nodes are removed, the networkitself can become fragmented, without a way to communicate from one location to another.

On a different level, one of the recent developments in the use of the Internet has beenthe emergence of peer-to-peer (P2P) networks. These networks are used by millions of usersdaily to exchange a variety of files directly with one another. Current peer-to-peer networkstend to be decentralized. That is, nodes connect directly to one another rather than to acentral server. The distribution in the number of computers a computer has connectionsto is a Zipf distribution (recently it has shifted into a two-sided Zipf distribution, with ashallower exponent for the high degree nodes and a steeper exponent for the low degreeones).

Finally, it has been shown that scale-free networks are more susceptible to viruses thannetworks with a more even degree distribution. Namely, a virus spreading in a randomnetwork needs to surpass a threshold of infectiousness in order not to die out. However, if

4

the network has a Zipf degree distribution, the virus can persist in the network indefinitely,no matter what level of its infectiousness.

4 Web size and growth trend

4.1 Past estimations

According to the results of previous surveys, the public Web, as of June 2002, contained3,080,000 Web sites, or 35 percent of the Web as a whole [6]. In another study, Varian andLyman estimate that in 2000, the surface Web accounted for between 25 - 50 terabytesof information, based on the assumption that the average size of a Web page is between10 and 20 kilobytes. However, Varian and Lyman make no distinction between publicand other types of Web sites. Combining their estimate with results from the 2000 WebCharacterization Project survey, and assuming that Web sites of all types are, on average,the same size in terms of number of pages, 41 percent of the surface Web, or between 10 -20 terabytes, belonged to the public Web in 2000.

What is probably most remarkable about the size of the Web is how rapidly it rosefrom relatively insignificant proportions to a scale at least comparable to that of researchlibrary collections. A widely-cited estimate placed the size of the Web as a whole in 1996at about 100,000 sites. Two years later, the Web Characterization Project’s first annualsurvey estimated the size of the public Web alone to be nearly 1.5 million sites. By 2000,the public Web had expanded to 2.9 million sites, and two years later, in 2002, to over 3million sites. In the five years spanning the series of Web Characterization Project surveys(1998 - 2002), the public Web more than doubled in size.

Deep Web content has some significant differences from surface Web content. Deep Webdocuments (13.7 KB mean size; 19.7 KB median size) are on average 27% smaller thansurface Web documents [13]. Though individual deep Web sites have tremendous diversityin their number of records, ranging from tens or hundreds to hundreds of millions (a meanof 5.43 million records per site but with a median of only 4,950 records), these sites are onaverage much, much larger than surface sites. The rest of this paper will serve to amplifythese findings.

The mean deep Web site has a Web-expressed (HTML-included basis) database size of74.4 MB (median of 169 KB). Actual record counts and size estimates can be derived fromone-in-seven deep Web sites.

On average, deep Web sites receive about half again as much monthly traffic as sur-face sites (123,000 pageviews per month vs. 85,000). The median deep Web site receivessomewhat more than two times the traffic of a random surface Web site (843,000 monthlypageviews vs. 365,000). Deep Web sites on average are more highly linked to than surfacesites by nearly a factor of two (6,200 links vs. 3,700 links), though the median deep Website is less so (66 vs. 83 links). This suggests that well-known deep Web sites are highlypopular, but that the typical deep Web site is not well known to the Internet search public.

5

4.2 Current estimation

WorldWideWebSize.com provides an estimation of the size of the web. The estimated mini-mal size of the indexed World Wide Web is based on the estimations of the numbers of pagesindexed by Google, Bing and Yahoo Search [22]. From the sum of these estimations, an esti-mated overlap between these search engines is subtracted. The overlap is an overestimation;hence, the total estimated size of the indexed World Wide Web is an underestimation.

The size of the index of a search engine is estimated on the basis of a method thatcombines word frequencies obtained in a large offline text collection (corpus), and searchcounts returned by the engines. Each day 50 words are sent to all four search engines.The number of webpages found for these words are recorded; with their relative frequenciesin the background corpus, multiple extrapolated estimations are made of the size of theengine’s index, which are subsequently averaged. The 50 words have been selected evenlyacross logarithmic frequency intervals. The background corpus contains more than 1 millionwebpages, and can be considered a representative sample of the World Wide Web.

If it is known that the word ‘the’ is present in 67,61% of all documents within the corpus,an extrapolation of the total size of the engine’s index by the document count it reports for‘the’ can be performed. If Google says that it found ‘the’ in 14.100.000.000 webpages, anestimated size of the Google’s total index would be 23.633.010.000.

5 Public and Hidden Web

The surface or public Web can be interpreted as the portion of the Web that is accessibleusing traditional crawling technologies based on link-to-link traversal of Web content. Thisapproach is used by most search engines in generating their indexes.

A large part of the Web is “hidden” behind search forms and is indexed only by typing aset of keywords, or queries, to the forms. These pages are known as the hidden or deep Webas search engines generally cannot index those pages and do not show them in the results.Searching the deep web is difficult process because each source searched has a unique methodof access. Hidden web crawlers must also provide input in the form of search queries [18].

Existing attempts to characterize the deep Web are based on the following methodsoriginally applied to general Web surveys [19]:

• Overlap analysis and random sampling of IP addresses: this technique involvespairwise comparisons of listings of deep web sites, where the overlap between each twosources is used to estimate the size of the deep Web (specifically, total number of deepweb sites). The critical requirement to listings of being independent from one anotheris unfeasible in practice, thus making the estimates produced by overlap analysisseriously biased. Additionally, the method is generally non-reproducible.

• Random sampling of IP addresses: rsIP for short, is easily reproducible andrequires no pre-built listings. The rsIP estimates the total number of deep web sitesby analyzing a sample of unique IP addresses randomly generated from the entirespace of valid IPs and extrapolating the findings to the Web at large. Since the entire

6

IP space is of finite size and every web site is hosted on one or several web servers,each with an IP address, analyzing an IP sample of adequate size can provide reliableestimates for the characteristics of the Web in question. The most serious drawbackof the rsIP is ignoring virtual hosting, i.e., the fact that multiple web sites can sharethe same IP address.

• Host-IP clustering sampling: hostname aliases for a given web site are frequentlymapped to the same IP address. This approach, given a hostname resolved to itsIP address, identifies other hostnames potentially pointing to the same web contentby checking other hostnames mapped to this IP. Once the overall list of hosts isclustered by IPs it applies a cluster sampling strategy, where an IP address is to be aprimary sampling unit consisting of a cluster of secondary sampling units, hosts. Inthe cluster sampling, whenever a primary unit is included in a sample, every secondaryunit within it is analyzed. This approach addresses drawbacks of previous deep websurveys, specifically to take into account the virtual hosting factor.

6 Web Languages

Considering every variant based on its intended purpose, 15 different web language layersexist [17]. Layers and languages belonging to each layer are detailed in table 1.

7 Web Domains

A domain name is an identification string that defines a realm of administrative autonomy,authority, or control in the Internet. Domain names are formed by the rules and proceduresof the Domain Name System (DNS) [10]. In general, a domain name represents an InternetProtocol (IP) resource, such as a personal computer used to access the Internet, a servercomputer hosting a web site, or the web site itself or any other service communicated viathe Internet.

The DNS service has three major components [9]:

• The Domain Name Space: specification for a tree structured name space. Conceptu-ally, each node and leaf of the domain name space tree names a set of information,and query operations are attempts to extract specific types of information from aparticular set. A query names the domain name of interest and describes the type ofresource information that is desired.

• Name Servers: server programs which hold information about the domain tree’s struc-ture and set information. A name server may cache structure or set information aboutany part of the domain tree, but in general a particular name server has complete in-formation about a subset of the domain space, and pointers to other name servers thatcan be used to lead to information from any part of the domain tree. Name serversknow the parts of the domain tree for which they have complete information; theseparts are called zones; a name server is an authority for these parts of the name space.

7

Table 1: Web Languages

Layer Languages

Markup Languages HTML, XHTML, XML, WML, MHTML, SGML

Syndication Languages RSS, Atom, EventsML, GeoRSS, MRSS, NewsML, OPML, SportsML, XBEL

Metadata Languages DCMI, META, Microformats, OWL, RDF, APML, FOAF, P3P, SIOC, XFN

Stylesheet and CSS, XSL, DSSSLTransform Languages

Client-Side Scripting AJAX, DOM Scripting, Flex (ActionScript), JavaScript, VBScript,E4X, ECMAScript, JScript, JScript.NET

Server-Side Scripting ASP, ASP.NET, ColdFusion, JSP, Perl, PHP, Python, Ruby On Rails,Lasso, OpenLaszlo, Smalltalk, SMX, SSI, SSJS

Database Management MS-SQL, mySQL, Oracle, PostgreSQL, Derby, MongoDB, SQLiteSystems

Sandboxed Languages ActiveX, Flash, Java, Shockwave, Silverlight

Server-Side/Web Server .htaccess, Robots.txt, Web.configSettings

Rich Internet Air, Gears, JavaFX, Prism, Cappuccino, Curl, TitaniumApplications

Vector Modeling 3DMLW, Canvas (HTML5), SVG, VML, X3D, 3DML, 3DXML, SMIL,Languages UML, VRML, XVRML

PostScript Format PDF, XPS, FlashPaper, OpenXMLLanguages

Data Formatting DocBook, KML, MathML, OpenSearch, PAD, Sitemap, VoiceXML, DOAC,Languages DOAP, GML, GraphML, InkML, OpenMath, SISR, SRGS, SSML, XMLTV

Document Schema DTD, XSD, DSD, RelaxNG, Schema XMLLanguages

8

• Resolvers: programs that extract information from name servers in response to userrequests. Resolvers must be able to access at least one name server and use that nameserver’s information to answer a query directly, or pursue the query using referralsto other name servers. A resolver will typically be a system routine that is directlyaccessible to user programs; hence no protocol is necessary between the resolver andthe user program.

The Internet Corporation for Assigned Names and Numbers (ICANN) manages the top-level development and architecture of the domain name space. It authorizes domain nameregistrars, through which domain names may be registered and reassigned.

Domain names are organized in subordinate levels (subdomains) of the DNS root domain,which is nameless. The first-level set of domain names are the top-level domains (TLDs).There are several types of TLDs within the DNS [3]:

• TLDs with two letters (such as .de, .es, and .jp) have been established for over250 countries and external territories and are referred to as “country-code” TLDsor “ccTLDs”. They are delegated to designated managers, who operate the ccTLDsaccording to local policies that are adapted to best meet the economic, cultural, lin-guistic, and legal circumstances of the country or territory involved.

• Most TLDs with three or more characters are referred to as “generic” TLDs, or“gTLDs”. Seven gTLDs (.com, .edu, .gov, .int, .mil, .net, and .org) were cre-ated. Domain names may be registered in three of these (.com, .net, and .org)without restriction; the other four have limited purposes. In 2001, seven new gTLDswere introduced (.biz, .info, .name, .pro, .aero, .coop, and .museum). Inclusionof seven new TLDs (.asia, .cat, .jobs, .mobi, .tel, .travel and .xxx) is underevaluation by ICANN.

• In addition to gTLDs and ccTLDs, there is one special TLD, .arpa, which is used fortechnical infrastructure purposes. ICANN administers the .arpa TLD in cooperationwith the Internet technical community under the guidance of the Internet ArchitectureBoard.

Below these top-level domains in the DNS hierarchy are the second-level and third-leveldomain names that are typically open for reservation by end-users that wish to connectlocal area networks to the Internet, create other publicly accessible Internet resources orrun web sites. The registration of these domain names is usually administered by domainname registrars.

8 Web of Spain

An in-depth study over a large collection of Web pages have been performed in [4]. OnSeptember and October 2004 more than 16 million Web pages from about 300,000 Websites from the Web of Spain were downloaded and the characteristics of this collectionstudied at three different granularity levels: Web pages, sites and domains. For each level,contents, technologies and links were analyzed. A summary of this report follows.

9

8.1 Web Page Characteristics

• URL Length: in the studied collection, the average length of an URL, including theprotocol specification http://, the server name, path and parameters, is 67 characters,compared with the average of 50 characters in samples of the global Web.

• URL depth: the starting page of a Web site is at depth 1, pages with URLs of theform http://site/pag.html or http://site/dir/ are at depth 2, and so on. InWeb sites with static pages, the physical depth measures the organization of pages infiles and directories. The majority of pages falls between the third and fourth levels.

• Page titles: over 9% of the pages have no title, and 3% have a default title such asthe Spanish equivalents of “Untitled document” or “New document 1”. More than 70%of the pages have a shared title that is not unique. On average a title is shared byapproximately 4 pages.

• Text of pages: page sizes follow a power-law distribution, with parameter 2.25. Mostof the pages have between 256 bytes and 4 KB of text, and the average is 2.9 KB.

• Language: Spanish language is used by a little more than half of the pages, followedby English and Catalan. The fraction of pages written in the official languages ofthe country is around 62%. This is related to the presence of a large group of pagesin English, including pages related to tourism and technical documentation aboutcomputing.

• Dynamic Pages: the most used application for building dynamic pages is PHP6,followed closely by ASP7. Other technologies that are used for many pages are Java(.jhtml and .jsp) and ColdFusion (.cfm).

• Non-HTML documents: plain text and Adobe PDF are the most used format andcomprise over 80% of the non-HTML documents.

• Links to Web Pages: both page internal degree (number of links received by a page)and external degree (number of links going out from a page) follow a power-law dis-tribution with the parameters 2.11 and 2.84 respectively.

• Ranking of Pages: the distribution of Pagerank scores also follows a power law, withparameter 1.96. For this parameter, a value of 2.1 has been observed in samples ofthe global Web.

8.2 Web Site Characteristics

• Number of Pages: an average of 52 pages per site is observed in the studied collection.In the Web of Spain there is relatively a smaller amount of larger sites. Just 27% ofthe sites have more than ten pages, 10% more than a hundred pages, and less than 1%more than a thousand pages. About in 60% of the sites the crawler found only oneWeb page. This happened because the navigation of most sites was Javascript basedor the starting page required Flash. This has led to an estimation of around 30% ofthe sites that use only Javascript or only Flash for navigating from the start page andtherefore they are difficult or impossible to index by most search engines.

10

• Size of the Pages in a Whole Web Site: average size of a whole Web site (consideronly the text) is approximately 146 KB.

• Internal Links: a link is considered internal if it points to another page inside thesame Web site. The Web sites of Spain have on average 169 internal links. An averageWeb site has approximately 0.15 internal links per page, or an internal link every 6 or7 pages.

• Links between Web Sites: discarding one-page sites, 63% of the sites do not receiveany reference from other site in Spain, and 90% have no link to other site in Spain.Among the sites with more in-links, we found mostly newspapers and universities.

• Link Structure among Web Sites: certain structural components on the Web can bedistinguished considering the connections between pages:

– MAIN, sites on the strongly connected component;– OUT, sites that are reachable from MAIN, but have no links towards MAIN;– IN, sites that can reach MAIN, but that have no links from MAIN;– ISLANDS, sites that are not connected to MAIN;– TENDRILS, sites that only connect to IN or OUT, but following the links in the

reverse direction;– TUNNEL, a component joining OUT and IN without going through MAIN.

In the Web of Spain, there are very few sites with only one page in the componentMAIN, while in the ISLANDS component they are approximately 50% of the sites.Another characteristic is that all of the sites in MAIN are under .es, while in othertop-level domains the component OUT is the most common. Also, the ISLANDS areroughly evenly split between .es and .com.

8.3 Domain Characteristics

• IP Address and Hosting Provider : a DNS search in the IP address of each of thestudied sites obtained about 88% of the IP addresses. The rest no longer exist or werenot reachable. Each address has on average five domains, however, the distributionis very skewed: there are four IP addresses with more than 1,000 domains each, and16,565 IP addresses with only one domain.

• Web Server Software: the two dominant software brands are Apache and MicrosoftIIS (Internet Information Server), in that order. The data suggests that the marketshare of Microsoft IIS is larger in the Web of Spain that in the global Web: accordingto Netcraft, the proportion is 69% for Apache and 21% for Microsoft IIS whereas inthe Web of Spain it is 46% for Apache and 38% for Microsoft IIS. The most usedoperating system used for Web servers in Spain is Windows (43%), followed closely byoperating systems based in Unix (41%); this means that at least 15% of the serversbased in Windows prefer Apache as a Web server.

• Number of Sites per Domain: on average 2.55 sites per domain were found, but thereare several very large domains. For instance, almost 30 domains with more than 1,000sites in each one were discovered. On the other hand about 92% have only one site.

11

• Number of Pages per Domain: there is an average of 133 pages per domain. Thedistribution of the number of pages per domain exhibits a power-law with parameter1.18. Only 26% of the domains are domains with only one Web page.

• Total Size per Domain: the average size of a Web domain, considering only text, is ofapproximately 373 KB. The distribution of the total size of pages per domain followsa power-law with parameter 1.19. Most of the large domains in terms of text areuniversities, research centers and databases for academic use.

• First-level Domains: the largest top-level domains: .com, .org, .net, etc. are themost used, which means that although servers are physically located in Spain, thisdoes not always mean that they are under the .es top-level domain. Around 65% ofthe sites and 31% of the pages belong to the domain .com.

• External Top-level Domains: half of the external sites linked from the Web of Spainare located in the .com domain. The generic top-level domains .org, .info and .bizappear with much more frequency that expected by the number of host names in eachof these domains.

9 Other research topics related to Web Dynamics

9.1 Stochastic Models for Web Dynamics

The web is continuing to grow at an exponential rate, in terms of both the amount anddiversity of the information that it encompasses, and the size of its user base. This growthposes a host of challenges to the research community. There is a need for mechanismsand algorithms for organizing and manipulating the information on the web in order tomake the web tractable. In [7], H. Hama et al. from the Osaka City University investigatethe characteristics of popularity to websites and the dynamical properties among competingwebsites by establishing a theoretical model and simulations with it. They build a model forthe number of visitors to the websites and the visiting time duration by a visitor. Numericaland analytic calculations for a website novelty and popularity measures are then carried out.

Three metrics to measure the website popularity are used. The first is based on accessfrequency which is defined as the number of visits to the site. The second is based on thepassage time at which the number of visitors drops below a threshold level. The last metricis the time duration of visiting. They measure three different aspects of website popularityweighted by visit frequency: visiting time, the first passage time, and the distribution fordecay. To account for the variation due to the change in the number of active users, theauthors focus on the probability distribution instead of the absolute value of site measures.

Using this model, the conclusion is that most of the characteristics of the dynamics ofthe websites, such as, popularity, novelty and relevancy are governed by the distributionsof the number of visitors to websites and the fluctuation in the individual website’s growth.Thus the key ingredients in the dynamics of the websites are the following. First, there isa global interaction in terms of the stochastic force strength among websites with whichone can view the web environments as a competitive complex system. Second, the webdynamical system stays in non equilibrium in the sense that the number of the websites inthe system is not fixed but exponentially increased.

12

9.2 Resonance on the Web

The Web is a dynamic, ever-changing collection of information accessed in a dynamic way.In [2], E. Adar et al. explore the relationship between Web page content change and people’srevisitation to those pages. A relationship, or resonance, between revisitation behavior andthe amount and type of changes on those pages is identified.

Revisiting Web pages is common, but people’s reasons for revisiting can be diverse.While most content on the Web does not change, pages that are revisited change frequently.As revisitation is often motivated by monitoring intent, some relationship between changeand revisitation is to be expected. Pages are often composed of many sub-pieces, eachchanging at different rates. Web content change can be beneficial to the Web user lookingfor new information, but can also interfere with the re-finding of previously viewed con-tent. Understanding what a person is interested in when revisiting a page can enable usto build systems that better satisfy those interests by, for example, highlighting changedcontent when the change is interesting, actively monitoring changes of particular interest,and providing cached information when changes interfere with re-finding.

The results of the analysis conducted can be used to improve the Web experience andhave the following implications:

• Website implications: website designers would like to understand why users arereturning to their pages. While some inference can be made from the links that areclicked on, there are many situations when users revisit and do not click on anything.Another application is the optimization of the “what’s new” pages. The resonancebetween what is changing and the revisitation pattern may point at content that ismore of interest. Thus, a website can identify the rate of change of pages or portionsof the page, and highlight those that correspond to the peak revisitation rates.

• Web Browser design implications: just as a Website designer may create op-timized “what’s new” information for their pages, a client-side implementation mayprovide additional change analysis features to the user. The ability to expose and in-teract with meaningful change would be particularly useful within a client-side, Webbrowser context, where a user’s history of interaction is known. Pages displayed in thebrowser could be annotated to provide the user with highlights of its changes. Also, abrowser could also act as a personal Web crawler, and pre-fetch pages that are likelyto experience meaningful change.

• Search Engine implications: just as a browser can provide intee the same resulton a much larger scale. Optimized crawling may lead to crawling strategies thatunderstand what information on a page is interesting and should be tracked more orless aggressively for indexing. For example, change in advertising content or changethat occurs as a result of touching a page should not inspire repeat crawls, whilechange to important content should. Furthermore, a document need not be indexedif it has not changed in a meaningful way, potentially saving server resources.

13

10 International conferences related to Web Dynamics

10.1 World Wide Web Conference (WWW)

The World Wide Web Conference is a yearly international conference on the topic of thefuture direction of the World Wide Web [11]. It began in 1994 at CERN and is organizedby the International World Wide Web Conferences Steering Committee (IW3C2). TheConference aims to provide the world a premier forum for discussion and debate about theevolution of the Web, the standardization of its associated technologies, and the impactof those technologies on society and culture. WWW2012 will focus on openness in webtechnologies, standards and practices.

10.2 International Conference on Internet and Web Applicationsand Services (ICIW)

ICIW 2012 [15] comprises five complementary tracks. They focus on Web technologies, de-sign and development of Web-based applications, and interactions of these applications withother types of systems. Management aspects related to these applications and challengeson specialized domains are aided at too. Evaluation techniques and standard position ondifferent aspects are part of the expected agenda. Internet and Web-based technologies ledto new frameworks, languages, mechanisms and protocols for Web applications design anddevelopment. Interaction between web-based applications and classical applications requiresspecial interfaces and exposes various performance parameters.

10.3 International Conference on Web Intelligence, Mining andSemantics (WIMS)

WIMS’12 [20] is intended to foster the dissemination of state-of-the-art research in the areaof Web intelligence, Web mining, Web semantics and the fundamental interaction betweenthem.

10.4 Web Information Systems and Technologies (WEBIST)

The purpose of the 8th International Conference on Web Information Systems and Technolo-gies (WEBIST2012) is to bring together researchers, engineers and practitioners interested inthe technological advances and business applications of web-based information systems [21].The conference has five main tracks, covering different aspects of Web Information Systems,including Internet Technology, Web Interfaces and Applications, Society, e-Communities,e-Business, Web Intelligence and Mobile Information Systems.

14

10.5 IEEE/WIC/ACM International Conference on WebIntelligence

Web Intelligence explores the fundamental roles, interactions as well as practical impacts ofArtificial Intelligence engineering and Advanced Information Technology on the next gener-ation of Web systems [17]. Here AI-engineering is a general term that refers to a new area,slightly beyond traditional AI: brain informatics, human level AI, intelligent agents, socialnetwork intelligence and classical areas such as knowledge engineering, representation, plan-ning, discovery and data mining are examples. Advanced Information Technology includeswireless networks, ubiquitous devices, social networks, and data/knowledge grids, as well ascloud computing, service oriented architecture.

11 References

[1] Lada A. Adamic and Bernardo A. Huberman. “Zipf’s law and the Internet”. In: Glot-tometrics 3 (2002), pp. 143–150.

[2] Eytan Adar, Jaime Teevan, and Susan T. Dumais. “Resonance on the web: web dy-namics and revisitation patterns”. In: Proceedings of the 27th international conferenceon Human factors in computing systems. CHI ’09. Boston, MA, USA: ACM, 2009,pp. 1381–1390.

[4] Ricardo Baeza-Yates, Carlos Castillo, and Vicente López. “Characteristics of the Webof Spain”. In: Cybermetrics 9.1 (2005).

[5] Lee Breslau et al. “Web caching and Zipf-like distributions: evidence and implications”.In: INFOCOM ’99: Proceedings of the Eighteenth Annual Joint Conference of theIEEE Computer and Communications Societies. New York, NY, USA: IEEE, Mar.1999, pp. 126–134.

[7] H. Hama et al. “A Stochastic Model for Popularity Measures in Web Dynamics”. In:Intelligent Information Hiding and Multimedia Signal Processing (IIH-MSP), 2010Sixth International Conference on. Oct. 2010, pp. 676–679.

[8] Bernardo A. Huberman and Lada A. Adamic. “Growth dynamics of the World-WideWeb”. In: Nature 401.6749 (Sept. 1999), p. 131.

[12] Mark Levene and Alexandra Poulovassilis. “Web Dynamics”. In: Software Focus 2(2001), pp. 60–67.

[14] M. E. J. Newman. “Power laws, Pareto distributions and Zipf’s law”. In: ContemporaryPhysics 46.5 (Sept. 2005), pp. 323–351. eprint: cond-mat/0412004.

[18] D. K. Sharma and A. K. Sharma. “Query Intensive Interface Information ExtractionProtocol for deep web”. In: Intelligent Agent Multi-Agent Systems, 2009. IAMA 2009.International Conference on. 2009, pp. 1–5.

[19] D. Shestakov and T. Salakoski. “Host-IP Clustering Technique for Deep Web Char-acterization”. In: Web Conference (APWEB), 2010 12th International Asia-Pacific.2010, pp. 378–380.

15

cond-mat/0412004

12 Links

[3] Internet Corporation for Assigned Names and Numbers (ICANN). Top-Level Domains(gTLDs). http://www.icann.org/en/tlds/. 1983.

[6] OCLC Online Computer Library Center. Trends in the Evolution of the Public Web.http://www.dlib.org/dlib/april03/lavoie/04lavoie.html. 2011.

[9] The Internet Engineering Task Force (IETF). RFC 882 - Domain Names: Conceptsand Facilities. http://tools.ietf.org/html/rfc882. 1983.

[10] Wikimedia Foundation Inc. Wikipedia. http://en.wikipedia.org/wiki/Web_domain.2011.

[11] Wikimedia Foundation Inc. Wikipedia. http://en.wikipedia.org/wiki/Terminology_extraction. 2011.

[13] University of Michigan. The Deep Web: Surfacing Hidden Value - Bergman, MichaelK. http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0007.104. 2001.

[15] University of Penn. Web 1T 5-gram, 10 European Languages. http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T25. 2006.

[16] Cambridge University Press. Introduction to Information Retrieval - The Web Graph.http://nlp.stanford.edu/IR-book/html/htmledition/the-web-graph-1.html. 2011.

[17] Six Revisions. Web Languages: Decoded. http://sixrevisions.com/web-technology/web-languages-decoded/. 2011.

[20] Romania Software Engineering Department University of Craiova. WIMS’12. http://software.ucv.ro/Wims12. 2011.

[21] University of Southern Denmark. International Symposium on Open Source Intelli-gence and Web Mining 2012. http://www.osint-wm.org/. 2011.

[22] Tilburg University. The Size of the World Wide Web. http://www.worldwidewebsize.com/. 2011.

16

http://www.icann.org/en/tlds/

http://www.dlib.org/dlib/april03/lavoie/04lavoie.html

http://tools.ietf.org/html/rfc882

http://en.wikipedia.org/wiki/Web_domain

http://en.wikipedia.org/wiki/Terminology_extraction

http://en.wikipedia.org/wiki/Terminology_extraction

http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0007.104

http://quod.lib.umich.edu/cgi/t/text/text-idx?c=jep;view=text;rgn=main;idno=3336451.0007.104

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T25

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2009T25

http://nlp.stanford.edu/IR-book/html/htmledition/the-web-graph-1.html

http://sixrevisions.com/web-technology/web-languages-decoded/

http://sixrevisions.com/web-technology/web-languages-decoded/

http://software.ucv.ro/Wims12

http://software.ucv.ro/Wims12

http://www.osint-wm.org/

http://www.worldwidewebsize.com/

http://www.worldwidewebsize.com/

Web Mining: T7

Documents

Transcript of Web Mining: T7