1
Web Data Analysis/Mining
2
ObjectiveTo introduce fundamental concepts of web miningTo know the different categories of web miningTo appreciate the use of web mining in web applications
Data Mining: computational process of discovering patterns (extract information) in large data sets
3
Web MiningAim: discover and extract information from webDifficulty in web information retrieval
Web: largest repository of dataText, audio, image, video, animation, …Html, xml, pdf, mp3, JPEG, MPEG, …
Web is very dynamicPages added, removed and changed hourly, daily / weekly
Links between pagesKeyword-oriented search is not effective
Irrelevant documents returned alwaysApply data mining techniques to extract knowledge from web content, structure & usage
4
Web MiningDiscover useful info from the web hyperlink structure, page content and usage data
Web content miningExtract and integrate useful data, information and knowledge from web page contentsKeywords in page, assists users in finding documents that meet a certain criterion (text mining)
Web structure miningDiscover useful knowledge from the structure of hyperlinksUnified resource locator (URL)
Web usage miningStudy user behavior when navigating the webDiscover user access patterns, e.g., sequence of clicks in sessions
5
Web Content Mining
6
Web Content MiningProcess of extracting knowledge from web contents
Topic discovery, clustering of web documents, classification of web pagesSearch engines: gather information
Vector-based representation:Document + a set of attributesterm frequency
no of occurrences / no of words in a document
7
Document RepresentationStop word removal
Un-informative words such as “a”, “an”, “the”, “in”, “on”, “because”, “that”, “which”, …
Stemming: same word stem“tax”, “taxes”, “taxing” “tax”
Attributes/Terms
Web Data Mining Tool …Doc 1 0.03 0.04 0.1 0.001 …Doc 2 0 0.04 0.01 0 …
8
Exampledocument text termsd1 web web graph web graph
d2 graph web net graph net graph web net
d3 page web complex page web complex
Boolean representation of vectors:
V = [ web, graph, net, page, complex ] V1 = [2 1 0 0 0]/3V2 = [1 2 2 0 0]/5V3 = [1 0 0 1 1]/3
9
Document RepresentationWeighting?
A term that appears in many documents does not have discriminative powerGive higher weight to terms that are rare
Collection frequency:No of occurrence of a term in a collection
Document frequency:No of documents in the collection that contain a particular term
10
Document RepresentationExample:
Document frequency is better than collection frequencyCf
A term that occurs in a few document A term that appears in most or all documents
11
Document RepresentationExample:
N=806791 (total document in the collection)
df logdft
t
Ni
12
ExampleN=1 million documentDF of auto, best, car and insurance: 5K, 50K, 10K and 1Kquery: best car insuranceDocument:
auto: 1, best: 0, car: 1, insurance: 2Score = 0 +0.82 +2.46 =3.28
df idf query document normalize
Auto 5K 2.3 --- 1 0.41 0
Best 50K 1.3 1.3 0 0 0
Car 10K 2.0 2.0 1 0.41 0.82
insurance 1K 3.0 3.0 2 0.82 2.46
2 2 2 21 0 1 2c
13
ExampleN=1 million documentDF of auto, best, car and insurance: 5K, 50K, 10K and 1Kquery: best car insuranceDocument 1:
auto: 1, best: 1, car: 1, insurance: 0Document 2:
auto: 0, best: 0, car: 0, insurance: 2 use Euclidean normalization to normalize tf
Scores = ??14
Examplequery: best car insuranceDocument 1:
auto: 1, best: 1, car: 1, insurance: 0Document 2:
auto: 0, best: 0, car: 0, insurance: 2
15
ApplicationsSummarization of consumer reviews
Customer can post reviews of products at various places
Web forums, discussion groups, blogs, …Need web content analysis to
Help potential customer to “read” many reviewsHelp product manufacturer to do product “benchmarking”
16
Web Structure Mining
17
Web StructureDiscover structure information from the webStructure info
Hyperlinks: location in a web page connects to another location
Link structure is very important: A page is important if many important pages link to it
Document structure: organize content in a tree-structured format
Web robotsSoftware programs that automatically traverse the hyperlink structure of the web to locate and retrieve information
18
Web Structure MiningWeb Graph
visualizationGraph
Pages: nodesHyperlinks: edges
19
Web Structure MiningB:
Inlink: 3, outlink: 1
Use graph for ranking
20
Web GraphWeb graph is highly dynamic
Nodes and edges are added/deletedContent of existing nodes is subject to change
Study behaviour of users as they transverse the web graphStudy distribution of pages per site
21
Web Link AnalysisPageRank
Assigns a numerical weighting: measure its relative importance
A hyperlink to a page counts as a vote of supportA page that is linked to by many pages with high PageRank receives a high rankRepresent a likelihood that a person randomly clicking on links will arrive at any particular page
22
Calculation of PageRank(optional)
23
Calculation of PageRank(optional)
Finding Pagerank to find eigenvector of B with an associated eigenvalue α
B
PageRank r1 = normalized
24
Calculation of PageRank
25
Web StructureApplications
Find related pages/popular pagesPopulate categories in web directories
Assumptions:Can use hyperlink structure to identify “important” web sources for broad-topic
Authorities: Identify higher-referenced pages on a topic
Advertisement: pay high ranking pages for advertising space
26
Web Usage Mining
27
Web Usage MiningExtraction of usage patterns from the web data to better understand the needs of web-based applications
Data stored in server access logs, and/or client-side cookiesUser characteristics, usage profilesPage attributes, content attributes etc
28
Web Usage MiningDesign marketing strategies across productsElectronic advertisements and coupons
Based on user access patternDetermine the most relevant ads
Present dynamic information to usersBased on interests/profiles
Web data and measurement issues
Collected automatically via logging tools
No manual supervision requiredData can be skewed
Presence of robots
29
Number of page requests
30
31
Data SourcesLog file: a file that records and keeps track of HTTP request/response messages received / sent to a web siteExamine logs
Server level collectionServer stores data regarding requests performed from clientIP address, time of request, status code, size in byte of transaction
Client level collectionClient’s behavior sends to a repository information through JavaScript or through browser
Proxy level collectionContains clients who pass through the proxy to certain web sites
32
Log files
Proxy: generate requests on behalf of many users
Client 1
Client 2
Client 3
Server 1
Proxy A
B
A
Server 2
GET A
GET A
GET A
GET BGET B
33
Server LoggingWeb server
Generates a log as part of processing client requestsEach log entry: HTTP request handled by the server
Extract part of the informationQuestions to think: Caching:
Requests that are satisfied by a browser or proxy cache would not appear in the logDifficult to absolutely certain about popularity of resources
Fact: popularity put in a cache (reduce network load and latency)
34
Server LoggingAccess Patterns finding: associate requests with users
Looking at IP address/hostnameDifficulties:
Proxy: generate requests on behalf of many users
Client 1
Client 2
Client 3
Server 1
Proxy A
B
A
Server 2
GET A
GET A
GET A
GET BGET B
35
Server LoggingAccess Patterns finding:
Difficulties:Share machines
One computer might not be assigned for a single person only!
IP address assignmentISP: dynamic IP assignment
A number of available addresses: different assignment for different login
36
Proxy LoggingProxy: create logs as part of normal operationProxy log vs server log
Request records to a wide range of web sites
Include info about requests that are satisfied by the proxy’s cache
Can be used to determinepopularity of web sitesEffectiveness of caching
37
Client LoggingLogs at user agent
Provides a detailed view of user browsing patternsDirect recording of page requestsMore reliable
Time per session, page-view duration
38
Web log fileMay want
User identificationAssociate specific page requests to specific usersThrough IP
Assume each IP address is a unique userProblem: not guarantee, dynamic allocation of IP address, one IP can belong to several users
Through session IDCannot capture repeat visitors
Through cookie (login ID)Actions by the same user during different sessions can be linked togetherCookies can be deleted/disabled
Identifying individualsUseful to associate specific page requests to specific individual usersBetter to use cookies
Information in the cookie can be accessed by the Web server to identify an individual user over timeActions by the same user during different sessions can be linked togetherCommercial websites use cookies extensively90% of users have cookies enabled permanently in their browsers.
39 40
Example Data AnalysisExamine groups of server session data
Examine customer behaviorsKnow how the site is being used
Data Mining: frequent items“Home Page” and “Shopping Cart page” are accessed together in 30% of the sessions… product and … product pages are accessed together in x% of the sessions
41
Example: Amazon.comAim: personalized customer experience (a personalized store for every customer)Use web mining techniques to improve customer’s experience
Instant/featured recommendationsBrowsing history“Wish” list (stuffs you want to buy)
Data Mining problem:What products should we advertise to this person?If a person buys X, should we suggest Y?
42
Minimum support for the associations is 80 customersConfidence: 37% of people who purchased A also purchased BLift: People who purchased A were 222 times more likely to purchase B compared to the general population
Product Association Lift Confidence
222 37%
195 52%
304 73%
51 48%
VolantPants
WindstopperAlpine Hat Tremblant 575
Vest Women’s
43
Others
Customer Locations Relative to Retail Stores
Map of Canada with store locations.
Black dots show store locations.
Heavy purchasing areas away from retail stores can suggest new retail store locations No stores in several hot areas:
MEC is building a store in Montreal right now.
44
SummaryWeb mining
Web contentWeb structureWeb usage
Incorporation of web mining in web application development
Top Related