Web Data Analysis/Mining - The Department of Electronic and ...

11
1 Web Data Analysis/Mining 2 Objective To introduce fundamental concepts of web mining To know the different categories of web mining To appreciate the use of web mining in web applications Data Mining: computational process of discovering patterns (extract information) in large data sets 3 Web Mining Aim: discover and extract information from web Difficulty in web information retrieval Web: largest repository of data Text, audio, image, video, animation, … Html, xml, pdf, mp3, JPEG, MPEG, … Web is very dynamic Pages added, removed and changed hourly, daily / weekly Links between pages Keyword-oriented search is not effective Irrelevant documents returned always Apply data mining techniques to extract knowledge from web content, structure & usage 4 Web Mining Discover useful info from the web hyperlink structure, page content and usage data Web content mining Extract and integrate useful data, information and knowledge from web page contents Keywords in page, assists users in finding documents that meet a certain criterion (text mining) Web structure mining Discover useful knowledge from the structure of hyperlinks Unified resource locator (URL) Web usage mining Study user behavior when navigating the web Discover user access patterns, e.g., sequence of clicks in sessions

Transcript of Web Data Analysis/Mining - The Department of Electronic and ...

1

Web Data Analysis/Mining

2

ObjectiveTo introduce fundamental concepts of web miningTo know the different categories of web miningTo appreciate the use of web mining in web applications

Data Mining: computational process of discovering patterns (extract information) in large data sets

3

Web MiningAim: discover and extract information from webDifficulty in web information retrieval

Web: largest repository of dataText, audio, image, video, animation, …Html, xml, pdf, mp3, JPEG, MPEG, …

Web is very dynamicPages added, removed and changed hourly, daily / weekly

Links between pagesKeyword-oriented search is not effective

Irrelevant documents returned alwaysApply data mining techniques to extract knowledge from web content, structure & usage

4

Web MiningDiscover useful info from the web hyperlink structure, page content and usage data

Web content miningExtract and integrate useful data, information and knowledge from web page contentsKeywords in page, assists users in finding documents that meet a certain criterion (text mining)

Web structure miningDiscover useful knowledge from the structure of hyperlinksUnified resource locator (URL)

Web usage miningStudy user behavior when navigating the webDiscover user access patterns, e.g., sequence of clicks in sessions

5

Web Content Mining

6

Web Content MiningProcess of extracting knowledge from web contents

Topic discovery, clustering of web documents, classification of web pagesSearch engines: gather information

Vector-based representation:Document + a set of attributesterm frequency

no of occurrences / no of words in a document

7

Document RepresentationStop word removal

Un-informative words such as “a”, “an”, “the”, “in”, “on”, “because”, “that”, “which”, …

Stemming: same word stem“tax”, “taxes”, “taxing” “tax”

Attributes/Terms

Web Data Mining Tool …Doc 1 0.03 0.04 0.1 0.001 …Doc 2 0 0.04 0.01 0 …

8

Exampledocument text termsd1 web web graph web graph

d2 graph web net graph net graph web net

d3 page web complex page web complex

Boolean representation of vectors:

V = [ web, graph, net, page, complex ] V1 = [2 1 0 0 0]/3V2 = [1 2 2 0 0]/5V3 = [1 0 0 1 1]/3

9

Document RepresentationWeighting?

A term that appears in many documents does not have discriminative powerGive higher weight to terms that are rare

Collection frequency:No of occurrence of a term in a collection

Document frequency:No of documents in the collection that contain a particular term

10

Document RepresentationExample:

Document frequency is better than collection frequencyCf

A term that occurs in a few document A term that appears in most or all documents

11

Document RepresentationExample:

N=806791 (total document in the collection)

df logdft

t

Ni

12

ExampleN=1 million documentDF of auto, best, car and insurance: 5K, 50K, 10K and 1Kquery: best car insuranceDocument:

auto: 1, best: 0, car: 1, insurance: 2Score = 0 +0.82 +2.46 =3.28

df idf query document normalize

Auto 5K 2.3 --- 1 0.41 0

Best 50K 1.3 1.3 0 0 0

Car 10K 2.0 2.0 1 0.41 0.82

insurance 1K 3.0 3.0 2 0.82 2.46

2 2 2 21 0 1 2c

13

ExampleN=1 million documentDF of auto, best, car and insurance: 5K, 50K, 10K and 1Kquery: best car insuranceDocument 1:

auto: 1, best: 1, car: 1, insurance: 0Document 2:

auto: 0, best: 0, car: 0, insurance: 2 use Euclidean normalization to normalize tf

Scores = ??14

Examplequery: best car insuranceDocument 1:

auto: 1, best: 1, car: 1, insurance: 0Document 2:

auto: 0, best: 0, car: 0, insurance: 2

15

ApplicationsSummarization of consumer reviews

Customer can post reviews of products at various places

Web forums, discussion groups, blogs, …Need web content analysis to

Help potential customer to “read” many reviewsHelp product manufacturer to do product “benchmarking”

16

Web Structure Mining

17

Web StructureDiscover structure information from the webStructure info

Hyperlinks: location in a web page connects to another location

Link structure is very important: A page is important if many important pages link to it

Document structure: organize content in a tree-structured format

Web robotsSoftware programs that automatically traverse the hyperlink structure of the web to locate and retrieve information

18

Web Structure MiningWeb Graph

visualizationGraph

Pages: nodesHyperlinks: edges

19

Web Structure MiningB:

Inlink: 3, outlink: 1

Use graph for ranking

20

Web GraphWeb graph is highly dynamic

Nodes and edges are added/deletedContent of existing nodes is subject to change

Study behaviour of users as they transverse the web graphStudy distribution of pages per site

21

Web Link AnalysisPageRank

Assigns a numerical weighting: measure its relative importance

A hyperlink to a page counts as a vote of supportA page that is linked to by many pages with high PageRank receives a high rankRepresent a likelihood that a person randomly clicking on links will arrive at any particular page

22

Calculation of PageRank(optional)

23

Calculation of PageRank(optional)

Finding Pagerank to find eigenvector of B with an associated eigenvalue α

B

PageRank r1 = normalized

24

Calculation of PageRank

25

Web StructureApplications

Find related pages/popular pagesPopulate categories in web directories

Assumptions:Can use hyperlink structure to identify “important” web sources for broad-topic

Authorities: Identify higher-referenced pages on a topic

Advertisement: pay high ranking pages for advertising space

26

Web Usage Mining

27

Web Usage MiningExtraction of usage patterns from the web data to better understand the needs of web-based applications

Data stored in server access logs, and/or client-side cookiesUser characteristics, usage profilesPage attributes, content attributes etc

28

Web Usage MiningDesign marketing strategies across productsElectronic advertisements and coupons

Based on user access patternDetermine the most relevant ads

Present dynamic information to usersBased on interests/profiles

Web data and measurement issues

Collected automatically via logging tools

No manual supervision requiredData can be skewed

Presence of robots

29

Number of page requests

30

31

Data SourcesLog file: a file that records and keeps track of HTTP request/response messages received / sent to a web siteExamine logs

Server level collectionServer stores data regarding requests performed from clientIP address, time of request, status code, size in byte of transaction

Client level collectionClient’s behavior sends to a repository information through JavaScript or through browser

Proxy level collectionContains clients who pass through the proxy to certain web sites

32

Log files

Proxy: generate requests on behalf of many users

Client 1

Client 2

Client 3

Server 1

Proxy A

B

A

Server 2

GET A

GET A

GET A

GET BGET B

33

Server LoggingWeb server

Generates a log as part of processing client requestsEach log entry: HTTP request handled by the server

Extract part of the informationQuestions to think: Caching:

Requests that are satisfied by a browser or proxy cache would not appear in the logDifficult to absolutely certain about popularity of resources

Fact: popularity put in a cache (reduce network load and latency)

34

Server LoggingAccess Patterns finding: associate requests with users

Looking at IP address/hostnameDifficulties:

Proxy: generate requests on behalf of many users

Client 1

Client 2

Client 3

Server 1

Proxy A

B

A

Server 2

GET A

GET A

GET A

GET BGET B

35

Server LoggingAccess Patterns finding:

Difficulties:Share machines

One computer might not be assigned for a single person only!

IP address assignmentISP: dynamic IP assignment

A number of available addresses: different assignment for different login

36

Proxy LoggingProxy: create logs as part of normal operationProxy log vs server log

Request records to a wide range of web sites

Include info about requests that are satisfied by the proxy’s cache

Can be used to determinepopularity of web sitesEffectiveness of caching

37

Client LoggingLogs at user agent

Provides a detailed view of user browsing patternsDirect recording of page requestsMore reliable

Time per session, page-view duration

38

Web log fileMay want

User identificationAssociate specific page requests to specific usersThrough IP

Assume each IP address is a unique userProblem: not guarantee, dynamic allocation of IP address, one IP can belong to several users

Through session IDCannot capture repeat visitors

Through cookie (login ID)Actions by the same user during different sessions can be linked togetherCookies can be deleted/disabled

Identifying individualsUseful to associate specific page requests to specific individual usersBetter to use cookies

Information in the cookie can be accessed by the Web server to identify an individual user over timeActions by the same user during different sessions can be linked togetherCommercial websites use cookies extensively90% of users have cookies enabled permanently in their browsers.

39 40

Example Data AnalysisExamine groups of server session data

Examine customer behaviorsKnow how the site is being used

Data Mining: frequent items“Home Page” and “Shopping Cart page” are accessed together in 30% of the sessions… product and … product pages are accessed together in x% of the sessions

41

Example: Amazon.comAim: personalized customer experience (a personalized store for every customer)Use web mining techniques to improve customer’s experience

Instant/featured recommendationsBrowsing history“Wish” list (stuffs you want to buy)

Data Mining problem:What products should we advertise to this person?If a person buys X, should we suggest Y?

42

Minimum support for the associations is 80 customersConfidence: 37% of people who purchased A also purchased BLift: People who purchased A were 222 times more likely to purchase B compared to the general population

Product Association Lift Confidence

222 37%

195 52%

304 73%

51 48%

VolantPants

WindstopperAlpine Hat Tremblant 575

Vest Women’s

43

Others

Customer Locations Relative to Retail Stores

Map of Canada with store locations.

Black dots show store locations.

Heavy purchasing areas away from retail stores can suggest new retail store locations No stores in several hot areas:

MEC is building a store in Montreal right now.

44

SummaryWeb mining

Web contentWeb structureWeb usage

Incorporation of web mining in web application development