Granular modeling of web documents: impact on information retrieval systems

Post on 21-Jan-2023

4 views 0 download

Transcript of Granular modeling of web documents: impact on information retrieval systems

Granular Modeling of Web Documents: Impact onInformation Retrieval Systems

Elisabetta FersiniUniversity of Milano-Bicocca

Viale Sarca, 336Milan, Italy

fersini@disco.unimib.it

Enza MessinaUniversity of Milano-Bicocca

Viale Sarca, 336Milan, Italy

messina@disco.unimib.it

Francesco ArchettiUniversity of Milano-Bicocca

Viale Sarca, 336Milan, Italy

archetti@disco.unimib.it

ABSTRACTOne of the most important tasks in Information Retrieval (IR) isrelated to web page information extraction and processing. It is acommon approach to consider a web page as an atomic unit andto model its textual content as a "bag-of-words". However, thiskind of representation does not reflect how people perceive a webpage. A granular document representation, in terms of semanticobjects, can help in identifying semantic areas of a web page andusing them for different IR goals. In this paper we use a granularrepresentation to define a new metric for evaluating semantic objectimportance and to enhance the performance of IR systems. In par-ticular we show that this new metric can be used not only for classi-fication goals, in which instances are assumed as independent andidentically distributed, but also to gauge the strength of relation-ship between hypertextual documents and exploit this informationfor improving page ranking performance.

Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval Applications]: Infor-mation Search and Retrieval; I.7.2 [Document and Text Process-ing]: Document Preparation—Hypertext/hypermedia

General TermsMeasurement, Experimentation

KeywordsDocument Classification, Web Page Ranking, Visual Layout Anal-ysis, Relational Granular Document Modeling

1. INTRODUCTIONToday’s search engines and document classification techniques

are mostly based on textual information and do not take into ac-count those information related to visual layout features. However,in the sense of human perception, it is always the case that peopleview a web page as a composition of semantic objects rather thanas a single object. Indeed, when a web page is presented to the

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.WIDM’08, October 30, 2008, Napa Valley, California, USACopyright 2008 ACM 978-1-60558-260-3/08/10 ...$5.00.

user, the spatial and visual cues help him to unconsciously dividethe page into several semantic parts containing different types of in-formation, for example: navigation bar, copyright, contact informa-tion and so on. The detection of this semantic content structure mayimprove the performance of web Information Retrieval (IR) taskssuch as indexing, searching, and classification. The problem ofidentifying visual semantic parts has been addressed in [2] and [3].

In [2] the authors provide a mechanism to detect the visual lay-out structure of a web page through the identification of differentpage components such as headers, navigation bars, left and rightmenus, footers, and informative parts. The functional role of eachcomponent is defined through a set of heuristic rules, derived bystatistical analysis on a sample of web pages.In [3] a Vision-based Page Segmentation (VIPS) algorithm has beenproposed. VIPS algorithm, based on the assumption that semanti-cally related contents are usually grouped together, is able to dividethe page into different regions, called blocks, using visual separa-tors. The semantic content of a web page is mapped into a granularstructure in which each node corresponds to a block. VIPS algo-rithm makes full use of page layout features: first it extracts all thesuitable blocks from the html DOM tree, secondly it finds separa-tors between extracted blocks, represented by horizontal or verticallines, that visually do not cross any block. With this informationthe web page semantic structure is constructed.

A web page di can therefore be represented as a set of "visual"blocks divided by different types of "visual" separators. Intuitively,each block denotes a semantic part of the web page and can berecursively considered as a web document. During the IR tasks,visual blocks are usually considered as independent [16] [3].

In this paper we want to extend this representation in order totake into account embedded relationships among blocks. By usingthis extended representation we propose a new metric for evalu-ating the importance of each block. This importance evaluationexploits both implicit and explicit relationships between blocks be-longing to a set of hypertextual documents, thus enhancing the per-formance of different IR tasks. In particular we show the effective-ness of this method for classification and page ranking purposes.

The outline of this paper is the following. In section 2 we outlinethe block relational model used for representing web documents.In section 3 and 4 we present two different instantiation of thismodel aimed respectively at web page classification and search en-gine page ranking. In Section 5 the dataset and the performancemeasures are described and the experimental results are discussed.Finally, in section 6 we derive conclusions and outline future de-velopments.

111

2. A BLOCK RELATIONAL MODELA web page di can be represented as a couple (Θi, Φi), where

Θi = {di1, di2, ..., din} is a finite set of disjoint "visual" blocksand Φi is the set of horizontal and vertical separators. Of course notall blocks dil are equally informative and we can take this into ac-count by representing Θi = {(di1, ri1), (di2, ri2), ..., (din, rin)},where ril represents the importance of the semantic content of dil.In the literature different block importance estimation proceduresbased on granular representation have been proposed. The sim-plest one requires a manual supervision for block labeling [16],while the other one [3] considers visual block importance strictlyrelated to spatial location (blocks located in the middle of a webpage are usually more important than blocks in the right down cor-ner). The shortcoming of these approaches is that the importance ofeach block is evaluated following a set of rules without taking intoaccount the semantic content of the other blocks. In order to over-come this problem, we propose to compute ril by evaluating thecoherence of the textual content of dil with respect to the rest ofthe document. Thus, this "internal" coherence depends implicitlyby the "internal" coherence of the other blocks.

A more explicit kind of relationships is represented by hyper-links. By using the granular representation a link can be repre-sented as the relationship between the block d∗ik originating the hy-perlink and the destination document dk, i.e. R(d∗ik, dk).This means that we represent a link as a connection between a blockand a document and not simply between two documents.

Moreover, instead of simply considering a link as a 0-1 a binaryrelation, we propose to evaluate the link "strength" by taking intoaccount not only the internal coherence of d∗ik but also its externalcoherence computed as its coherence with respect to dk. We caninterpret this "strength" as the "probablility" to jump from di to dk

by following R(d∗ik, dk).In section 3 we propose an algorithm for evaluating block im-

portance ril aimed at improving web page classification accuracy,while in section 4 we propose a method for estimating the linkstrength R(d∗ik, dk) for page ranking purposes.

3. WEB PAGE CLASSIFICATION USINGSEMANTIC IMAGE-BLOCKS

In this section we propose an algorithm for evaluating block im-portance ril in order to improve the efficacy of information retrievalsystems in terms of web document classification accuracy. Thisalgorithm is unsupervised and not constrained to any block spa-tial location. The proposed approach is based on the assumptionthat images of a web page are those elements which mainly attractthe attention of the user and that the text contained in the visualblock in which an image is located, called image-block, should con-tain significant information about the page contents. Since not allimages are equally important (for example banners or logos), weneed to identify and evaluate the semantic contents of an image-block. In figure 1 it is possible to see some examples of impor-tance/unimportance: image-blocks 1 and 2 (red frame) are con-sidered meaningful for the document, while image-block 3 (greenframe), which reflect noisy information, is considered not impor-tant. Image-blocks are identified by using VIPS algorithm [3] andterms strictly related to them are captured, evaluated and stored intothe index.

3.1 Visual block importance estimationWe identify the most informative image-blocks and their most

relevant terms, by using the Inverse Term Importance Metric asdescribed below.

Figure 1: Semantic image-blocks on a real web page. Image-blocks 1 and 2 (red frame) are considered important, whileblock 3 (green frame) is not significant.

Let I be the set of document terms, J be the set of html docu-ments to be processed and Kj be the set of image-blocks belong-ing to document dj . The Inverse Term Importance (ITI) metric isdefined as:

ITIikj =σikj

γij − σikji ∈ I, j ∈ J, k ∈ Kj (1)

where σikj represents the number of occurrences of term i intoimage-block k belonging to document j, and γij is the number ofoccurrences of term i into document j. Note that for seek of sim-plicity we use k to denote block djk.

This ITI metric is based on the assumption that if a subset ofterms belongs to an image-block k and these terms are well dis-tributed along the entire document, then the image-block is likelyto be related to the topic described in the web page and therefore itwill be highly relevant. If these terms are not frequent in the rest ofthe document we may deduce that either the terms are not signifi-cant for the document topic or we have a multi-topic document. Inthe latter case we don’t have enough information to evaluate a set ofterms distributed along different topic, all having high ITI value.An ITIikj near to zero means that term i is informative for docu-ment j. Increasing values of ITIikj imply decreasing importanceof the term i with respect to document j. In this way it is possibleto identify a set of discriminant terms for the analyzed document,independently by their spatial location.

Given the proposed ITI metric, we need to handle two simpleexceptions: (1) if a term is considered important, but it is locatedin an insignificant block, its weight should be increased taking intoaccount the image-block importance in which is located; (2) in theopposite case, if a term is considered unimportant but it is locatedin a significant block, its weight should be decreased.

In order to evaluate more precisely the terms importance, weneed to evaluate each image-block importance and to smooth theITI values with respect to the image-block importance in whicheach word is located, leading us to IT I . The main idea of thissmoothing procedure is that an image-block importance dependsnot only on its terms, but also on its possible relations with otherimage-blocks. In fact, there might be cases in which a term ti

is considered important through its ITI value because it appearsfrequently in the document, but only in some unimportant image-blocks. If we do not take into account these unimportant image-blocks, ITI values may decrease. We then propose an iterativeprocedure in order to eliminate unimportant image-blocks and tocompute the IT I values using only important image-blocks.

112

At each iteration the importance rjk is computed and filtered inorder to eliminate incoherent blocks. This is done by discardingall blocks k with rjk < η, where η is a threshold parameter thathas been determined experimentally (for more details see section5). Recalling that Kj is the set of all image-blocks contained inthe document dj , the smoothing procedure can be synthesized byAlgorithm 1:

Algorithm 1 Iterative Block Importance (Set Kj , ITI Threshold η)

1: Set h = 0, K0j = ∅, K0

j = Kj

2: rhjk = ( 1

|Khj |

∑i∈I

ITIikj)−1

3: Khj = {k ∈ Kh

j |rjk(k)h < η}4: if Kh

j 6= ∅ then5: Kh+1

j = Khj /Kh

j and return to step 2:6: else7: IT Iikj =

{ITIikj for k ∈ Kh

j

+∞ for k ∈ Kj/Khj

8: end if

Following this process (computed off-line during the documentindexing phase) we evaluate for each image-block k its importancerjk, that represents how coherent is an image-block with respectto the document in which is located. Since the rjk coefficient iscomputed as the arithmetic mean of ITI coefficients, an image-block is considered important for rjk values near to zero. If theimage-block k is considered unimportant, we than delete k updat-ing the importance index of remaining image-blocks. The processreported in Algorithm 1 is iterated until convergence is reached,i.e. Kh+1

j = Khj . This final set will contain those image-blocks

that most probably are related to the document topic.With respect to its computational complexity, this method can besynthesized by three nested iterative cycles: the first one is overall documents dj , the second one is over the image-block set Kj

in each document dj , and the last one includes the computation ofITI and rjk coefficients for all the terms and blocks.

Let K? be the maximum number of image-blocks contained ina document and |W | be the number of terms in the vocabulary ofthe document collection D, then the time required by the IterativeBlock Importance Method is O(|W | · |K?| · |D|). Further visualblock evaluation techniques have been investigate in [1].

3.2 Term ScoringGiven a document j we consider only those terms I that are char-

acterized by IT Iikj > η. We have then quantified the importanceof a term using the Term Importance measure (TI) defined as

TI(ti, dj) = maxk∈Kj

{IT I−1

ikj} (2)

The TI coefficient related to important terms are stored into thedocument index for the subsequent retrieval and classification phases.

In particular, when a user submits a query about a topic of in-terest, a set of documents Q is retrieved from D and manipulatedin order to perform a classification task taking into account TermImportance coefficient related to images. The document set Q ismapped into a matrix M = [mij ], where each row of M repre-sents a document d, following the Vector Space Model [4]:

−→dj = (w1j , w2j , ..., w|V |j) (3)

where V ∈ W are the set of terms contained in the set of documentQ and wij is the weight of the ith term in the jth document.

Usually this weight is computed by using the TFxIDF approach,presented in [5], as follows:

wij = TF (ti, dj)× IDF (ti) i = 1, ..., |V |, j = 1, ..., |Q| (4)

where TF (ti, dj) is the Term Frequency, i.e. the number of oc-currences of term ti in dj , and IDF (ti) is the Inverse DocumentFrequency. The IDF (ti) factor which enhances the terms whichappear in fewer documents, downgrading the terms occurring inmany documents, is defined as

IDF (ti) = log

( |Q|DF (ti)

)i = 1, ..., |V | (5)

where DF (ti) is the number of documents containing the ith term.In our approach the weight wij is defined introducing a refined

version of TF (ti, dj), called Image Weighted Term Frequency(IWTF ), which takes into account important image-block termswhich are likely to describe the document content. The IWTF isdefined as follows:

IWTF (ti, dj) = TF (ti, dj) + TI(ti, dj) (6)

The basic idea is to augment TF (ti, dj) by a "block sensitive"term scoring TI(ti, dj). The IDF factor, described in (5), is main-tained in order to evaluate the usefulness of a word over the entirecollection Q. The weight of term i in document j is then definedas:

wij = IWTF (ti, dj)× IDF (ti) (7)

Note that if a document does not contain images the IWTF reducesto the traditional TF measure.

4. SEARCH ENGINE PAGE RANKINGIn this section we specify the block relational model in order to

perform page ranking through the estimation of R(d∗ik, dk) intro-duced in section 2. In literature we can find several page rankingalgorithms, which assume binary relational representation of links,i.e. R(d∗ik, dk) between a block dik and a document dk assumes 0or 1 value.

The most important contributions have been provided by [10]and [11]. The Hits algorithm proposed in [10] relies on query-timeprocessing to deduce the hubs and authorities pages: a good hubpoints to many good authorities, and a good authority is pointedby many good hubs. The PageRank algorithm discussed in [11]pre-computes a ranking vector that provides a-priori "importance"estimates for all of the pages on the Web. This vector is computedonce, offline, and is independent of the search query. At query time,these importance scores are used in conjunction with query-specificIR scores to rank the query results.

In the real world the assumption of binary valued R(d∗ik, dk),is not verified. Indeed, since a link from a web page A to a webpage B can be viewed as a reference in terms of recommendation,links includes highly semantic clues even if we have to take intoaccount some exceptions: (a) some web pages are simply lists ofhyperlinks and contain no direct information themselves; (b) linkscould be noisy, some links lead to related documents, but othersdon’t; (c) two web pages could be linked because they share a partof the same macro-topic, but do not describe the same argument;(d) a link between two web pages could be meaningful only for asub-part of the involved pages.These remarks lead us to consider an hyperlink R(d∗ik, dk) as aprobabilistic link, instead of boolean link, during page ranking com-putation. In [24] the authors estimate R(d∗ik, dk) by using spatialcues about link.

113

In order to guarantee independency with respect to links spatiallocation (links can be followed by a user independently by their lo-cation), we instead evaluate R(d∗ik, dk) by using semantic informa-tion. In this direction, in which R(d∗ik, dk) can be represented bya probabilistic connection, we propose an evaluation metric namedJumping Probability (JP).

Intuitively we can think about a probabilistic link as the HTMLelement that a web surfer chooses to follow during its browsing:a user in a web page di of interest, will be motivated in jumpingforward to dk following an existing link R(d∗ik, dk) if the semanticof dim is coherent both to the origin di and destination dk. Forthis purpose we assign to R(d∗ik, dk) a measure which representsthe probability to jump from d∗ik to dk estimated as the degree oftextual coherence that origin page di and destination page dk sharethrough the the block d∗ik.

For this purpose, we need (a) to identify which semantic por-tion of a web page contains a probabilistic link and (b) to evaluateits coherence w.r.t. both the origin and the destination pages. Inparticular, (a) aims at partitioning an html document, following theidea that it can be viewed not as an atomic unit but also as a gran-ular composition of blocks. The block relational model helps usto identify the semantic area (block) in which a hyperlink is em-bedded, allowing a subsequent -internal and external- coherenceestimation. For this purpose we propose two procedures, namelyGlobal Coherence Estimation and Link-Block Coherence Estima-tion, for the evaluation of the jumping probability.

The effectiveness of these procedures has been validated by us-ing Hits algorithm on a transition matrix characterized by the jump-ing probability instead of the traditional adjacency matrix with 0 or1 entries.

4.1 Global Coherence EstimationThe proposed link coherence estimation is based on the idea that

a visual block that contains a hyperlink, called link-block, can shareits semantic content with the other blocks located in the origin anddestination pages. A simple representation is depicted in figure2, where the red frame indicates the link-block while the yellowframes indicate the remaining ones.

Figure 2: Representation of a link-block

Thanks to the visual layout analysis and the block relational model,the jumping probability (JP) can be computed estimating how manyblocks, that belong either to the origin or the destination page, arecoherent w.r.t. the link-block located into the origin document.

Recalling that an origin page di and a destination page dk can berepresented by Θi ={di1, di2, .., din} and Θk ={dk1, dk2, .., dkn}respectively, given a link-block d∗ik we define

Θs = {Θk

⋃Θi \ d∗ik} (8)

A first attempt to define a coherence index is defined by theGlobal Coherence between d∗ik ∈ Θi and dsr ∈ Θs:

GC(d∗ik, dsr) =1

|d∗ik||d∗ik|∑j=1

σsrj

σikj(9)

where σikj represents the number of occurrences of term tj intod∗ik, σsrj is the number of occurrences of tj in dsr ∈ Θs, and|d∗ik| is the number of terms belonging to d∗il. If the GC(d∗ik, dsr)is greater than a given threshold ε, the block dsr is considered co-herent w.r.t. the link-block d∗il. Therefore the Global Coherence isthe arithmetic meas of ti occurrences in d∗ik w.r.t. its occurrencesin dsr , we consider a link-block d∗ik coherent to the block dsr ifGC(d∗il, dsr) ≥ ε = 0.5.

Now, let Θ∗s =⋃{dsr ∈ Θs|GC(d∗ik, dsr) ≥ ε} be the set of

blocks coherent to the link-block d∗ik. We can assign to dik a linkstrength value:

R(dik, dk) =|Θ∗s ||Θs| (10)

Therefore the jumping probability from page di to page dk canbe computed as:

JP (i, k) = α∑

dik∈Θi

R(dik, dk) (11)

where α is a normalizing coefficient.Unfortunately, JP (i, k) is strongly dependent on the block di-

mensions. This means that if the representation of origin and des-tination pages are too fine-grained, we generate a great number ofsmall blocks, then producing a reduced set Θ∗s of blocks which arecoherent to the link-block d∗ik. Consequently, the jumping proba-bility computed as 11 underestimates the number of coherent blocksw.r.t. the total number of them.In order to overcome this problem we propose the following alter-native method.

4.2 Link-Block Coherence EstimationAn alternative jumping probability measurement can be obtained

by estimating the coherence of the link-block w.r.t. the origin page(Internal Coherence) and the destination page (External Coherence),and combining their respective results.

Given a web page of origin di = (Θi, Φi) belonging to a doc-ument collection Q, let Θ∗i ⊆ Θi be the set of visual blocks ofdi containing hyperlinks, i.e. Θ∗i = {d∗ik | ∃ a link (i, k) betweendi and dk}. Given a set of terms T = {tj ∈ d∗ik}, we define theInternal Coherence (IC) as:

IC(d∗ik, di) =

1 |Θ∗i | = 1

1

|d∗ik||d∗ik|∑j=1

γij − σijk

σijk × γij

otherwise(12)

where σijk represents the number of occurrences of term tj intod∗ik, γij is the number of occurrences of term tj in di, and |d∗ik| isthe number of terms belonging to the hyperlink semantic area d∗ik.

The Internal Coherence, following the idea presented in [1], as-sumes that if a subset of terms belongs to a link-block d∗ik and theseterms are well distributed along the entire document, then d∗ik islikely to be coherent to the document di. If these terms are not fre-quent in the rest of the document we may deduce that either d∗ik isnot related for the document topic or we have a multi-topic docu-ment.

114

On the other hand, if a web page di consists of a single visualblock that corresponds to a hyperlink semantic area d∗ik, we canstate that d∗ik is coherent to itself and consequently to di. This leadus to set IC = 1.

The External Coherence can be estimated by considering the av-erage relative frequency of terms tj ∈ dik, where the relative fre-quency of a term tj is computed as the rate between the frequencyof tj in dk and the frequency of the most important term of dk.

Given an origin web page di = (Θi, Φi) and a destination webpage dk belonging to a document collection Q, let d∗ik ∈ Θ∗i be thevisual block containing an outgoing hyperlink from di to dk. Givena set of terms T = {tj ∈ d∗ik}, we define the External Coherence(EC) as:

EC(d∗ik, dk) =1

|d∗ik|ρk

|d∗ik|∑j=1

σijk (13)

where σijk represents the number of occurrences of term tj intothe visual block d∗ik, ρk is the number of occurrences of the mostfrequent term in dk, and |d∗ik| is the number of terms belonging tothe hyperlink semantic area d∗ik.Following this simple estimation, if a subset of terms belongs to alink-blocks d∗ik and these terms are important into the destinationdocument dk, then we can say that d∗ik is likely to be coherent to dk.

Since we have to take into account the coherence of a link bothw.r.t. the origin and the destination page, we defined its strength asthe arithmetic mean of Internal and External Coherence presentedrespectively in (12) and (13):

R(dik, dk) =IC(d∗ik, di) + EC(d∗ik, dk)

2(14)

Therefore the jumping probability from di to dk can be obtainedby replacing (14) in (11).

The jumping probability JP (i, k), computed by using either(11) or (14), can be used to define the transition matrix J employedin the Hits ranking algorithm as described in section 5.2.

5. EXPERIMENTAL RESULTSIn order to evaluate the the impact that a granular web page rep-

resentation could have in IR systems, we investigated the perfor-mance of the proposed metric on a real dataset. We selected about10000 web pages from popular sites listed in 5 categories of Yahoo!Directories (http://dir.yahoo.com/). See table 1.

Category # of document # of imagesArt & Humanities 3280 31660

Science 2810 32771Health 1740 15449

Recreation & Sports 1250 8243Society & Culture 960 7100

Table 1: Dataset Features

The two proposed case studies have been evaluated by followingtwo different preprocessing activities and performance evaluation.

5.1 Web Page ClassificationGiven the document collection Q synthesized by the dataset pre-

sented in table 1, every html page has been processed in order toextract image-blocks and to compute the TI values. Before sub-mitting the indexed document to the classification phase we per-formed a feature selection procedure based on the Term Frequency

Variance index presented in [6] and given by:

q(tj) =

n1∑i=1

f2ij − 1

n1

[n1∑i=1

fij

]2

(15)

where n1 is the number of documents in Q containing tj at leastonce and fij is the frequency of term tj in the document di. Weconducted experiments by varying the vocabulary dimension T from50 to 1000 by selecting the set of terms with the highest Term Fre-quency Variance index.

The proposed IWTFxIDF is compared to the traditional TFxIDFscoring function, evaluating the performance produced by the fol-lowing learning algorithms: Naive Bayes [7], Decision Tree [14](no pruning is performed), k-Nearest Neighbor [15] (with k=5) andSupport Vectors Machines [8] (with complexity parameter C = 1.0and linear kernel). These benchmark algorithms are obtained byusing Weka API (http://www.cs.waikato.ac.nz/ml/weka/) [9].

The measurement of classification performance obtained by us-ing the traditional and the proposed scoring function are computedby the widely adopted F-Measure metric which combines the Pre-cision and Recall measure typical of Information Retrieval. Givena set of class label C, related to the Category in table 1, we computePrecision (P) and Recall (R) for the class label α ∈ C as:

P (α) =# of doc successfully classified as belonging to α

# doc classified as belonging to α(16)

R(α) =# of doc successfully classified as belonging to α

# of doc belonging to α(17)

The F-Measure for each class α ∈ C is computed as the harmonicmean of Precision and Recall:

F (α) =(1 + β2) ∗ P (α) ∗R(α)

β2P (α) + P (α)(18)

where β measure the trade off between Precision and Recall. In or-der to equally weight Precision and Recall, we balanced F-Measureby using β = 1. The overall quality of the classification is given bya scalar F ∗ computed as the weighted sum of the F-measure valuestaken over all the classes α ∈ C:

F ∗ =∑α∈C

|α||Q|F (α) (19)

The performance, measured by the scalar F ∗, have been obtainedwith respect to the learning algorithms, performing a 10-folds CrossValidation as testing method. In figure 3 we compared, for anylearning algorithm, the F-Measure values obtained varying the num-ber of terms considered descriptive of the retrieved document col-lection Q.

During the experimental phase we also studied the performanceof our scoring technique with different η values. In particular, dur-ing the indexing phase, we considered a term (or an image-block)as important for 0 < η ≤ 1. Even if all of these threshold valuesproduced better performance on the IWTFxIDF than TFxIDF, forconciseness we report only those performance obtained through thethreshold η = 0.6 that produce the most stable results in terms ofF-Measure.

The results show that our approach is able to identify importantimage-blocks contained in a web page in order to recognize thoseterms that are more informative and consequently to improve clas-sification accuracy of some traditional learning algorithms.

We believe that the main reason related to the success of ourapproaches is that images represent one of the most important ele-ments reported in a web page.

115

(a)

(b)

(c)

(d)

Figure 3: Learning algorithms performance comparison

In order to prove our intuition at the basis of our approach weprovide simple numerical statistics. Consider the set of documentsused during the experimental phase: only 210 documents are plaintext, while more than half of documents contain approximately be-tween from 1 to 5 images. Moreover, our approach identifies morethan half of image-blocks as related to the topic of the documentsin which are located, and then they are considered as important.

Finally, given the identified important image-blocks, terms be-

longing to them are manipulated in order to improve their impor-tance during the retrieval phase. In this way the feature selectionpolicy is driven to select those attributes with better discriminativepower.

5.2 Search Engine Page RankingPage ranking activity is performed on a sub-sample of document

that belongs to the dataset presented in table 1. In particular, weuse 10 sample queries, listed in table 2, to extract and rank a subsetof documents.

Cubism ProteinFreud Religion

Computer Memeory ZeusMilky Way Gene

Monet Surgery

Table 2: Query used

Queries are submitted to the local index that contains the docu-ment dataset in order to retrieve a small set of relevant documents,called Root Set and containing 200 web pages. Starting from thislist, Hits algorithm performs two main steps: (a) a sampling step toobtain an enlarged set of document related to the Root Set and (b)a weight-propagation step to compute for the pages in the enlargedRoot Set, by using the adjacency matrix, hub and authority coef-ficients. (a) is aimed at enlarging the Root Set with pointing andpointed pages in order to obtain a richer set of documents, calledBase Set. While (b) aims at evaluating hub and authority weightsthrough the following mutual reinforcement reasoning: if a webpage X is pointed by many good hubs, then its authority weight isincreased; if a web page points to many good authorities, then itshub weight is increased.

According to [10], given an adjacency matrix A related to a setof document, the page ranks are calculated as authority weight aand hub weight h as follows:

a ← AT h h ← Aa (20)

In our experiment, after Root Set and Base Set are constructed,we submit for the weight propagation step a transition matrix Jcharacterized by the jumping probability, instead of the traditionaladjacency matrix A. This lead us to define a refined Hits algorithm,called wHits, in which authority and hub weights are computed as:

a ← JT h h ← Ja (21)

The hub and authority weight computation, in the original Hitsalgorithm, converges to the principal eigenvector of AT A and AAT

that correspond to the dominant eigenvalue. In our wHits compu-tation we need to manipulate J in order to ensure the convergenceto the principal eigenvector of JT J and JJT . In particular, we"augment" the transition matrix J in order to make it stochastic,irreducible and aperiodic. In order to satisfy these requirements,we substituted all the zero entries with a small jumping probability,normalizing then each row i to obtain

∑k

Jik = 1.

During the experimental phase, we compared the performanceobtained by our wHits algorithm by using two variants, J1 and J2,of the augmented transition matrix J : the first one is built by usingequation (10), while the second one is obtained by using the (14).

To compare our wHits to ordinary Hits, we conducted a userstudy. For each query we ask two kinds of judgments to 50 volun-teers both for wHist and Hits, in order to estimate the well known

116

retrieval quality measures named Precision at top 10 (P@10) andNormalized Discounted Cumulative Gain at top 10 (NDCG@10).For the P@10 metric, as in [12] and [13], volunteers was askedto indicate the URLs which were "relevant" for the given queries.Given volunteer "relevance" judgments, we compute the precisionof the top 10 URLs that are deemed relevant as follows:

P@10 =# of relevant document retrieved

# of retrieved document(22)

In our experiment an URL have been considered relevant if al least35 of the 50 volunteers have been selected it as relevant for a givenquery.

Given the above preliminary evaluation, volunteers was asked toassociate a relevance degree to the top n URLs in order to computethe Normalized Discounted Cumulative Gain. This metric aims atmeasuring "gain" of a user seeing the top n documents. Given arank-ordered document list L =< l1, ..., ln > as result to queryq and G[i] as the relevance degree (0=worst, ..., 5=best) associ-ated to the document at position i into the list L, the NormalizedDiscounted Cumulative Gain at top 10 URLs is defined by [23] asfollows:

NDCG@10 = Z

[G[1] +

n∑i=2

G[i]

log2 i

](23)

where Z is the normalizing constant chosen so that a perfect order-ing of the results for the query q will receive the score of one.

P@10 and NDCG@10 of the ranking techniques for each testquery are shown respectively in figures 4 and 5.

Figure 4: P@10

The comparisons highlight, over all the queries, that the best re-sults are obtained by the wHits algorithm characterized by the J2

matrix which always outperforms the other two methods. A pos-sible motivation of the wHits − JP1 poor performance has beendiscussed in section 4.1.An interesting remark concerns the performance comparison be-tween the wHits−JP2 algorithm and the traditional Hits method-ology. The results show that JP 2(i, k) computed as a compro-mise between Internal and External coherence, is able to capturethe strength of connection between two linked web pages. Thismeans that the Internal Coherence and External Coherence metricsare well suited to estimate the relationship between the link-blockand the textual content of origin and destination pages.

Figure 5: NDCG@10

6. CONCLUSIONS AND FUTURE WORKIn this paper we present the impact that a block relation model

could have on IR systems. In particular, we show that by usingthis representation we can define new metrics for block importanceestimation, in order to distinguish the most important content of adocument from the noisy one. This information can be used notonly for improving classification but also for enhancing page rank-ing performance by measuring the strength of relationship betweenhypertextual documents.

With respect to the future work, a first contribution will regardthe influence that semantic image interpretation can have duringthe terms scoring phase and consequently during the classificationtask. Even if we didn’t include an image interpretation approach inour analysis, evaluating only the surrounding text of an image, weplan to introduce some additional models related to image under-standing in order to improve the precision during terms evaluation.[21] and [22] are interesting approaches able to analyze and under-stand image semantic through hierarchial models. In this directionwe aim at combining image and textual analysis, in order to ex-tract a more precise and meaningful semantic related to importantimage-blocks.A second future contribution concerns with the study of the pro-posed refined Hits algorithm on several benchmark datasets, forexample [17] and [18], and the perfomance comparison with othersimilar approaches. An interesting comparative study could bedone with respect to [19], [20] and [24].

7. REFERENCES[1] Fersini, E., Messina, E., Archetti, F. (2008). Enhancing web

page classification through image-block importance analysis.Information Processing and Management, 44(4), pp.1431-1447.

[2] Kovacevic, M., Diligenti, M., Gori, M., & Milutinovic, V. M.(2002). Recognition of common areas in a web page usingvisual information: a possible application in a pageclassification. In Proceedings of the 2002 IEEE InternationalConference on Data Mining, (pp. 250-257). Washington:IEEE Computer Society.

[3] Cai, D., Yu, S., Wen, J. R. & Ma, W. Y., Extracting contentstructure for web pages based on visual representation. InZhou, X., Zhang, Y., Orlowska, M. E. (Eds.), Proceedings ofthe Pacific Web Conference, (pp. 406-417).

[4] Salton, G., Wong, A. & Yang, C., S. A vector space modelfor automatic indexing. Communications of the ACM,18(11), 613-620.

117

[5] Salton, G. & Buckley, C. (1998). Term-weightingapproaches in automatic text retrieval. InformationProcessing and Management, 24(5):513-523.

[6] Nicholas, C., Dhillon, I. & Kogan, J. (2003). Featureselection and document clustering. In Berry, M. W. (Ed.), AComprehensive Survey of Text Mining. Springer-Verlag.

[7] John, G.-H., & Langely, P. (1995). Estimating continuousdistributions in Bayesian classifiers. In Besnard, P., Hanks, S.(Eds.), Proceedings of the Eleventh Conference onUncertainty in Artificial Intelligence, (pp. 338-345). SanFrancisco: Morgan Kauffman.

[8] Platt, J., C. (1999). Fast training of support vector machinesusing sequential minimal optimization. In Schölkopf, B.,Burges, C. J. C. & Smola, A. J. (Eds.), Advances in kernelmethods: support vector learning, (pp. 185-208). Cambridge:MIT Press.

[9] Witten, I., H. & Frank, E. (1999). Data Mining: PracticalMachine Learning Tools and Techniques with JavaImplementations. San Francisco: Morgan Kauffman.

[10] J. Kleinberg. Authoritative sources in a hyperlinkedenvironment. In Proc. Ninth Ann. ACM-SIAM Symp.Discrete Algorithms, pages 668-677, ACM Press, New York,1998

[11] Sergey Brin; Larry Page. The Anatomy of a Large-ScaleHypertextual Web Search Engine

[12] T. H. Haveliwala. Topic-sensitive PageRank. In Proceedingsof the Eleventh International World Wide Web Conference,2002, pages 517-526.

[13] A. Borodin, C. O. Roberts, J.S. Rosenthal, P. Tsaparas. LinkAnalysis Ranking: Algorithms, Theory, and Experiments.ACM Transactions on Internet Technology, Volume 5 , Issue1, Pages: 231 - 297 , 2005.

[14] Quinlan, J., R. (1993). C4.5: programs for machine learning.San Francisco: Morgan Kauffman.

[15] Aha, D. W., Kibler, D., & Albert, M. K. (1991).Instance-based learning algorithms. Machine Learning, 6(1),37-66.

[16] Song, R., Liu, H., Wen, J-R & Ma, W.-Y. (2004). Learningblock importance models for web pages. In Feldman, S. I.,Uretsky, M., Najork, M., Wills, C. E. (Eds.), Proceedings ofthe 13th international conference on World Wide Web, (pp.203-211). New York: ACM Press.

[17] Jun Hirai, Sriram Raghavan, Andreas Paepcke, and HectorGarcia-Molina. "WebBase : A repository of Web pages," InProceedings of the 9th Internationall World Wide WebConference (WWW9), Amsterdam, May 2000.

[18] Richardson, M., & Domingos, P. (2002). The intelligentsurfer: Probabilistic combination of link and contentinformation in PageRank. In T. G. Dietterich, S. Becker andZ. Ghahramani (Eds.), Advances in Neural InformationProcessing Systems 14, 1441-1448. Cambridge, MA: MITPress.

[19] R. Lempel and S. Moran, "The stochastic approach forlink-structure analysis (SALSA) and the TKC effect.", Proc.9th International World Wide Web Conference, 2000.http://citeseer.ist.psu.edu/lempel00stochastic.html

[20] D. Cohn and H. Chang. Learning to probabilistically identifyauthoritative documents. In Proceedings of the 17thInternational Conference on Machine Learning, pages167–174, Stanford University, 2000.

[21] Gao, Y., Fan, J., Xue, X., Jain, R. (2006). Automatic imageannotation by incorporating feature hierarchy and boostingto scale up SVM classifiers. In Nahrstedt, K., Turk, M., Rui,Y., Klas, W., Mayer-Patel, K. (Eds.), Proceedings of the 14thannual ACM International Conference on Multimedia, (pp.901-910). New York: ACM Press.

[22] Li, F., Perona, P. (2005). A Bayesian Hierarchical Model forLearning Natural Scene Categories. In Proceeding of 2005IEEE Computer Society Conference on Computer Visionand Pattern Recognition, (pp. 524-531). San Diego: IEEEComputer Society.

[23] Jarvelin, K., Kekalainen, J. (2002). Cumulated gain-basedevaluation of IR techniques. ACM Trans. Inf. Syst., 20(4),pp. 422–446. New York: ACM Press.

[24] Cai, D., He, X., Wen, J. Ma, W. (2004). Block Level LinkAnalysis. In Proceedings of the 27th annual internationalACM SIGIR conference on Research and Development inInformation Retrieval, (pp. 440–447), New York: ACMPress.

118