A Web-Page Usage Prediction Scheme Using Weighted Suffix Trees

10
A Web-Page Usage Prediction Scheme using Weighted Suffix Trees Christos Makris, Yiannis Panagis, Evangelos Theodoridis, and Athanasios Tsakalidis Computer Engineering and Informatics Department,University of Patras, Rio 26500, Greece {makri,panagis,theodori,tsak}@ceid.upatras.gr Abstract. In this paper we consider the problem of web page usage prediction in a web site by modeling users’ navigation history with weighted suffix trees. This user’s navigation prediction can be exploited either in an on-line recommendation system in a web-site or in a web-page cache system. The method proposed has the advantage that it demands a constant amount of computational effort per one user’s action and consumes a relatively small amount of extra memory space. These features make the method ideal for an on-line working environment. Finally, we have performed an evaluation of the proposed scheme with experiments on various web-site log-files and we have found that its quality performance is fairly well and in many cases an outperforming one. Key words: World wide web, web mining, online web page recommendation, weighted sequences 1 Introduction The vast size of the World Wide Web (WWW) nowadays makes it the largest database ever existed. Back to the beginning of this decade it was estimated to contain over 350 million pages [4] while recently it has been estimated that only the indexed part of WWW by a web search engine consists of at least 11.3 billion pages [15]. Every attempt to shape this huge volume of data, that follows a very loose schema, is quite difficult and extreme challenging. According to [11] the application of data mining techniques in order to extract useful information that implicitly lay among web data is a very essential task. Web data may be either web data pages or data describing the activity of users. Actual web data consists of the web pages, the web-page structure, the linkage structure between the web pages, the surfing navigational behavior of the users and the user profiles including demographic and registration information about the users [11, 34]. Web data mining can be divided in three general categories: the web content mining, web structure mining and finally the web usage mining [42]. Here we focus to the later area of web data mining that tries to exploit the navigational traces of the users in order to extract knowledge about their preferences and their behavior. The task of modeling and predicting a user’s navigational behavior on a web-site or on a web-domain can be useful in quite many web applications such as web caching [26, 36], web page recommendation [9, 5], web search engines [2, 29] and personalization [12]. Consult [14] for several other applications of web usage mining. According to [14] most of the web usage mining techniques are based on association rules, sequential patterns and clustering. The approaches based on association rules build a

Transcript of A Web-Page Usage Prediction Scheme Using Weighted Suffix Trees

A Web-Page Usage Prediction Scheme using WeightedSuffix Trees

Christos Makris, Yiannis Panagis, Evangelos Theodoridis, and Athanasios Tsakalidis

Computer Engineering and Informatics Department,University of Patras,Rio 26500, Greece

{makri,panagis,theodori,tsak}@ceid.upatras.gr

Abstract. In this paper we consider the problem of web page usage prediction in aweb site by modeling users’ navigation history with weighted suffix trees. This user’snavigation prediction can be exploited either in an on-line recommendation systemin a web-site or in a web-page cache system. The method proposed has the advantagethat it demands a constant amount of computational effort per one user’s actionand consumes a relatively small amount of extra memory space. These features makethe method ideal for an on-line working environment. Finally, we have performedan evaluation of the proposed scheme with experiments on various web-site log-filesand we have found that its quality performance is fairly well and in many cases anoutperforming one.

Key words: World wide web, web mining, online web page recommendation, weightedsequences

1 IntroductionThe vast size of the World Wide Web (WWW) nowadays makes it the largest databaseever existed. Back to the beginning of this decade it was estimated to contain over 350million pages [4] while recently it has been estimated that only the indexed part of WWWby a web search engine consists of at least 11.3 billion pages [15]. Every attempt to shapethis huge volume of data, that follows a very loose schema, is quite difficult and extremechallenging. According to [11] the application of data mining techniques in order to extractuseful information that implicitly lay among web data is a very essential task. Web data maybe either web data pages or data describing the activity of users. Actual web data consistsof the web pages, the web-page structure, the linkage structure between the web pages, thesurfing navigational behavior of the users and the user profiles including demographic andregistration information about the users [11, 34].

Web data mining can be divided in three general categories: the web content mining, webstructure mining and finally the web usage mining [42]. Here we focus to the later area ofweb data mining that tries to exploit the navigational traces of the users in order to extractknowledge about their preferences and their behavior. The task of modeling and predictinga user’s navigational behavior on a web-site or on a web-domain can be useful in quite manyweb applications such as web caching [26, 36], web page recommendation [9, 5], web searchengines [2, 29] and personalization [12]. Consult [14] for several other applications of webusage mining.

According to [14] most of the web usage mining techniques are based on associationrules, sequential patterns and clustering. The approaches based on association rules build a

2 C. Makris et al.

set of rules of the form: page1.html, page2.html → page3.html which means that users thathave visited page1 and page2 have also visited page3. Methods that use association rules canbe found in [23, 24, 20, 38]. Sequential patterns maintain navigational patterns as sequencesand try to discover frequent subsequences that describe better the data. Sequences are eitherused to produce association rules or to produce tree structures or Markov chains to representnavigation patterns. Methods proposed in [23, 28, 27, 22, 10, 26] fall into this general category.Lately, in [43, 30, 16] the authors proposed the use of web access motifs and string matchingtechniques. The idea behind the use of these methods is that string algorithmics seem toadapt quite well in the analysis of hypertext navigation. Finally, clustering techniques havebeen used to group similar user sessions according to a distance function, like longest commonsubsequence [1] or sequence alignment [19]. Also more complex methods have been proposedin [32, 41, 31] in order to enhance the results of the clustering method.

In this paper we propose an efficient method for modeling the user navigation history. Ourmethod is based upon several ideas from the existing literature and upon a string processingstructure used previously in computational biology problems; the weighted suffix tree [21].The provided estimation of user’s navigational intention can be exploited either in an on-linerecommendation system in a web-site or in a web-page cache system. The method proposedhere has the advantage that it demands a constant amount of computational effort per oneuser’s action and consumes a relatively small amount of extra memory bytes. These featuresmake our method ideal for an on-line working environment. The structure of the paper is asfollows. In section 2 we present some basic definitions and background information. Section3 presents the approach proposed by this paper. In section 4 we describe the methodologyfollowed to evaluate our method and present experimental results, while in section 5 weconclude and discuss future directions.

2 Definitions and Background

Let Σ be a finite alphabet which consists of a set of characters (or symbols). The cardinalityof an alphabet, denoted by |Σ|, expresses the number of distinct characters in the alphabet.A string or word is a sequence of zero or more characters drawn from an alphabet. The setof all words over the alphabet Σ is denoted by Σ+. A word w of length n is represented byw[1..n] = w[1]w[2] . . . w[n], where w[i] ∈ Σ for 1 ≤ i ≤ n, and n = |w| is the length of w.

A subword u of length p is said to occur at position i in the word w if u = w[i..i+ p− 1].In other words u is a substring of length p occurring at position i in word w. In the casethat for a given position of a word w we consider the presence of a set of characters eachwith a given probability of appearance, we have the concept of a weighted sequence X.

Definition 1. A weighted sequence/word X = X[1]X[2] . . . X[n] is a sequence of positions,where each position X[i] consists of a set of ordered pairs. Each pair has the form (σ, π(σ)),where π(σ) is the probability of having the character σ at position i. For every positionX[i], 1 ≤ i ≤ n,

∑π(σ) = 1.

The suffix tree is a fundamental data structure supporting a wide variety of efficientstring processing algorithms. In particular, the suffix tree is well known to allow efficientand simple solutions to many problems concerning the identification and location either ofa set of patterns or repeated substrings (contiguous or not) in a given sequence. The readercan find an extended literature on such applications in [35].

A Web-Page Usage Prediction Scheme using Weighted Suffix Trees 3

The weighted suffix tree can be considered as a generalization of the ordinary suffix tree inorder to handle weighted sequences by incorporating the notion of probability of appearancefor every suffix stored in a leaf. A construction of this structure has been proposed in [39],plus a set of several applications.

Definition 2. Let X be a weighted sequence. For every suffix starting at position i we definea list of possible weighted subwords so that the probability of appearance for each one of themis greater than 1/k. We denote each of them as Xi,j , where j is the subword rank in arbitrarynumbering. We define WST (X) the weighted suffix tree of a weighted sequence X, as thecompressed trie of a portion of all the weighted subwords starting within each suffix Xi ofX$, $ not in Σ, having a probability of appearance greater than 1/k. Let L(v) denote thepath-label of node v in WST (X), which results by concatenating the edge labels along thepath from the root to v. Leaf v of WST (X) is labeled with index i if ∃j > 0 such thatL(v) = Xi,j [i..n] and π(Xi,j [i . . . n]) ≥ 1/k, where j > 0 denotes the j-th weighted subwordstarting at position i.

We suppose that the navigation history in a web site is maintained as web access se-quences (WAS). Web access sequences are sequences of web pages that express each session(clicking sequence) of the users and can be generated by the log-files of the web site; if werepresent each web page as a symbol in an appropriate alphabet then the WAS is essentiallya string.

3 Prediction/Recommendation Model

In our model we assume that each navigation session is described by a WASi and storedfor offline preprocess. S is the set of all existing web access sequences that have taken placein the web site. A sketchy description of the function of our method follows: In an offlinemanner, running either in an idle time period or in the background, the system processesthe set S of the navigation sessions in order to group them in clusters. Each of the clusterscontains sequences very similar to each other. Then each of the clusters Ci is representedby a weighted sequence WSi and finally using these sequences a generalized weighted suffixtree gWST( i : WASi) is constructed. This structure is further maintained and used as aweb page prediction/recommendation tool. Each one of the pre-process steps is going to bedescribed in a more detailed way in the following sections.

3.1 WAS Maintenance

Each of the user navigation sessions is implicitly maintained in the log-files of the web-sitesserver. Either we program properly the web server to store each WAS in separate repositoryor we can program an extraction process from the log-files that is executed at the beginningof the preprocess procedure. Each one of the web site’s pages has been labeled with a uniquesymbol. All these symbols form the alphabet of the web access sequences. Assume for the sakeof description that there are N sequences that form a set S = {WAS1,WAS2, . . . ,WASN}.

3.2 WAS Clustering

At this step of the preprocessing procedure we construct the similarity matrix of S. ThisN × N matrix expresses the similarity of each pair (i, j) of sequences. As a metric of the

4 C. Makris et al.

distance/similarity between WASi and WASj we chose to use a hybrid metric taking intoaccount the global alignment and the local alignment of the two sequences. More formally:

D(i, j) = (1 − p) ∗ LA(WASi,WASj) + p ∗ GA(WASi,WASj) (1)

where LA(WASi,WASj) is the score of the local alignment of the sequences WASi andWASj , GA(WASi,WASj) is the score of the global alignment for these sequences, and pis a parameter that expresses the importance that we give to the scores of the two differentalignments.

In order to define the value of p there are several possibilities. One choice is to use p equalto 0.5 giving the same gravity to both alignments. The other one, and more proper choice,is to define the value of p to be relative to the ratio of the lengths of the two sequences.More formally, assuming without loss of generality that |WASj | ≥ |WASi|, we define:

p =|WASi|

|WASj |(2)

The intuition behind this definition is that when the two sequences, and thus navigationalbehaviors, have almost the same length we should take into account the global alignmentmore than the local. When the lengths of the two sequences are very different p is closeto 0 and local alignment has more gravity than the global one. Our claim is that commonnavigational preferences of the users that are depicted by the two sequences are not capturedonly by aligning the sequences in their whole length; very equal subsections of the sequencescan capture common behavior too.

The score function for the alignment has to be appropriate in order to express naviga-tional properties of the sessions. A straightforward approach would assume that the matchof two characters/pages in the alignments should be prized with a positive constant scorewhile a mismatch or a match with a space with a negative constant score. A specific page ofa web site may play dissimilar roles to different users so a simple character/page match inmay mean nothing in the task of similarity testing of two sequences. However, a web pagemay have no importance to a user; we call these web pages Unimportant, such as web sitestart-pages or a page visited by a wrong click and immediately a back-step is performed.Usually a user stays in unimportant pages a very short period of time, a few seconds at most,and then goes to another one. Also a page is unimportant for a user when it is recordedin log-file for inordinately large period of time, usually above 25-30 minutes. This usuallymeans that the user has changed web page by typing a new address or by clicking a banner.Another role that can play a page is the one of index, we call these pages Hub. Users tryin these pages to locate the link to the desired page by reading roughly the content andstaying in those pages not much than 5-6 minutes. Finally, there are the Content pageswith the desired for the user content. Users spend in Content pages approximately 10 to20 minutes. Consequently there is need to store in each WAS the corresponding period oftime along with each character/page. Techniques for session extraction from log files andextracting additional information can be found in [39, 33, 7, 3]. As mentioned before from allthe pages we form initially an alphabet from the web-pages which then is tripled at mostby characterizing each one of the existing symbols as Unimportant, Hub and Content.

For calculating the global and the local alignments the classical approach of dynamicprogramming [25] has been chosen using recursive formulas for the alignments of the prefixes

A Web-Page Usage Prediction Scheme using Weighted Suffix Trees 5

WASi and WASj . Both in global and local alignment of two sequences the alignments ofsymbols with spaces should not be gathered in consecutive positions. Consecutive alignmentswith spaces are defined as gaps. Very long gaps mean that in the one WAS the user hasfollowed a very different path (maybe lost) in comparison with the other so we would notlike to favor such gaps. In order to abate this fact we apply affine gap penalties in thealignments. The objective is then to find an alignment that maximizes the quantity: Wm ∗(#matches) − Wms ∗ (#mismatches) − Wg ∗ (#gaps) − Ws ∗ (#spaces), where Wm=1denotes the weight of a match, Wms=-1 denotes the weight of a mismatch Wg denotes theweight of starting a new gap and Ws denotes the weight of an alignment with a space. Wewould like to favor small gaps so we have chosen Wg equal to -1 while WS=-5. The gapopening penalty intentionally kept smaller than the gap penalty in order to favor frequentgap openings with small length instead of longer and more infrequent gaps. The classicrecursive equations can easily adapt to incorporate affine gaps too [17].

When similarity matrix D has been constructed we use it in order to group the sequencesinto clusters. The distance matrix can be viewed as N points in N -dimensional space. Run-ning a clustering algorithm upon these N points will produce clusters consisting of verysimilar web access sequences. D is given as an input to the k-windows clustering algorithm[35]; a well known unsupervised partitional clustering algorithm. Our decision to use K-windows as a clustering method was driven by a variety of reason, such as the enhancesquality of the produced clusters and its inherent parallel nature (more information can befound in [35]).

One of the drawbacks of the traditional implementation of the k-windows algorithm isthat it uses as an underlying data structure a d-dimensional range tree [37] which is veryinefficient in high-dimensions. In order to overcome this drawback we propose a variationof k-windows algorithm that utilizes R-trees [18] as an underling data structure for theoperation of the algorithm. As an alternative to k-windows, k-means clustering algorithmcan also be used.

3.3 WAS Cluster RepresentationWhen the WAS clustering procedure is over each one of the clusters is expressed as a weightedsequence. The produced weighted sequences implicitly capture most of the navigationalbehavior that is observed in the corresponding cluster. Assuming that we examine a clusterwith f web access sequences, in order to produce the weighted sequences the web accesssequences are combined using their multiple sequence alignment. Although the calculationof a multiple sequence alignment is extremely time and space consuming using dynamicprogramming approaches, in some cases it may be affordable because the length of thesessions is usually short (5 to 10 pages is a very frequently observed session length) due tolimited human recall [13] or to bounded web site height [6].

As an alternative someone could possibly use the approach of progressive or iterativepairwise alignment in order to produce the multiple sequence alignment [17]. Progressive ofiterative algorithms are based on the idea that the solution can be computed by modifyingan already suboptimal solution. The simplest approach is to initially align the pair of WASwith the maximum similarity. Then successively merge in the WAS with smallest distancefrom any of the strings already in the multiple alignment. This method can be viewedas finding the multiple alignments consistent with a maximum spanning tree formed fromsimilarity data (matrix D). A more realistic variant of the above approach might choose the

6 C. Makris et al.

sequence with the maximum average similarity in order to incorporate it in the multiplesequence alignment aligning that sequence with the most alike. In our case we have used theStar-alignment heuristic presented in [17].

Finally, when the multiple sequence alignment of the cluster’s sequences has been calcu-lated, we construct the weighted sequence that represents the cluster. At each position andfor each character/page we maintain the percentage of their appearance, without taking thespace into account.

3.4 WST Utilization - Recommendation/Prediction Method

As soon as all weighted sequences have been produced, their corresponding generalizedweighted suffix tree is constructed. For this construction the algorithm of [21] is used tunedby a parameter 1/k. The computational time and the space utilized by this structure arelinear to the sum of the lengths of all weighted sequences. The generalized weighted suffixtree will represent all possible weighted subwords of all of the weighted sequences such thatthe probability of appearance for each one of them is greater than 1/k. The experimentalevaluation showed up that a proper value for parameter 1/k is between 0.01 and 0.1.

The weighted suffix tree implicitly can capture the most important navigational expe-rience and with small computational and memory prize. It works like a prediction modelwithout of storing explicitly strings and probabilities of occurrences. Also inherits all thevirtues of suffix trees (which is a well studied data structure): the linear space, the linearconstruction time and many well designed practical implementations.

The recommendation/prediction algorithm works as follows: when a new user is arrivedin the system is assigned to the root of the generalized weighted suffix tree (gWST). As theuser navigates through the web site selecting web pages by clicking links, his position in thegWST is advanced too. Assume that he currently is at an internal node and he is ready toperform his next step. The system proposes to him the web pages indicated by the outgoingedges of u. Each of the outgoing edges is labeled by a substring and starts with a differentcharacter. All those different characters/web pages are proposed to the user. The propositionof all descendants is not realistic since the cardinality of them can grow more than 100 (werefer to this approach as unbounded in the evaluation section just for comparisons sake). Ifthe number of the proposed web pages is relatively large, above 20, a selection of at most 20web-pages is presented (or prefetched) to the user. Those pages are selected either randomly(bounded-random) or we pick the ”heaviest” ones(bounded-weight). The weight of each oneof the choices is defined as the number of times that this node has been visited by a userbefore and it is implemented by a simple counter and a priority queue upon each node.

In the extremely case that under an internal node there is no edge starting with thecharacter that corresponds to the user’s choice then the user is assigned to the root againand the tree navigation starts again according to the following actions of the user.

3.5 Evaluation

For the evaluation of the proposed system various experiments have been performed. Throughthis process we would like to study and estimate: the demands in memory of the proposedsystem, the preprocessing effort to construct the structure and the success of the predic-tion/recommendation that the system creates.

We have performed several experiments, choosing different values for the parameters ofthe system in each one. All experiments have been performed on a AMD64 dual core@ 2Ghz

A Web-Page Usage Prediction Scheme using Weighted Suffix Trees 7

with 1 Gbyte of main memory under Windows XP. The programs were coded in Perl and C.The experiments simulate the user behavior by using a web-server log file. The 3-10% of theinformation of the log-file has been used in order to construct the data structure (gWST)of the system while the rest has been used to evaluate the recommendations.

For the purposes of this experiment we have utilized the web server’s log-file of ourdepartment (www.ceid.upatras.gr). This log-file corresponds to a two month period andhas approximate 260 Mbytes size. Also we have used for this set of experiments the well-known data sets in many studies in this area; the log file of NASA web server over themonths of July and August 1995 [45]. For extracting the web access sequences from thelog-files we have extended the Follow web-log analysis tool [44], under version 1.5, in orderto have the properties described in section 3.2. The initial tool and the extension were bothwritten in Perl. With the preprocessing of the log-files a set of web access sequences has beenproduced. A portion of these (3-10%), selected randomly, has been utilized for capturing theuser behavior of the web-site by clustering them and constructing the weighted suffix treewhile the rest has been used as an evaluation set. For each WAS, we perform the movementsof the user described by the sequence and the system behavior as described in section 3.4.In order to evaluate the performance of our method we have used two known metrics as in[31, 8]:– Hit-Ratio Metric: For each WAS we maintain an evaluation score. For each character of

the sequence/step of the user in the site we increase the score by one if the next step(page of the web site) has been proposed by the system while in the case that the userperforms a step not in the set recommended by the WST (section 3.4) we keep the scoreunchanged. Finally for each WAS, we calculate a weighted score defined by the ratio ofthe previous score over the length of the sequence (which is the best score).

– Click-Soon-Ratio Metric: For each WAS we maintain an evaluation score. For each char-acter of the sequence/step of the user in the site we increase the score by one if atleast one of the recommendations/predictions is requested by the user until the end ofthe sequence (thus while the user session is active). If none of the recommendations isused while the session is active we keep the score unchanged. Finally for each WAS, wecalculate a weighted score defined by the ratio of the previous score over the length ofthe sequence (which is the best score too).

Fig. 1. Experiments on NASA log-file.

Figure 1. recapitulates the settings and the results of the experiments upon the NASAlog-file. A small portion (2%) of the data set has been used to train our model in both

8 C. Makris et al.

experiments. In the first one the 1/k building parameter of the weighted suffix tree is fine-tuned while in the second one the number of initial clusters is varied. At a first glancewe observe that the weighted scheme for the selection of the recommendations, in bothmetrics, is always much better from the random selection but slightly warse from impracticalunbounded proposing. Decreasing 1/k implies increasing the effectiveness of the model inboth metrics but with trading off memory. Suitable selection for 1/k seems to be valuesbellow 0.1. The number of utilized clusters seems to affect the capability of the model withless finally memory consuption but in a more mild way than the 1/k parameter. Finally, inthe third experiment we have varied the size of the training data set from 2% to 10%. It seemsthat with greater training sets and small values of 1/k the model can achieve even more betterprediction with the weighted scheme in both metrics. (84.12% and 97.29% resp.). With suchhigh hit-ratio scores this model is capable to perform as an online recommendation systemand with those click-soon ratio scores as a prefetching web-page system. Comparing ourexperimental results with those of [16], which is one of the latest techniques competitive toours and the latest comparisson work in the topic, we can claim that our method outperformsmost of them. More especially :– our model seems to have similar performance and sometimes slightly better in click-soon

metric score (the best scores in [16] vary from 85 − 94%)– in hit-ratio metrics our method is achieving clearly better performance (the best scores

in [16] vary from 50 − 60%)in both cases by utilizing a much smaller training set which says that our model has astronger predictive capability.

The results of experiments on the other data set, the log-file of www.ceid.upatras.gr,are quite similar to the above ones and slightly better again using the same portion ofinformation for training purposes. This comes from the fact that this website is a smallerone with much less navigational patterns that are captured better by our model. This dataset consists of 27805 web access sequences while from NASA log-file they have been producedover 55000 sequences. Due to lack of space the data table of these results is not presented.

Concerning, memory utilization, the consumption of memory of the main data structureis quite reasonable and does not exceed the size of few MBytes while the preprocessingprocedure in more demanding due to the computation of distance matrix D, whose size isquadratic to the size of the training set.

3.6 Conclusions and Open IssuesIn this paper we have proposed a technique for predicting web page usage patterns by mod-eling users’ navigation history using string processing techniques, and we have validatedexperimentally the superiority of our proposed technique. Future work includes differentways of modeling web user access patterns, choice of different clustering techniques, inves-tigation of different metrics, exhaustive experimental comparisons with similar techniquesand exploitation of the trade-off between the degree of the weighted suffix tree (is directlyaffected from the threshold), and the performance of our scheme.

References1. Banerjee A., J. Ghosh, Clickstream clustering using weighted longest common subsequences,

in: Proc. of the Web Mining Workshop at the 1st SIAM Conference on Data Mining, 2001.2. Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In

Proceedings of the Seventh int. Conf. on World Wide Web 7, 107-117.

A Web-Page Usage Prediction Scheme using Weighted Suffix Trees 9

3. Catledge L. and J. Pitkow. Characterizing browsing behaviors on the WWW. Computer Net-works and ISDN Systems, pp 1065-1073,1995.

4. Chakrabarti, S., van der Berg, M., and Dom, B. (1999). Focused crawling: a new approachto topic-specific Web resource discovery. In Proceedings of 8th Int. World Wide Web Conf.(WWW8)

5. Chen, M., LaPaugh, A. S., and Singh, J. P. 2002. Predicting category accesses for a user in astructured information space. In Proc. of the 25th Annual int. ACM SIGIR Conf. on Researchand Development in information Retrieval,2002.

6. Christopoulou E., J. Garofalakis, C. Makris, Y. Panagis, A. Psaras-Chatzigeorgiou, E.Sakkopoulos, A. Tsakalidis: Techniques and Metrics for Improving Website Structure. J. WebEng. 2(1-2): 90-114,2003.

7. Cooley R., B. Mobasher and J. Srivastava. Data preparation for mining WWW browsing pat-terns. Knowledge and Information Systems,1999.

8. Cosley, D., Lawrence, S., and Pennock, D. M. (2002). REFEREE: An open framework forpractical testing of recommender systems using ResearchIndex. In Very Large Database (VLDB‘02) Conference, China, 35-46.

9. Dean, J. and Henzinger, M. R. 1999. Finding related pages in the World Wide Web. In Pro-ceeding of the Eighth international Conference on World Wide Web (Toronto, Canada). P. H.Enslow, Ed. Elsevier North-Holland, New York, NY, 1467-1479.

10. Deshpande M., and Karypis, G. 2004. Selective Markov models for predicting Web page accesses.ACM Trans. Inter. Tech. 4, 2 (May. 2004), 163-184.

11. Dunham m., Data Mining Introductory and Advanced Topics, Prentice Hall, 2003.12. Eirinaki M., and M. Vazirgianis, ”Web Mining for Web Personalization”, ACM Transactions

on Internet Tehnology (TOIT), volume 3, issue 1, 2003.13. Eysenecj M. and M keane. Cognitive Psychology. Lawrence Erlbaum Associates,1990.14. Facca, F. M. and Lanzi, P. L. 2005. Mining interesting knowledge from weblogs: a survey. Data

Knowl. Eng. 53, 3 (Jun. 2005), 225-241.15. Gulli, A. and Signorini, A. 2005. The indexable web is more than 11.5 billion pages. In Special

interest Tracks and Posters of the 14th international Conference on World Wide Web (Chiba,Japan, May 10 - 14, 2005).

16. Gunduz, U. and Ozsu, M. T. 2003. A Web page prediction model based on click-stream treerepresentation of user behavior. In Proceedings of the Ninth ACM SIGKDD international Con-ference on Knowledge Discovery and Data Mining. KDD ’03.

17. Gusfield D.: Algorithms on Strings, Trees, and Sequences - Computer Science and Computa-tional Biology Cambridge University Press 1997

18. Guttman. A. R-trees: a dynamic index structure for spatial searching. In Proceedings of the1984 ACM SIGMOD international Conference on Management of Data, 1984.

19. Hay B., G. Wets, K. Vanhoof, Clustering navigation patterns on a website using a sequencealignment method. In: Intelligent Techniques for Web Personalization: IJCAI 2001, 17th Int.Joint Conf. on Artificial Intelligence, August 4, 2001, Seattle, WA, USA, pp. 1-6.

20. Huang X., N. Cercone, A. An, Comparison of interestingness functions for learning web usagepatterns, in: Proceedings of the Eleventh International Conference on Information and Knowl-edge Management, ACM Press, 2002, pp. 617-620.

21. Iliopoulos C. S., C. Makris, Y. Panagis, K. Perdikuri, E. Theodoridis, and A. Tsakalidis. Theweighted suffix tree: an efficient data structure for handling molecular weighted sequences andits applications. Fundamenta Informaticae, 71(2-3):259-277, 2006.

22. Jespersen S.E., J. Thorhauge, T.B. Pedersen, A hybrid approach to web usage mining, in:Proceedings of the 4th Int. Conf. on Data Warehousing and Knowledge Discovery, Springer-Verlag, 2002, pp. 73-82.

23. Joshi K.P., A. Joshi and Y. Yesha, On using a warehouse to analyze web logs, Distributed andParallel Databases 13 (2003) (2), pp. 161-180.

10 C. Makris et al.

24. Nanopoulos A., D. Katsaros, Y. Manolopoulos, Exploiting web log mining for web cache en-hancement, in: R. Kohavi, B. Masand, M. Spiliopoulou, J. Srivastava (Eds.), WEBKDD 2001-Mining Web Log Data Across All Customers Touch Points, Third Int. Workshop, San Francisco,CA, USA, August 26, 2001.

25. Needleman S. and C. Wunsch in their paper A general method applicable to the search forsimilarities in the amino acid sequence of two proteins, J Mol Biol. 48(3):443-53.

26. Padmanabhan, V. N. and Mogul, J. C. 1996. Using predictive prefetching to improve WorldWide Web latency. SIGCOMM Comput. Commun. Rev. 26, 3 (Jul. 1996)

27. Pei J., J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, M.-C. Hsu, Mining sequentialpatterns by pattern-growth: the PrefixSpan Approach, IEEE Transactions on Knowledge andData Engineering 16, 11 (Nov. 2004), 1424-1440.

28. Pei J., J. Han, B. Mortazavi-asl, H. Zhu, Mining access patterns efficiently from web logs, in:Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2000, pp. 396-407.

29. Qiu, F. and Cho, J. 2006. Automatic identification of user interest for personalized search. InProceedings of the 15th Int. Conf. on World Wide Web, WWW ’06, 727-736.

30. Ruddle, R. A. (in press). Using String-matching to Analyze Hypertext Navigation. Proceedingsof the 17th ACM Conference on Hypertext and Hypermedia (HT’06).

31. Santosh K. Rangarajan, Vir V. Phoha, Kiran S. Balagani, Rastko R.Selmic, S.S. Iyengar, ”Adap-tive Neural Network Clustering of Web Users,” Computer, vol. 37, no. 4, pp. 34-40, Apr., 2004.

32. Shahabi C. and Y.-S. Chen, Improving user profiles for e-commerce by genetic algorithms,E-Commerce and Intelligent Methods Studies in Fuzziness and Soft Computing 105 (2002) (8).

33. Smith D. and J. Pricer. Sessioninzing clickstream data. Teradataview, pages 8-9, 2000.34. Spiliopoulou, M. 2000. Web usage mining for Web site evaluation. Commun. ACM 43, 8 (2000),

127-134.35. Tasoulis D.K. and M.N. Vrahatis. Novel Approaches to Unsupervised Clustering Through the

k-Windows Algorithm. Knowledge Mining: Series Studies in Fuzziness and Soft ComputingSeries: Studies in Fuzziness and Soft Computing, Vol. 185, pages 51-78, 2005.

36. Schechter, S., Krishnan, M., and Smith, M. D. 1998. Using path profiles to predict HTTP re-quests. In Proceedings of the Seventh international Conference on World Wide Web 7 (Brisbane,Australia). P. H. Enslow and A. Ellis, Eds. Elsevier Science Publishers B. V., Amsterdam, TheNetherlands, 457-467

37. Willard, D. New data structures for orthogonal range queries’, SIAM J. on Computing 14,232-253, 1985.

38. Wong S.S.C. , S. Pal, Mining fuzzy association rules for web access case adaptation, in: Work-shop on Soft Computing in Case-Based Reasoning, International Conference on Case-BasedReasoning (ICCBR’Ol), 2001.

39. Xiao Y. and M. Dunham. Efficient minning of traversal patterns. Data and Knowledge Engi-neering, 39(2):191-214, November 2001.

40. Xie Y., V.V. Phoha, Web user clustering from access log using belief function, in: Proc. of theFirst Inter. Conf. on Knowledge Capture (K-CAP 2001), ACM Press, 2001, pp. 202-208.

41. Ypma A., T. Heskes, Clustering web surfers with mixtures of hidden markov models, in: Pro-ceedings of the 14th Belgian-Dutch Conference on AI (BNAIC’02), 2002.

42. Zaiane, O. R. 1999 Resource and Knowledge Discovery from the Internet and MultimediaRepositories. Phd Thesis.

43. Zhao Q., Bhowmick, S. S., and Gruenwald, L. 2005. WAM-Miner: in the search of web accessmotifs from historical web log data. In Proc. of the 14th ACM CIKM ’05, 421-428.

44. http://www.mnot.net/follow/45. http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html