Chapter: Cryobionomics: Evaluating the Concept in Plant Cryopreservation

31
Chapter 3 Web Pattern Extraction and Storage ıctor L. Rebolledo, Gast ´ on L’Huillier and Juan D. Vel´ asquez Abstract Web data provides information and knowledge to improve the web site content and structure. Indeed, it eventually contains knowledge which suggests changes that makes a web site more ecient and eective to attract and retain vis- itors. Making use of a Data Webhouse or a web analytics solution, it is possible to store statistical information concerning the behaviour of users in a website. Like- wise, through applying web mining algorithms, interesting patterns can be discov- ered, interpreted and transformed into useful knowledge. On the other hand, web data include quantities of irrelevant but complex data preprocessing that must be applied in order to model and understand visitor browsing behaviour. Nevertheless, there are many ways to pre-process web data and model the browsing behaviour, hence dierent patterns can be obtained depending on which model is used. In this sense, a knowledge representation is necessary to store and manipulate web patterns. Generally, dierent patterns are discovered by using distinct web mining techniques on web data with dissimilar treatments. Consequently, patterns meta-data are rel- evant to manipulate the discovered knowledge. In this chapter, topics like feature selection, web mining techniques, models characterisation and pattern management will be covered in order to build a repository that stores patterns’ meta-data. Specif- ically, a Pattern Webhouse that facilitates knowledge management in the web envi- ronment. ıctor L. Rebolledo Department of Industrial Engineering, University of Chile, Rep´ ublica 701, Santiago, Chile, e-mail: [email protected] Gast´ on L’Huillier Department of Industrial Engineering, University of Chile, Rep´ ublica 701, Santiago, Chile, e-mail: [email protected] Juan D. Vel´ asquez Department of Industrial Engineering, University of Chile, Rep´ ublica 701, Santiago, Chile, e-mail: [email protected] 51

Transcript of Chapter: Cryobionomics: Evaluating the Concept in Plant Cryopreservation

Chapter 3Web Pattern Extraction and Storage

Vıctor L. Rebolledo, Gaston L’Huillier and Juan D. Velasquez

Abstract Web data provides information and knowledge to improve the web sitecontent and structure. Indeed, it eventually contains knowledge which suggestschanges that makes a web site more e�cient and e↵ective to attract and retain vis-itors. Making use of a Data Webhouse or a web analytics solution, it is possible tostore statistical information concerning the behaviour of users in a website. Like-wise, through applying web mining algorithms, interesting patterns can be discov-ered, interpreted and transformed into useful knowledge. On the other hand, webdata include quantities of irrelevant but complex data preprocessing that must beapplied in order to model and understand visitor browsing behaviour. Nevertheless,there are many ways to pre-process web data and model the browsing behaviour,hence di↵erent patterns can be obtained depending on which model is used. In thissense, a knowledge representation is necessary to store and manipulate web patterns.Generally, di↵erent patterns are discovered by using distinct web mining techniqueson web data with dissimilar treatments. Consequently, patterns meta-data are rel-evant to manipulate the discovered knowledge. In this chapter, topics like featureselection, web mining techniques, models characterisation and pattern managementwill be covered in order to build a repository that stores patterns’ meta-data. Specif-ically, a Pattern Webhouse that facilitates knowledge management in the web envi-ronment.

Vıctor L. RebolledoDepartment of Industrial Engineering, University of Chile, Republica 701, Santiago, Chile, e-mail:[email protected]

Gaston L’HuillierDepartment of Industrial Engineering, University of Chile, Republica 701, Santiago, Chile, e-mail:[email protected]

Juan D. VelasquezDepartment of Industrial Engineering, University of Chile, Republica 701, Santiago, Chile, e-mail:[email protected]

51

52 Vıctor L. Rebolledo, Gaston L’Huillier and Juan D. Velasquez

3.1 Introduction

According to the Web Intelligence Consortium (WIC)1, “Web Intelligence (WI) hasbeen recognised as a new direction for scientific research and development to ex-plore the fundamental roles as well as practical impacts of Artificial Intelligence(AI)2 and advanced Information Technology (IT)3 on the next generation of Web-empowered products, systems, services, and activities”

In other words, WI seeks ways to evolve from a Web of data to a Web of knowl-edge and wisdom. Just as you can see in Figure 3.1, web data source must be un-derstood, particularly, web server and browser interactions. So new ways of pre-processing web data need to be discovered in order to extract information, whichmust be stored in repositories or Data Web-houses [29]. Nevertheless, informationis not enough to make intelligent decisions, therefore knowledge is required. [54].

Fig. 3.1 Web Intelligence Overview

Thanks to Web Mining techniques, web user browsing behaviour can be mod-elled and studied. Indeed, mining web data enables one to find useful and unknown

1 It is an international, non-profit organization dedicated to advance world-wide scientific researchand industrial development in the field of Web Intelligence (WI) wi-consortium.org2 Knowledge representation, planning, knowledge discovery and data mining, intelligent agents,and social network intelligence3 Wireless networks, ubiquitous devices, social networks, wisdom Web, and data/knowledge grids

3 Web Pattern Extraction and Storage 53

knowledge through patterns which must be correctly interpreted [10]. However, webdata need to be pre-processed to apply web mining algorithms whose results de-pends on the treatment applied to data sources [35]. Moreover, web data is alwayschanging: users change their browsing behaviour and web site content and structureare modified to attract new users. In this sense, the discovered knowledge may be-come obsolete in a short period of time, therefore Knowledge Base is required tostore and manipulate relevant patterns [56].

Finally, knowledge must be managed, for which there are di↵erent approaches.One of which is knowledge representation based on ontologies written in XML-based languages. This representation is the basis of Semantic Web4, whose purposeis to enable the Web to resolve requests from people and machines through artificialsystems.

Web Intelligence systems will not be limited to extracting information from webdata, they can also extract, store and manage knowledge. In fact, they will be ableto understand visitors and, consequently, make “intelligent” decisions and recom-mendations based on their interactions.

3.1.1 From Data to Knowledge

Many times, the terms data, information, and knowledge are indistinctly used, dueto incorrect interpretations of their meanings. According to Davenport & Prusak intheir book Working Knowledge [11]:

• Data are primary elements of information that are irrelevant for decision mak-ing. In other words, they can be seen as a discrete group of values that do notmention anything about the nature of the context and furthermore do not provideany guidance. For example, this could be a list of phone numbers.

• Information corresponds to a group of processed data that have a meaning, rele-vance, purpose and context, and therefore, they are useful for decision making. Inorder to obtain information, data must be put into context, categorised, processed,corrected, edited and condensed so that they make sense to decision maker.

• Knowledge is composed of a mixture of experience, values, information andknow-how, which serves as a framework for the incorporation of new experi-ences and information. The knowledge is originated and applied to the expertsmind5. In organisations, it is frequently witnessed not only in documents or datawarehouses, but also in organisational routines, processes and standards. To ob-tain knowledge from information, one must compare it with other elements to

4 It is an evolving development of the Web in which the meaning (semantics) of information andservices is defined5 The experts’ minds are source of knowledge and they are able to apply it

54 Vıctor L. Rebolledo, Gaston L’Huillier and Juan D. Velasquez

predict consequences, thus looking for underlying connections or searching foran expert’s interpretation.

When users are browsing, they visits di↵erent pages that contain data in variousformats: text, images, multimedia files and so on [56]. For example, if this userlooks for information about a certain topic making use of search engines, the resultsmust be put in context [11]. Firstly, web user read the content of pages obtaininginformation related to the topic [39]. While the user learns more about the topic,he could request new questions, resulting in new enquiries to search engines whichprovide new pages to users to read. Eventually, they will be able to obtain knowledgeabout the topic. However, the process could be lengthy and tedious due to the useronly obtaining knowledge once they have read every page provided by the searchengine.

The previous process will be repeated every time a user wants to obtain informa-tion or knowledge through the Web [39]. For some people this process is simplerthan for others, however, achieving a high degree of e�ciency demands the use ofsearch engines for a certain period of time before becoming an experienced user[57]. In other words, users should adapt to the Web to obtain what they want. In thatsense, knowledge contained in the Web must be represented in order to reduce thecost of obtaining it.

3.1.2 About Knowledge Representation

The human mind’s mechanism for storing and retrieving knowledge is fairly trans-parent to us. For example, when an “orange” was memorized it was examined first,thought about for some time and perhaps eaten. During this process, the orange’sessential qualities are stored: size, colour, smell, taste, texture, etc. Consequently,when the word “orange” is heard, our senses are activated from within [3].

Unfortunately, computers are not able to build this kind of representation bythemselves. Instead of gathering knowledge, computers must rely on human be-ings to place knowledge directly into their memories. While programming enablesus to determine how the computer performs, we must decide on ways to representinformation, knowledge, and inference techniques within it.

As field research, Knowledge representation (KR) is the study of how knowledgeof the world can be represented and what kind of reasoning can be achieved withregards to it. In other words, it is “how an entity sees the world”, understands asituation and prepares satisfactory action. In short, it is the ability to create newperceptions from old ones.

According to Davis et al. in their paper “What is Knowledge Representation”,the notion of this topic can best be understood in terms of five distinct roles it plays[13, 54]:

3 Web Pattern Extraction and Storage 55

• KR is fundamentally a surrogate or substitute for the elements that compose theexternal world, used to enable an entity to determine consequences of applyingan action, i.e., by reasoning about the world rather than taking action.

• KR should answer the question: In what terms should I think about the world?.So, KR is a set of ontological commitments, of which the choice of representa-tion must be made. All knowledge representations are approximations of reality.In this sense, KR is a method, consisting of criteria such as “what we want toperceive” as a filter. For example, a general representation of voice productionis to think in terms of a semi-periodic source (glottal excitation), vocal tract, lipradiation and the integration of these components. However, if the focus is onvocal track dynamics only, it can be explained as “the glottal excitation wave hitswith the walls of the vocal track, generating a composed wave”, which can beunderstood as a set of ontological commitments.

• KR is a fragmentary theory of intelligent reasoning, expressed in terms of threecomponents:

– The representation’s fundamental conception of intelligent reasoning.– The set of inferences to represent sanctions.– The set of recommendations implied by the inference.

For example, in the classic economic theory, consumers make intelligent deci-sions based on the information about product characteristics and prices. Whenthe price changes, so does the intention to purchase it.

• KR is a medium for pragmatically e�cient computation. As thinking machinesare in essence computational processes, the challenge to e�cient programming isto properly represent the world. Independent of the language used, it is necessaryto correctly represent the problem and create a manageable and e�cient code,i.e., not requiring redundant and high data processing capacities.

• KR is a medium of human expression. Knowledge is expressed through amedium such as the spoken language, written text, arts, etc. A communicationinterface is necessary if a human being is to interact with an intelligent system.

In order to provide the five before mentioned roles, KR is a multidisciplinarysubject that applies theories and techniques from three other fields:

1. Logic provide the formal structure and rules of inference.2. Ontology defines the kind of things that exist in the application domain.3. Computation supports the applications that distinguish knowledge representa-

tion from a simple concept.

This combination represents the main initiative to evolve from a Web of data toa Web of knowledge. Indeed, this is the final purpose of Semantic Web which isan initiative led by World Wide Web Consortium (W3C) that aims to define stan-dards in order to add descriptive formal content to the web pages. This content willbe invisible for the users and will enable machines to understand and manipulatethe content of each of the pages to provide intelligent services. In this sense, thereare diverse trade-o↵s when we one attempt to represent knowledge. Many times

56 Vıctor L. Rebolledo, Gaston L’Huillier and Juan D. Velasquez

the representational adequacy and fidelity, given by Logic and Ontology, opposedthe Computational cost. In dynamic environments with stochastic information thesetrade-o↵s are even more pronounced.

On the other hand, knowledge can also be seen as patterns discovered from databases by using data mining techniques. Indeed, patterns can be defined as knowl-edge artefacts, providing a compact and semantically rich representation of a hugequantity of heterogeneous raw data. However, due to the specific characteristics ofpatterns (heterogeneous and voluminous), ad hoc systems are required for patternmanagement in order to model, store, retrieve, and manipulate patterns in an e↵ec-tive and e�cient way.

In order to make pattern management, di↵erent approaches can be founded:

• The design of a Data Mining Query Language for mining di↵erent kinds ofknowledge in relational databases. The idea is to provide an standardised lan-guage as SQL to extract patterns so if SQL was successful in extracting infor-mation in databases, a new language can resolve the knowledge extraction issue.Examples of this specific language are DMQL [23], MINE RULE [40], andMineSQL (MSQL) [26]

• Commercial initiatives as PMML6, an XML notation represents and describesdata mining and statistics models, as well as some of the operations required forcleaning and transforming data prior to modelling. The idea is to provide enoughinfrastructure for an application to be able to produce a model (the PMML pro-ducer) and another one to consume it (the PMML consumer) simply by readingthe PMML file [21].

• Build a Pattern Base Management System (PBMS) [7]. The idea is formallyto define the logical foundations for the global setting of pattern managementthrough a model that covers data, patterns, and their intermediate mappings.Moreover a formalism for pattern specification along with safety restrictions andintroduce predicates for comparing patterns and query operators. This systemwas finally implemented as a prototype called PSYCHO [8, 50]

With respecto to a di↵erent approach, this chapter focus on characterise webmining models with the purpose of building a repository that facilitates the manip-ulation of knowledge contained in the discovered patterns. The idea is to store thepatterns and meta-data associated with them making use of Data Warehouse archi-tecture. Unlike Data Webhouses 7 that store statistical information about web usersbehaviour, a Knowledge Repository is propose which allows us to evaluate di↵erentmodels and web mining studies in order to compare patterns for discovering relevantknowledge from the diverse and changing web data.

This chapter covers the most successful data mining algorithms applied to webdata, feature selection and extraction criteria, and model evaluation measures tocharacterise web mining studies depending on the used technique.

6 The Predictive Model Markup Language (PMML) is an XML standard being developed by theData Mining Group (www.dmg.org)7 Application of Data Warehouse architecture to store information about web data

3 Web Pattern Extraction and Storage 57

3.1.3 General Terms and Definition of Terms

In the following, the general notation and definition of terms used in this chapter arepresented.

• X: Feature space for N objects presented in a dataset, defined by X = XiNi=1, where

each object Xi is defined by n features represented by Xi = xi, jnj=1.

• Y: Dependent feature for supervised learning problems, where for each objecti 2 N, its value is determined by Y = yi

Ni=1.

• f (·) : Theoretical function f (x), x 2 X and f : X ! Y , which takes values fromthe feature space and maps a value in the space of the dependent feature. Thisfunction is a theoretical representation of the best approximated function for agiven data mining algorithms.

• h(·) : Hypothesis function h(x), x 2 X and h : X! Y , with the same behavior thanfunction f . This function represents the empirical estimated function for a givendata mining algorithm.

• M: Generic data mining model.

All algorithms and procedures presented through this chapter will be based onthis notation, except in case otherwise stated.

3.2 Feature Selection for Web data

In web mining applications, the number of variables used to characterize Web-objects8 can be expressed in vectors of hundreds or even thousands of features.This characterization can be influenced by traditional data mining techniques, wherefeature preprocessing, selection and extraction is evaluated for a given Web-objectdatabase. The fact that the Web is a feature-rich environment, which introduces inweb mining the curse of dimensionality over the evaluation of di↵erent pattern ex-traction algorithms, from which given the application domain (supervised, unsuper-vised or incremental learning) the performance and evaluation time of the techniqueare directly related.

In terms of feature selection and extraction, understood as the construction andselection of useful features to build a good predictor, several potential benefits arisesfar beyond the regular machine learning point of view for improving the complexityand training capabilities of algorithms. Data visualization and understanding, froman analysts’ perspective, is a key factor to determine which algorithms should beconsidered for the web-data preprocessing. In this context, relevant features for datainterpretation could be sub-optimal in terms of building a predictor, as redundantinformation could be needed for a correct interpretation of a given pattern. Further-more, in order to determine the set of useful features, the criterion is commonly

8 A Web-object is refered to any data generated over the Web.

58 Vıctor L. Rebolledo, Gaston L’Huillier and Juan D. Velasquez

associated to reducing training time and defly the curse of dimensionality for per-formance improvements [4, 22, 32].

As a given set of training data can be labeled or unlabelled, some supervised andunsupervised methods for feature selection can be used [62, 68]. Despite the super-vised feature selection using an evaluation criteria with respect of the target label,unsupervised feature selection exploits information and patterns extracted from datato determine the relevant set of features.

In this section, Web data pre-processing will be reviewed in terms of featureselection and extraction. Firstly, some of the main feature selection techniques arediscussed. Afterwards, some of the main feature extraction methods for Web-miningare explored. All previous steps for obtaining the correct characterization of webdata are covered in other chapters in this book, for di↵erent domain tasks, like web-content mining, web-structure mining and web-usage mining.

3.2.1 Feature Selection Techniques

In terms of feature selection, simple measures such as the correlation criteria or in-formation theoretic criteria are well known in many applications of data-mining.Furthermore, meta-algorithms such as Random Sampling, Forward Selection andBackward Elimination are used together with some evaluation measures for deter-mining the most representative set of features, for a given web-mining problem. Inthe following parts, the main points of these methods will be presented, as well astheir most common usage in web-mining applications.

3.2.1.1 Correlation Criteria for Feature Selection

The Pearson correlation coe�cient has been widely used in data mining applications[22], and roughly represents the relationship between a given feature and the label,or dependent feature, that is supposed to be predicted.

Despite this measure being widely used by practitioners in several web-miningapplications, it can only detect linear dependencies between a given feature and thedependent variable.

3.2.1.2 Information Theoretic Feature Selection

Information theoretic criteria, such as information gain or mutual information, havebeen used for feature selection and extraction [17, 51], where the amount of infor-mation between each feature and the dependent variable is used for ranking in termsof relevance among features. As presented in equation 3.1, the mutual information,which can be considered as a criterion of dependency between the ith feature xi

3 Web Pattern Extraction and Storage 59

and the target variable y, is determined by their probability densities p(xi) and p(y)respectively.

I(i) =Z

xi

Z

yp(xi,y)log

p(xi,y)p(xi)p(y)

dxidy (3.1)

In most cases of web-mining applications, web data is represented by discreteamounts of data. Given this, probability distributions are unlikely to be estimated,through which probabilities are determined by frequency tables and counting tech-niques, as presented by the following expression,

I(i) =X

x2Xi

X

y2YP(xi, j,y)log

P(x,y)P(x)P(y)

(3.2)

Using this evaluation criteria, the optimal set of features can be determined byordering decreasingly given the mutual information (or information gain) parameter,selecting those features which overrides a given threshold.

3.2.1.3 Random Sampling

A simple method for feature selection is to choose features by a uniform randomsub-sampling without repeating the whole set of features. This method, known asRandom Sampling, has been used when a given dataset is characterized by a highdimensional set of features, where a proportion of features provides enough infor-mation on the underlying pattern to be determined. However, if this is not the case,poor results could be obtained as the sub-sampling does not generate a su�cient setof relevant features.

3.2.1.4 Forward Selection and Backward Elimination

Both forward selection and backward elimination meta-algorithms are well knownfor feature selection in di↵erent applications. On the one hand, in the first placeforward selection considers an empty set of features are relevant ones are incre-mentally added. On the other hand, backward elimination considers the deletion ofnon-relevant features from the whole set of attributes using a given elimination crite-ria. These methods are used in the scope of wrapper methods [32, 38] and embeddedmethods [22] for feature selection:

• Wrapper methods: This type of methods requires no prior knowledge of the ma-chine learning algorithm usedM, where at each step i of the backward elimina-tion or forward selection process, a subset of features is used for the evaluationofM. Then, the algorithm’s performance is evaluated, and finally, from the over-all evaluation, the most relevant set of features is determined by the one thatmaximises the algorithm’s performance.

60 Vıctor L. Rebolledo, Gaston L’Huillier and Juan D. Velasquez

• Embedded methods: For this method, its key component is to use a machinelearning algorithm that produces a feature ranking, where the same forward se-lection or backward elimination procedure described previously, determines thefinal set of relevant features.

3.2.2 Feature Extraction Techniques

In many Web mining applications, a given set of features can be used to deter-mine a new set for a more e↵ective characterization of the underlying pattern tobe extracted. In this sense, methods such as Principal Component Analysis, Sin-gular Value Decomposition, Linear Discriminant Analysis, amongst other featureextraction methods, have been considered in di↵erent web-mining and InformationRetrieval applications.

3.2.2.1 Principal Component Analysis

In Principal component analysis (PCA) [66] the main idea is to determine a new setof features K from the original feature set X, where |K| |X| = n, and each featurek 2 K is generated as a uncorrelated (orthogonal) linear combination of the originalfeatures. These features are called the principal components of the original set X. Inorder to determine the new set of features, this method aims at the minimization ofthe variance of data on each principal components. Given the set of features Xi 2 X,the objective is to determine the principal component zk, by estimating the linearcombination values ak = (a1,k, . . . ,an,k), where zk = aT

k ·X

minak

var(zk)

subject to cov(zk,zl) = 0, k > l � 0

aTk ·ak = 1,ak ·aT

l = 0,8k, l 2 K and 8k , l

(3.3)

where var(·) and cov(·) are the variance function and the covariance functionsrespectively.

3.2.2.2 Independent Component Analysis

Independent Component Analysis has been used in di↵erent web mining applica-tions, such as web usage mining [9], and general web mining feature reductionapplications [31]. This method, unlike PCA, aims to determine those features whichminimizes mutual information (equation 3.2)between the new feature set Z.

3 Web Pattern Extraction and Storage 61

In order to determine the statistical independence between extracted features,the probability distribution function of all features is considered to be determinedindependently from each feature.

3.2.2.3 Singular Variable Decomposition

As most web data is in fact text, a text specific feature extraction process is intro-duced. Using the tf-idf matrix [46], a Singular Value Decomposition (SVD) of thismatrix reduces the dimensions of the term by document space. SVD considers anew representation of the feature space, where the underlying semantic relationshipbetween terms and documents is revealed.

Let matrix X be an n⇥ p tf-idf representation of documents and k an appropiatenumber for the dimensionality reduction and term projection. Given, Uk = (u1, ...,uk)an n⇥ k matrix, the singular values matrix Dk = diag(d1, ...,dk), where {di}ki=1, rep-resents the eigenvalues for XXT and Vk = (v1, ...,vk) an m⇥ k matrix, then the SVDdecomposition of X is represented by,

Xi = Ui ·Di ·VTi (3.4)

Here, the expression for Vi ·Di can be used as a final representation of a givendocument i. As described in [42], SVD preserves the relative distances in the VSMmatrix, while projecting it onto a Semantic Space Model (SSM), which has a lowerdimensionality. This allows one to keep the minimum information needed to definethe appropiate representation of the dataset.

3.3 Pattern Extraction from Web Data

In this section, main web-mining models for pattern extraction are presented. Thesemodels, considered as supervised learning, unsupervised learning and ensemblemeta-algorithms will be reviewed, adapted to web mining applications, and theirpattern extraction properties.

3.3.1 Supervised Learning Techniques

Over the last years, several supervised algorithms have been developed for pre-viously stated problem tasks, where the main algorithms are Support Vector Ma-chines (SVMs) [6, 48], associated with the regularized risk minimization from thestatistical learning theory proposed by Vapnik [53], and the naıve Bayes algorithm[15, 25, 41]. Also, for classification and regression problems, Artificial Neural Net-works (ANNs) have been extensively developed since Rosenblatt’s perceptron in-

62 Vıctor L. Rebolledo, Gaston L’Huillier and Juan D. Velasquez

troduced in [44], where the empirical risk minimization is used for adjusting thedecision function.

In web mining the target feature can represent di↵erent web entities related toclassification or regression tasks. For example web-page categorization problems,regression analysis for web user demand forecasting, amongst other applications[34]. Here, supervised learning refers to the fact that the dependent feature y j 2 Yis available. Furthermore, the term learning is related to the fact that the dependentfeature must be infered from the set of features Xi, from a training set T and validatedfrom a test set of objects V.

In the following section, these techniques are explained with further details.

3.3.1.1 Support Vector Machines

Support Vector Machines (SVMs) are based on the Structural Risk Minimization(SRM) [6, 53] principle from statistical learning theory. Vapnik stated that the funda-mental problem when developing a classification model is not concerning the num-ber of parameters that are and to be estimated, but more, the flexibility of the model,given by VC-dimension, introduced by V. Vapnik and A. Chervonenkis [5, 52, 53]to measure the learning capacity of the model. So, a classification model shouldnot be characterized by the number of parameters, but by the flexibility or capac-ity of the model, which is related to how complicated the model is. The higher theVC-dimension, the more flexible a classifier is.

The main idea of SVMs is to find the optimal hyperplane that separates objectsbelonging to two classes in a Feature Space X (|X| = n), maximizing the marginbetween these classes. The Feature Space is considered to be a Hilbert Space definedby a dot product, known as the Kernel function, k(x, x0) := (�(x) ·�(x0)), where � :�! n, is the mapping defined that translates an input vector to the Feature space.The objective of the SVM algorithm is to find the optimal hyperplane wT · x+ bdefined by the following optimization problem,

minw,⇠,b

12

nX

i=1w2

i +CNX

i=1⇠i

subject to yi⇣wT xi+b

⌘� 1� ⇠i 8i 2 {1, ..,N}

⇠i � 0 8i 2 {1, ..,N}

(3.5)

The objective function includes training errors ⇠i while obtaining the maximummargin hyperplane, adjusted by parameter C. Its dual formulation is defined by thefollowing expression, known as the Wolfe dual formulation.

3 Web Pattern Extraction and Storage 63

max↵

NX

i=1↵i� 1

2

NX

i, j=1↵i↵ jyiy j · k(xi, x j)

subject to ↵i � 0,8i 2 {1, ...,N}NX

i=1↵iyi = 0

(3.6)

Finally, after determining the optimal parameters ↵ in order to classify the labels,the continuous output are represented by,

g⇣x j

⌘=

NX

i=1↵iyi · k(xi, x j)+b (3.7)

The resulting classification for the tth element is given by h(xt) = sign (g(Xt)).

3.3.1.2 Naıve Bayes

Several applications for web mining considered in this model has been proposed[49, 59]. Here, the main objective is to determine the probability that an objectcomposed by the feature set xi has a given label yi. In this particular case, it willbe considered as a binary classification problem where yi = {+1,�1}. For this, usingthe Bayes theorem, and considering that every feature presented in Xi is independentfrom each other, the needed probability can be denoted by the following expression,

P(yi|xi) =P(yi) ·P(xi|yi)

P(xi)=

P(yi)P(xi)

·nY

j=1P(xi j|yi) (3.8)

It is easy to prove that considering the conditional probabilities over the labels,the previous equation can be extended to the following expression,

ln

P(yi = +1|xi)P(yi = �1|xi)

!= ln

P(yi = +1)P(yi = �1)

!+

nX

j=1ln

P(xi j|yi = +1)P(xi j|yi = �1)

!(3.9)

And the final decision is given by the following cases, subject to a threshold ⌧.

h(xi) =

8>><>>:+1 if ln

⇣ P(yi=+1|xi)P(yi=�1|xi)

⌘> ⌧

�1 Otherwise(3.10)

It is important to notice that the previous expression can be determined by con-tinuous updates in the probabilities of the labels (yi) given the features (xi),whichare determined over the frequencies of the features presented in the incoming streamof messages.

64 Vıctor L. Rebolledo, Gaston L’Huillier and Juan D. Velasquez

3.3.1.3 Artificial Neural Networks

Artificial Neural Networks (ANN) represents a mathematical model for the opera-tion of biological neurons from brain and, like biological neural structures, ANNsare usually organized in layers. In this context, the neuron is modelled as an activa-tion function that receives stimulus from others neurons, represented by inputs withassociate weights W1 = (w1

1, . . . ,w1L) and W2 = (w2

1, . . . ,w2L). In a single layer ANN

the objective is to minimize the empirical learning risk represented by an error func-tion E, as stated in the following non-linear optimization problem,

minW1,W2

E =NX

i=1

0BBBBBB@yi�g1

0BBBBBB@

LX

l=1

KX

k=1

w1l,k ·g2

0BBBBBB@

nX

a=1w2

a,l xa

1CCCCCCA

1CCCCCCA

1CCCCCCA (3.11)

Here, the transfer functions g1 : L! K and g2 : X! L, where N is the number ofobjects, L is the number of neurons in the hidden layer, K is the number of labels inthe classification problem (for regression problems K = 1), and n is the number offeatures considered for the characterization of objects. This minimization problem isdetermined by non-linear optimization algorithms, where the most known methodto find the set of weights W1 and W2 is the back-propagation algorithm [61], inwhich the convergence towards the minimum error is not guaranteed.

In general, multi-layer ANNs are mainly used in:

• Classification. By training an ANN, the output can be used as a feature vec-tor classifier. For example, a web page type (blog, news, magazine, etc.) givenweb-content features. The input layer receives a n-dimensional vector with pagecharacteristics and in the output layer, the web site can be classified as the typefrom which it is needed to be classified, depending on the set of neurons resultbeing closer to “1” or “0”.

• Regression. Given a set of training examples, an ANN can be trained to representan approximated function on which to model a situation. For example a web siteuser demand, given the previous historical behaviour of users. When presentedwith new input examples, the ANN provides the best prediction based on whatwas learned using the training set of examples.

3.3.2 Unsupervised Techniques

In practice, unsupervised learning in the web-mining context have been widely usedas many of the pattern that are to be extracted from web data is from a unsuper-vised point of view [36, 54, 60, 64, 65]. Many applications on web usage miningdetermine the web user behaviour [14, 55], and identify both the website keywords[57, 58] and web site key-objects [16].

3 Web Pattern Extraction and Storage 65

3.3.2.1 k-Means

This algorithm divides a set of data into a predetermined number of clusters [24, 37].The main idea is to assign each vector to a set of given cluster centroids and thenupdate the centroids given the previously established assignment. This procedureis repeated iteratively until a certain stopping criterion is fulfilled. The number ofclusters to be found, k, is a required input value for the k-means algorithm. A set ofk vectors are selected from the original data as initial centroids, whose initial val-ues are considered randomly. The clustering process executes an iterative algorithmwhose objective is to minimize the overall sum of distances between the centroidsand objects, represented as a given stopping criteria in an heuristic algorithm.

It is important to notice that one of the main characteristics of the algorithm isthe distance considered between object, which is considered to be relevant in thetype of segmentation problem that needs to be solved.

Some clustering evaluation criteria have been introduced for determining the op-timal number of clusters [12, 45]. One of the most simple is the Davies-Boulin index[12] (see section 3.4.4). In this case, the main idea is to determine the optimal num-ber of clusters used as stopping rule the minimization of the distance within everycluster and the maximization of the distance between clusters. Likewise, the optimalk-means parameter-finding problem has been extensively discussed in [63].

3.3.2.2 Kohonen Self Organizing Feature Maps

A Kohonen Self Organising Feature Maps (SOFM) [33], is a vector quantizationprocess. This takes a set of vectors as high dimensional inputs and maps them intoan ordered sequence. The SOFM maps from the input data space X (|X| = N) ontoa regular two-dimensional array of nodes or neurons. The output lattice can be rect-angular or hexagonal. Each neuron is an N-dimensional vector mi 2 N, whose com-ponents are the synaptic weights. By construction, all the neurons receive the sameinput at a given moment of time. This machine learning algorithm is considered asa “non-linear projection of the probability density function of the high dimensionalinput data onto the bi-dimensional display” [33]. Let Xi 2 X be an input data vector.The idea of this learning process is to present Xi to the network and, by using ametric, to determine the most similar neuron (center of excitation, winner neuron).

3.3.2.3 Association Rules

The basic idea in association rules is to find significant correlations among a largedata set. A typical example of this technique is the “purchasing analysis”, whichuses customer buying habits to discover associations among purchased items. Ingeneral, the discovered rules are formalized as if < X > then < Y > expressions.

The formal statement for the described situation is proposed in [1, 2] as follows.Let I = {i1, ..., im} be a set of items and T = {t1, ..., tn} be a set of transactions, where ti

66 Vıctor L. Rebolledo, Gaston L’Huillier and Juan D. Velasquez

contains a group of items from I. Let X ✓ I be a group of items from I, a transactionti is said to contain X if X ✓ ti. An association rule is an implication of the formX) Y , where Y ✓ I and X[Y = ;.

The rule X ) Y holds for the transactions set T with support ↵ and confidence�. ↵ is the percentage of transactions in T that contain X and Y . � is the percentageof transactions in T that contain X [Y . This process can be generalized with themultidimensional association, whose general form is X1, ...,Xn) Y .

3.3.3 Ensemble Meta-Algorithms

Ensemble methods have been used to improve results over pattern extraction mod-els. The main idea is to use a large amount of models, along with carefully cho-sen parameters, combining them to boost a single predictor performance. Ensemblemeta-algorithms are generic, and most of data mining models can be used in an en-semble of models, ANNs, SVMs, naıve Bayes and decision Trees, amongst others.Majority voting, boosting, and bagging methods will be reviewed.

3.3.3.1 Majority Voting

This ensemble method is easy to deploy, scale and use in a wide range of applica-tions, associated most of the time in supervised classification algorithms. Considerm di↵erent models generated over a training set, and let suppose the simple casefor a binary classification problem. After m models are determined, new objects arepresented to be classified. Therefore, each object is evaluated for all model, and itsresults are saved. Afterwards, the predicted label whose frequency at it’s highestlevel, is then considered as the most likely hypothesis for the binary classificationproblem.

3.3.3.2 Bagging

Bagging (bootstrap aggregation) is considered as a ensemble meta-learning methodto improve the predictive accuracy of a given model. The main idea is that given atraining set of N objects, the method generates m sets of objects Bj, j = {1, . . . ,m},where |Bj| = m < N. These sets are generated by the bootstrap re-sampling methodwith replacements. Each one of the Bj new sets are used to train m models. Resultsfor these methods can be considered as a regular average over resulting models, orcan be used in a majority voting evaluation schema.

3 Web Pattern Extraction and Storage 67

3.3.3.3 Boosting

This meta-learning technique, introduced by Schapire in [47], and further extensionsto the Adaboost meta-algorithm presented by Freund & Schapire in [18, 19] uses adi↵erent re-sampling method than Bagging regarding classification problems. Here,each subset that evaluates di↵erent models is determined initially by using a con-stant probability over the whole set of objects 1

N for each instance. This probabilityis updated in time, according to the performance of the models. This technique in-troduces the concept of weak learners, whose performance is constrained to at least50% of its accuracy regarding the classification of its performance, and still cap-tures underlying patterns of data. In a nutshell, the idea is that the right set of weaklearners will create a stronger learner.

3.4 Web Mining model assessment

In previous sections, several web mining techniques have been introduced. Theselearning methods enables the building of hypothesis and models from a set of data.However, in most cases, an accurate measurement of hypothesis quality is neces-sary. In fact, there are several methods that measure the model’s quality from theevidence. In the following sections some of these measures will be reviewed.

3.4.1 Evaluation of classifiers

Given a set of data S determined by objects XiNi=1, an hypothesis function h can be

determined making use of a supervised learner. A simple measure to evaluate thequality of the classifier h is the sampling error defined in equation 3.12.

ErrorS (h) =1N

NX

i=1�( f (Xi) , h(Xi)) (3.12)

where � is a boolean function for which �(true) = 1 and �( f alse) = 0. It is rele-vant to consider an evaluation of the algorithm trained in a di↵erent data set, whichwas not used for training, given that a biased evaluation and over-fitting decisionscould lead to over-rated predictive models. For a fair evaluation of performancemeasures, di↵erent approaches have been developed. The most known of these eval-uation methods are the Hold-out (train/test) validation and the k-Cross-Validation,briefly reviewed as follows,

• Hold Out: For an evaluation without over-fitting models A more e↵ective ap-proach consists of separating the set S in two subsets: a Training Set to definethe hypothesis function h and a Test Set to calculate the sampling error of thehypothesis.

68 Vıctor L. Rebolledo, Gaston L’Huillier and Juan D. Velasquez

• Cross Validation: k-Cross Validation consists of dividing the evidence in k dis-joint subsets of similar size. Then, the hypothesis can be deduced with a setformed by the union of k�1 subsets and the remaining subset is used to calculatethe sampling error. This procedure is repeated k times by using a di↵erent subsetto estimate the sampling error. Then, the final sampling error is the mean of k par-tial sampling errors. Likewise, the final results are the mean of the experimentswith k independent subsets.

3.4.1.1 Evaluation of Hypothesis based on Cost

Cost sensitive models have been used in several web mining applications. In binaryclassification tasks all these models are based on the confusion matrix, generatedover the di↵erent outcomes that a model can have. The main idea is to determinethe model that minimizes the overall cost, or maximize the overall gain of the evalu-ated model for a given application. For the confusion matrix, there are four possibleoutcomes: Correctly classified type I elements or True Positives (TP), correctly clas-sified type II elements or True Negative (TN), wrongly classified type II elements orFalse Positive (FP) and wrongly classified type I elements or False Negative (FN).The following evaluation criteria are commonly considered in machine learning anddata mining applications.

• The False Positive Rate (FP-Rate) and the False Negative Rate (FN-Rate) as theproportion of wrongly classified type II and type I elements respectively.

FP-Rate =FP

FP+T N(3.13)

FN-Rate =FN

FN +T N(3.14)

• Precision, that states the degree in which identified as type I elements belongs tothe type I elements class. Can be interpreted as the classifier’s safety.

Precision =T P

T P+FP(3.15)

• Recall, that states the percentage of type I elements that the classifier manages toclassify correctly. Can be interpreted as the classifier’s e↵ectiveness.

Recall =T P

T P+FN(3.16)

• F-measure, the harmonic mean between the precision and recall.

F-measure =2⇤Precision⇤Recall

Precision+Recall(3.17)

• Accuracy, the overall percentage of correct classified elements.

3 Web Pattern Extraction and Storage 69

Accuracy =T P+T N

T P+T N +FP+FN(3.18)

• Area Under the Curve (AUC), defined as the area under a ROC curve for theevaluation step

AUC =Z 1

0

T PP

dFPN=

1P ·N

Z 1

0T PdFP (3.19)

3.4.2 Evaluation of Regression models

Given an objective theoretical function f , a regression model h and a set of data Sformed by N examples, which is one of the most used evaluation measures is theMean Squared Error (MSE):

MS E =1N

NX

i=1(h(Xi)� f (Xi))2 (3.20)

However, this evaluation criteria does not allow one to calculate the real di↵er-ence between the estimator and the true value of the quantity being estimated. Infact, MSE has the same unit of measurement as the square of the quantity that isestimated. In an analogy to standard deviation, to fix the units problem, the squareroot of MSE yields to the Root Mean Squared Error (RMSE).

RMS E =

vut1N

NX

i=1(h(Xi)� f (Xi))2 (3.21)

To square the di↵erences between estimators tends to increase the value of themost extreme errors. This is an issue for MSE and RSE. By using the Mean Abso-lute Error (MAE) this problem is limited.

MAE =1N

NX

i=1|h(Xi)� f (Xi)| (3.22)

Nevertheless, MAE estimates the mean of errors by ignoring its signs. In thissense, relative errors can be used. In other words, by normalising the total squarederror by dividing by the total squared error of the simple predictor f , the RelativeSquared Error (RSE) can be achieved:

RS E =1N

1X

i=1

(h(Xi)� f (Xi))2

(h(Xi)� f )2, where f =

1N

NX

i=1f (Xi) (3.23)

70 Vıctor L. Rebolledo, Gaston L’Huillier and Juan D. Velasquez

This last measure can be modified in the same way as MSE. Indeed, we could ap-ply square root or use absolute errors over RSE and new measures can be achieved.

3.4.3 MDL Principle

The Minimum Description Length (MDL) principle for statistical and machinelearning model selection and statistical inference is based on one simple idea: “thebest way to capture regular features in data is to construct a model in a certain classwhich permits the shortest description of the data and the model itself”. It can be un-derstood as a formalization of Occam’s Razor, which states that the best hypothesisfor an underlying pattern is the simplest.

Formally, the MDL principle recommends the selection of the hypothesis h thatminimalises the:

C(h,D) = K(h)+K(D|h) (3.24)

Where C(h,D) is the complexity in bits of the hypothesis h, K(h) the amount ofbits needed to describe the hypothesis and K(D|h) is the amount required to describethe evidence D which includes the hypothesis exceptions.

3.4.4 Evaluation of Clustering models

Unsupervised models are di�cult to evaluate because there is not a class or numericvalue from which an error could be estimated. A simple measure, often used toidentify how compact the determined clusters are is presented in equation 3.25.

S S E(M) =NX

i=1

KX

k=1

||Xi� ck ||2 (3.25)

where M is the model formed by K clusters with centroids c1, . . . ,cK , evaluatedover the data set S with N objects.

Another simple approach, which is often related to analytical purposes of resultsobtained, consists of using the distance between clusters. The distance can be de-fined in di↵erent ways:

• Mean Distance: It is the mean of the distances between every examples of everyclusters.

• Nearest Neighbour Distance: It is the distance between the nearest neighboursof two clusters, i.e., the closest examples.

• Farthest Neighbour Distance: It is the distance between the farthest neighboursof two clusters, i.e., the more distant examples.

3 Web Pattern Extraction and Storage 71

The MDL principle can also be used as an evaluation measure. For example, ifusing a particular model, the set E formed by n examples is clustering in k groups,this model can be used as a codification of set E. Hence, the model which providesthe shortest and more representative codification of E will be the most e↵ective.

In k-means, a frequently used index for determining such number is the Davies-Bouldin index [12], defined as,

Davies-Bouldin =1K

KX

i=1maxi, j

"S k(Qi)+S k(Q j)

S (Qi+Q j)

#(3.26)

where K is the number of clusters, S k refers to the mean distance between objectsand their centroid, and S (Qi +Q j) is the distance between centroids. This indexevaluation states that the best clustering parameter K is obtained from smaller indexvalues.

3.4.5 Evaluating Association Rules

One is used to working with two measures to evaluate the rule quality: support andconfidence. The support of a rule of a set of feature values is the number of trainingdatabase instances (or percentage of training database instances) that contain thevalues mentioned. While the confidence is the number of training database instances(or percentage training database instances) the rule is satisfied.

The traditional task of association rule mining is to find all rules with high sup-port and high confidence. Nevertheless, in some applications one is interested inrules with high confidence but low support. Likewise, some rules with high supportbut low confidence can be used in generalised data.

3.4.6 Other evaluation criteria

There are several other criteria to evaluate data mining models, some of which pro-vides interesting insights. However, some of them are considered as subjective andlack useful information in scientific research, but are considered useful in real lifeapplications. The most recurrent criteria are:

• Interest: There are models whose assessment measures are great but the knowl-edge extracted from them is irrelevant. In other words, the model is not interest-ing for data mining. For example, “if the web user is a mother then she is female”.The first ones are related with the previous knowledge of the data miner, if themodel accures this knowledge then the model is interesting.

• Applicability: A model can be precise regarding training data but it can havepoor applicability. For example: “the probability of a web user to become a client

72 Vıctor L. Rebolledo, Gaston L’Huillier and Juan D. Velasquez

depends on his age and gender”. However, if we don’t rely on with a proactivestrategy that provides this web user data (user/password registration, cookie, etc.)then the model is not applicable.

3.5 A Pattern Webhouse Application

In this section, a Pattern Webhouse will be introduced making use of every charac-teristic and evaluation measure previously presented. For this purpose, Data Web-house architecture will be presented.

3.5.1 Data Webhouse overview

The issue of extracting information a↵ects more than the Web environment. Indeed,it is a common problem for any company seeking information from their diary op-erations systems. At first, companies only created computer systems to register datain order to give support to daily operations. Subsequently, due to market compet-itiveness, reliable information for decision making was necessary [27]. However,operational systems are not designed for extracting information. In fact, this re-quire complex enquiries whose processing a↵ects the systems performance and thecompany operations. This situation finally results in time, cost and resources in thearea of IT that develop the required reports and decision makers who will not haveenough information or material to make their decisions [30].

To handle the aforementioned situation, the Data Warehouse or informationrepositories appears to supply the information necessities. As described by Inmon in[27], they are “subject oriented, integrated, non-volatile, and time variant collectionof data to support management’s decisions”. It is subject oriented because the dataprovides non detailed information about business aspects. It is integrated, becauseit gathers data from di↵erent sources and arranges them under a coherent structure.The data is not volatile because it is stable and reliable, and more data can be added.However the data already existing cannot be removed or updated. And furthermoreit varies in time because all of the data in the repository is associated with a partic-ular period of time. Kimball in [30] provides a simpler definition: a repository ofinformation is specifically a copy of the transactional data structured for consultingand analysis.

It’s worth mentioning that the warehouse information consists of consolidateddata that is ready to be consulted. In other words, there was a previous computerprocess that ran on a schedule of low demands in the systems9. Therefore the op-erational data was converted into information which was stored in the repository[27].

9 Both operational systems and the repository itself are usually carried at night

3 Web Pattern Extraction and Storage 73

The Data Web-house appears as the application of an architecture that has beentested for a decade, the Data Warehouse applied to web data. It was created toassist the information necessities of business with total or partial presence on theWeb. Indeed, the proliferation of business based on the Web such as e-business, e-commerce, e-media and e-market showed how important it was to know more aboutthe Web and its users – all of these under marketing strategies aimed at creatingfaithful clients.

Applying Data Warehouse to web data brought multiple challenges [29, 54]:

• Web data grows exponentially through time because web logs stores requestseach second. Web users change their behaviour in time and web pages are up-graded. Consequently, more data has to be added to the repository.

• The response times expected are much smaller than those in traditional data ware-house. These should be similar to those taken for loading a web page, which isless than 5 seconds.

Thanks to Data Webhouse, the web miner, web master or another person inter-ested in statistics about a web site can carry one multidimensional analysis on webdata. Indeed, these users see the world in several dimensions, that is, they associatea fact to a group of causal factors [56]. For example, the sale level of a particularproduct will depend on the price, the period of time and the branch in which it wassold. Consequently, repositories are designed to respond to requirements of infor-mation based on how final users perceive the business aspects. In other words, theinformation is stored based on relevant dimensions for a final user, in a way thatfacilitates their browsing in the repository in order to enable them to find answers totheir enquiries [27].

For example, a commercial manager interested in the amount of visits to theCompany web site can look for information about this on each web page, at eachhour of each day of the week. According to multidimensional modelling, the factunder scrutiny are the visits, the data, its values, and dimensions associated withthem. Also important factors are the web page name, hour and weekday.

3.5.2 About PMML

The Predictive Model Markup Language (PMML) [20, 43] has been developed overthe last years for the contribution on predictive analytic models for an analyticalinfrastructure interoperability. Nowadays, PMML has become a leading standard,whose wide set of features and development has allowed a wide variety of applica-tions to be built in specific industrial applications. The deployment of predictive an-alytics in the PMML standard has been continuously supported by leading businessintelligence providers [67], from which its adoption and growth will be sustainablein the future.

PMML is based on XML standards, where the description of data mining mod-els in this semi-structured language has led to a fluent exchange of models in open

74 Vıctor L. Rebolledo, Gaston L’Huillier and Juan D. Velasquez

standards, contributing to an increase of interoperability for consumers and produc-ers of information and analytical applications. Furthermore, data transformation andpreprocessing can be described using PMML models, where output features can beused as components that are served by other PMML applications which representsdata mining models.

Today it is straightforward for technological infrastructure components and ser-vices using PMML to run on di↵erent analytical systems, such as cloud computingarchitectures. A modeler might use a statistical application to build a model, and thescoring process could be done in a cloud. Also it might be used to produce featuresfrom an input data for the statistical model described by the modeler. All this, con-sidering an open standard who is likely to be supported by most data mining andstatistical software tools in the future.

3.5.3 Application

The web patterns can be understood as knowledge artefacts which enables one torepresent in a compact and semantically rich way large quantities of heterogeneousraw web data. In this sense, since web data may be very heterogeneous, several kindsof patterns exist that can represent hidden knowledge: Clusters, Association Rules,Regression, Classifiers, etc. Due to specific characteristics of patterns, specific rep-resentation is required for pattern management in order to model, store, retrieve andmanipulate patterns in a e�cient and e↵ective way.

Given the advantages of PMML to describe data mining models, a repositorywas proposed that works together with PMML files and describe the di↵erent webmining studies. By making use of a Data Warehouse architecture and PMML tech-nology, it is possible to build a repository which represents and stores the knowledgeextracted from web data. Generally, Data Warehouse and Web-houses repositoriesmake use of star schema to be implemented in relational databases. Nevertheless,di↵erent measures exists for every web mining technique. Hence, more than one facttable is required so a fact constellation schema is a more appropriated design. In thissense, the proposed repository is composed by four fact tables and six dimensions.

3.5.3.1 Fact Tables

As shown in figure 3.2, the model includes four fact tables which store measures foreach kind of web mining technique. They share seven metrics which are commonto every algorithm.

• Training time: It is the amount of time in seconds that the algorithm is trained.• Prediction time: It is the amount of time in seconds that the algorithm takes to

predict a class or numeric value. For some techniques this value will be near tozero.

3 Web Pattern Extraction and Storage 75

• Start date: It is the first date in which web data source were taken.• End date: It is the last date in which web data source were taken.• PMML URL: It is the path to the PMML file that describe the web mining model.• MDL Hypothesis: It is the amount of bits that describe the hypothesis or web

mining pattern.• MDL Exceptions: It is the amount of bits that describe the exceptions of the

hypothesis or web mining pattern.

Likewise, fact tables have particular metrics which are assessment measureswhose purpose is to compare di↵erent web mining techniques.

1. Classification Fact: Include every measure to assess classifiers. Firstly, themeasures based on precision: the sampling method used (Training Test, Cross-Validation and Bootstrapping), the estimated sampling error and the confidenceinterval for this last measure. Also, it stores measures based on cost as: TruePositives (TP), True Negatives, False Positives (FP), False Negatives (FN), FPRate, FN Rate, Precision, Recall, F-measure, Accuracy and Area under Curve(AUC).

2. Regression Fact: Include the classic measures of quality for regression models.For example: Mean Squared Error (MSE), Root Mean Squared Error (RMSE),Mean Absolute Error (MAE) and Relative Squared Error (RSE).

3. Clustering Fact: Di↵erent measures are stored in this table. If the clusteringmodel is based on Probability, a probability entropy (S L) is saved. When mod-els don’t provide a probability functions, this table include measures as: Sumof squared error per cluster, Mean distance, Nearest neighbour distance andFarthest neighbour distance

4. AssociationRules Fact: The association rule’s relevance is given by the supportand confidence level. Hence, when we have a web mining study whose purposeis to extract association rules, the number of rules and the number of relevantrules are stored. Likewise, making use of tables relationship, every extractedrules are stored in another table called Rules. Specifically, the rule expressed intext and the support and confidence level associated.

3.5.3.2 Dimensions

Every measure of fact tables is characterised by attributes of five di↵erent dimen-sions.

1. Time: This dimension enables on to make tendency analysis regarding metricsincluded in fact tables. Every web mining study will be characterised by thedate when the algorithm was run, specifically, the year, month, day and instantof time.

2. Study: Every web mining study has an objective. For which, several algorithmscan be used. For example, to study web user behaviour, two clustering tech-niques can be used: SOFM and K-means. In this case the dimension stores thesubject and description of the study.

76 Vıctor L. Rebolledo, Gaston L’Huillier and Juan D. Velasquez

3. Knowledge: Many times web patterns provide interesting knowledge about thestructure, usage and content of a web site. This knowledge will be described intext and stored in this dimension. Also, subjective measures such as Interest andApplicability of the algorithm are stored.

4. Weblog Source: Sometimes a study will make use of web logs as data source.In this case, this dimension characterise the source and the associated prepro-cessing. Particularly: the amount of web log entries, the number of relevant weblog entries (web log contain irrelevant data), the sessionization method used,the amount of identified sessions, the number of relevant sessions for the study,and finally: the minimum and maximum length for used sessions.

5. Webpage Source: This dimension characterise the web page as data source andits preprocessing. Particularly, it stores: the amount of web pages collected, theamount of relevant web pages for the study and the amount of websites andrelevant websites crawled. Regarding the textual information, the total amountof words, the number of stop-words and the mean of words per page are stored.Finally, the vectorial representation used is also stored10.

3.6 Summary

This chapter introduced a specific repository to store meta-data about patterns dis-covered through web mining techniques, In fact, depending on the web mining tech-nique, their parameters, web data quality and pre-process, feature selection and thenature of the web mining study, di↵erent patterns can be discovered from web logentries and web page content.

Making use of Data Warehouse architecture, the same as those commonly knownData Web-houses, it is possible to store relevant information about usage, structureand content patterns that enables the manipulation of hidden knowledge.

Firstly, classic web mining techniques are presented, particularly: associationrules, classification, regression and clustering models. Likewise, the main featureselection techniques applied to web data are introduced.

Later, di↵erent assessment measures are presented, specifically, evaluation cri-teria for the four kinds of web mining models. The idea consists of comparing twostudies which uses the same web mining techniques. Also, subjective measures suchas Interest and Applicability are introduced. These last measures enables the rescueof attributes about patterns that previous numeric measures can not.

Finally, the repository is defined as a fact constellation schema. Due to the highcomplexity and nature of patterns and web mining models, it is necessary to dividethe information into four fact tables which represents each kind of technique. Inthis way, fact tables are related to five dimensions: Time, Study, Knowledge, We-blog Source and Webpage Source. On the other hand, the patterns content and theirparameters were stored in PMML files, which are a XML standard to di↵use data

10 For example: Vector Space Model, Boolean Retrieval Model, etc.

3 Web Pattern Extraction and Storage 77

mining models. This technology allows one to reduce the repository complexity,particularly, the granularity of the information. Hence, the proposed repository andPMML files work e↵ectively in managing the knowledge contained in patterns dis-covered from web data.

Fig. 3.2 Web Pattern Webhouse

78 Vıctor L. Rebolledo, Gaston L’Huillier and Juan D. Velasquez

Acknowledgements This work was supported partially by the FONDEF project DO8I-1015 andthe Web Intelligence Research Group (wi.dii.uchile.cl) is greatly acknowledge.

References

1. Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rules between setsof items in large databases. SIGMOD Rec., 22(2):207–216, 1993.

2. Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules inlarge databases. In VLDB ’94: Proceedings of the 20th International Conference on Very LargeData Bases, pages 487–499, San Francisco, CA, USA, 1994. Morgan Kaufmann PublishersInc.

3. William A. Arnold and John S. Bowie. Artificial Intelligence: A Personal CommonsenseJourney. Englewood Cli↵s, NJ: Prentice Hall, 1985.

4. Avrim L. Blum and Pat Langley. Selection of relevant features and examples in machinelearning. Artif. Intell., 97(1-2):245–271, 1997.

5. Anselm Blumer, A. Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability andthe vapnik-chervonenkis dimension. J. ACM, 36(4):929–965, 1989.

6. Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm foroptimal margin classifiers. In COLT ’92: Proceedings of the fifth annual workshop on Com-putational learning theory, pages 144–152, New York, NY, USA, 1992. ACM.

7. Barbara Catania and Anna Maddalena. Hershey, PA, USA.8. Barbara Catania, Anna Maddalena, and Maurizio Mazza. Psycho: A prototype system for pat-

tern management. In Klemens Bohm, Christian S. Jensen, Laura M. Haas, Martin L. Kersten,Per-Åke Larson, and Beng Chin Ooi, editors, VLDB, pages 1346–1349. ACM, 2005.

9. Siriporn Chimphlee, Naomie Salim, Mohd Salihin Bin Ngadiman, Witcha Chimphlee, andSurat Srinoy. Independent component analysis and rough fuzzy based approach to web usagemining. In AIA’06: Proceedings of the 24th IASTED international conference on Artificialintelligence and applications, pages 422–427, Anaheim, CA, USA, 2006. ACTA Press.

10. R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide webbrowsing patterns. Knowledge and Information Systems, 1:5–32, 1999.

11. T. Davenport and L. Prusak. Working Knowledge: How Organizations Manage What TheyKnow. Harvard Business School Press, Cambridge, MA, 1997.

12. D.L. Davies and D.W. Bouldin. A cluster separation measure. IEEE Transactions on PatternAnalysis and Machine Intelligence, 1:224–227, 1979.

13. Randall Davis, Howard Shrobe, and Peter Szolovits. What is knowledge representation. AIMagazine, 14(1):17–33, 1993.

14. Robert F. Dell, Pablo E. Roman, and Juan D. Velasquez. Web user session reconstructionusing integer programming. In Web Intelligence, pages 385–388. IEEE, 2008.

15. Pedro Domingos, Michael Pazzani, and Gregory Provan. On the optimality of the simplebayesian classifier under zero-one loss. In Machine Learning, pages 103–130, 1997.

16. Luis E. Dujovne and Juan D. Velasquez. Design and implementation of a methodology foridentifying website keyobjects. In Juan D. Velasquez, Sebastian A. Rıos, Robert J. Howlett,and Lakhmi C. Jain, editors, KES (1), volume 5711 of Lecture Notes in Computer Science,pages 301–308. Springer, 2009.

17. Francois Fleuret. Fast binary feature selection with conditional mutual information. Journalof Machine Learning Research, 5:1531–1555, 2004.

18. Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learn-ing and an application to boosting. In EuroCOLT ’95: Proceedings of the Second EuropeanConference on Computational Learning Theory, pages 23–37, London, UK, 1995. Springer-Verlag.

19. Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In ICML,pages 148–156, 1996.

3 Web Pattern Extraction and Storage 79

20. Robert L. Grossman. What is analytic infrastructure and why should you care? SIGKDDExplor. Newsl., 11(1):5–9, 2009.

21. Robert L. Grossman, Mark F. Hornick, and Gregor Meyer. Data mining standards initiatives.Commun. ACM, 45(8):59–61, 2002.

22. Isabelle Guyon and Andre Elissee↵. An introduction to variable and feature selection. Journalof Machine Learning Research, 3:1157–1182, 2003.

23. Jiawei Hah, Yongjian Fu, Wei Wang, Krzysztof Koperski, and Osmar Zaiane. Dmql: A datamining query language for relational databases. 1996.

24. J. A. Hartigan and M. A. Wong. A K-means clustering algorithm. Applied Statistics, 28:100–108, 1979.

25. Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learn-ing: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics).Springer-Verlag, 2nd ed. 2009. corr. 3rd printing edition, September 2009.

26. Tomasz Imielinski and Aashu Virmani. Msql: A query language for database mining. DataMin. Knowl. Discov., 3(4):373–408, 1999.

27. W. H. Inmon. Building the Data Warehouse, 4rd Edition. Wiley Publishing, 2005.28. John F. Elder IV, Francoise Fogelman-Soulie, Peter A. Flach, and Mohammed Javeed Zaki,

editors. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining, Paris, France, June 28 - July 1, 2009. ACM, 2009.

29. R. Kimball and R. Merx. The Data Webhouse Toolkit. Wiley Computer Publisher, 2000.30. Ralph Kimball and Margy Ross. The Data Warehouse Toolkit: The Complete Guide to Di-

mensional Modeling (Second Edition). Wiley, 2002.31. Mieczyslaw A. Klopotek, Slawomir T. Wierzchon, and Krzysztof Trojanowski. Intelligent

Information Processing and Web Mining: Proceedings of the International IIS: IIPWM’06Conference held in Ustron, Poland, June 19-22, 2006 (Advances in Soft Computing). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.

32. Ron Kohavi and George H. John. Wrappers for feature subset selection. Artif. Intell., 97(1-2):273–324, 1997.

33. T. Kohonen, M. R. Schroeder, and T. S. Huang, editors. Self-Organizing Maps. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2001.

34. Jan Larsen, Lars Kai Hansen, Anna Szymkowiak Have, Torben Christiansen, and ThomasKolenda. Webmining: learning from the world wide web. Computational Statistics & DataAnalysis, 38(4):517–532, February 2002.

35. B. Liu. Web Data Mining: Exploring Hyperlinks, Content and Usage Data. Springer, firstedition, 2007.

36. Ping Luo, Fen Lin, Yuhong Xiong, Yong Zhao, and Zhongzhi Shi. Towards combining webclassification and web information extraction: a case study. In IV et al. [28], pages 1235–1244.

37. J. B. MacQueen. Some methods for classification and analysis of multivariate observations. InL. M. Le Cam and J. Neyman, editors, Proc. of the fifth Berkeley Symposium on MathematicalStatistics and Probability, volume 1, pages 281–297. University of California Press, 1967.

38. Sebastian Maldonado and Richard Weber. A wrapper method for feature selection using sup-port vector machines. Inf. Sci., 179(13):2208–2217, 2009.

39. Zdravko Markov and Daniel T. Larose. Data Mining the Web: Uncovering Patterns in WebContent, Structure, and Usage. Wiley-Interscience, 2007.

40. Rosa Meo, Giuseppe Psaila, and Stefano Ceri. An extension to sql for mining associationrules. Data Min. Knowl. Discov., 2(2):195–224, 1998.

41. Tom M. Mitchell. Machine Learning. McGraw-Hill, New York, 1997.42. Christos H. Papadimitriou, Hisao Tamaki, Prabhakar Raghavan, and Santosh Vempala. Latent

semantic indexing: a probabilistic analysis. In PODS ’98: Proceedings of the seventeenthACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pages 159–168, New York, NY, USA, 1998. ACM.

43. Rick Pechter. What’s pmml and what’s new in pmml 4.0? SIGKDD Explor. Newsl., 11(1):19–25, 2009.

44. Frank Rosenblatt. Principles of Neurodynamics: Perceptrons and the Theory of Brain Mech-anisms. Spartan Books, 1962.

80 Vıctor L. Rebolledo, Gaston L’Huillier and Juan D. Velasquez

45. Peter Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of clusteranalysis. J. Comput. Appl. Math., 20(1):53–65, 1987.

46. G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Commun.ACM, Vol. 18(11):613–620, 1975.

47. Robert E. Schapire. The strength of weak learnability. Mach. Learn., 5(2):197–227, 1990.48. Bernhard Scholkopf and Alexander J. Smola. Learning with Kernels: Support Vector Ma-

chines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2001.49. Fabrizio Sebastiani. Text categorization. In Alessandro Zanasi, editor, Text Mining and its

Applications to Intelligence, CRM and Knowledge Management, pages 109–129. WIT Press,Southampton, UK, 2005.

50. Manolis Terrovitis, Panos Vassiliadis, Spiros Skiadopoulos, Elisa Bertino, Barbara Catania,and Anna Maddalena. Modeling and language support for the management of pattern-bases.Scientific and Statistical Database Management, International Conference on, 0:265, 2004.

51. Kari Torkkola. Feature extraction by non parametric mutual information maximization. Jour-nal of Machine Learning Research, 3:1415–1438, 2003.

52. V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequencies of eventsto their probabilities. Theory of Probability and its Applications, 16:264–280, 1971.

53. Vladimir N. Vapnik. The Nature of Statistical Learning Theory (Information Science andStatistics). Springer, 1999.

54. J. D. Velasquez and V. Palade. Adaptive Web Sites: A Knowledge Extraction from Web DataApproach. IOS Press, 2008.

55. J.D. Velasquez and V. Palade. Building a knowledge base for implementing a web-basedcomputerized recommendation system. International Journal of Artificial Intelligence Tools,16(5):793–828, 2007.

56. J.D. Velasquez and Vasile Palade. A knowledge base for the maintenance of knowledge ex-tracted from web data. Knowledge Based Systems, 20(3):238–248, 2007.

57. J.D. Velasquez, H. Yasuda, T. Aoki, and R. Weber. A new similarity measure to understandvisitor behavior in a web site. IEICE Transactions on Information and Systems, Special Issuesin Information Processing Technology for web utilization, vE87-D i2.:389–396, 2004.

58. Juan D. Velasquez, Sebastian A. Rios, Alejandro Bassi, Hiroshi Yasuda, and Terumasa Aoki.Towards the identification of keywords in the web site text content: A methodological ap-proach. International Journal of Web Information Systems information, Vol. 1(1):pp. 53–57,2005.

59. Yong Wang, Julia Hodges, and Bo Tang. Classification of web documents using a naive bayesmethod. In ICTAI ’03: Proceedings of the 15th IEEE International Conference on Tools withArtificial Intelligence, page 560, Washington, DC, USA, 2003. IEEE Computer Society.

60. Catherine W. Wen, Huan Liu, Wilson X. Wen, and Je↵ery Zheng. A distributed hierarchicalclustering system for web mining. In WAIM ’01: Proceedings of the Second InternationalConference on Advances in Web-Age Information Management, pages 103–113, London, UK,2001. Springer-Verlag.

61. Paul John Werbos. The roots of backpropagation: from ordered derivatives to neural networksand political forecasting. Wiley-Interscience, New York, NY, USA, 1994.

62. Lior Wolf and Amnon Shashua. Feature selection for unsupervised and supervised inference:The emergence of sparsity in a weight-based approach. J. Mach. Learn. Res., 6:1855–1887,2005.

63. Junjie Wu, Hui Xiong, and Jian Chen. Adapting the right measures for k-means clustering. InIV et al. [28], pages 877–886.

64. Rui Xu and II Wunsch. Survey of clustering algorithms. Neural Networks, IEEE Transactionson, 16(3):645–678, 2005.

65. Zhijun Yin, Rui Li, Qiaozhu Mei, and Jiawei Han. Exploring social tagging graph for webobject classification. In IV et al. [28], pages 957–966.

66. T. Y. Young. The reliability of linear feature extractors. IEEE Transactions on Computers,20(9):967–971, 1971.

3 Web Pattern Extraction and Storage 81

67. Michael Zeller, Robert Grossman, Christoph Lingenfelder, Michael R. Berthold, Erik Mar-cade, Rick Pechter, Mike Hoskins, Wayne Thompson, and Rich Holada. Open standards andcloud computing: Kdd-2009 panel report. In KDD ’09: Proceedings of the 15th ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages 11–18, New York,NY, USA, 2009. ACM.

68. Zheng Zhao and Huan Liu. Spectral feature selection for supervised and unsupervised learn-ing. In ICML ’07: Proceedings of the 24th international conference on Machine learning,pages 1151–1157, New York, NY, USA, 2007. ACM.