Automatic web page categorization using text classification ...

44
 

Transcript of Automatic web page categorization using text classification ...

 

golda
Typewritten Text
Degree project in Computer Science Second cycle Stockholm, Sweden 2013
golda
Typewritten Text
golda
Typewritten Text
golda
Typewritten Text
golda
Typewritten Text
golda
Typewritten Text
golda
Typewritten Text
golda
Typewritten Text
Automatic web page categorization using text classification methods
golda
Typewritten Text
golda
Typewritten Text
golda
Typewritten Text
golda
Typewritten Text
golda
Typewritten Text
Tobias Eriksson
golda
Typewritten Text
golda
Typewritten Text
golda
Typewritten Text
golda
Typewritten Text
golda
Typewritten Text
golda
Typewritten Text

Automatic web page categorizationusing text classi�cation methods

Automatisk kategorisering av webbsidor med

textklassificeringsmetoder

Master's Degree Project in Computer Science

CSC School of Computer Science and

Communication

Author:

Tobias Eriksson

Author E-mail:

[email protected]

Supervisor:

Johan Boye

Examiner:

Jens Lagergren

Project provider:

Whaam AB

September 8, 2013

Abstract

Over the last few years, the Web has virtually exploded withan enormous amount of web pages of di�erent types of con-tent. With the current size of Web, it has become cum-bersome to try and manually index and categorize all of itscontent. Evidently, there is a need for automatic web pagecategorization.This study explores the use of automatic text classi�cationmethods for categorization of web pages. The results in thispaper is shown to be comparable to results in other paperson automatic web page categorization, however not as goodas results on pure text classi�cation.

Referat

Automatisk kategorisering av webbsidor med

textklassificeringsmetoder

Under de senaste åren så har Webben exploderat i stor-lek, med miljontals webbsidor av vitt skilda innehåll. Denenorma storleken av Webben gör att det blir ohanterligt attmanuellt indexera och kategorisera allt detta innehåll. Up-penbarligen behövs det automatiska metoder för att kate-gorisera webbsidor.Denna studie undersöker hur metoder för automatiskt text-klassi�cering kan användas för kategorisering av hemsidor.De uppnådda resultatet i denna rapport är jämförbara medresultat i annan litteratur på samma område, men når ej upptill resultatet i studier på ren textklassi�cering.

Contents

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Purpose and contribution . . . . . . . . . . . . . . . . . . . . . . 21.5 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.7 Description of this document . . . . . . . . . . . . . . . . . . . . 5

2 Background 6

2.1 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1.2 Types of machine learning . . . . . . . . . . . . . . . . . . 6

2.2 Natural language processing . . . . . . . . . . . . . . . . . . . . . 72.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.2 N-grams . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.3 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.4 Term frequency - inverse document frequency . . . . . . . 72.2.5 Stemming . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2.6 Bag-of-words model . . . . . . . . . . . . . . . . . . . . . 8

3 Automatic text categorization 10

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Multinomial Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . 123.4 TWCNB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.5 k-nearest-neighbour . . . . . . . . . . . . . . . . . . . . . . . . . . 133.6 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . 13

3.6.1 Linear SVM . . . . . . . . . . . . . . . . . . . . . . . . . . 143.6.2 Nonlinear SVM . . . . . . . . . . . . . . . . . . . . . . . . 153.6.3 Multi-class SVM . . . . . . . . . . . . . . . . . . . . . . . 15

3.7 N-Gram based classi�cation . . . . . . . . . . . . . . . . . . . . . 15

4 Automatic web page categorization 18

4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.2 Data gathering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3 Choice of classi�er . . . . . . . . . . . . . . . . . . . . . . . . . . 184.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.5 Feature selection and extraction . . . . . . . . . . . . . . . . . . . 19

4.5.1 Plain text conversion . . . . . . . . . . . . . . . . . . . . . 194.5.2 Tokenization and stemming . . . . . . . . . . . . . . . . . 20

4.6 Document representation . . . . . . . . . . . . . . . . . . . . . . . 214.7 Categorization of unlabeled documents . . . . . . . . . . . . . . . 21

5 Implementation 22

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.2 The content framework package . . . . . . . . . . . . . . . . . . . 235.3 The ml framework package . . . . . . . . . . . . . . . . . . . . . . 235.4 The nlp framework package . . . . . . . . . . . . . . . . . . . . . 235.5 The tools framework package . . . . . . . . . . . . . . . . . . . . 235.6 The web framework package . . . . . . . . . . . . . . . . . . . . . 23

6 Results 24

6.1 Classi�ers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246.2 Data and experiments . . . . . . . . . . . . . . . . . . . . . . . . 246.3 Evaluation metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 266.4 Per-classi�er performance . . . . . . . . . . . . . . . . . . . . . . 26

6.4.1 Micro-averaged F-measure . . . . . . . . . . . . . . . . . . 266.4.2 Macro-averaged F-measure . . . . . . . . . . . . . . . . . 276.4.3 Raw classi�cation accuracy . . . . . . . . . . . . . . . . . 28

7 Conclusions 30

7.1 Concerns about the data . . . . . . . . . . . . . . . . . . . . . . . 307.2 Suggestions for improvements in future work . . . . . . . . . . . . 30

7.2.1 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 307.2.2 Improved feature selection . . . . . . . . . . . . . . . . . . 317.2.3 Utilizing domain knowledge . . . . . . . . . . . . . . . . . 317.2.4 Utilizing web site structure . . . . . . . . . . . . . . . . . 317.2.5 Alternative approaches . . . . . . . . . . . . . . . . . . . . 31

7.3 Overall conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Bibliography 33

Articles and publications . . . . . . . . . . . . . . . . . . . . . . . . . . 33Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35Internet resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

List of Figures

1.1 The number of Internet hosts (1994-2012) according to the ISC

Domain Survey [46]. . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Overview of the methodology used in this thesis. . . . . . . . . . . 2

3.1 Example of a maximum margin separating two classes of data

points [52]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1 Training the classi�er for web page categorization. . . . . . . . . . 194.2 An example of a web page. . . . . . . . . . . . . . . . . . . . . . . 20

6.1 Mean Micro F-Measure Score with 95% con�dence interval shown. 276.2 Mean Macro F-Measure Score with 95% con�dence interval shown. 286.3 Mean Classi�cation Accuracy with 95% con�dence interval shown. 29

List of Tables

2.1 N-grams of di�erent lengths for the word �APPLE� . . . . . . . . 7

6.1 Distribution of samples over the categories in the data. . . . . . . 256.2 Micro-averaged F1 score for the di�erent classi�ers, averaged over

all instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276.3 Macro-averaged F1 score for the di�erent classi�ers, averaged over

all instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286.4 Accuracy of the di�erent classi�ers in terms of correct classi�cation

rate, averaged over all instances . . . . . . . . . . . . . . . . . . . 29

Chapter 1

Introduction

1.1 Background

The World Wide Web has grown tremendously in the last few years, with at least14 billion web pages indexed by search engines such as Google and Bing [40]. TheInternet Systems Consortium publishes quarterly statistics on the on the numberof reachable hosts in the Internet, shown in �gure 1.1. The Web has content ofvirtually any category imaginable: entertainment, fashion, news, multimedia,science and technology (to name a few). The Open Directory project [42] tries totake on the daunting task of categorizing the Web. It is a human-edited index,with over 5 million web pages listed in over 1 million categories. While this isan impressive feat, it becomes clear that a human-edited index of the Web is notviable as the Web continues to grow.

Figure 1.1: The number of Internet hosts (1994-2012) according to the ISC Do-

main Survey [46].

One of the applications of machine learning is to automatically classify doc-uments into prede�ned sets of categories, based their textual content. This isknown as text classi�cation. Often these documents are news articles, researchpapers and the like. A web page is essentially a text document, and often canbe considered to be of a certain category or subject. Therefore it seems proba-ble that proven machine learning methods for text classi�cation can be used to

1

CHAPTER 1. INTRODUCTION

assign categories to web pages.

1.2 Problem statement

This master's thesis is an exploratory study, with the goal to try and answer thequestion of how can machine learning and natural language processing tools for

text classi�cation be used to perform automatic web page categorization? Machinelearning and natural language processing methods and tools for text classi�cationare studied, and a few methods selected. From this a general method for auto-matic web page categorization is proposed, and evaluated by experimentation ofreal web pages.

1.3 Scope

Given that there are many methods for text classi�cation, this thesis only focuseson a selected few. Document classi�cation papers most often use documents in asingle language (English is very common), and this is also the case in this thesis.

1.4 Purpose and contribution

The purpose of this thesis is to try and identify a method for web page clas-si�cation that has good performance in terms of classi�cation accuracy that isacceptable is practical applications. The author hopes that the thesis inspiresothers to identify other machine language methods than the ones studied for usewith the categorization method proposed in this report, and expand upon it.

1.5 Methodology

The approach to automatic web page categorization proposed in this report isbased on contemporary methods for performing automatic text classi�cation.

This involves extracting the textual, natural language content from the web page,encoding the document as a feature vector with natural language processing meth-ods. This subsection gives a brief overview of the methodology used in this thesis.The methodology is summarized in �gure 1.2.

Figure 1.2: Overview of the methodology used in this thesis.

First, a study of relevant document classi�cation literature were performed toget an understanding of the �eld and �nd suitable machine learning algorithms

2

CHAPTER 1. INTRODUCTION

and natural language processing methods that can be used for web page classi�-cation. There were no hard criteria for selecting a speci�c algorithm or method.The general guideline is:

� The algorithm or method should be cited in the literature.

� The algorithm or method should have a readily available implementationin Java, or be easily implemented in a short time.

� Preferably, if a machine learning algorithm, it should have been a part ofa comparative study in document classi�cation.

This is followed by proposing a general method of automatic web page catego-rization using text classi�cation methods. To evaluate this method, a frameworkfor automatic web page categorization is developed that implements the generalmethod and includes support for several algorithms and tools that have beenstudied. To establish how well the general method performs, several experimentsare done by performing web page categorization using the framework (with theseveral classi�ers) on di�erent data sets of real web page data. The performanceof the framework is documented, and statistical analysis of the results is done.From these results, conclusions about the method can be drawn and any furtherwork proposed.

The �real� web pages are given by the project provider,Whaam1. Whaam is adiscovery engine based on the idea of social sur�ng. On Whaam, users store linksto web pages in link lists, that can be shared with friends. Links are seperatedinto a set of eight broad categories. The categories are Art & Design, Fashion,Entertainment, Blogs, Sport, Lifestyle, Science & Tech, and Miscellaneous. Theweb pages have been labeled by the users of Whaam, which brings up the issue ofquality of the labeling. One of the concerns is that a cursory examination of thedata shows that many samples have been labeled as Miscellaneous, when theyshould have been labeled more speci�cally. For example, many links to musicalartists are in the Miscellaneous category, but should probably belong to the En-tertainment category. Another concern is that many samples are media-centered,thus having minimal textual content other than basic metainformation. For ex-ample, the following is the textual content taken from a typical Entertainmentsample from the video-sharing website YouTube2:

Uploaded on Dec 23, 2011

An acoustic rendition of Silent Night by Matt Corby.Merry Christmas from Matt and team!

Samples from the Science & Technology, for example, are comparatively large interm of textual content, as illustrated by the following extracted text3:

Clojure is a dynamic programming language that targets the JavaVirtual Machine (and the CLR, and JavaScript). It is designed to bea general-purpose language, combining the approachability and inter-active development of a scripting language with an e�cient and robustinfrastructure for multithreaded programming. Clojure is a compiled

1See http://www.whaam.com2For reference, at the time of this writing, the sample referred to is http://www.youtube.

com/watch?v=z6_Hcr5o_0M.3Again, for reference, at the time of this writing, the sample referred to is http://clojure.

org/.

3

CHAPTER 1. INTRODUCTION

language - it compiles directly to JVM bytecode, yet remains com-pletely dynamic. Every feature supported by Clojure is supported atruntime. Clojure provides easy access to the Java frameworks, withoptional type hints and type inference, to ensure that calls to Javacan avoid re�ection.

Clojure is a dialect of Lisp, and shares with Lisp the code-as-dataphilosophy and a powerful macro system. Clojure is predominantlya functional programming language, and features a rich set of im-mutable, persistent data structures. When mutable state is needed,Clojure o�ers a software transactional memory system and reactiveAgent system that ensure clean, correct, multithreaded designs.

I hope you �nd Clojure's combination of facilities elegant, powerful,practical and fun to use. The primary forum for discussing Clojureis the Google Group - please join us!

Of course, the texts above are not the full content of both samples but the mostrelevant by simple selection. Still, it illustrates the signi�cant di�erence in theamount of the textual content of the di�erent categories.

1.6 Related work

There has been much research done on text classi�cation over the years, and alsoon di�erent methods for performing categorization of web pages. To get an ideaof what work has been done in the past, in this section we will brie�y look atsome research papers on automatic web page categorization. A fairly recent paperby Qi and Davison lists multiple methods [21] that are of interest.

An interesting paper by Kan [13] studies the use of URLs for web page cat-egorization. The assumption is that the URL inherently encodes informationabout a page's category. In the paper, Kan describes several methods for ex-tracting tokens from URLs. One of them uses a �nite state transducer to expandURL segments, such as �cs� in http://cs.cornell.edu to �computer science�.These tokens are then used as the data for the classi�er. The experiments useda Support Vector Machine as the classi�er, and tried using di�erent sources fortext data as a comparison to using URLs only.

A paper written by Daniele Riboni in 2002 focused on feature selection forweb page classi�cation [23]. It highlighted the fact that categorizing web pagesis di�erent, but related to, text classi�cation mainly because of the presence ofHTML markup. Riboni chose a set of sources for text in HTML: the contentof the BODY tag, the content of the META tag, the content of the TITLE tag,and combinations of these. Robini experimented with several feature selectiontechniques based on information gain, word frequency, and document frequency.The experiments were made using a large set of samples from the subcategories ofthe Science directory from Yahoo!4, using a Naive Bayes' classi�er and a kernelperceptron.

Not unexpectedly, there has been work done methods for automatic webpage categorization that is not based purely on text classi�cation. An interestingpaper by Attardi et al. [1] discusses categorization by context. The intendedapplication is cataloguing, by spidering from a root web page. The paper makesthe assumption that a web page can be categorized by links pointing to it, either

4http://www.yahoo.com

4

CHAPTER 1. INTRODUCTION

directly by the anchor text itself or text in the same context i.e. surrounding it.Of course, this can be less useful if the link has a generic anchor text and occursin a generic context, such as �Read more about it here�. Another paper [31]explored the possibility of using mixed Hidden Markov models for clickstreams5

to use as a model for web page categorization.This report takes a similar approach web page categorization as [23], but

with more emphasis on studying several classi�ers and di�erent sources of text.While Robini used the whole content of the BODY tag, in this report we willexplore further by using di�erent parts under the BODY tag as sources for text.Chapter 6 explains these sources in more detail. This report utilizes classic textclassi�cation methods, thus making it possible to make relevant comparisons toother studies done on both text classi�cation and web page categorization thatuses text classi�cation methods.

1.7 Description of this document

Introduction The introduction gives a brief background of the problem state-ment, the problem statement itself, discusses the methodology and related work.

Background This chapter introduces the reader to machine learning and nat-ural language processing.

Automatic text categorization This chapter presents the identi�ed machinelearning methods for text classi�cation, describes the algorithms and their re-garded performance.

Automatic web page categorization This chapter presents a general methodfor automatic web page categorization using text classi�cation methods.

Implementation This chapter brie�y discusses the implementation of the cat-egorization framework.

Results This chapter presents the experiments performed, and discusses theresults.

Conclusions This chapter presents the conclusion drawn based on the exper-imental results, and proposed further work.

5A clickstream is the recording of the parts of the screen a computer user clicks on whileweb browsing. Users clicks are logged inside the web server.

5

Chapter 2

Background

This chapter introduces the �elds of machine learning and natural language pro-cessing, with some speci�c concepts that are relevant to later parts of this thesis.This chapter is meant to give a quick overview of the concepts for the reader withminimal to no knowledge of these �elds, without going into details. The readeris encouraged to explore the cited literature for more detailed information.

2.1 Machine learning

2.1.1 Overview

The core of the topic of this paper is learning. Let us take a moment and de�newhat learning is. The Oxford English Dictionary de�nes learning as the acquisi-tion of knowledge or skills through study, experience, or being taught. Computerscan be taught to learn from data, and if we put it into human behavioural termswe can say that they learn from experience. American computer scientist TomM. Mitchell de�nes machine learning as

�A computer program is said to learn from experience E with respectto some class of tasks T and performance measure P, if its perfor-mance at tasks in T, as measured by P, improves with experienceE.� [35]

Mitchell gives an example:

�A computer program that learns to play checkers might improve itsperformance as measured by its ability to win at the class of tasks in-volving playing checkers games, through experience obtained by play-

ing games against itself.� [35]

Applications of machine learning include text classi�cation [26], computer vision[24] and medical diagnosis [15].

2.1.2 Types of machine learning

There are several types of machine learning. Supervised learning uses a providedtraining set of examples with correct responses. Based on this training set, thealgorithm generalizes to respond correctly to all possible inputs. On the otherhand, unsupervised learning does not rely on provided correct responses for thetraining but instead tries to identify similarities between the inputs so that theinputs have something in common are grouped together. A combination of the

6

CHAPTER 2. BACKGROUND

Table 2.1: N-grams of di�erent lengths for the word �APPLE�N Resulting N-grams

2 _A, AP, PP, PL, E_3 _AP, APP, PPL, PLE, LE_, E__4 _APP, APPL, PPLE, PLE_, LE__, E___

two former types is reinforcement learning, where the algorithm is noti�ed thatit has done an error but is not told how to correct it. Instead, the algorithmhas to explore and try out di�erent possible solutions until it makes a correctprediction [34].

Other types of machine learning are evolutionary learning [34], semi-supervisedlearning [36] and multitask learning [3].

2.2 Natural language processing

2.2.1 Overview

Natural language processing, or NLP for short, is a �eld of computer scienceand linguistics concerned with the interaction between human natural languagesand computers. Some major applications of NLP include information retrieval,dialogue and conversational agents, and machine translation [32]. NLP is a large�eld, and therefore this section will only brie�y focus on the concepts needed forlater chapter in this thesis.

2.2.2 N-grams

An N-gram is a contiguous sequence of n items from a given sequence of text.The items are usually letters (characters), but can also be other things based onthe application (e.g. words). Typically, one slices a word into a set of overlappingN-grams, and usually pad the words with blanks1 to help with start-of-word andend-of-word situations [32]. Refer to table 2.1 for an example.

2.2.3 Tokenization

Tokenization is process of breaking up a text into words, called tokens. Tok-enization is relatively straightforward in languages such as English, but is par-ticularly di�cult for languages such as Chinese that has no word boundaries [9].In English, words are often separated by each other with blanks or punctuation.However this does not always apply since, for example, �Los Angeles� and �rock'n' roll� are each often considered a single word.

2.2.4 Term frequency - inverse document frequency

Consider the case where we query a system for the sentence �the car�. As �the�is a very common word in English, it will appear in many documents. The resultof the query would then return documents not relevant to �car� as �the� likelyappears in all of the documents in the system. Instead we want to highlight theimportance of the term �car�. Term frequency - inverse document frequency, ortf-idf for short, is a measure of how important a term is in a document [32].

1In this report, blanks are represented by underscores.

7

CHAPTER 2. BACKGROUND

The term frequency tfi of a term i can be chosen to be the raw frequency

of a term in the document. Other possibilites include boolean frequencies andlogarithmically scaled frequencies [32]. The problem with the term frequency isthat it considered all terms equally important. Thus we introduce a factor thatweights the term with a factor that discounts its importance if it appears in manydocuments in the set. This approach de�nes the weight wi of the term i as:

wi = tfi logn

ni(2.1)

where n is the total number of documents in the set, and ni is the number ofoccurrences of term i in the whole document set.

2.2.5 Stemming

Many natural languages are in�ected, meaning that words sharing the same rootcan be related to the same topic [27]. In information retrieval and classifcation,one often wishes to group words with the same meaning to the same term. Anexample of this is are the terms �cat� and �cats� that can be represented by theterm �cat�. In some algorithms, the stems that are the results of the stemmeralgorithm are not words, that can appear �incorrect� in terms of the naturallanguage. However, this is not seen as a �aw and rather a feature [27]. Stemmeralgorithms can be roughly be classi�ed as a�x removing, statistical or mixed.

The a�x removal algorithms can be as simple as removing n tokens from theterm, or removing −s from plural words. One stemmer for su�x removal is thePorter Stemming Algorithm, that is very popular and can perhaps be considerthe de-facto standard for English words [27]. It's based on performing a set ofsteps for reducing words to stems using rules for transformation. An example ofa rule in the Porter Stemmer is

SSES→ SS

i.e. if the word ends in �sses� change the su�x to �ss� [20].There are some approaches to stemming that are statistic. For example, there

have been work done on using Hidden Markov Models for performing unsuper-vised word stemming [27]. Finally there are method that mix methods from botha�x stemmers and statistical stemmers.

2.2.6 Bag-of-words model

The bag-of-words model is a simplifying representation of documents. A bag-of-words is a unordered set of words, with their exact positions ignored [33]. Thesimplest representation of a bag-of-words is a binary term vector, each binaryfeature indicating whether a vocabulary word does or does not occur in thedocument [32]. For example, assuming the vocabulary is

V = { dog, cat, fish, iguana }

the term vector of a document containing only the words cat and iguana wouldbe

~d =

0101

8

CHAPTER 2. BACKGROUND

Another, in some cases more useful, term vector representation for the bag-of-words model is to use word frequency as the feature elements. In information re-trieval and classi�cation, various term weighting schemes are used that emphasizethe relative importance of a term in the context (for example tf-idf weighting).

9

Chapter 3

Automatic text categorization

3.1 Overview

Automatic text categorization1 is a supervised learning task, de�ned as assigningpre-de�ned category labels to new documents based on the likelihood suggestedby a training set of labeled documents [29].

Let us illustrate the concept with an example. Imagine we want to classifythe following text to a prede�ned set of categories:

�It was the best of times, it was the worst of times, it was the age ofwisdom, it was the age of foolishness, it was the epoch of belief, it wasthe epoch of incredulity, it was the season of Light, it was the seasonof Darkness, it was the spring of hope, it was the winter of despair,we had everything before us, we had nothing before us, we were allgoing direct to Heaven, we were all going direct the other way � inshort, the period was so far like the present period, that some of itsnoisiest authorities insisted on its being received, for good or for evil,in the superlative degree of comparison only.�

Let us assume that the set of categories are a set of authors. Then, the taskwould be to determine which author has most likely written the given text. Forthe text above, the correct category would be Charles Dickens.

We want to use machine learning methods to classify this text. To do this, wemust encode the text as a feature vector. The simplest approach is to representthe document by a bag-of-words feature vector with the features being wordoccurrences. However, as we will see in the following sections, the features areusually term weights as determined by a term weighting scheme. For simplicity'ssake, we decide that the vocabulary of the training documents is the small set ofwords:

V = {matchbook, times, age, wisdom, hope, despair}We can encode the document, based on the frequency of the distinct words, asthe bag-of-words feature vector:

~d =

022111

1Text categorization is also known as text classi�cation, document categorization, and doc-

ument classi�cation. These terms are used interchangeably in this document.

10

CHAPTER 3. AUTOMATIC TEXT CATEGORIZATION

where the �rst feature represents the frequency of �matchbook� in the text, thesecond the frequency of �times� in the text, and so on. The vector representation~d of the document is point in a feature space of dimension |V |. Using this featurevector, we apply a machine learning algorithm to determine its category.

In this chapter, we will look at a selection of document classi�cation meth-ods that appear in the studied literature. Essentially, these are the selectionof methods that were deemed suitable for experimentation with automatic webpage categorization during the literature study. We will take a brief look at howthe methods work, and how well their performance is at document classi�cationtasks in the literature.

3.2 Naive Bayes

A Naive Bayes classi�er is a probabilistic classi�er based on applying Bayes'theorem, with the naive assumption of independence between features. Whatthese features are vary from application to application; in text categorizationthe features are usually term (word) weights calculated using tf-idf or anotherweighting scheme. These classi�ers are commonly studied in machine learning[29], and is frequently used because they are fast and easy to implement [22]. Thebasic idea of the Naive Bayes' classi�er is that we want to construct a decisionrule d that labels a document with the class that yields the highest posteriorprobability:

d(X1, . . . , Xn) = argmaxc

P (C = c|X1 = x1, . . . , Xn = xn) (3.1)

This is known as a maximum a posteriori or MAP decision rule. However, theposterior are usually not known. The trick is to make the rather naive assumptionthat all featuresX1, . . . , Xn are conditionally independent. Bayes' rule state that:

P (C = c|X1 = x1, . . . , Xn = xn) =P (C = c)P (X1 = x1, . . . , Xn = xn|C = c)

P (X1 = x1, . . . , Xn = xn)(3.2)

Applying the naive assumption of the independence between features:

P (X1 = x1, . . . , Xn = xn|C = c) =n∏

i=1

P (Xi = xi|C = c) (3.3)

From this we see that:

P (C = c|X1 = x1, . . . , Xn = xn) ∝ P (C = c)n∏

i=1

P (Xi = xi|C = c) (3.4)

Using the above we can write the decision rule 3.1 as:

d(X1, . . . , Xn) = argmaxc

P (C = c)

n∏i=1

P (Xi = xi|C = c) (3.5)

The assumption of independence between features is a strong one. If one thinksabout a typical document, it seems unlikely that its features occur independentof one another. Not only that, but the increase in the number of the featuresmake the model describe noise in the data rather than the actual underlying rela-tionship between features (known as over�tting) [17]. However, the performanceof the Naive Bayes classi�ers vary in the literature. In some cases, NB is one ofthe poorest of the classi�ers compared [11, 12, 29], but in some studies they arecomparable to the best known classi�cation methods [5].

There are several variants of the NB classi�er, which di�er mainly by theassumptions they make regarding the distribution of P (Xi = xi|C = c) [49].

11

CHAPTER 3. AUTOMATIC TEXT CATEGORIZATION

3.3 Multinomial Naive Bayes

A classic approach [49] for text classi�cation is the Multinomial Naive Bayes(MNB) classi�er, which models the distribution of words (features) in a documentas multinomial i.e. the probability of a document given its class is the multinomialdistribution [18]. The estimated individual probabilities for P (Xt = xt|C = c) isgenerally written as [18, 49]:

P̂ (Xt = xt|C = c) =Nct + α

Nc + α|V |(3.6)

where |V | is the size of the vocabulary, Nct is the number of times feature tappears in class c in the training set, and Nc is the total count of features ofclass c. The constant α is a smoothing constant used to handle the problems ofover�tting and the edge case where Nct = 0. Setting α = 1 is known as Laplacesmoothing [18, 49]. Applying this estimate to the decision rule 3.5 we get

d(X1, . . . , Xn) = argmaxc

P (C = c)n∏

i=1

P̂ (Xi = xi|C = c)

= argmaxc

P (C = c)

n∏i=1

Nci + α

Nc + α|V |

(3.7)

In the literature the minimum-error classi�cation rule is more often used [14, 22]:

d(X1, . . . , Xn) = argmaxc

[logP (C = c) +

n∑i=1

fiP̂ (Xi = xi|C = c)

]

= argmaxc

[logP (C = c) +

n∑i=1

fiNci + α

Nc + α|V |

] (3.8)

where fi is the frequency of word xi.

3.4 TWCNB

Transformed Weight-Normalized Complement Naive Bayes (TWCNB) is a vari-ant on the MNB classi�er that, according to the original authors, �xes �many ofthe classi�er's problems without making it slower or signi�cantly more di�cultto implement. � [22]

While similar to MNB, one of the di�erences is that the tf-idf normalizationtransformation is part of the de�nition of the algorithm. The main di�erencebetween the two, however, is that TWCNB estimates the conditioned featureprobabilities by using data from all classes apart from c [14, 22]. The estimatedprobability is de�ned as [22]:

θ̂ic =α+

∑|C|k=1 dik

α|V |+∑|C|

k=1

∑|V |x=1 dxk

, k 6= c ∧ k ∈ C (3.9)

where |V | is the size of the vocabulary and dik, dxk are the tf-idf weights of wordsn and x in class k. Now, we apply this estimate as normalized word weight [22]:

wci =log θ̂ic∑k log θ̂kc

(3.10)

12

CHAPTER 3. AUTOMATIC TEXT CATEGORIZATION

Letting P̂ (Xi = xi|C = c) = wic and applying it to the decision rule 3.8 we get:

d(X1, . . . , Xn) =

[P (C = c) +

n∑i=1

fiwic

]

=

[P (C = c) +

n∑i=1

filog θ̂ic∑k log θ̂kc

] (3.11)

The literature shows that TWCNB performs about equally or better than MNB[14, 22].

3.5 k-nearest-neighbour

k-nearest-neighbor (kNN) is a well known statistical approach for classi�cationand has been widely studied over the years, and has was applied early to textcategorization tasks [29]. In the kNN algorithm the data points are in a fea-ture space, and what the points represents depends on the application. In textclassi�cation tasks, the components of the feature vector is usually term (word)weights (as for the NB classi�ers).

The algorithm is quite simple. It determines the category of a test documentt based on the voting of a set of k documents that are nearest to t it termsof distance, usually Euclidean distance [28]. In some applications, Euclideandistance can start to become meaningless if the number of features are high andthus reduces the accuracy of the classi�er. However, kNN is still regarded as oneof the top performing classi�ers on the Reuters corpus [29]. The basic decisionrule given a testing document t for the kNN classi�er is [10]:

d(t) = argmaxc

∑xi∈kNN

y(xi, c) (3.12)

where y(xi, c) is a binary classi�cation function for training document xi (whichreturns value 1 if xj is labeled with c, or 0 otherwise). This rule labels with twith the category that is given the most votes in the k-nearest neighborhood.The rule can also be extended by introducing a similarity function s(t, xi) thatlabels t with the class with the maximum similarity to t [10, 29]:

d(t) = argmaxc

∑xi∈kNN

s(t, xi)y(xi, c) (3.13)

The latter, weighted decision function is thought to be better than the formerand is more popular [10].

3.6 Support Vector Machine

Support Vector Machines, or SVM, is a relatively new learning approach intro-duced by Vapnik in 1995 [37], for solving two class pattern recognition problems.Empirical evidence suggest that SVM is one of the best techniques for performingautomatic text categorization [29]. The SVM problem is commonly solved usingquadratic programming techniques [30].

13

CHAPTER 3. AUTOMATIC TEXT CATEGORIZATION

Figure 3.1: Example of a maximum margin separating two classes of data points

[52].

3.6.1 Linear SVM

The SVM method is de�ned over a vector space where the problem is to �nda decision surface that �best� separates two classes of data points. To do this,we introduce a margin between the two classes [29]. Similarily to kNN and NBclassi�ers, the data points represents term (word) weights when the task is textclassi�cation. Figure 3.1 shows an example of a margin that separates two classesof data points in a two-dimensional feature space of linearly separable classes.The solid line is an example of a decision surfaces that separates the two classes,and the dashed lines parallel to the solid one show how much the decision surfacecan be moved without risking causing misclassi�cation of data points. The SVMsolves the problem of �nding a decision surface that maximizes the margin. Thedecision surface as seen in �gure 3.1 is expressed as [29, 30]:

~w · φ(~x)− b = 0 (3.14)

where ~x is an arbitrary data point to be classi�ed, and φ is a transformationfunction on the data points. The vector ~w and constant b are learned from atraining set of linearly separable data. Let D = {(yi, ~xi)} of size N denote thetraining set where yi ∈ {±1} is the classi�cation for ~x (+1 being a positiveexample for the given class, -1 a negative example) [29]. The SVM problem isthen to �nd ~w and b that satis�es the following two constraints [30]:

~w · φ(~xi)− b ≥ +1 for yi = +1

~w · φ(~xi)− b ≤ −1 for yi = −1(3.15)

Refer to �gure 3.1 for a graphical example. The circled points on the dottedlines are the support vectors that gives the name to the method. The decisionfunction for an unlabeled document d, represented by feature vector ~xd is [30]:

d(~xd) =

{+1 if ~wT · φ(~xd) + b > k

−1 otherwise(3.16)

where k is a user-de�ned threshold.

14

CHAPTER 3. AUTOMATIC TEXT CATEGORIZATION

3.6.2 Nonlinear SVM

If we take an arbitrary text document, it seems very unlikely that the data inthe feature space will be linearly separable. Thus we want a decision surface thatcan separate nonlinear data points. It can be shown that we can reformulate thedecision function as [30]

d(~xd) =

{+1 if

∑Ni=1 αiyiK (~xd, ~xi) + b > k

−1 otherwise(3.17)

where αi ≥ 0. The function K (~x, ~xi) is called a kernel function and allows usto have a decision surface for non-linearly separable data. An example of such akernel function is the polynomial kernel K (~x, ~xi) = (~xT · ~xi + 1)d [30].

3.6.3 Multi-class SVM

The SVM as describe above is a binary classi�er. However, in many practicalapplications there is usually multiple categories that a data point can belongto and certainly is the case in non-trivial document classi�cation applications.There are several studied methods for multi-class classi�cation [2, 8]; the usualapproach is to combine several binary SVMs to produce a single multi-class SVM.

In 2005, Duan & Keerthi made an empirical study of some multi-class meth-ods; these are summarized in the following paragraphs. For a given multi-classproblem, letM denote the number of classes and ωi, i = 1, . . . ,M will denote theM classes. For binary classi�cation, we will refer to the two classes as positiveand negative.

The �rst method that we will look at is the so-called one-versus-all winner-

takes-all method. The method constructs M binary classi�er. The ith classi�eroutput function pi is trained taking the examples from ωi as positive and allothers as negative. For a new document ~t, it assigns it the class with the largestvalue of pi [2].

Another method is the one-versus-one max-wins voting method that con-structs one binay classi�er for every pair of distinct classes, all together M(M −1)/2 classi�ers. The binary classi�er Cij is trained taking the examples fromωi as positive and the examples from ωj as negative. When classifying a newdocument ~t, all the classi�ers take a vote on which of its classes ωi, ωj should beassigned to ~t. After all classi�ers have voted, the class with the most votes isassigned to ~t [2].

Yet another approach is known as pairwise coupling. It works under theassumption that the output of each binary classi�ers should be interpreted pos-terior probabilities of the positive class. Then, the strategy is to combine theoutputs of all one-versus-one binary classi�ers to obtain estimated of the priorprobabilities pi = P (ωi|~t). The classi�er chooses the class that yields the highestpi, for details see [2].

3.7 N-Gram based classi�cation

In 1994, Cavnar and Trenkle proposed an N-gram-based method of text catego-rization [4], which was used for language and subject classi�cation of USENET2

newsgroup articles.2USENET is a worldwide distributed Internet discussion system. It's one of the oldest

computer network communications systems still in widespread use.

15

CHAPTER 3. AUTOMATIC TEXT CATEGORIZATION

Common sense dictates that human languages invariably have some wordswhich occur more frequently than others. Zipf's law, as re-stated by Cavnar andTrenkle, expresses this as:

�The nth most common word in a human language text occurs witha frequency inversely proportional to n.� [4]

An implication of this law is there is always a set of words in a language thatdominates most of the other words in terms of frequency of use. In English, forexample, the most frequently used words are function words such as �the�, �be�,and �to� [48]. Cavnar and Trenkle states that the law also implies that there isalways a set of words more frequent for a speci�c subject. For example, articlesabout sports may have many mentions of �football�, �players� and �fumble� whilearticles about computers and technology more frequently mention �computer�,�programmer� and the like.

A possible conclusion at this point is that if articles in a speci�c language orsubject has a set of words more frequent than others, the same articles shouldalso have a set of N-grams that are more frequent than others. This seems to bean reasonable conclusion, as experiments show that using N-grams for languageidenti�cation is reliable [4, 7], and fairly reliable for subject identi�cation [4].

The method proposed by Cavnar and Trenkle is based on comparing N-gramfrequency pro�les from a set of training documents to test documents, and cate-gorize the latter based on distance measures. The steps for pro�ling a documentare as follows [4]:

� Tokenize the text, discard digits and punctuation. Pad the tokens withsu�cient blanks before and after.

� Generate all possible N-grams (including blanks) for each token, for N = 1to 5.

� Count the occurrence for each N-gram.

� Sort the N-grams in reverse order in order of occurrence, i.e. the mostfrequent N-gram is the �rst element in the sorted list.

By generating N-grams for all the training documents and merging the pro�lesof documents in the same category, a pro�le of N-grams frequency for each cat-egory is made. The �rst 300 or so N-Grams are considered to be very languagedependent, and are thus removed from the pro�le [4].

Categorizing an incoming document is straightforward: generate an N-gramfrequency pro�le for the document, measure the distance to the category andpick the category with the minimum distance. The pro�le distance measure usedby Cavnar and Trenkle is simple:

Take two N-gram pro�les and calculate a simple rank-order statistics (called�out-of-place� measure) by measuring how far out of place an N-gram in onepro�le is from its place in the other pro�le. For each N-gram in the documentpro�le, �nd its counterpart in the category pro�le and calculate how far outof place it is in terms of position (rank) in the pro�le. We can formulate theout-of-place measure of an N-gram n in pro�les i and j as

dn(i, j) =

{M if n is not in both pro�les i and j

|rank(n,i)− rank(n,j)| otherwise(3.18)

16

CHAPTER 3. AUTOMATIC TEXT CATEGORIZATION

where M is a pre-de�ned maximum out-of-place value given if the N-gram doesnot exist in both pro�les, and rank(n,i) is the rank of the N-gram n in the pro�lei [4]. Using the distance measure in equation 3.18, we can construct the decisionrule:

d(t) = argminc

∑i∈Pt

di(Pt, Pc) (3.19)

where Pc, Pt are the N-gram pro�les for class c and document t respectively.Experiments indicate that this method works very well for language classi-

�cation and fairly well for subject classi�cation, achieving as high as an 80%classi�cation rate for the latter [4]. Cavnar and Trenkle notes that a higher sub-ject classi�cation rate could probably be achieved by removing very frequentlyused words of the language from the data, i.e. using a stop list. It is interestingto see a reasonably good classi�cation rate, given that the method can be seenas less sophisticated when compared to other approaches.

17

Chapter 4

Automatic web page

categorization

4.1 Overview

In this chapter a general approach to categorize web pages is presented. To solvethe problem of automatic web page categorization, we will make use of automatictext categorization and natural language processing techniques as described inprevious chapters. Before we go into details, we should de�ne what we mean bya web page.

De�nition 4.1. A web page is a web document that is encoded in the HTML or

XHTML format, and can be accessed either locally or from a remote web server

with the use of a web browser.

This de�nition is introduced to emphasize what when we refer to a web page, werefer to the actual document containing the HTML of XHTML information.

The basic assumption of this method is that document classi�cation can bedirectly applied to automatic web page categorization, given that the web doc-uments are transformed into a plain text document. This general method is,obviously, very general. The reason for this is to make it easily possible to cus-tomize the method for the speci�c practical application. For example, the methodmakes no assumptions about which document classi�er is used.

4.2 Data gathering

For a web page to be used for categorization, it must �rst be somehow acquiredand stored in a convenient format (which depends on the requirements of theapplication). For example, the document can be downloaded o� the web serverand its content stored in a relational database. Another, very simple, approachis to download the document and store it as is on the local �le system.

It's advisable to gather the whole web page data set (both training and testsamples) once and store it locally in case the parameters of the classi�er is tobe tweaked in training. This is important for the simple reason that gatheringthousands of web pages from the Web is time consuming even if automated.

4.3 Choice of classi�er

The choice of classi�er depends on the practical requirements of the application.Of course, preferably one would want to use a classi�er that yields good results

18

CHAPTER 4. AUTOMATIC WEB PAGE CATEGORIZATION

Figure 4.1: Training the classi�er for web page categorization.

in the established literature for document classi�cation. However, there may bespeci�c requirements in the application that narrows the selection of classi�ersuch as the computational complexity for the training and/or testing phase. Asmentioned earlier, the rest of this chapter will make no assumption of whatclassi�er is being used.

4.4 Training

The training process requires that a set of categories has been de�ned, and thetraining documents need to be somehow labelled with their respective category.The chosen classi�er must be trained on a training set of web pages, and prefer-ably evaluated on a smaller testing set. The evaluation is useful for determiningthe optimal values of any parameters of the classi�er on the available trainingdata. Figure 4.1 illustrates the training process.In the following sections, we will look closer at the individual steps of the trainingprocess.

4.5 Feature selection and extraction

A web page document itself is not very suitable for categorization, as it by de�ni-tion contains HTML or XHTML. In order to use machine learning techniques fordocument classi�cation, we need to reduce the web page document to a plain textdocument. This plain text document must then be transformed into a featurevector suitable for use with the chosen machine learning algorithm.

4.5.1 Plain text conversion

Converting a web page to a plain text document can be done quite straightfor-ward by applying a regular expression that matches HTML and/or XHTML tagsand replacing the occurrences with empty strings. However, web pages containelements that are not part of the content per se, but as a part of the web pagedesign (navigation menu entries for example) that may not be relevant for cate-gorization purposes. Figure 4.2 shows a very simple of a web page using HTMLmarkup. A better approach would be to make use of a parser to extract the de-sired content from the web page. The exact content that is extracted depends onthe application, but a general approach could be to extract paragraphs and head-ings. The parser should implement standards from organizations such as WorldWide Web Consortium (W3C) [50] or Web Hypertext Application TechnologyWorking Group (WHATWG) [51].

19

CHAPTER 4. AUTOMATIC WEB PAGE CATEGORIZATION

<!DOCTYPE html>

<html>

<head>

<title>A web page</title>

</head>

<body>

<h1>A header</h1>

<p>Some text! And a <a href="#">link</a></p>

</body>

</html>

Figure 4.2: An example of a web page.

4.5.2 Tokenization and stemming

With the web document converted into a plain text document, the next step isto extract words that make up the textual content. Blindly applying a tokenizerat this point would not be advisable, seeing as the number of distinct words usedthroughout the entire training set is likely incredible large. Therefore we needto reduce the number of distinct tokens, i.e. reduce the number of dimensions inthe feature vector. How can we do this?

The �rst, easy, method of reducing the dimensions of the feature vector isto use a stop list to remove very common words such a �the� in English. Thesewords are common in all sorts of text and are uninteresting for categorizationpurposes.

However, the number of distinct tokens is still likely to be large. On theremaining tokens, stemming is performed to end up with a �nal set of stemmedtokens that make up the vocabulary. This will assign words in the same contextto the same token (for example �car� and �cars� will be assigned to to the sametoken �car�). The number of distinct stems are is very likely to be less than thenumber of distinct words, thus it is likely that stemming will reduce the numberof dimensions signi�cantly. The stemming algorithm to be used depends on thelanguage of the web page, thus it may be necessary to classify the language of acertain web page using a language classi�er1 before stemming can be performed.

We are almost ready to de�ne the feature vector. However, for determiningterm weights, we also need a vocabulary. We de�ne it as

De�nition 4.2. For a set of documents ~d = {d1, · · · , dN}, its vocabulary V is

the set of tokens that are extracted and stemmed from the the documents in ~d.

1A language classi�er has been used in the implementation when �ltering the web database,so that all pages used in the experiments are most likely in English. For more details refer tochapter 5.

20

CHAPTER 4. AUTOMATIC WEB PAGE CATEGORIZATION

4.6 Document representation

Let V the vocabulary of the training set as stated in de�nition 4.2. We representa web page as the feature vector

~d =

w1...wi...wN

(4.1)

where wi is weight of the term i and N is the size of the vocabulary V . Lessformally, a web page is represented as a bag-of-words vector with weights repre-senting the stemmed tokens.

There are several weighting schemes to be considered for the weights wi [25].A good choice is the popular tf-idf weighting scheme [41], described in section2.2.4. In some algorithm, such as TWCNB, the weighting scheme is a part of thealgorithm. Therefore the weights can also simply be frequencies for the tokens ifthe algorithm performs its own weighing scheme. In all but the simplest weightingschemes, the vocabulary must be extended with term frequencies. This is mosteasily done with a mapping of stemmed tokens to frequencies, either as a part ofthe vocabulary or as a separate structure.

4.7 Categorization of unlabeled documents

Given a unlabeled document ~d, it's fairly straightforward to categorize it usingthe trained classi�er. As with the training data, the unlabeled document mustbe processed and turned into a feature vector. This is done by performing thefeature extraction as described in previous sections, using the vocabulary of thetraining set. The feature vector of the unlabeled document is then given to theclassi�er's decision function, which will in turn make a prediction based on itstraining and return a category for the unlabeled document.

21

Chapter 5

Implementation

In this chapter gives a brief overview of the automatic web page categorization

framework that was implemented to be used to evaluate the performance ofdi�erent text classi�cation algorithms on real web page data, as discussed inChapter 6. The purpose of this chapter is not to convey details, but to give anoverview of what technologies were used to build the framework and give theproper credit to these projects.

5.1 Overview

The implementation of the automatic web page categorization framework (here-after simply referred to as the framework) was done in Java, due to it being crossplatform and has a high availability of third party libraries for tasks relating tomachine learning and natural language processing. The framework makes us of anumber of library packages to provide methods of machine learning and naturallanguage processing needed for automatic web page categorization, as describedin Chapter 4. It also provides ways of transforming web page document data intoformats needed by some of the third party packages used. The core packages ofthe framework are the following:

content Interfaces and classes for extracting content from downloaded web doc-uments.

ml Provides an interface for categorizing web pages to a set of predeterminedcategories, and implementation classes using concrete classi�ers.

nlp Classes that uses natural language processing tools to transform raw textdocuments into token streams.

tools Command-line tools for crawling a speci�c web page database used inthe experiments, and for batch processing of web pages into other formatsrequired by some of the third party libraries.

web Classes for representing and managing web pages.

The framework also consist of several packages dealing with the experiments asdiscussed in Chapter 6, however these are not discussed here.

22

CHAPTER 5. IMPLEMENTATION

5.2 The content framework package

The content package is responsible for extracting content from web pages. Forconvenience the package provides a number of di�erent content extractors thatspecialize on extracting di�erent sets of data from a web page, which is used forexperimentation (see Chapter 6).

The content extractors are by themselves very simple: they get the desiredcontent of the page (see section 5.6 how it is done), perform natural languageprocessing as described in Chapter 4, using the nlp package (see section 5.4).

5.3 The ml framework package

The ml package provides category predictors using the classi�ers as describedin Chapter 3. The classi�ers are trained by given a set of training documentswhich have been processed by the nlp package, convert these into feature vectors(as described in Chapter 4), and actually trained on these. Categorization isperformed similarly by turning the unlabelled page into a feature vector, andrunning the decision function of the classi�er.

The MNB, kNN and SVM classi�er implementations are provided by theWEKA machine learning Java library [39], while the TWCNB implementation isprovided by the Apache Mahout machine learning Java library [44].

5.4 The nlp framework package

The nlp package provides a tool for natural language processing o English-languagedocuments. Essentially, it takes a raw plain text document, i.e. the parsed con-tent from a web page, and performs tokenization, stop-word removal, and stem-ming. This tool is used by the content package, see section 5.2, to fully extractcontent from a web page into a suitable format (i.e. a stream of stemmed tokens).

The bulk of the work is performed by the Apache Lucene Java library, thathas many packages for natural language processing for di�erent languages [43].

5.5 The tools framework package

The tools package implements various command-line tools. The most importanttools are those that convert a set of training documents into formats suitable tobe read for the classi�ers. The WEKA classi�ers expect documents to be in theARFF format, which is a �le format that describes a list of instances sharing aset of attributes [47]. The TWCNB classi�er uses the Sequence �le format, whichis a binary key/value-pair format [38].

Also included in this package are tools for traversing and �ltering the linkdatabase from the project provider, Whaam1.

5.6 The web framework package

The web package implements classes for representing and managing web pagesfrom the link database. The web page representations makes use of the jsoup

HTML parser [45] for extracting text when called by the content package (seesection 5.2).

1http://whaam.com/

23

Chapter 6

Results

This chapter presents the result of a set of evaluation performed using the frame-work as described in Chapter 5. First we present the classi�ers used and theirparameters (if any), followed by the data and experiments. Then we continue dodiscuss di�erent types of evaluation metrics used for measuring the success of theresults. The chapter concludes by presenting the results based on the evaluationmetrics.

6.1 Classi�ers

For the evaluation, we're comparing Support Vector Machines (SVM), k-nearest-neighbor (kNN), Multinomial Naive Bayes (MNB), Term-Weighted Complemen-tary Naive Bayes (TWNCB) and the Cavnar-Trenkle N-Gram-based classi�er(N-Gram).

The SVM is con�gured to use the Radial basis function, or RBF, kernelwith γ = 0.01. RBF is considered to be the most popular kernel to use [6] forSupport Vector Machines. For kNN the number of nearest neighbor is k = 30, assome preliminary testing seems to indicate that it was a fairly good compromisebetween a high and low k. For the N-Gram classi�er the pro�les are set to havea maximum length of 400, as suggested by the results of [4].

6.2 Data and experiments

To reiterate, the data used in the evaluation is a link database from the projectprovider Whaam1, from which web pages were extracted. Whaam is a discoveryengine based on the idea of social sur�ng. On Whaam, users store links toweb pages in link lists, that can be shared with friends. Links are seperatedinto a set of eight broad categories. The categories are Art & Design, Fashion,Entertainment, Blogs, Sport, Lifestyle, Science & Tech, and Miscellaneous. Thedata is real in the sense that the web pages have been categorized by real usersin an existing public product.

As described in Chapter 1, the scope of the evaluation is to only considerweb pages that are in English. A cursory examination of the data showed thatthe vast majority of links points to English and Swedish web sites, with manyof the Swedish samples being blogs. Thus the evaluation focuses on English websites to avoid any arti�cial grouping of the data that does not depend on the

1See http://www.whaam.com

24

CHAPTER 6. RESULTS

Table 6.1: Distribution of samples over the categories in the data.Category Number of samples

Art & Design 17Fashion 20Entertainment 789Blogs 334Sport 319Lifestyle 1125Science & Tech 597Miscellaneous 2546Total number of samples 5747

�real� content but on language2. All of the English samples have been extractedusing the tools described in Chapter 5; the distribution of the samples over thedi�erent categories in shown in table 6.1.

In order to try and see what parts of web pages are relevant to categorization,there are several sources used for gathering textual content from the web pagesused in the experiment. The sources chosen were inspired by the work done byKwok in document representation for automatic classi�cation [16], but more soby the paper by Riboni about feature selection for web page categorization [23].

For each data source, �ve instances were generated. In each of these instances,the samples that go into the training set and test set respectively are randomized.Seven di�erent text sources were used, namely:

� T, the content of the <title> tag;

� H, the contents of all <h1>, ... , <h6> tags;

� P, the contents of all <p> tags;

� TH, the contents of the T and H sources;

� HP, the contents of the H and P sources;

� TP, the contents of the T and P sources;

� THP, the contents of the T, H and P sources.

If we would apply this to the simple web page shown in �gure 4.2, the contentfrom the di�erent sources can be represented by the following multisets, withHTML markup and punctuation removed:

T = {a,web, page}H = {a, header}P = {some, text, and, a, link}

TH = {a,web, page, a, header}HP = {a, header, some, text, and, a, link}TP = {a,web, page, some, text, and, a, link}

THP = {a,web, a, header, some, text, and, a, link}

Of course, as described in Chapter 4, these sources would be processed andencoded as feature vectors and not used as-is.

2The concern was that any web page in Swedish would most likely be categorized as beinga blog, independent of its actual content.

25

CHAPTER 6. RESULTS

6.3 Evaluation metrics

A commonly used metric for judging classi�er performance [29] is the micro-

averaging F1 score. It gives equal weight to each document and is thereforeconsidered as an average of all the document/category pairs, and tends to bedominated by the classi�er's performance on common categories [19, 29].

The F1 score takes a value between 0 and 1, where 0 is the worst possiblescore and 1 is the best possible score. It is calculated using the precision (p) andrecall (r) measures, de�ned as [19]:

pi =TPi

TPi + FPi, ri =

TPi

TPi + FNi(6.1)

where TPi is the number of documents correctly labelled with class i (known astrue positives); FPi are the number of documents incorrectly labelled with classi (known as false positives), and FNi is the number of documents that shouldhave been labelled with class i but were not (known as false negatives). The F1

measure for class i is then expressed as [19, 29]:

Fi =2piripi + ri

(6.2)

The global precision and recall values are obtained by summing over all in-dividual decisions [19]:

p =

∑Mi=1 TPi∑M

i=1(TPi + FPi), r =

∑Mi=1 TPi∑M

i=1(TPi + FNi)(6.3)

whereM is the number of categories. The micro-averaged F1 score is then de�nedas in equation 6.2, but using the global precision and recall values [19]:

F1(micro-averaged) =2pr

p+ r(6.4)

Seeing that table 6.1 show that the data is dominated by a few categories, italso seems appropriate to have a metric that shows how well a classi�er performson less common categories. The macro-averaged F1 score is understood to dojust this [29]. It is calculated by �rst calculating individual F1 values for thedi�erent categories, and taking the average [19]:

F1(macro-averaged) =∑M

i=1 Fi

M(6.5)

To be thorough, the �raw� accuracy results (which is simply the percentage ofcorrect classi�cations) is also used as performance measure. This measure, isless informative as it does nothing to show how well the classi�er performs onindividual categories.

6.4 Per-classi�er performance

6.4.1 Micro-averaged F-measure

Table 6.2 shows the micro-averaged F1 score for the di�erent classi�ers, averagedover the �ve instances for each source with standard deviation in parentheses.The same results is also shown in �gure 6.1. What is observed here is that the

26

CHAPTER 6. RESULTS

scores for SVM, kNN and MNB are signi�cantly lower when compared to otherstudies on text classi�cation. In the study by Yang & Lin [29], the scores of theseclassi�ers range from roughly 0.8 to 0.9, but here goes as low as about 0.45. Thescores of the N-Gram and TWCNB classi�ers has no comparison in the studiedliterature, however it should be highlighted that TWCNB has the highest scoreof all the classi�ers. Also of note is that the HP source has the most �best�scores, and hints to that it might be the best of general source to use of the onescompared.

However, when compared to the micro-averaged F-measure of Robini's paper[23], the performance is more closely matched. The methods as described byRobini achieves a micro-averaged F-measure score of 0.6506 for the NB classi�eras its best result, also averaged over several instances as here. Interestingly, in theRobini's paper the best result was achieved using their largest text source, whichincludes the page title, meta-tag and the whole content of the body. The bestresult in this paper uses less data as the source, using only the content of theheadings an individual paragraphs. TWCNB achieves comparable peformanceresults, but it should be highlighted that Robini uses a di�erent data set andterm weighting schemes.

Table 6.2: Micro-averaged F1 score for the di�erent classi�ers, averaged over allinstances.Source N-Gram MNB TWCNB kNN SVM

T 0.1908 (0.0156) 0.3415 (0.0276) 0.4152 (0.0347) 0.4366 (0.0109) 0.4343 (0.0038)H 0.1553 (0.0112) 0.3929 (0.1131) 0.4877 (0.0082) 0.4453 (0.0022) 0.4926 (0.0197)P 0.2122 (0.0269) 0.5626 (0.0061) 0.6078 (0.0039) 0.4408 (0.0244) 0.5358 (0.0037)TH 0.1398 (0.0268) 0.4361 (0.0076) 0.4013 (0.0579) 0.4103 (0.0083) 0.4453 (0.0078)TP 0.1521 (0.0176) 0.4385 (0.0026) 0.4453 (0.0113) 0.4423 (0.0135) 0.4854 (0.0221)HP 0.1636 (0.0269) 0.5681 (0.0181) 0.6101 (0.0078) 0.4363 (0.0221) 0.5646 (0.0119)THP 0.1618 (0.0079) 0.3877 (0.0069) 0.3812 (0.0406) 0.3899 (0.0011) 0.4531 (0.0109)

Figure 6.1: Mean Micro F-Measure Score with 95% con�dence interval shown.

6.4.2 Macro-averaged F-measure

Table 6.3 shows the macro-averaged F1 score for the di�erent classi�ers, averagedover the �ve instances for each source with standard deviation in parentheses.

27

CHAPTER 6. RESULTS

The same results is also shown in �gure 6.2. When compared to the studied lit-erature [29], the macro-averaged F-measure for the SVM and kNN classi�ers aresigni�cantly worse. However, the scores for the two Bayesian classi�ers (MNBand TWCNB) in these experiments outperform the NB in the studied text clas-si�cation literature. For the macro-averaged F-measure, the source that has themost �best� results is the P source with the HP source in second place.

Interestingly, compared to the macro-averaged F-measure results by Kan [13],we see that the best results of the NB classi�ers in this paper are about equalto Kan's best results for the SVM classi�er. The SVM classi�er in this paperperform signi�cantly worse, however it is di�cult to make a good comparison asKan does not present the kernel used in his experiments. Again, it should behighlighted that Kan uses a di�erent data set and his best results are using URLsas well as textual content for classi�cation.

Table 6.3: Macro-averaged F1 score for the di�erent classi�ers, averaged over allinstances.Source N-Gram MNB TWCNB kNN SVM

T 0.1282 (0.0208) 0.2227 (0.0228) 0.3023 (0.0225) 0.1225 (0.0019) 0.1219 (0.0244)H 0.1095 (0.0178) 0.2439 (0.1034) 0.3357 (0.0191) 0.1087 (0.0038) 0.1763 (0.0049)P 0.1783 (0.0369) 0.4435 (0.0177) 0.4457 (0.0236) 0.1689 (0.0132) 0.2451 (0.0195)TH 0.1078 (0.0089) 0.2777 (0.0079) 0.2782 (0.0615) 0.1224 (0.0068) 0.0801 (0.0091)TP 0.1114 (0.0149) 0.3165 (0.0186) 0.3141 (0.0042) 0.1614 (0.0046) 0.1964 (0.0101)HP 0.1253 (0.0294) 0.4359 (0.0033) 0.4497 (0.0140) 0.1519 (0.0087) 0.2925 (0.0033)THP 0.1246 (0.0059) 0.2793 (0.0239) 0.2592 (0.0306) 0.1422 (0.0036) 0.1846 (0.0053)

Figure 6.2: Mean Macro F-Measure Score with 95% con�dence interval shown.

6.4.3 Raw classi�cation accuracy

Table 6.4 shows the accuracy of the di�erent classi�ers in terms of correct clas-si�cation rate, averaged over the �ve instances for each source with standarddeviation in parentheses. The same results is also shown in �gure 6.3. Whenexamining these results, one quickly see that the overall are quite poor. Also ofnote is that the standard deviations of both NB classi�ers are relatively high, in-dicated they are more sensitive to the actual instance being used. An interestingpattern is that the N-Gram classi�er performs about equally for all data sources,with relatively minor di�erences. The standard deviation is also low, which would

28

CHAPTER 6. RESULTS

hint that the N-Gram classi�er performs equally well (or bad, depending on onesview) across a wide range of di�erent data. However it's accuracy comes nowhereclose the results given by [4].

Table 6.4: Accuracy of the di�erent classi�ers in terms of correct classi�cationrate, averaged over all instancesSource N-Gram MNB TWCNB kNN SVM

T 19.542 (1.165) 40.565 (7.518) 33.643 (7.397) 43.569(1.561) 43.515 (2.619)H 15.429 (0.765) 39.702 (9.097) 39.216 (9.069) 43.713 (1.701) 46.048 (2.486)P 15.333 (3.130) 40.501 (14.777) 43.111 (6.068) 41.867 (3.543) 46.949 (6.639)TH 15.213 (0.514) 41.157 (3.428) 38.431 (4.366) 41.529 (1.141) 44.920 (1.896)TP 14.824 (1.111) 40.186 (4.653) 40.451 (5.568) 42.535 (2.017) 46.822 (2.365)HP 17.154 (1.003) 48.894 (8.482) 53.271 (9.389) 41.717 (0.369) 52.328 (4.201)THP 14.785 (1.938) 34.136 (4.732) 35.685 (5.851) 39.021 (1.011) 43.243 (2.434)

Figure 6.3: Mean Classi�cation Accuracy with 95% con�dence interval shown.

29

Chapter 7

Conclusions

In this chapter, we will discuss and draw conclusions on the results and proposefuture work that can be done in this subject.

7.1 Concerns about the data

The category distribution in the provided data is, to put it mildly, skewed. About45% of the given data belongs to the Miscellaneous category, and thus it domi-nates the training and testing samples. Examining the confusion matrices showsthat all the classi�ers has the most correct answers for theMiscellaneous category,but also confuses many other samples from other categories with Miscellaneous.A short, undocumented, experiment on the HP dataset shows that accuracy in-creases when Miscellaneous samples are removed with raw accuracy approaching70% for the NB and SVM classi�ers. This hints to that the Miscellaneous cate-gory is too general to be e�ectively included in the data. However, one advantageof having the Miscellaneous data included is that a misclassi�cation of an unseenweb page is likely to be categorized as Miscellaneous. In a practical application,to categorize a web page with Miscellaneous may be seen as acceptable depend-ing on the views of the user. For example, it seems better to wrongly classifythe Science & Technology web site Ars Technica with the category Miscellaneous

rather than with the category Fashion.Another concern about the data is that many samples may, not unlikely, be

labelled with the �wrong� category. What is to be considered a wrong categorymay be discussed at length, but the fact is that the data has in no way beenaudited. The relative large size of the Miscellaneous category hints at that manyusers that have contributed to the database through Whaam, may have sloppilyadded links to the Miscellaneous category when they clearly should have belongsomewhere else.

When working with high-dimensional data, the concern is always that thereare too many dimensions in the feature space. While steps have been taken toreduce the number of dimension, through stop word removal and stemming, thismay still be a problem.

7.2 Suggestions for improvements in future work

7.2.1 Data sets

The experiments in this thesis were, for practical reasons, performed on only asingle, non-standard, database. Text classi�cation studies are usually performed

30

CHAPTER 7. CONCLUSIONS

on standardized datasets, such as Reuters-21578 [29]. If possible, future workshould try to utilize other link databases.

7.2.2 Improved feature selection

This report focused on studying di�erent data sources and classi�ers, but tooka simplistic approach to feature selection by only eliminating stop words. It islikely that the classi�cation rate would improve if more advanced feature selectionmethods were used, such as described by Riboni [23]. While the data sources inthis report are more �ne-grained than in the paper by Riboni, the extracted textmay still be very noisy due to the nature of textual content on the web. It islikely that many terms that are uncommon enough to pass stop-word �ltering areirrelevant to the category that the web page belongs to, and should be discarded.

7.2.3 Utilizing domain knowledge

Depending on the application, it might be possible to signi�cantly improve theaccuracy of the classi�cation by exploiting domain knowledge about what websites are being used. For example, if the application is to categorize Wordpress1

blogs, one can choose to gather Wordpress-speci�c elements from the pages. Theactual textual content in Wordpress blog posts is usually inside a div-tag of theclass entry-post. Therefore one can choose to extract this content and ignoreeverything else since using only, say, the p-tag may bring in text that is part ofthe web site design, and may not relevant to blog posts and the overall categoryof the page.

7.2.4 Utilizing web site structure

Perhaps one major �aw in the method proposed in this thesis is that it does notin any way utilize the fact that web pages belong to web sites. The de�nitionfor a web page was chosen to refer to the actual document, however it might bepossible to improve the classi�cation rate by spidering the web site and gatheringdata from more than one page. This, however, brings up the question on howmuch data should be gathered from the overall web site, and is some cases websites may have content from several categories.

7.2.5 Alternative approaches

The results draws into question the viability of using text classi�cation methodsfor performing web page categorization. While the results may be due to �aws inthe data rather in the approach, one can't help wondering if other methods maybe suitable. Section 1.6 described earlier work done on web page categorization.

7.3 Overall conclusion

As mentioned earlier, the results in this thesis brings up the question if text clas-si�cation methods are viable to use for web page categorization tasks. However,the best results achieved in this thesis, which are not terrible, hints to that with

1WordPress is a free and open source blogging tool and a content management system (CMS)based on PHP and MySQL which runs on a Web hosting service. See http://wordpress.org/for more information.

31

CHAPTER 7. CONCLUSIONS

re�nements to be method the rate of successful classi�cations can increased. Af-ter all, some of the classi�ers used have been shown in the literature to performwell in text classi�cation tasks. It should also be highlighted that the results arecomparable to other studies on automatic web page categorization that uses textclassi�cation methods. As this thesis was meant as an exploratory study, thereis no real measurement of success. The author would, however, dare to claimthat this thesis is successful in that it brings out and shows the problems thatcan occur when using data from the real world.

32

Bibliography

Articles and publications

[1] Giuseppe Attardi, Antonio Gullì, and Fabrizio Sebastiani. Categorizationby context. Journal of Universal Computer Science, 1998.

[2] Kai bo Duan and S. Sathiya Keerthi. Which is the best multiclass SVMmethod? An empirical study. In Proceedings of the Sixth International

Workshop on Multiple Classi�er Systems, pages 278�285, 2005.

[3] Richard A. Caruana. Multitask learning: A knowledge-based source of in-ductive bias. In Proceedings of the Tenth International Conference on Ma-

chine Learning, pages 41�48, 1993.

[4] William B. Cavnar and John M. Trenkle. N-gram-based text categorization.In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis

and Information Retrieval, pages 161�175, Las Vegas, US, 1994.

[5] Kian Ming Adam Chai, Hai Leong Chieu, and Hwee Tou Ng. Bayesian onlineclassi�ers for text classi�cation and �ltering. In Proceedings of the 25th

annual international ACM SIGIR conference on Research and development

in information retrieval, SIGIR '02, pages 97�104, New York, NY, USA,2002. ACM.

[6] Yin-Wen Chang, Cho-Jui Hsieh, Kai-Wei Chang, Michael Ringgaard, andChih-Jen Lin. Training and testing low-degree polynomial data mappingsvia linear svm. J. Mach. Learn. Res., 99:1471�1490, August 2010.

[7] Ted Dunning. Statistical Identi�cation of Language. Technical report, Com-puting Research Laboratory, New Mexico Statue University, 1994.

[8] Chih-Wei Hsu and Chih-Jen Lin. A Comparison of Methods for Multiclass

Support Vector Machines, 2002.

[9] Chu-Ren Huang, Petr �imon, Shu-Kai Hsieh, and Laurent Prévot. Re-thinking chinese word segmentation: tokenization, character classi�cation,or wordbreak identi�cation. In Proceedings of the 45th Annual Meeting of

the ACL on Interactive Poster and Demonstration Sessions, ACL '07, pages69�72. Association for Computational Linguistics, 2007.

[10] Shengyi Jiang, Guansong Pang, Meiling Wu, and Limin Kuang. An im-proved k-nearest-neighbor algorithm for text categorization. Expert Syst.

Appl., 39(1):1503�1509, January 2012.

[11] Thorsten Joachims. Text Categorization with Support Vector Machines:

Learning with Many Relevant Features year = 1998.

33

BIBLIOGRAPHY

[12] Thorsten Joachims. A Probabilistic Analysis of the Rocchio Algorithm with

TFIDF for Text Categorization, 1997.

[13] Min-Yen Kan. Web page classi�cation without the web page. In Proceedingsof the 13th international World Wide Web conference on Alternate track

papers & posters, WWW Alt. '04, pages 262�263, New York, NY, USA,2004. ACM.

[14] Ashraf M. Kibriya, Eibe Frank, Bernhard Pfahringer, and Geo�rey Holmes.Multinomial naive bayes for text categorization revisited. In Proceedings of

the 17th Australian joint conference on Advances in Arti�cial Intelligence,AI'04, pages 488�499, Berlin, Heidelberg, 2004. Springer-Verlag.

[15] Igor Kononenko. Machine learning for medical diagnosis: history, state ofthe art and perspective. Artif. Intell. Med., 23(1):89�109, August 2001.

[16] K. L. Kwok. The Use of Title and Cited Titles as Document Representationfor Automatic Classi�cation, 1975.

[17] David D. Lewis and Marc Ringuette. A Comparison of Two Learning Algo-rithms for Text Categorization. In In Third Annual Symposium on Document

Analysis and Information Retrieval, pages 81�93, 1994.

[18] Andrew McCallum and Kamal Nigam. A comparison of event models forNaive Bayes text classi�cation. In In AAAI-98 Workshop on Learning for

Text Categorization, pages 41�48. AAAI Press, 1998.

[19] Arzucan Özgür, Levent Özgür, and Tunga Güngör. Text categorization withclass-based and corpus-based keyword selection. In Proceedings of the 20th

international conference on Computer and Information Sciences, ISCIS'05,pages 606�615, Berlin, Heidelberg, 2005. Springer-Verlag.

[20] Martin F. Porter. An algorithm for su�x stripping. Program: Electronic

Library & Information Systems, 40(3):211�218, 1980.

[21] Xiaoguang Qi and Brian D. Davison. Web page classi�cation: Features andalgorithms. ACM Comput. Surv., 41(2):12:1�12:31, February 2009.

[22] Jason D. M. Rennie, Lawrence Shih, Jaime Teevan, and David R. Karger.Tackling the Poor Assumptions of Naive Bayes Text Classi�ers. In In Pro-

ceedings of the Twentieth International Conference on Machine Learning,pages 616�623, 2003.

[23] Daniele Riboni. Feature selection for web page classi�cation, 2002.

[24] Peter M. Roth and Martin Winter. Survey of Appearance-based Methodsfor Object Recognition, 2008.

[25] Gerard Salton and Christopher Buckley. Term-weighting approaches in au-tomatic text retrieval. In Information Proccessing and Management, pages513�523, 1988.

[26] Fabrizio Sebastiani. Machine learning in automated text categorization.ACM Comput. Surv., 34(1):1�47, March 2002.

[27] Ilia Smirnov. Overview of Stemming Algorithms. Technical report, DePaulUniversity, 2008.

34

BIBLIOGRAPHY

[28] Liu Yang. Distance Metric Learning: A Comprehensive Survey , 2006.

[29] Yiming Yang and Xin Liu. A Re-Examination of Text Categorization Meth-ods, 1999.

[30] James Tin yau Kwok. Automated Text Categorization Using Support Vec-tor Machine. In In Proceedings of the International Conference on Neural

Information Processing (ICONIP, pages 347�351, 1998.

[31] Alexander Ypma, Er Ypma, and Tom Heskes. Categorization of web pagesand user clustering with mixtures of hidden markov models. pages 31�43,2002.

Books

[32] Daniel Jurafsky and James H. Martin. Speech and Language Processing:

An Introduction to Natural Language Processing, Computational Linguistics

and Speech Recognition. Prentice Hall, second edition, 2008.

[33] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Intro-duction to Information Retrieval. Cambridge University Press, New York,NY, USA, 2008.

[34] Stephen Marsland. Machine Learning: An Algorithmic Perspective. Chap-man & Hall/CRC, 1st edition, 2009.

[35] Tom M. Mitchell. Machine Learning. McGraw-Hill Sci-ence/Engineering/Math, 1 edition, March 1997.

[36] Bernhard Scholkopf Olivier Chapelle and Alexander Zien. Semi-SupervisedLearning. The MIT Press, 1st edition, 2006.

[37] Vladimir N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 1995.

Internet resources

[38] Apache Software Foundation. SequenceFile. URL: http://wiki.apache.org/hadoop/SequenceFile.

[39] Machine Learning Group at the University of Waikato.Weka 3: Data Mining

Software in Java. URL: http://www.cs.waikato.ac.nz/ml/weka/.

[40] Maurice de Kunder. World Wide Web Size. URL: http://www.

worldwidewebsize.com/, retrieved on 2013-02-26.

[41] Michael Dittenbach.

[42] The Open Directory Project Editors. The Open Directory Project. URL:http://www.dmoz.org/, retrieved on 2013-02-26.

[43] The Apache Software Foundation. Apache Lucene Core. URL: http://lucene.apache.org/core/.

[44] The Apache Software Foundation. Apache Mahout. URL: http://mahout.apache.org/.

35

BIBLIOGRAPHY

[45] Jonathan Hedley. jsoup: Java HTML Parser. URL: http://jsoup.org/.

[46] Internet Systems Consortium. The ISC Domain Survey. URL: http://www.isc.org/solutions/survey.

[47] Machine Learning Group at the University of Waikato. Attribute-Relation

File Format (ARFF). URL: http://www.cs.waikato.ac.nz/ml/weka/

arff.html.

[48] Oxford Dictionaries Online. The OEC: Facts about the

language. URL: http://oxforddictionaries.com/words/

the-oec-facts-about-the-language, retrieved on 2013-02-08.

[49] sci-kit learn. Naive Bayes. URL: http://scikit-learn.org/dev/modules/naive_bayes.html.

[50] World Wide Web Consortium (W3C). HTML. URL: http://www.w3.org/html/wg/drafts/html/master/.

[51] Web Hypertext Application Technology Working Group (WHATWG).HTML. URL: http://www.whatwg.org/html.

[52] Wikipedia Commons. Graphic showing the maximum separating hyperplane

and the margin. URL: http://commons.wikimedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png.

36

www.kth.se

Powered by TCPDF (www.tcpdf.org)