An evaluation of passage-based text categorization

19
Journal of Trivia, Volume 1, Issue 4(July 2009) Date of Revision: 2003-09-26 An Evaluation of Passage-based Text Categorization Jinsuk Kim * Center for Computational Biology & Bioinformatics, Korea Institute of Science and Technology Information (KISTI), P.O. Box 122, Yuseong-gu, Daejon, Republic of Korea 305-600 [email protected] Myoung Ho Kim Department of Electrical Engineering & Computer Science, Korea Advanced Institute of Science and Technology (KAIST), 373-1, Guseong-dong, Yuseong-gu, Daejon, Republic of Korea 305-701 [email protected] Category Computer Science/Information Retrieval, Information Retrieval/Text Catego- rization Keywords Text Categorization, Passage, Non-overlapping Window, Overlapping Window, Paragraph, Bounded-Paragraph, Page, TextTile, Passage Weight Function Abstract Researches in text categorization have been confined to whole-document-level classification, probably due to lacks of full-text test collections. However, full- length documents available today in large quantities pose renewed interests in text classification. A document is usually written in an organized structure to present its main topic(s). This structure can be expressed as a sequence of subtopic text blocks, or passages. In order to reflect the subtopic structure of a document, we propose a new passage-level or passage-based text categoriza- tion model, which segments a test document into several passages, assigns cat- egories to each passage, and merges passage categories to document categories. Compared with traditional document-level categorization, two additional steps, passage splitting and category merging, are required in this model. Using four subset of Reuters text categorization test collection and a full-text test collection of which documents are varying from tens of kilobytes to hundreds, we evaluate the proposed model, especially the effectiveness of various passage types and the importance of passage location in category merging. Our results show sim- ple windows are best for all test collections tested in these experiments. We also found that passages have different degrees of contribution to main topic(s), depending on their location in the test document. 1 Introduction Text categorization, the task of assigning one or more predefined categories to a doc- ument, is an active research field in information retrieval and machine learning. How- ever, research interests in text categorization have been confined to the problems such * This is a copy version of a paper published in other journal. If you want to cite this paper, use the following bibliography: Jinsuk Kim and Myoung-Ho Kim. “An Evaluation of Passage-Based Text Categorization”. Journal of Intelligent Information Systems 23(1):47-65, 2004. To whom correspondence should be addressed. Copyright © 2004 Journal of Intelligent Information Systems. Copyrighted by Springer Science.

Transcript of An evaluation of passage-based text categorization

Journal of Trivia, Volume 1, Issue 4 (July 2009)

Date of Revision: 2003-09-26

An Evaluation of Passage-based Text Categorization

Jinsuk Kim∗

Center for Computational Biology &

Bioinformatics, Korea Institute of Science

and Technology Information (KISTI),

P.O. Box 122, Yuseong-gu, Daejon,

Republic of Korea 305-600

[email protected]

Myoung Ho Kim†

Department of Electrical Engineering &

Computer Science, Korea Advanced Institute

of Science and Technology (KAIST), 373-1,

Guseong-dong, Yuseong-gu, Daejon,

Republic of Korea 305-701

[email protected]

Category Computer Science/Information Retrieval, Information Retrieval/Text Catego-rization

Keywords Text Categorization, Passage, Non-overlapping Window, Overlapping Window,Paragraph, Bounded-Paragraph, Page, TextTile, Passage Weight Function

Abstract Researches in text categorization have been confined to whole-document-levelclassification, probably due to lacks of full-text test collections. However, full-length documents available today in large quantities pose renewed interests intext classification. A document is usually written in an organized structure topresent its main topic(s). This structure can be expressed as a sequence ofsubtopic text blocks, or passages. In order to reflect the subtopic structure ofa document, we propose a new passage-level or passage-based text categoriza-tion model, which segments a test document into several passages, assigns cat-egories to each passage, and merges passage categories to document categories.Compared with traditional document-level categorization, two additional steps,passage splitting and category merging, are required in this model. Using foursubset of Reuters text categorization test collection and a full-text test collectionof which documents are varying from tens of kilobytes to hundreds, we evaluatethe proposed model, especially the effectiveness of various passage types andthe importance of passage location in category merging. Our results show sim-ple windows are best for all test collections tested in these experiments. Wealso found that passages have different degrees of contribution to main topic(s),depending on their location in the test document.

1 Introduction

Text categorization, the task of assigning one or more predefined categories to a doc-ument, is an active research field in information retrieval and machine learning. How-ever, research interests in text categorization have been confined to the problems such

∗This is a copy version of a paper published in other journal. If you want to cite this paper, use

the following bibliography: Jinsuk Kim and Myoung-Ho Kim. “An Evaluation of Passage-Based Text

Categorization”. Journal of Intelligent Information Systems 23(1):47-65, 2004.†To whom correspondence should be addressed.

Copyright © 2004 Journal of Intelligent Information Systems. Copyrighted by Springer Science.

Volume 1, Issue 4 (July 2009) 365

as feature extraction, feature selection, supervised learning algorithms, and hypertextclassification. Traditional categorization systems, or classifiers, have treated whole doc-ument as a categorization unit but there is few researches related to the input units ofclassifiers.

However, emergence of full-length documents such as word-processor files, full-textSGML/XML/HTML documents, and PDF/Postscript files in large quantities today,tackles traditional categorization models which process whole-document as an inputunit. As an alternative access method, we regard each document as a set of passages,where a passage is a contiguous segment of text. In the area of information retrieval,introduction of passages can be dated back to early 1990s[4, 3], and various types ofpassages have been proposed and tested for document retrieval effectiveness[3, 5, 6, 7].Large quantity of full-length documents available today imposes renewed interests onthe area of text categorization as well as on the information retrieval.

In this article, we propose a new text categorization model, in which a test documentis splitted into passages, categorization is processed for each passage, and then thedocument’s categories are merged from passages’ categories. We name the proposedmodel as Passage-based Text Categorization, compared with the trends that traditionalcategorization tasks are done based on document-level.

In our experimental results, we compare the effectiveness of several passage typesin text categorization, using a kNN(k-Nearest Neighbor) classifier[14]. In the case ofa test collection which consists of very long documents, we find that use of passagescan improve the effectiveness about 10% for all passage types used in our experimentscompared with that of document-level categorization. In addition, in the case of col-lections consisting of rather short documents such as newswires, there are about 5%improvement as well.

This paper starts introducing the passage-based text categorization in Section 2.Then data sets and several measures used in the experiments are explained in Section 3,and the experimental results are given in Section 4. Finally, we conclude in Section 5.

2 Passage-based Text Categorization Model

Generally a document is deliberately structured to a sequence of subtopical discussionsthat occur in the context of one or more main topic discussions[5]. If this is true, itis natural to deal a document as a sequence of subtopic blocks of any unit such assentences, paragraphs, sections, and contiguous text segments. By this motivation, wepropose a new text categorization model, passage-based text categorization, shown inFigure 1.

The primary difference of passage usages between in information retrieval and intext categorization is the target documents. In information retrieval, documents storedin the databases are splitted into passages and queries are evaluated their similaritiesagainst the passages[3, 4, 10, 18]. On the other hand, in text categorization, thedocument to categorize is splitted into passages and categorization tasks are appliedto these passages instead of parent document.

As shown in Figure 1, passage-based categorization system splits the test documentinto several passages, classifies each passage into categories, and determines the testdocument’s categories by merging all passage categories. This procedure includes two

366 Journal of Trivia

Passage-based Text Categorization (Proposed)

testdocument

CategoryMerger

documentcategories

TextClassifier

TextClassifier

passage 1’scategories

PassageSplitter

passage 1

passage 2

passage n

...

passage 2’scategories

passage n’scategories

...

Document-based Text Categorization (Traditional)

testdocument

documentcategories

TextClassifier

TextClassifier

Figure 1. Comparison of document-based(upper) and passage-

based(lower) text categorization models

additional steps, passage splitting and category merging, compared with traditionaldocument-level classification systems. These two additional steps are topics of Sec-tion 2.1 and Section 2.2, respectively.

2.1 Passages in Text Categorization

In regard of passage splitting in Figure 1, the definition of a passage is the key pointof this step in passage-based text categorization. Since a passage is defined as anysequence of text from a document[7], many types of passages have been used in docu-ment retrieval. [3] grouped these passage types into three classes: discourse passages,semantic passages, and window passages.

2.1.1 Discourse passages

Discourse passages are based on logical components of documents such as sentencesand paragraphs[3, 5]. This passage definition is intuitive, because discourse boundariesorganize material by content.

There are three problems with discourse passages. First, there is no guaranteethat documents have discourse consistency among authors[3]. Second, sometimes it’simpossible to make discourse passages, because many documents are supplied withoutpassage demarcation[7]. Finally, the lengths of discourse passages can be vary, fromvery long to very short[7].

Volume 1, Issue 4 (July 2009) 367

As an alternative solution for the first and last problems, [3] suggested a passagetype, known as bounded-paragraphs. As the name implies, while building bounded-paragraphs, short paragraphs are merged to subsequent paragraphs, paragraphs longerthan some minimum length are kept intact. [3] used 50 words as the minimum lengthof a bounded-paragraph.

2.1.2 Semantic passages

As discussed above, discourse passages may have inconsistency or may be impracticalto be built due to poor structure of the source document. An alternative approachis to split a document into semantic passages, each corresponding to a topic or to asubtopic. Several algorithms to partition documents into segments have been proposedand developed[7]. One of such algorithms, known as TextTiling[5], partitions full-lengthdocuments into coherent multi-paragraph segments, known as TextTiles or simply tiles,which represent a subtopic structure of a document.

TextTiling splits a document into small text blocks and computes the similaritiesof all adjacent blocks based on term frequencies. Boundaries of two blocks that showrelatively low similarities are regarded as boundaries between two adjacent tiles, whileblocks with high similarities are considered to be merged into a tile.

2.1.3 Window passages

While discourse and semantic passages are based on structural properties of documents,an alternative approach called window passages are based on the sequences of words.This is computationally simple and can be applied to documents without explicit struc-tural properties as well to the well-structured documents.

[4] segmented documents into even-sized blocks, each corresponding to a fixed-length sequence of words and starting just after the end of previous block. Accord-ingly there is no shared region between two adjacent blocks and thus this passage typeis referred to as non-overlapping windows. [3] partitioned documents into overlappingwindows where two adjacent segments share words at the boundary. In our experiments,the second half of an overlapping window is shared by the following window.

Another window passage type is page[9]. Pages are similar to bounded-paragraphs,but pages are bounded their lengths by physical length in bytes while bounded-paragraphsare bounded based on number of words. A page’s minimum length is 1.0kb[9].

2.2 Selecting Document Categories from Passage Categories

Usually a document is written in organized manner to describe the authors’ intention.For example, a newspaper article may locate its main topic at title and early partto arrest its readers’ attention. On the other hand, a science article may locate itsconclusion at the last part of itself. Therefore, depending on the location, passagesfrom a document may have different degrees of contribution to the document’s maintopic(s).

In our passage-based text categorization model, a passage’s degree of contributionto document categories is expressed as a passage weight function. We chose six passageweight functions that are shown in Table 2 in Section 3.4. After categories for allpassages from a document are assigned, passage weights are computed by the passage

368 Journal of Trivia

weight function and a category’s weight is summed from weights of which passageassigned the category. Categories with higher weights than some predefined value areassigned to the document as the final result. More detailed procedures are describedin Section 3.4.

3 Data Sets and Measures

3.1 Data Sets

We used the Reuters version 3[1] and its three subsets named GT800, GT1200, andGT1600, and KISTI-Theses collections to verify the effectiveness of passages in textcategorization.

Reuters version 3 collection was constructed by C. Apte et al.[1]. They removedall unlabeled documents from both the training and and test sets and restricted thecategories to have a training set frequency of at least two. By the name of constructorswe call this data set as Apte collection here after.

GTnnnn test collections were constructed from Apte data set by removing all thedocuments of which lengths are less than nnnn bytes from the test set. This restrictionresulted in test sets of 1,109, 652, and 410 documents for GT800, GT1200, and GT1600,respectively.

Finally, KISTI-Theses data set was constructed from master and doctoral theses ofKAIST1, POSTECH2, and CNU3, which were submitted in electronic forms as theirpartial fulfillment of graduation. These theses are counted to 1,042 documents from 22departments. We regarded the 22 departments as categories and selected one thirdsof documents as test set(347 documents) and remaining two thirds as training set(695documents). Majority of the documents are written in Hangul(Korean text).

The category distributions for KISTI-Theses data set are listed below in < category,#te, #tr, sum> format, where #te is the number of documents in test set, #tr is thenumber of documents in training set, and sum is the summation of #te and #tr:

<Advanced Materials Engineering, 2, 5, 7><Aerospace Engineering, 6, 11, 17><Automation & Design Technology, 1, 4, 5><Biology, 23, 39, 62><Chemical Engineering, 27, 47, 74><Chemistry, 25, 49, 74><Civil Engineering, 9, 17, 26><Computer Science, 24, 49, 73>

<Electrical Engineering, 48, 102, 150><Environmental Engineering, 5, 6, 11><Industrial Design, 1, 4, 5 >

<Industrial Engineering, 14, 29, 43><Information & Communication Engineering, 6, 15, 21><Management Engineering, 44, 90, 134>

1. Korea Advanced Institute of Science and Technology, Daejon, Korea, http://www.kaist.ac.kr

2. Pohang University of Science and Technology, Pohang, Korea, http://www.postech.ac.kr

3. Chungnam National University, Daejon, Korea, http://www.cnu.ac.kr

Volume 1, Issue 4 (July 2009) 369

Table 1. Characteristics of test collections used in this work.

Collection Apte GT800 GT1200 GT1600 KISTI-Theses

Test Set 3309 1019 652 410 347

Training Set 7789 7789 7789 7789 695

Category Count 93 93 93 93 22

Minimal text size (kb) 0.1 0.8 1.2 1.6 14.8

Average text size (kb) 0.8 1.8 2.2 2.7 92.9

<Materials Science, 20, 33, 53>

<Mathematics, 7, 21, 28>

<Mechanical Engineering, 45, 98, 143><Metal Engineering, 11, 17, 28><Miscellaneous, 0, 1, 1><Nuclear Engineering, 7, 19, 26><Physics, 15, 27, 42>

<Steel Engineering, 7, 12, 19>

The characteristics of data sets used in this experiment are shown in Table 1.

3.2 Similarity Measures

To assess the effectiveness of various passaging methods, k-nearest neighbor(kNN)classifier[14] was used as a document-level classifier. As an example-based classifier[14,11], kNN classifier has many similarities with traditional information retrieval systems.Our kNN classifier was built on an information retrieval system, KRISTAL-II, whichwas developed by KISTI’s Information Systems Group4 to manage and retrieve semi-structured texts such as bibliographies, theses, and journal articles.

To retrieve k top-ranked documents, the kNN classifier uses a vector-space similaritymeasure(Sim(q, d)) between query document q and target document d, which is definedto be

Sim(q, d) =1

Wd

∑t∈q∧d

(wq,t · wd,t + min(fd,t, fq,t)) (1)

with:

wd,t = log(fd,t + 1) · log(N

ft+ 1)

wq,t = log(fq,t + 1) · log(N

ft+ 1)

Wd = log(∑t∈d

fd,t)

where fx,t is the frequency of term t in document x; N is the total number of documents;min(x, y) is the smaller value between x and y; ft is the number of documents where

4. More information about KRISTAL-II information retrieval system can be obtained at http://www.

kristalinfo.com.

370 Journal of Trivia

the term t occurs more than once; wx,t means the weight of term t in query or documentx and Wd represents the length of document d.

Equation 1 is an empirically derived TF·IDF form of traditional vector based infor-mation retrieval schemes[13] which have been commonly used due to its robustness andsimplicity. The most noticeable modification in Equation 1, compared with traditionalvector schemes, is introduction of the expression, min(fd,t, fq,t), which reflects the termfrequency in query or document to the query-document similarity directly. Introduc-ing this expression, the categorization performance is slightly better than traditionalvector-space similarity measures(data not shown).

Summation of all min(fd,t, fq,t) in Equation 1 means total frequency of terms thatcooccurred in the query and the target document simultaneously. Sum of min(fd,t, fq,t),however rather indirectly, reflects the term cooccurrence information to the similaritymeasure, which may result in improved performance.

3.3 Performance Measures

To evaluate the categorization effectiveness of various passages, we use the standarddefinition of precision(p) and recall(r) as basic performance measures:

p =Categories relevant and retrieved

Categories retrieved

r =Categories relevant and retrieved

Categories relevant

Along with precision and recall, many other researches in text categorization usedF1 measure as the performance measure. The F1 measure[12] is the harmonic averageof precision and recall, which is defined as

F1 =2pr

p + r

A special point of F1 measure where precision is equal to recall is called preci-sion and recall break-even point, or simply break-even point(BeP). Since, theoretically,BeP is always less than or equal to F1 measures in any point, BeP is usually used tocompare the effectiveness among different kinds of classifiers or among categorizationmethods[16, 11]. We present precision, recall, and BeP as the categorization effective-ness, and if it is unable to get BeP, F1 measure will be presented instead. To averagethe precision and recall across categories, we used the micro-averaging method[11].

3.4 Category Relevance Measure

In document-level categorization, to determine whether a document d belongs to acategory cj ∈ C = {c1, c2, · · · , c|C|}, our kNN classifier retrieves k training documentsmost similar to d and computes cj ’s weight by adding up similarities between d anddocuments that are retrieved and belong to cj ; if the weight is large enough, the decisionis taken to be positive and negative otherwise. Category cj ’s weight for document d iscalled category relevance score, Rel(cj , d), and computed as follows:[17]

Rel(cj , d) =∑

d′∈Rk(d)∩Dj

Sim(d′, d) (2)

Volume 1, Issue 4 (July 2009) 371

Table 2. Passage weight functions

Functions† Weighting tendency

pwf1(p) = 1 Head = Body = Tail

pwf2(p) = p−1 Head ≫ Body ≫ Tail

pwf3(p) = p Head < Body < Tail

pwf4(p) =√

(p − n2 )2 Head = Tail > Body

pwf5(p) =√

(n2 )2 − (p − n

2 )2 Head = Tail < Body

pwf6(p) = (log(p + 1))−1 Head > Body > Tail

† Normalization factors are omitted for clarity.

p: passage location

n: total number of passages

where set Rk(d) are the k nearest neighbors(top-ranked training documents from the1st to the kth) of document d, Dj is the set of training documents assigned categorycj , and Sim(d′, d) is the document-document similarity obtained by Equation 1 inSection 3.2. For each test document, categories with relevance scores greater thangiven threshold are assigned to the document.

In passage-level categorization, the same procedure for document-level categoriza-tion task is applied and categories are assigned to each passage pi from the test docu-ment, d. However, since d’s categories are not yet determined, relevance scores for allcandidate categories are computed from categories of all passages as follows:

Rel(cj , d) =∑

pi∈Pj

pwfn(i) (3)

where pi is the ith passage of document d, Pj is the set of passages assigned category cj ,and pwfn() is one of the passage weight functions shown in Table 2. As in document-level categorization, categories with relevance scores greater than given threshold areassigned to the document.

As stated in Section 2.2, we use 6 passage weight functions(pwfs) which are func-tions of passage location and return a value between 0 and 1 (For clarity, the normal-ization factors are omitted in Table 2). Weighting tendency for each pwf is also showndividing a document briefly into Head, Body, and Tail.

4 Experiments

4.1 Experimental Settings

Terms separated by space characters were extracted as features from documents andpassages. Stemming was not applied to the terms since it hurted the effectiveness in ourexperiments(data not shown), as shown in previous researches such as [2]. Commonwords(also known as stopwords) were removed from the feature pool. We also removeddigits or numeric values from feature pool. Removing digits slightly improved the

372 Journal of Trivia

Table 3. Average passage number and length.

Passage Type Apte GT800 GT1200 GT1600 Theses

Nonoverlapping window 1.8(0.5) 2.2(0.5) 4.2(0.6) 4.7(0.6) 114.0(0.8)

Overlapping window 2.4(0.5) 5.1(0.5) 6.5(0.6) 8.0(0.6) 226.4(0.8)

Paragraph 7.1(0.1) 10.8(0.2) 12.8(0.2) 15.6(0.2) 90.8(1.0)

Bounded-paragraph 2.3(0.4) 2.8(0.6) 5.4(0.4) 6.6(0.4) 72.3(1.3)

Page N/A N/A N/A N/A 56.9(1.6)

TextTile 1.9(0.5) 3.1(0.6) 3.5(0.6) 3.8(0.7) 64.4(1.4)

Average number of passages per document.

Average length(kilobytes) of a passages is shown in ().

categorization effectiveness in the case of Apte data set and its three subsets(data notshown).

An additional step was applied to KISTI-Theses data set. Since majority of it’sdocuments are written in Hangul(Korean text), we applied Hangul morpheme analyzerto Hangul terms and included them into feature pool5.

Feature selection was applied according to terms’ document frequencies(DF). [15]showed DF is a simple, effective, and reliable thresholding measure for selecting featuresin text categorization[15]. In our experiments, by varying DF ranges during featureselection, we chose minimal DF(DFmin) and maximal DF(DFmax) that performed bestin document-level categorization tasks for each test collections. The same DFmin andDFmax were also applied to passage-level categorization tasks.

In regard of k in our kNN classifier, the best k value for document-level categoriza-tion was chosen for passage-level tasks. For Apte, GT800, GT1200, and GT1600 datasets, k value was selected to be 10. And for KISTI-Theses collection, k = 1 was chosen.The k = 1 for KISTI-Theses collection is very interesting and it’s probably due to thefact that only one category is assigned for each document and each document’s lengthmay be long enough to describe the category retrieved. More discussions are in section4.3.5.

Table 3 shows the average passage numbers of a test document and its average lengthfor each data set. The data for both non-overlapping and overlapping window are shownin the case of passage size being 100 words. Since a page is bounded to minimal lengthof 1.0kb, we think it’s meaningless to apply page type to short documents of 1 or 2kb.Therefore we did not test the effectiveness of page type in Apte and its three subsetcollections.

4.2 Effectiveness of Passage Weight Functions

As stated in Section 2.2, we assume that the location of a passage plays an importantrole in the determination of parent document’s main topic(s). To reflect this assump-tion, we introduced six different kinds of passage weight functions, which are functions

5. Hangul morpheme analyzer used in this experiment is a component of KRISTAL-II information

retrieval system which is used as the base of our kNN classifier.

Volume 1, Issue 4 (July 2009) 373

����

����

���

����

����

����

����

��� �������

���������

�� �������

���������

��������� ��� ��

���������

� !���

���

����

����

����

����

����

���

Figure 2. Effectiveness of 6 passage weighting functions. (Data set: GT1600)

of passage location and returns passage weights between 0 and 1(see Table 2). Sum-ming these passage weights, document’s categories are assigned as shown in Figure 1in Section 2.

Microaveraged break-even points for GT1600 collection are plotted against variouspassage types in Figure 2, where the trends for the six passage weight functions can beseen. For GT1600 collection, passage weight function 2 (pwf2) and function 6 (pwf6)show the best performance. These two functions return high weight for passage fromHead part, middle from Body, and low from Tail part(Table 2). And pwf3 has thereverse pattern and shows the worst performance. Situations are same with Apte,GT800 and GT1200 collections(data not shown). This means that main topics ofReuters news articles are determined mainly by the early part of the document.

On the other hand, KISTI-Theses data set shows very different situation(Figure 3).pwf2 shows the worst performance for KISTI-Theses data set, while it was best fornewswire data, Apte and its three subsets. Furthermore, weighting tendencies of pwf2

and pwf6 is very similar(Table 2), but pwf6 performs well for several passage typeswhile pwf2 is poor for all passage types. This situation makes the interpretation verydifficult. However, it’s clear that different passage weighting functions should be appliedfor different document types.

Note that pwf1 always returns 1 for any passage location while other pwfs returnvariable values depending on passage’s location in the document. But in Figure 2, pwf2

374 Journal of Trivia

����

����

����

����

����

����

���

���

����

�������� ����������

��������

����������

����� ��������

�����

��� ��������

���

��

���

���

���

���

���

Figure 3. Effectiveness of 6 passage weighting functions. (Data set:

KISTI-Theses)

and pwf6 outperform pwf1, and in Figure 3, pwf3 and pwf6 outperform pwf1. Thesesupport our assumption that each passage has different degree of contribution to thedocument’s main topic(s).

4.3 Effectiveness of Passages

4.3.1 Apte collection

Experimental results for Apte test collection are shown in Table 4. Microaveragedprecision, recall and break-even point(BeP) applied passage weight function 2 (pwf2)are presented as performance measures. See the Section 4.2 for comparison of sixpwfs’ effectiveness. The differences, ∆%, are performance improvement in percentilecompared with document-based categorization. They are based on break-even points.The best BeP for all passage types is underlined.

Since the Apte, GT800, GT1200, GT1600 test collections consist of rather shortdocuments(refer Table 1), page type was not applied to the these data sets. And weslightly modified TextTiling to yield fine-grained tiles, because standard TextTiling istoo coarse-grained segmentation algorithm to apply for the data sets. Average size oftiles partitioned by the modified TextTiling algorithm is 98 words corresponding to 454bytes while average size of standard TextTiles is 1 or 2 kilobytes(see Table 3).

Though there is no significant improvement in BeP, overlapping windows at passage

Volume 1, Issue 4 (July 2009) 375

Table 4. Effectiveness of passages for Apte dataset with pwf2

Precision Recall BeP† ∆%

Document 0.818 0.818 0.818 0.0

Non-overlapping Windows

window size = 50 0.817 0.817 0.817 -0.1

window size = 100 0.820 0.821 0.820 0.3

window size = 150 0.822 0.822 0.822 0.5

window size = 200 0.819 0.824 0.822‡ 0.4

Overlapping Windows

window/overlap = 50/25 0.820 0.820 0.820 0.3

window/overlap = 100/50 0.823 0.823 0.823 0.6

window/overlap = 150/75 0.823 0.823 0.823 0.6

window/overlap = 200/100 0.822 0.822 0.822 0.4

Paragraphs 0.812 0.812 0.812 -0.8

Bounded-Paragraphs 0.761 0.766 0.763‡ -6.7

TextTiles 0.831 0.812 0.821‡ 0.4

† Micro-averaged precision and recall break-even points

‡ Micro-averaged F1 measures

size of 100 and 150 words showed the best performance. The passages type showingthe worst performance is bounded-paragraphs. A bounded-paragraph is defined as aparagraph containing at least 50 words and at most 200 words[3]. We reason that shortnature of document lengths in Apte collection cause malformed bounded-paragraphsin this collection resulting in mal-performance.

The poor performance improvement for all passage types in this data set seemsto be due to the fact that proportion of very short documents in Apte collection istoo high. For example, about 60% of the documents in the test set are less than100 words which corresponds to only one non-overlapping window of size 100. Thiscauses the degree of improvement by passage-level categorization being overwhelmedby the inaccuracy introduced by step of category merging in Figure 1, resulting in poorperformance improvement.

To eliminate the negative effect of short test documents and to verify the effec-tiveness of passage-level categorization, we prepared three subsets of Apte collection,GT800, GT1200, and GT1600. The results will be shown in the following sections.

4.3.2 GT800 collection

Experimental result for GT800 test collection is shown in Table 5. GT800 collectionis a subset of Apte test collection, where documents that are shorter than 800 bytesare removed from the test set. Experimental environment is same as that of Aptecollection.

There are some improvements in performance for most passages types, where over-lapping windows of size 100 works best. Compared with that of Apte collection, per-

376 Journal of Trivia

Table 5. Effectiveness of passages for GT800 dataset with pwf2

Precision Recall BeP† ∆%

Document 0.690 0.690 0.690 0.0

Non-overlapping Windows

window size = 50 0.688 0.688 0.688 -0.3

window size = 100 0.695 0.699 0.697 1.0

window size = 150 0.706 0.701 0.703 1.9

window size = 200 0.706 0.706 0.706 2.3

Overlapping Windows

window/overlaps = 50/25 0.690 0.690 0.690 0.0

window/overlaps = 100/50 0.715 0.707 0.711‡ 3.0

window/overlaps = 150/75 0.704 0.704 0.704 2.0

window/overlaps = 200/100 0.706 0.710 0.708 2.6

Paragraphs 0.696 0.697 0.697 0.9

Bounded-Paragraphs 0.688 0.707 0.697‡ 1.0

TextTiles 0.704 0.702 0.703 1.9

† Micro-averaged precision and recall break-even points

‡ Micro-averaged F1 measures

formance improvements in this data set is clear. Detailed explanation will be stated inSection 4.3.4.

4.3.3 GT1200 collection

Experimental result for GT1200 test collection is shown in Table 6. GT1200 collectionis a subset of Apte test collection, where documents that are shorter than 1,200 bytesare removed from the test set. Experimental environment is same as that of Aptecollection.

There are significant improvements in performance for most passages types, whereoverlapping windows of size 100 works best. As lengths of test documents increases,compared with Apte and GT800 datasets, the difference of BeP for passage-level anddocument-level categorization gets enlarged. Detailed explanation will be stated inSection 4.3.4.

4.3.4 GT1600 collection

Experimental result for GT1600 test collection is shown in Table 7. GT1600 collectionis a subset of Apte test collection, where documents that are shorter than 1,600 bytesare removed from the test set. Experimental environment is the same as that of Aptecollection.

For all Apte, and GTnnnn collections, overlapping windows showed the best perfor-mance. This is probably because variable length distribution of paragraphs, bounded-paragraphs, and tiles, compared with evenly distributed lengths of overlapping win-

Volume 1, Issue 4 (July 2009) 377

Table 6. Effectiveness of passages for GT1200 dataset with pwf2

Precision Recall BeP† ∆%

Document 0.660 0.660 0.660 0.0

Non-overlapping Passages

window size = 50 0.674 0.673 0.673 2.1

window size = 100 0.689 0.683 0.686‡ 4.0

window size = 150 0.665 0.664 0.665 0.8

window size = 200 0.642 0.637 0.640‡ -3.0

Overlapping Passages

window/overlaps = 50/25 0.677 0.678 0.678 2.7

window/overlaps = 100/50 0.690 0.689 0.689 4.5

window/overlaps = 150/75 0.665 0.664 0.665 0.8

window/overlaps = 200/100 0.643 0.645 0.644 -2.3

Paragraphs 0.675 0.675 0.675 2.3

Bounded-Paragraphs 0.683 0.683 0.683 3.5

TextTiles 0.671 0.670 0.671 1.7

† Micro-averaged precision and recall break-even points

‡ Micro-averaged F1 measures

dows, causes skewed categorizations for passages, accordingly resulting in poor perfor-mance. Furthermore, overlapping window type is superior to non-overlapping windowfor all data sets. While other passage types, including non-overlapping window type,may have some degree of loss in term locality information at the boundaries, over-lapping window passages reserve the locality information since they overlap with theadjacent passages.

The best performing passage size of both non-overlapping and overlapping is 100words. This size is about 3 paragraphs for test documents of Apte and its three subsetcollections. (Note that we do not include numeric values in word count.)

The meaning of GTnnnn collections is the increasing lengths of test documents byremoving short documents from test set of Apte collection. From Table 5, 6, and 6,performance improvements are approximately proportional to the lengths of test doc-uments. This is clear in Figure 4.

In Figure 4, performance improvements in percentage are plotted against averagesizes of test documents for Apte, GT800, GT1200, and GT1600 data sets, which showsagain overlapping passage type is the best performer. There is a trend that as averagedocument size gets larger performance also improves better. Furthermore, there isstrong linear correlation between average document size and performance improvementsfor overlapping window type. This means that the longer the test documents are, thebetter the performance of all passage types.

We have examined the result for rather longer documents among short-documentdata sets so far. In the following section we will present the passage-level classificationtasks for full-length documents of master and doctoral theses.

378 Journal of Trivia

Table 7. Effectiveness of passages for GT1600 dataset with pwf2

Precision Recall BeP† ∆%

Document 0.636 0.636 0.636 0.0

Non-overlapping Windows

windows size = 50 0.649 0.648 0.648 1.9

windows size = 100 0.665 0.663 0.664 4.4

windows size = 150 0.653 0.642 0.647‡ 1.8

windows size = 200 0.626 0.621 0.623 -2.0

Overlapping Windows

window/overlaps = 50/25 0.658 0.658 0.658 3.4

window/overlaps = 100/50 0.670 0.670 0.670 5.4

window/overlaps = 150/75 0.668 0.663 0.666 4.6

window/overlaps = 200/100 0.649 0.649 0.649 2.0

Paragraphs 0.652 0.652 0.652 2.5

Bounded-Paragraphs 0.665 0.665 0.665 4.5

TextTiles 0.659 0.658 0.658 3.4

† Micro-averaged precision and recall break-even points

‡ Micro-averaged F1 measures

4.3.5 KISTI-Theses collection

So far we have examined the passage-based text categorization on data sets with shortdocuments of one or two kilobytes(Table 1). In this section, we will present experimen-tal result for KISTI-Theses data set which consists of full-length documents in average92.9kb, varying from 14.8kb to 533.5kb. Data with passage weight function3(pwf3) areshown in Table 8.

In this experiment, k value for kNN classifier is 1, and feature selection conditionis document frequency of at most 2 and at least 69(corresponding to 10% of trainingdocuments). The k = 1 for this collection is very interesting. It’s probably dueto the fact that this collection is rather a small data set and only one category isassigned to each document in the collection. This means that the higher is the k

value the higher possibility of inadequate documents being contained in the top-rankeddocuments, causing performance degradation. In addition, since all the documentsin the collection is very long(average 92,900 characters), one top-ranked document toquery document may contain sufficient terms which can fully describe a category’sfeatures.

Categorization performance using any type of passage results in more effective cat-egorization than when whole-document is used as categorization unit(Table 8).

Based on Figure 4 and Table 8, overlapping window of passage size 100 showsthe best performance for all data sets. And for all data sets except Apte collection,performance of bounded-paragraph type is better than that of paragraph type. Asstated in Section 2.1, the lengths of bounded-paragraphs are less skewed than those of

Volume 1, Issue 4 (July 2009) 379

��

�� �� �� �� �� ��

�� ���������������������

������������� ��

������������ �����������

�������� �����������

�����

������������

��������

Figure 4. Correlation between document size and performance

improvement for various passage types. First point of bounded-

paragraph is omitted due to its huge bias.

paragraphs, because the length of a bounded-paragraph is bounded to some minimalvalue, while paragraphs are varying from one sentence to tens of sentences. Skewing inpassage sizes seems to be harmful to passage-based text categorization.

Finally we would like to note shortly the speed-effectiveness trade-off in classificationof KISTI-Theses data set. In this experiment, we find that passage-level categorizationis 2 to 5 times slower than document-level categorization, but memory space requiredis greatly reduced, because splitting a document into passages increases the numberof categorization tasks by passage count but decreases the amount of work load fromdocument to passage size. This is a bearable trade-off regarding that a document issplitted into hundreds of passages in the case of KISTI-Theses collection(see Table 3).The speed-memory trade-off will be very useful when it’s impractical to apply whole-document-level categorization due to bulky volume of the categorization unit such asfull-length digital books and whole web sites.

5 Conclusions

Advent of full-length document databases and expansion of web sites pose new chal-lenges to text classification, because traditional whole-document classification maybe impractical to these document types. In this article, we introduce a new textcategorization model, called passage-based text categorization, in which the problemsize(categorization unit) is reduced from whole-document to small passages splitted

380 Journal of Trivia

Table 8. Effectiveness of passages for KISTI-Theses dataset with pwf3

Precision Recall BeP† ∆%

Document 0.631 0.631 0.631 0.0

Non-overlapping Windows

window size = 50 0.677 0.677 0.677 7.3

window size = 100 0.695 0.695 0.695 10.0

window size = 200 0.683 0.683 0.683 8.2

window size = 400 0.687 0.689 0.688‡ 9.0

Overlapping Windows

window/overlaps = 50/25 0.669 0.669 0.669 5.9

window/overlaps = 100/50 0.697 0.697 0.697 10.5

window/overlaps = 200/100 0.692 0.692 0.692 9.6

window/overlaps = 400/200 0.686 0.686 0.686 8.7

Paragraphs 0.669 0.669 0.669 5.9

Bounded-Paragraphs 0.686 0.686 0.686 8.7

Pages 0.689 0.689 0.689 9.1

TextTiles 0.689 0.689 0.689 9.1

† Micro-averaged precision and recall break-even poi nts

‡ Micro-averaged F1 measures

from the document.We explored the text categorization using various passage types, such as non-

overlapping windows, overlapping windows, paragraphs, bounded-paragraphs, pages, andtiles. The improvement obtained by passage-level categorization compared with whole-document categorization is greater than 5% for short-document data set and is greaterthan 10% for long-document data set. For all passages types, there are general improve-ments in categorization effectiveness. However, overlapping windows showed superioreffectiveness to other passage types across five test collections.

We also introduced passage weight which is applied to merge passage categories todocument categories. Despite overall improvements are observed for all passage weight-ing schemes tested in this article, there do exist one or more scheme that is superiorto simple summation of weights(pwf1). Therefore, careful design of passage weightingscheme will further improve the effectiveness. Furthermore, our results showed thatdifferent data sets have different optimal passage weighting schemes.

The superior effectiveness of passages to whole-document may be the result of afew factors. First, passages can quest for the subtopic structure of the test documentwhile whole-document ignores detailed design of the document. In passage-level textcategorization, this subtopic structure can be exploited by a pool of classifiers, eachof which has a local, i.e. passage-level, view of the document. Second, passages areshort segments from the document. It means they embody locality: document-levelclassification mashes term’s locality information, but dividing document into passagesreserves the partial locality. In this respect, it’s natural that overlapping windows

Volume 1, Issue 4 (July 2009) 381

which reserve the locality even at the boundaries outperforms other passage types.Dividing document to smaller pieces of passages are an effective mechanism for text

categorization where collection of long documents are considered. Even In the collec-tions of short texts, we found that passages have the potential to improve effectiveness.Though there is slight speed-effectiveness trade-off in passage-based text categorization,it is bearable to process really long documents.

We suggest that passage-level text categorization is one of the influential methodsin the environments such as unstructured full-length documents, XML documents andwhole web sites. For unstructured documents, any passage type quoted in this papercan be applied to categorization task. In regard of XML documents, since they arenaturally composed of several passages(each appropriate element content as a passage),it’s reasonable to expect that passage-based categorization method is well suited forXML documents. For web site classification, since a web site is usually composed ofmany web pages, passage-based categorization model can be applied by regarding eachweb page as a passage.

Passage-based text categorization model resembles classifier committees[8, 11] intwo ways. First, Its result of the classification is obtained by a pool of classifiers.Second, it uses a passage weight function to compute document’s categories whileclassifier committees use a combination function to choose categories[8, 11]. Thereforewe expect that many research results on classifier committees can also be applied easilyto passage-based categorization in near future.

Acknowledgements

We would like to thank Wonkyun Joo for some helpful comments and fruitful discus-sions, Hwa-muk Yoon for providing raw data for KISTI-Theses test collection, andChangmin Kim and Jieun Chong for supporting this work.

References

1. Apte, C., Damerau, F., and Weiss, F. (1994). Towards Language Independent Automated

Learning of Text Categorization Models. Proceedings of the 17th Annual International

ACM/SIGIR Conference on Research and Development in Information Retrieval, 23-30.

2. Baker, L. D. and McCallum, A. K. (1998). Distributional Clustering of Words for Text

Classification. Proceedings of the 21th Annual International ACM/SIGIR Conference on

Research and Development in Information Retrieval, 96-103.

3. Callan, J. P. (1994). Passage Retrieval Evidence in Document Retrieval. Proceedings of

the 17th Annual International ACM/SIGIR Conference on Research and Development in

Information Retrieval, 302-310.

4. Hearst, M. A., and Plaunt, C. (1993). Subtopic Structuring for Full-length Document

Access. Proceedings of the 16th Annual International ACM/SIGIR Conference on Research

and Development in Information Retrieval, 59-68.

382 Journal of Trivia

5. Hearst, M. A. (1994). Multi-paragraph Segmentation of Expository Texts. Proceedings of

the 32nd Annual Meeting of the Association for Computational Linguistics, 9-16.

6. Kaszkiel, M., Zobel, J. and Sacks-Davis, R. (1999). Efficient Passage Ranking for Document

Databases. ACM Transactions on Information Systems, 17(4), 406-439.

7. Kaszkiel, M., and Zobel, J. (2001). Effective Ranking with Arbitary Passages. The Journal

of American Society for Information Science and Technology, 52(4), 344-364.

8. Larkey, L. S., and Croft, W. B. (1996). Combining Classifiers in Text Categorization.

Proceedings of SIGIR-96, 19th ACM International Conference on Research and Development

in Information Retrieval, 289-297.

9. Moffat, A., Sacks-Davis, R., Wilkinson, R. and Zobel, J. (1994). Retrieval of Partial

Documents. NIST Special Publication 500-215: The Second Text REtrieval Conference

(TREC 2), 181-190.

10. Salton, G., Allan, J., and Buckley, C. (1993). Approaches to Passage Retrieval in Full Text

Information Systems. Proceedings of the 16th Annual International Conference on Research

and Development in Information Retrieval, 49-58.

11. Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Com-

puting Surveys, 34(1), 1-47.

12. van Rijsbergen, C. (1979). Information Retrieval. Butterworths, London, 1979.

13. Witten, I. H., Moffat, A., and Bell, T. C. (1999). Managing Gigabytes: Compressing and

Indexing Documents and Images. Morgan Kaufmann Publishing, San Francisco, 1999.

14. Yang, Y. (1994). Expert Network: Effective and Efficient Learning from Human Deci-

sions in Text Categorization and Retrieval. Proceedings of the 17th Annual International

ACM/SIGIR Conference on Research and Development in Information Retrieval, 13-22.

15. Yang, Y. and Pedersen, J. O. (1997). A Comparative Study on Feature Selection in

Text Categorization. Proceedings of the 14th International Conference on Machine Learn-

ing(ICML’97), 412-420.

16. Yang, Y. (1999). An Evaluation of Statistical Approaches to Text Categorization. Journal

of Information Retrieval, 1(1), 67-88.

17. Yang, Y., Slattery, S., and Ghani, R. (2002). A Study of Approaches to Hypertext

Categorization. Journal of Intelligent Information Systems, 17(2), 219-241.

18. Zobel. J., Moffat, A. Wilkinson, R., and Sacks-Davis, R. (1995). Efficient Retrieval of

Partial Documents. Information Processing and Management, 31(3), 361-377.