Enrichment of BOW Representation with Syntactic and Semantic Background Knowledge

10
S.A. Noah et al. (Eds.): M-CAIT 2013, CCIS 378, pp. 283–292, 2013. © Springer-Verlag Berlin Heidelberg 2013 Enrichment of BOW Representation with Syntactic and Semantic Background Knowledge Rayner Alfred 1 , Patricia Anthony 2 , Suraya Alias 1 , Asni Tahir 1 , Chin Kim On 1 , and Lau Hui Keng 1 1 Center of Excellence in Semantic Agents, School of Engineering and Information Technology, Universiti Malaysia Sabah, Jalan UMS, 88400, Kota Kinabalu, Sabah, Malaysia {Ralfred,suealias,asnieta,kimonchin,hklau}@ums.edu.my 2 Department of Applied Computing, Faculty of Environment, Society and Design, Lincoln University, Christchurch, New Zealand [email protected] Abstract. The basic Bag of Words (BOW) representation, that is generally used in text documents clustering or categorization, loses important syntactic and semantic information contained in the documents. When the text document contains a lot of stop words or when they are of a short length this may be particularly problematic. In this paper, we study the contribution of incorporating syntactic features and semantic knowledge into the representation in clustering texts corpus. We investigate the quality of clusters produced when incorporating syntactic and semantic information into the representation of text documents by analyzing the internal structure of the cluster using the Davies- Bouldin (DBI) index. This paper studies and compares the quality of the clusters produced when four different sets of text representation used to cluster texts corpus. These text representations include the standard BOW representation, the standard BOW representation integrated with syntactic features, the standard BOW representation integrated with semantic background knowledge and finally the standard BOW representation integrated with both syntactic features and semantic background knowledge. Based on the experimental results, it is shown that the quality of clusters produced is improved by integrating the semantic and syntactic information into the standard bag of words representation of texts corpus. Keywords: clustering, bag of words, syntactic features, semantic back-ground knowledge, automatic text categorization, knowledge management. 1 Introduction Text document clustering represents a challenging problem to text mining and machine learning communities due to the growing demand for automatic information retrieval systems [17]. With a large volume of text documents accessible to users, it hinders user accessibility to useful information buried in disorganized, incomplete, and unstructured text messages. In order to enhance user accessibility, we propose an algorithm that clusters text documents based on the standard bag of words

Transcript of Enrichment of BOW Representation with Syntactic and Semantic Background Knowledge

S.A. Noah et al. (Eds.): M-CAIT 2013, CCIS 378, pp. 283–292, 2013. © Springer-Verlag Berlin Heidelberg 2013

Enrichment of BOW Representation with Syntactic and Semantic Background Knowledge

Rayner Alfred1, Patricia Anthony2, Suraya Alias1, Asni Tahir1, Chin Kim On1, and Lau Hui Keng1

1 Center of Excellence in Semantic Agents, School of Engineering and Information Technology, Universiti Malaysia Sabah, Jalan UMS, 88400, Kota Kinabalu, Sabah, Malaysia

{Ralfred,suealias,asnieta,kimonchin,hklau}@ums.edu.my 2 Department of Applied Computing, Faculty of Environment, Society and Design,

Lincoln University, Christchurch, New Zealand [email protected]

Abstract. The basic Bag of Words (BOW) representation, that is generally used in text documents clustering or categorization, loses important syntactic and semantic information contained in the documents. When the text document contains a lot of stop words or when they are of a short length this may be particularly problematic. In this paper, we study the contribution of incorporating syntactic features and semantic knowledge into the representation in clustering texts corpus. We investigate the quality of clusters produced when incorporating syntactic and semantic information into the representation of text documents by analyzing the internal structure of the cluster using the Davies- Bouldin (DBI) index. This paper studies and compares the quality of the clusters produced when four different sets of text representation used to cluster texts corpus. These text representations include the standard BOW representation, the standard BOW representation integrated with syntactic features, the standard BOW representation integrated with semantic background knowledge and finally the standard BOW representation integrated with both syntactic features and semantic background knowledge. Based on the experimental results, it is shown that the quality of clusters produced is improved by integrating the semantic and syntactic information into the standard bag of words representation of texts corpus.

Keywords: clustering, bag of words, syntactic features, semantic back-ground knowledge, automatic text categorization, knowledge management.

1 Introduction

Text document clustering represents a challenging problem to text mining and machine learning communities due to the growing demand for automatic information retrieval systems [17]. With a large volume of text documents accessible to users, it hinders user accessibility to useful information buried in disorganized, incomplete, and unstructured text messages. In order to enhance user accessibility, we propose an algorithm that clusters text documents based on the standard bag of words

284 R. Alfred et al.

representation of texts corpus integrated with syntactic features and semantic background knowledge. Traditionally, text documents clustering process is based on a Bag of Words (BOW) approach, in which each document is represented as a vector with a dimension for each term of the dictionary containing all the words that appear in the corpus, as shown in Fig. 1. The row section represents a document and the column section represents the unique terms exist in documents. For instance, based on Fig. 1, there are n documents that share p unique terms in the BOW representation.

Fig. 1. Standard Bag of Words Representation

The value TF-IDF (Term Frequency – Inverse Document Frequency) associated to a given term represents the weight of the term itself and it is computed based on its frequency of occurrence within the corresponding document (Term Frequency, or TF), and within the entire corpus (Inverse Document Frequency, or IDF). A lot of works conducted related to preprocessing of documents in order to improve the text document representation that includes stemming, removal of stop words and normalization [18]. However, the improvement is still limited due to three major drawbacks: (1) the order to the terms occurrence is not maintained or considered in clustering the text documents; (2) synonymous words are considered different component; and (3) ambiguous words are grouped as a single component (e.g., bank of rivers, financial bank). It is therefore essential to further embed the semantic and syntactic information in order to enhance the quality of clustering.

This paper is organized as follows. Section 2 discusses about some of the related works. Section 3 presents the proposed modified BOW representation used to cluster text documents. Section 4 describes the experimental design and discusses the results obtained and finally this paper is concluded in Section 5.

2 Related Works

In traditional document clustering methods, a document is considered as a bag of words, with no relations between words. The feature vector representing the document is made by using the frequency count of terms in a document. Weights calculated from techniques like Inverse Document Frequency (IDF) and Information Gain (IG) are applied to the frequency count associated with the term. Some works conducted related to preprocessing of documents in order to improve the text document representation that includes semantic and syntactic analysis. Choudhary

11 1 1

1

1

f p

i f p

n nf np

W W W

W Wi Wi

W W W

Enrichment of BOW Representation with Syntactic and Semantic Background Knowledge 285

and Bhattacharyya have proposed a new method to create document vectors in order to improve clustering results by using a Self Organizing Map (SOM) technique [1]. This approach uses the Universal Networking Language (UNL) representation of a document. The UNL represents the document in the form of a graph with universal words as nodes and the relation between them as links. Instead of considering the documents as a bag of words they use the information given by the UNL graph to construct the vector. The proposed method managed to improve the clustering accuracy by using the semantic information of the sentences representing the document. Siolas [7] introduced semantic knowledge in Automatic Text Categorization (ATC) by building a kernel that takes into account the semantic distance: first between the different words based on WordNet, and then using Fisher metrics in a way similar to Latent Semantic Indexing (LSI). Zelikovitz and Hirsh [8] have also proven that the ATC accuracy can be improved by adding extra semantic knowledge into the LSI representation extracted from unclassified documents.

Caropreso and Matwin [5] and Moschitti and Basili [9] have studied the usefulness of including semantic knowledge in the text representation for the selection of sentences from technical genomic texts [5]. Both works agree that word senses are not adequate to improve ATC accuracy. They have shown that using hierarchical technical dictionaries together with syntactic relations is beneficial for our problem when using state of the art machine learning algorithms. Yamakawa et al. have studied a technique for incorporating the vast amount of human knowledge accumulated in Wikipedia into text representation and classification [17]. The aim is to improve classification performance by transforming general terms into a set of related concepts grouped around semantic themes. They proposed a unique method for breaking the enormous amount of extracted Wikipedia knowledge (concepts) into smaller pieces (subsets of concepts). The subsets of concepts are separately used to represent the same set of documents in a number of different ways, from which an ensemble of classifiers is built. The experiments conducted show that the method provides improvement in the classification performance, when compared with the results of a classifier trained on a regular term-document matrix.

Compared to semantics, less works have been done on the syntactic side. Cohen and Singer [3] have conducted experiments that study the importance of introducing the order of the words in the text representation. This can be done by defining position related predicates in the proposed ILP system. This has been extended by Goadich et al. [4] in the Information Extraction area, incorporating the order of noun phrases into the representation. Lewis compared different representations using either words or syntactic phrases (but not a combination of both) for Information Retrieval (IR) and Automatic Text Categorization (ATC) [11]. In text documents classification task, Caropreso and Matwin have proposed a method that uses bi-grams representation of text documents together with their single words in the BOW representation. It is shown that syntactic bi-grams (formed by using words that are syntactically linked) provide extra information that improves the classification performance compared to the traditional BOW representation of texts document [5]. In this work, we investigate the effects of integrating both the semantic and syntactic information into the BOW representation of documents.

286 R. Alfred et al.

3 Enrichment of Text Representation for Text Documents Clustering

In this research, the syntactic and semantic knowledge of a text document will be incorporated into the BOW representation.

3.1 Syntactic Enrichment of Text Representation with n-gram

Firstly, the syntactic knowledge of a text document is incorporated into the BOW representation by using n-gram. An n-gram is a subsequence of n items from a given sequence. The items in question can be phonemes, syllables, letters, words or base pairs according to its particular application. An n-gram of size 1 is referred to as a unigram; size 2 is a bigram (or, less commonly, a digram); size 3 is a trigram; and size 4 or more is simply called an n-gram. In this study, the syntactic enrichment of BOW representation is done by using bi-gram model which is n-gram of size 2. There are two main steps involved in the integration of syntactic features into the standard BOW representation of a text document. First, all possible bi-gram combinations or the likelihood of two words combined together are calculated. Then the frequencies of all the possible bi-gram combinations are computed and only the high frequency bi-gram terms will be taken into consideration as a new single item introduced into the standard BOW representation. For instance, when a document D1 is represented by a vector of n terms, where D1 = < T1, T2,…, Tn >, then the possible bi-gram combinations are T1T2, T2T3, T3T4, …, Tn-1Tn. Only the high frequency bi-gram terms will be taken into consideration as a new single item introduced into the standard BOW representation.

3.2 Semantic Enrichment of Text Representation with WordNet

Semantic matching is a technique used in computer science to identify information which is semantically related. Given any two graph-like structures, e.g., classifications, database or XML schemas and ontologies, matching is an operator which identifies those nodes in the two structures which semantically correspond to one another. An example applied to file systems is, it can identify that a folder labeled car is semantically equivalent to another folder automobile because they are synonyms in English. This information can be taken from a linguistic resource like WordNet [14].

Semantic matching represents a fundamental technique in many applications in areas such as resource discovery, data integration, data migration, query translation, peer to peer networks, agent communication, schema and ontology merging. In fact, it has been proposed as a valid solution to the semantic heterogeneity problem, namely managing the diversity in knowledge. Interoperability among people of different cultures and languages, having different viewpoints and using different terminology has always been a huge problem. Especially with the advent of the Web and the consequential information explosion, the problem seems to be emphasized. People

Enrichment of BOW Representation with Syntactic and Semantic Background Knowledge 287

face the concrete problem to retrieve, disambiguate and integrate information coming from a wide variety of sources. It has been shown that if words are indexed with their WorNet synset or sense then it improves the information retrieval performance [15].

WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. WordNet is also freely and publicly available for download. In this online Lexical Database, English nouns, verbs, and adjectives are organized into synonym sets, that representing one concept that linked by different relations.

In this study, the semantic knowledge of a text document is incorporated into the BOW representation based on WordNet. This is done by introducing a new concept that can be used to represent or describe several items that carry similar meaning based on the references obtained from WordNet. The basic idea to incorporate semantic background into the text representation is by introducing a new concept that represents two or more words that are semantically related or have similar meaning. Once the semantically related words are grouped under one concept these words will be replaced with the new concept introduced earlier in the BOW representation. Every time a new mutually exclusive semantically relationship is found between words, a new concept will be introduced. In our approach, the method only takes into consideration hyponyms and synonyms related terms. While in the hyponyms, only verbs are considered.

4 Experimental Evaluations

The experiment is designed in order to investigate and compare the effectiveness of clustering text documents based on the following FOUR types of text documents representations;

1. Standard BOW representation (SBOW) 2. Enriched BOW representation with syntactic background knowledge (SYBOW) 3. Enriched BOW representation with semantic background knowledge (SEBOW) 4. Enriched BOW representation with syntactic and semantic background knowledge

(SSBOW).

The flow of the experiment for all types of text document representations is shown in Fig. 2. The first step performed in the experiment is Tokenization. Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. Next the stop words removal process is conducted in order to remove irrelevant items in the text document. The stemming process is also conducted in order to turn words into their basic forms.

In this experiment, the Portel stemmer is not used in the stemming process [10]. Instead, the WordNet library morphological process is used to remove the suffix and prefix of a word found in the text document. The WordNet dictionary morphological process is used because the WordNet semantics information will be used to enrich the BOW representation. Thus, by using the WordNet stemmer method, the stemmed

288 R. Alfred et al.

word produced can be easily found in the WordNet application for its synonym and hyponym. After the stemming process, the BOW representation is enriched with syntactic background knowledge by using the n-gram model. The BOW representation is further enriched with semantic background knowledge by using the WordNet application. Next, the TF * IDF weight (Term Frequency–Inverse Document Frequency) is computed. TF * IDF is a numerical statistic which reflects how important a word is to a document in a collection or corpus [11].

Fig. 2. A data transformation process for data stored in multiple tables with one-to-many relations into a vector space data representation

Once the magnitude of each document is computed, then the similarity between two different documents is computed by using the Cosine similarity distance [12]. All documents are clustered by using the partitional k-means clustering method. The clustering results obtained are evaluated by using the Davies-Bouldin Index (DBI) [13]. DBI is used to measure the quality of clusters produced because DBI uses both the within-cluster and between clusters distances to measure the cluster quality. The DBI’s description is shown in equation (1) through (4). Let dcentroid(Qk) in (3), denote the centroid distances within-cluster Qk, where xi∈Qk, Nk is the number of samples in cluster Qk, ck is the center of the cluster and k ≤ K clusters. Let dbetween(Qk, Ql), defined in (5), denote the distances between-clusters Qk and Ql, where ck is the centroid of cluster Qk and cl is the centroid of cluster Ql.

Enrichment of BOW Representation with Syntactic and Semantic Background Knowledge 289

dcentroid(Qk) = k

i

N

cx ki − (1)

ck = )(1 ∈ ki Q

ik

x xN

(2)

dbetween(Qk, Ql) = lk cc − (3)

DBI = =

+

K

k lkbetween

lcentroidkcentroid

QQd

QdQd

Kkl

1 ),(

)()(max

1 (4)

Therefore, given a partition of the N points into K-clusters, DBI is defined in (4).

This cluster dispersion measure can be incorporated into any clustering algorithm to evaluate a particular segmentation of data.

In the experiment conducted, firstly, the number of clusters, k, is set 20 clusters. Although, usually k is set as square root of the number of document, n, by rule of thumb, we want to investigate the effect of varying the number of documents on the clusters results when we fixed the number of clusters. In this experiment, the number of text documents are varies from 500 to 1500 with an increment of 250 for each experiment. There are two types of text documents representation used in the first part of the experiment which includes SBOW and SSBOW. Secondly, the number of documents is fixed to 1000 documents only but the number of clusters formed is fixed to 20, 25 and 30 respectively. There are four types of text documents representation used in the second part of the experiment which includes all four SBOW, SYBOW, SEBOW and SSBOW representations. Table 1 shows the results of clustering the text documents ranging from 500 to 1500 with a fixed number of clusters by using the standard BOW representation (SBOW) and the enriched BOW representation with syntactic and semantic background knowledge (SSBOW).

Based on the results obtained, the quality of clusters produced is better for documents which are represented by the enriched BOW representation with syntactic and semantic background knowledge (SSBOW). This is shown in Table 1, in which the DBI measurements are lower when documents are clustered by using the enriched BOW representation with syntactic and semantic background knowledge (SSBOW). Thus, the results show the importance of integrating the syntactic and semantic back- ground knowledge into the standard BOW representation of a text documents. Noted that the average value of DBI for the clustering results using the SBOW representation is quite high since more items or words are considered when computing the similarity distance between two different text documents. In the SSBOW representation of text documents, less items or words are used in computing the similarity distance between two different documents.

290 R. Alfred et al.

Table 1. Comparison of DBI Values for The Clustering Results When Using SBOW and SSBOW with Different Number of Documents Clustered

Experiment Text Documents SBOW (DBI) SSBOW (DBI) 1 500 31.1 7.2

2 750 41.7 1.6

3 1000 27.2 2.1

4 1250 44.3 4.4

5 1500 16.1 4.4

Table 2 shows the results of clustering 1000 text documents ranging with three

different number of clusters, 20, 25 and 30, by using the standard BOW representation (SBOW), the enriched BOW representation with syntactic background knowledge (SYBOW), the enriched BOW representation with semantic background knowledge (SEBOW) and finally the enriched BOW representation with syntactic and semantic background knowledge (SSBOW).

Table 2. Comparison of DBI Values for the Clustering Results When Using SBOW, SYBOW, SEBOW and SSBOW with Different Number of Clusters Formed in Each Experiment

Number of Clusters

Number of Text Documents

DBI Values SBOW SYBOW SEBOW SSBOW

20 1000 4.91 11.1 4.4 3.1

25 1000 27.2 22.8 6.9 2.1

30 1000 9.5 3.5 3.1 4.1

The results also show that the quality of clusters produced is better for documents

which are represented by the enriched BOW representation with syntactic and semantic background knowledge (SSBOW), regardless of the number of clusters produced. Notice that the best quality of clusters is obtained when the number of clusters is 20 with SSBOW representation. However, when the number of clusters is 20, the quality of clusters produced is not consistent, in which the standard BOW representation provides better quality of clusters compared to the enriched BOW representation with syntactic background knowledge (SYBOW). These results show that the syntactic background knowledge alone is not enough in order to improve the quality of clustering produced. It is also shown in Table 2 that the quality of clusters produced is better when the number of clusters requested is larger. When the number of clusters is larger, then the clusters formed are more compact and they are separated well among themselves. On the contrary, the enriched BOW representation with syntactic background knowledge is not always producing better quality of clusters compared to the results obtained when using SBOW representation. The syntactic background knowledge alone is not enough to get better clustering results. Similarly, the semantic background knowledge alone is not good enough to get better clustering results. In short, a better quality of clusters produced when both syntactic and

Enrichment of BOW Representation with Syntactic and Semantic Background Knowledge 291

semantic background knowledge can be integrated into the standard bag of word (BOW) representation for text documents.

5 Conclusion

In this paper we have presented the comparisons of clustering results when using different type of bag of words (BOW) representation for text documents. Besides the standard BOW representation, there are three other different types of BOW representation are investigated, which include the enriched BOW representation with syntactic background knowledge (SYBOW), the enriched BOW representation with semantic background knowledge (SEBOW) and finally the enriched BOW representation with syntactic and semantic background knowledge (SSBOW). It is shown that the contributions of having additional semantic and syntactic background knowledge in the standard BOW are very important in order to get better clustering results. We have empirically showed that this syntactic and semantic knowledge is useful for improving the clustering results. Acknowledgments. This work has been supported by the Long Term Research Grant Scheme (LRGS) project funded by the Ministry of Higher Education (MoHE), Malaysia under Grants No. LRGS/TD/2011/UiTM/ICT/04.

References

1. Choudhary, B., Bhattacharyya, P.: Textclustering using Universal Networking Language representation. In: Eleventh International World Wide Web Conference (2003)

2. Zelikovitz, S., Hirsh, H.: Improving Text Classification with LSI Using Background Knowledge. In: Proceedingsof CIKM 2001,10th ACM International Conference on Information and Knowledge Management (2001)

3. Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. ACM Trans. Inf. Syst. (1999)

4. Goadrich, M., Oliphant, L., Shavlik, J.: Learning Ensembles of First-Order Clauses for Recall-Precision Curves: A Case Study in Biomedical Information Extraction. In: Proceedings of the Fourteenth International Conference on Inductive Logic Programming, Porto, Portugal (2004)

5. Maria, F.C., Stan, M.: Incorporating Syntax and Semantics in the Text Representation for Sentence Selection. In: Recent Advances in Natural Language Processing, Borovets, Bulgaria (2007)

6. Lewis, D.D.: Representation and Learning in Information Retrieval, Ph.D. dissertation, University of Massachusetts (1992)

7. Siolas, G.: Modèles probabilistes et noyaux pour l’extraction d’informations à partir de documents. Thèsede doctorat de l’Université Paris (2003)

8. Zelikovitz, S., Hirsh, H.: Improving Text Classification with LSI Using Background Knowledge. In: Proceedings of CIKM 2001,10th ACM International Conference on Information and Knowledge Management (2001)

292 R. Alfred et al.

9. Moschitti, A., Basili, R.: Complex Linguistic Features for Text Classification: a Comprehensive Study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)

10. Porter, M.F.: Analgorithm for suffix stripping. In: Jones, K.S., Willett, P. (eds.) Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers Inc., SanFrancisco (1997)

11. Salton, G., McGill, M.J.: Introduction to modern information retrieval. McGraw-Hill (1983)

12. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairssimilarity search. In: WWW 2007 - Proceedings of the 16th International World Wide Web Conference, pp.131–140 (2007)

13. Davies, D.L., Bouldin, D.W.: A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 2, 224 (1979)

14. Miller, G.A., Beckwith, R., Fellbaum, C.D., Gross, D., Miller, K.: WordNet: Anonline lexical database. Int. J. Lexicograph 3(4), 235–244 (1990)

15. Gonzalo, J., Verdejo, F., Chugur, I., Cigarrán, J.M.: Indexing with WordNet synsets can improve Text Retrieval, CoRR (1998)

16. Yamakawa, H., Jing, P., Feldman, A.: Semantic enrichment of text representation with Wikipedia for text classification. In: Systems Man and Cybernetics (SMC 2010), pp. 4333–4340 (2010)

17. Alfred, R., Mujat, A., Obit, J.H.: A Ruled-Based Part of Speech (RPOS) Tagger for Malay Text Articles. In: Selamat, A., et al. (eds.) ACIIDS 2013, Part II. LNCS, vol. 7803, pp. 50–59. Springer, Heidelberg (2013)

18. Leong, L.C., Basri, S., Alfred, R.: Enhancing Malay Stemming Algorithm with Background Knowledge. In: Anthony, P., Ishizuka, M., Lukose, D. (eds.) PRICAI 2012. LNCS, vol. 7458, pp. 753–758. Springer, Heidelberg (2012)