45i18 Ijaet0118730 V6 Iss6 2704 2716

13
International Journal of Advances in Engineering & Technology, Jan. 2014. ©IJAET ISSN: 22311963 2704 Vol. 6, Issue 6, pp. 2704-2716 PERCEPTION BASED DOCUMENT CATEGORIZATION USING K-MODE ALGORITHM Saranya V 1 and Dharmaraj R 2 1 Assistant Professor, Department of Computer Science and Engineering, Sri Vidya College of Engineering & Technology, Virudhunagar, Tamilnadu, India. 2 Research Scholar, Manonmaniam Sundaranar University, Tirunelveli, India ABSTRACT Categorization is the development in which objects are recognized, differentiated and understood. Categorization implies that objects are grouped as categories, usually for some precise purpose. There are many categorization theories and techniques such as conventional categorization, conceptual clustering, and Prototype theory. Conceptual Clustering is a data mining (machine learning) technique used to situate data elements into related groups without advance acquaintance of the group definitions. This correspondence describes extensions to the K-modes algorithm for clustering categorical data. The simple identical dissimilarity measure for categorical objects, allows the use of K-modes paradigm to attain a cluster with strong intra similarity and competently cluster large categorical data sets. By exploiting the semantic formation of the sentences in documents, a better text clustering result is achieved. KEYWORDS: Categorization, Conceptual clustering, K-mode, Dissimilarity measure, text clustering. I. INTRODUCTION Data mining is the use of automated data investigation techniques to uncover previously undetected relationships amongst data items. Data mining often involves the analysis of data stored in a data warehouse. Three of the major data mining techniques are regression, classification and clustering [3]. Also Known As: machine learning, knowledge discovery in databases, KDD. Two common data mining techniques for finding concealed patterns in data are clustering [2] and classification analyses. Although classification and clustering are often mentioned in the same breath, they are different analytical approaches. Classification is a data mining (machine learning) technique used to predict group membership for data instances. In data mining, classification is one of the most important tasks. It maps the data in to predefined targets. It is a supervised learning as targets are predefined. The aim of the classification is to build a classifier based on some cases with some attributes to depict the objects or one attribute to describe the group of the objects. Then, the classifier is used to envisage the group attributes of new cases from the field based on the values of other attributes. For example, you may inclination to use classification to predict whether the weather on a scrupulous day will be “sunny”, “rainy” o “cloudy”. Popular classification techniques include decision trees and neural networks. Classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data. Clustering [13] is a data mining (machine learning) procedure used to place data elements into related groups without advance knowledge of the group definitions. Trendy clustering [4] techniques include K-means clustering and expectation maximization (EM) clustering. Clustering is a division of data into groups of comparable objects. Each group, called cluster, consists of objects that are similar

Transcript of 45i18 Ijaet0118730 V6 Iss6 2704 2716

International Journal of Advances in Engineering & Technology, Jan. 2014.

©IJAET ISSN: 22311963

2704 Vol. 6, Issue 6, pp. 2704-2716

PERCEPTION BASED DOCUMENT CATEGORIZATION USING

K-MODE ALGORITHM

Saranya V1 and Dharmaraj R2 1Assistant Professor, Department of Computer Science and Engineering,

Sri Vidya College of Engineering & Technology, Virudhunagar, Tamilnadu, India. 2Research Scholar, Manonmaniam Sundaranar University, Tirunelveli, India

ABSTRACT Categorization is the development in which objects are recognized, differentiated and understood.

Categorization implies that objects are grouped as categories, usually for some precise purpose. There are

many categorization theories and techniques such as conventional categorization, conceptual clustering, and

Prototype theory. Conceptual Clustering is a data mining (machine learning) technique used to situate data

elements into related groups without advance acquaintance of the group definitions. This correspondence

describes extensions to the K-modes algorithm for clustering categorical data. The simple identical dissimilarity

measure for categorical objects, allows the use of K-modes paradigm to attain a cluster with strong intra

similarity and competently cluster large categorical data sets. By exploiting the semantic formation of the

sentences in documents, a better text clustering result is achieved.

KEYWORDS: Categorization, Conceptual clustering, K-mode, Dissimilarity measure, text clustering.

I. INTRODUCTION

Data mining is the use of automated data investigation techniques to uncover previously undetected

relationships amongst data items. Data mining often involves the analysis of data stored in a data

warehouse. Three of the major data mining techniques are regression, classification and clustering [3].

Also Known As: machine learning, knowledge discovery in databases, KDD. Two common data

mining techniques for finding concealed patterns in data are clustering [2] and classification analyses.

Although classification and clustering are often mentioned in the same breath, they are different

analytical approaches.

Classification is a data mining (machine learning) technique used to predict group membership for

data instances. In data mining, classification is one of the most important tasks. It maps the data in to

predefined targets. It is a supervised learning as targets are predefined. The aim of the classification is

to build a classifier based on some cases with some attributes to depict the objects or one attribute to

describe the group of the objects. Then, the classifier is used to envisage the group attributes of new

cases from the field based on the values of other attributes. For example, you may inclination to use

classification to predict whether the weather on a scrupulous day will be “sunny”, “rainy” o “cloudy”.

Popular classification techniques include decision trees and neural networks. Classifies data

(constructs a model) based on the training set and the values (class labels) in a classifying attribute

and uses it in classifying new data.

Clustering [13] is a data mining (machine learning) procedure used to place data elements into related

groups without advance knowledge of the group definitions. Trendy clustering [4] techniques include

K-means clustering and expectation maximization (EM) clustering. Clustering is a division of data

into groups of comparable objects. Each group, called cluster, consists of objects that are similar

International Journal of Advances in Engineering & Technology, Jan. 2014.

©IJAET ISSN: 22311963

2705 Vol. 6, Issue 6, pp. 2704-2716

between themselves and dissimilar to objects of other groups.

Representing data by fewer clusters necessarily loses certain fine details (akin to loss data

compression), but achieves simplification. It represents many data objects by a small amount of

clusters, and hence, it models data by its clusters. Clustering [5] problem is about partitioning a given

data set into groups (clusters) such that the data points in a bunch are more similar to each other than

points in different clusters. The most dissimilar characteristics of clustering operation in data mining

is that the data sets often contain both numeric and categorical feature value.

Text document clustering [25] has been rigorously studied since of its important position in data-

mining and information retrieval [24]. A text document is not only a collection of word and also a

collection of concept. Clustering is a dissection of data into groups of similar objects. Each group,

called cluster, consists of objects that are similar between themselves and dissimilar to objects of

other groups. The upcoming session of the article are stands to indicate some work done relevant with

the topic focused by the research. The methodology, analysis, findings, discussions & conclusions and

subsequently the suggestions for the future work have also explicitly displayed.

II. EXISTING SYSTEM

In the existing system approach the top k algorithm and two methods are used to categorize the

categorical data [15] or document.

2.1. Top K Algorithm

Top-k algorithm [1] has received much consideration in a variety of settings such as similarity search

on multimedia data, ranked retrieval [24] on text and semi-structured documents in digital libraries

and on the Web, spatial data analysis, network and stream monitoring collaborative recommendation

and penchant queries on e-commerce product cat logs, and ranking of SQL-style query results on

structured data sources in general. The top-k investigate becomes more popular with increasing

datasets sizes. Users are usually interested in a few best objects rather than a large list of objects as a

result.

The text [21] can actually be viewed as word sequence. So we don't have to worry about how to break

down it into word sequence.

1) Use a table to record all words' incidence while traverse the whole word sequence. In this

phase, the key is "word" and the value is "word-frequency" [12]. This takes O (n) time.

2) Score the (word, word-frequency) brace; and the key is "word-frequency". This takes O

(n*log (n)) time with normal sorting algorithm.

3) After sorting, we just take the first K words. This takes O (K) time. To summarize, the total

time is O (n+n*log (n) + K) , Since K is surely smaller than N, so it is actually O(n*log(n)).

The two methods are:

The first one scans inverted index once to compute relevance scores of all cells [1].

The second one discovers as a small amount of cells in text cube as probable to locate the top-

k answers [1].

III. PROPOSED SYSTEM

In the proposed system the K-mode [13] algorithm is used to categorize the document by matching

dissimilarity measures [7].

3.1. Problem statement

Due to similarity measures in top k the efficiency of the clustering data is less. So the proposed

system has K-mode algorithm to cluster the data. The proposed system has some methods used to

categorize the data or document.

3.2 Stop Word

Stop words [11] is a part of human language and there’s nothing you can do about it. Stop word

removal means removing poor words. The poor words are like an Auxiliary verb ‘has’, ’is’, ’been’,

International Journal of Advances in Engineering & Technology, Jan. 2014.

©IJAET ISSN: 22311963

2706 Vol. 6, Issue 6, pp. 2704-2716

etc... Articles ‘the’, ‘an’, ‘a’, etc...

3.3 Stemming

Stemming [8] process remove affixes (suffixes or prefixes) from words producing a root form called a

stem that often closely approximates the root morpheme of a word.

3.4 Filtering

The filter process removes the unwanted letters from the word. And this process makes the keyword

[9].

3.5 Frequency

In the frequency [12] process the system counts the number of times the keyword that occurs in the

text document [21].

3.6 Distance

In the distance process the system compare the keywords [9] from text document into XML files [29].

If the keyword is present in XML file it denotes 1 otherwise 0.

3.7 K- Mode Algorithm

In the distance process the system compare the keywords [9] from text document into XML files. If

the keyword is present in XML file it denotes 1 otherwise 0.

Clustering [2] is one of the most useful tasks in data mining process for discovering groups and

identifying interesting distributions and patterns in the underlying data. Clustering problem is about

partitioning a given data set into groups (clusters) such that the data points in a cluster are more

similar to each other than points in different clusters.

The most distinct characteristics of clustering operation in data mining is that the data sets often

contain both numeric and categorical attribute values.

This requires the clustering algorithm to be capable of dealing with the intricacy of inter- and intra

relation of the data sets expressed in different types of the attributes, no matter numeric or categorical.

The K-means algorithm is one of the most popular clustering algorithms. The K-means clustering

algorithm cannot cluster categorical data [15] because of the dissimilarity measure [7] it uses.

The K-modes clustering algorithm is based on K-means paradigm but removes the numeric data

limitation whilst preserving its competence which is shown in fig 2.

The K-modes algorithm [13] extends K-means paradigm to cluster categorical data by removing the

limitation imposed by K-means.

The K-modes algorithm has made the following modifications to the K-means algorithm:

A simple matching dissimilarity measure for categorical objects.

Modes instead of means for clusters.

A frequency [12] based method to find the modes.

Because the K-modes algorithm uses the same clustering process as K-means, it preserves the

efficiency of the K-means algorithm.

These modifications have removed the numeric-only limitation of the K-means algorithm but

maintain its efficiency in clustering large categorical data sets.

International Journal of Advances in Engineering & Technology, Jan. 2014.

©IJAET ISSN: 22311963

2707 Vol. 6, Issue 6, pp. 2704-2716

Figure 1. System Architecture

The fig 1 represents the overall architecture of the system. This system allows the user and to find the

group is very easily. And it is flexible to cluster the data [4]. The User gives the webpage as the input

then they convert the webpage into text document using HTML as TEXT parser [27].

They process the input using some methods and find what the keywords [9] that occur in the text

document [21]. Then the system compares the keywords from text document to XML files. And

display the category using K-mode algorithm.

Figure 2 Use case diagram for K Mode algorithm

IV. SYSTEM SEGMENTATION

4.1 Modules and sub modules

Login session

Stop word removal

Stemming

Filter

Frequency

Distance

K-mode algorithm [10]

Web

page

Convert

webpage

into text

document

New text

document

Analysis the

Keyword in a

document

Find out the

category using

keywords

Category

matching

Allowed into

particular

category

International Journal of Advances in Engineering & Technology, Jan. 2014.

©IJAET ISSN: 22311963

2708 Vol. 6, Issue 6, pp. 2704-2716

4.2 Login Session

In the login session the administrator enters the user name and password. If the user name and

password is correct admin logged in message will be displayed otherwise the invalid login message

displayed. The administrator can add the XML files and the keywords [9] of the any category. And

also process the input data.

4.3 Stop word

Stop words is a piece of human language and there’s nothing you can do about it. In enterprise search

[14], all stop words, for example, common words like a, and the, are removed from multiple word

queries to raise search performance [11].

Stop word recognition in Japanese is based on grammatical information. Stop words are common

words that carry less important meaning than keywords. Usually search engines eliminate stop words

from a keyword [9] phrase to return the most relevant result.

That is the stop words constrain much less traffic than keywords. It is the most frequent words often

do not clutch much meaning. Stop word removal means removing poor words.

The poor words are

Articles ‘the’ ‘an’ ‘a’.

We generally remove these as well as perhaps numbers, dates etc...

And also remove these special characters or symbols and numbers etc..

Stop word removal emphasis discrimination

Reduces the number of words in common between document descriptions.

Tries to reduce amount of trash.

4.4 Stemming

Stemming process [8] remove affixes (suffixes or prefixes) from words producing a root form called

a stem that often closely approximates the root morpheme of a word [8]. Words that appear in

documents have many morphological variants.

Morphological variants of words have similar semantic interpretations they can be considered as

equivalent for IR purposes. Morphological analysis that tries to associate variants of the same term

with a root form go, goes, going, gone −! Root form go.

Many applications, such as document retrieval [24], do not often need morphological analyzers to be

linguistically correct.

Attempts to eradicate certain surface markings from words in order to discover their root form. In

theory, this involves discarding both affixes (un-, dis-, etc.) and suffixes (-ing, -ness, etc.), although

most stemmers used by search engines [14] only remove suffixes.

Stemming is rather a rough process, since the root form is not required to be a proper word.

After removing high frequency words, an indexing procedure tries to conflate word variants into the

same stem or root using a stemming.

For example, the words "thinking", "thinkers" or "thinks" may be reduced to the stem "think". In

information retrieval [24], grouping words having the same root under the same stem (or indexing

term) may increase the success rate when matching documents to a query. Such an automatic

procedure may therefore be a valuable process in enhancing retrieval effectiveness, assuming that

words with the same stem refer to the same idea or concept and must be therefore indexed under the

same form.

4. 5 Filter

The filter process is the process used to remove the connectives or abdicable words from the text

document. Connectives or abdicable words are a part of human language and there’s nothing you can

do about it. In endeavour search, all abdicable words, for example, common words like in, but, not,

for, are removed from several word queries to increase search performance. Abdicable words are

common words that carry less important meaning than keywords [9]. Usually search engines remove

abdicable words from a keyword phrase to return the most relevant result. I.e. abdicable words drive

International Journal of Advances in Engineering & Technology, Jan. 2014.

©IJAET ISSN: 22311963

2709 Vol. 6, Issue 6, pp. 2704-2716

much less traffic than keywords. It is the most numerous words often do not carry much meaning.

Filter process means removing abdicable words

The abdicable words are

An auxiliary verb ‘has’, ’is’, ’been’, etc...

Prepositions ‘of’, ‘in’, etc...

Connectives such as ‘and’, ‘but’, ‘because’, etc...

4.6 Frequency

Frequency analysis is the study of the frequency of letters or groups of letters in a cipher text. The

method is used as an aid to breaking classical ciphers. The keyword containing document will be

imported to the frequency process. In the frequency process the system counts the number of times the

keyword that occurs in the text document.

4.7 Distance

Distance measures are used extensively in data mining and other types of data analysis. Such

measures assume it is possible to compute for each pair of domain objects their communal distance.

Much of the research in distance measures concentrates on objects either in attribute-value

representation or in first-order representation. The representation languages are typically classified as

being either attribute-value or relational/first-order.

In an attribute-value representation, the objects in a data set can be summarized in a table, where the

columns represent attributes, and each row represents an object, with the cell entries in that row being

the specific values for the corresponding attributes.

With the increased use of XML [29] (Extensible Mark-up Language) as a means for unambiguous

representation of data across platforms, more and more data is delivered in XML [29].

When performing data analysis over such data, this is normally transformed into attribute-value or

first-order representation, if distance measures are involved. Such transformation may result in

information loss.

The new representation may not contain all the contents or attributes of the original. The XML

structure may not be preserved either.

In the distance process the system compare the keywords from text document [21] into XML files. If

the keyword is present in XML file it denotes 1 otherwise 0.

4.8 K-Mode algorithm

The K-means clustering algorithm cannot cluster categorical data because of the dissimilarity measure

[7] it uses. The K-modes clustering algorithm is based on K-means paradigm but removes the numeric

data limitation whilst preserving its efficiency [7], [10].

The K-modes algorithm [10] extends K-means paradigm to cluster categorical data by removing the

limitation imposed by K-means through following modifications.

Using a simple matching dissimilarity measure or the hamming distance for categorical data

objects.

Replacing means of clusters by their modes.

These modifications remove the numeric-only limitation of the K-means algorithm while

maintain its efficiency in clustering categorical data sets [15].

The simple matching dissimilarity measure can be defined as following equations (1) & (2). Let X and

Y are two categorical data objects described by F categorical attributes. The dissimilarity measure

d(X, Y) between X and Y can be defined by the total mismatches of the corresponding attribute

categories of two objects. Smaller the number of mismatches, more similar the two objects are.

Mathematically, we can say

𝒅(𝑿, 𝒀) = ∑ 𝜹(𝑿𝒋,𝑭𝓳=𝟏 𝒀𝒋) (1)

Where 𝜹(𝐱𝐣, 𝐲𝐣) = {𝟎 (𝐱𝐣 = 𝐲𝐣)

𝟏(𝐱𝐣 ≠ 𝐲𝐣) (2)

The K-modes algorithm consists of the following steps

International Journal of Advances in Engineering & Technology, Jan. 2014.

©IJAET ISSN: 22311963

2710 Vol. 6, Issue 6, pp. 2704-2716

a) Select K initial modes, one for each of the cluster.

b) Allocate data object to the cluster whose mode is nearest to it.

c) Compute new modes of all clusters.

d) Repeat step a to b until no data object has changed cluster membership. Recently, more

attention has been put on clustering categorical data where records are made up of non-numerical

attributes.

The K-modes algorithm extends the K-means paradigm to cluster categorical data by using

A simple matching dissimilarity measure [7] for categorical objects.

Modes instead of means for clusters.

Because the K-modes algorithm uses the same clustering process as K-means, it preserves the

efficiency of the K-means algorithm, which is highly desirable for data mining.

However, the dissimilarity measure [7] used in K-modes [10] doesn’t consider the relative frequencies

of attribute values in each cluster mode; this will result in a weaker intra cluster similarity by

allocating less similar objects to the cluster.

V. EXPERIMENTAL SETUP

5.1 Dataset

The experimental results evaluated on 50 large data sets with K-mode algorithms show that the new

scheme can enhance the efficiency compared with typical top-k algorithm based approaches as well as

their some of the methods are used to categorize the large categorical data by matching dissimilarity

measures [7]. The XML data’s are used for the data set. And HMTL web page acts as the input.

A data set (or dataset) is a collection of data, usually presented in tabular form. Each column

represents a particular variable. Each row corresponds to a given member of the data set in question.

It lists values for each of the variables, such as height and weight of an object. Each value is known as

a datum. The data set may comprise data for one or more members, corresponding to the number of

rows.

Datasets consist of all of the information gathered during a survey which needs to be analyzed.

Learning how to interpret the results is a key component to the survey process.

VI. EXPERIMENTAL RESULTS

In experimental results are given to illustrate that the K-mode algorithm with the dissimilarity

measure performs better in clustering accuracy than the top k algorithm.

The main aim of this section is to illustrate the convergence result and evaluate the clustering

performance and efficiency of the K-mode algorithm with the dissimilarity measure [7].

The scalability of the K-mode algorithm with the dissimilarity measure is calculated here. The

categorical data’s [15] are used to evaluate the K-mode algorithm. The number of clusters attributes

and categories of data sets are ranged in between 3 to 24. The number of objects is ranged in between

100 and 1000.

The computational results are performed by using a machine with 1GB RAM. The computational

times of both algorithms are plotted with respect to the number of clusters, attributes, categories and

objects, while the other corresponding parameters are fixed.

International Journal of Advances in Engineering & Technology, Jan. 2014.

©IJAET ISSN: 22311963

2711 Vol. 6, Issue 6, pp. 2704-2716

Figure 3 Computational times for different number Figure 4 Computational times for different number of

of attributes cluster

Figure 5 Computational times for different number of category

All the experiments are repeated five times and the average computational times are depicted.

Figure 3 shows the computational times against the number of attributes, while the number of

categories is 50 and the number of objects is 1000.

Figure 4 shows the computational times against the number of clusters, while the numbers of

categories and attributes are 50 and the number of objects is 1000.

Figure 5 shows the computational times against the number of categories, while the number of

attributes is 50 and the number of objects is 1000. The computational times increase linearly with

respect to either the number of attributes, categories, clusters or objects.

The k-mode algorithm with the dissimilarity measure [7] requires more computational times than the

original top-k algorithm.

However, according to the tests, the k-mode algorithm with the dissimilarity measure is still scalable,

i.e., it can cluster categorical objects efficiently.

International Journal of Advances in Engineering & Technology, Jan. 2014.

©IJAET ISSN: 22311963

2712 Vol. 6, Issue 6, pp. 2704-2716

VII. RESULT ANALYSIS

In experimental results (shown in table 1) are given to illustrate that the K-mode algorithm with the

dissimilarity measure performs better in clustering accuracy than the top k algorithm.

Table 1 Methods compared

7.1 Evaluation Metrics

Precision: The fraction of the predicted web page information needs that were indeed rated

satisfactory by the user .And can also be defined as the fraction of the retrieved pages which is

relevant. Precision at K for a given query is the mean fraction of relevant pages ranked in the top K

results; the higher the precision, the better the performance is. The score is Zero if none of the top K

results contain a relevant page. The problem of this metric is, the position of relevant pages within the

top K is irrelevant, while it measures overall use potential satisfaction with the top K results.

Recall: This is used to separate high-quality content from the rest of the contents and evaluates the

quality of the overall pages. If more blocks are retrieved, recall increases while precision usually

decreases. Then, a proper evaluation has to produce precision/recall values at given points in the rank.

This provides an incremental view of the retrieval performance measures. The received page is

analyzed from the top pages and the precision-recall values are computed when we find each relevant

page

F measure: The weighted harmonic mean of precision and recall, the traditional F-measure or

balanced F-score is:

𝐹 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = 2.𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙

(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙) (3)

This is also known as the F1 measure, because recall and precision are evenly weighted using

equation (3).

Two other commonly used F measures are the F2 measure, which weights recall twice as much as

precision, and the F0.5 measure, which weights precision twice as much as recall. Fβ "measures the

effectiveness of retrieval [24] with respect to a user who attaches β times as much importance to recall

as precision". It is based on van Rijsbergen's effectiveness measure E=1− (1/ (α/P+ (1−α)/R)). Their

relationship is Fβ = 1 − E where α = 1 / (β2 + 1).

Table 2 Precision, Recall and F measure values for document clustering methods

Methods Precision Recall F Measure

Top K 0.9812 0.9123 0.945496

K Mode 0.9910 0.9389 0.964247

Interestingly our proposed method called K Mode Algorithm produces higher precision of 0.99 than

other clustering methods which is shown in the fig 6.

Existing System Proposed System

Here text document is

directly considered as an

input.

Here webpage is taken as

input and HTML parser

converts webpage into text

document

In the top k algorithm the

similarity measures is used

to categorize the data.

In the K-mode algorithm

the dissimilarity measures

is used to categorize the

data. Due to the similarity

measures the efficiency of

clustering data is less.

Due to the dissimilarity

measures the efficiency of

clustering data is high. Computational time is

high in existing system.

Computational time is less

in proposed system.

International Journal of Advances in Engineering & Technology, Jan. 2014.

©IJAET ISSN: 22311963

2713 Vol. 6, Issue 6, pp. 2704-2716

Figure 6 User satisfaction predictions

VIII. CONCLUSION

We considered the web pages as input to our system and track the main category to which the

webpage belongs .There are many other applications to classify the data in a document by the

particular dataset which uses the database to store the datasets.

In addition to that we uses the xml [29] file to store data and we mainly focused on webpage category

classification by the usage of html parser which converts the webpage into text format .This page has

taken for pre processing of data mining such as stop word, stemming, filtering, distance, frequency

[12].

Then by the K-mode algorithm [10] the set of values obtained for the input is efficiently compared

with already trained datasets in the system. The graphs are obtained for the clustering [6], attribute,

and category with time for both existing and proposed system which exhibits the efficiency of our

system.

IX. FUTURE WORK

In future the dataset for the system can be increased such as the applications which are used in online

searches and classification. We give the webpage [19] as input and in future only the URL of

webpage would be given as input and the process of fetching the webpage, conversion of text, pre-

processing of data mining will be the backend processes. And the main expectation of our research in

future is “response of the system in fraction of seconds”.

ACKNOWLEDGEMENTS

I would like to express my sincere thanks to Mr. R. Kesavan, Mr. V. G Kalimuthu and Mr. M.

Deenadhayalan for their kind support.

REFERENCES

[1]. Bolin Ding, Bo Zhao, Cindy Xide Lin,Jiawei Han,Chengxiang Zhai” Top Cells Keyword-Based Search of

Top-k Aggregated Documents in Text Cube .”in ICDM, 2010.

[2]. B. Andreopoulos, A. An and X.Wang Clustering the internet topology at multiple layers, "WSEAS

Transactions on Information Science and Applications, vol. 2, no. 10, pp. 1625-1634, 2005.

[3]. C.X.Lin, B.Ding, J.Han,F.Zhu and B.Zhao,” Text cube Computing IR measures for multidimensional text

database analysis,” in ICDM, 2008.

[4]. Fred, A., Jain, A.K. Combining Multiple Clustering Using Evidence Accumulation. IEEE Transactions on

Pattern Analysis and Machine Intelligence, vol. 27, number 6,835-850, 2005.

[5]. Fred, A., Jain, A.K. Evidence Accumulation Clustering based on the K-means algorithm, in Proceedings

of the International Workshops on Structural and Syntactic Pattern Recognition (SSPR), Windsor,

Canada, August 2002.

0.86

0.88

0.9

0.92

0.94

0.96

0.98

1

Precision Recall F Measure

Top K

K Mode

International Journal of Advances in Engineering & Technology, Jan. 2014.

©IJAET ISSN: 22311963

2714 Vol. 6, Issue 6, pp. 2704-2716

[6]. Navneeth kumar V. M and Dr Chandrasekar C, “A consistent web document based text clustering using

concept based mining model”, IJCSI 2012.

[7]. Michael K.Ng,Mark JunjieLi,Joshua Zhexue Huang,”On the Impact of Dissimilarity Measure in K-

modes Clustering Algorithm,” in 2007.

[8]. Sandhya N,Sri lalitha Y, Sowmya V,Dr Anuradha K and Dr Govardhan A, “Analysis of Stemming

algorithm for text clustering” IJCSI, 2011.

[9]. S.Agrawal, S.Chaudhuri and G.Das,”Dbxplorer A system for keyword-based search over relational

databases,” in ICDE, 2002.

[10].Shehroz S Khan, Dr.Shri Khant,”Computation of Initial Modes for K-modes clustering algorithm using

Evidence Accumulation,”IJCA, 2007.

[11]. “Stop word removal from text” http://jwordnet.sourceforge.net

[12].Z. He, S. Deng and X. Xu, \Improving k-modes algorithm considering frequencies of attribute values in

mode", International Conference on Computational Intelligence and Security, LNAI 3801, pp. 157-162,

2005.

[13]. Z. Huang and M. Ng, A note on k-modes clustering", Journal of Classification, vol. 20,pp. 257-261,

2003.

[14].Z.Yang.G.Gan,J,wu.A Genetic k-modes algorithm for clustering categorical data. In: Proc.of

ADMA’05,pp195-202,2005

[15]. Z.Huang M.K Ng.A fuzzy k-modes algorithm for clustering categorical data. IEEE Transaction on fuzzy

systems 1999, 7(4):446-452.

[16]. Z.K.Ng.J.C.Wong.Clustering categorical data sets using tabu search techniques. Pattern Recognition

2002, 35(12):2783-2790.

[17]. A. Hotho, S. Staab, and G. Stumme. Wordnet improves text document clustering. In Proc. of the SIGIR

2003 Semantic Web Workshop, 2003.

[18]. George Karypis and Eui-Hong Han. Fast supervised dimensionality reduction algorithm with

applications to document categorization and retrieval. In Proceedings of CIKM-00, pages 12–19. ACM

Press, New York, US, 2000.

[19]. Matthew Lease & Emine yilmaz Crowd sourcing for Information Retrieval: Introduction to the special

Issue. In ACM digital Library, Volume 16 Issue 2, April 2013.

[20]. A. Maedche and S. Staab. Ontology learning for the semantic web. IEEE Intelligent Systems, 16(2):72 –

79, 2001.

[21]. D. Mladenic. Text learning and related intelligent agents. IEEE Expert, July/August 1999.

[22]. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993.

[23]. Namita Mittal, Richi Nayak, Mahesh Chandra Govil, Kamal Chand Jain, Personalised search-a hybrid

approach for web information retrieval and its evaluation International Journal of Knowledge and Web

Intelligence , Volume 2 Issue 2/3, December 2011. [24]. G. Salton. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by

Computer. Addison-Wesley, 1989.

[25]. M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD

workshop on Text Mining, 2000.

[26]. G. Stumme, R. Taouil, Y. Bastide, N. Pasqier, and L. Lakhal. Computing iceberg concept lattices with

Titanic. J. on Knowledge and Data Engineering, 42:189–222, 2002.

[27]. J. Avanija, K. Ramar, A hybrid approach using pso and K-means for semantic clustering of web

documents, Journal of Web Engineering, Volume 12 Issue 3-4, July 2013.

[28]. Shehroz S khan and Dr Shri kant, “Computation of initial modes for K-modes clustering algorithm using

evidence accumulation”, ijcsi 2007.

[29]. Tobias Blanke, Mounia Lalmas, Theo Huibers,A framework for the theoretical evaluation of XML

retrieval, Journal of the American Society for Information Science & technology, December 2012,

Volume 63, Issue 12, Page: 2463 – 2473.

AUTHORS

Saranya V is working as an Assistant Professor in the Department of Computer Science

and Engineering at Sri Vidya College of Engineering and Technology, Virudhunagar.

She had completed her Bachelors in Computer Science and Engineerig from Kamaraj

College of Engg & Tech and Masters in Software Engineering from Anna University,

Trichy. And she has a teaching experience of 3 years. Her current research interests

includeData Mining and Software Engineering. Also She is a life member of ISTE,

ISOC, IAENG and IACSIT.

International Journal of Advances in Engineering & Technology, Jan. 2014.

©IJAET ISSN: 22311963

2715 Vol. 6, Issue 6, pp. 2704-2716

Dharmaraj R received his B.E degree in Computer Science and Engineering from Kamaraj

College of Engineering & Technology, Tamilnadu, India in 2003. He received his M.E

degree in Computer Science and Engineering from Manonmaniam Sundaranar University,

Tamilndu, India in 2008. He has more than 12 years of teaching experience. And currently

he is pursuing Ph.D in Computer Science Engineering at Manonmaniam Sundaranar

University, Tamilnadu, India.

APPENDIX-1

EVALUATION RESULTS

Stemming Filter process

Frequency measure Distance calculation

International Journal of Advances in Engineering & Technology, Jan. 2014.

©IJAET ISSN: 22311963

2716 Vol. 6, Issue 6, pp. 2704-2716

K Mode algorithm

Table representation of K Mode