45i18 Ijaet0118730 V6 Iss6 2704 2716
Transcript of 45i18 Ijaet0118730 V6 Iss6 2704 2716
International Journal of Advances in Engineering & Technology, Jan. 2014.
©IJAET ISSN: 22311963
2704 Vol. 6, Issue 6, pp. 2704-2716
PERCEPTION BASED DOCUMENT CATEGORIZATION USING
K-MODE ALGORITHM
Saranya V1 and Dharmaraj R2 1Assistant Professor, Department of Computer Science and Engineering,
Sri Vidya College of Engineering & Technology, Virudhunagar, Tamilnadu, India. 2Research Scholar, Manonmaniam Sundaranar University, Tirunelveli, India
ABSTRACT Categorization is the development in which objects are recognized, differentiated and understood.
Categorization implies that objects are grouped as categories, usually for some precise purpose. There are
many categorization theories and techniques such as conventional categorization, conceptual clustering, and
Prototype theory. Conceptual Clustering is a data mining (machine learning) technique used to situate data
elements into related groups without advance acquaintance of the group definitions. This correspondence
describes extensions to the K-modes algorithm for clustering categorical data. The simple identical dissimilarity
measure for categorical objects, allows the use of K-modes paradigm to attain a cluster with strong intra
similarity and competently cluster large categorical data sets. By exploiting the semantic formation of the
sentences in documents, a better text clustering result is achieved.
KEYWORDS: Categorization, Conceptual clustering, K-mode, Dissimilarity measure, text clustering.
I. INTRODUCTION
Data mining is the use of automated data investigation techniques to uncover previously undetected
relationships amongst data items. Data mining often involves the analysis of data stored in a data
warehouse. Three of the major data mining techniques are regression, classification and clustering [3].
Also Known As: machine learning, knowledge discovery in databases, KDD. Two common data
mining techniques for finding concealed patterns in data are clustering [2] and classification analyses.
Although classification and clustering are often mentioned in the same breath, they are different
analytical approaches.
Classification is a data mining (machine learning) technique used to predict group membership for
data instances. In data mining, classification is one of the most important tasks. It maps the data in to
predefined targets. It is a supervised learning as targets are predefined. The aim of the classification is
to build a classifier based on some cases with some attributes to depict the objects or one attribute to
describe the group of the objects. Then, the classifier is used to envisage the group attributes of new
cases from the field based on the values of other attributes. For example, you may inclination to use
classification to predict whether the weather on a scrupulous day will be “sunny”, “rainy” o “cloudy”.
Popular classification techniques include decision trees and neural networks. Classifies data
(constructs a model) based on the training set and the values (class labels) in a classifying attribute
and uses it in classifying new data.
Clustering [13] is a data mining (machine learning) procedure used to place data elements into related
groups without advance knowledge of the group definitions. Trendy clustering [4] techniques include
K-means clustering and expectation maximization (EM) clustering. Clustering is a division of data
into groups of comparable objects. Each group, called cluster, consists of objects that are similar
International Journal of Advances in Engineering & Technology, Jan. 2014.
©IJAET ISSN: 22311963
2705 Vol. 6, Issue 6, pp. 2704-2716
between themselves and dissimilar to objects of other groups.
Representing data by fewer clusters necessarily loses certain fine details (akin to loss data
compression), but achieves simplification. It represents many data objects by a small amount of
clusters, and hence, it models data by its clusters. Clustering [5] problem is about partitioning a given
data set into groups (clusters) such that the data points in a bunch are more similar to each other than
points in different clusters. The most dissimilar characteristics of clustering operation in data mining
is that the data sets often contain both numeric and categorical feature value.
Text document clustering [25] has been rigorously studied since of its important position in data-
mining and information retrieval [24]. A text document is not only a collection of word and also a
collection of concept. Clustering is a dissection of data into groups of similar objects. Each group,
called cluster, consists of objects that are similar between themselves and dissimilar to objects of
other groups. The upcoming session of the article are stands to indicate some work done relevant with
the topic focused by the research. The methodology, analysis, findings, discussions & conclusions and
subsequently the suggestions for the future work have also explicitly displayed.
II. EXISTING SYSTEM
In the existing system approach the top k algorithm and two methods are used to categorize the
categorical data [15] or document.
2.1. Top K Algorithm
Top-k algorithm [1] has received much consideration in a variety of settings such as similarity search
on multimedia data, ranked retrieval [24] on text and semi-structured documents in digital libraries
and on the Web, spatial data analysis, network and stream monitoring collaborative recommendation
and penchant queries on e-commerce product cat logs, and ranking of SQL-style query results on
structured data sources in general. The top-k investigate becomes more popular with increasing
datasets sizes. Users are usually interested in a few best objects rather than a large list of objects as a
result.
The text [21] can actually be viewed as word sequence. So we don't have to worry about how to break
down it into word sequence.
1) Use a table to record all words' incidence while traverse the whole word sequence. In this
phase, the key is "word" and the value is "word-frequency" [12]. This takes O (n) time.
2) Score the (word, word-frequency) brace; and the key is "word-frequency". This takes O
(n*log (n)) time with normal sorting algorithm.
3) After sorting, we just take the first K words. This takes O (K) time. To summarize, the total
time is O (n+n*log (n) + K) , Since K is surely smaller than N, so it is actually O(n*log(n)).
The two methods are:
The first one scans inverted index once to compute relevance scores of all cells [1].
The second one discovers as a small amount of cells in text cube as probable to locate the top-
k answers [1].
III. PROPOSED SYSTEM
In the proposed system the K-mode [13] algorithm is used to categorize the document by matching
dissimilarity measures [7].
3.1. Problem statement
Due to similarity measures in top k the efficiency of the clustering data is less. So the proposed
system has K-mode algorithm to cluster the data. The proposed system has some methods used to
categorize the data or document.
3.2 Stop Word
Stop words [11] is a part of human language and there’s nothing you can do about it. Stop word
removal means removing poor words. The poor words are like an Auxiliary verb ‘has’, ’is’, ’been’,
International Journal of Advances in Engineering & Technology, Jan. 2014.
©IJAET ISSN: 22311963
2706 Vol. 6, Issue 6, pp. 2704-2716
etc... Articles ‘the’, ‘an’, ‘a’, etc...
3.3 Stemming
Stemming [8] process remove affixes (suffixes or prefixes) from words producing a root form called a
stem that often closely approximates the root morpheme of a word.
3.4 Filtering
The filter process removes the unwanted letters from the word. And this process makes the keyword
[9].
3.5 Frequency
In the frequency [12] process the system counts the number of times the keyword that occurs in the
text document [21].
3.6 Distance
In the distance process the system compare the keywords [9] from text document into XML files [29].
If the keyword is present in XML file it denotes 1 otherwise 0.
3.7 K- Mode Algorithm
In the distance process the system compare the keywords [9] from text document into XML files. If
the keyword is present in XML file it denotes 1 otherwise 0.
Clustering [2] is one of the most useful tasks in data mining process for discovering groups and
identifying interesting distributions and patterns in the underlying data. Clustering problem is about
partitioning a given data set into groups (clusters) such that the data points in a cluster are more
similar to each other than points in different clusters.
The most distinct characteristics of clustering operation in data mining is that the data sets often
contain both numeric and categorical attribute values.
This requires the clustering algorithm to be capable of dealing with the intricacy of inter- and intra
relation of the data sets expressed in different types of the attributes, no matter numeric or categorical.
The K-means algorithm is one of the most popular clustering algorithms. The K-means clustering
algorithm cannot cluster categorical data [15] because of the dissimilarity measure [7] it uses.
The K-modes clustering algorithm is based on K-means paradigm but removes the numeric data
limitation whilst preserving its competence which is shown in fig 2.
The K-modes algorithm [13] extends K-means paradigm to cluster categorical data by removing the
limitation imposed by K-means.
The K-modes algorithm has made the following modifications to the K-means algorithm:
A simple matching dissimilarity measure for categorical objects.
Modes instead of means for clusters.
A frequency [12] based method to find the modes.
Because the K-modes algorithm uses the same clustering process as K-means, it preserves the
efficiency of the K-means algorithm.
These modifications have removed the numeric-only limitation of the K-means algorithm but
maintain its efficiency in clustering large categorical data sets.
International Journal of Advances in Engineering & Technology, Jan. 2014.
©IJAET ISSN: 22311963
2707 Vol. 6, Issue 6, pp. 2704-2716
Figure 1. System Architecture
The fig 1 represents the overall architecture of the system. This system allows the user and to find the
group is very easily. And it is flexible to cluster the data [4]. The User gives the webpage as the input
then they convert the webpage into text document using HTML as TEXT parser [27].
They process the input using some methods and find what the keywords [9] that occur in the text
document [21]. Then the system compares the keywords from text document to XML files. And
display the category using K-mode algorithm.
Figure 2 Use case diagram for K Mode algorithm
IV. SYSTEM SEGMENTATION
4.1 Modules and sub modules
Login session
Stop word removal
Stemming
Filter
Frequency
Distance
K-mode algorithm [10]
Web
page
Convert
webpage
into text
document
New text
document
Analysis the
Keyword in a
document
Find out the
category using
keywords
Category
matching
Allowed into
particular
category
International Journal of Advances in Engineering & Technology, Jan. 2014.
©IJAET ISSN: 22311963
2708 Vol. 6, Issue 6, pp. 2704-2716
4.2 Login Session
In the login session the administrator enters the user name and password. If the user name and
password is correct admin logged in message will be displayed otherwise the invalid login message
displayed. The administrator can add the XML files and the keywords [9] of the any category. And
also process the input data.
4.3 Stop word
Stop words is a piece of human language and there’s nothing you can do about it. In enterprise search
[14], all stop words, for example, common words like a, and the, are removed from multiple word
queries to raise search performance [11].
Stop word recognition in Japanese is based on grammatical information. Stop words are common
words that carry less important meaning than keywords. Usually search engines eliminate stop words
from a keyword [9] phrase to return the most relevant result.
That is the stop words constrain much less traffic than keywords. It is the most frequent words often
do not clutch much meaning. Stop word removal means removing poor words.
The poor words are
Articles ‘the’ ‘an’ ‘a’.
We generally remove these as well as perhaps numbers, dates etc...
And also remove these special characters or symbols and numbers etc..
Stop word removal emphasis discrimination
Reduces the number of words in common between document descriptions.
Tries to reduce amount of trash.
4.4 Stemming
Stemming process [8] remove affixes (suffixes or prefixes) from words producing a root form called
a stem that often closely approximates the root morpheme of a word [8]. Words that appear in
documents have many morphological variants.
Morphological variants of words have similar semantic interpretations they can be considered as
equivalent for IR purposes. Morphological analysis that tries to associate variants of the same term
with a root form go, goes, going, gone −! Root form go.
Many applications, such as document retrieval [24], do not often need morphological analyzers to be
linguistically correct.
Attempts to eradicate certain surface markings from words in order to discover their root form. In
theory, this involves discarding both affixes (un-, dis-, etc.) and suffixes (-ing, -ness, etc.), although
most stemmers used by search engines [14] only remove suffixes.
Stemming is rather a rough process, since the root form is not required to be a proper word.
After removing high frequency words, an indexing procedure tries to conflate word variants into the
same stem or root using a stemming.
For example, the words "thinking", "thinkers" or "thinks" may be reduced to the stem "think". In
information retrieval [24], grouping words having the same root under the same stem (or indexing
term) may increase the success rate when matching documents to a query. Such an automatic
procedure may therefore be a valuable process in enhancing retrieval effectiveness, assuming that
words with the same stem refer to the same idea or concept and must be therefore indexed under the
same form.
4. 5 Filter
The filter process is the process used to remove the connectives or abdicable words from the text
document. Connectives or abdicable words are a part of human language and there’s nothing you can
do about it. In endeavour search, all abdicable words, for example, common words like in, but, not,
for, are removed from several word queries to increase search performance. Abdicable words are
common words that carry less important meaning than keywords [9]. Usually search engines remove
abdicable words from a keyword phrase to return the most relevant result. I.e. abdicable words drive
International Journal of Advances in Engineering & Technology, Jan. 2014.
©IJAET ISSN: 22311963
2709 Vol. 6, Issue 6, pp. 2704-2716
much less traffic than keywords. It is the most numerous words often do not carry much meaning.
Filter process means removing abdicable words
The abdicable words are
An auxiliary verb ‘has’, ’is’, ’been’, etc...
Prepositions ‘of’, ‘in’, etc...
Connectives such as ‘and’, ‘but’, ‘because’, etc...
4.6 Frequency
Frequency analysis is the study of the frequency of letters or groups of letters in a cipher text. The
method is used as an aid to breaking classical ciphers. The keyword containing document will be
imported to the frequency process. In the frequency process the system counts the number of times the
keyword that occurs in the text document.
4.7 Distance
Distance measures are used extensively in data mining and other types of data analysis. Such
measures assume it is possible to compute for each pair of domain objects their communal distance.
Much of the research in distance measures concentrates on objects either in attribute-value
representation or in first-order representation. The representation languages are typically classified as
being either attribute-value or relational/first-order.
In an attribute-value representation, the objects in a data set can be summarized in a table, where the
columns represent attributes, and each row represents an object, with the cell entries in that row being
the specific values for the corresponding attributes.
With the increased use of XML [29] (Extensible Mark-up Language) as a means for unambiguous
representation of data across platforms, more and more data is delivered in XML [29].
When performing data analysis over such data, this is normally transformed into attribute-value or
first-order representation, if distance measures are involved. Such transformation may result in
information loss.
The new representation may not contain all the contents or attributes of the original. The XML
structure may not be preserved either.
In the distance process the system compare the keywords from text document [21] into XML files. If
the keyword is present in XML file it denotes 1 otherwise 0.
4.8 K-Mode algorithm
The K-means clustering algorithm cannot cluster categorical data because of the dissimilarity measure
[7] it uses. The K-modes clustering algorithm is based on K-means paradigm but removes the numeric
data limitation whilst preserving its efficiency [7], [10].
The K-modes algorithm [10] extends K-means paradigm to cluster categorical data by removing the
limitation imposed by K-means through following modifications.
Using a simple matching dissimilarity measure or the hamming distance for categorical data
objects.
Replacing means of clusters by their modes.
These modifications remove the numeric-only limitation of the K-means algorithm while
maintain its efficiency in clustering categorical data sets [15].
The simple matching dissimilarity measure can be defined as following equations (1) & (2). Let X and
Y are two categorical data objects described by F categorical attributes. The dissimilarity measure
d(X, Y) between X and Y can be defined by the total mismatches of the corresponding attribute
categories of two objects. Smaller the number of mismatches, more similar the two objects are.
Mathematically, we can say
𝒅(𝑿, 𝒀) = ∑ 𝜹(𝑿𝒋,𝑭𝓳=𝟏 𝒀𝒋) (1)
Where 𝜹(𝐱𝐣, 𝐲𝐣) = {𝟎 (𝐱𝐣 = 𝐲𝐣)
𝟏(𝐱𝐣 ≠ 𝐲𝐣) (2)
The K-modes algorithm consists of the following steps
International Journal of Advances in Engineering & Technology, Jan. 2014.
©IJAET ISSN: 22311963
2710 Vol. 6, Issue 6, pp. 2704-2716
a) Select K initial modes, one for each of the cluster.
b) Allocate data object to the cluster whose mode is nearest to it.
c) Compute new modes of all clusters.
d) Repeat step a to b until no data object has changed cluster membership. Recently, more
attention has been put on clustering categorical data where records are made up of non-numerical
attributes.
The K-modes algorithm extends the K-means paradigm to cluster categorical data by using
A simple matching dissimilarity measure [7] for categorical objects.
Modes instead of means for clusters.
Because the K-modes algorithm uses the same clustering process as K-means, it preserves the
efficiency of the K-means algorithm, which is highly desirable for data mining.
However, the dissimilarity measure [7] used in K-modes [10] doesn’t consider the relative frequencies
of attribute values in each cluster mode; this will result in a weaker intra cluster similarity by
allocating less similar objects to the cluster.
V. EXPERIMENTAL SETUP
5.1 Dataset
The experimental results evaluated on 50 large data sets with K-mode algorithms show that the new
scheme can enhance the efficiency compared with typical top-k algorithm based approaches as well as
their some of the methods are used to categorize the large categorical data by matching dissimilarity
measures [7]. The XML data’s are used for the data set. And HMTL web page acts as the input.
A data set (or dataset) is a collection of data, usually presented in tabular form. Each column
represents a particular variable. Each row corresponds to a given member of the data set in question.
It lists values for each of the variables, such as height and weight of an object. Each value is known as
a datum. The data set may comprise data for one or more members, corresponding to the number of
rows.
Datasets consist of all of the information gathered during a survey which needs to be analyzed.
Learning how to interpret the results is a key component to the survey process.
VI. EXPERIMENTAL RESULTS
In experimental results are given to illustrate that the K-mode algorithm with the dissimilarity
measure performs better in clustering accuracy than the top k algorithm.
The main aim of this section is to illustrate the convergence result and evaluate the clustering
performance and efficiency of the K-mode algorithm with the dissimilarity measure [7].
The scalability of the K-mode algorithm with the dissimilarity measure is calculated here. The
categorical data’s [15] are used to evaluate the K-mode algorithm. The number of clusters attributes
and categories of data sets are ranged in between 3 to 24. The number of objects is ranged in between
100 and 1000.
The computational results are performed by using a machine with 1GB RAM. The computational
times of both algorithms are plotted with respect to the number of clusters, attributes, categories and
objects, while the other corresponding parameters are fixed.
International Journal of Advances in Engineering & Technology, Jan. 2014.
©IJAET ISSN: 22311963
2711 Vol. 6, Issue 6, pp. 2704-2716
Figure 3 Computational times for different number Figure 4 Computational times for different number of
of attributes cluster
Figure 5 Computational times for different number of category
All the experiments are repeated five times and the average computational times are depicted.
Figure 3 shows the computational times against the number of attributes, while the number of
categories is 50 and the number of objects is 1000.
Figure 4 shows the computational times against the number of clusters, while the numbers of
categories and attributes are 50 and the number of objects is 1000.
Figure 5 shows the computational times against the number of categories, while the number of
attributes is 50 and the number of objects is 1000. The computational times increase linearly with
respect to either the number of attributes, categories, clusters or objects.
The k-mode algorithm with the dissimilarity measure [7] requires more computational times than the
original top-k algorithm.
However, according to the tests, the k-mode algorithm with the dissimilarity measure is still scalable,
i.e., it can cluster categorical objects efficiently.
International Journal of Advances in Engineering & Technology, Jan. 2014.
©IJAET ISSN: 22311963
2712 Vol. 6, Issue 6, pp. 2704-2716
VII. RESULT ANALYSIS
In experimental results (shown in table 1) are given to illustrate that the K-mode algorithm with the
dissimilarity measure performs better in clustering accuracy than the top k algorithm.
Table 1 Methods compared
7.1 Evaluation Metrics
Precision: The fraction of the predicted web page information needs that were indeed rated
satisfactory by the user .And can also be defined as the fraction of the retrieved pages which is
relevant. Precision at K for a given query is the mean fraction of relevant pages ranked in the top K
results; the higher the precision, the better the performance is. The score is Zero if none of the top K
results contain a relevant page. The problem of this metric is, the position of relevant pages within the
top K is irrelevant, while it measures overall use potential satisfaction with the top K results.
Recall: This is used to separate high-quality content from the rest of the contents and evaluates the
quality of the overall pages. If more blocks are retrieved, recall increases while precision usually
decreases. Then, a proper evaluation has to produce precision/recall values at given points in the rank.
This provides an incremental view of the retrieval performance measures. The received page is
analyzed from the top pages and the precision-recall values are computed when we find each relevant
page
F measure: The weighted harmonic mean of precision and recall, the traditional F-measure or
balanced F-score is:
𝐹 𝑀𝑒𝑎𝑠𝑢𝑟𝑒 = 2.𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
(𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙) (3)
This is also known as the F1 measure, because recall and precision are evenly weighted using
equation (3).
Two other commonly used F measures are the F2 measure, which weights recall twice as much as
precision, and the F0.5 measure, which weights precision twice as much as recall. Fβ "measures the
effectiveness of retrieval [24] with respect to a user who attaches β times as much importance to recall
as precision". It is based on van Rijsbergen's effectiveness measure E=1− (1/ (α/P+ (1−α)/R)). Their
relationship is Fβ = 1 − E where α = 1 / (β2 + 1).
Table 2 Precision, Recall and F measure values for document clustering methods
Methods Precision Recall F Measure
Top K 0.9812 0.9123 0.945496
K Mode 0.9910 0.9389 0.964247
Interestingly our proposed method called K Mode Algorithm produces higher precision of 0.99 than
other clustering methods which is shown in the fig 6.
Existing System Proposed System
Here text document is
directly considered as an
input.
Here webpage is taken as
input and HTML parser
converts webpage into text
document
In the top k algorithm the
similarity measures is used
to categorize the data.
In the K-mode algorithm
the dissimilarity measures
is used to categorize the
data. Due to the similarity
measures the efficiency of
clustering data is less.
Due to the dissimilarity
measures the efficiency of
clustering data is high. Computational time is
high in existing system.
Computational time is less
in proposed system.
International Journal of Advances in Engineering & Technology, Jan. 2014.
©IJAET ISSN: 22311963
2713 Vol. 6, Issue 6, pp. 2704-2716
Figure 6 User satisfaction predictions
VIII. CONCLUSION
We considered the web pages as input to our system and track the main category to which the
webpage belongs .There are many other applications to classify the data in a document by the
particular dataset which uses the database to store the datasets.
In addition to that we uses the xml [29] file to store data and we mainly focused on webpage category
classification by the usage of html parser which converts the webpage into text format .This page has
taken for pre processing of data mining such as stop word, stemming, filtering, distance, frequency
[12].
Then by the K-mode algorithm [10] the set of values obtained for the input is efficiently compared
with already trained datasets in the system. The graphs are obtained for the clustering [6], attribute,
and category with time for both existing and proposed system which exhibits the efficiency of our
system.
IX. FUTURE WORK
In future the dataset for the system can be increased such as the applications which are used in online
searches and classification. We give the webpage [19] as input and in future only the URL of
webpage would be given as input and the process of fetching the webpage, conversion of text, pre-
processing of data mining will be the backend processes. And the main expectation of our research in
future is “response of the system in fraction of seconds”.
ACKNOWLEDGEMENTS
I would like to express my sincere thanks to Mr. R. Kesavan, Mr. V. G Kalimuthu and Mr. M.
Deenadhayalan for their kind support.
REFERENCES
[1]. Bolin Ding, Bo Zhao, Cindy Xide Lin,Jiawei Han,Chengxiang Zhai” Top Cells Keyword-Based Search of
Top-k Aggregated Documents in Text Cube .”in ICDM, 2010.
[2]. B. Andreopoulos, A. An and X.Wang Clustering the internet topology at multiple layers, "WSEAS
Transactions on Information Science and Applications, vol. 2, no. 10, pp. 1625-1634, 2005.
[3]. C.X.Lin, B.Ding, J.Han,F.Zhu and B.Zhao,” Text cube Computing IR measures for multidimensional text
database analysis,” in ICDM, 2008.
[4]. Fred, A., Jain, A.K. Combining Multiple Clustering Using Evidence Accumulation. IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 27, number 6,835-850, 2005.
[5]. Fred, A., Jain, A.K. Evidence Accumulation Clustering based on the K-means algorithm, in Proceedings
of the International Workshops on Structural and Syntactic Pattern Recognition (SSPR), Windsor,
Canada, August 2002.
0.86
0.88
0.9
0.92
0.94
0.96
0.98
1
Precision Recall F Measure
Top K
K Mode
International Journal of Advances in Engineering & Technology, Jan. 2014.
©IJAET ISSN: 22311963
2714 Vol. 6, Issue 6, pp. 2704-2716
[6]. Navneeth kumar V. M and Dr Chandrasekar C, “A consistent web document based text clustering using
concept based mining model”, IJCSI 2012.
[7]. Michael K.Ng,Mark JunjieLi,Joshua Zhexue Huang,”On the Impact of Dissimilarity Measure in K-
modes Clustering Algorithm,” in 2007.
[8]. Sandhya N,Sri lalitha Y, Sowmya V,Dr Anuradha K and Dr Govardhan A, “Analysis of Stemming
algorithm for text clustering” IJCSI, 2011.
[9]. S.Agrawal, S.Chaudhuri and G.Das,”Dbxplorer A system for keyword-based search over relational
databases,” in ICDE, 2002.
[10].Shehroz S Khan, Dr.Shri Khant,”Computation of Initial Modes for K-modes clustering algorithm using
Evidence Accumulation,”IJCA, 2007.
[11]. “Stop word removal from text” http://jwordnet.sourceforge.net
[12].Z. He, S. Deng and X. Xu, \Improving k-modes algorithm considering frequencies of attribute values in
mode", International Conference on Computational Intelligence and Security, LNAI 3801, pp. 157-162,
2005.
[13]. Z. Huang and M. Ng, A note on k-modes clustering", Journal of Classification, vol. 20,pp. 257-261,
2003.
[14].Z.Yang.G.Gan,J,wu.A Genetic k-modes algorithm for clustering categorical data. In: Proc.of
ADMA’05,pp195-202,2005
[15]. Z.Huang M.K Ng.A fuzzy k-modes algorithm for clustering categorical data. IEEE Transaction on fuzzy
systems 1999, 7(4):446-452.
[16]. Z.K.Ng.J.C.Wong.Clustering categorical data sets using tabu search techniques. Pattern Recognition
2002, 35(12):2783-2790.
[17]. A. Hotho, S. Staab, and G. Stumme. Wordnet improves text document clustering. In Proc. of the SIGIR
2003 Semantic Web Workshop, 2003.
[18]. George Karypis and Eui-Hong Han. Fast supervised dimensionality reduction algorithm with
applications to document categorization and retrieval. In Proceedings of CIKM-00, pages 12–19. ACM
Press, New York, US, 2000.
[19]. Matthew Lease & Emine yilmaz Crowd sourcing for Information Retrieval: Introduction to the special
Issue. In ACM digital Library, Volume 16 Issue 2, April 2013.
[20]. A. Maedche and S. Staab. Ontology learning for the semantic web. IEEE Intelligent Systems, 16(2):72 –
79, 2001.
[21]. D. Mladenic. Text learning and related intelligent agents. IEEE Expert, July/August 1999.
[22]. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993.
[23]. Namita Mittal, Richi Nayak, Mahesh Chandra Govil, Kamal Chand Jain, Personalised search-a hybrid
approach for web information retrieval and its evaluation International Journal of Knowledge and Web
Intelligence , Volume 2 Issue 2/3, December 2011. [24]. G. Salton. Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by
Computer. Addison-Wesley, 1989.
[25]. M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. In KDD
workshop on Text Mining, 2000.
[26]. G. Stumme, R. Taouil, Y. Bastide, N. Pasqier, and L. Lakhal. Computing iceberg concept lattices with
Titanic. J. on Knowledge and Data Engineering, 42:189–222, 2002.
[27]. J. Avanija, K. Ramar, A hybrid approach using pso and K-means for semantic clustering of web
documents, Journal of Web Engineering, Volume 12 Issue 3-4, July 2013.
[28]. Shehroz S khan and Dr Shri kant, “Computation of initial modes for K-modes clustering algorithm using
evidence accumulation”, ijcsi 2007.
[29]. Tobias Blanke, Mounia Lalmas, Theo Huibers,A framework for the theoretical evaluation of XML
retrieval, Journal of the American Society for Information Science & technology, December 2012,
Volume 63, Issue 12, Page: 2463 – 2473.
AUTHORS
Saranya V is working as an Assistant Professor in the Department of Computer Science
and Engineering at Sri Vidya College of Engineering and Technology, Virudhunagar.
She had completed her Bachelors in Computer Science and Engineerig from Kamaraj
College of Engg & Tech and Masters in Software Engineering from Anna University,
Trichy. And she has a teaching experience of 3 years. Her current research interests
includeData Mining and Software Engineering. Also She is a life member of ISTE,
ISOC, IAENG and IACSIT.
International Journal of Advances in Engineering & Technology, Jan. 2014.
©IJAET ISSN: 22311963
2715 Vol. 6, Issue 6, pp. 2704-2716
Dharmaraj R received his B.E degree in Computer Science and Engineering from Kamaraj
College of Engineering & Technology, Tamilnadu, India in 2003. He received his M.E
degree in Computer Science and Engineering from Manonmaniam Sundaranar University,
Tamilndu, India in 2008. He has more than 12 years of teaching experience. And currently
he is pursuing Ph.D in Computer Science Engineering at Manonmaniam Sundaranar
University, Tamilnadu, India.
APPENDIX-1
EVALUATION RESULTS
Stemming Filter process
Frequency measure Distance calculation