A Hybrid Approach Using I2 and MC4.5 Algorithms for Mining Multidimensional Data Sets (HIIMC4.5)

18
Volume 2, No. 08, October 2013 ISSN – 2278-1080 The International Journal of Computer Science & Applications (TIJCSA) RESEARCH PAPER Available Online at http://www.journalofcomputerscience.com/ © 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved 12 A Hybrid Approach Using I 2 and MC4.5 Algorithms for Mining Multidimensional Data Sets (HIIMC4.5) S.Santhosh kumar Research Scholar, PRIST University, Thanjavur Lecturer, Department of Computer Science Government College for Women (A) Kumbakonam, Tamil Nadu, India [email protected] Dr.E.Ramaraj Director, Computer Center Alagappa University Karaikudi India [email protected] Abstract This paper presents a combinational approach of clustering and classification called semi-supervised learning approaches. In this work we developed a hybrid model by combining our earlier contributions of two algorithms. For large data bases, before searching a data, primary categorisation is needed to mine the data efficiently. The proposed hybrid model is compared with the existing hybrid model, which is also our earlier work to identify the closest data patterns in the large data bases. The new hybrid technique enables to improve the limitations of existing hybrid model. The proposed model is compared with established hybrid model. The implementation with different data sets, results accurate classification prediction with less error rate. Key Terms: C4.5 Classifier, k-means, M C4.5, I 2 Clustering. 1. Introduction Semi- supervised learning (SSL) [1], is a type of machine learning technique handles both labelled and unlabeled simultaneously. It is an emerging field of data mining become popularised since 2005. The primary advantage of use of SSL is its cost effectiveness. The processing of labelled data required knowledge, skill, technique and which is expensive. The SSL allows extracting the unlabelled data with small amount of labelled data. For large databases such as banking, medical are contains huge amounts of data. It is more expensive and also more complex to extract the particular (labelled) data, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value. There are many SSL models developed based on data types. The Generative model is one of familiar model which combines classification and clustering techniques based on joint distribution of data. Based on combinational approach, we have taken our proposed hybrid model, which is a combination of k-means algorithm and C4.5

Transcript of A Hybrid Approach Using I2 and MC4.5 Algorithms for Mining Multidimensional Data Sets (HIIMC4.5)

Volume 2, No. 08, October 2013 ISSN – 2278-1080

The International Journal of Computer Science & Applications (TIJCSA)

RESEARCH PAPER

Available Online at http://www.journalofcomputerscience.com/

© 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved

12

A Hybrid Approach Using I2 and MC4.5 Algorithms for Mining Multidimensional Data Sets (HIIMC4.5)

S.Santhosh kumar Research Scholar, PRIST University, Thanjavur

Lecturer, Department of Computer Science Government College for Women (A)

Kumbakonam, Tamil Nadu, India [email protected]

Dr.E.Ramaraj

Director, Computer Center Alagappa University

Karaikudi India

[email protected]

Abstract This paper presents a combinational approach of clustering and classification called semi-supervised learning approaches. In this work we developed a hybrid model by combining our earlier contributions of two algorithms. For large data bases, before searching a data, primary categorisation is needed to mine the data efficiently. The proposed hybrid model is compared with the existing hybrid model, which is also our earlier work to identify the closest data patterns in the large data bases. The new hybrid technique enables to improve the limitations of existing hybrid model. The proposed model is compared with established hybrid model. The implementation with different data sets, results accurate classification prediction with less error rate. Key Terms: — C4.5 Classifier, k-means, M C4.5, I2 Clustering. 1. Introduction

Semi- supervised learning (SSL) [1], is a type of machine learning technique handles both labelled and unlabeled simultaneously. It is an emerging field of data mining become popularised since 2005. The primary advantage of use of SSL is its cost effectiveness. The processing of labelled data required knowledge, skill, technique and which is expensive. The SSL allows extracting the unlabelled data with small amount of labelled data. For large databases such as banking, medical are contains huge amounts of data. It is more expensive and also more complex to extract the particular (labelled) data, whereas acquisition of unlabeled data is relatively inexpensive. In such situations, semi-supervised learning can be of great practical value. There are many SSL models developed based on data types. The Generative model is one of familiar model which combines classification and clustering techniques based on joint distribution of data. Based on combinational approach, we have taken our proposed hybrid model, which is a combination of k-means algorithm and C4.5

S.Santhosh kumar, Dr.E.Ramaraj, The International Journal of Computer Science & Applications (TIJCSA) ISSN – 2278-1080, Vol. 2 No. 07 September 2013

© 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved

13

Classification algorithm. In previous work we compared these two algorithms as a single approach. Although it gives better results, there are some limitations on that. The limitations are known to be the inherent constraints of k-means and C4.5 algorithms respectively. To overcome the drawbacks of k-means clustering we proposed Inter Intra (I2) Clustering algorithm in the year 2012. Recently we developed a modified C4.5 algorithm called MC4.5 algorithm used to overcome the limitations of generation of decision trees with high information gain and gain ratio. In this paper the proposed new combined approach as a hybrid model and it is compared with existing hybrid model (k-means and C4.5.). Our continuous research reveals that the large data bases must be partitioned or categorised before searching of data. In our research work we mostly concentrated on biological data bases because it contains huge volume of data and requires specialised technique to acquire exact information on every time. 2. Review of Literature

The heuristic approach of self-training (also known as self-learning or self-labelling) is historically the oldest approach to semi-supervised learning,[2] with examples of applications starting in the 1960s the transductive learning framework was formally introduced by Vladimir Vapnik in the 1970s[2]. Interest in inductive learning using generative models also began in the 1970s. A probably approximately correct learning bound for semi-supervised learning of a Gaussian mixture was demonstrated by Ratsaby and Venkatesh in 1995 [3].Semi-supervised learning has recently become more popular and practically relevant due to the variety of problems for which vast quantities of unlabeled data are available. For a review of survey recent work by Zhu (2008).[4]. Although the clustering and classification are different learning methods, for complex data prediction both the approaches are used in combined form called semi supervised approach. The research is then classified as a hybrid approach. In the year 2010 CSVM, Clustering based classification technique is proposed by Juanying Xie, Chunxia Wang, Yan Zhang, Shuai Jiang [5] for unlabelled data prediction. They combined different kinds of k-means algorithm with SVM classifier to achieve better results. In order to avoid the major drawback of k-means algorithm; k-value initialisation, the CSVM is proposed. In 2010 [6], Pritha Mahata proposed a new hierarchical clustering technique called ECHC (exploratory consensus of hierarchical clustering’s) which is used to sub group the various types of melanoma cancer. This work reveals that, k-means algorithm gives better results for biological subtype with proper sub tree. In 2010 [7], Taysir Hassan A. Soliman, proposed a clustering and classification as a combined approach to classify the different types of diseases based on gene selection method. The results had shown improved accuracy in data prediction. In 2011[8], Reuben Evans, Bernhard Pfahringer, Geoffrey Holmes, proposed a technique called statistical based clustering technique for large datasets. They used k-means algorithm as initial step for centroid prediction and classification as secondary step. In March 2013[9] Claudio Gentile, Fabio Vitale, Giovanni Zappella, implemented the combined technique (clustering and classification) in networks using signed graphs. From related studies SSL gives better results in various kinds of application areas. 3. Preliminaries

In this chapter we discussed about k-means algorithm and C 4.5 Decision tree algorithm. The k-means is simplest partition based algorithm used for any kind of data bases. It is advantage is global maximum. For large kind of information processing applications, k-means is suitable technique used to handle multidimensional data. The C4.5 is a decision tree classifier

S.Santhosh kumar, Dr.E.Ramaraj, The International Journal of Computer Science & Applications (TIJCSA) ISSN – 2278-1080, Vol. 2 No. 07 September 2013

© 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved

14

handles both continuous and discrete data. It is one of familiar algorithm used for grouping the data based on selecting discriminator data from given samples. There are certain drawbacks in k-means and C 4.5 but basic functionality of partitioning allows to used both algorithms as a combined technique. As earlier work our research we have done analysis for predicting best classifier [10] to handle any kind of data with considerable results. The result of the analysis revealed that C 4.5 is a best classifier which handles mostly when compared with other classifiers. It combinational nature and new algorithm namely I2 and MC4.5 are also discussed. 3.1. C4.5 Algorithm It is used to generate a decision tree developed by Ross Quinlan [11] and it is an extension of ID3 algorithm. The decision trees generated by C4.5 can be used for classification, anC4.5 builds decision trees from a set of training data in the same way as ID3 [12], using the concept of information entropy. The training data is a set of already classified samples. Each sample consists of a p-dimensional vector, where they represent attributes or features of the sample, as well as the class in which falls. At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurses on the smaller sub lists.  

3.2. K-means Algorithm The k-means algorithm was developed by Macqueen [13] based on standard algorithm. It is one of the most widely used hard clustering techniques. This is an iterative method where the specified number of clusters should initialise earlier. The number of clusters should be specified earlier. The algorithm can be specified as a given set of observations (x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ n) S = {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares Where µi is the mean of points in Si. The algorithm works as follows:

o The k denotes number of clusters must be initialised o Randomly select k cluster centres called centroids in the data space o Assign data points to clusters based on the shortest Euclidean distance to the cluster

centers o Re-compute new centroids by averaging the observations assigned to a cluster o Repeat above two steps until convergence criterion is satisfied

The advantage of this approach is its efficiency to handle large data sets and can work with compact clusters. The major limitation of this technique is the requirement to specify the number of clusters beforehand and its assumption that clusters are spherical. 3.3. I2 Clustering To improve the performance of k-means algorithm, we developed an efficient algorithm called I2-Clustering [14], which performs better and it has improved performance than k-means algorithm. The algorithm is based on the density distribution of data objects. I2

S.Santhosh kumar, Dr.E.Ramaraj, The International Journal of Computer Science & Applications (TIJCSA) ISSN – 2278-1080, Vol. 2 No. 07 September 2013

© 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved

15

clustering algorithm results good partitioning with high accuracy rate and with both local and global reach ability densities. We proved that I2 Clustering performs better than k-means with respect to correctness of similarity partition and accuracy. The I2 algorithm works on following procedure

• First I2 – Clustering algorithm gets the number of clusters as an input and it finds the centroid based on the cluster size.

• As an next step it finds the intra cluster similarity among the elements within a cluster.

• It forms Intra clusters among the cluster and generated sub clusters with centriods • Based on the intra cluster similarity the centroid is updated iteratively. • The next step of I2 clustering algorithm is finding the inter cluster relationship among

the neighbour clusters. • If any cluster elements are having relativity with neighbour cluster then it is migrated

in to that neighbour clusters until the better clusters achieved.

3.3.1. I2 Algorithm Input: k: the number of clusters D: a dataset containing n objects Output: A set of k clusters Step 1: Cluster the data objects using given cluster size. Step 2: Arrange the clusters in any randomized order. go to() { Step 3: Divide each cluster into sub clusters. Find midpoint (Density based) for each sub clusters Step 4: Find the intra cluster similarity in sub clusters Step5: Update the cluster centroids iteratively (cluster relativity) } Return; Step 6: Reorder the clusters in any order (updated centroids) go to { Step 7: Find the inter cluster similarity between other nearest sub clusters with nearest cluster. If { Step 8: Find any inter cluster similarity is in between the cluster elements from different clusters } Then { Step 9: transform the cluster element in to relevant cluster. } Step 9: update the centroids incrementally Step 10: Find the centroid for the cluster using new sub cluster centroids. Step 10: select the next clusters and find the inter cluster similarity. Step 11: Repeat the same procedure in iteratively until all the centroids returns their same value. }

S.Santhosh kumar, Dr.E.Ramaraj, The International Journal of Computer Science & Applications (TIJCSA) ISSN – 2278-1080, Vol. 2 No. 07 September 2013

© 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved

16

Return; 3.4. Modified C4.5 Algorithm In C4.5 the gain ratio needs overcompensate. Sometimes it creates a dependency to choose an attribute depends on its splitting criterion. The important drawback is, it considers only the attribute which is greater than the average information gain. This phenomenon fails to consider the attributes which are having less information gain but it is more important for splitting. That means it gives additional information that may be basic information used to split the attributes accurately. We developed a modified approach for C 4.5 algorithm called MC4.5 [15] based on its information theory, entropy and gain ratio. We developed a new methodology to handle missing valued attributes, adding additional value to training sets. The MC4.5 performs better than C4.5. It also concentrates on lowest valued attributes and missing values for splitting criterion. 3.4.1. M C4.5 Algorithm Algorithm Generate Modified C 4.5 decision_tree Generate a decision tree with highest information gain from Data D.

Input

D is the partition of data set containing training tuples with set of attributes

The procedure used of splitting criterion that highest information gain of A with single or

multiple values. A is splitting criterion with information gain

Output: Decision tree with high information gain

Method

Create a node N; If attributes in D are all in the same category called C then

Return and declare N as leaf node with label C; Else if N is empty then Return N is a leaf node label with D; Apply the attribute selection method (D, attribute_list) to find the best splitting_criterion; If training set is partitioned with single attribute A Find the gain ratio for A with respect to training sets; Return and assign as best splitting criterion Label node N with Splitting criterion Else add more attribute in the training set to find best splitting_criterion Find the gain ratio for A with respect to training sets; Label node N with Splitting_criterion If splitting criterion is distribute_valued and Mutli_splits allowed then attribute_list ← splitting attribute; //splitting attribute value is reduced in overall information gain For each set of splitting attributes called j as splitting criterion Let Dj be the set data with set of attributes in D satisfying outcome j;

S.Santhosh kumar, Dr.E.Ramaraj, The International Journal of Computer Science & Applications (TIJCSA) ISSN – 2278-1080, Vol. 2 No. 07 September 2013

© 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved

17

If Dj is empty then Attach a leaf label with majority of class in D to node N; Else Attach the node returned by Generate_decisioning_tree (Dj, attribute_list) to node N; End for Return N;

4. The Hybrid Approach Based on the scope and scalability of clustering and classification technique, the k-means and C4.5 algorithms are used for combinational approaches. We developed a combinational approach to handle multidimensional data in large data bases. The combinational approach is proposed because of following limitations of use of separate mining technique.

o The use of single technique (either labeled or unlabelled) should not give promising results for mixed and multidimensional data sets.

o For very large databases, because of its dimensionality, it requires more time to acquire data, so that partition or exclusion of large data bases is a necessary one.

o A combined approach in needed for primary classification of data bases and sub grouping is needed as a secondary process to find the data from classified groups. To overcome the drawbacks of existing techniques, we proposed a hybrid model approach, which contains primary classification and sub group classification of data.

4.1. The Hybrid Model The hybrid model is a combination of K-means and C4.5 Algorithms to mine multi dimensional data sets. The major drawback of large datasets is its dimensionality.

Fig. (1).Hybrid Model It is unavoidable, when needed information contains additional or unrelated data then it leads more difficult to search, analyse, and transfer the data. As resolvation of the problem,

C 4.5 Algorithm

Classified Samples

Clustered Groups

K-means Algorithm (Cluster Based on Sub types)

Results

Multi Dimensional

Data set

S.Santhosh kumar, Dr.E.Ramaraj, The International Journal of Computer Science & Applications (TIJCSA) ISSN – 2278-1080, Vol. 2 No. 07 September 2013

© 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved

18

the major categorization of data based on requirement is needed. So that major part of large data can be divided in to two or more groups. One category contains the required data and other contains unrelated data respectively. The large data sets are categorized using C4.5 classifier produces decision tree with relevant and irrelevant attributes in a set. This phenomenon makes possible to group the similar attributes in to group called cluster. Based on the hybrid approach one can divide the large database into major groups. The major groups can be further clustered in to similar group which enables to achieve high accuracy rate with less computation time. This hybrid approach is suitable for large data bases having multi dimensional complexity. By using this approach one can retrieve exact information from the data base with in stipulated time by removing or minimizing the additional features of large data. 4.2. Limitations of Hybrid Model The experiments and performance analysis of using hybrid approach gives better results. The limitations of the hybrid model are its inherent constraints of k-means and C4.5 algorithms respectively. The further work on improving the efficiency of two existing algorithms we developed I2 and C 4.5 respectively. The important drawback of k-means algorithm is local optimum, we achieved by measuring inter and intra relationships of clusters using iterative process called sub clustering. Similarly in C4.5, the elimination of less information gain and lack of selection of training attributes for test data. It is overcome by additional information to training attributes as well as consideration of minimum entropy of least valued attributes. Based on the enhancements, it reveals that there are some remarkable improvements are needed to achieve accurate splitting and prediction of data from large multidimensional data sets. For that, we combine I2 Algorithm with C 4.5 algorithm for primary categorization and sub grouping of data. With the combination of I2 and C 4.5, we observed that

o A combined approach gives better performance and gives promising results

from multidimensional datasets but the inherent features and constraints of these algorithms failed to give optimum results.

o For multi dimensional data sets, the primary partitioning is needed; so that the

categorized search gives better results with accuracy and time efficiency. The classified samples can be sub grouped into smaller sets; so that the clustering can be done with global and local optimization.

5. Proposed Work 5.1. Proposed Hybrid Model [HIIMC4.5] In this section we proposed a new hybrid model based on I2 and MC 4.5 techniques. The Hybrid model is called as HIIMC4.5 (Hybrid Inter Intra clustering and Modified C4.5 model).Its functionality is divided in to two approaches; first approach divides the large data set in to major categories (based on its correlation) which requires discrete values to classify the data. The improved information gain and gain ratio based MC4.5 Decision tree algorithm is used to classify the data set into major discrete groups. In second approach; it finds the intra measure and inter measure between the attributes in a group. The sub clustering allows

S.Santhosh kumar, Dr.E.Ramaraj, The International Journal of Computer Science & Applications (TIJCSA) ISSN – 2278-1080, Vol. 2 No. 07 September 2013

© 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved

19

to migrate an attribute from one group to another, depends on its sub cluster centroids. The I2 algorithm creates the relativity among categorized data sets. For that, similar and dissimilar distance measures between the data items are considered.

Fig (2) HIIMC4.5 Model

5.2. Analysis of Hybrid Model and HIIMC 4.5 Model In this section we take a different types of data sets to give input both hybrid models. The attribute contains different values. The data sets are obtained from various resources with originality and with proven records. The data used for analysis are tabulated below

Table (1) Datasets Data Set Type Size Iris dataset Multivariate 150

Pheno Sense dataset Mixed 1728

5.2.1. About Datasets The Iris flower data set is a multivariate data set introduced by Sir Ronald Fisher in 1936. The data set consists of 50 samples from each of three species of Iris namely Iris setosa, Iris virginica and Iris versicolor.The phenosense dataset is drug resistance for HIV 1 Type viruses. The data set contains mixed values of various dosage indications of drug such as APV Fold, IDV fold, ATV fold etc. 5.2.2. Data Set 1

Modified C 4.5

Sub Clustering

Clustered Samples

I 2 Clustering

Evaluation & Results

Information Gain (Optimized)

Gain Ratio

Primary Classification

Classified samples

Secondary classification

Intra Measure

Inter Measure

Multi Dimensional

Data set

HIIMC4.5 Model

S.Santhosh kumar, Dr.E.Ramaraj, The International Journal of Computer Science & Applications (TIJCSA) ISSN – 2278-1080, Vol. 2 No. 07 September 2013

© 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved

20

The Iris dataset consists of four values respectively. The figure shows the representations namely sepal length, sepal width, petal length, petal width respectively. The graphical representations of discrete attributes are shown below.

Fig (3.1.a).Hybrid Model (Sepal length) Fig (3.1.b).HIIMC4.5 Model (Sepal length)

Fig (3.2.a).Hybrid Model (Sepal length) Fig (3.2.b) HIIMC4.5 Model (Sepal length)

(X1) 5.1 vs. (X2) 5.1 by (Y) Cluster_KMeans_1

c_kmeans_1 c_kmeans_2 c_kmeans_3

765

8

7

6

5

(X1) 5.1 vs. (X2) 5.1 by (Y) Cluster_KMeans_2

c_kmeans_1 c_kmeans_2 c_kmeans_3

765

8

7

6

5

(X1) 3.5 vs. (X2) 3.5 by (Y) Cluster_KMeans_1

c_kmeans_1 c_kmeans_2 c_kmeans_3

432

4

3

2

(X1) 3.5 vs. (X2) 3.5 by (Y) Cluster_KMeans_2

c_kmeans_1 c_kmeans_2 c_kmeans_3 c_kmeans_4 c_kmeans_5

432

4

3

2

(X1) 1.4 vs. (X2) 1.4 by (Y) Cluster_KMeans_1

c_kmeans_1 c_kmeans_2 c_kmeans_3

654321

7

6

5

4

3

2

1

(X1) 1.4 vs. (X2) 1.4 by (Y) Cluster_KMeans_2

c_kmeans_1 c_kmeans_2 c_kmeans_3 c_kmeans_4 c_kmeans_5

654321

7

6

5

4

3

2

1

S.Santhosh kumar, Dr.E.Ramaraj, The International Journal of Computer Science & Applications (TIJCSA) ISSN – 2278-1080, Vol. 2 No. 07 September 2013

© 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved

21

(3.3.a)Hybrid Model (petal length) Fig (3.3.b). HIIMC4.5 Model (petal length)

Fig (3.4.a) Hybrid Model (Petal Width) Fig (3.3.b). HIIMC4.5 Model (petal width)

Here the sepal length and petal length valued attributes are taken as x and y values of graph respectively. The figure shows only three attribute representations means that the closely related attributes are presented here. Hence the hybrid model failed to predict the distance valued attributes.

Fig4. Hybrid Model (Sequence attribute 5.1 and 3.5)

The same input is given to HIIMC4.5 model, the figure shows all attribute representations, here all the attributes are categorised based on its density and relativity.

(X1) 0.2 vs. (X2) 0.2 by (Y) Cluster_KMeans_1

c_kmeans_1 c_kmeans_2 c_kmeans_3

21

2

1

(X1) 0.2 vs. (X2) 0.2 by (Y) Cluster_KMeans_2

c_kmeans_1 c_kmeans_2 c_kmeans_3 c_kmeans_4 c_kmeans_5

21

2

1

S.Santhosh kumar, Dr.E.Ramaraj, The International Journal of Computer Science & Applications (TIJCSA) ISSN – 2278-1080, Vol. 2 No. 07 September 2013

© 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved

22

Fig5. HIIMC4.5 Model (Sequence attribute 5.1 and 3.5)

5.2.3. Data Set 2 Phenosense dataset contains information about drug dosage specifications with sequence identification, reference identification and subtype. The three last three attributes are considered as training attributes. The following diagram individual representation training attributes.

Fig(6.a)Hybrid Model (Sequence id attribute Fig(6.b) HIIMC4.5 (Sequence id attribute) The discrete training attributes are given input for primary classification. The figure shows sub grouping of sub types of phenosense data.

(X1) SeqID vs. (X2) SeqID by (Y) Cluster_KMeans_1

c_kmeans_1 c_kmeans_2 c_kmeans_3 c_kmeans_4 c_kmeans_5 c_kmeans_6 c_kmeans_7 c_kmeans_8 c_kmeans_9c_kmeans_10

160,000140,000120,000100,00080,00060,00040,00020,000

170,000160,000150,000140,000130,000120,000110,000100,00090,00080,00070,00060,00050,00040,00030,00020,00010,000

0

(X1) RefID vs. (X2) RefID by (Y) pred_SpvInstance_1

B A CRF02_AG G J Others C K D NA FCRF12_BF CRF01_AE AG

1,8001,7001,6001,5001,4001,3001,2001,1001,000900800

1,800

1,700

1,600

1,500

1,400

1,300

1,200

1,100

1,000

900

800

S.Santhosh kumar, Dr.E.Ramaraj, The International Journal of Computer Science & Applications (TIJCSA) ISSN – 2278-1080, Vol. 2 No. 07 September 2013

© 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved

23

Fig (7.a) Hybrid Model (Reference id attribute) Fig (7.b) HIIMC4.5 (Reference Id attribute) The following figure shows the subgrouping of data based on training attributes but the sub grouped data contains discrete groups with outliers and anamolies.

Fig(8). Hybrid Model

In the case of HIIMC 4.5 the least value or information gain values are also considered. The following figure of HIIMC4.5 shows sub grouping with high relativity and less number of outliers.

(X1) RefID vs. (X2) RefID by (Y) SubType

B A CRF02_AG G J Others C K D NA FCRF12_BF CRF01_AE AG

1,8001,7001,6001,5001,4001,3001,2001,1001,000900800

1,800

1,700

1,600

1,500

1,400

1,300

1,200

1,100

1,000

900

800

(X1) RefID vs. (X2) RefID by (Y) Cluster_KMeans_2

c_kmeans_1 c_kmeans_2 c_kmeans_3 c_kmeans_4 c_kmeans_5 c_kmeans_6 c_kmeans_7 c_kmeans_8 c_kmeans_9c_kmeans_10

1,8001,7001,6001,5001,4001,3001,2001,1001,000900800

1,800

1,700

1,600

1,500

1,400

1,300

1,200

1,100

1,000

900

800

(X1) RefID vs. (X2) SeqID by (Y) SubType

B A CRF02_AG G J Others C K D NA FCRF12_BF CRF01_AE AG

1,8001,7001,6001,5001,4001,3001,2001,1001,000900800

180,000

160,000

140,000

120,000

100,000

80,000

60,000

40,000

20,000

0

S.Santhosh kumar, Dr.E.Ramaraj, The International Journal of Computer Science & Applications (TIJCSA) ISSN – 2278-1080, Vol. 2 No. 07 September 2013

© 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved

24

Fig (9) HIIMC4.5 Model

6. Results and Discussion

From the implementation of data sets with hybrid approaches, the HIIMC4.5 gives better results when compared with Hybrid model. Here we used confusion matrix to find best hybrid model. A confusion matrix consists of information about actual and predicted classifications done by a classification system based on the classifier performances. The entries in the confusion matrix are

• a is the number of right predictions that an instance is negative,

• b is the number of incorrect predictions that an instance is positive,

• c is the number of incorrect of predictions that an instance negative, and

• d is the number of correct predictions that an instance is positive.

6.1. Data Set 1

(X1) SeqID vs. (X2) RefID by (Y) SubType

B A CRF02_AG G J Others C K D NA FCRF12_BF CRF01_AE AG

160,000140,000120,000100,00080,00060,00040,00020,000

1,800

1,700

1,600

1,500

1,400

1,300

1,200

1,100

1,000

900

800

Table (2) Confusion Matrix

Predicted Negative Positive

Actual Negative a b

Positive c b

S.Santhosh kumar, Dr.E.Ramaraj, The International Journal of Computer Science & Applications (TIJCSA) ISSN – 2278-1080, Vol. 2 No. 07 September 2013

© 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved

25

The confusion matrix is applied for both models as well as the data sets. The following table shows the confusion matrix for hybrid model. The hybrid model shows little bit less accuracy rate and performance.

The table shows the confusion matrix for HIIMC4.5 model which denotes increase in accuracy rate.

6.2. Data Set 2 In phenosense dataset it contains many discrete and continuous attributes. The table shows the confusion matrix with highly information gain attribute alone where as the table shows HIIMC4.5 model in which the addition information gain is added with information gain attribute. The result shows no difference because it depends on the individual attribute.

Table (5).Confusion matrix for Hybrid Model

B A CRF

02_AG G J Others C K D NA F

CRF 12_BF

CRF 01_AE

AG Sum

B 1586 0 0 0 0 0 1 0 0 0 0 0 0 0 1587

A 0 0 2 3 0 0 0 0 0 0 0 0 0 0 5

CRF 02_AG

3 0 20 9 0 0 0 0 0 0 0 0 0 0 32

Table (3) Confusion Matrix for HIIMC4.5 Model

Setosa versicolor virginica Sum

setosa 49 0 0 49

versicolor 0 47 3 50

virginica 0 1 49 50

sum 49 48 52 149

Table (4) Confusion Matrix for Hybrid Model

setosa versicolor virginica Sum setosa 49 0 0 49

versicolor 0 49 1 50

virginica 0 05 45 50

sum 49 54 46 149

S.Santhosh kumar, Dr.E.Ramaraj, The International Journal of Computer Science & Applications (TIJCSA) ISSN – 2278-1080, Vol. 2 No. 07 September 2013

© 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved

26

G 0 0 5 21 0 0 0 0 0 0 0 0 0 0 26

J 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2

Others 1 0 4 0 0 0 0 0 0 0 0 0 0 0 5

C 12 0 2 1 0 0 35 0 0 0 0 0 0 0 50

K 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2

D 3 0 1 1 0 0 1 0 0 0 0 0 0 0 6

NA 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1

F 3 0 3 0 0 0 0 0 0 0 0 0 0 0 6 CRF

12_BF 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1

CRF 01_AE

1 0 0 2 0 0 0 0 0 0 0 0 0 0 3

AG 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 Sum 1611 0 42 37 0 0 37 0 0 0 0 0 0 0 1727

Table (6). Confusion matrix for HIIMC4.5 Model

B A CRF

02_AG G J Others C K D NA F

CRF 12_BF

CRF 01_AE

AG Sum

B 1586 0 0 0 0 0 1 0 0 0 0 0 0 0 1587

A 0 0 2 3 0 0 0 0 0 0 0 0 0 0 5

CRF 02_AG

3 0 20 9 0 0 0 0 0 0 0 0 0 0 32

G 0 0 5 21 0 0 0 0 0 0 0 0 0 0 26

J 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2

Others 1 0 4 0 0 0 0 0 0 0 0 0 0 0 5

C 12 0 2 1 0 0 35 0 0 0 0 0 0 0 50

K 0 0 2 0 0 0 0 0 0 0 0 0 0 0 2

D 3 0 1 1 0 0 1 0 0 0 0 0 0 0 6

NA 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1

F 3 0 3 0 0 0 0 0 0 0 0 0 0 0 6

CRF 12_BF

0 0 1 0 0 0 0 0 0 0 0 0 0 0 1

CRF 01_AE

1 0 0 2 0 0 0 0 0 0 0 0 0 0 3

S.Santhosh kumar, Dr.E.Ramaraj, The International Journal of Computer Science & Applications (TIJCSA) ISSN – 2278-1080, Vol. 2 No. 07 September 2013

© 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved

27

AG 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1

Sum 1611 0 42 37 0 0 37 0 0 0 0 0 0 0 1727

The table (7) shows results of computation time and error rate in HIIMC4.5, Hybrid Approaches in Iris datasets. The HIIMC4.5 achieves less computation time with reduced error rate.

The Figure (10) shows comparison of computation time of iris dataset in hybrid and HIIMC4.5 approaches respectively. The HIIMC4.5 takes only 13 milliseconds.

Fig (10).comparison of Iris dataset

The following figure shows the error rate in Iris dataset HIIMC4.5 achieves less error which improves accuracy rate.

0  2  4  6  8  

10  12  14  16  18  20  

Iris   Iris  

HIIMC4.5   Hybrid  

Computa(on  Time  comparision    

Table (7) Method Data Set Computation

Time Error Rate

HIIMC4.5 Iris 13 0.0268

Hybrid Iris 18 0.0403

S.Santhosh kumar, Dr.E.Ramaraj, The International Journal of Computer Science & Applications (TIJCSA) ISSN – 2278-1080, Vol. 2 No. 07 September 2013

© 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved

28

Fig (11).comparison of error rate

7. Conclusion

From the detailed study and analysis of hybrid models based on experimental results, we can conclude that our hybrid approach model for multi dimensional data bases gives better results in comparison with existing approaches. The efficiency of hybrid model, an improvement over conventional clustering and classification algorithm is proposed. The proposed HIIMC4.5 model enables the careful selection of the primary classification useful for further sub grouping of data objects. The algorithm is based on the density distribution of data objects. I2 clustering algorithm results good partitioning with high accuracy rate with both local and global reach ability densities. Experimental results proved that HIIMC4.5 performs better than k-means and C4.5 with respect to correctness of partition and accuracy. As extension of this work, the data base must be categorized at the time of searching, based on the attribute value is experimented and results shows accuracy rate and error rate. For some biological data bases such as Primary nucleotide sequence databases, Metadata bases, Genome databases, Protein sequence databases, Proteomics databases, RNA databases the primary classification is also be refined based on the data bases to be mined. When the primary classification is refined based on the database then the searched patterns are stored along with the relevant data in the database then it can be reused when it is needed. In that time it requires less memory and time to extract the data. The methodology is used then the different patterns can be obtained. As a result it leads to generation of informative patterns about the data base. At the explorative use of data; the edge of this work needs only the stored patterns to be mined for information retrieval from the data base. 8. References

[1] Chapelle, Olivier; Schölkopf, Bernhard; Zien, Alexander. Semi-supervised learning.

Cambridge, Mass.: MIT Press. ISBN 978-0-262-03358-9.2006. [2] Vapnik, V. and Chervonenkis, A. Theory of Pattern Recognition [in Russian]. Nauka,

Moscow (1974). Cited in Chapelle et al. 2006 [3] A probably approximately correct learning bound for semi-supervised learning of a

Gaussian mixture was demonstrated by Ratsaby and Venkatesh in 1995

0  0.005  0.01  

0.015  0.02  0.025  0.03  0.035  0.04  0.045  

Iris   Iris  

HIIMC4.5   Hybrid  

Error  Rate  comparision  with  Iris  Data  Set  

S.Santhosh kumar, Dr.E.Ramaraj, The International Journal of Computer Science & Applications (TIJCSA) ISSN – 2278-1080, Vol. 2 No. 07 September 2013

© 2013, http://www.journalofcomputerscience.com - TIJCSA All Rights Reserved

29

[4] Semi-Supervised Learning Literature Survey Xiaojin Zhu Computer Sciences TR 1530 University of Wisconsin Madison in 2006

[5] Juanying Xie, Weixin Xie, Chunxia Wang, Xinbo Gao: A Novel Hybrid Feature Selection Method Based on IFSFFS and SVM for the Diagnosis of Erythemato-Squamous Diseases. Journal of Machine Learning Research - Proceedings Track 11: 142-151.2010.

[6] Pritha Mahata, January-march, Ieee/acm transactions on computational biology and bioinformatics, vol. 7, no.

[7] Taysir Hassan A. Soliman, , Adel A. Sewissy, and Hisham Abdel LatifSannella, 2010, A gene selection approach for classifying diseases based on microarray datasets , IEEE 12th International Conference on Bioinformatics and Bioengineering, Cyprus.

[8] Holmes, IEEE, Clustering for classification Claudio Gentile, Fabio Vitale, Giovanni Zappella, implemented the combined technique (clustering and classification) in networks using signed graphs. Volume 14 Issue 1, January 2013, Pages 1251-1284. . 2011.

[9] Claudio Gentile, Fabio Vitale, Giovanni Zappella, implemented the combined technique (clustering and classification) march 2013.

[10] S. Santhosh Kumar, Dr.E. Ramaraj, Analysis of Sequence Based Classifier Prediction for HIV Subtypes, International Journal of Engineering and Technology Volume 2 No. 10, ISSN: 2049-3444 © 2012 – IJET Publications UK1753, October, 2012. Morgan Kaufmann Publishers,

[11] Quinlan, J. R., Induction of decision trees' Machine Learning 1, 81{106. Reprinted in Shavlik and Dietterich, 1993. (eds.) Readings in Machine Learning, (1986).

[12] B.MacQueen (1967): "Some Methods for classification and Analysis of Multivariate Observations, Proceedings of 5-th Berkeley Symposium on Mathematical Statistics and Probability", Berkeley, University of California Press, 1:281-297

[13] S. Santhosh Kumar, A.Sumathi, Dr. E.Ramaraj, Development of an Efficient Clustering Technique for Colon Dataset, International Journal of Engineering and Innovative Technology (IJEIT) Volume 1, Issue 5, May 2012 8 3

[14] S.Santhosh Kumar, Dr.E.Ramaraj, Modified C4.5 Algorithm with Improved Information Entropy and Gain Ratio International Journal of Engineering Research & Technology (IJERT) Vol. 2 Issue 9, September – 2013

[15] Zhu, Xiaojin; Goldberg, Andrew B. (2009). Introduction to semi-supervised learning. Morgan & Claypool. ISBN 9781598295481.

[16] Jump up^ Younger, B. A. and Fearing, D. D. (1999), Parsing Items into Separate Categories: Developmental Change in Infant Categorization. Child Development, 70: 291– 303.

[17] Jump up^ Xu, F. and Tenenbaum, J. B. (2007), Sensitivity to sampling in Bayesian word learning. Developmental Science, 10: 288–297.

[18] Jump up^ Gweon, H., Tenenbaum J.B., and Schulz L.E (2010), Infants consider both the sample and the sampling process in inductive generalization. Proc Natl Acad Sci U S A., 107(20):9066-71.

[19] www.bioafrica.com [20] www.pdb.org [21] www.avert.org [22] www.cdc.gov/dhdsp/data_set_directory.htm [23] www.infochimps.com