Biological Sequences Encoding for Supervised Classification

1

Biological Sequences Encoding for SupervisedClassification

Rabie Saidi1, 3, Mondher Maddouri2, and Engelbert Mephu Nguifo1

1 CRIL - CNRS FRE 2499 Université d'Artois - IUT de Lens, France{saidi, mephu}@cril.univ-artois.fr

2 Computer Science Department, National Institute of Applied Sciences and Technologies,Tunis-Carthage 2035, Tunisia

[email protected] FSJEG, University of Jendouba, [email protected]

Abstract. The classification of biological sequences is one of the significantchallenges in bioinformatics as well for protein as for nucleic sequences. Thepresence of these data in huge masses, their ambiguity and especially the highcosts of the in vitro analysis in terms of time and money, make the use of datamining rather a necessity than a rational choice. However, the data miningtechniques, which often process data under the relational format, are confrontedwith the inappropriate format of the biological sequences. Hence, an inevitablestep of pre-processing must be established. This work presents the biologicalsequences encoding as a preparation step before their classification. We presentthree existing encoding methods based on the motifs extraction. We also pro-pose to improve one of these methods and we carry out a comparative studywhich takes into account, of course, the effect of each method on the classifica-tion accuracy but also the number of generated attributes and the CPU time.

1 Introduction

The emergence of the bioinformatics that we have witnessed during the last yearsfinds its origin in the technological progress which has helped to conduct large scaleresearch projects. The most remarkable one was the Human Genome Project (HGP)[10] accomplished in 13 years since 1990; a period that seems to be very short com-pared with the quantity of the collected data on the human genome: 3 billion baseswhich constitute the human DNA. Thus, several problems are open:

How does the gene express its protein? Where does the gene start and where does it end? How do the protein families evolve and how to classify them? How to predict the three-dimensional structure of proteins? …

The answer to these questions by the biochemical means and the in vitro analysisproves to be very expensive in terms of time and money. Indeed, some tasks, such as

mailto:[email protected]

2

the determination of the protein three-dimensional structure, can extend over monthsand even years whereas the biological sequences quantity generated by the variousprograms of sequencing knows an exponential growth. Henceforth, the challenge isnot the gathering of biological data but rather their exploration in a faster and efficientway making it possible to reveal the cell secrets.

In this context, the need to use the data mining is increasingly pressing. Data min-ing is particularly essential considering the enormous quantities of the biological dataand their ambiguity. The use of mining tools was profitable in several fields alsoknown by their large masses of data such as the commerce, the finance, the informa-tion retrieval... However, the data mining techniques, which often process data underthe relational format, are confronted with the inappropriate format of the biologicalsequences. This makes it necessary to apply transformation on these data before theiranalysis. Our work is within the framework of the biological sequences pre-processingnamely their encoding under a standard format appropriate to the analysis which isgenerally the relational format, frequently used by the data mining tools. We studyand compare some existing encoding methods based on the motifs extraction. Thesemethods are implemented in C language and gathered into a DLL enabling their com-parison in terms of classification accuracy, number of generated attributes and CPUtime.

Introduction to the problem and motivation to the biological sequences encodingcan be found in section 2. In section 3, we give an overview on some encoding meth-ods based on the motifs extraction. We propose to improve one of them in section 4.In section 5, we carry out the experimental study. Then we discuss our results in sec-tion 6. Section 7 concludes the paper and indicates some possible directions for futurework.

2 Protein Sequences Classification by Data Mining Tools

Classification is one of the most significant problems open in bioinformatics. Thisproblem arises, as well for proteins as for DNA. Indeed, the biologists are often inter-ested to identify the family to which belongs a lately sequenced protein. This makes itpossible to study the evolution of this protein and to discover its biological functions.For the DNA, the biologists seek, for instance, to classify parts of sequences in codingor non-coding zones [7]. They utilize biochemical means and in vitro analysis to per-form these tasks which prove to be very expensive in terms of time and money whilethe biological data quantity is unceasingly growing.

In this context, the use of data mining techniques proves to be a rational response tothis problem, since they were efficient in various fields and particularly in supervisedclassification. However, knowing that biological sequences are represented by stringsof characters and that mining tools often process data under the relational format, itwill not be possible to apply these tools directly to such data. Hence, the biologicalsequences have to be encoded into another format. [8] propose a model of a data min-ing process, illustrated by Fig. 1, to perform this task. The model presents the threemain steps of the Knowledge Discovery in Data (KDD) process applied to the prob-

https://www.researchgate.net/publication/223805512_A_data_mining_approach_based_on_machine_learning_techniques_to_classify_biological_sequences?el=1_x_8&enrichId=rgreq-c5b39762637f69e9042bfa6b61134619-XXX&enrichSource=Y292ZXJQYWdlOzIyMDg1NTYzNTtBUzoxMDEwOTc4NzYyMzAxNTVAMTQwMTExNTAyMDMzOA==

3

lem of the biological sequences classification. It consists of the extraction of a set ofmotifs from a set of sequences. These motifs will be used as attributes to construct abinary table that contains in row the set of the mentioned sequences. The presence orthe absence of an attribute in a sequence is respectively noted by 1 or 0. This binarytable is called context. It represents the pre-processing step result and the new se-quences encoding format. It will be used as input for the processing step where a clas-sifier will be applied to generate the classification rules. These rules are used to clas-sify other sequences.

Fig. 1. Process of biological sequences classification by data mining tools

This paper is within the framework of the biological sequences pre-processing. Westudy how encoding methods affect the discovered knowledge by measuring the clas-sification accuracy. We suppose that a good method will help to increase the accuracyeven in hard classification problems.

In the next section we present and describe some encoding methods based on themotifs extraction.

3 Existing Encoding Methods

The nucleic and the protein sequences contain patterns or motifs which have beenpreserved throughout the evolution, because of their importance in terms of structure

4

and/or function of the molecule. The discovery of these motifs can help to gather thebiological sequences in structural or functional families but also to better understandthe rules which control the evolution.

The members of a protein family are often characterized by more than one motif:on average each family preserves 3 to 4 regions [11]. The motifs often indicate func-tional and evolutionary relationships between proteins. Like for proteins, the motifsdiscovery can be used to determine the function of the nucleic sequences such as theidentification of the promoter sites and the junction sites.

We present, hereafter, three methods of motifs extraction which are the methods ofthe N-Grams (NG), the Active motifs (AM) and the Discriminative Descriptors (DD).Then, we propose, in section 4, a modification of the Discriminative Descriptorsmethod by the use of a substitution matrix (DDSM).

3.1 N-Grams

The simplest approach is that of the N-Grams, known also as N-Words or length Nfenestration [6]. The motifs to be built are length fixed as a preliminary. The N-gramis a sub-sequence composed of N characters, extracted from a larger sequence. For agiven sequence, the set of the N-grams which can be generated is obtained by movinga window of N characters on the whole sequence. This movement is carried out char-acter by character. In each movement a sub-sequence of N characters is extracted.This process will be reiterated for all the analyzed sequences (Fig. 2 illustrates thisprinciple). Then, only the distinct N-grams will be kept.

Fig. 2. Extraction of 2-grams from the 3 sequences: FFVV, NVVI and INNVI. For each se-quence of length m, the number of extracted N-Grams is: m-N+1

The N-Grams are widely used in information retrieval and natural languages proc-essing [9]. They are also used in local alignment by several systems such as BLAST[1]. The N-Grams extraction can be done in a O(m*n*N) time, where m is the maxi-mum length of a sequence, n is the number of sequences in question and N is the motiflength.

3.2 Active Motifs

This method is founded on the assumption that the significant regions are better pre-served during the evolution and thus they appear more frequently than expected. In-

http://en.wikipedia.org/wiki/Natural_language_processing

http://en.wikipedia.org/wiki/Natural_language_processing

https://www.researchgate.net/publication/31914664_Performance_and_Scalability_of_a_Large-Scale_N-gram_Based_Information_Retrieval_System?el=1_x_8&enrichId=rgreq-c5b39762637f69e9042bfa6b61134619-XXX&enrichSource=Y292ZXJQYWdlOzIyMDg1NTYzNTtBUzoxMDEwOTc4NzYyMzAxNTVAMTQwMTExNTAyMDMzOA==

https://www.researchgate.net/publication/24462206_Highly_Specific_Protein_Sequence_Motifs_for_Genome_Analysis?el=1_x_8&enrichId=rgreq-c5b39762637f69e9042bfa6b61134619-XXX&enrichSource=Y292ZXJQYWdlOzIyMDg1NTYzNTtBUzoxMDEwOTc4NzYyMzAxNTVAMTQwMTExNTAyMDMzOA==

5

deed, this method enables to extract the commonly occurring motifs whose lengths arelonger than a specified length, called Active Motifs, in a set of biological sequences.The activity of a motif is the number of sequences which match it within an allowednumber of mutations [12].

The motifs extraction is based on the construction of a Generalized Suffix Tree(GST). A GST is an extension of the suffix tree [4] and is dedicated to represent a setof n sequences indexed, each one, by i = 1..n. Each suffix of a sequence is representedby a leaf (in the shape of a rectangle) labelled by the index i of this sequence. It iscomposed by the concatenated sub-sequences labelled on the root-to-leaf i path. Eachnon-terminal node (in the shape of a circle) is labelled by the number of sequences towhich belongs its corresponding sub-sequence composed by the concatenation of thesub-sequences labelled on the arcs which bind it to the root (Fig. 3). The candidatemotifs are the prefixes of strings labelled on root-to-leaf paths which satisfy the lengthminimum. Then, only motifs having an acceptable activity will remain.

Fig. 3 . GST illustration for the 3 sequences: FFVV, NVVI and INNVI. If we suppose thatonly exactly coinciding segments occurring in at least two sequences and having a minimumlength of 2 are considered as active. Then we have 3 active motifs: VV, VI and NV

There are several algorithms used for the construction of the GST. [12] affirm thatthe GST can be built in a O(m*n) time, where m the maximum size of a sequence andn the number of sequences in question. To extract the motifs which satisfy the condi-tions of research, it is necessary to traverse the entire tree. That is to say a complexityof O((m*n)²).

6

3.3 Discriminative Descriptors

Given a set of n sequences, assigned to P families/classes F1 F2 .., FP , it is a ques-tion of building sub-strings called Discriminative Descriptors DD which make it pos-sible to discriminate a family Fi from other families, where i = 1..P [8].

This method is based on an adaptation of the Karp, Miller and Rosenberg (KMR)algorithm [5]. This algorithm can identify the repeats in characters strings, trees ortables. It will be applied here to biological sequences. The extracted repeats are thenfiltered in order to only keep the discriminative and minimal ones.

The discriminative descriptors are built in a O(m2*n3*N) time, where m is themaximum size of a sequence, n is the number of sequences in question and N is themaximum motif length.

Repeats Identification. KMR algorithm is based on the following concept ofequivalence: two positions i and j in a character string S of length m are k-equivalent,we note i Ek j, if and only if the two sub-strings of length k S[i; i+k-1] and S[j; j+k-1]are identical [5]. We say also that the positions i and j belong to the same class ofthe equivalence relation Ek. An equivalence relation Ek, where 1<=k<=m, can berepresented by a vector Vk[1.. m - k +1] whose each component Vk[i], with 1<=i<=(m- k + 1), represents the class number to which belongs the position i for theequivalence relation Ek. To apply the KMR algorithm to several strings, they areconcatenated in one sequence while keeping in memory the information about thepositions of the various strings terminals. Fig. 4 shows an example of equivalenceapplied on 3 concatenated sequences: FFVV, NVVI and INNVI.

Fig. 4. Illustration of a 2-equivalence between i=5 and j=11. It is about the repeat NV whichrepresents one of the equivalence relation E2 classes

Discriminative and Minimal Sub-strings Identification. A sub-string X isconsidered to be discriminative between the family Fi and of the other families Fj,where i = 1..P , j = 1..P and i ≠ j if :

100*FiofsequencesofnumbertotalappearsXwhereFiofsequencesofnumber . (1)

100*FjofsequencesofnumbertotalappearsXwhereFjofsequencesofnumber . (2)

https://www.researchgate.net/publication/221591625_Rapid_Identification_of_Repeated_Patterns_in_Strings_Trees_and_Arrays?el=1_x_8&enrichId=rgreq-c5b39762637f69e9042bfa6b61134619-XXX&enrichSource=Y292ZXJQYWdlOzIyMDg1NTYzNTtBUzoxMDEwOTc4NzYyMzAxNTVAMTQwMTExNTAyMDMzOA==


7

A discriminative sub-string is considered minimal if it does not contain any otherdiscriminative sub-string.

4 Proposed Method: Discriminative Descriptors with SubstitutionMatrix

In the case of protein, the motifs extracted by the Discriminative Descriptors methodmake it possible to discriminate between various families. But, this method neglectsthe fact that some amino acids have similar properties and can be thus substituted byeach others without changing neither the structure nor the function of the protein [3].So, we can find in the set of the generated attributes by the Discriminative Descriptorsmethod several motifs which all derive from only one motif. In the same way, duringthe construction of the context (binary table), we are likely to lose information whenwe note by 0 the absence of a motif while another one, which can be substituted by it,already exists.

The similarity between the motifs is based, as already mentioned, on the similaritybetween the amino acids which constitute them. Indeed, there are various degrees ofsimilarity between the amino acids. Since there are 20 amino acids, the mutationsbetween them are scored by a 20x20 matrix called substitution matrix.

4.1 Substitution Matrix.

In bioinformatics, a substitution matrix estimates the rate that each possible residue ina sequence changes to another residue over time. Substitution matrices are usuallyseen in the context of amino acid sequence alignment, where the similarity betweensequences depends on the mutation rates as represented in the matrix[3].

4.2 Terminology

Let M be a set of n motifs, noted each one by M[p], p = 1.. n. M can be dividedinto m clusters. Each cluster contains a main motif M* and, probably, other motifswhich can be substituted by M*. The main motif is the motif which has the highestprobability, in its cluster, to mutate to another one. For a motif M of k amino acids,this probability, noted Pm(M) is based on the probability P i (i = 1.. k) that each aminoacid M[i] of the motif M does not mutate to any other amino acid. We have:

Pm = 1 -

k

i 1

P i . (3)

A Pi is calculated based on the substitution matrix according to the following for-mula:

http://en.wikipedia.org/wiki/Bioinformatics

http://en.wikipedia.org/wiki/Amino_acid

http://en.wikipedia.org/wiki/Sequence_alignment

https://www.researchgate.net/publication/21718015_Amino_acid_substitution_matrices_from_protein_blocks?el=1_x_8&enrichId=rgreq-c5b39762637f69e9042bfa6b61134619-XXX&enrichSource=Y292ZXJQYWdlOzIyMDg1NTYzNTtBUzoxMDEwOTc4NzYyMzAxNTVAMTQwMTExNTAyMDMzOA==


8

P i = S(M[i], M[i]) /

20

1j

S+(M[i], AAj) ; j = 1.. k . (4)

S(x, y) is the substitution score of the amino acid y by the amino acid x as it appearsin the substitution matrix. We mean by S+(x, y) a positive substitution score. AAj is theamino acid of index j among the 20 amino acids.

We consider that a motif M substitutes a motif M’ if:

M and M’ have the same length k, S(M[i], M’[i]) >= 0 , i = 1.. k, SP(M, M’) >= T, where T is a user-specified threshold: 0 <= T <= 1.

We note by SP(M, M’) the substitution probability of the motif M’ by the motif Mhaving the same length k. It measures the possibility that M mutates to M’:

SP(M, M’) = Sm (M, M’) / Sm (M, M) . (5)

Sm (X, Y) is the substitution score of the motif Y by the motif X. It is computedaccording to the following formula:

Sm (X, Y) =

k

i 1

S(X[i], Y[i]) (6)

It is clear, according to the substitution matrix, that there is only one best motifwhich can substitute a motif M; it is obviously itself, since the amino acids whichconstitute it are better substituted by themselves. This proves that the substitutionprobability of a motif by another one, if they satisfy the substitution conditions, will bebetween 0 and 1.

4.2 Methodology

The modification of the Discriminative Descriptors method relates to two aspects.First, the number of the extracted motifs will be, of course, reduced because we willkeep only one motif for each cluster of substitutable motifs of the same length. Then,we will modify the context construction rule mentioned in section 2. Indeed, we willnote by 1 the presence of the motif or that of one of its substitutes. The first aspectcan be also divided into two phases: (1) identify the clusters main motifs and (2) per-form the filtering.

Main Motifs Identification and Filtering. The main motif of a cluster is the mostlikely motif in this cluster to mutate to another one.

To identify all the main motifs, we sort M in a descending order by motifs lengthsthen by Pm. For each motif M’ of M, we look for the motif M which can substitute M’having the highest Pm. The clustering is based on the computing of the substitutionprobability between the motifs. We can find a motif which belongs to more than onecluster. In this case, it must be the main motif of one of them.

9

The filtering consists on keeping only the main motifs and removing all the otherones. The result is a smaller set of motifs which can represent the same information ofthe initial set.

The main motifs identification and the filtering are performed by the followingsimplified algorithm:

beginsort M in a descending order by (motifs lengths,Pm);for each motif M[i] from i=n to 1

if Pm(M[i])=0 thenM[i] becomes a main motif;

elsex position of the first motif having the samelength as M[i];for each motif M[j] from j=x to i

if M[j] substitutes M[i] or j=i thenM[j] becomes a main motif;Break;

end ifend for

end ifend forfor each motif M of M

if M is not a main motif thendelete M;

end ifend for

end.

The time complexity of this algorithm is O((n2/2)*k), where n is the number of mo-tifs in question and k is the maximum motif length.

Example. Given a BOLSUM62 substitution matrix and the following set of motifs(Table 1) sorted by their lengths and Pm, we assign each motif to a cluster representedby its main motif. We get 5 clusters illustrated by the diagram shown in Fig. 5.

Table 1. Motifs clustering. The third row shows the cluster main motifs

M LLK IMK VMK GGP RI RV RF RA PPPm 0.89 0.87 0.86 0 0.75 0.72 0.72 0.5 0

Main motif LLK LLK LLK GGP RI RI RI RV PP

Context building. The context building is done by noting 1 if a sequence matches amain motif or one of the motifs it can substitute; otherwise we note 0. We use thefollowing algorithm:

beginfor each sequence S

10

for each motif M of length krepeat

extract a k-gram M’ from S;if M substitutes M’ then

note 1 in the context for S and M;goto presence;

end ifuntil the end of Snote 0 in the context for S and M;presence: continue

end forend for

end.

The time complexity of this algorithm is O(m*n*k*l), where n is the number of mo-tifs in question, m is the number of sequences, k is the maximum motif length and l isthe maximum sequence length.

Fig. 5. Motifs clustering. The motif RV belongs to 2 clusters. It is the main motif of one ofthem.

5 Experiments and results

The encodings methods are implemented in C language and gathered into a DLL. Theaccepted formats of the input files are: FASTA format for biological sequences filesand the format described by Fig. 6 for the classification files. The DLL generatesrelational files under various formats such as the ARFF format used by the workbenchWEKA [13] and the DAT format used by the system DisClass [7].


https://www.researchgate.net/publication/221900847_Data_Mining_Practical_Machine_Learning_Tools_And_Techniques?el=1_x_8&enrichId=rgreq-c5b39762637f69e9042bfa6b61134619-XXX&enrichSource=Y292ZXJQYWdlOzIyMDg1NTYzNTtBUzoxMDEwOTc4NzYyMzAxNTVAMTQwMTExNTAyMDMzOA==

11

Fig. 6. Classification file sample. The file describs a FASTA file containing 4 biological se-quences belonging to 2 classes: the 1 st one belongs to the class «TLRH» and the 3 others onesbelong to the class «TLRNH »

5.1 Experimental Data

To compare the encoding methods we use 3 samples of biological sequences de-scribed by Table 2.

Table 2. Experimental data

Data type Sample Family/class Number ofsequences Data source

High-potentialIron-Sulfur Pro-tein

19

HydrogenaseNickel Incorpora-tion ProteinHypA

20Sample S1

Hlycine Dehy-drogenase 21

SWISS-PROT

human TLR 14

Protein

Sample S2Non-human TLR 26Promoter site 53

Nucleic Sample S3 Non-promotersite 53

Entrepôt del’université

IRVINE

The sample S1 contains three distinct and distant protein families. We suppose thatclassification in this case will be relatively easy since each family will probably havepreserved patterns which are different from those of other families [11]. However, thesample S2 presents a more delicate classification problem. It consists of distinguishing


12

between the human Toll-like Receptors (TLR) protein sequences and the non-humanones. The difficulty is due to the structural and functional similarity of the two groups.The sample S3 is the subject of a typical classification problem. It is a question ofrecognizing the nucleic sequences carrying promoter sites from those which are not.The promoters are short segments of DNA whose identification facilitates the localiza-tion of the genes beginnings.

5.2 Experimental Process

In our experiments, we use the 10-fold cross validation technique [2]. Each sample ofdata is randomly and equitably partitioned to 10 mutually exclusive subsets. The train-ing and the test are carried out 10 times. In each iteration, a subset is reserved for thetest and the others are used together for the training. After having built the contexts oftraining CA and test CT, we start the classification step. Using the classifier C4.5 ofthe workbench WEKA [13], we generate the classification rules from CA which wetest on CT. The classification accuracy is computed as being the average of the 10iterations accuracies. The encoding and experimental process is illustrated by Fig. 7 .

Fig. 7. Encoding and experimental process

https://www.researchgate.net/publication/220688795_Data_Mining_Concepts_and_Techniques?el=1_x_8&enrichId=rgreq-c5b39762637f69e9042bfa6b61134619-XXX&enrichSource=Y292ZXJQYWdlOzIyMDg1NTYzNTtBUzoxMDEwOTc4NzYyMzAxNTVAMTQwMTExNTAyMDMzOA==


13

5.3 Results

We examine, initially, each method individually while varying its various parametersto seek their optimal values (Table 3). Then, we use the best parameters found tocompare them in terms of accuracy (rate of classified sequences), attributes numberand CPU time (in seconds). For the DDSM method we use the BLOSUM62 andPAM250 substitution matrix. Results are shown in Table 4.

Table 3. Best parameters

SampleMethod ParameterS1 S2 S3

NG N 3 3 4Min length 3 3 3AM Min activity 50 % 50 % 25 %Alpha 0 0 0DD Beta 0 0 0Alpha 0 0 -Beta 0 0 -Substitutionmatrix BLOSUM62 BLOSUM62 -DDSM

Threshold 0.7 0.9 -

Table 4. Experimental results

Sample Method NG AM DD DDSM

Accuracy 90 % 95 % 95 % 98.33 %Attributes 4777 1978 4709 2139S1Time 0.82 38 35 37Accuracy 60 % 55 % 67.5 % 77.5 %Attributes 5340 3458 6839 6562S2Time 1 91 921 954Accuracy 73.58 % 77.78 % 77.78 % -Attributes 244 314 701 -S3Time 0.05 2 1.57 -

6 Discussion

According to the experimental study, we noticed that there are no optimal and singlevalues for the parameters of the studied methods. In fact, these values depend on thenature of the data in question. So, the adjustment of these parameters requires a pre-liminary knowledge of the data characteristics such as the lengths of the preservedregions and the mutation rate between the sequences of a family.

14

The experimental results vary according to the input data. The sample S1 classifica-tion was relatively easy since the three protein families are completely distinct. Eachone of them probably has its own motifs which characterize it and discriminate it fromthe others. This explains the high accuracy reached by all the methods especially theDiscriminative Descriptors with Substitution Matrix method with which the classifica-tion reached a very high accuracy. The methods of the Active Motifs and the Dis-criminative Descriptors made the best classification accuracy for the sample S3 (thissample does not concern the DDSM method since it contains nucleic sequences). Thesample S2 represents a hard classification challenge since the human TLR and thenon-human TLR resemble to each other in terms of function and structure. Indeed thetwo classes share many similar parts what explains the low accuracy with the methodof the Active Motifs. Indeed, this method, which extracts motifs based on their occur-rences, built attributes which belong at the same time to the two classes, which in-creases the possibility of confusion. The method of the N-Grams made a better preci-sion, but did not reach the default accepted accuracy which is 65% (if we assign all thesequences to the non-human TLR class). The Discriminative Descriptors methodoutperforms the two last methods. Since it adopts an approach of discrimination tobuild the attributes, it allowed a better distinction between the human TLR and thenon-human TLR. But, to more improve classification in the sample S2, it is necessaryto take into account the phenomenon of mutation and substitution between the aminoacids. Indeed, the method DDSM which we proposed made it possible to reach thehighest precision while reducing the number of generated attributes.

7 Conclusion

In this paper, we presented the biological sequences encoding as a pre-processing stepbefore their classification. We described three existing encoding methods based onthe motifs extraction, which are the methods of the N-Grams (NG), the Active Motifs(AM) and the Discriminative Descriptors (DD). Then, we proposed a modification ofthe DD method by the use of a substitution matrix.

In order to examine the effect of each encoding method on the classification accu-racy, we undertook an experimental study which relates to various biological datacomprising protein and nucleic sequences. We also compared the numbers of gener-ated attributes by these methods and their CPU times. Among the existing methods,we noticed that the DD method presents the best accuracy. The modification of thismethod by the use of a substitution matrix made it possible to improve the classifica-tion even in a relatively delicate case. However, we noticed that the DD method likeits alternative with the matrix of substitution, are very expensive in term of CPU timecompared with the other methods, especially the NG.

Considering this work, several ways are open. It will be interesting to conceive ahybrid encoding method based on the N-Grams, which uses filters like and andtakes into account the substitution and the order of the extracted motifs in the se-quences. We will, also, try to use other amino acids substitution matrices and to ex-

15

tend the application of the modification, which we proposed to the DD method, withthe nucleic sequences by using the DNA scores tables.

References

1. Altschul S. F., W. Gish, W. Miller, E. W. Myers, D. J. Lipman. Basic local alignment searchtool. Journal of Molecular Biology, Vol. 215(3), pp. 403-413, 1990.

2. Han J., M. Kamber. Data Mining: Concepts and Techniques. ISBN 1-55860-489-8. MorganKaufmann Publishers: www.mkp.com, 2001

3. Henikoff S., J. G. Henikoff. Amino acid substitution matrices from protein blocks. NationalAcademy of Sciences, USA, 89, pp. 10915-10919, 1992.

4. Hui L. C. K., M. Crochemore, Z. Galil, & U. Manber, (ed.). Combinatorial Pattern Match-ing. Lecture Notes in Computer Science in Apostolico, Springer-Verlag, 644, 230-243,1992.

5. Karp R., R. E. Miller, A. L. Rosenberg. Rapid Identification of Repeated Patterns in Strings,Trees and Arrays. 4th Symposium of Theory of Computing, pp.125-136, 1972.

6. Leslie, C., E. Eskin, & W. S. Noble. The spectrum kernel : a string kernel for svm proteinclassification. Pac Symp Biocomput, 564–575, 2002.

7. Maddouri M, Elloumi M. A data mining approach based on machine learning techniques toclassify biological sequences. Knowledge Based Systems Journal, March 2002.

8. Maddouri M. & M. Elloumi. Encoding of primary structures of biological macromoleculeswithin a data mining perspective. Journal of Computer Science and Technology (JCST);VOL 19, num 1. Allerton Press: 78-88. USA, 2004.

9. Miller E., Shen D., Liu J. & Nicholas C. Performance and scalability of a large-scale N-gramBased Information Retrieval System. Journal of digital information, 1999.

10. National Human Genome Research Institute. National Institute of Health. Available:http://www.nhgri.nih.gov/, June 2006.

11. Nevill -Manning, C. G., Wu, T. D., and Brutlag, D. L. (1998). Highly specfic protein se-quence motifs for genome analysis. Proceedings of the National Academy of Sciences of theUnited States of America, 95(11):5865-5871, 1998.

12. Wang J. T. L., T. G. Marr, D. Shasha, B. A. Shapiro, & G.-W. Chirn. Discovering activemotifs in sets of related protein sequences and using them for classification. Nucleic AcidsResearch, 22(14): 2769-2775, 1994.

13. Witten I. H. & Eibe F. Data Mining: Practical machine learning tools and techniques, 2ndEdition. Morgan Kaufmann, San Francisco, 2005.

http://www.mkp.com/

http://www.nhgri.nih.gov/HGP/
















Biological Sequences Encoding for Supervised Classification

Documents

Transcript of Biological Sequences Encoding for Supervised Classification