A Method for Gene Detection Based on Maximum Likelihood and A-Star

A Method for Gene Detection based onMaximum Likelihood and Astar

Luciana Ribeiro Matos, Nicolas Carels, Diego FríasLaboratório de Bioinormática da UESC

Rodovia IlhéusItabuna km 16, Itabuna, [email protected]

AbstractThis article describes the use of Astar and maximum likelihood for coding sequence detection. Maximum likelihood

maximizes the discrimination between coding and noncoding sequences and Astar optimizes the shortest way to the bestsolution. The combination of maximum likelihood and Astar algorithm provide a contextual process of gene finding. The Astar algorithm generates a collection of sequence pieces from the original stretch. The dissection process is guided by a treestructure in the sense that the largest pieces are found near the root of the tree. The algorithm branches through the numericalstates that describe each sequence piece and searches the shortest tree branch (the largest piece of sequence) with maximumLOD score (best exon candidate). The search of the shortest tree branch is guaranteed by a penalty function that considers theLOD score (maximum likelihood) and the number of branches that was generated to reach the current point on the tree. Thesearching procedure stops when the maximum LOD value is found by monitoring the history of this variable during the treeexploration. This method was found to be efficient on DNA sequences originated from Arabidopsis thaliana andSaccharomyces cerevisiae.

Keywords Astar algorithm, Maximum likelihood, Gene finding.

I. INTRODUCTION

Detection of coding regions in large DNA sequences is a complex task still in progress [MAT 02].This problem can be treated by two different approaches: (1) sliding in the nucleotide sequence fromits beginning to its end with a window of fixed size. Each step of the window is associated to acoding measure [TIW 97]; (2) division of the sequence into a set of segments of variable lengthwhich are analyzed using a coding measure.

In the former case, the window of size N J moves by steps of N v≤N J nucleotides. At each step,the coding potential is evaluated by an appropriate measure. In the process of gene finding, themeasure must be related to the coding state of the sequence stretch in the window. Thus, when thewindow enters a coding region the value of the measure should increase. Conversely, it shoulddecrease when it goes back to noncoding regions such as introns or intergenic sequences. In theory,this process should allow the exon boundary identification, however, it largely depends on themeasure sensitivity. Unfortunately, if gene finding methods using sliding windows are easy toimplement, they have several problems. Indeed, the coding measure is strongly influenced by thewindow length and there is always a minimum length, under which it is not possible to differentiatecoding and noncoding sequences (exons and introns, for instance). Finally, it is important to observethat this method has a computational cost that increases in a nonlinear way when N v e N J

decrease. In addition, the measure sensitivity depends on the widow size, but in that kind of techniques the

window size can not be adjusted to the coding region size. For this reason, we were interested in thisstudy to developed a contextual methodology that would not present this kind of weakness.

The second case presented above offers the possibility of a contextual methodology of codingregion detection. For instance, the sampling of subsequences from the original sequence by

eliminating varying number of windows on the left and right sides would lead to such effect. It isequivalent to search the largest subsequence that has the highest coding probability. In each searchsteps (loops), the sequence part that is removed corresponds to a window number increased by one.At each loop, a coding measure value is associated to the subsequence. The coding measure valuemust be affected by the deepness of the subsequence position in the tree of states. A simplifiedexample of subsequence generation is given in Fig. 1. It displays the tree structure of thewindowing process and the contextual approach to the “best” coding stretch.

Fig. 1: Systematic Method As we can see, the best solution of coding region identification isthe first solution of the level 2, since it is the one that best preserves the size and the integrityof the coding region. In the level 1, the noncoding part of the subsequence evaluated by thecoding measure is to large to lead to the best coding candidate. In (2) and (3) of level 2, thecoding measure is decreased by the combination of the size reduction of the coding stretch andthe large proportion of noncoding DNA.

Such strategy can be automated by the association of decision trees and coding measuretechniques. In this study, we chose the algorithm Astar [RIC 93] for the automation of the decisiontree and maximum likelihood[DUR 98] as a coding measure.

Astar is a type of search algorithm based on graph technology that warrants the finding of theshortest path to the best solution (also called admissible). Graphs are data structures (pointer stack)used to refer to data scattered in the computer memory. The graph stores data in the computermemory using structures named nodes that represent data features. The binary tree is a kind of graphthat groups the nodes using hierarchical structure allowing their recovery. The node represents thefeatures of the search model in a particular state. The best solution consists to restrict the nodegeneration just to those that are necessary nodes to find the best coding sequence candidate (goalnode). In other words, the Astar can be used to retrieve the path in the nodes that represents thesolution to the best coding sequence candidate. The binary tree is used for this purpose because itsarchitecture is compatible with Astar and with the process of subsequence sampling.

In the binary trees, the first node in the hierarchy is named “root” and represents the primary stateof the tree. Each node can generate one or two more nodes, which motivated the denomination(binary) of that type of tree and turns easy the simulation of the sequence states as they are shown inFig.2. Each node keeps the information about which one did generate it and which one it generates.

Fig. 2: Deductive Method In this figure, the binary tree (on the left) has six nodes. Each nodecorresponds to a particular state of the subsequence sampling model (on the right) and point(solid arrows) to other nodes from higher hierarchical levels (deeper in the tree). Each level ofthe subsequence sampling process shows its increasing combination complexity. At each level(l) corresponds l+1 putative coding subsequences.

To optimized the trajectory through the node tree, Astar, uses a route function and classifiesnodes in two sets, the open and the closed list. The open list keeps track of those nodes that still needto be analyzed, while the closed list keeps track of the nodes that have already been analyzed.

Initially, the open list contains just the initial node and the closed list is empty. The search startswith the root. It generates two new nodes that are included in the open list while the “root” one istransfered to the closed list. Then, the route function will be calculated from the two new nodes. Allthe nodes in the open list (actually 2) are compared to the route function. The one that is found tohave the minimal value is chosen to be the father for new node(s) and subsequently transfered to theclosed list, etc. The route function is the result of the sum of the heuristic and the cost functions. Theheuristic and cost function have important characteristics for the tree behavior. The cost functionrepresents the computation cost for the the generation of new nodes. More, nodes are generated morethe cost function is high. Therefore, the cost function is a penalty for the generation of new nodes.This function minimizes the cost path from the root node to the goal node. Finally, the cost functionis cumulative and, therefore, global. In contrast, the heuristic function is local since it justcharacterized the quality of the node selected in relation to the goal node.

Fig. 3: The Algorithm Astar and the binary tree the goal node is the objective of the algorithm.Since it is unknown a priori. The algorithm refers to an atractor (LODmax) to which it compares thenodes from the open list. The the penalty score reflects the computer cost to explore the tree.Therefore, the cost function increases at each node level in the tree. For this reason, it is global. Bycontrast, the heuristic function is local. It qualifies the efficiency of the algorithm in its progresstoward its objective (the goal node).

The maximum likelihood estimator (MLE) maximizes the probability function that a particularevent fits with a given model. [DURBIN, R.]. In the case of this study, the ML estimator is a codingmeasure that maximizes the difference between coding and noncoding sequences. We used ML tomeasure the “distance” between the codon probability distribution for the coding sequences of agiven species (the model) and the codon frequency distribution of the sequence under consideration.If the sequence under consideration is coding, its codon usage will be close to the model; if not, thetriplet distribution will be closer to a random distribution (such as in introns and intergenicsequences). ML uses a function called logarithm of odds (LOD) to measure the codon usage distancein terms of probabilities of a given sequence to the model. Since, relative frequency is asymptoticallyequivalent to probability for a given event, the construction of the model consists in the calculationof a vector M containing the relative codon frequency Mi, i=1,...,64 for a high coding sequencenumber.

In this paper, we describe a contextual procedure controlled by the Astar and MLE algorithmsthat stops the search of the best coding candidate after detection of the global maximum of thecorresponding coding measure.

II. MATERIAL AND METHOD

A. THE SEQUENCES

Coding sequences, introns (284) and contigs (1,380) from Arabidopsis thaliana were extractedfrom GenBank (Release. 142, june 2004) using the Infobiogen server (see http://www.infobiogen.fr)and the ACNUC/QUERY retrieval system [GOU 85].

B. THE CODING MEASURE

The maximum likelihood estimator (MLE) is the coding measure that we used to discriminatecoding from noncoding sequences. It uses the logarithm of ODD (LOD) that measures the distance,in probability terms, from the codon distribution of a given sequence to that of a model, which is thereference for the species under consideration. Considering a codon M, with the probability Mi, theLOD of a sequence S is computed by the Formula (1) where Ni (S) is the number of times that the ith codon appears in S.

LODS =∑i=1

64 [ logM i −log 164 ] . N iS (1)

Mi has been calculated considering a genes set representative of the species under consideration.We verified that the coding sequences of that gene set were all obtained by laboratoryexperimentation. When LOD(S) > 0, S was classified as coding; otherwise S was assumed to benoncoding. The probability that the sequence S belong to the class assigned is proportional to theLOD score.

To verify that the probability vector M was indeed representative of the codon probabilitydistribution of the whole gene population, we investigated the number of nucleotides above whichthe probability vector M becomes stable. For this, we calculated a succession of vectors {M(k),k=1,2,...} using k*N nucleotides, with N multiple of 3. The monotonic convergence of suchsuccession is warranted if |M(k) – M(k1)| < |M(k1) – M(k2)|, where |.| is the vector norm. Underthis condition, for any real eM > 0 (norm of the differences), there is a number k* such that |M(k*) –M(k*1)| < eM. We chose eM small enough in order that the procedure run until satisfaction of theabove conditions.

Then, we investigated the sensibility of the LOD score to the coding and noncoding sequencelength by plotting the LOD score as a function of the sequence size.

C. THE DEDUCTIVE METHOD

Since LOD score is proportional to the sequence size, the precision of the coding measure ishigher if coding sequences are found through a scanning process proceeding from large sequencepieces to smaller ones.

For this, we chose a window of a given nucleotide size proportional to the length of the sequenceto analyze. Each tree level correspond to a maximum window number that can be removed in variouscombination from the original sequences to produce several subsequences. For instance, level 1 hasjust one window to remove from each sides of the original sequence and will generate two nodes,from the root node, corresponding to two subsequences (Fig. 2). Similarly, level 2 has now one morewindows to remove. The number of window combinations to remove is equal to the number ofsubsequence generated, but also to the tree level plus one (n+1). As a result, from the two nodesgenerated in the level 1 just one will generate 2 subsequences. In the example of Fig. 1, it is the rightnode from level 1 that will generate the two nodes. The left node can generate just one subsequencesince the other possible one has already been generated by the right node. Therefore, there are 3window combinations possible in level 2, ie, (i) two windows on the left side, (ii) two windows fromthe right side, or (iii) one window from both sides are cut off from the original sequence. At eachnew node the size of the subsequence is decreased by one more window on the left or the right sidegenerating a new nonredundant sampling from the original sequence. By this way, it's possible tofind nodes that do not have heir, because the corresponding combination were already generated bybrother nodes and cannot be repeated.

Each node from the binary tree is a mathematic model of the corresponding subsequences that isgenerated. All the characteristics about length, window size, LOD score and number of windowsremoved are loaded in the nodes. Using Astar, we compared the coding potential (LOD score) ofeach subsequence through their corresponding nodes. The route function (F) is the Astar functionthat determine the path to the best coding region candidate (goal node, Fig. 3). It is given by Formula2, where H (n) is the heuristic function and C(n) is the cost function:

F n =H n +C n (2)

Since the coding measure is obtained through the LOD score, we based the cost and heuristicfunctions on it. However, the route function does not have a well defined purpose, because thecoding sequence is unknown a priori. To solve this problem, we considered the best codingcandidate as the one with the closest LOD score to a LODmax higher than all possible cases. TheLODmax must be chosen higher than any possible values taken by the tree nodes. However, LODmax

must be realistic to allow relative node comparison without losing sensibility. The cost function penalties each new level that is generated. Considering that the cost function

increases in absolute value at each new nodes, N is the number of nodes generated until the last oneand LOD(n) is the LOD score for the selected node. Formula 3 describes the cost function:

C n =[1+Ac . LODmax−LOD n ] . N (3)

In contrast to C, H is decreasing when approaching to the best node. Formula 4 describes theheuristic function where Kh is a constant that maintain H in the same range than C:

H n =K h [LODmax−LOD n ] (4)

The sequence of event just described is carried out at each new node. Finally, to save computer resource, we design a decision function that compute the trend of the

LOD score. When the decision function detects that the LOD score is continuously decreasing over aquantity of program cycles corresponding to the value calculated by Formula 5 (or if sub-sequencesize < 200 nucleotides) it stops the search process and detects the global maximum. The nodecorresponding to the global maximum represents the sub-sequence with the highest codingprobability: the one that should be considered as an exon or a coding sequence in agreement with ourmethod.

Where k is the node number generated by the binary tree, LODraiz is the LOD in the first node ofthe tree, LODn is the LOD value in node n.

∂LOD∂ n

k =∑n=1

k

LODn−LOD raiz

∑n=1

k

n

(5)

When the value of the function ∂LOD∂ n

k start to decrease, the program stop the generation of

new nodes and search for the global maximum.

D. PERFORMANCE CHARACTERIZATION

To test the performance of this process, we generated 100 sequences with a varying codingsequence size between 200 and 15,000 nucleotides bordered by noncoding sequences. Since weknew exact coding sequence position, we could test the detection efficiency of the above algorithm.Two different situations were recognized: (i) the coding sequence does not exist (false positive) or isnot detected (true negative) and (ii) the coding sequence is imperfectly recognized. The first case wasestimated by the percentage of wrong prediction to the annotated coding regions. In the second case,the Astar prediction may differ by several codons lacking (aastar – aexon) from the prediction or insupplement (bastar – bexon) to the annotation (Fig. 4).

Fig. 4: Process of error quantification To estimate the error of coding stretch identification, wequantified the prediction size by Astar to the contig annotations of A. thaliana. The coding stretchpredicted can be larger, smaller or to be larger on one side and smaller on the other. To quantifysuch imprecision on the size determination of the coding sequence, we just sum the differences da

and db between extremities and express it in percentage of the length of the annotated exon.

We measure such inaccuracy with Formula 6 where l is the size of the annotated exon and d is inpercent:

d=[∑i=1

n1 dai +db

i

lexoni ].100 (6)

We estimated the algorithm performance (false positives and through negatives) by applyingFormula 6 to a sample of 100 coding sequences (partial or not) flanked by intergenic sequencesextracted from real Arabidopsis contigs. The coding sequence size varied between 200 and 2,000 bp.

All this procedure has been implemented in the programming language C++ [STR 00] .

RESULTS

We found that the relative norm of the differences, |M(k) – M(k1)| / |M(2) – M(1)|, started to satisfy theconvergence conditions for eM > 1.0E5 at 40,000 nucleotides (Fig. 5). Given that convergence, thecorresponding probabilistic vector was used as model for the gene population behavior.

Fig. 5: The Convergence of the MLE Analyzing the difference between successive probabilityvectors for increasing amount of coding sequences by steps of 999 bp, we found that inArabidopsis thaliana vectors starts to be representative of the gene population above 40,000 bp.

We then analyze the relationship between LOD and sequence size in coding (Fig. 6) and noncoding sequences (Fig. 7). Both regression lines were found linear with approximately the sameslopes, but the opposite sign. As a result, any small noncoding sequence interrupting a codingsequence will lead to decrease its coding measure. Furthermore, the discrimination efficiencyincreases with the size of the sequence under consideration. The discrimination error is null above1,000 bp. We found that error rate significantly increases below 600 bp where we found ~8% of falsenegatives for coding sequences and ~10% of false positives for noncoding sequences.

Fig. 6: Dependence of MLE (LOD) on codingsequence size – The false negatives are CDS withLOD score below 0. They were only found forCDS size below 600 bp.

Fig. 7: Dependence of MLE (LOD) on noncoding sequence size – The false positives arenoncoding sequences with LOD score above 0.The vast majority of them were only found forsequence size below 600 bp.

When the decision function was limited only to a threshold size of 200 bp the program generatedmuch more nodes then necessary resulting in a increase and useless computation and memory costs(Fig. 8). The decision function that take the cost and the heuristic functions into consideration ismuch more efficient since it can be seen that the program running is quickly stopped after the globalmaximum detection (Fig. 9).

Fig. 8: Program efficiency based on a thresholdof sequence size - The number of nodesgenerated reaches 600, 9 times more thannecessary. The bold line represents the averageof each 100 LOD values.

Fig. 9: Decision function and program efficiency- By taking account of the cost and heuristicfunctions, the algorithm succeed to quicklydetect the global maximum, which is 25 in thiscase and stop the ongoing process.

Discrimination efficiency of the deductive method of coding region tracking was found to have adiscrimination efficiency of 99.3% on the partial contig sample of 200 sequences (100 non-codingand 100 coding sequences, Fig. 10).The algorithm successfully identified 100% of the codingsequences, but with some imprecision. Most of this imprecision was around 2% of the coding regionsize and found to be at maximum 10% (Fig. 11).

Fig.10: Discrimination efficiency of thedeductive method of coding region tracking – Itwas found to have a discrimination efficiency of99.3% on the partial contig sample of 200sequences (100 non-coding and 100 codingsequences).

Fig. 11: Precision of coding region detection -The algorithm successfully identified 100% ofthe coding sequences, but with someimprecision. Most of this imprecision was around2% of the coding region size and found to be atmaximum 10% .

DISCUSSION

The method that we described above is a deductive process of contextual coding regionidentification. It allows to optimize the time involved in the search process through a binary treestructure based on cost, heuristic and decision functions. The methodology can be implement withany coding measure provided that it is positively correlated to the coding potential of the codingsequence.

The maximum likelihood estimator (MLE) depends on how much the model does represent thereal population or, in the case of this study, on how much the probability vector does, indeed,represent the coding sequence (CDS) population. Since the genes of the training set must berepresentative of the complete population it was important to select experimental genes only.However, the availability of such genes is relatively limited (~1000 for Arabidopsis thaliana).Therefore, we were interested to figure out whether or not the codon number of such limited sampleallows to reach gene population representativity. Since we reached the convergence of the model inthe limit of the sample size, we were allowed to use it as an estimator of the true gene population.Therefore, MLE could be used as well as a coding measure for coding regions tracking with A-starapplied to binary tree.

This methodology is interesting because it gives contextual approach to the best coding candidatein an unknown sequence. Probabilistic estimators of coding sequences such as MLE are verysensitive to sequence size. Since there are poorly efficient for small sequences they do not perform

well when used in a serial approach such as that of sliding windows. Actually, small CDS or exonsdo have small LOD value and if the widows size is not optimized for their size, they will be mostprobably a significant part of the window with non-coding DNA. This would have the effect todrown the weak coding measure of the coding stretch with the noise due to the non-coding DNA. Bycontrary, if the window is chosen small, it increases the probability of false negative since the codingmeasure is at its performance limit (limit of statistical representativity). In such condition it isfrequent to find sub-region of a coding sequence that can be close to the codon distribution of non-coding DNA. By contrast, one would easily find short intron regions that characteristics of CDS withconcern to codon distribution. For these reasons contextual approach are preferred when probabilisticmethods of gene tracking are applied. The method of coding stretch tracking recursive piece ofsequence cutting from left and right sides is contextual by essence and well adapted to binary treetechnology. Binary tree, in its turn, is well adapted to the A-star search process.

MLE gives an error rate of coding sequence identification below 600 bp around 20%. However,the contextual methodology their association offers can be easily adapted to other coding measures.The algorithm is settle to detect only one coding region at a time. This raise the question of how toimplement it in order that it would be able to detect successive exons of large genes or successiveCDS in DNA regions of high gene density. A simple solution is to run the algorithm in the context ofa window sliding over the DNA sequence to be analyzed. The trick is to chose sliding steps smallerthan the window size in the range of the smallest introns and the windows size in the average exonsize compatible with the error rate. By this way, the algorithm will have the possibility to detect mostexons. The inconvenient of this methodology is that it generate high redundancy in the results andthat it overload the computer resource.

ACKNOWLEDGMENTS

This research was supported by the Brazilian agencies CNPq that provided researcher fellowship toN.C. and to L.R.M..

REFERENCES

[ALT 97] ALTSCHUL SF, MADDEN TL, SCHAFFER AA, ZHANG J, ZHANG Z, MILLERW, LIPMAN DJ.(1997) “Gapped BLAST and PSIBLAST: a new generation of proteindatabase search programs”, Nucleic Acids Res. 25, 33893402.

[DUR 98] DURBIN, R., EDDY, S. R., KROGH, A. and MITCHISON, G. (1998) Biologicalsequence analysis: probabilistic models of proteins and nucleic acids. Cambridge UniversityPress, pp 368.

[GOU 85] GOUY M. , GAUTIER C., ATTIMONELLI N., LANAVE C. and DI PAOLA G..(1985) CABIOS 1, 167–172.

[MAT 02] MATHÉ, C., Sagot, M.F., Schiex, Th. and Rouzé, P. (2002) Current methods osgene prediction, their strengths and weaknesses. Nucleic Acids Res. 30, 41034117.

[RIC 93] RICH, E. and KNIGHT, K. (1993) Inteligência Artificial; trad. Maria Cláudia SantosRibeiro Ratto; revisão técnica Álvaro Antunes – São Paulo: Makron Books, p. 9296.

[STR 00] STROUSPTRUP, B. (2000) A linguagem de programação C++; trad. Maria LúciaBlanck Lisboa e Carlos Arthur Lang Lisboa. – 3 ed. – Porto Alegre: Bookman, p. 2223.

[TIW 97] TIWARY, S. Ramchandran, S., Bhattacharya, A., Bhattacharya, S. and Ramaswamy,R. (1997) Prediction of probably genes by Fourrier analysis of genomic sequences. CABIOS13, 263270.

A Method for Gene Detection Based on Maximum Likelihood and A-Star

Documents

Transcript of A Method for Gene Detection Based on Maximum Likelihood and A-Star