Experimental Determination of Drosophila Embryonic Coordinates by Genetic Algorithms, the Simplex...

10
Experimental Determination of Drosophila Embryonic Coordinates by Genetic Algorithms, the Simplex Method, and Their Hybrid Alexander V. Spirov 1 , Dmitry L. Timakin 2 , John Reinitz 3 , and David Kosman 3 1 The Sechenov Institute of Evolutionary Physiology and Biochemistry, 44 Thorez Ave., St. Petersburg, 194223, Russia, 2 Dept. of Automation and Control Systems, Polytechnic University, 29 Polytechnic St, St. Petersburg, 194064, Russia 3 Dept. of Biochemistry and Molecular Biology, Box 1020 Mt. Sinai Medical School, One Gustave L. Levy Place, New York, NY 10029 USA Abstract. Modern large-scale functional genomics projects are incon- ceivable without the automated processing and computer-aided analysis of images. The project we are engaged in is aimed at the construction of heuristic models of segment determination in the fruit fly Drosophila melanogaster. The current emphasis in our work is the automated trans- formation of gene expression data in confocally scanned images into an electronic database of expression. We have developed and tested pro- grams which use genetic algorithms for the elastic deformation of such images. In addition, genetic algorithms and the simplex method, both separately and in concert, were used for experimental determination of Drosophila embryonic curvilinear coordinates. Comparative tests demon- strate that the hybrid approach performs best. The intrinsic curvilinear coordinates of the embryo found by our optimization procedures appear to be well approximated by lines of isoconcentration of a known mor- phogen, Bicoid. 1 Introduction 1.1 Computer-Aided Analysis of Biological Images. The ongoing revolution in molecular genetics has progressed from the large scale automated characterization of genomic sequence to the characterization of the biological function of the genome. These investigations mark the beginning of the era of ’functional genomics’ [4]. A key feature of genomic scale approaches is the automated treatment of large amounts of data. Both current and future work in the eld is impossible without the automated processing and computer- aided analysis of images in connection with updating interactive electronic image databases [8]. A key aspect of such processing involves the segmentation of individual im- ages and the registration of serial images. Many problems involving the recog- nition, classi cation, segmentation and registration of images can be formulated as optimization problems. These optimization problems are typically di cult, S. Cagnoni et al. (Eds.): EvoWorkshops 2000, LNCS 1803, pp. 97-106, 2000. Springer-Verlag Berlin Heidelberg 2000

Transcript of Experimental Determination of Drosophila Embryonic Coordinates by Genetic Algorithms, the Simplex...

Experimental Determination of DrosophilaEmbryonic Coordinates by Genetic Algorithms,the Simplex Method, and Their Hybrid

Alexander V. Spirov1, Dmitry L. Timakin2, John Reinitz3, and David Kosman3

1 The Sechenov Institute of Evolutionary Physiology and Biochemistry, 44 ThorezAve., St. Petersburg, 194223, Russia,

2 Dept. of Automation and Control Systems, Polytechnic University, 29 PolytechnicSt, St. Petersburg, 194064, Russia

3 Dept. of Biochemistry and Molecular Biology, Box 1020 Mt. Sinai Medical School,One Gustave L. Levy Place, New York, NY 10029 USA

Abstract. Modern large-scale \functional genomics" projects are incon-ceivable without the automated processing and computer-aided analysisof images. The project we are engaged in is aimed at the constructionof heuristic models of segment determination in the fruit fly Drosophilamelanogaster. The current emphasis in our work is the automated trans-formation of gene expression data in confocally scanned images into anelectronic database of expression. We have developed and tested pro-grams which use genetic algorithms for the elastic deformation of suchimages. In addition, genetic algorithms and the simplex method, bothseparately and in concert, were used for experimental determination ofDrosophila embryonic curvilinear coordinates. Comparative tests demon-strate that the hybrid approach performs best. The intrinsic curvilinearcoordinates of the embryo found by our optimization procedures appearto be well approximated by lines of isoconcentration of a known mor-phogen, Bicoid.

1 Introduction

1.1 Computer-Aided Analysis of Biological Images.

The ongoing revolution in molecular genetics has progressed from the large scaleautomated characterization of genomic sequence to the characterization of thebiological function of the genome. These investigations mark the beginning ofthe era of ’functional genomics’ [4]. A key feature of genomic scale approachesis the automated treatment of large amounts of data. Both current and futurework in the �eld is impossible without the automated processing and computer-aided analysis of images in connection with updating interactive electronic imagedatabases [8].A key aspect of such processing involves the segmentation of individual im-

ages and the registration of serial images. Many problems involving the recog-nition, classi�cation, segmentation and registration of images can be formulatedas optimization problems. These optimization problems are typically di�cult,

S. Cagnoni et al. (Eds.): EvoWorkshops 2000, LNCS 1803, pp. 97−106, 2000. Springer-Verlag Berlin Heidelberg 2000

involving multiple minima and a complex search space topology. Contemporaryapproaches based on evolutionary computations are a promising avenue for thesolution of such problems (EvoIASP99).Here we describe a new method for the determination of intrinsic biological

coordinates in embryos of the fruit fly Drosophila melanogaster by means ofgenetic algorithms (GAs). GAs, the simplex method, and a hybrid of both wereapplied to the problem, and we �nd that hybrid methods perform the best. Ourresults indicate that these coordinates may be determined by a morphogeneticgradient of the protein Bicoid, a result of some biological interest.

1.2 Stripe Straightening: Search of Intrinsic Coordinates of EarlyEmbryo.

Early in the development the fruit fly embryo is shaped roughly like a hollowprolate ellipsoid, composed of a shell of nuclei which are not separated by cellmembranes. Deviations from the elipsoidal shape reveal the future polarity ofthe animal’s body: The more pointed end on the long axis makes anterior (head)structures, and the rounder end posterior (tail) structures. From a lateral (side)perspective, one long edge of the embryo is flat and will will make dorsal (\back")structures, while the other long edge is rounded and makes ventral (\underside")structures. In this paper we follow the standard biological convention and showembryos with anterior to the left and (if a lateral view) dorsal up (Fig. 1).Fig. 1 shows that so called pair-rule stripes (early markers of the future

segmental pattern [1]) are not parallel and straight, but have a crescent-likeform. The curvature of the stripes is highest at the termini, and minimal at thecentral part. Each stripe speci�es an anterior-posterior (A-P) location, and thesestripes can be regarded as contours in an intrinsic coordinate system that is beingcreated by the embryo itself. Another set of embryonic determinants exists forthe dorso-ventral (D-V) axis. If the image is smoothly transformed such that thecurvilinear coordinates are plotted orthogonally, the stripes appear straight, sothe determination of these coordinates can be viewed as a \stripe straightening"procedure. Our task is to understand and characterize this curvilinear coordinatesystem as it relates to the A-P axis.Two coupled objectives of this study are:

1. To characterize the intrinsic embryonic curvilinear coordinates.2. To use carefully characterized and tested computational procedures for thepurpose of automatically processing large numbers of images.

2 Methods and Approaches

The work reported here is part of a large scale project to construct a model ofsegment determination in the fruit fly D. melanogaster based on coarse-grainedchemical kinetic equations [7]. The acquisition and mapping of gene expressiondata at a heretofore unprecedented level of precision is an integral part of this

98 A.V. Spirov et al.

Fig. 1. Image of early (blastoderm stage) fly embryo with crescent-like stripes, inCartesian physical coordinates. This is a confocally scanned image of an embryostained by indirect fluorescence (immunostaining with polyclonal antisera against theEVEN-SKIPPED segmentation protein). Each small dot is an individual nucleus.

project. The current emphasis in our work is on the automated transformationof gene expression data in confocally scanned images into an electronic databaseof expression.

2.1 Images of Drosophila Genes Expression.

Transformations of embryonic coordinates begin with data expressed in termsof the average fluorescence level (proportional to gene expression level) of eachnucleus, where segmentation proteins exert their biological function. This datawas obtained as follows.

Antibodies for 14 protein products of segmentation genes were raised and over500 images were prepared and scanned [2]. These images were computationallytreated by means of the Khoros package [6]. Embryos were rotated and croppedautomatically such that the physical long axis of the embryo was parallel withthe x axis and the short physical axis with the y axis.

Next, the images were segmented [3, Kosman et al. in preparation]. About2000 segmented and identi�ed nuclei are obtained from each image. Each nucleusis labeled numerically, and the x and y coordinates of its centroid are found,together with the average fluorescence level over that nucleus. The segmented

99Experimental Determination of Drosophila Embryonic Coordinates

data takes the form of tables in ASCII text format. The result is the conversion ofan image to a set of numerical data which is then suitable for further processing.

2.2 Stripe straightening algorithm.

In Fig. 1 the crescent-like pair-rule stripes of an embryo in near saggital projec-tion are shown. We assume that the center of a pair-rule stripe follows a curveof constant A-P position. The origin of the image coordinate system is at thetop left, with image coordinates for width w increasing to the right and heighth increasing down.Our goal here is to �nd the true A-P and D-V coordinates on the image. We

approximate the true coordinate system by a Taylor series as follows. We denotethe true A-P coordinate by x and the true D-V coordinate by y. We pick theorigin of (x; y) and the origin of new image coordinates (x; y) so that they arethe same and as close as possible.We note that there is an A-P position at which a stripe is exactly vertical

on its whole length. The center of that stripe de�nes x = 0, which is the y-axis.Each pair-rule stripe other than the one at x = 0 is curved, and we imagine thex axis to intersect each of the stripes at the point where it is exactly vertical.Now we pick new image coordinates x and y such that they have the same originand orientation as the (x; y) coordinates, that is

x = w − w0y = −h− h0 (1)

We now turn our attention to x. For now, we can assume that y = y. Evenif we don’t do that, two important things will be true about the relationshipbetween (x; y) and (x; y): (1) The y and y axes are coextensive, and (2) The lociy = const are orthogonal to the y and y axes as they cross y = y = 0. Both ofthese important points follow from the existence of the vertical stripe. We wouldlike to write x in terms of x and y, so that

x = f(x; y): (2)

We expand in a Taylor series to third order around the origin. That gives

x = f(0; 0) + @f@xkx=y=0 x + @f

@ykx=y=0 y

+ (1=2)@2f@x2kx=y=0 x2 + (1=2)

@2f@x@ykx=y=0 xy + (1=2)

@2f@y2kx=y=0 x2

+ (1=6)@3f@x3 kx=y=0 x

3 + (1=6) @3f

@x2@ykx=y=0 x2y + (1=6) @

3f@x@y2 kx=y=0 xy

2

+ (1=6)@3f@y3 kx=y=0 y

3:

(3)Now consider the terms and what they mean. f(0; 0) = 0 by de�nition. We

picked (x; y) such that at the origin @f@x= 1 and @f

@y= 0. For pure y terms we can

say more than that. The fact that the y and the y axes are coextensive means

that f(0; y) = 0 8y, and so @2f@y2 =

@3f@y3 = 0 as well. Thus far we have shown

that �ve of the ten terms of the Taylor expansion vanish.

100 A.V. Spirov et al.

The unit vector ex in the x direction is proportional to@f@x, so @

2f@x2measures

the change in length of ex as we move along the x axis. This means that@2f@x2= 0.

Now consider @2f

@x@y . This term can be thought of as the rate of change in size of

the unit vector kexk =@f@xalong the y-axis. Along the y-axis where x = x = 0,

@f@x = 1 8y, so that derivatives of this quantity with respect to y vanish, andhence this term of the series vanishes. This has eliminated all but three termsfrom the series, so now we write the �rst order model of image transformationas

x = x+Axy2 +Bx2y + Cx3: (4)

All of these terms have a clear interpretation. The xy2 term is the main one:it gives quadratic D-V curvature that increases with distance from the x-axis.The x2y term gives residual D-V asymmetry and the x3 term gives residual A-Pasymmetry. Lastly, if one expresses the above equation in terms of w and h,expansion will bring back lower order terms in h and w when expanding

x = w−w0 +A(w−w0)(−h− h0)2 +B(w−w0)

2(−h− h0) +C(w−w0)3 (5)

in terms of w and h.We tested this 1-st order model and found that in more then half of cases

it is insu�cient for straightening stripes. We expanded the model empirically,with the result that an empirical extension of the 1-st order model is given by

x = A(w − w0)(−h− h0)2 +B(w − w0)2(−h− h0)+C(w − w0)2(−h− h0)2 +D(w − w0)(−h− h0)3

(6)

We can treat of these additional fourth order members as follows: Cx2y2 is acorrection term for parabolic splay, whileDxy3 serves to correct D-V asymmetry.In general, the situation is typical of a polynomial approximation problem|thereis one polynomial that is best but there are a number of distinct ones that canapproximate it very well.Preliminary calculations have shown that the best outcome is achieved with

an independent deformation of the anterior and posterior half of an embryo. Insummary it requires the determination of 8 parameters of a deformation plus anevaluation of values w0, h

10, h

20.

2.3 Genetic Algorithms Technique, Simplex Method, and TheirHybrid.

GA Search. The optimization problem of �nding the coe�cient values forproper elastic transformations was initially implemented with GAs. We havereduced the problem to the determination of factors A, B, C and D of equation6.We use the following cost function. Each embryo’s image under consideration

was subdivided into a series of longitudinal strips. Then each strip is subdividedinto boxes and the mean value of the product (EVEN-SKIPPED protein) is

101Experimental Determination of Drosophila Embryonic Coordinates

calculated for each box. Each row of means gives the local pro�le of even-skippedgene expression along each strip. The cost function is computed by comparingeach pro�le and summing the squares of di�erences between the strips. The taskof the GA is to minimize this cost function.Following the classical GA algorithm, the program generates a population of

floating-point chromosomes. Initial chromosomes are randomly generated. Af-ter that the program evaluates every chromosome as described above; then,according to the truncation strategy, the average score is calculated. Copies ofchromosomes with above average scores replace all chromosomes with a scoreless than average.On the next step a predetermined proportion of the chromosome population

undergoes mutation, so that one of the coe�cients gets a small increment. Thiscycle is repeated: all chromosomes are consecutively evaluated, the average scoreis calculated and the winners’ o�spring substitutes for the losers in the processof reproduction, until an accepted level of stripe straightening is achieved.

Simplex Search. We also solved the optimization problem by the downhillsimplex method in multidimensions of Nelder and Mead [5]. The method requiresonly function evaluation, not derivatives. This is an important speed advantageover gradient methods, since calculation of the gradients requires many moreevaluations of the cost function.A simplex is the geometric �gure in N dimensions of N + 1 vertices and all

their interconnecting line segments. The Nelder-Mead method starts with sucha set of N + 1 points de�ning an initial simplex. The downhill simplex methodoperates by moving the point of the simplex where the function is largest throughthe opposite face of the simplex to a lower point, and so on until it reaches thevicinity of an extremum.

Hybrid Procedure. Initial experience indicated that that the simplex methodis fast but does not give high quality answers, while GAs give excellent answersbut are slow. We noted that both multiple simplex runs and GA search performnumerous evaluations for many random points in search space. If we use smallincrements as mutations we will perform practically the same search by usingGAs or the simplex method. If so, we could use a set of chromosomes fromthe GA technique as a starting simplex for Nelder-Mead optimization. In thehybrid algorithm, we implement a simple evolution strategy with floating-pointchromosomes with small mutational increments. Selection and reproduction areperformed as described above. In addition, from the very beginning the programlinks pointers to mutant o�spring so as to achieve complete lists of N+1 pointerson N + 1 relatives.These \clans of mutants" are ready for simplex procedure. Following the

completion of the �rst list of N + 1 pointers the program starts to perform notonly mutation, selection, reproduction procedures, but also the simplex proce-dure for the lowest scoring members of \complete" clans. The more clans achievecompletion the more species undergo simplex procedure. In summary, GAs must

102 A.V. Spirov et al.

provide search of global optima together with local ones, while simplex providesfast downhill moving.

3 Results and Discussion

3.1 Search Space Features for Stripe Straightening Problem.

The above described task of image elastic deformations turned out to be a dif-�cult numerical problem. This is caused �rst of all by the unusual geometry ofsearch space. Fig. 2 gives a picture of its features through crossections of thesearch space for one typical embryo under consideration.

Fig. 2. Search space features for one of crossections [A + D] for typical image. Thisis surface plot where vertical (Z) axis is evaluation one, while X and Y are A and Dcoe�cients of expression 6.

As we can see this cross section includes two grooves, one of which is deeperthan another. The sharp rectangular walls in Fig.2 are caused by penalty con-ditions. The omission of the penalties gives a smoother surface with one groove,which corresponds to the deeper groove (not shown). In turn, penalties are ab-solutely needed to avoid highly nonlinear folding of an image instead of smoothdeformations.The bottom of both grooves have several local minima. As a result the simplex

search gives in serial searches tens of such local extrema. GA search is more

103Experimental Determination of Drosophila Embryonic Coordinates

e�ective and it �nds the best local extremum on the bottom of the deepestgroove. However to jump from the shallow to the deeper groove is still a di�culttask for GA search as well. To overcome this we use large population sizes or aseries of runs to achieve the best solution.

3.2 The Results of Stripe Straightening.

After completion of the stripe straightening procedure with 11 coe�cients (w0and two sets of A, B, C, D and h10, h

20) for about two hundred images from

the stages when all seven crescent stripes are visible we could compare foundcoe�cient sets. These coe�cient sets show considerable diversity, so that wefail to elucidate a general formula of appropriate elastic deformations to achievesatisfactory stripe straightening. However, the resulting transformation of coor-dinates are very similar for most of the images. Typical example is shown inFig.3.

Fig. 3. Typical example of curvilinear coordinates found by our optimization proce-dures.

On the contrary, comparison of coordinate curves for anterior and posteriorhalves of embryos reveal small but quite evident di�erences (Cf. Fig.1 and Fig.3).

A biological subject of interest is the source of the pair-rule stripes’ curvature.It is known that in Drosophila segmentation the maternally expressed proteinBICOID forms an anterior-posterior morphogenetic gradient in the egg whichcontrols all following segmentation events [1]. It is interesting that contour linesof a 2D-concentration map of the BICOID gradient closely coincide with curvi-linear coordinates determined by our method. The full biological implications ofthis observation will be reported elsewhere.

104 A.V. Spirov et al.

3.3 E�ectivity and Cost of GAs, Simplex and Hybrid Techniques.

In a table 1 the results of the comparative tests on elastic transformation ofa typical image by means of the simplex-method, GAs, and GAs with simplex(the hybrid technique) are presented. To �nd parameters of optimization for asimplex and GAs giving the most e�ective optimization, careful tuning of thelimits in variation of a mutational increment and the range of variability ofthe initial population was carried out. The hybrid method was tested at thesame values of parameters, as GA technique. The result of testing was comparedaccording to the time required for calculation and according to the standarddeviation of the result. On each tested procedure 100 independent runs werecarried out. Inspection of the table reveals that the simplex-method is fastest,but also the least precise. The GA technique is the most precise, but also requires10 times as much computing. Our results indicate that the hybrid technique isapproximately twice faster than GAs at the same accuracy.

Table 1. Comparison of e�ectiveness of three approaches on calculation time (in asummarized amount of evaluations) and on a divergence (in a standard deviation val-ues)

Method Time (in evaluations) Standard Deviation

Simplex Method 25000 4738.646GA Technique 301000 753.191

Hybrid Technique 151000 761.530

The problems we encountered with our optimization task are:

1. An abundance of local minima very close to the global minimum. The sim-plex can get stuck in very small-scale holes, even with starting conditionsexceptionally close to a known solution. We need to allow the optimizer tomove from a position that is very close to the global minimum towards theglobal minimum. It seems that this last stage is the di�cult part for theNealder-Mead simplex method. It is possible to produce good scores witha number of distinct end points in polynomial parameter space, suggestingthat the problem is probably over-speci�ed.

2. This is the polynomial approximation problem. There will be one polynomialthat is the best for the stripe straightening but there are a number of distinctones that can approximate it very well.

As to the �rst item, the successful approach to a solution of this problemis to employ hybrid techniques. Genetic algorithms alone are usually slow inoptimization problems, since they are too coarse-grained to obtain a solutionquickly. On the other hand, downhill algorithms are usually fast (in terms ofprocessor cycles), if they are close to the solution, but tend to get stuck in localminima. Combining both kind of algorithms manages to avoid local minima, and�nds solutions accurately.

105Experimental Determination of Drosophila Embryonic Coordinates

4 Conclusions

In the task of optimization of parameters of elastic transformation a simplex-method is fastest, but also the least precise, giving the greatest divergence. TheGA technique is the most precise, but also requires at least 10 times more time.The hybrid technique is twice faster than GAs at the same accuracy.The intrinsic curvilinear coordinates of an embryo found by our procedures of

optimization appears to be approximated by contour lines of a map of a gradientof the morphogen bicoid. It is in the good agreement with known ideas abouta governing role of this gradient in consequent processes of segmentation of anearly fly embryo.

5 Acknowledgements

This work is supported by INTAS, grant No 97-30950; Russian Foundation forBasic Researches, grant No 96-04-49349; USA National Institutes of Health,grant RO1-RR07801; and CRDF, grant No RB0-685. A.S. wishes to thank Tim-othy Bowler for stimulating discussions and King-Wai Chu for help with pro-gramming.

References

1. Akam,M.: The molecular basis for metameric pattern in the Drosophila embryo.Development 101(1987) 1{22.

2. Kosman,D. and Reinitz,J.: Rapid preparation of a panel of polyclonal antibodies toDrosophila segmentation proteins. Development, Genes, and Evolution 208 (1998)290{294.

3. Kosman,D., Reinitz,J. and Sharp D.H.: 1997. Automated assay of gene expression atcellular resolution. In Altman,R., Dunker,K., Hunter,L. and Klein,T. editors, Pro-ceedings of the 1998 Paci�c Symposium on Biocomputing, pages 6{17, Singapore:World Scienti�c Press.

4. Lander, E.S.: The new genomics: Global view of biology. Science 274 (1996) 536.5. Press,W.H., Flannery,B.P., Teukolsky,S.A. and Vetterling,W.T.: 1988. NumericalRecipes in C: The Art of Scienti�c Computing. Cambridge: Cambridge UniversityPress.

6. Rasure J. and Young M.: An open environment for image processing software devel-opment. In: 1992 SPIE/ISET Symposium on Electronic Imaging, V.1659 of SPIEProcessings, SPIE, 1992.

7. Reinitz,J., Kosman,D., Vanario-Alonso,C.E. Sharp,D.: Stripe forming architectureof the gap gene system. Developmental Genetics 23 (1998) 11{27.

8. Sanchez,C., Lachaize,C., Janody,F., et al.: Grasping at molecular interactions andgenetic networks in Drosophila melanogaster using FlyNets, an Internet database.Nucleic Acids Research 27 (1999) 89{94.

106 A.V. Spirov et al.