M3G: Maximum Margin Microarray Gridding

RESEARCH ARTICLE Open Access

M3G Maximum Margin Microarray GriddingDimitris Bariamis1 Dimitris K Iakovidis2 Dimitris Maroulis1

Abstract

Background Complementary DNA (cDNA) microarrays are a well established technology for studying geneexpression A microarray image is obtained by laser scanning a hybridized cDNA microarray which consists ofthousands of spots representing chains of cDNA sequences arranged in a two-dimensional array The separation ofthe spots into distinct cells is widely known as microarray image gridding

Methods In this paper we propose M3G a novel method for automatic gridding of cDNA microarray imagesbased on the maximization of the margin between the rows and the columns of the spots Initially the microarrayimage rotation is estimated and then a pre-processing algorithm is applied for a rough spot detection In order todiminish the effect of artefacts only a subset of the detected spots is selected by matching the distribution of thespot sizes to the normal distribution Then a set of grid lines is placed on the image in order to separate each pairof consecutive rows and columns of the selected spots The optimal positioning of the lines is determined bymaximizing the margin between these rows and columns by using a maximum margin linear classifier effectivelyfacilitating the localization of the spots

Results The experimental evaluation was based on a reference set of microarray images containing more thantwo million spots in total The results show that M3G outperforms state of the art methods demonstratingrobustness in the presence of noise and artefacts More than 98 of the spots reside completely inside theirrespective grid cells whereas the mean distance between the spot center and the grid cell center is 12 pixels

Conclusions The proposed method performs highly accurate gridding in the presence of noise and artefactswhile taking into account the input image rotation Thus it provides the potential of achieving perfect gridding forthe vast majority of the spots

BackgroundThe process of protein synthesis inside the cells beginswith the transcription of a gene sequence from DNA tomessenger RNA (mRNA) in the cell nucleus ThemRNA is then transported outside the nucleus and thesequence encoded in the mRNA chain is translated intoamino acids that form the corresponding protein forthat particular sequence Since proteins are translateddirectly from mRNA chains the quantity of each mRNAchain that is present in a cell is indicative of the corre-sponding protein synthesis ie the gene expression Thegoal of a microarray experiment is the quantification ofthe amount of mRNA present in a test sample com-pared to that of a reference sampleThe first step of such an experiment is the isolation of

the test and reference mRNA samples These two

samples are reverse-transcribed into complementaryDNA (cDNA) amplified using polymerase chain reac-tion and labelled usually by means of two distinct fluor-escent dyes such as the red Cy5 and the green Cy3 Thelabelled cDNA is hybridized on a microarray device thatconsists of a solid substrate and a large number ofspots where single-stranded chains of known DNAsequences are attached Each of these sequences corre-sponds to a part of a specific gene The sample cDNAcan only be hybridized with its complementarysequence The hybridized microarray is then scanned atthe wavelength of each dye and the output of theexperiment is a high resolution greyscale digital imagefor each wavelength Such an image consists of a matrixof blocks each of which contains a number of rows andcolumns of spots The grey level intensity of each spotsignifies the degree of hybridization of the labelledcDNA sample to the known DNA sequences therebyindicating the expression levels of the respective genes

Correspondence dbariamisdiuoagr1Department of Informatics and Telecommunications University of AthensAthens Greece

Bariamis et al BMC Bioinformatics 2010 1149httpwwwbiomedcentralcom1471-21051149

copy 2010 Bariamis et al licensee BioMed Central Ltd This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (httpcreativecommonsorglicensesby20) which permits unrestricted use distribution and reproduction inany medium provided the original work is properly cited

The gene expression levels are extracted from micro-array images in three steps The first step of this processis the separation of the blocks present in the image Thenext step is gridding ie the construction of a grid cov-ering each block so that it isolates each spot into a dis-tinct cell enabling the localization of each spot The laststep involves the segmentation of the spots from thebackground of the image and the quantification of theintensity of each spot which corresponds to the expres-sion level of the respective geneThe distance between the blocks of each image is sig-

nificantly larger than the distance between the spots ofeach block thus the blocks can easily be separated Avariety of approaches have been proposed for blockseparation and have achieved accurate results Theseinclude the analysis of the distances between neighbour-ing spots [1] and the use of projections of the imagepixels to the x and y axes [23] In contrast to the blockseparation step the process of gridding poses severalchallenges and has a significant impact on the accuracyof a microarray experiment [4] A gridding algorithmshould be able to grid images that include spots of var-ious shapes sizes and intensities while being robust tonoise and artefacts introduced at a microarray prepara-tion stage as well as rotation due to slight misalign-ments of the scanning robot coordinate system to theimage coordinate system [5] Furthermore it is desirablethat the gridding be automatically performed withoutany user intervention that would possibly affect themicroarray experiment as well as limit the processingthroughput of large amounts of microarray imagesSeveral methods have been proposed for microarray

image gridding they can be viewed in terms of automa-tion as manual semiautomated and fully automated [6]However most of the proposed methods are not fullyautomated and require manual tuning of parameters orother user intervention For example the state of the artmethod implemented in ImaGene [7] is semiautomatedrequiring the tuning of a multitude of parameterswhereas in the manual gridding methods implementedin ScanAlyze [8] and SpotFinder [9] the process ofgridding is performed interactively by the user Themethod proposed by Braumlndle et al [10] is parametricrequiring estimated values for several parametersOnly a few state of the art methods have been pro-

posed as providing automatic gridding but most ofthem do not address all requirements of fully automaticgridding ie handling of irregular spots and robustnessagainst noise artefacts and image rotation The state ofthe art method proposed by Angulo et al [11] is basedon mathematical morphology and requires that gridrows and columns are strictly aligned with the x and yaxes of the microarray image The same requirement isimposed by the hill-climbing approach proposed by

Rueda et al [12] A fully automatic region segmentationapproach based on Markov random fields was proposedby Katzer et al [13] but the results showed that its per-formance is diminishing in the presence of weaklyexpressed spots The Bayesian grid matching methodproposed by Hartelius et al [14] employs an iterativealgorithm to solve a complex deformable model foraccurate microarray gridding whereas methods produ-cing simpler linear grids such as [13] and [15] havebeen proved highly accurate as well Blekas et al [16]proposed a method based on Gaussian mixture modelwhereas later a methodology combining a stochasticsearch approach for the grid positioning and a MarkovChain Monte Carlo method was proposed by Antoniolet al [17] to account for local deformations of themicroarray image However this approach cannot beconsidered as fully automatic since it requires priorknowledge about the number of rows and columns ofthe spots in the microarray image Another methodbased on Voronoi diagrams was proposed by Gianna-keas et al [3] however it requires that artificial spotsare introduced in place of the spots that are very weaklyexpressed Recently a heuristic gridding approach basedon a genetic algorithm was proposed by Zacharia et al[15] This algorithm provides a near optimal griddingoutperforming the method proposed by Blekas et al[16] while being robust to both noise and rotationHowever it is well known that the genetic optimizationprocesses tend to require long processing times to con-verge since a multitude of possible solutions has to becreated and evaluatedIn this paper we propose a novel methodology for

automatic cDNA microarray gridding based on a com-putationally efficient optimization approach The pro-posed methodology is based on the maximization of themargin between the consecutive rows and columns ofthe microarray spots which is implemented by traininga linear maximum margin classifier with an automati-cally detected subset of spots on the microarray imageThe classifier determines the optimal positioning ofeach grid line whereas the use of the soft-margin var-iant provides robustness to outliers This methodologynamed M3G (Maximum Margin Microarray Gridding) issupported by a non-parametric Radon-based rotationestimator of general applicability for cDNA microarrayimages The novel contributions of the proposedmethod include

bull Optimal grid line determination through marginmaximizationbull Fully automatic gridding without user interventionbull Robustness to a wide range of image imperfectionsincluding rotation artefacts noise and weaklyexpressed spots


Page 2 of 11

bull Higher accuracy than the state of the art methods

All algorithms have been implemented under a GNULinux environment The M3G software is publicly avail-able [18] and also provided along with the manuscript[Additional file 1] A preliminary version of the pro-posed methodology which involved an arbitrary thresh-old setting in the spot detection process and a cruderotation estimator has been presented in [19] alongwith a limited experimental evaluation

MethodsThe proposed method consists of the steps illustrated inthe flowchart of Fig 1Rotation estimationThe 16-bit greyscale microarray image (Fig 2a) is initi-ally analyzed by the Radon transform (Eq 1) which isapplied to estimate the image rotation angle The Radontransform has been the method of choice for microarrayrotation estimation in approaches such as [10]

R a r I x y r x a y a dxdy( ) ( ) ( cos sin )

(1)

In the above equation the grey level intensity of theimage in pixel (x y) is denoted by I(x y) In the trans-formed image illustrated in Fig 2b the intensity of eachpixel with coordinates (a r) is equal to the integral ofthe image brightness over a straight line with an angle ato the x axis and a distance r to the origin The rotationangle θ of the microarray image is estimated by locatingthe column with the highest mean brightness in thetransformed image which is denoted by the arrow Theimage is subsequently counter-rotated by angle θ as illu-strated by Fig 2cPreprocessingThe next step involves the pre-processing of the micro-array image by linearly normalizing it so as to fit theintensity histogram to the full dynamic range of the 16-bit image The image is then quantized to 256 greylevels in order to reduce the computational complexityof the next steps Subsequently the edges of the spotsare detected by the application of the Sobel operator[20] on the normalized image A threshold is deter-mined automatically using the Otsu method [21] inorder to binarize the image and isolate the sharpestedges that correspond to spots as illustrated in Fig 2dSince the image has been normalized to the full

dynamic range and the result of the preprocessing stepis a binarized image the quantization error is expectedto affect a very small number of pixels Indeed in thedata set used for the evaluation of the proposed method(see Results) the percentage of the affected pixels in

each binarized image was less than 001 having noeffect on the subsequent grid placementSpot detectionThe binarized image is then analyzed so as to locate allthe groups of consecutive white pixels that reside on thesame spot edge Each of these groups is characterized bythe location of the center pixel and the size of thegroup Fig 2e illustrates the detected groups by

Figure 1 Flowchart of the proposed method


Page 3 of 11

representing them as rectangles that circumscribe thepixels of each group Ideally each rectangle should con-tain the edge pixels of a single microarray spot howeverdepending on the threshold used it might also includeartefacts or multiple merged spots due to the noise pre-sent in the image and the inter-spot proximity A spotselection step is introduced in the next subsection inorder to refine the spots based on their shape and sizecharacteristics resulting in the rectangles shown in Fig2fSpot selectionThe spot selection process aims to the removal of falsespots introduced by noise and artefacts In the spotselection step the aspect ratios of the detected pixelgroups are first evaluated Considering that the idealspot shape is circular the rectangles (Fig 2e) should notdeviate much from being square so that each rectanglecontains only one microarray spot Therefore the aspectratio of each spot must be close to unity Then a lowerbound smin and an upper bound smax of the spot sizesare calculated so as to maximize the similarity of thespot size distribution to the normal distribution Thespots that have sizes out of the calculated bounds areconsidered false and discardedIn order to quantify the similarity of the distribution

of the spot sizes to the normal distribution it has to betaken into account that the spot sizes can only be posi-tive in contrast to the normal distribution N (x μ s)(Eq 2) that also spans into the negatives The compari-son should therefore be made to a normal distributionfor which the negative values are explicitly set to zero

Such a distribution Nm(x μ s) (Eq 3) can be derived bynullifying the probability of N (x μ s) for x lt 0 andscaling it accordingly so that the total probabilityremains equal to unity The corresponding cumulativedistributions C(x μ s) and Cm(x μ s) are expressed byEqs 4 and 5 respectively

N x e

x

( )

( )

12

2

2 2 (2)

N x

x

N xC x

xm( ) ( )( )

0 0

10

(3)

C x erfx

( )

12

12

(4)

C x

x

C x CC x

xm( ) ( ) ( )( )

0 0

01

0(5)

A measure of dissimilarity E between the discreteprobability distribution of the spot sizes and the contin-uous probability distribution Nm(x μ s) can be estab-lished based on their respective cumulative distributionfunctions The cumulative histogram of the spot sizesCh(x) is defined as a function of the histogram h(x) as

Figure 2 Steps of the proposed method (a) input microarray image (b) result of the Radon transform (c) counter-rotated inputimage (d) binarized image (e) detected spots (f) selected spots (g) distance estimation between rows of spots (h) determination ofgrid line and (i) gridded microarray image


Page 4 of 11

shown in Eq 6 The dissimilarity E is defined as thetotal area between Ch(x) and Cm(x μ s) as shown inEq 7

C x h ih

i

x

( ) ( )

0

(6)

E C x C x dxh m

| ( ) ( ) | 0

(7)

The optimal bounds smin and smax are calculated so asto minimize the dissimilarity E defined above By select-ing the spots with sizes within the range defined bythese bounds the resulting cumulative spot size distri-bution closely resembles the normal distribution as

illustrated in the example of Fig 3 In this case anyspot that is smaller than smin = 64 pixels or larger thansmax = 171 pixels is considered false and discarded It isevident that the cumulative histogram of the selectedspots almost coincides with the cumulative normal dis-tribution (Fig 3d) whereas the original cumulative dis-tribution (Fig 3c) differs substantially from therespective cumulative normal distributionDistance estimation between consecutive rows andcolumnsThe optimal distance between spot rows is calculated bysegmenting the input microarray image into horizontalstripes with a height of dr pixels as shown in Fig 2gwhich are then averaged If dr is selected so that it isequal to the distance between the rows the spots of allrows will be in the same relative positions in the hori-zontal stripes therefore they will be highly overlapping

Figure 3 The (a) original (b) selected (c) cumulative original and (d) cumulative selected spot size distributions compared to theirrespective modified normal distributions


Page 5 of 11

in the resulting average stripe Thus the average stripewill contain well defined spot areas as illustrated in Fig4a If a suboptimal value of dr is selected the spots willreside in different relative positions in the horizontalstripes and will thus blend with the background in theaverage stripe (Fig 4b) The optimal value of dr isselected by maximizing the standard deviation of thepixel intensities of the average stripe The standarddeviation can be used as an effective measure of spotoverlap since high values of the standard deviation indi-cate distinct dark and bright areas whereas low valuesof the standard deviation indicate abundant grey areasThus the standard deviation should be maximized withrespect to dr in order to obtain the optimal value of drThe optimal column width dc is likewise estimatedusing vertical stripesA wide range of dr values is tested in order to find the

optimal value ensuring successful estimation withoutany user intervention The standard deviation dr ofthe average stripes is calculated for all values of drwithin that range using a small real valued step Fromall the tested values of dr those that result in local max-ima of the standard deviation are selected These localmaxima are most often located on multiples of the opti-mal dr since such an estimation also results in highlyoverlapping spots For each of the selected dr values themean of the resulting standard deviation dr in itsneighbourhood is calculated The neighbourhood for thecalculation of each dr is equal to the range betweenits adjacent local maxima The value of dr that results inthe highest value of the d dr r

ratio is selected asoptimalAnother method for estimating the distances between

rows and columns of spots has been proposed by Cec-carelli et al [22] It employs the Orientation Matching(OM) and Radon transforms in order to extract the spotpositions and grid rotation respectively Subsequentlythe spot locations are projected on the axes of the gridand the distance between rows and columns is esti-mated This method requires prior knowledge about theradii of the spots and uses a filter for noise reductionIn contrast to this approach the proposed method

performs distance estimation without any parameters orarbitrarily selected filters based on the maximization ofthe standard deviation of the average stripe This maxi-mization over a wide range of dr values allows successfulestimation without any user-defined parameterswhereas the use of the average stripe acts as a low passfilter allowing high tolerance to noise An evaluation ofthe distance estimation for noisy images has beenincluded in the Results sectionMaximum margin griddingIn order to determine the optimal grid lines each spotthat has been selected by the spot selection step (Fig2e) is represented by a vector xi i = 1 N where N isthe total number of selected spots in the image and thecomponents of each vector xi are the coordinates ofthe spot centre These vectors are assigned into distinctrows and columns based on the distances dr and dcwhich were previously estimated Each pair of consecu-tive rows or columns of spots can now be separated bya single separating line The optimal separating line ispositioned so as to maximize the margin between therows or columns of the spots For a pair of rows num-bered k and k + 1 the vectors that belong to row k orto any row above it are assigned a class label ci = +1and the vectors that belong to row k + 1 or to any rowbelow it are assigned a class label ci = -1 These vectorsxi along with their respective class labels ci are pro-vided as a training set to a linear Support VectorMachine (SVM) classifier [23] which produces the max-imum margin grid lineIn this particular application the SVM classifier is

provided with the aforementioned training setD x c x ci i i i ( ) | 2 1 1 By solving a quad-ratic programming optimization problem it producesthe normal vector w and the parameter b of the separ-ating line w x b 0 which maximizes the marginbetween vectors xi of different classes ie the marginbetween spots of distinct rows or columns The width ofthe margin is equal to 2||w|| therefore the widest mar-gin is found by minimizing ||w|| under the constraintsci( w xi - b) ge 1 ie requiring that all the spots lie onthe correct side of the resulting grid lineThe support vector machine described above is called

a hard-margin SVM and does not take into account anyoutliers One of its properties is that the separating lineis solely determined by the vectors that that lie on theedges of the margin called support vectors In a linearSVM a very small number of support vectors determinethe separating line and the margin In the case of out-liers present inside the margin the positioning of theseparating line will be exclusively determined by theoutliers and will thus be suboptimal for gridding Thisproblem can be solved using the soft-margin SVMwhere a slack variable ξi is introduced for each vector

Figure 4 Detail of average stripes for the horizontal directionand for (a) optimal dr and (b) suboptimal dr


Page 6 of 11

c w x bi i i( ) 1 The constraints are then formu-lated as c w x bi i i( ) 1 and the separating hyper-plane can be found by minimizing

min || ||12

2

1

w C i

i

N

(8)

where C is a cost parameter that determines the effectof outliers on the positioning of the resulting grid lineLarge values of C result in a grid line that is mostlydetermined by any outliers while on the other handsmaller values of C result in a grid line that follows thegeneral trend of the spot locations given to the classifiervirtually ignoring any outliers The hard-margin classi-fier is equivalent to a soft-margin classifier with an infi-nitely large C [24]The soft-margin SVM is employed in order to dimin-

ish the effects of misdetected spots that result from arte-facts or noise A small fraction of these outliers mighthave a shape and size similar to valid spots and couldtherefore pass through the selection step without beingdiscarded The soft-margin SVM ensures that such out-liers will not have an impact on the produced grid linesFor ideal microarray images where all spots could besuccessfully detected and no outliers are present ahard-margin SVM could be used as well but a griddingapplication for real microarray images requires robust-ness against outliers Furthermore in ideal noiselessimages the training set for the SVM classifier wouldconsist only of the necessary spots ie those residing onrows k and k + 1 However in real microarray imagesthere are cases where several consecutive spots might bevery weakly expressed and therefore not detected Inorder to cope with this problem spots from rows abovek and below k + 1 are included in the training set pro-viding redundant data to the classifier to ensure success-ful gridding Using an algorithm based on the SequentialMinimal Optimization (SMO) to solve the SVM optimi-zation problem [25] the additional data introduces onlya small computational overhead since such algorithmsevaluate vectors that are far from the separating line inonly the first few iterations of their outer loops [2627]The SVM has been selected over similar methods forthe determination of the grid lines such as a leastsquares fit because the soft margin SVM is adjustablewith regards to its tolerance to outliers through the costparameter C [28]In the case that row k contains less than two detected

spots the two grid lines that separate this row fromrows k - 1 and k + 1 cannot be determined by the useof the SVM classifier This is a rather rare case consid-ering that the image is normalized during the preproces-sing step To cope with this limitation the endpoints of

the two grid lines are positioned equidistantly betweenthe endpoints of the first neighboring grid lines aboveand below them In the case where the top or bottomrows of spots contain less than two spots the endpointsof the grid lines that cannot be determined are posi-tioned dr pixels further from the nearest grid linesFig 5 illustrates the case of gridding in the presence of

an obvious outlier denoted by the arrow It is evidentthat for a small value for the cost parameter C (Fig 5a)the margin is determined by the other spots in the rowand the outlier is ignored whereas for larger values of C(Fig 5b) the outlier affects the positioning of the separ-ating line by moving it significantly closer to the vectorsof the top row reducing the margin and rendering itsuboptimal for gridding

Results and DiscussionData sets and evaluation methodTwo data sets were used for the evaluation of M3G Thefirst data set consists of 54 cDNA microarray imagesfrom the Stanford Microarray Database [29] The imagesare TIFF files with a resolution of 1900 times 5500 pixelsand 16-bit grey level depth Each image includes 48blocks of 870 spots each resulting in a total of2255040 spots in the data set These images have beenproduced for the study of the gene expression profilesof 54 specimens of BCR-ABL-positive and -negativeacute lymphoblastic leukemia [30] This data set is asuperset of the one used by the preliminary version ofthe proposed method [19] and the genetic algorithmapproach proposed in [15] This data set is accompaniedby ground truth annotations regarding the positions andthe sizes of the spots Fig 6 visually validates the resem-blance of the distribution of the sizes of the spots in thedata set to the Nm(x μ s) distributionThe second data set consists of 10 microarray blocks

selected from distinct microarray images that were arti-ficially created or obtained from public microarray data-bases The blocks are stored in TIFF files with 16-bitgrey level depth This data set has been used for theevaluation of the method proposed by Blekas et al [16]and has been obtained from the authors of [16]In order to produce directly comparable results with

the aforementioned methods the same statistical analy-sis is performed Each spot was evaluated as being

Figure 5 The effect of an outlier as a function of the SVM costparameter C (a) Small value of C (b) Large value of C


Page 7 of 11

perfectly marginally or incorrectly gridded when thepercentage of its pixels contained within its grid cell is100 more than 80 or less than 80 respectivelyComparative evaluationThe evaluation results of M3G for the first data set areshown in Table 1 along with the corresponding resultsof the preliminary version of the proposed method andthe genetic algorithm approach In order to evaluate thesensitivity of the proposed method to missing spots twoexperiments were conducted in the first all selectedspots were used for the SVM training process whereasin the second half of the selected spots were randomlydiscarded Both experiments resulted in perfect griddingfor more than 98 of the spots in the data set whereasonly up to 16 and 03 of the spots were marginallyand incorrectly gridded respectively These results illus-trate that the proposed method can achieve almost per-fect gridding even in the case of significantly fewerdetected spotsThe evaluation results for the second data set are

shown in Table 2 and reveal that the proposed methodachieves a significantly lower percentage of marginallyand incorrectly gridded spots compared to the user-guided approaches [89] and the automatic approaches[1516]In comparison to the preliminary version [19] several

improvements allow a significant enhancement of

gridding accuracy

bull Automatic selection of spots based on the distribu-tion of their sizesbull Automatic determination of the operating para-meters such as the edge detection threshold and theupper and lower bounds of the sizes of the spotsbull Inclusion of all selected spots into the training setof each SVM classifier instead of only pairs of rowsor columns of selected spotsbull More accurate rotation detection using the Radontransform

The method used for the evaluation of the griddingaccuracy by [3] involves the measurement of the dis-tance between each spot center and the center of itsrespective grid cell The mean value and standard devia-tion of these distances are used to evaluate the localiza-tion of the spots by the grid The results of M3G ascompared to the ones presented in [3] are shown inTable 3 The proposed method achieves a 50 smallermean distance and a 33 smaller standard deviation ascompared to the best results presented in [3] illustrat-ing the advantageous performance of the proposedmethod with regards to the localization of the spotsEven when half of the selected spots are randomly dis-carded to simulate the case of significantly fewerdetected spots M3G still achieves a better localizationthan the best results of [3]The gridding performance of M3G was evaluated by

following a grid search approach for the determinationof the optimal value for the SVM cost parameter C Theresults are presented in Fig 7 The SVM cost parameterC determines the effect that outliers or noise mighthave on the positioning of the separating lines produced

Figure 6 The distribution of spot diameters in the data setcompared to the modified normal distribution Nm (x μ s)with μ = 807 and s = 166

Table 1 Comparison of gridding results for the first dataset

Perfect Marginal Incorrect

Zacharia et al [15] 946 48 06

Preliminary method [19] 951 45 04

M3G (all spots) 983 15 02

M3G (50 of spots) 981 16 03

Table 2 Comparison of gridding results for the seconddata set


ScanAlyze [8] 487 226 287

SpotFinder [9] 728 143 129


Blekas et al [16] 896 92 12

M3G 980 17 03

Table 3 Euclidean distances between spot centers andgrid cell centers

Mean Standard Deviation

Giannakeas et al (Red) 273 pixels 110 pixels

Giannakeas et al (Green) 234 pixels 107 pixels

M3G (all spots) 117 pixels 071 pixels

M3G (50 of spots) 121 pixels 104 pixels


Page 8 of 11

by the SVM therefore a small value of C should beselected for successful gridding The choice of C = 10-2

is supported by the results as it produces the mostaccurately gridded spots compared to the other valuesof C Lower values of C do not increase the achievedaccuracy but the choice of a larger value results in areduction of the achieved accuracy due to the corre-sponding increase of significance of outliersFig 8 illustrates the gridding that results from the

application of M3G on a microarray image area thatincludes a large and bright artefact Even in the vicinityof the artefact the gridding is not affected by its pre-sence Fig 9 illustrates the resulting gridding for threemore such images including a detailed view of the areaaround each artefact Despite the presence of these arte-facts the proposed method achieves successful griddingin all those casesTime performanceThe proposed method was also evaluated with regardsto its computational time requirements The evaluationplatform is based on an Athlon 64 X2 3800+ processorand includes 3GB of RAM whereas the blocks usedbelong to the first data set and their dimensions areroughly 450 times 450 pixels For the first block in eachmicroarray image our method requires 18 seconds ofprocessing time most of which is used for estimatingthe distance between consecutive rows and columns ofspots For any subsequent blocks of the same image therange of possible values for dr and dc is reduced result-ing in 10 seconds of processing time It is worth notingthat the processing times mentioned above can be sig-nificantly reduced by optimizing the implementationand by using multiple cores of the processor as themost time consuming parts of the gridding process can

be efficiently parallelized The genetic algorithmapproach [15] requires a processing time of 92 secondsfor each microarray image block which is nearly oneorder of magnitude larger than the time required byM3GEvaluation of the rotation detectionIn order to assess the performance of the rotation esti-mation step the images in the first data set were ran-domly rotated by angle θreal ranging from -25deg to +25degThe proposed rotation detection method was then usedto compute an estimate θest of the rotation for eachimage Based on that estimate the images were counter-rotated and subsequently gridded The evaluation wasperformed based on the mean and standard deviation ofthe difference between the real and estimated rotationof the images Δθ = θreal - θest Both the mean differenceμΔθ and the standard deviation sΔθ were below 01deg Theaccuracy achieved when gridding the counter-rotatedimages was within 03 of the accuracy obtained usingthe original images thus the introduction of rotationresults in only negligible variation of the griddingaccuracyEvaluation of the distance estimationThe distance estimation step was evaluated with regardsto its tolerance to noise Additive Gaussian noise wasintroduced to fifty randomly selected blocks from dis-tinct real microarray images of the first data set Thestandard deviations used were s = 250 500 and 1000resulting in mean signal to noise ratios of 9 dB 55 dBand 1 dB respectively whereas in several cases the intro-duced noise was stronger than the signal in the imageFor all the images and noise levels tested the distanceestimation step displayed a negligible variance of up to002 pixels illustrating that the use of the average stripefor distance estimation provides very high tolerance tonoise

ConclusionsIn this paper we presented M3G a novel method forgridding cDNA microarray images without user inter-vention based on the maximization of the marginbetween consecutive rows and columns of spots Theproposed method involves several preprocessing stepsincluding a Radon-based rotation estimation for themicroarray image as well as spot detection and selec-tion The distance between rows and columns of spotsis then estimated and the positions of the selected spotsare used to train a set of linear soft-margin SupportVector Machine classifiers The use of soft-marginSVMs allows high tolerance to outliers that result fromartefacts and noise whereas the use of redundant vec-tors in the SVM training set and the automatic determi-nation of the operating parameters facilitate asubstantial increase in gridding accuracy Overall the

Figure 7 Percentage of correctly gridded spots as a function ofthe SVM cost parameter C


Page 9 of 11

proposed method achieves successful automatic griddingof cDNA microarray images in the presence of irregularspots noise and artefacts as well as image rotationThe experimental results on reference DNA microar-

ray images containing more than two million spots

showed that the proposed method outperforms themost accurate state of the art methods providing thepotential of achieving perfect gridding for the vastmajority of the spots

Figure 8 Example of successful gridding in the presence of a large and bright artefact

Figure 9 Details of successful gridding for microarray images with bright artefacts


0 of 11

Additional file 1 M3G Software All algorithms have beenimplemented under a GNULinux environment The M3G software ispublicly available at the Downloads page of httprtsimagediuoagrand also provided along with the manuscript as an additional fileClick here for file[ httpwwwbiomedcentralcomcontentsupplementary1471-2105-11-49-S1TGZ ]

AcknowledgementsThe authors would like to thank K Blekas N P Galatsanos A Likas and I ELagaris for providing one of the data sets used for the evaluation of theproposed gridding method The authors would also like to thank theanonymous reviewers for their constructive and insightful commentsThis work was realized under the framework of the Reinforcement Programof Human Research Manpower (PENED 2003rdquo 03ED324) co-funded 25 bythe General Secretariat for Research and Technology Greece and 75 bythe European Social FundThis work was partially funded by the Special Account of Research Grants ofthe National and Kapodestrian University of Athens (Kapodistriasrdquo 7046445)

Author details1Department of Informatics and Telecommunications University of AthensAthens Greece 2Department of Informatics and Computer TechnologyTechnological Educational Institute of Lamia Lamia Greece

Authorsrsquo contributionsAll authors were equally involved in the conception and design of the M3Gmethod as well as the writing and revision of the manuscript DBdeveloped the implementation of M3G and performed the experiments

Competing interestsThe authors declare that they have no competing interests

Received 31 July 2009Accepted 25 January 2010 Published 25 January 2010

References1 Jung HY Cho HG An automatic block and spot indexing with k-nearest

neighbors graph for microarray image analysis Bioinformatics 200218(Suppl 2)141-151

2 Hirata R Jr Barrera J Hashimoto RF Dantas DO Esteves GH Segmentationof Microarray Images by Mathematical Morphology Real-Time Imaging2002 8(6)491-505

3 Giannakeas N Fotiadis DI An automated method for gridding andclustering-based segmentation of cDNA microarray images ComputerizedMedical Imaging and Graphics 2009 3340-49

4 Lawrence ND Milo M Niranjan M Rashbass P Soullier S Reducing thevariability in cDNA microarray image processing by Bayesian inferenceBioinformatics 2004 20(4)518-26

5 Katzer M Kummert F Sagerer G Methods for automatic microarray imagesegmentation IEEE Transactions on Nanobioscience 2003 2(4)202-214

6 Bajcsy P An Overview of DNA Microarray Grid Alignment andForeground Separation Approaches EURASIP Journal on Applied SignalProcessing 2006 20061-13

7 Biodiscovery Inc ImaGene 2007httpwwwbiodiscoverycomimageneasp8 Eisen MB ScanAlyze 2002httpranalblgovEisenSoftwarehtm9 Hegde P Qi R Abernathy K Gay C Dharap S Gaspard R Hughes JE

Snesrud E Lee N Quackenbush J A concise guide to cDNA microarrayanalysis Biotechniques 2000 29(3)548-550

10 Braumlndle N Bischof H Lapp H Robust DNA microarray image analysisMachine Vision and Applications 2003 1511-28

11 Angulo J Serra J Automatic analysis of DNA microarray images usingmathematical morphology Bioinformatics 2003 19(5)553-562

12 Rueda L Vidyadharan V A Hill-Climbing Approach for Automatic Griddingof cDNA Microarray Images IEEEACM Transactions on ComputationalBiology and Bioinformatics 2006 372-83

13 Katzer M Kummert F Sagerer G A Markov Random Field Model ofMicroarray Gridding Proceedings of the 2003 ACM symposium on AppliedComputing ACM 2003 72-77

14 Hartelius K Carstensen JM Bayesian Grid Matching IEEE Transactions onPattern Analysis and Machine Intelligence 2003 25(2)162-173

15 Zacharia E Maroulis D An Original Genetic Approach to the FullyAutomatic Gridding of Microarray Images IEEE Transactions on MedicalImaging 2008 27(6)805-813

16 Blekas K Galatsanos NP Likas A Lagaris IE Mixture model analysis of DNAmicroarray images IEEE Transactions on Medical Imaging 200524(7)901-909

17 Antoniol G Ceccarelli M Microarray image gridding with stochasticsearch based approaches Image and Vision Computing 200725(2)155-163

18 M3G Maximum Margin Microarray Gridding httprtsimagediuoagr19 Bariamis DG Maroulis D Iakovidis DK Automatic DNA microarray gridding

based on Support Vector Machines Proceedings of the 8th IEEEInternational Conference on Bioinformatics and Bioengineering (BIBE 2008)IEEE 2008 1-5

20 Gonzalez RC Woods RE Digital Image Processing Upper Saddle River NJUSA Prentice-Hall Inc 3 2006

21 Otsu N A threshold selection method from gray-level histograms IEEETransactions on Systems Man and Cybernetics 1979 962-66

22 Ceccarelli M Antoniol G A Deformable Grid-Matching Approach forMicroarray Images IEEE Transactions on Image Processing 200615(10)3178-3188

23 Cortes C Vapnik V Support-Vector Networks Machine Learning 199520(3)273-297

24 Theodoridis S Koutroumbas K Pattern Recognition Academic Press 4 200825 Chang CC Lin CJ LIBSVM a library for support vector machines

2001httpwwwcsientuedutw~cjlinlibsvm26 Platt J Sequential minimal optimization A fast algorithm for training

support vector machines Tech rep Microsoft Inc 199827 Fan RE Chen PH Lin CJ Working Set Selection Using Second Order

Information for Training Support Vector Machines Journal of MachineLearning Research 2005 61889-1918

28 Burges CJC A Tutorial on Support Vector Machines for PatternRecognition Data Min Knowl Discov 1998 2(2)121-167

29 Stanford Microarray Database 2007httpsmdstanfordedu30 Juric D Lacayo NJ Ramsey MC Racevskis J Wiernik PH Rowe JM

Goldstone AH OrsquoDwyer PJ Paietta E Sikic BI Differential gene expressionpatterns and interaction networks in BCR-ABL-positive and -negativeadult acute lymphoblastic leukemias Journal of Clinical Oncology 200725(11)1341-1349

doi1011861471-2105-11-49Cite this article as Bariamis et al M3G Maximum Margin MicroarrayGridding BMC Bioinformatics 2010 1149

Submit your next manuscript to BioMed Centraland take full advantage of

bull Convenient online submission

bull Thorough peer review

bull No space constraints or color figure charges

bull Immediate publication on acceptance

bull Inclusion in PubMed CAS Scopus and Google Scholar

bull Research which is freely available for redistribution

Submit your manuscript at wwwbiomedcentralcomsubmit


Page 11 of 11

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Rotation estimation

Preprocessing

Spot detection

Spot selection

Distance estimation between consecutive rows and columns

Maximum margin gridding

Results and Discussion

Data sets and evaluation method

Comparative evaluation

Time performance

Evaluation of the rotation detection

Evaluation of the distance estimation

Conclusions

Acknowledgements

Author details

Authors contributions

Competing interests

References

The gene expression levels are extracted from micro-array images in three steps The first step of this processis the separation of the blocks present in the image Thenext step is gridding ie the construction of a grid cov-ering each block so that it isolates each spot into a dis-tinct cell enabling the localization of each spot The laststep involves the segmentation of the spots from thebackground of the image and the quantification of theintensity of each spot which corresponds to the expres-sion level of the respective geneThe distance between the blocks of each image is sig-

nificantly larger than the distance between the spots ofeach block thus the blocks can easily be separated Avariety of approaches have been proposed for blockseparation and have achieved accurate results Theseinclude the analysis of the distances between neighbour-ing spots [1] and the use of projections of the imagepixels to the x and y axes [23] In contrast to the blockseparation step the process of gridding poses severalchallenges and has a significant impact on the accuracyof a microarray experiment [4] A gridding algorithmshould be able to grid images that include spots of var-ious shapes sizes and intensities while being robust tonoise and artefacts introduced at a microarray prepara-tion stage as well as rotation due to slight misalign-ments of the scanning robot coordinate system to theimage coordinate system [5] Furthermore it is desirablethat the gridding be automatically performed withoutany user intervention that would possibly affect themicroarray experiment as well as limit the processingthroughput of large amounts of microarray imagesSeveral methods have been proposed for microarray

image gridding they can be viewed in terms of automa-tion as manual semiautomated and fully automated [6]However most of the proposed methods are not fullyautomated and require manual tuning of parameters orother user intervention For example the state of the artmethod implemented in ImaGene [7] is semiautomatedrequiring the tuning of a multitude of parameterswhereas in the manual gridding methods implementedin ScanAlyze [8] and SpotFinder [9] the process ofgridding is performed interactively by the user Themethod proposed by Braumlndle et al [10] is parametricrequiring estimated values for several parametersOnly a few state of the art methods have been pro-

posed as providing automatic gridding but most ofthem do not address all requirements of fully automaticgridding ie handling of irregular spots and robustnessagainst noise artefacts and image rotation The state ofthe art method proposed by Angulo et al [11] is basedon mathematical morphology and requires that gridrows and columns are strictly aligned with the x and yaxes of the microarray image The same requirement isimposed by the hill-climbing approach proposed by

Rueda et al [12] A fully automatic region segmentationapproach based on Markov random fields was proposedby Katzer et al [13] but the results showed that its per-formance is diminishing in the presence of weaklyexpressed spots The Bayesian grid matching methodproposed by Hartelius et al [14] employs an iterativealgorithm to solve a complex deformable model foraccurate microarray gridding whereas methods produ-cing simpler linear grids such as [13] and [15] havebeen proved highly accurate as well Blekas et al [16]proposed a method based on Gaussian mixture modelwhereas later a methodology combining a stochasticsearch approach for the grid positioning and a MarkovChain Monte Carlo method was proposed by Antoniolet al [17] to account for local deformations of themicroarray image However this approach cannot beconsidered as fully automatic since it requires priorknowledge about the number of rows and columns ofthe spots in the microarray image Another methodbased on Voronoi diagrams was proposed by Gianna-keas et al [3] however it requires that artificial spotsare introduced in place of the spots that are very weaklyexpressed Recently a heuristic gridding approach basedon a genetic algorithm was proposed by Zacharia et al[15] This algorithm provides a near optimal griddingoutperforming the method proposed by Blekas et al[16] while being robust to both noise and rotationHowever it is well known that the genetic optimizationprocesses tend to require long processing times to con-verge since a multitude of possible solutions has to becreated and evaluatedIn this paper we propose a novel methodology for

automatic cDNA microarray gridding based on a com-putationally efficient optimization approach The pro-posed methodology is based on the maximization of themargin between the consecutive rows and columns ofthe microarray spots which is implemented by traininga linear maximum margin classifier with an automati-cally detected subset of spots on the microarray imageThe classifier determines the optimal positioning ofeach grid line whereas the use of the soft-margin var-iant provides robustness to outliers This methodologynamed M3G (Maximum Margin Microarray Gridding) issupported by a non-parametric Radon-based rotationestimator of general applicability for cDNA microarrayimages The novel contributions of the proposedmethod include

bull Optimal grid line determination through marginmaximizationbull Fully automatic gridding without user interventionbull Robustness to a wide range of image imperfectionsincluding rotation artefacts noise and weaklyexpressed spots


of 11





(1)






Page 3 of 11




N x e

x

( )

( )

12

2

2 2 (2)

N x

x

N xC x

xm( ) ( )( )

0 0

10

(3)

C x erfx

( )

12

12

(4)

C x

x

C x CC x

xm( ) ( ) ( )( )

0 0

01

0(5)




Page 4 of 11


C x h ih

i

x

( ) ( )

0

(6)

E C x C x dxh m

| ( ) ( ) | 0

(7)





Page 5 of 11










Page 6 of 11


min || ||12

2

1

w C i

i

N

(8)











Page 7 of 11




gridding accuracy










M3G (50 of spots) 981 16 03



ScanAlyze [8] 487 226 287

SpotFinder [9] 728 143 129


Blekas et al [16] 896 92 12

M3G 980 17 03








Page 8 of 11








Page 9 of 11







Page 10 of 11















































Page 11 of 11

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Rotation estimation

Preprocessing

Spot detection

Spot selection






Time performance



Conclusions

Acknowledgements

Author details


Competing interests

References





(1)






of 11




N x e

x

( )

( )

12

2

2 2 (2)

N x

x

N xC x

xm( ) ( )( )

0 0

10

(3)

C x erfx

( )

12

12

(4)

C x

x

C x CC x

xm( ) ( ) ( )( )

0 0

01

0(5)




Page 4 of 11


C x h ih

i

x

( ) ( )

0

(6)

E C x C x dxh m

| ( ) ( ) | 0

(7)





Page 5 of 11










Page 6 of 11


min || ||12

2

1

w C i

i

N

(8)











Page 7 of 11




gridding accuracy










M3G (50 of spots) 981 16 03



ScanAlyze [8] 487 226 287

SpotFinder [9] 728 143 129


Blekas et al [16] 896 92 12

M3G 980 17 03








Page 8 of 11








Page 9 of 11







Page 10 of 11















































Page 11 of 11

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Rotation estimation

Preprocessing

Spot detection

Spot selection






Time performance



Conclusions

Acknowledgements

Author details


Competing interests

References




N x e

x

( )

( )

12

2

2 2 (2)

N x

x

N xC x

xm( ) ( )( )

0 0

10

(3)

C x erfx

( )

12

12

(4)

C x

x

C x CC x

xm( ) ( ) ( )( )

0 0

01

0(5)




of 11


C x h ih

i

x

( ) ( )

0

(6)

E C x C x dxh m

| ( ) ( ) | 0

(7)





Page 5 of 11










Page 6 of 11


min || ||12

2

1

w C i

i

N

(8)











Page 7 of 11




gridding accuracy










M3G (50 of spots) 981 16 03



ScanAlyze [8] 487 226 287

SpotFinder [9] 728 143 129


Blekas et al [16] 896 92 12

M3G 980 17 03








Page 8 of 11








Page 9 of 11







Page 10 of 11















































Page 11 of 11

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Rotation estimation

Preprocessing

Spot detection

Spot selection






Time performance



Conclusions

Acknowledgements

Author details


Competing interests

References


C x h ih

i

x

( ) ( )

0

(6)

E C x C x dxh m

| ( ) ( ) | 0

(7)





of 11










Page 6 of 11


min || ||12

2

1

w C i

i

N

(8)











Page 7 of 11




gridding accuracy










M3G (50 of spots) 981 16 03



ScanAlyze [8] 487 226 287

SpotFinder [9] 728 143 129


Blekas et al [16] 896 92 12

M3G 980 17 03








Page 8 of 11








Page 9 of 11







Page 10 of 11















































Page 11 of 11

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Rotation estimation

Preprocessing

Spot detection

Spot selection






Time performance



Conclusions

Acknowledgements

Author details


Competing interests

References










of 11


min || ||12

2

1

w C i

i

N

(8)











Page 7 of 11




gridding accuracy










M3G (50 of spots) 981 16 03



ScanAlyze [8] 487 226 287

SpotFinder [9] 728 143 129


Blekas et al [16] 896 92 12

M3G 980 17 03








Page 8 of 11








Page 9 of 11







Page 10 of 11















































Page 11 of 11

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Rotation estimation

Preprocessing

Spot detection

Spot selection






Time performance



Conclusions

Acknowledgements

Author details


Competing interests

References


min || ||12

2

1

w C i

i

N

(8)











of 11




gridding accuracy










M3G (50 of spots) 981 16 03



ScanAlyze [8] 487 226 287

SpotFinder [9] 728 143 129


Blekas et al [16] 896 92 12

M3G 980 17 03








Page 8 of 11








Page 9 of 11







Page 10 of 11















































Page 11 of 11

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Rotation estimation

Preprocessing

Spot detection

Spot selection






Time performance



Conclusions

Acknowledgements

Author details


Competing interests

References




gridding accuracy










M3G (50 of spots) 981 16 03



ScanAlyze [8] 487 226 287

SpotFinder [9] 728 143 129


Blekas et al [16] 896 92 12

M3G 980 17 03








of 11








Page 9 of 11







Page 10 of 11















































Page 11 of 11

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Rotation estimation

Preprocessing

Spot detection

Spot selection






Time performance



Conclusions

Acknowledgements

Author details


Competing interests

References








of 11







Page 10 of 11















































Page 11 of 11

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Rotation estimation

Preprocessing

Spot detection

Spot selection






Time performance



Conclusions

Acknowledgements

Author details


Competing interests

References







of 11















































Page 11 of 11

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Rotation estimation

Preprocessing

Spot detection

Spot selection






Time performance



Conclusions

Acknowledgements

Author details


Competing interests

References















































of 11

Abstract

Background

Methods

Results

Conclusions

Background

Methods

Rotation estimation

Preprocessing

Spot detection

Spot selection






Time performance



Conclusions

Acknowledgements

Author details


Competing interests

References

M3G: Maximum Margin Microarray Gridding

Documents

Transcript of M3G: Maximum Margin Microarray Gridding