Lane tracking software for four-color fluorescence-based electrophoretic gel images

9
10.1101/gr.6.11.1110 Access the most recent version at doi: 1996 6: 1110-1117 Genome Res. M L Cooper, D R Maffitt, J D Parsons, et al. electrophoretic gel images. Lane tracking software for four-color fluorescence-based References http://genome.cshlp.org/content/6/11/1110#related-urls Article cited in: This article cites 8 articles, 4 of which can be accessed free at: service Email alerting click here top right corner of the article or Receive free email alerts when new articles cite this article - sign up in the box at the http://genome.cshlp.org/subscriptions go to: Genome Research To subscribe to Copyright © Cold Spring Harbor Laboratory Press Cold Spring Harbor Laboratory Press on January 14, 2009 - Published by genome.cshlp.org Downloaded from

Transcript of Lane tracking software for four-color fluorescence-based electrophoretic gel images

10.1101/gr.6.11.1110Access the most recent version at doi: 1996 6: 1110-1117Genome Res.

 M L Cooper, D R Maffitt, J D Parsons, et al. electrophoretic gel images.Lane tracking software for four-color fluorescence-based  

References

http://genome.cshlp.org/content/6/11/1110#related-urlsArticle cited in:  

This article cites 8 articles, 4 of which can be accessed free at:

serviceEmail alerting

click heretop right corner of the article orReceive free email alerts when new articles cite this article - sign up in the box at the

http://genome.cshlp.org/subscriptions go to: Genome ResearchTo subscribe to

Copyright © Cold Spring Harbor Laboratory Press

Cold Spring Harbor Laboratory Press on January 14, 2009 - Published by genome.cshlp.orgDownloaded from

GENOME METHODS

Lane Tracking Software for Four-color Fluorescence-based Electrophoretic

Gel Images Mat thew L. Cooper, 1'3 David R. Maffitt, 2 Jeremy D. Parsons, 1

LaDeana Hillier, and David J. States 2

1Genome Sequencing Center, Department of Genetics, and 2Institute for Biomedical Computing, Washington University School of Medicine, St. Louis, Missouri 63110

Software to track sample lanes automatically in four-color, fluorescence-based, electrophoretic gel images has been developed for application in large-scale DNA sequencing projects. Lanes and lane boundaries are tracked by analyzing a first difference approximation to the gradient of a vertically integrated and processed "brightness" profile. Initially lanes are located in a region of the gel image selected for good horizontal lane spacing and signal strength. The software uses models of expected lane and interlane spacing and lateral lane behavior to maintain accurate tracking on imperfect gels. In areas where intensity-based tracking is difficult, interpixel column correlation is also used to locate and define lane features. Summary statistics and compressed-in-time images are generated for user evaluation of tracking performance. The software developed has been tested successfully on gel images with degradations including significant horizontal lane motion (curving} and image artifacts, and is now in full-scale use in our sequencing projects.

Increasing the level of automation of electropho- retic DNA sequencing is an important and well- documented challenge to the Human Genome Project (HGP) (Watson 1985). Several advances toward this end have been made, including, most notably, the development of fluorescence-based sequencing ins t ruments (Smith et al. 1986; Prober et al. 1987; Ansorge et al. 1988; Brum- baugh et al. 1988). Fluorescence-based sequenc- ing systems allow the coupling of electrophoretic separation, fragment detection, image analysis, trace processing, and base-call ing, w i thou t manual intervention. This level of automation is predicated on robust, high-performance software that, in the case of sample lane tracking, is only now becoming available (Berno 1996).

The genomes of several model organisms are being completely sequenced as part of the HGP, including that of Caenorhabditis elegans. In the context of sequencing the C. elegans genome, manual tracking of sample lanes in the gel images has proven to be a time-consuming obstacle to achieving increased throughput and further au- tomation of accurate, reproducible sequence de-

3Corresponding author. E-MAIL [email protected]; FAX (314) 286-1810.

termination (Wilson et al. 1994). Recently, dou- bling the channel sampling across the gel has al- lowed us to increase the number of samples per gel from thirty-six to sixty-four, but has resulted in a concomitant increase in the difficulty and time required to manually reposition the track- ing lanes on the appropriate samples. In the most problematic gel images, manual correction of the vendor-supplied tracking software on difficult gels can require up to 3 hr. This effort is necessi- tated by the failure, almost total in some cases, of the supplied data collection software to accu- rately track samples in gels exhibiting anomalous sample migration patterns.

To further automate the pathway from DNA sequence of individual shotgun clones to con- tiguous cosmid sequences, algorithms to accu- rately locate and track lanes of DNA sequencing fragments in fluorescence-based four-color elec- trophoretic gel images have been developed and tested successfully. The new software for lane tracking significantly reduces the manual pro- cessing time invested in gel image analysis, and represents the first step in the development of a complete UNIX-based gel analysis package.

Our goal is to locate and accurately track the lanes in a gel image that result from excitation and detection of electrophoretically separated,

1110 @ GENOME RESEARCH 6:1110-1117 �9 by Cold Spring Harbor Laboratory Press ISSN 1054-9803/96 $5.00

Cold Spring Harbor Laboratory Press on January 14, 2009 - Published by genome.cshlp.orgDownloaded from

fluorescently labeled fragments of DNA. Many available gel analysis packages rely on vendor- supplied lane tracking software (Golden et al. 1993), manual tracking of the lanes (Giddings et al. 1993), or, most often, a combination of the two. The automation of lane tracking is difficult for a number of reasons. In principle, the nega- tively charged phosphate backbones of the DNA fragments should cause them to electrophorese in a straight path through the pores of the poly- merized acrylamide gel in the presence of a linear potential gradient. In practice, gel or plate con- taminants, excess salt in loaded samples, electric field or polymerization variability, or machine misuse or failures can cause the lanes to move laterally, sometimes significantly. Furthermore, image artifacts of undetermined origin are often observed. These can be described as "stripes" or "patches" in the gel image of approximately con- stant intensity that obscure the underlying data. Postulated causes of artifacts include gel impuri- ties, plate scratches, dust on the detector, laser cutout, buffer leaks, and software failures. Ob- served artifacts, even within a single gel image, fail to exhibit a consistent set of characteristics, and can be mistaken for lanes by tracking soft- ware. Also, gels may contain empty lanes as a result of failed reactions or poorly formed wells, and lateral spacing, lane width, and signal strength can vary considerably with the location in the gel image and the comb used in loading.

Our software, Getlanes, has been designed for use with four-color fluorescence-based elec- trophoretic gel images. All code is written in ANSI C. Thus far, the gel images have been generated by the Applied Biosystems (ABI) 373A Sequenc- ing Instrument, and accompanying 388 Channel Data Collection and Analysis Software (Version 1.2.388) or the ABI 377 Sequencing Machine and 194 Channel Data Collection and Analysis Soft- ware (Version 2.1). The 194 Channel Software has also been used with the ABI 373A. Once data collection (on the Macintosh connected directly to the sequencing machine) is complete, the gel file is transferred to a UNIX file system using one of many available UNIX/Macintosh file-sharing packages, Columbia Appleshare Protocol (CAP, publicly available from Columbia University). This transfer separates the image data, contained in the ABI gel file data fork, and the resource data, contained in the ABI gel file resource fork, into two UNIX files. These two files are then processed by Getlanes in the UNIX environment.

After lane tracking, one-dimensional traces

LANE IRACKING SOFTWARE

are extracted down the lanes from each of the four ABI filter images. These traces are then pro- cessed and used for base calling. Presently, ABI gel files are transferred to a UNIX machine for automatic tracking by a daemon that runs Get- lanes using default options, and then back to a Macintosh for manual inspection. The daemon process runs continuously, and regularly checks specified directories spawning processes to ex- ecute the lane tracking software on any new gel files it discovers. Statistics are compiled in log files. Once the gel is back on a Macintosh, the ABI-supplied software is used for both trace ex- traction and preprocessing.

In the future, additional UNIX tools will be available for gel inspection, manual tracking, and trace preprocessing. Locally optimized UNIX- based base-calling software is already in use in our project (P. Green, unpubl.). Thus our overall aim is to use only the ABI Data Collection soft- ware and to complete the gel analysis using im- proved UNIX-based software, including the soft- ware documented here. The design of these tools is modular to permit comparative testing of each individual component prior to the completion of our entire gel analysis package.

RESULTS

We have tested Getlanes in three ways. First, we have tested the software using gel images exhib- iting extreme (and unusual) imperfections. Fig- ures 1-3 show three such gel images with artifacts and extreme lateral sample. The gel of Figure 3 possesses an artifact in the lower left that crosses

Figure 1 Example gel imperfections. (A) A promi- nent artifact is crossing the curving lanes. (B) The gel is shown tracked using lane-behavior models.

GENOME RESEARCH @ 1 1 ! 1

Cold Spring Harbor Laboratory Press on January 14, 2009 - Published by genome.cshlp.orgDownloaded from

COOPER ET AL.

Figure 2 The most curved section of the gel in A is shown retracked in B, demonstrating the success of our definition of a lane using peaks and bound- aries even in an area of large lateral movement. This gel also has "missing" lanes in which the signal is weak. The gel is retracked with missing lanes added, automatically enabling the application of our spac- ing models.

several of the underlying lanes, but its effects are minimized by the application of the horizontal spacing model (see Methods). Such artifacts con- fuse algorithms lacking lane behavior models, as artifacts often have greater intensity than the lanes nearby, as seen in the intensity plot of Fig- ure 4 of a vertically integrated horizontal slice of this gel image. The gel in Figure 2A exhibits se- vere curving of the lanes at the top of the image (physically, the bottom of the gel). This gel tests

Figure 3 (A) Artifacts that make tracking in the lower left corner extremely difficult and erase most hopes of determining the underlying sequence. (B) The gel is shown tracked by Getlanes salvaging much of the data.

1 1 12 ~ GENOME RESEARCH

the robustness of our definitions for lanes as im- age features. The gel of Figure 3A contains an artifact that makes manual tracking nearly im- possible. This necessitates a combination of hori- zontal and vertical lane behavior models, de- scribed in Methods.

Second, we have tested the software's perfor- mance statistically against ABI software as well as human retracking. A comparison of Getlanes, the software supplied by ABI and human tracking is shown in the Tables 1 and 2, prepared using pro- duction data from our lab. Table 2 presents the results of Table 1 normalized by the correspond- ing manual results. In preparing these statistics, ABI lane-tracking software placed more than one lane on a single sample in some instances. This has skewed our statistics in favor of ABI in some cases. To partially correct for this, ABI results were not allowed to exceed the corresponding human result used for normalization. Obviously, double tracking a single sample can create major problems in any subsequent data processing.

To obtain summary statistics, we subjected the data to our normal laboratory protocols for data processing. Data (tracked with either ABI or Getlanes software or manually) for each lane was extracted and processed using the ABI analysis software for both. Base calls were performed us- ing the program PHRED (P. Green, unpubl.). Each trace was evaluated for quality and removed entirely or trimmed based on measures of peak to highest noncalled peak values and peak to shoul- der ratios. Regardless of trace quality, trimming was not allowed to extend past 400 bases. Se- quences were screened for sequencing and clon- ing vector (10-12% of the sequences are entirely vector) and removed or clipped accordingly. The percentage success shown in Table 1, then, is the number of vector-free, high-quality reads divided by the total number of reads processed (number of lanes loaded) for ABI-tracked data versus Get- lanes-tracked data, both without manual inter- vention. The mean number of bases per success- ful trace is the mean clipped read length (clipped for vector and for low-quality data) for the suc- cessful reads. This method provides a measure of the ability of the tracking to successfully follow the lane throughout the length of the gel. The mean bases per lane loaded is the sum of the quality, vector-clipped read lengths divided by the total number of reads processed (total num- ber of lanes loaded). Finally, the number of en- tered bases is the sum of the quality, vector- clipped read lengths of the successful reads.

Cold Spring Harbor Laboratory Press on January 14, 2009 - Published by genome.cshlp.orgDownloaded from

2.6

2.4

2.2

2

1.8

1.6

t .4 50

x t O c

. . . . . . . . . . . . . . . . . . . .

66 70 75 80 05 gO g5 1 O0 I'1 orizo.tal I~dex

Figure 4 This plot shows the effect of the artifact in the lower left of the gel from Fig. 1. The artifact obscures the peak result- ing from the true lane at index 81. The peak resulting from the true lane becomes the shoulder on the left of the peak resulting from the artifact at index 83. This effect is overcome by the use of horizontal spacing models.

Judging from the successful trace percentage in Table 1, the sample of gels used here is below average in quality, and Getlanes is successful in salvaging data. The statistics demonst ra te that the software performs approximate ly as well as a human , and significantly better t han ABI. In six of the 20 cases summar i zed above, Get lanes tracking yielded >2000 more entered bases than ABI, and in one instance >10,000 more. In four cases where ABI tracking yielded more bases, only once did ABI successfully track an addi t ional sample, indicat ing that it tracks the same sample with more than one lane when it fails to locate an expected number of lanes. In these cases, using the h u m a n result as a correction, the number of

LANE TRACKING SOFTWARE

bases entered using ABI tracking never ex- ceeded the Getlanes result by 1000 bases (on a 64-lane gel). In five cases, the number of bases entered using Getlanes tracking ex- ceeded the h u m a n result.

M13 subclones for the data summa- rized above came from bo th h u m a n ex- pressed sequence tags, p lasmid artificial chromosomes, and C. elegans cosmids. In most cases, sample lanes are tracked very similarly by humans and Getlanes. Human tracking does not yield major gains in the number of bases called and entered into our cosmid sequence data bases. Thus, data tracked by h u m a n s and missed by Getlanes are often of poor quali ty and cannot be in- cluded in assemblies of larger cont iguous sequences.

Finally, we have tested Getlanes by full- scale i n t r o d u c t i o n in to our p r o d u c t i o n groups. Although transferring the gel files from the ABI Macintosh to the UNIX sys- tem and back again is cumbersome, it en- ables us to evaluate the performance of Get- lanes prior to the complet ion of the down- s t r e a m p r o c e s s i n g . N o n e t h e l e s s , t h e performance of Getlanes' has dramatically reduced t ime spent m a n u a l l y retracking

sample lanes as documented below, despite these unnecessary data transfers. It is est imated that the use of this software in our group has reduced the number of people performing this task by up to 60% and t ime spent per gel by up to 80%.

DISCUSSION

The lane tracking approach described here was designed as a useful compromise between speed and p e r f o r m a n c e , and r u n t imes on a Sun SPARC5 workstat ion for 388 Channel ABI 373A gels conta in ing 6000 rows or more are - 1 - 2 min. The m e m o r y required is -14 megabytes, but up to

Table 1. Raw Performance Statistics

Success Mean bases Mean bases (%) per successful trace per lane loaded Entered bases

ABI 57 287.5 203.4 10743 Getlanes 69 312.55 241.1 12923

Statistics shown compiled over 19 standard 64-lane production gels. Entries are averaged over the 19 gels.

GENOME RESEARCH J 1 1 1 3

Cold Spring Harbor Laboratory Press on January 14, 2009 - Published by genome.cshlp.orgDownloaded from

COOPER ET AL.

Table 2. Normalized Performance Statistics

Success Mean bases Mean bases (%) per successful trace per lane loaded Entered bases

ABI 81 95 84 81 Getla nes 100 100 99 98

Statistics normalized by corresponding results of manual tracking. Compiled over 19 standard 64-lane pro- duction gels. Entries are averaged over the 19 gels, then normalized by the corresponding human average.

two-thirds of this may be the gel image data, and could be compressed by an order of magnitude if needed (the compression would merely store in- tensity information as it is laterused: multiple rows summed together so no useful information would be lost). The brightness technique, though simplistic, provides great flexibility in the adap- tation of our models to the data and was found to be significantly better on gels with poor signal strength, though poor, at times, in the presence of artifacts. The correlation approach was more effective when lanes merged and in the presence of certain artifacts. Its pitfall is the use of the four- column neighborhood that implicitly assumes that lane and interlane features can be distin- guished using correlation over any four columns. This may not be true for unusually large gaps between lanes or wide lanes. The choice of a four- column neighborhood yielded the best overall performance for our data, and the two ap- proaches to tracking complement each other very well. Other more sophisticated approaches were considered and sometimes tested, but deemed unnecessary given the success of the cur- rent version.

The most common error made by the soft- ware, although relatively infrequent, is to miss one of the outermost lanes of the gel. It is a straightforward error to correct in general, be- cause additional lanes are placed to the right side of the tracked lanes, so they do not interfere with the tracking of correctly identified lanes. This er- ror is due to stringent requirements of peaks that are labeled as lanes in the starting window. These requirements are used to avoid mistaking inten- sity maxima resulting from artifacts or back- ground fluctuations for lanes. The performance of the software on gels with low signal, missing lanes, or curving lanes is satisfactory in large part because of the definitions of a lane as a global image feature, and the behavior models that are used to track each lane, as described earlier.

Tolerances of the spacing models vary with the tracking techniques to best exploit the differ- ent profiles, though the overall structure is the same. By tracking boundaries closely, it is hoped that in extracted one-dimensional traces, signal strength will be maximized, as the horizontal in- tegrations between the boundaries will span the full lane width. Two hundred locations of lane boundaries for each lane are recorded, and sum- mary statistics, as well as compressed-in-time tagged image file format (TIFF) images are pro- duced. Code is also available to extract traces down the tracked lanes using a horizontal inte- gration over a user-specified or automatically computed lane width moving down the lanes. The program's use of command line flags to con- trol its options is very flexible for interactive use and allows its easy integration into scripts.

Future versions of the software could, if nec- essary, use larger scale modeling of gel features, one- and two-dimensional image enhancement, reduced memory requirements, and dynamic programming. We also plan to make it compat- ible with other sequencing technologies as well as the aforementioned complete UNIX-based gel tracking and analysis interface and package cur- rently in development.

Conclusion

The development of this software has addressed a significant obstacle to automatic high-through- put, reproducible sequence determination. The software is routinely used by our production se- quencing groups. The results of its use on 40-50 gel images per day for several months indicate that it improves upon vendor-supplied software markedly, and approximates human tracking performance. The further automation of lane tracking and gel file management has signifi- cantly reduced the time invested by our produc- tion groups in this tedious task. It is hoped that

1 1 14 @ GENOME RESEARCH

Cold Spring Harbor Laboratory Press on January 14, 2009 - Published by genome.cshlp.orgDownloaded from

our c o m p l e t i o n of a r o b u s t U N I X - b a s e d sof tware

p a c k a g e to t rack lanes , ex t r ac t a n d p r e p r o c e s s

traces, a n d m a k e base calls wil l b o t h inc rease t h e

t h r o u g h p u t a n d i m p r o v e t h e s e q u e n c e q u a l i t y of

large-scale DNA s e q u e n c i n g pro jec ts .

METHODS The algorithm's structure is depicted in Figure 4, and de- tailed below. For additional details, the source code is available as noted at the conclusion of this paper. First the image data is loaded and preprocessed to form a composite image. From the composite image, a starting point is se- lected, and horizontal tracking profiles are generated by filtering the vertical sum of 40-50 rows of the image (de- scribed in more detail below). Using the starting tracking profile, global characteristics of the gel are identified. With these global characteristics and lane behavior models, tracking of lanes and lane boundaries proceeds iteratively over the whole image.

Image Preprocessing The ABI gel file contains four filter images, produced by filtering the scan data with four bandpass filters tuned to the emission spectra of the four fluorescent dye primers used. The four filter images are loaded and summed to form a brightness image. Several image-processing steps are performed to facilitate subsequent tracking. The first is a crude background subtraction. After determining the maximum and minimum intensities of the image, the minimum value is subtracted from the entire image, and the image is normalized to a maximum intensity of 512. Then, a histogram equalization is performed to the nor- malized image, to increase its contrast and dynamic range (Gonzales and Woods 1992).

A starting point for tracking is then selected from a sample of gel regions by comparison of signal strength and horizontal lane spacing. Greater signal strength improves the detection of lanes and lane boundaries. Likewise, regu- lar spacing increases the effectiveness of the lane behavior models applied later and facilitates the determination of characteristics of the gel as a whole, including missing lanes, horizontal boundaries of the lane data, and the number of lanes in the image. Algorithms that attempt to process disjoint portions of the gel without these global parameters are unable to track lanes in more difficult re- gions of the gel (Blanchard 1993).

In each of three regions, a horizontal gel profile is formed by vertically integrating over a fixed window with a size of between 30 and 50 rows. The window size used by these algorithms in the iterations described later is com- puted as a function of the total number of rows in the gel image. In each of these profiles, maxima and minima are located, and signal strength is compared by computing the mean peak intensity. The standard deviation of the inter- maxima widths is used to evaluate the regularity of the horizontal spacing of the lanes. The ratio of the mean peak height to the standard deviation of the intermaxima dis- tance is maximized to select the starting region.

Next, the starting window intensity histogram is computed and used to identify the horizontal boundaries

LANE TRACKING SOFTWARE

of the portion of the image containing the lanes. The modes of this histogram tend to be background intensities. Thus, the maximum background intensity is estimated by comparison of the incremental integral of the histogram and the total integral, as in Equation 1.

max x)} 0 ~ I < 5 1 2 i=0 0 ~ i < 5 1 2

The subscript i represents intensity, between 0 and 511, and xi represents the number of pixels in the starting win- dow with intensity i. Thus the xi comprise the histogram of the starting window.

This equation is a thresholding of the gel image, using a threshold computed by approximating the unnormal- ized cumulative distribution function of the intensity with the integral of the histogram. In each row of the starting window, the outermost columns in which the intensity equals or exceeds XB as computed in Equation 1 are found. Of these, the outermost columns are marked as horizontal boundaries for the area containing the lane data in the starting window. These boundaries are adjusted according to the lateral movements of the outermost lanes through- out the gel image.

The number of lanes in the image, as well as missing lanes, are identified and compared with the expected number, determined according to ABI machine type. If the expected number of lanes is not found, a compensatory number of "artificial" lanes are added on the right side of the image to be available for subsequent manual tracking. Alternatively, the positions of the leftmost and rightmost lanes (horizontal boundaries for lane data), the number of lanes, and starting region can be specified explicitly in the command line.

The software then iterates in both directions verti- cally to cover the entire image. At each iteration, the soft- ware locates and tracks lanes over a 50% redundant win- dow of the gel consisting of typically 30-50 rows. After one or more horizontal gel profiles are used to locate and de- fine the lanes, lane behavior models are used to fine-tune the tracking, and to work through difficult portions of the images.

Tracking Techniques The software uses multiple tracking techniques. The first approach, which has proven the most robust in practice, is the brightness-based approach. An intensity-based profile is formed by vertical integration over the current window of the brightness image. This profile is filtered for peak enhancement, and a first-difference approximation to the gradient is computed to identify maxima, which are marked as lane locations. The output of the peak- enhancement filter, at horizontal index i, is given in Equa- tion 2 for input profile f a row vector with elements in- dexed from 1 to 388, as the ABI channels, and ~fthe mean of the elements of that row vector:

f~-l. ~. f~+l h i - ix f 2 (2)

Location of gradient extrema allows us to determine lane boundaries. Gradient maxima are labeled as left lane boundaries, and gradient minima are labeled as right lane

GENOME RESEARCH ~ 1 1 1 5

Cold Spring Harbor Laboratory Press on January 14, 2009 - Published by genome.cshlp.orgDownloaded from

COOPER ET AL.

M A I N P R O G R A M > I T E R A T I O N Lane Behavior Models

I I...oad Filter Images Composite Image

i

i (Identify Gel Charaderistics )

T

(Tr~k I.~nes In Starting W i n @ i I

Iterate To Cover Gd Above Starting Window) - -

f lterate To Cover Gel Below Startir~ Window) _ _

Lorm Vertically Integrated Ga~ file Over C~rent Windowfl

I Locate Lanes (Maxima) And Edi~ Locations With Previous Lane / Locations ~ fl

i

T t e Lane Boundaries & Edit Previ~ i Current !_~-ae Locations & dary l_~tiom

(Apply Lane Behavior Mod~ , ~Iterativdy f l

(Output TIFF Images & Summary Statistics ) i

' (~Slide Window VerticaUy To Next i i \Location

Extrad Traces Down Tracked

The lane behavior models are applied in two steps. In the first step, each lane lo-

~,. cation is analyzed individually for con- sistency with lane locations in the pre- vious iteration. The lane locations are const ra ined to lie be tween the corre- sponding lane boundaries tracked in the

, previous i terat ion. This res t r ic t ion is 1 based on empirical observations that the I max imum lateral velocity of a lane is ap- / proximately one pixel per 20 rows and

that typical lane widths are three to four pixels. This underlies the concept of lane

j m o m e n t u m , that the lateral movements of lanes occur in consistent directions in localized areas of the gel image. W h e n ,!

/ no peak is found wi thin a lane's bound- ,/ aries from the previous i teration, the

/ software estimates its location based on m o m e n t u m observed in the two previ- ous iterations. Lane boundaries must be within an absolute distance of the lane and their previous locations, or they are estimated similarly. This approach pre- cludes mislabeling maxima or gradient extrema due to background fluctuations, gel artifacts, or even neighboring lanes as a particular lane or lane boundary.

In the second step, the models applied reflect collec- tive characteristics of the lane locations in any horizontal window of the gel. After the current lane and lane bound- ary locations are established as above, first-order statistics are computed by which outliers in terms of lane width or interlane width are identified. These outliers are processed in an iterative routine to correct tracking errors similar to those above. The vertical models used are the same, but here are applied to the lane and interlane widths rather than to the lane and lane boundary locations. The hori- zontal models used impose regularity on the lane and in- terlane widths across the gel. An estimate for lane location or boundaries for statistical outliers is computed by aver- aging estimates based on the horizontal and vertical be- havior models. First-order statistics are recomputed and the iteration continues until the spacing satisfies the con- straints of the model, or until a m a x i m u m number of it- erations is reached, when horizontal lane spacing is un- usually irregular. Generally this second step efficiently fine-tunes the tracking locations established in the first step, and is no more than a quality check.

Figure 5 Flowcharts depicting the stucture of Getlanesl.0. The flow- chart on the right details the two iterations in the flow chart on the left.

boundaries . The a lgor i thm then applies models of ex- pected lane behavior, as detailed below, and iterates to cover the entire image.

The second approach calculates in tercolumn correla- t ion coefficients in the four filter images independent ly over four co lumn neighborhoods, in which the current co lumn is the third co lumn from the left. The columns used in these computat ions span the current window (30- 50 rows). Portions of the four images being used are stored in cache memory to reduce the memory needed for access to the large color images. The correlation coefficients are combined to create an alternative horizontal gel profile. Let the quant i ty xi O Xj equal the correlation coefficient of the vectors xi and Xj. In Equation 3 below, X i denotes the co lumn of the current window at horizontal index i. The value of the profile at horizontal index i is computed as follows:

Xi = ~ [(Xi-2 0 ~/)+ (X/-1 ~ ~i) Filter Images

+ (Xi-20 Xi+I) + (Xi-1 ~ Xi+l)] (3)

This profile is made nonnegat ive by subtraction of its m i n i m u m value, then normalized and scaled to a maxi- m u m value equal to the mean of the maxima of the bright- ness data. This profile is processed similarly to the previ- ously men t ioned brightness approach, and either me thod can be used exclusively th roughou t the gel via a c o m m a n d line option.

The default approach is a combina t ion of the two. The brightness profile is computed initially, and in cases where it fails to reveal expected lanes or lane boundaries as required by the models below, the correlation profile is checked for improved feature definition. Brightness is the initial technique applied for its speed and generally reli- able performance.

ACKNOWLEDGMENTS

This work was supported by the National Institutes of Health grant HG00956. The authors would also like to ac- knowledge the review of the code by Darren Platt and Asif Chinwalla , the readings of this manuscr ip t by Elaine Mardis, and the direction of Richard Wilson and Robert Waterston. The informatics and, above all, p roduct ion groups at the Genome Sequencing Center at the Washing- ton University School of Medicine are especially thanked.

The publication costs of this article were defrayed in part by payment of page charges. This article must there-

1 1 16 @ GENOME RESEARCH

Cold Spring Harbor Laboratory Press on January 14, 2009 - Published by genome.cshlp.orgDownloaded from

fore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.

NOTE The following sites on the World Wide Web are available: to obtain CAP and related documenta t ion : h t tp : / / www.cs.mu.oz.au/appletalk/cap.page; to obtain Getlanes and related documentation (note: distribution or use of Getlanes and all software listed here for commercial pur- poses without permission of the Genome Sequencing Cen- ter of the Washington University School of Medicine is s t r ic t ly p roh ib i ted) : h t t p : / / g e n o m e . w u s t l . e d u / g s c / gschmpg.html; to obtain the Getlanes Daemon (Perl Script for automatic running of Getlanes) and related documen- tation: http://genome.wustl.edu/gsc/gschmpg.html; and to obtain the TIFF software library and related documen- tat ion: ht tp: / /www.sgi .com/Fun/ t i f f / t i f f -v3.4beta018/ html/index.html.

REFERENCES

Ansorge, W., B. Sproat, J. Stegemann, C. Schwager, and M. Zenke. 1987. Automated DNA sequencing: Ultrasensitive detection of fluorescent bands during electrophoresis. Nucleic Acids Res. 15" 4593-4602.

Berno, A.J. 1987. A graph theoretic approach to the analysis of DNA sequencing data. Genome Res. 6: 80-91.

Blanchard, A. 1993. "Sequence-specific effects on the incorporation of dideoxynucleotides by a modified T7 polymerase." Ph.D. thesis, California Institute of Technology, Pasadena, CA.

Brumbaugh, J.A., L.R. Middendorf, D.L. Grone, and J.L. Ruth. 1988. Continuous on-line DNA sequencing using oligodeoxynucleotide primers with multiple fluorophores. Proc. Natl. Acad. Sci. 85: 5610-5614.

Giddings, M.C., R.L. Brumley, M. Haker, and L.M. Smith. 1993. An adaptive, object-oriented strategy for base-calling in DNA sequence analysis. Nucleic Acids Res. 21" 4530-4540.

Golden, J.B., D. Torgersen, and C. Tibbetts. 1993. Pattern recognition for automated DNA sequencing: I. On-line signal conditioning and feature extraction for basecalling. Proceedings of the First International Conference on Intelligent Systems for Molecular Biology. AAAI Press.

Gonzales, R.C and R. E. Woods. 1992. Digital image processing. Addison-Wesley, Reading, MA.

Prober, J.M., G.L. Trainor, R.J. Dam, F.W. Hobbs, C.W. Robertson, R.J. Zagursky, A.J. Cocuzza, M.A. Jensen, and K. Baumeister. 1987. A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides. Science 238" 336-341.

Smith, L.M., J.Z. Sanders, R.J. Kaiser, P. Hughes, C. Dodd, C.R. Connell, C. Heiner, S.B.H. Kent, and L.E. Hood.

LANE TRACKING SOFTWARE

1986. Fluorescence detection in automated DNA sequence analysis. Nature 321" 674-679.

Watson, J.D. 1985. The Human Genome Project: Past, present, and future. Science 248: 44-48.

Wilson, R., R. Ainscough, K. Anderson, C. Baynes, M. Berks, J. Bonfield, J. Burton, M. Connell, T. Copsey, J. Cooper, A. Coulson, M. Craxton, S. Dear, Z. Du, R. Durbin, A. Favello, L. Fulton, A. Gardner, P. Green, T. Hawkins, L. Hillier, M. Jier, L. Johnston, M. Jones, J. Kershaw, J. Kirsten, N. Laister, P. Latreille, J. Lightning, C. Lloyd, A. McMurray, B. Mortimore, M. O'Callaghan, J. Parsons, C. Percy, L. Rifken, A. Roopra, D. Saunders, R. Shownkeen, N. Smaldon, A. Smith, E. Sonnhammer, R. Staden, J. Sulston, J. Thierry-Mieg, K. Thomas, M. Vaudin, K. Vaughan, R. Waterston, A. Watson, L. Weinstock, J. Wilkinson-Sproat, and P. Wohldman. 1994. The C. elegans genome project: Contiguous nucleotide sequence of over two megabases from chromosome III. Nature 368: 32-38.

Received January 5, 1996; accepted in revised form August 21, 1996.

GENOME RESEARCH ~ 1 1 17

Cold Spring Harbor Laboratory Press on January 14, 2009 - Published by genome.cshlp.orgDownloaded from