Bioinformatics of the sugarcane EST project

24

Transcript of Bioinformatics of the sugarcane EST project

Bioinformati s of the Sugar ane EST Proje tMar��lia D.V. Braga Zanoni Dias Tzy Li LinJo~ao Meidanis Jos�e A.A. Quitzau Felipe R. da SilvaGuilherme P. TellesBioinformati s Lab { IC-UNICAMPCP 6176 CEP 13083-970 { Campinas-SP-BrazilAbstra tThe Sugar ane EST Proje t { SUCEST { produ ed 291,904 ESTsin a onsortium-based fashion. In this proje t, the Bioinformati s Lab reated a web site that served as a \meeting point" of a network of 76sequen ing and data mining laboratories, re eiving, pro essing, ana-lyzing, and providing servi es to help explore the sequen e data. Inthis paper we des ribe the information pathway implemented by theBioinformati s Lab to support this proje t, in luding a brief explana-tion of the lustering pro edure, whi h resulted in 43,141 lusters.1

1 Introdu tionThe appli ation of expressed sequen e tag { EST { te hnology has provento be an e�e tive tool for gene dis overy [1℄, mapping [2℄, and generation ofgene expression pro�les [3℄.EST proje ts are usually ondu ted by a single laboratory, whi h makesthe libraries, isolates lones, sequen es them, analyzes the data and submitsit to GenBank. However, the Sugar ane EST Proje t { SUCEST { involved24 sequen ing labs, a bioinformati s lab, a oordinating lab, a group forinternational relationship, and 50 data mining groups s attered all over Brazilworking in ooperation. A newborn Brazilian bioinformati s group asso iatedto the proje t lately, intending to learn from previous experien e. Startingearly in 1999, in 15 months it generated 291,904 sequen es from 260,352 lones of 37 di�erent libraries.Genomi resear h in Brazil has been onsortium-based sin e its �rstproje t, the omplete genome of the phytopathogeni ba terium Xylella fas-tidiosa [4℄, ondu ted by the ONSA network. A onsortium-based genomeproje t provides a larger number of resear hers, te hni ians and sequen ingma hines, but demands a mu h more organized data ow. In SUCEST, theBioinformati s Lab { LBI { was responsible for re eiving data from a networkof wet labs, assessing their quality, storing, lustering, and providing manyother servi es on the data. These tasks are des ribed in some detail here,and quantitative �gures from the proje t are given.2

2 Computational SystemsFor a short time in the beginning of the proje t, the web site was hosted bya 128 MB memory personal omputer running the Linux operating system(distribution Red Hat 6.2), and presently the site resides on a Compaq Al-phaServer ES40 with two Alpha 667 MHz pro essors, 8 GB of RAM, and 384GB in disks, running OSF-1 version 4.0G. However, the bulk of the proje twas exe uted on a Compaq AlphaServer DS20 with two Alpha 500 MHz pro- essors, 4 GB of RAM, and 144 GB in disks, running OSF-1 version 4.0F.This is the system in whi h most of the tools were developed and we will on entrate on it for the rest of the paper.The Web engine server wasApa he [5℄ version 1.3.9. CGI's and programswere written in Perl [6℄ version 5.005, and PHP [7℄ version 3.0.12. Thedatabase management system was MySQL [8℄ version 3.22.26a.The input data onsisted of data re eived through web forms, in ludingABI 377 hromatograms, and HTML formatted data mining reports.The base alling and sequen e extra tion programs used were phred [9℄version 0.980904.e and phd2fasta [9℄ version 0.990622.d. The omparisonprograms used are ross-mat h [9℄ version 0.990319 and blastall [10, 11℄version 2.0.10. Assembly programs were phrap [9℄ version 0.990319 andCAP3 [12℄. O�-the-shelf s ripts were used to provide keyword sear h indata mining reports [13℄, database administration and other minor tasks.The software used is either free for a ademi purposes, or was developed by3

our team.3 Methods and ResultsFrom a omputational viewpoint, SUCEST may be seen as a big data reposi-tory and as a provider of Internet based servi es for a ommunity of di�erenttypes of users. Figure 1 shows the major relationships among users, servi es,data and programs in the proje t.Users are members of sequen ing labs, whi h submit hromatograms from lone libraries, members of data mining labs, whi h perform sear hes onthe proje t database and publi ize their results in data mining reports, andmembers of the proje t oordination, whi h monitor the proje t status and ontrol plate distribution and validation. Users intera t with data throughservi es that add, retrieve, and update the data repositories.Data in lude sugar ane ESTs, information about the parti ipants of theproje t, data mining reports, ontrol data, summaries, and output from pro-grams that perform automated sear hes in databases, lustering, and ate-gorization of lusters.Below is a des ription of what users, data, servi es and programs inSUCEST are, and how they intera t.4

SequencingLabs

Data MiningLabs

Coordination

Users

KeywordSearch

Objects

SequencesSubmission

Reports

BLAST

Data MiningReports

Services

BLASTUpdates

Plate Control

BLASTnt, nr, dbEST

ComparativeGenomics

Clustering

Programs

Submission

Directories

Clustering

Directories

Directories

BLAST

Data Mining

Directories

Project

DBMS

DataRepositories

Search

Categorization

Figure 1: Major relations among users, servi es, data, and programs inSUCEST. Arrows indi ate the ow of information.3.1 Obje tsSUCEST data is stored in two di�erent kinds of repositories: operating sys-tem dire tories and a relational database. The dire tories hold biologi alsequen e �les, results from BLAST and ross-mat h sear hes in biologi aldatabases, and data mining reports. Biologi al sequen e �les in lude hro-matograms, �les in fasta format, quality �les, and �les generated by luster-ing, ategorization, and omparative genomi s pro edures. The proje t usesonly one relational database, with several inter onne ted tables that store5

other biologi al and management data, su h as information on libraries, se-quen ing plates, labs, and their members. The database also points to data indire tories. The major entities (obje ts) in our database are des ribed below,where we also introdu e quantitative �gures and details from the pipeline.Laboratories There are 78 laboratories in SUCEST that belong to one ormore of �ve groups: \DNA Coordination", \Bioinformati s", \Data Mining",\Sequen ing", and \International Cooperation". The servi es and data thata member of a lab an a ess depend on the group to whi h the lab belongs.One of the members in a lab is its head, and re eives noti� ations of someof the a tivities performed by the lab members. A two-letter ode identi�esea h lab.Members A SUCEST member is a person that belongs to at least one lab.Several members belong to a sequen ing lab and also to a data-mining lab.Data about members in lude name, the labs they belong to, e-mail address,phone numbers, and login and password to grant a ess to the authorizedservi es. SUCEST had 256 members on Mar h 25, 2001.Libraries An EST results from the sequen ing rea tion of one of the endsof a lone. In turn, a lone is a DNA that was prepared from a parti ulartissue of the plant in spe i� onditions. ESTs in SUCEST ame from 37di�erent libraries, prepared from di�erent sugar ane tissues under di�erent6

onditions [14℄. Library name, des ription, and ve tor employed are re ordedfor the libraries. A ode omposed of two letters and one number was as-signed to ea h library. The letters indi ate the tissue from where the libraryhas originated and a onse utive number was assigned for every new libraryderiving from the same tissue. For example, \LR1" oded for leaf roll withlong inserts, \LR2" oded for leaf roll with small inserts. Libraries have threepossible status: \TEST", for validating libraries, \START", for libraries re-leased for sequen ing, and \STOP", when the oordination de ides it is notworth to keep sequen ing a distributed library. Of the 37 libraries preparedfor the proje t, 32 were started and 5 were abandoned after the TEST phase.Those not started either produ ed too mu h redundan y or very small reads.Plates Clones are organized in 96-will plates in SUCEST. Every plate holds lones from the same library in a 8� 12 grid. Sequen ing is done for a wholeplate, and this data is sent to LBI for pro essing and storage. Data for aplate in lude the library whi h it ame from, and the lab that is authorizedto send data from this plate. A plate has a three-digit identi� ation tag,ex ept for ontrol plates (see below for details on ontrol plates), that havethe letter \C" and two digits. The SUCEST database re ords data from2,771 di�erent plates.Reads A read is the same as an EST. Reads are extra ted (using theprogram phred) from hromatograms submitted by the sequen ing labs,7

and s reened for ve tors (using the program ross mat h). All reads arestored in dire tories as hromatogram �les, and also as a pair of text �lesholding the sequen e and its quality. The following attributes are stored inthe database for every read:� the plate and the position in the plate where it ame from,� information about the submission pro ess, su h as the date and timeit happened,� the number of ve tor and non-ve tor bases with phred quality equal orhigher to 20,� the number of ve tor and non-ve tor bases with phred quality lowerthan 20,� the start and ending positions for every ve tor sequen e identi�ed inthe read, and� whether it has tra e data.Every read has a name that is the on atenation of its lab name, li-brary name, plate name, position in plate, and dire tion. The string SCA-CAD1001A01.g, for example, is the name for the 5' read of the lone in wellA01 of plate 001 of library AD1, and sequen ed by lab AC. (Note: 5' readsare identi�ed by the .g suÆx, while 3' reads use the .b suÆx.) Every posi-8

tion in the plate is identi�ed by its row and olumn names, with row namesranging from A to H, and olumn names from 01 to 12.Preparation Sheet Before a lab an sequen e and submit a plate, it mustsubmit information about the plate preparation pro ess. There are re ordsin the database for every position where the ba teria did not grow and forthe positions from whi h it was not possible to obtain DNA. Every positionmarked with a problem will orrespond to a sequen e without informationrelevant to the proje t.Control Plates A s heme for plate validation is used in the proje t, asfollows. For every set of 12 plates a ontrol plate is built with the 8th olumnof ea h. Control plates are sequen ed, and information about the reads thatmat h in ontrol and in ontrolled plates is stored in the database. The ideais to identify tra king and naming errors in the sequen ing pro ess. Plateswith problems may be �xed and resubmitted by the lab that produ ed them.Clusters SUCEST reads are grouped by a lustering pro edure that isdes ribed below. This lustering pro edure reates sets of aligned reads thatwe all lusters. In our database we store the reads that are part of ea h luster. Therefore, in addition to being a set of reads, a luster has analignment and a onsensus sequen e. Alignments, onsensus sequen es, andquality �les are stored in luster dire tories. A luster also has a name, whi h9

is equal to the name of oldest read in it.3.2 Servi es and ProgramsData ome in and out of SUCEST's data repository through a set of servi es.These servi es are provided by web pages hosted at LBI. Data is also gener-ated within the LBI by programs that are exe uted either automati ally ormanually. Brief des riptions on these servi es and programs are presentedbelow, starting with data retrieval, giving a general overview on how theSUCEST web site is organized and how it works.Data Retrieval Data is retrieved from SUCEST in units alled \obje ts".SUCEST obje ts are the same as data entities des ribed above: Library,Reads, Plates, Control Plate, Member, Laboratory, and Cluster. Ea h obje thas its own, individual web page, ontaining information about itself andlinks to any other obje t, servi e, and report dire tly related to it. Startingfrom a lab or library obje t it is possible to rea h the web page of any otherobje t.Some obje ts point to pages that in lude data extra ted from the proje ts'dire tory stru ture. For instan e, one an visualize reads and base qualitiesin many versions: immediately after submission (but before s reening), afters reen (but before trimming), and after trimming (see Clustering below). For lusters, it is possible to visualize the reads in a luster, and their alignments,10

in luding the onsensus.An obje t sear h servi e was reated to allow dire t a ess to any givenobje t. Given the ode and the type of obje t the servi e delivers its page.For obje t type Member, it is possible to sear h by name, email, department, ity, and institution.Besides obje ts, some reports that summarize data are also available forthe proje t:� a summary of submitted reads gives totals per lab or per library ofsubmitted, payable and lusterizable reads, and� a summary of ontrol plates gives totals of a epted and reje ted platesby ontrol.SQL skilled users may take advantage of a servi e that allows generi queries to the database. Queries an be typed in a web form and the re-sults are returned in tabular fashion. Entity-relationship diagrams and tabledes riptions for our databases are available to help users in this task.Sequen e submission Sequen e submissions are done by sequen ing lab-oratories only. The pro ess requires a essing the proje t's web site using avalid login/password pair, and uploading a set of 96 hromatograms { ex-a tly one plate. When an upload �nishes, ertain pre-requisites are veri�ed:all hromatograms must belong to the same plate, the lab that is trying tosubmit a plate must be the one authorized to do so, the preparation sheet11

for that plate must have already been submitted, and the reads must be ina ordan e with the naming onventions.If the pre-requisites are satis�ed, programs phred and phd2fasta areused to extra t the sequen es and their qualities in fasta format from hro-matograms. Program ross-mat h is used to mask ve tor sequen es in thereads. These steps take a few minutes, and this time was essentially onstantduring the entire proje t, be ause the analysis done upon submission doesnot depend on the other reads present in the repositories.After the submission analysis, a report that summarizes the pro ess andthe sequen es re eived is presented to the submitter. At this point the sub-mitter is asked to on�rm the submittion or not. If the submission is on-�rmed, the database is updated and, if there is an older version of the plate,it is repla ed. Dire tories are updated as well. If the submission is not on-�rmed (e.g., if the submitter is not happy with the quality assessment), thesubmission is dis arded.The path followed by a read in LBI's pipeline, starting from the submis-sion, is shown in Figure 2. The submission pro edure orresponds to the partin the �gure starting at \Zip �le" and extending through the top line. Othersteps in the diagram are performed by programs des ribed in the sequel.Clustering Clustering of ESTs is important to redu e the amount of se-quen e data that miners have to look at, and to organize the reads in a lessredundant set. In SUCEST, lustering had one additional motivation: the12

Phred Phd2FastaZip file Validation

Reads sequenceand quality files

Report Generator

DBMS

TrimmerReads sequenceand quality files

Reads sequenceand quality files

Submission dirsBLAST dirs

ContaminationFiltering Vector screen

Reads sequenceand quality files

Control

Clustering

Reads sequenceand quality files

Clustering dirsDBMS

BLASTXnr, nt, est

DBMS

Comparative

Categorization

Clustering dirs

DBMSBLAST dirs

and quality filesConsensi sequence

Phd file

Submission dirs

Chromatograms

Submission dirs

Submission dirs

DBMS

Screen output

DBMS

DBMS

Plate

Genomics

Figure 2: The operations over a read in the SUCEST pipeline. Bla k arrowslinking boxes indi ate data that ow from one stage to the next, and white-headed arrows going out of boxes indi ate data repository updates.need for redundan y estimation in the libraries.Two pivotal de isions were made early on. One is that ea h luster shouldre e t a trans ript, rather than a gene, an allele, or other biologi al entity.The se ond was that a luster onsists not only of a set of reads, but also ofan alignment of these reads.In this ontext, our �rst s heme was to group similar trans ripts and toprodu e onsensus sequen es using the assembly program phrap. This strat-13

egy was suÆ ient in the early stages of the proje t, but as data a umulateda series of problems for ed us to hange the s heme, as des ribed below.To minimize artifa ts, reads were trimmed before lustering. This trim-ming pro edure started with ve tor masking using program ross mat h.Then some of the poly-A, ve tor and adaptor regions were removed. A qual-ity trimmer was also applied, removing bases from the ends of the sequen eone by one until there were at least 12 bases with phred quality above 15 ina window of 20 bases at the end. Reads were also he ked for ontaminationagainst Xylella fastidiosa, Xanthomonas itri, Es heri hia oli and other po-tential ontaminators in the labs that produ ed the libraries. BLAST wasused to ompare the reads and the potential ontaminators, and if a mat hof at least 100 bases and more than 90% of identity o ured, the read wasmarked as a probable ontamination. However, marked reads were kept in lustering and subsequent analyses, to allow data miners to de ide by them-selves whether a spe i� read was ontaminated or not.Trimmed reads were assembled using phrap with quality and with thefollowing arguments, whi h made it very stringent:-penalty -15 -bandwidth 14 -mins ore 100 -shatter greedy.Every ontig and singlet produ ed by phrap was taken as a luster. Asnew plates ame in, our lusters were automati ally updated by a program.This daemon updated the database, dire tories and BLAST results for every luster that hanged. Initially lustering was performed every day, but as14

the set of sequen es grew, the updates be ame sparser, running on e a week.In the �nal phases of the proje t, lustering would typi ally o upy an entireCPU for about 20 hours.The last assembly done with phrap in luded 261; 609 trimmed reads,and produ ed 81; 223 lusters. However, remarks by several members of theproje t that the total number of lusters in the database was unreasonablylarge, that many lusters were malformed, and that some lusters looked likethey ould be ombined, led to hanges, des ribed in detail by Telles and daSilva [15℄.The new s heme was based on areful testing and evaluation, and on-sisted of a more elaborate trimming pro edure, and of the use of the CAP3 [12℄assembler, the same tool used to produ e TIGR's gene indi es [16℄. Trimmingin this new pro edure in luded ribossomi RNA removal, a omprehensiveremoval of poly-A, poly-T, ve tor and adaptors regions, and improved low-quality-end trimming. CAP3 was fed with 237; 954 reads and quality, andprodu ed 43; 141 lusters.Both versions of the lustering are presently a essible through the proje tweb site, with data from both methods available for most servi es.Keyword Sear h Keyword sear h is a servi e that allows users to sear hfor a set of keywords in the header lines of every sequen e in NCBI's nr, nt anddbEST databases [10℄ that hits any luster in SUCEST. To perform a query,the user gives a database name (nr, nt or dbEST), a logi al expression of15

keywords (that may in lude \or" and \and" onne tors), and the maximum e-value required (whi h is an optional parameter and defaults to 1e-5 = 10�5).The servi e then returns the lusters that have a hit with the expe ted e-value or better and whose subje t heading ontain words satisfying the logi alexpression. The resulting list of lusters is ordered by e-value.A program was reated for keeping BLAST results against nr, nt anddbEST up to date for all SUCEST lusters. A BLAST result against a ertain database is onsidered outdated for a SUCEST luster if the lusterwas newer than the result or if the luster or the database were modi�ed afterthe last BLAST run. When the program �nds outdated BLAST results, itbuilds a BLAST queue giving priority to older lusters. If the databases areavailable in di�erent omputers, the system is able to improve the pro essingtime, running several BLASTs in parallel, one on ea h remote server, andtaking about 2 or 3 days. Otherwise, it alternates BLAST runs for ea h blastdatabase in the same ma hine, taking onsiderably longer.Sub lustering This servi e is used to evaluate statisti s about subsets of lusters of the lustering, in luding read frequen y by luster size, total ofreads, total of lusters, redundan y, and novelty.To sele t the subset of lusters, the user has to indi ate the reads thatbelong to the lusters. Any luster that ontains a read in the sele tion isin luded in the evaluation. To indi ate the reads, elements from their namesshould be sele ted: lab, library, plate, position and dire tion. More than16

one element may be sele ted. Sele ting a parti ular lab, for example, willgenerate the statisti s for the lusters that have at least one read sequen edby that lab.BLAST Sear h A BLAST servi e allows sear hes against SUCEST reads,reads in their trimmed version, and luster onsensi. These databases wereupdated automati ally in a daily basis, to in orporate new reads and on-sensi.Data Mining Report Data mining groups submit HTML formatted re-ports to the SUCEST site, and update them periodi ally. Users may a essreports through an index page that provides a ess to the reports of everydata mining group. A sear h by keywords is also available. When a reportar hive is uploaded, a servi e takes are of unpa king the �les, updating theindex page and the sear hing index.Information about reports is also kept in SUCEST's database, in lud-ing the proje t's name, summary, members, submission date, and submittername.Categorization SUCEST lusters were ategorized, in an attempt to de-termine their fun tion, and to aggregate information. Thirty ategories wherebuilt by a team of SUCEST members, and 32,438 example proteins were as-signed to the ategories. Categorization was done in two steps:17

1. Automati : a BLAST database was onstru ted, ontaining exampleproteins for the ategories. A BLAST sear h was performed againstthis database having SUCEST lusters as input. Any luster was on-sidered to be in a ategory A, if it hit with some example protein in A,with e-value better or equal to 10�10, and overed 70% or more of theexample. A luster ould be in many di�erent ategories. This method ategorized 36% of the 43,141 lusters.2. Manual: A web servi e was built to allow manual annotation, whenautomati annotation produ ed ambiguous ategorization or produ edno ategorization at all. Based on BLAST results against nr, SUCESTmembers were able to establish a dire t relation between a luster anda ategory. Manual annotation in reased signi� antly the number of ategorized lusters and, as of Mar h 20th, 2001, 60.5% of the lusterswere ategorized.Comparative Genomi s In an e�ort to obtain knowledge about sugar- ane and its relation to other spe ies, luster onsensi from SUCEST were ompared against other organisms. The �rst organism sele ted for ompar-ison was the model plant Arabdopsis thaliana. Every luster onsensus wasBLASTed against A. thaliana's hromossomes, proteins, and ESTs. Clustersthat produ ed no hit against A. thaliana, were also BLASTed against ESTsfrom Ly opersi on es ulentum, Gly ine max, Lotus japoni us, Hordeum vul-18

gare, Oryza sativa, Sorghum bi olor, Zea mays, Triti um aestivum, and Med-i ago trun atula. Results from these sear hes were inserted in our database,allowing several queries to determine the distribution of these hits per library,per luster, or grouped a ording to many other riteria.Management These servi es provide a way for the DNA oordination toinput management information into the SUCEST database. This informationis used mainly by servi es that perform he king and summarizing operations.Using the library management servi es, the DNA Coordination modi�es inthe system the status of any library and assigns plates to sequen ing labo-ratories. Manual plate approval is also possible for the DNA oordination,via a servi e that displays ontrol and ontrolled plates showing whi h ellsmat h in ontrol and ontrolled plates.4 Dis ussionA key aspe t of the proje t was the deep intera tion between the biologi allabs and the bioinformati s lab. Through dis ussion lists or over the phone,the users would give suggestions of new servi es, and point out qui kly prob-lems with the servi es (broken links, bugs, et .) This daily, intensive in-tera tion was undoubtedly one of the main reasons for the su ess of theproje t.Clustering started early, and had a dramati impa t along the proje t.19

Re lustering in a regular basis demanded designing and implementing pro-grams to update databases, and BLAST results against nr, nt, and dbEST.It also demanded a lot of CPU time. When another lustering s heme wasadopted, the web site had to hange to a ommodate both versions simul-taneously, and to show relations among lusters in di�erent versions. Bothbioinformati s and data mining sta� experien ed some level of overhead toadapt for hanges.Among the lessons learned in the pro ess we point out two importantones:Avoid hanging systems: during this proje t we had to hange the under-lying omputing system twi e. The �rst time from a PC to a medium-sized server, and then to a bigger server. These hanges aused manyproblems, for instan e, programs that used to work in one system wouldnot to work on the other, users have to get used to new URLs, and soon. The migration pro ess proved time- onsuming and error-prone.Our advi e would be to set up a system that is big enough right fromthe start, and keep the proje t there for as long as possible. To min-imize the impa t of migration, it is important to devise the dire torystru ture in a system-independent way, for instan e, put your data indire tories that won't ollide with system dire tories, and install yourprograms in standard pla es, using exe ution path variables to assurethey will work. 20

Another important pie e of advi e is to use software that ombinesmany physi al disks into one big volume, say, of a few hundred giga-bytes. Most vendors provide su h software for a small fee.Keep referen e sequen es, not luster lists: in this proje t, data a u-mulated at a fast rate and lustering was redone frequently. Some datamining groups had problems trying to keep up with the frequent up-dates be ause they maintened lists of relevant lusters. Ea h time the lustering was redone, some lusters would dissapear (merge into largerones), or the read omposition of a luster would hange, requiring alot of manual labor. Our advi e would be to use referen e sequen esfrom Genbank or other stable sequen e database, whi h will then beused as queries to retrieve the luster lists via BLAST. Pro eeding inthis way, the lists ould be re onstru ted qui kly from the referen esequen es, using automated methods.There are many other programs, not presented here, that ontributed tothe fun tionality of SUCEST's web site. Some servi es and programs havealready been disabled, for example, sequen e submission and plate ontrol,but others, su h as keyword sear h, BLAST and report submission, are stillbeing used by data mining labs, and will be used by international ommunitywhen the data goes publi . This will ertainly transform the meeting pointof the proje t's ommunity into the meeting point of a wider group, settingnew demands for servi es and data storage.21

5 A knowledgmentsThis work was supported by FAPESP, CNPq, and COOPERSUCAR.6 Summary in PortugueseO projeto SUCEST (Sugar ane EST Proje t) produziu 291.904 ESTs de ana-de-a� �u ar. Nesse projeto, o Laborat�orio de Bioinform�ati a riou o website que foi o \ponto de en ontro" dos 76 laborat�orios de sequen iamento edata mining que �zeram parte do ons�or io para o projeto. O Laborat�oriode Bioinform�ati a re ebeu, pro essou, analizou e disponibilizou ferramentaspara a explora� ~ao dos dados. Neste artigo os dados, servi� os e programas im-plementados pelo LBI para o projeto s~ao des ritos, in luindo o pro edimentode lustering que gerou 43.141 lusters.Referen es[1℄ Adams, M., Kelley, J., Go ayne, J., et al. (1991) Complementary DNAsequen ing: "expressed sequen e tags"and the human genome proje t,S ien e, 252, 1651{1656.[2℄ S huler, G. (1997) Pie es of the puzzle: Expressed sequen e tags andthe atalog of human genes, Journal of Mole ular Medi ine, 75(10),694{698. 22

[3℄ Boguski, M. and S huler, G. (1995) ESTablishing a human trans riptmap, Nature Geneti s, 10, 369{371.[4℄ Simpson, A., Reina h, F., Arruda, P., et al. (2000) The genome sequen eof the plant pathogen xylella fastidiosa, Nature, 406, 151{157.[5℄ Apa he homepage, www.apa he.org.[6℄ Perl homepage, www. pan.org.[7℄ PHP homepage, www.php.net.[8℄ MySQL homepage, www.mysql. om.[9℄ The Phred/Phrap/Consed homepage, www.phrap.org.[10℄ The National Center for Biote hnology Information (NCBI) homepage,www.n bi.nlm.nih.gov.[11℄ Alts hul, S., Madden, T., S h�a�er, A., et al. (1997) Gapped BLASTand PSI-BLAST: a new generation of protein database sear h programs,Nu lei A ids Resear h, 25, 3389{3402.[12℄ Huang, X. and Madan, A. (1999) CAP3: A DNA sequen e assemblyprogram, Genome Resear h, 9, 868{877.[13℄ Weil, B. and Baron, C. Drag `n' Drop CGI: Enhan e Your Web SiteWithout Programming hapter 12 Addison Wesley (1997).23

[14℄ Vettore, A., daSilva, F., Kemper, E., and Arruda, P. (2001) The librariesthat made SUCEST, the Sugar ane EST Proje t, GMB, This issue.[15℄ Telles, G. and daSilva, F. (2001) Trimming and lustering sugar aneESTs, GMB, This issue.[16℄ Qua kenbush, J., Liang, F., Holt, I., et al. (2000) The TIGR GeneIndi es: re onstru tion and representation of expressed gene sequen es,Nu lei A ids Resear h, 28(1), 141{145.

24