Decoding the rice genome

12
Decoding the rice genome Shubha Vij, Vikrant Gupta, Dibyendu Kumar, Ravi Vydianathan, Saurabh Raghuvanshi, Paramjit Khurana, Jitendra P. Khurana, and Akhilesh K. Tyagi* Summary Rice cultivation is one of the most important agricultural activities on earth, with nearly 90% of it being produced in Asia. It belongs to the family of crops that includes wheat, maize and barley, and it supplies more than 50% of calories consumed by the world population. Its immense economic value and a relatively small genome size makes it a focal point for scientific investigations, so much so that four whole genome sequence drafts with varying qualities have been generated by both public and pri- vately funded ventures. The availability of a complete and high-quality map-based sequence has provided the opportunity to study genome organization and evolution. Most importantly, the order and identity of 37,544 genes of rice have been unraveled. The sequence provides the required ingredients for functional genomics and mole- cular breeding programs aimed at unraveling intricate cellular processes and improving rice productivity. BioEssays 28:421–432, 2006. ß 2006 Wiley Periodicals, Inc. Introduction Rice is one of the most important food crops of the world. More than half of the world population depends on rice as the major source of calories and proteins. About 840 million people in the world are undernourished, which includes almost 200 million children from developing countries (http://www.fao.org/). Rice production will have to be increased substantially to meet the demand of the growing world population, especially in the Asian subcontinent. The rice production has, however, declined in the last 4 years (http://www.irri.org). (1) This is due to increasing urbanization leading to shortage of cultivable land and deteriorating environmental conditions. To meet the growing demand, a combination of breeding strategies and molecular biology tools has to be used in synchrony to obtain varieties that are high yielding and also more resistant to various abiotic and biotic stresses. (2) Sequencing of the rice genome was initiated with the aim of using the sequence information to understand the function of its gene repertoire. The pioneering work, which laid the foundation for rice genome sequencing, was initiated in early 1990s. (3) The work centered on constructing a linkage map (http://rgp.dna.affrc.go.jp/pub- licdata/geneticmap2000.index.html), (4) YAC (yeast artificial chromosome) based physical map, (5,6) a transcript map (7,8) and sequence-ready BAC/PAC (bacterial artificial chromo- some/P1-derived artificial chromosome) physical map. (2,9) Rice is also amenable to genetic transformation, thereby providing an ideal crop system for functional genomics. (10) Moreover, the rice genome shares a syntenic relationship with other cereal crops like sorghum and maize. (11,12) Amongst the different cereal crops, rice was chosen as the best representa- tive genome due to a relatively small estimated genome size of 430 Mb. (13) This review aims to trace the path of rice genome sequencing from its initiation to the current status and seeks to interpret the information obtained from the genome of the first food crop to be sequenced. Strategies to sequence whole genomes In order to sequence a large DNA molecule, it is first broken into small fragments, which are cloned and sequenced. The overlapping sequence reads are assembled using computer software programs into contigs. The quality of the sequence submitted in the database is variously classified as phase 0, I, II and III. The initial raw sequence generated is referred to as phase 0 and the assembled sequence represents phase I. When the contigs are ordered and oriented, the sequence is Interdisciplinary Centre for Plant Genomics and Department of Plant Molecular Biology, University of Delhi South Campus, New Delhi 110 021, India. Funding agency: The research work of our group is funded by the Department of Biotechnology, Government of India, New Delhi. SV, VG, DK and VR were supported by research fellowships from CSIR/ UGC, Government of India, New Delhi. *Correspondence to: Akhilesh K. Tyagi, Interdisciplinary Centre for Plant Genomics and Department of Plant Molecular Biology, University of Delhi South Campus, New Delhi 110 021, India. E-mail: [email protected] DOI 10.1002/bies.20399 Published online in Wiley InterScience (www.interscience.wiley.com). BioEssays 28:421–432, ß 2006 Wiley Periodicals, Inc. BioEssays 28.4 421 Abbreviations: BAC, Bacterial artificial chromosome; bp, Base pairs; EST, Expressed sequence tag; IRGSP, International Rice Genome Sequencing Project; JAK, Janus kinase; Mb, Million base pairs; MDRs, Mathematically determined reads; MOsDB, MIPS Oryza sativa DataBase; MTP, Minimum tiling path; ORF, Open reading frame; PAC, P1-derived artificial chromosome; QTL, Quantitative trait loci; RePS, Repeat masked phrap with scaffolding; RiceGAAS, Rice Genome Automated Annotation System; SNP, Single nucleotide polymorphism; SSR, Simple sequence repeat; STAT, Signal Transdu- cers and Activators of Transcription; STCs, Sequence tag connectors; TIR-NB-LRR, Toll-Interleukin-Region-Nucleotide-Binding site-Leucine- Rich Repeat; WGS, Whole Genome Shotgun; YAC, Yeast artificial chromosome. Genes and genomes

Transcript of Decoding the rice genome

Decoding the rice genomeShubha Vij, Vikrant Gupta, Dibyendu Kumar, Ravi Vydianathan,Saurabh Raghuvanshi, Paramjit Khurana, Jitendra P. Khurana,and Akhilesh K. Tyagi*

SummaryRice cultivation is one of the most important agriculturalactivities on earth, with nearly 90%of it being produced inAsia. It belongs to the family of crops that includeswheat,maize and barley, and it supplies more than 50% ofcalories consumed by the world population. Its immenseeconomic value and a relatively small genomesizemakesit a focal point for scientific investigations, so much sothat four whole genome sequence drafts with varyingqualities have been generated by both public and pri-vately funded ventures. The availability of a complete andhigh-quality map-based sequence has provided theopportunity to study genome organization and evolution.Most importantly, theorderand identityof37,544genesofrice have been unraveled. The sequence provides therequired ingredients for functional genomics and mole-cular breeding programs aimed at unraveling intricatecellular processes and improving rice productivity.BioEssays28:421–432, 2006.� 2006WileyPeriodicals, Inc.

Introduction

Rice is one of themost important food crops of theworld. More

than half of the world population depends on rice as the major

source of calories and proteins. About 840million people in the

world are undernourished, which includes almost 200 million

children from developing countries (http://www.fao.org/). Rice

production will have to be increased substantially to meet

the demand of the growing world population, especially in

the Asian subcontinent. The rice production has, however,

declined in the last 4 years (http://www.irri.org).(1) This is due to

increasing urbanization leading to shortage of cultivable land

and deteriorating environmental conditions. To meet the

growing demand, a combination of breeding strategies and

molecular biology tools has to be used in synchrony to obtain

varieties that are high yielding and also more resistant to

various abiotic and biotic stresses.(2) Sequencing of the rice

genome was initiated with the aim of using the sequence

information to understand the function of its gene repertoire.

Thepioneeringwork,which laid the foundation for rice genome

sequencing, was initiated in early 1990s.(3) The work centered

on constructing a linkage map (http://rgp.dna.affrc.go.jp/pub-

licdata/geneticmap2000.index.html),(4) YAC (yeast artificial

chromosome) based physical map,(5,6) a transcript map(7,8)

and sequence-ready BAC/PAC (bacterial artificial chromo-

some/P1-derived artificial chromosome) physical map.(2,9)

Rice is also amenable to genetic transformation, thereby

providing an ideal crop system for functional genomics.(10)

Moreover, the rice genome shares a syntenic relationship with

other cereal crops like sorghum and maize.(11,12) Amongst the

different cereal crops, rice was chosen as the best representa-

tive genome due to a relatively small estimated genome size of

�430Mb.(13) This review aims to trace the path of rice genome

sequencing from its initiation to the current status and seeks to

interpret the information obtained from the genome of the first

food crop to be sequenced.

Strategies to sequence whole genomes

In order to sequence a large DNA molecule, it is first broken

into small fragments, which are cloned and sequenced. The

overlapping sequence reads are assembled using computer

software programs into contigs. The quality of the sequence

submitted in the database is variously classified as phase 0, I,

II and III. The initial raw sequence generated is referred to as

phase 0 and the assembled sequence represents phase I.

When the contigs are ordered and oriented, the sequence is

Interdisciplinary Centre for Plant Genomics and Department of Plant

Molecular Biology, University of Delhi South Campus, New Delhi

110 021, India.

Funding agency: The research work of our group is funded by the

Department of Biotechnology, Government of India, New Delhi. SV,

VG, DK and VR were supported by research fellowships from CSIR/

UGC, Government of India, New Delhi.

*Correspondence to: Akhilesh K. Tyagi, Interdisciplinary Centre for

Plant Genomics and Department of Plant Molecular Biology, University

of Delhi South Campus, New Delhi 110 021, India.

E-mail: [email protected]

DOI 10.1002/bies.20399

Published online in Wiley InterScience (www.interscience.wiley.com).

BioEssays 28:421–432, � 2006 Wiley Periodicals, Inc. BioEssays 28.4 421

Abbreviations: BAC, Bacterial artificial chromosome; bp, Base pairs;

EST, Expressed sequence tag; IRGSP, International Rice Genome

Sequencing Project; JAK, Janus kinase; Mb, Million base pairs; MDRs,

Mathematically determined reads; MOsDB, MIPS Oryza sativa

DataBase; MTP, Minimum tiling path; ORF, Open reading frame;

PAC, P1-derived artificial chromosome; QTL, Quantitative trait loci;

RePS, Repeat masked phrap with scaffolding; RiceGAAS, Rice

Genome Automated Annotation System; SNP, Single nucleotide

polymorphism; SSR, Simple sequence repeat; STAT, Signal Transdu-

cers and Activators of Transcription; STCs, Sequence tag connectors;

TIR-NB-LRR, Toll-Interleukin-Region-Nucleotide-Binding site-Leucine-

Rich Repeat; WGS, Whole Genome Shotgun; YAC, Yeast artificial

chromosome.

Genes and genomes

designated as phase II. All stages before the final stage

generate draft sequences of variable quality, which refers to the

fact that the sequence is incomplete. The final step is to convert

the draft into the finished sequence, also referred to as phase III

(http://www.ncbi.nlm.nih.gov/HTGS/). For small genomes, like

that of microbes, the finished sequence refers to a complete

sequence,withoutgaps.However, in thecaseofeukaryotes, it is

virtually impossible to get the complete genome information ina

single piece because they contain a large amount of repetitive

sequences, which are especially concentrated in the region

spanning centromeres and telomeres.(14) The two main

strategies for whole genome sequencing are discussed below.

In the Clone-by-Clone Shotgun approach, the genome

is fragmented and cloned in BAC/PAC vectors. Inserts of

genomic DNA fragments in the BAC/PAC vectors are

anchored physically to the genome, with the help of DNA

markers, to develop a minimum tiling path (MTP). The MTP is

generated using a combination of techniques including finger-

print patterns, sequence tag connectors (STCs) and marker

information. Each BAC/PAC (with average insert size of 100–

150 kb) present in the MTP is again broken into small-sized

fragments, cloned and sequenced. The sequence of the

genome is then obtained by merging the individual BAC/PAC

sequences.(14,15) Although this approach is time consuming,

it offers the advantage that each clone is anchored to a speci-

fic chromosome, thus making the task of finishing much

easier.(16) In addition, since the finished genome of a model

organism will leverage other genomes, this technique is

eventually cost effective. A high-quality sequence has been

generated for human genome as well as Arabidopsis and rice

genomes adopting this approach.(17–19)

In the Whole Genome Shotgun (WGS) approach, the

genomic DNA as such is broken into small-sized fragments,

cloned and directly used for sequencing. The sequences are

then assembled to reconstruct the whole genome.(15) This

approach avoids the initial task of making BAC/PAC libraries,

constructing aMTPand individual library construction for each

BAC/PAC clone in the MTP. This strategy has been used

extensively for bacterial genomes. The WGS approach has

also been used for sequencing the human genome as well as

indica and japonica rice genomes.(20–22) The potential

problem in use of WGS for eukaryotic genomes is misassem-

bly due to a high percentage of repetitive elements.(14,16) This

is because each contig has to be individually anchored to the

chromosome, which makes the task of finishing more

laborious and cumbersome.

2002: The year of rice genome

sequence drafts

OverviewRice was the ideal candidate for genome sequencing after

Arabidopsis sinceArabidopsis and rice arewidely accepted as

model dicot andmonocot plants, respectively.(23) Rice was the

first organism whose sequencing was pursued by four groups

independently, which itself speaks for the importance of its

genome information.(24,25) Although the task of sequencing

the rice genome was initiated by the publicly funded Interna-

tional Rice Genome Sequencing Project (IRGSP, see next

section),(26) it wasaprivate company,Monsanto (St Louis,MO,

USA), that released the first draft of the rice genome in April,

2000, based on the data generated at the University of

Washington.Monsanto sequenceda total of 3,391BACsusing

a clone-by-clone approach, to the level of 5X coverage, to

produce a draft sequence of 399 Mb. This sequence was

assembled in 52,202 contigs, representing 259 Mb non-

overlapping data, which was expected to cover almost 60% of

the rice genome.(27) Meanwhile, two other groups, Syngenta

(TorreyMesaResearch Institute, SanDiego, USA) andBeijing

Genomics Institute (BGI), China, also launched their indepen-

dent sequencing programs. IRGSP, Monsanto and Syngenta

chose the japonica cultivar ‘Nipponbare’ while BGI used the

indica cultivar ‘93-11’ for sequencing. The aim of both

the private ventures (Monsanto and Syngenta) and BGI for

rice genome sequencing was primarily gene discovery and

identification of molecular markers for breeding. Hence, these

groups aimed at obtaining a draft sequence to get a broad

overview of the rice genome.(14) The Monsanto and Syngenta

data were not released to the public database but could be

accessed for academic purposes on entering a database

registration agreement through their site (http://www.rice-

research.org, http://www.tmri.org).(21,27) Both Monsanto and

Syngenta also allowed its sequences to be incorporated into

IRGSP sequence as long as IRGSP used the information to

improve the sequence from draft to finished level. The BGI

data, unlike the Monsanto and Syngenta data, were made

available freely (http://btn.genomics.org.cn/rice/). However,

the aim of IRGSP was to obtain a highly accurate finished

sequence of the rice genome.(26) As a first step, the IRGSP

announced the release of a high-quality map-based draft

sequence in the public domain in December 2002.(2) As a

result of these private and public ventures, the year 2002 saw

the release of three draft sequences of the rice genome. The

details of the participating groups and their efforts to produce

the draft sequence are given below.

Detailed historyThe decision to sequence the rice genome was taken at the

International PlantMolecular BiologyConference held in 1997

in Singapore. Countries sharing a common interest in

sequencing the rice genome joined hands to achieve this

task(28) and launched the International RiceGenomeSequen-

cingProject (IRGSP).(26) It was the third largest public genome

project undertaken after the human and mouse genome

projects.(28) The consortium included laboratories fromJapan,

USA, China, Taiwan, France, India, Korea, Brazil, Thailand

Genes and genomes

422 BioEssays 28.4

and UK.(29) The participants from the member countries

are Rice Genome Research Program (RGP) Japan (http://

rgp.dna.affrc.go.jp), The Institute for Genomic Research

(TIGR) USA (http://www.tigr.org/tdb/e2k1/osa1), National

Center for Gene Research (NCGR) China (http://www.

ncgr.ac.cn/), Genoscope France (http://www.genoscope.

cns.fr/), Arizona Genomics Institute (AGI) USA (http://

www.genome.arizona.edu), Cold Spring Harbor Laboratory

(CSHL) USA (http://nucleus.cshl.org/riceweb), Academia Sini-

ca Plant Genome Center (ASPGC) Taiwan (http://genome.

sinica.edu.tw), Indian Initiative for Rice Genome Sequencing

(IIRGS) India (http://www.genomeindia.org/), Plant Genome

Initiative at Rutgers (PGIR) USA (http://pgir.rutgers.edu),

Korea Rice Genome Research Program (KRGRP) Korea

(http://biogen.niast.go.kr), National Center for Genetic Engi-

neering and Biotechnology (BIOTEC) Thailand (http://

www.cs.ait.ac.th/nstda/biotec/biotec.html), Brazilian Rice Ge-

nome Initiative (BRIGI) Brazil (http://www.ufpel.tche.br/faem/

fitotecnia/fitomelhoramento), John Innes Centre United King-

dom (http://www.jic.bbsrc.ac.uk), Washington University

School of Medicine Genome Sequencing Center (GSC) USA

(http://genome.wustl.edu/) and Wisconsin Rice Genome

Project (GCOW) USA (http://www.gcow.wisc.edu).

The IRGSP effort evolved around a few basic points: the

sequencing strategy, the rice cultivar to be sequenced, the

accuracy of sequence and the sequence release policy. It was

decided to use the japonica cultivar ‘Nipponbare’ since it had

already been used by the Rice Genome Research Program

(RGP), Japan, as a source of ESTsequencing and construc-

tion of a dense linkage and YAC physical map.(26) The guide-

lines for the method of sequencing, sequence quality and

release policy were developed largely on the same lines as the

Human Genome Project (http://www.gene.ucl.ac.uk/hugo/

bermuda.htm). The backbone of the IRGSP sequence-ready

physical map for sequencing was derived from a PAC library

comprising of 71,040 clones(30) and aBAC library consisting of

48,960 clones.(31) The other equally important sources for

large insert clones for sequencing were a BAC library

(�90,000 clones) made at Clemson University Genomics

Institute (CUGI)(9) and BAC libraries made by Monsanto.(27)

TheMTP for the 12 chromosomes sequenced by IRGSPwere

largely constructed using these large insert clones. The clones

were chosen to form the MTP using the fingerprint patterns,

BAC/PAC end sequences and information available from

markers of each clone.(2) The general strategy employed by

IRGSP for sequencing each large insert BAC/PAC clone was

to shear the DNA andmake two libraries for each clone having

an insert sizeof�2and5kb, respectively.Onanaverage, 2000

clones from the two libraries were randomly sequenced from

bothends to get�10Xcoverageandassembled to thephase II

level, also referred to as the draft sequence.(29) The sequ-

ences were then assembled using a combination of the base

caller, PHRED,(32,33) the assembler, PHRAP (http://boze-

man.mbt.washington.edu) and sequence viewer and editor,

CONSED(34) software. The IRGSP had set the target to finish

the rice genome sequence by 2008.(26) This goal changed

when Monsanto released the draft sequence of ‘japonica’ in

2000.(27) Two other groups, Syngenta and BGI published

drafts of ‘japonica’ and ‘indica’ simultaneously in 2002.(21,22)

Due to these developments, IRGSP decided to release the

draft (phase II data) before releasing the finished se-

quence.(14) Consequently, the draft sequence was released

by the consortium at a meeting held in Japan in December

2002 (http://rgp.dna.affrc.go.jp/rgp/Dec18_NEWS.html). This

task was speeded by Monsanto’s decision to provide its BAC

libraries sequenced up to 5X coverage to IRGSP.(27) The

IRGSP draft sequence consisted of 3,380 BAC/PAC clones

representing 366 Mb of the rice genome. This sequence

covered 92% of the rice genome at >10X level. A total of

62,435 genes were predicted from the non-overlapping draft

sequence (http://rgp.dna.affrc.go.jp/rgp/Dec18_NEWS.html).

Syngenta (Torrey Mesa Research Institute, CA, USA)

collaborated with Myriad Genetics (Salt Lake City, Utah) to

sequence the ‘japonica’ variety of rice.(21) The draft was

completed in just 14months after inception of the program.(35)

The genome was sequenced using a whole genome sequen-

cing strategy. The repeat sequences were removed from the

data and the remaining sequence represented 390 Mb of

the estimated 420 Mb genome with coverage of 6X.(21) The

number of geneswasestimated to be�32,000 to 50,000 using

a combination of different prediction programs [FGENESH

(monocot), GeneMark.HMM (Arabidopsis and rice) and

GENSCAN (Arabidopsis and maize)].

TheBeijingGenomics Institute (BGI), China, announced its

decision to sequence the ‘indica’ rice genome in May 2000.

BGI, like Syngenta, took the whole genome sequencing

route to sequence the rice genome and also released the

draft sequence in 2002.(22) The sequence was made public

by releasing the data on their website; http://www.btn.

genomics.org.cn/rice/. The repeat sequences were identified

mathematically and all 20-mer sequences whose frequency

was above a particular threshold were categorized as

mathematically determined reads (MDRs). On the basis of

this, almost 78 Mb of sequence was identified as repeat

sequence. These data were masked using RePS (Repeat

masked phrap with scaffolding)(36) and the remaining se-

quence represented 361Mb of the estimated 466Mb genome

with a coverage of 4X. Among the different prediction

programs used for gene identification, FGENESH was found

to be the most useful. The program predicted �46,022 to

55,615 genes in the BGI draft sequence.(22)

After the release of the indica draft sequence,(22) BGI

did additional sequencing to get a 6.28X coverage of the

genome,(37) which was almost identical to the coverage

obtained in the Syngenta draft,(21) although for a different

cultivar. For the purpose of analysis, repeats were masked in

Genes and genomes

BioEssays 28.4 423

both the draft sequences and reassembled.(36,38) These

independently assembled scaffolds from BGI and Syngenta

draft data were combined to get super scaffolds in such a way

as to get the order and orientation information but preserve the

SNP differences between the two subspecies.(37) The total

number of genes predicted in BGI, Syngenta and IRGSP data

using FGENESH(39) were 49,088, 45,824 and 43,635, respec-

tively.(37) This reduction in gene number compared to the

previous estimates (http://rgp.dna.affrc.go.jp/rgp/Dec18_

NEWS.html)(21,22) canbeattributedtoan improved identification

and elimination of TE-related genes.(37) Further, the objective of

obtainingalmostall thericegenesinasinglepiecewasfulfilledby

checking the BGI and Syngenta assembled sequence with a

collection of 19,079 full-length cDNA clones available in the

KOMEdatabase. Almost 98%of the genes could be aligned in a

singlepiecetoeitherof the twogenomes.(37)Thesalient features

of the updated BGI and Syngenta draft sequences(37) are

comparedwith the IRGSP finished sequence(18) in Table 1.

Thedraft sequenceswerenot expected tomatch thequality

of finished sequence, yet proved to be quite useful to the rice

research community in general.(14) They were used exten-

sively for identifying genes in rice and for making comparisons

with other plant species. The drafts also accelerated the pace

for functional genomics, since the work on microarrays,

proteomics and several other genome-wide studies could

movemuch faster due to the ready availability of the sequence

information.(29) Other areas of research such as breeding for

introgressionofbetter traitsandevolutionarystudiesalsogained

from the availability of the draft sequence. It was, however,

necessary to have the complete sequence information not only

for accurate interpretation of the rice genome in its own context,

but also to serve as a standard and resource for other cereal

genomes.Thestandard itself shouldbeasreliableaspossible to

help extrapolation of information in the true sense to other

economically valuable cereals.(40)

2004: The international year of rice—IRGSP

releases the map-based finished rice genome

sequence

The year 2004 was declared as the International Year of

Rice by the UN General Assembly. The theme of the program

was ‘Rice is life’, reflecting the importance of rice as a

food crop. The declaration was in recognition of the impor-

tance of rice, which provides food to more than half of the

world population and is a source of income for millions of

rice producers (http://www.fao.org/rice2004). The year also

marked the declaration of the complete rice genomesequence

by the IRGSP (http://rgp.dna.affrc.go.jp/IRGSP/celebrates/

celebrates.html). To commemorate the International Year of

Rice, IRGSP received the Research Accomplishment Award

at the world rice research conference for its role in decoding

the rice genome sequence (http://rgp.dna.affrc.go.jp/IRGSP/

WRRC2004-Award/WRRC2004-Award.html). Before the de-

claration of the completion of the rice genome, the finished

sequence of three chromosomes (1, 4 and 10) had already

been published(41–43) To obtain the finished sequence, more

than 4,000 BAC/PAC clones were sequenced, of which 3,401

clones (with at least 10X coverage and 99.9% accuracy) were

used toobtain�95%coverageof the389Mb ricegenome.The

size of the genome (389Mb)was estimated by adding the sum

of non-overlapping sequence along with the estimated size of

gaps. The finished sequence includes three completely

sequenced centromeres (chromosome 4, 5 and 8). To reach

phase III level (finished sequence), the sequence of each

clone was checked for problem regions. The aimwas to obtain

an error rate of less than one per 10 kb with the least possible

gaps (http://demeter.bio.bnl.gov/Guidelines.html). The main

problem regionswere gaps (physical/sequencing), low-quality

regions and misassembled regions. Generally, these pro-

blemswere solved byanyoneor a combination of the following

approaches. For low-quality regions, resequencing was done

using universal or custom primers. Sequencing gaps were

closed by sequencing of bridge clones, PCR fragments or

direct sequencing of BAC/PAC clones. Physical gaps were

filled using PCR fragments or 40 kb fosmid clones. Sequen-

cing using alternate chemistry was done when the normally

used chemistry did not yield results. For regions that were not

solved by these conventional methods, small insert libraries of

the region weremade or transposonswere used to disrupt the

difficult region. Each finished clone was finally confirmed by

comparing its in silico restriction pattern with the actual

restriction pattern.(18,44)

The 370,733,456 bp long finished sequence was used to

construct 12 chromosome-specific pseudomolecules in

57 contigs with an average continuous sequence length of

6.9 Mb (Fig. 1). A total of 62 physical gaps still remain in the

finished sequence including 9 centromeres and 17 telomeres

constituting 18.1 Mb of rice genome. The total number of

genes predicted for the finished sequence is �37,544. EST

Table 1. Comparison of BGI and Syngenta draft

with IRGSP finished rice genome sequence

SequencingGroup Syngenta* BGI* IRGSP**

Genome size 433 Mb 466 Mb 389 Mb

Coverage >6X 6.28X >10X

Assembled

contigs

46,246 64,052 57

Sub species/

cultivar

japonica/

Nipponbare

indica/93-11 japonica/

Nipponbare

Sequencing

strategy

WGS WGS Clone-by-clone

shotgun

Predicted genes 45,824 49,088 37,544

Based on reference 38* and 18**. The contigs in Syngenta andBGI draft

sequences were linked together to create much larger scaffolds and

super scaffolds.

Genes and genomes

424 BioEssays 28.4

markers were used to measure the genome coverage in the

finished sequence. Almost 99.4% of the available ESTs were

represented in the pseudomolecules.(18) The strength and

validity of the gene prediction programs was checked by

comparing the predicted genes to full-length cDNAs(45) and

ESTs (http://www.ncbi.nlm.nih.gov/dbEST/) available in the

database. A total of 61%predicted genes showedamatchwith

either a cDNA or an EST.(18)

What does the rice genome sequence reveal?

General featuresThe map-based sequence of the rice genome is estimated to

cover 95% of the 389 Mb rice genome. A total of 37,544 genes

have been predicted for the complete sequence with an

average gene density of one gene per 9.9 kb and average

gene length of 2,699 bp. Chromosomes 1 and 3 have the

highest gene density with a gene density of one gene per 8.9 kb

andone gene per 8.7 kb, respectively. Chromosomes11and12

had the lowest gene density of one gene per 10.7 kb and one

gene per 11.6 kb, respectively, compared to the rest of the rice

chromosomes.(18)

The rice genome was estimated to comprise �10–25%

repeat elements before the availability of genome se-

quence.(46,47) In the finished sequence, repeats constitute at

least 35% of the rice genome. The number of transposable

elements was maximum for chromosome 8 (38%) and 12

(38.3%) and least for chromosome 1 (31%), 2 (29.8%) and 3

(29%). The number of class II repeat elements like hAT,

CACTA, IS630/Tc1/mariner, IS256/Mutator and IS5/Tourist is

more than two-fold greater than class I elements like LINEs,

SINEs, Ty1/copia and Ty3/gypsy. However, the class I

elements contribute more to the genome (19.4%) compared

to class II elements (12.9%). Thus, the presence of class II

elements such as IS256/Mutator, IS5/Tourist and IS630/Tc1/

mariner in the rice genome correlated with gene density and

they were most frequently present on the first three chromo-

somes.(18)

Detailed analysis of the rice genome has led to the identi-

fication of three main classes of duplications. The first class

of duplication is segmental, involving duplication of a large

number of genesalong the length of thechromosome.(18,37,48,49)

The second class is tandem duplications involving individual

genes and the third is backgroundduplications accounting for all

other duplications that could not be classified into either of the

first two categories.(18,37) When only those rice genes showing

homology to non-redundant KOME cDNAswere considered for

duplication analysis, a total of 18 pairs of duplicated segments

Chr1

JapanKorea

Chr2

JapanUK

Chr4

China

Chr7

Japan

Chr12

France

Chr8

Japan

Chr3

USA

Chr11

FranceIndiaUSA

Chr10

USA

Chr9

BrazilJapanKoreaThailand

Chr6

Japan

Chr5

Taiwan

50.5462.34

87.6359.53

73.7381.63

51.6384.53

00.0337.92

06.1337.03

82.0346.92

75.8234.82

35.0396.22

69.3286.22

67.0353.82

77.7265.72

Figure 1. Pseudomolecules of the 12 rice chromosomes. The participating nations responsible for sequencing each chromosome are

givenon the top. Thearrowheads indicate the location of centromeres andgreen colour represents thepositionof physical gaps. Thegapon

short armof chromosome9 represents the nucleolar organizer consisting of 17S–5.8S–25S rDNAcoding units. Values given at the bottom

represent estimated (red) and sequenced (green) bases for each chromosome in Mb (modified from reference 18).

Genes and genomes

BioEssays 28.4 425

covering more than 65% of the length of the mapped super-

scaffoldswere identified in the indica rice genomesequence.(37)

Analysis of the japonica rice genome sequence has shown

that almost 60% of the rice genome is duplicated.(18) All the

chromosomes have duplicated segments; however, the biggest

duplicated block is shared between chromosome 11 and 12.

From analysis of the duplicated segments, it seems that the

whole genome duplication occurred about 55 to 70million years

ago before the divergence of the major cereals from their

common ancestor. Most of the observed duplications can be

attributed to this event. However, the chromosome 11–12

duplication is probably recent in origin and represents a

segmental duplication, which was earlier predicted to have

occurred about 20 million years ago.(21,37,49–51) A recent

analysis on the basis of the finished sequence, however, esti-

mates this to have happened as recently as 7.7 million years

ago.(52) It may be mentioned that, for assessing the age of

segmental duplications, the quality of sequence and annotation

is very important. Thus, analysis of the duplication events in rice

provides evidence for whole genome duplication, a recent

segmental duplication and several individual duplication events.

Analysis of the finished sequence for organellar insertions

showed that there were at least 421 chloroplast and 909

mitochondrial DNA insertions contributing to �0.2% each of

the nuclear genome. The pattern of chloroplast andmitochon-

drial insertions in the rice genome suggests that their transfer

processes were independent of each other.(18) In another

analysis, the nuclear localized plastid DNAwas similarly found

to be 0.2% of the total rice nuclear genome and was

predominantly present near the pericentromeric regions. On

the one hand, the number of such insertions was highest

in chromosome 1 and lowest for chromosomes 9, 10 and 11.

On the other hand, amount of insertions (in kb) was greatest in

chromosome 10 and least in 11. Age distribution analysis

revealed that the phenomenon of chloroplast–nuclear DNA

flux involved a constant process of integration, shuffling and

eliminationwith 80%of thembeing eliminated from thenuclear

genome in the span of a million years.(53)

The GC content varieswidely amongst different organisms

ranging from 26 to 65%.(54) Study of GC content in plant

species showed that Gramineae genomes were richer in GC

content compared to dicot genomes.(55) The overall GC

content of the Arabidopsis genome is 34.7% with the exons

having 44.1% and introns having 32.7% GC content.(17) The

rice genome has an average GC content of 43.6% with 54.2%

GC content of exons and 38.3%GCcontent of introns.(18) This

GC content is much higher compared to Arabidopsis,

especially in the coding regions. Another difference observed

between the GC content of the two plants was that a distinct

gradient inGC content existedwithin the rice genes,with the 50

end having on an average 25%more GC content compared to

the 30 end. Such a gradient in GC content was not seen in

Arabidopsis genes.(22) In another study, the GC content of two

Gramineae data sets (rice andmaize) was compared with two

dicots (Arabidopsisand tobacco) andasimilar difference in the

gradient of GC content in the direction of transcription was

observed.(54)

The centromere is the physical entity on the chromosome

that binds microtubules and other centromeric-associated

proteins so they serve as points of chromatid segregation

during cell division.(56)With the exception of yeast centromere,

which is made of �125 bp unique sequence, eukaryotic

centromeres are known to contain long stretches of repetitive

DNAsequences.Due to this, centromereshave long remained

recalcitrant to cloning, sequencing and subsequent assem-

bly.(56) Although most rice centromeres, like other eukaryotes,

are large in size (>1 Mb) and thus difficult to sequence, some

rice centromeres (chromosomes 4, 5 and 8) were smaller and

hence could be sequenced fully.(18,57,58) Rice centromeres

typically comprise 165 bp CentO satellite repeat sequences

and retrotransposon elements.(59) The rice chromosome 4

centromere is 124 kb long with 18 tracts of 379 CentO repeats

(59 kb) and 19 centromeric retroelements forming the core

centromere. There were four different types of retroelements

but the LTR retrotranspsosons like Ty3/gypsy-like retrotran-

sposons constituted the largest retrotransposon family.(58) The

chromosome 8 centromere was made of three clusters of

CentO repeats spanning 68.5 kb while for chromosome 5 the

size was 50.3 kb. The CentO repeats were tandemly arrayed

and interrupted by �220 TE-related sequences, mostly Ty3/

gypsy-like retrotransposons. Chromosomes 4 and 8 had

similar amounts of CentO repeats but had different numbers

of retroelements. Surprisingly, 201ORFswere predicted in the

1.97 Mb region around the chromosome 8 centromere. The

majority of these predicted genes were found to code for

hypothetical proteins but at least 20% showed similarity to

known proteins or rice full-length cDNAs.(57) Out of these

genes, 14 were present in the centromeric region and 12 of

these were experimentally confirmed to be functional.(60) The

presence of functional genes inside rice centromeres was

an interesting finding since centromeres were previously

considered to be transcriptionally silent heterochromatic

zones.(61,62) This finding is similar to the presence of genes in

the human neocentromeres.(63) Possibly, human neocentro-

meres represent an earlier stage and the rice centromeres

(chromosomes 4 and 8) represent an intermediate stage in

centromereevolution. Thus, the rice centromeresare probably

not fully developed and in the due course of time the

centromeric region would adapt to its role in cell division

and accumulate repetitive sequences, and the genes will lose

their expression and become transcriptionally silent.(60)

The complete sequencing of the rice centromeres has

revealed the basic structure of eukaryotic centromeres, has

helped identify the minimum sequence required for centro-

mere function and would prove useful for understanding their

evolution.

Genes and genomes

426 BioEssays 28.4

Gene predictionsGene annotation consists of two basic steps. In the first

step, different computer prediction programs are used for

gene prediction and, in the second step, the predicted

genes are validated using information on gene function

available in the database.(64) Computational gene prediction

in rice is facilitated by several publicly available databases.

The different gene prediction programs used for annotating

rice by different groups include Genscan (http://genes.

mit.edu/ GENSCAN.html), FGENESH (http://www.softberry.

com/berry.phtml), GeneMark.hmm (http://opal.biology.gatech.

edu/GeneMark/eukhmm.cgi), GlimmerR (http://www.tigr.org/

software/glimmer/), RiceHMM (http://rgp.dna.affrc.go.jp/

RiceHMM/), tRNAscan-SE (http://www.genetics.wustl.edu/eddy/

tRNAscan-SE/), SplicePredictor (http://bioinformatics.iastate.

edu/cgi-bin/sp.cgi), GeneSplicer (http://www.tigr.org/tdb/

GeneSplicer/), GeneFinder (http://rulai.cshl.org/tools/genefinder/)

and NetGene2 (http://www.cbs.dtu.dk/services/NetGene2/).(65)

Out of these, FGENESH has been found to be the most-useful

prediction tool available for rice.(22) Several websites provide

detailed information about rice gene annotation. These include

RiceGAAS, Rice Genome Automated Annotation System

(http://RiceGAAS.dna.affrc.go.jp), TIGR, The Institute for Geno-

mic Research (http://www.tigr.org/tdb/e2k1/osa1), Gramene

(http://www.gramene.org/) and MOsDB, MIPS Oryza sativa

DataBase (http://mips.gsf.de/proj/plant/jsf/rice/index.jsp).(66–69)

The availability of�400,000 ESTs and at least 32,000 full-length

cDNA clones has helped to a large extent in validation of

computational gene prediction in rice (http://www.ncbi.nlm.nih.

gov/UniGene, cdna01.dna.affrc.go.jp/cDNA).(70) The rice gen-

ome annotation at TIGR named Osa1 (Oryza sativa 1) is the

most-widely used database for certain major projects of gene

array, transcriptomics and annotation as it provides details of

annotation and sequence assembly of the rice genome.(71) The

annotation details of each genemodel are linked to its functional

information like expression data, gene ontologies and tagged

lines (http://www.tigr.org/tdb/e2k1/osa1). Recently, another da-

tabase called Rice Annotation Project Database (RAP-DB) has

been made public. It utilizes IRGSP assembly and can be

accessed through http://rapdb.lab.nig.ac.jp/.(72)

The number of genes predicted in the finished rice genome

sequence is �37,544.(18) In addition, a 7 Mb region on

chromosome 9 and 0.25 Mb region on chromosome 11 code

for ribosomal RNA. 763 tRNA geneswere also predicted. This

is smaller in comparison to the number predicted in draft

sequences. In the IRGSP data, the vast difference in the

number of genes predicted in the previously finished chromo-

somes 1, 4 and 10 as well as draft sequence in comparison to

the finished genome sequence is explained by improvement

in their annotation process. This excludes transposon-related

genesnumbering17,752, since,FGENESHpredicteda total of

55,296 genes for the finished sequence, which was compar-

able to the genes estimated from the previously finished

chromosomes or draft sequence. The view that the rice genes

without Arabidopsis homologues could include wrongly pre-

dicted genes is also supported by the finding that this subset of

genes is largely different in its features from the rest of the rice

genes. These differences include smaller size, more introns

and unusual 30 GC richness. Another striking feature of these

genes is that onlya very small percentage is supported byEST

data.(73) To further prove this point, these rice genes were

annotated using the maize transcriptome data (representing

more than 80% of maize genes). Only 15% of the rice genes

lacking Arabidopsis homologues were supported by maize

ESTs. Further, manual annotation of these genes showed that

at least 30% of these genes were transposable elements.(73)

This study supports the number of genes predicted by IRGSP

for the finished sequence.(18) It is possible that the predicted

rice genes that are not supported by ESTs could be supported

by expression evidence from other functional genomics

approaches like tiling microarrays(74–76) or MPSS (http://

mpss.udel.edu/rice) and only detailed analysis will give a true

picture about the nature of these genes.

The 37,544 genes predicted from the IRGSP finished

sequence could be classified into 3,328 different types of

domains. Out of the most abundant domains predicted, five

were protein kinases. More than half of the predicted proteins

could be associated with a biological process.(18) A total of

71% predicted rice gene products had homologues in

Arabidopsis, while the percentage of rice gene products with

homologues in humans, Drosophila, C. elegans, yeast,

Synechocystis and E. coli were 40.8%, 38%, 36.5%, 30.2%,

17.6% and 10.2%, respectively (Fig. 2).

Comparison of the two sequenced

plant genomes

Rice and Arabidopsis are distantly related species that

diverged about 200 million years ago.(77) The rice genome is

about three times larger and has almost 50%more genes than

the Arabidopsis genome.(18,65) The largest syntenic region

observed to date between the two organisms was identified in

an analysis of chromosome 4 finished sequence with the

Arabidopsis genome. The syntenic region covered 119

Arabidopsis proteins showing an identity of at least 70% over

a minimum stretch of 30 amino acids. This analysis revealed

that there was collinearity between the two genomes but was

preserved only to a small extent.(21) Analysis of the Arabidop-

sis genome revealed that only 35% of its genes were unique

while at least 17% of the genes were tandemly duplicated.(17)

Similarly, in rice, almost 60%of the genome is duplicatedwhile

14%of its genes are tandemly duplicated.(18) The high number

of duplicated genes in both the plant genomes indicates that

gene diversity in plants has probably arisen through genome

duplication.(22)

In the recent study done with the IRGSP data, almost 90%

of Arabidopsis proteins had rice homologues, while �71% of

Genes and genomes

BioEssays 28.4 427

predicted rice proteins had an Arabidopsis homologue. To

eliminate the possibility of wrongly predicted genes, homology

search was also done using only those predicted rice genes

that were supported by an ESTor a cDNA and the percentage

of predicted rice genes with an Arabidopsis homologue

increased to 88%.(18) Comparison of the predicted rice and

Arabidopsis genes shows that the organisms share many

common genes. These include most of the disease- and

flowering-related genes, phosphate transporters, transcrip-

tion factors and those involved in metabolism. In contrast,

there are several common genes that are absent in these two

plant genomes,but arepresent in other sequencedorganisms.

These include members of gene families encoding nuclear

steroid receptor, p53, Notch/lin12, Janus kinase (JAK) and

Signal Transducers and Activators of Transcription (STAT).

Another important category was genes that were present

either in Arabidopsis or rice. Some of the Arabidopsis genes

that do not have rice homologues are FRIGIDA, FLOWERING

LOCUS C, UNUSUAL FLORAL ORGANS and SUPERMAN

amongst the flowering-related genes and TIR-NB-LRR (Toll-

Interleukin-Region-Nucleotide-Binding site-Leucine-Rich Re-

peat) amongst the disease-related genes.(21,25) Amongst the

predicted rice genes, �8% do not show homologues in

Arabidopsis and include well-known cereal-specific genes like

prolamins along with several proteins such as chitinase

precursor, seed allergen, starch branching enzyme, wound-

induced protease inhibitor and abscisic stress ripening

protein.(18) However, the majority of these genes do not show

hits in the database or to hypothetical proteins. The basic

difference between monocots and dicots will become clear

only when the function of these largely unknown cereal-

specific genes becomes clear.(18)

Comparison of rice with cereal genomes

The cereals diverged from their common ancestor around

60 million years ago.(78) Despite this period of independent

evolution, the genes as well as their order in cereals is seen to

be quite conserved.(12) A major advantage of sequencing the

rice genome was its syntenic relationship with other cereal

species.(26) Analysis of rice genome sequence draft showed

that homologues of almost 98% wheat, barley and maize

proteins could be identified in rice.(21) However, most of the

analyses, which report strong syntenic relationship among

cereals, have been done at a low resolution due to the limited

number of common markers available.(79) There are several

instances, however, where collinearity between cereals was

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

20000

22000

24000

26000

Nu

mb

er o

f P

red

icte

d P

rote

ins

Arabid

opsis

Yeast

C. eleg

ans

Droso

phila

Human

Synec

hocy

stis

E. coli

<E-200<E-150 to E-200

<E-100 to E-150

<E-50 to E-100<E-10 to E-50

<E-5 to E-10

Figure 2. Comparison of predicted rice proteins with proteins from model organisms at different e-value cut-offs (modified from

reference 18).

Genes and genomes

428 BioEssays 28.4

found to be disrupted when studied at a higher resolution. For

instance, high-resolution mapping was done for studying

wheat–rice synteny using a total of 4,485 wheat ESTs for

comparison with the rice genome sequence. The analysis

revealed that there was a general conservation of genes and

their order in the two species. However, several breaks in

the collinearity were observed.(79) In a similar study, 2,932

predicted genes from the long arm of chromosome 11 were

compared with wheat ESTs. Although the genes were

conserved in the analyzed region, several rearrangements

could be seen that disrupted the gene order.(80) An analysis of

rice sequencewith 2629maizemarkers identified 656 putative

orthologs but revealed several breaks in collinearity.(81) Similar

sequence-based alignments of rice done with sorghum and

barley revealed that there were some rearrangements along

with a general conservation of synteny.(82,83) Also, there are

studies where identifying candidate genes on the basis of

synteny did not prove useful. For instance, attempts to identify

theRph7 (leaf rust resistance) gene andPhd-H1 (photoperiod

response) gene in barley on the basis of their expected

collinearity in rice did not yield the expected results.(84,85)

However, comparative genomics based on the syntenic

relationship of rice with other cereals has helped in identifying

several important genes, such as the QTL for malting quality in

barley,majorheadingdateQTL inperennial ryegrass, liguleless

region in sorghum and Ror2, a gene conferring resistance to

powdery mildew disease in barley.(86–89) Hence, from these

studies, it seems that cross-species comparison would be

useful in identifying genes of interest. However, in each case,

collinearity will have to be investigated at the micro level in the

region of interest using high-density genetic maps.(90)

Conclusions

Sequencing of the rice genome was initiated by four different

groups. Monsanto and IRGSP used the clone-by-clone

approach, while Syngenta and BGI made use of the WGS

approach to sequence the rice genome. Amongst these

different groups, only the publicly funded IRGSP was

interested in the complete sequence information, while the

other groups pursued sequencing for gene discovery and

marker information.(14) Analysis of the rice genome sequence

has confirmed the syntenic relationship amongst cereal

crops.(21) However, it also shows that collinearity is not so well

preserved as previously thought. Hence, information from rice

can be extrapolated to other cereal crops, but only after

studyingmicrocollinearity in the region of interest.(90) A total of

18,828 SSRs and a large number of SNPs (0.5–0.8%) have

been identified in the finished rice genome sequence, which

will aid map-based cloning.(18) In fact the availability of

sequence has already facilitated such efforts and several rice

genes like Hd1, a major photoperiod-sensitive QTL, PLAS-

TOCHRON1, a regulator of leaf initiation, Spl7, a heat-stress

transcription factor,Rf-1, a fertility-restorer gene,Xa26, a gene

conferring resistance to Xanthomonas oryzae pv. oryzae and

Gn1a, a cytokinin oxidase gene representing a QTL for grain

production have been cloned utilizing the rice genome

sequence information.(91–96)

The sequence availability of two plant genomes (Arabidopsis

and rice) provided an opportunity to compare their sequences

and understand the special features of plant genomes. Plants

have a much larger number of genes compared to other

sequenced organisms. This is mainly due to the higher number

of duplicated genes.(21) Thus, plant genomes seem to have

evolved through polyploidization and subsequent gene loss.(97)

There is also a need for functional validation of predicted genes

in rice and Arabidopsis to serve as the landmark for other plant

genomes for which full sequencing will probably never be

done.(98–100) The recent use of rice genome sequence in

microarray projects indicates its importance as a tool for global

geneexpressionprofiling, discoveryof newgenesandvalidating

computational gene predictions.(74–76,101) Rice is also amongst

the few organisms for which sequences are available in two

Box 1. Glossary of terms

Bacterial artificial chromosome (BAC): A

bacterial cloning vector that can typically carry 100–

150 kb insert DNA.

Contig: A contiguous DNA sequence generated by

assembling overlapping sequences.

Draft sequence: An incomplete sequence in

terms of both contiguity and likelihood of errors.

Finished sequence: A sequence with an error

rate of less than one error per 10 kb, assembled in the

correct order and orientation with least possible gaps.

Minimum tiling path (MTP): The least number

of overlapping clones that span a chromosomal region.

P1-derived artificial chromosome (PAC): A

cloning vector derived from P1 phage that can carry

typically 100–150 kb insert DNA.

Retrotransposon: A type of transposon that can

move by producing an RNA intermediate.

Scaffolds: An ordered set of contigs placed on the

chromosome.

Sequencing gap: A gap in the sequence that can

be filled by sequencing of bridge clones available in the

region.

Shotgun sequencing: An approach to sequence

DNAby breaking it into a large number of fragments that

can be sequenced individually.

Transposon:Any segment of DNA that can change

its position in the genome.

Yeast artificial chromosome (YAC): A high

capacity cloning vector that can typically carry 300–

400 kb DNA.

Genes and genomes

BioEssays 28.4 429

subspecies. Analysis of the alignments shows that, although the

genes are highly conserved in the two subspecies, the major

difference lies in the intergenic regions.(37) Thus, the availability

of the rice genome sequences has given a deeper insight about

the gene content, regulatory elements and the nature of repeats

in its genome.(18) But, in the end, the real worth of rice genome

sequence will be measured in terms of the agro-economic

benefits. The huge effortsmade in studying the rice genomewill

finally be justified when the sequence information is used in

developing better rice varieties with greater yield and enhanced

tolerance to various abiotic and biotic stresses.

References1. Peng S, Huang J, Sheeshy JE, Laza RC, Visperas RM, et al. 2004. Rice

yields decline with higher night temperature from global warming. Proc

Natl Acad Sci USA 101:9971–9975.

2. Sasaki T, Matsumoto T, Antonio BA, Nagamura Y. 2005. From mapping

to sequencing, post-sequencing and beyond. Plant Cell Physiol 46:3–

13.

3. Sasaki T. 1998. The rice genome project in Japan. Proc Natl Acad Sci

USA 95:2027–2028.

4. Harushima Y, Yano M, Shomura A, Sato M, Shimano T, et al. 1998. A

high-density rice genetic linkage map with 2275 markers using a single

F2 population. Genetics 148:479–494.

5. Umehara Y, Inagaki A, Tanoue H, Yasukochi Y, Nagamura Y, et al.

1995. Construction and characterization of a rice YAC library for

physical mapping. Mol Breed 1:79–89.

6. Saji S, Umehara Y, Antonio BA, Yamane H, Tanoue H, et al. 2001. A

physical map with yeast artificial chromosome (YAC) clones covering

63% of the 12 rice chromosomes. Genome 83:32–37.

7. Yamamoto K, Sasaki T. 1997. Large-scale EST sequencing in rice.

Plant Mol Biol 35:135–144.

8. Wu J, Maehara T, Shimokawa T, Yamamoto S, Harada C, et al. 2002. A

comprehensive rice transcript map containing 6591 expressed

sequence tag sites. Plant Cell 14:525–535.

9. Chen M, Presting G, Barbazuk WB, Goicoechea JL, Blackmon B, et al.

2002. An integrated physical and genetic map of the rice genome.

Plant Cell 14:537–545.

10. Tyagi AK, Mohanty A. 2000. Rice transformation for crop improvement

and functional genomics. Plant Science 158:1–18.

11. Moore G, Devos KM, Wang Z, Gale MD. 1995. Grasses, line up and

form a circle. Curr Biol 5:737–739.

12. Gale MD, Devos KM. 1998. Plant comparative genetics after 10 years.

Science 282:656–659.

13. Goff SA. 1999. Rice as a model for cereal genomics. Curr Opin Plant

Biol 2:86–89.

14. Buell CR. 2002. Obtaining the sequence of the rice genome and

lessons learned along the way. Trends Plant Sci 7:538–542.

15. Green ED. 2001. Strategies for the systematic sequencing of complex

genomes. Nat Rev Genet 2:573–583.

16. Waterston RH, Lander ES, Sulston JE. 2002. On the sequencing of the

human genome. Proc Natl Acad Sci USA 99:3712–3716.

17. The Arabidopsis Genome Initiative. 2000. Analysis of the genome

sequence of the flowering plant Arabidopsis thaliana. Nature 408:796–

815.

18. International Rice Genome Sequencing Project. 2005. The map-based

sequence of the rice genome. Nature 436:793–800.

19. International Human Genome Sequencing Consortium. 2001. Initial

sequencing and analysis of the human genome. Nature 409:860–

921.

20. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, et al. 2001. The

sequence of the human genome. Science 291:1304–1351.

21. Goff SA, Ricke D, Lan TH, Presting G, Wang R, et al. 2002. A draft

sequence of the rice genome (Oryza sativa L. ssp. japonica). Science

296:92–100.

22. Yu J, Hu S, Wang J, Wong GK, Li S, et al. 2002. A draft sequence of the

rice genome (Oryza sativa L. ssp. indica). Science 296:79–92.

23. Izawa T, Shimamoto K. 1996. Becoming a model plant: the importance

of rice to plant science. Trends Plant Sci 1:95–99.

24. Buell CR. 2002. Current status of the sequence of the rice genome and

prospects for finishing the first monocot genome. Plant Physiol

130:1585–1586.

25. Delseny M. 2003. Towards an accurate sequence of the rice genome.

Curr Opin Plant Biol 6:101–105.

26. Sasaki T, Burr B. 2000. International Rice Genome Sequencing Project:

the effort to completely sequence the rice genome. Curr Opin Plant Biol

3:138–141.

27. Barry GF. 2001. The use of the Monsanto draft rice genome sequence

in research. Plant Physiol 125:1164–1165.

28. Eckardt NA. 2000. Sequencing the rice genome. Plant Cell 12:2011–

2017.

29. Tyagi AK, Khurana JP, Khurana P, Raghuvanshi S, Gaur A, et al. 2004.

Structural and functional analysis of rice genome. J Genet 83:79–99.

30. Baba T, Katagiri S, Tanoue H, Tanaka R, Chiden Y, et al. 2000.

Construction and characterization of rice genomic libraries: PAC library

of japonica variety, nipponbare and BAC library of indica variety,

Kasalath. Bulletin of the NIAR 14:41–49.

31. Wu J, Mizuno H, Hayashi-Tsugane M, Ito Y, Chiden Y, et al. 2003.

Physical maps and recombination frequency of six rice chromosomes.

Plant J 36:720–730.

32. Ewing B, Hillier L, Wendl MC, Green P. 1998. Base-calling of automated

sequencer traces using phred. I. Accuracy assessment. Genome Res

8:175–185.

33. Ewing B, Green P. 1998. Base-calling of automated sequencer traces

using phred. II. Error probabilities. Genome Res 8:186–194.

34. Gordon D, Abajian C, Green P. 1998. Consed: a graphical tool for

sequence finishing. Genome Res 8:195–202.

35. Davenport RJ. 2001. Rice genome. Syngenta finishes, consortium goes

on. Science 291:807.

36. Wang J, Wong GK, Ni P, Han Y, Huang X, et al. 2002. RePS: a

sequence assembler that masks exact repeats identified from the

shotgun data. Genome Res 12:824–831.

37. Yu J, Wang J, Lin W, Li S, Li H, et al. 2005. The genomes of Oryza

sativa: A history of duplications. PLoS Biol 3:e38.

38. Zhong L, Zhang K, Huang X, Ni P, Han Y, et al. 2003. A statistical

approach designed for finding mathematically defined repeats in

shotgun data and determining the length distribution of clone-inserts.

Genomics Proteomics Bioinformatics 1:43–51.

39. Salamov A, Solovyev V. 2000. Ab initio gene finding in Drosophila

genomic DNA. Genome Res 10:516–522.

40. Leach J, McCouch S, Slezak T, Sasaki T, Wessler S. 2002. Why

finishing the rice genome matters. Science 296:45.

41. Sasaki T, Matsumoto T, Yamamoto K, Sakata K, Baba T, et al. 2002.

The genome sequence and structure of rice chromosome 1. Nature

420:312–316.

42. Feng Q, Zhang Y, Hao P, Wang S, Fu G, et al. 2002. Sequence and

analysis of rice chromosome 4. Nature 420:316–320.

43. The Rice Chromosome 10 Sequencing Consortium. 2003. In-depth

view of structure, activity, and evolution of rice chromosome 10.

Science 300:1566–1569.

44. de la Bastide M, Johnson D, Balija V, McCombie WR. 2001. Strategies

and techniques for finishing genomic sequence. In: Khush GS, Brar

DS, Hardy B, editors. Rice Genetics IV. New Delhi: Science Publishers,

Inc. pp 197–213.

45. Kikuchi S, Satoh K, Nagata T, Kawagashira N, Doi K, et al. 2003.

Collection, mapping, and annotation of over 28,000 cDNA clones from

japonica rice. Science 301:376–379.

46. Mao L, Wood TC, Yu Y, Budiman MA, Tomkins J, et al. 2000. Rice

transposable elements: a survey of 73,000 sequence-tagged-connec-

tors. Genome Res 10:982–990.

47. Turcotte K, Srinivasan S, Bureau T. 2001. Survey of transposable

elements from rice genomic sequences. Plant J 25:169–179.

48. Wang S, Wang J, Jiang J, Zhang Q. 2000. Mapping of centromeric

regions on the molecular linkage map of rice (Oryza sativa L.) using

centromere-associated sequences. Mol Gen Genet 263:165–172.

Genes and genomes

430 BioEssays 28.4

49. Paterson AH, Bowers JE, Chapman BA. 2004. Ancient polyploidi-

zation predating divergence of the cereals, and its consequences

for comparative genomics. Proc Natl Acad Sci USA 101:9903–

9908.

50. Salse J, Piegu B, Cooke R, Delseny M. 2002. Synteny between

Arabidopsis thaliana and rice at the genome level: a tool to identify

conservation in the ongoing rice genome sequencing project. Nucleic

Acids Res 30:2316–2328.

51. Vandepoele K, Simillion C, Van de Peer Y. 2003. Evidence that rice

and other cereals are ancient aneuploids. Plant Cell 15:2192–

2202.

52. The Rice Chromosomes 11 and 12 Sequencing Consortia. 2005. The

sequence of rice chromosomes 11 and 12, rich in disease resistance

genes and recent gene duplications. BMC Biology 3:20.

53. Matsuo M, Ito Y, Yamauchi R, Obokata J. 2005. The rice nuclear

genome continuously integrates, shuffles, and eliminates the chlor-

oplast genome to cause chloroplast-nuclear DNA flux. Plant Cell 17:

665–675.

54. Wong GK, Wang J, Tao L, Tan J, Zhang J, et al. 2002. Compositional

gradients in Gramineae genes. Genome Res 12:851–856.

55. Carel N, Bernardi G. 2000. Two classes of genes in plants. Genetics

154:1819–1825.

56. Cooke HJ. 2004. Silence of the centromeres-not. Trends Biotechnol

22:319–321.

57. Wu J, Yamagata H, Hayashi-Tsugane M, Hijishita S, Fujisawa M, et al.

2004. Composition and structure of the centromeric region of rice

chromosome 8. Plant Cell 16:967–976.

58. Zhang Y, Huang Y, Zhang L, Li Y, Lu T, et al. 2004. Structural features

of the rice chromosome 4 centromere. Nucleic Acids Res 32:2023–

2030.

59. Lamb JC, Theuri J, Birchler JA. 2004. What’s in a centromere? Genome

Biol 5:239.

60. Nagaki K, Cheng Z, Ouyang S, Talbert PB, Kim M, et al. 2004.

Sequencing of a rice centromere uncovers active genes. Nat Genet

36:138–145.

61. Hosouchi T, Kumekawa N, Tsuruoka H, Kotani H. 2002. Physical map-

based sizes of the centromeric regions of Arabidopsis thaliana

chromosomes 1, 2, and 3. DNA Res 9:117–121.

62. Nagaki K, Talbert PB, Zhong CX, Dawe RK, Henikoff S, et al. 2003.

Chromatin immunoprecipitation reveals that the 180-bp satellite repeat

is the key functional DNA element of Arabidopsis thaliana centromeres.

Genetics 163:1221–1225.

63. Saffery R, Sumer H, Hassan S, Wong LH, Craig JM, et al. 2003.

Transcription within a functional human centromere. Mol Cell 12:509–

516.

64. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, et al. 2000.

The genome sequence of Drosophila melanogaster. Science 287:

2185–2195.

65. Schoof H, Karlowski WM. 2003. Comparison of rice and Arabidopsis

annotation. Curr Opin Plant Biol 6:106–112.

66. Sakata K, Nagamura Y, Numa H, Antonio BA, Nagasaki H, et al. 2002.

RiceGAAS: an automated annotation system and database for rice

genome sequence. Nucleic Acids Res . 30 :98–102.

67. Yuan Q, Ouyang S, Liu J, Suh B, Cheung F, et al. 2003. The TIGR

rice genome annotation resource: annotating the rice genome and

creating resources for plant biologists. Nucleic Acids Res 31:229–

233.

68. Ware DH, Jaiswal P, Ni J, Yap IV, Pan X, et al. 2002. Gramene, a tool for

grass genomics. Plant Physiol 130:1606–1613.

69. Karlowski WM, Schoof H, Janakiraman V, Stuempflen V, Mayer KF.

2003. MOsDB: an integrated information resource for rice genomics.

Nucleic Acids Res 31:190–192.

70. Rensink WA, Buell CR. 2004. Arabidopsis to rice. Applying knowledge

from a weed to enhance our understanding of a crop species. Plant

Physiol 135:622–629.

71. Yuan Q, Ouyang S, Wang A, Zhu W, Maiti R, et al. 2005. The Institute for

Genomic Research Osa1 rice genome annotation database. Plant

Physiol 138:18–26.

72. Ohyanagi H, Tanaka T, Sakai H, Shigemoto Y, Yamaguchi K, et al.

2006. The Rice Annotation Project Database (RAP-DB): hub for Oryza

sativa ssp. japonica genome information. Nucleic Acids Res 1:D741–

D744.

73. Bennetzen JL, Coleman C, Liu R, Ma J, Ramakrishna W. 2004.

Consistent over-estimation of gene number in complex plant genomes.

Curr Opin Plant Biol 7:732–736.

74. Jiao Y, Jia P, Wang X, Su N, Yu S, et al. 2005. A tiling microarray

expression analysis of rice chromosome 4 suggest a chromosome-

level regulation of transcription. Plant Cell 17:1641–1657.

75. Li L, Wang X, Xia M, Stolc V, Su N, et al. 2005. Tiling microarray

analsyis of rice chromosome 10 to identify the transcriptome and

relate its expression to chromosomal architecture. Genome Biol 6:

R52.

76. Li L, Wang X, Stolc V, Li X, Zhang D, et al. 2006. Genome-wide

transcription analyses in rice using tiling microarrays. Nature Genet 38:

124–129.

77. Wolfe KH, Gouy M, Yang Y-W, Sharp PM, Li W-H. 1989. Date of the

monocot-dicot divergence estimated from chloroplast DNA sequence

data. Proc Natl Acad Sci USA 86:6201–6205.

78. Petsko GA. 2002. Grain of truth. Genome Biol 3:1007.

79. Sorrells ME, La Rota M, Bermudez-Kandianis CE, Greene RA, Kantety

R, et al. 2003. Comparative DNA sequence analysis of wheat and rice

genomes. Genome Res 13:1818–1827.

80. Singh NK, Raghuvanshi S, Srivastava SK, Gaur A, Pal AK, et al. 2004.

Sequence analysis of the long arm of rice chromosome 11 for rice-

wheat synteny. Funct Integr Genomics 4:102–117.

81. Salse J, Piegu B, Cooke R, Delseny M. 2004. New in silico insight into

the synteny between rice (Oryza sativa L.) and maize (Zea mays L.)

highlights reshuffling and identifies new duplications in the rice

genome. Plant J 38:396–409.

82. Dubcovsky J, Ramakrishna W, SanMiguel PJ, Busso CS, Yan L, et al.

2001. Comparative sequence analysis of colinear barley and

rice bacterial artificial chromosomes. Plant Physiol 125:1342–1353.

83. Klein PE, Klein RR, Vrebalov J, Mullet JE. 2003. Sequence-based

alignment of sorghum chromosome 3 and rice chromosome 1 reveals

extensive conservation of gene order and one major chromosomal

rearrangement. Plant J 34:605–621.

84. Dunford RP, Yano M, Kurata N, Sasaki T, Huestis G. 2002. Comparative

mapping of the barley Ppd-H1 photoperiod response gene region,

which lies close to a junction between two rice linkage segments.

Genetics 161:825–834.

85. Brunner S, Keller B, Feuillet C. 2003. A large rearrangement involving

genes and low-copy DNA interrupts the microcollinearity between rice

and barley at the Rph7 locus. Genetics 164:673–683.

86. Han F, Kleinhofs A, Ullrich SE, Kilian A, Yano M. 1998. Synteny with

rice-analysis of barley malting quality QTLs and RPG4 chromosomal

regions. Genome 41:373–380.

87. Zwick MS, Islam-Faridi MN, Czeschin DG, Wing RA, Hart GE, et al.

1998. Physical mapping of the liguleless linkage group in Sorghum

bicolor using rice RFLP-selected sorghum BACs. Genetics 148:1983–

1992.

88. Collins NC, Thordal-Christensen H, Lipka V, Bau S, Kombrink E, et al.

2003. SNARE-protein-mediated disease resistance at the plant cell

wall. Nature 425:973–977.

89. Armstead IP, Turner LB, Farrell M, Skot L, Gomez P, et al. 2004.

Synteny between a major heading-date QTL in perennial ryegrass

(Lolium perenne L.) and the Hd3 heading-date locus in rice. Theor Appl

Genet 108:822–828.

90. La Rota M, Sorrells ME. 2004. Comparative DNA sequence analysis of

mapped wheat ESTs reveals the complexity of genome relationships

between rice and wheat. Funct Integr Genomics 4:34–46.

91. Yano M, Katayose Y, Ashikari M, Yamanouchi U, Monna L, et al. 2000.

Hd1, a major photoperiod sensitivity quantitative trait locus in rice, is

closely related to the Arabidopsis flowering time gene CONSTANS.

Plant Cell 12:2473–2484.

92. Yamanouchi U, Yano M, Lin H, Ashikari M, Yamada K. 2002. A rice

spotted leaf gene, Spl7, encodes a heat stress transcription factor

protein. Proc Natl Acad Sci USA 99:7530–7535.

93. Komori T, Ohta S, Murai N, Takakura Y, Kuraya Y, et al. 2004. Map-

based cloning of a fertility restorer gene, Rf-1, in rice (Oryza sativa L.).

Plant J 37:315–325.

Genes and genomes

BioEssays 28.4 431

94. Miyoshi K, Ahn BO, Kawakatsu T, Ito Y, Itoh J, et al. 2004.

PLASTOCHRON1, a timekeeper of leaf initiation in rice, encodes

cytochrome P450. Proc Natl Acad Sci USA 101:875–880.

95. Sun X, Cao Y, Yang Z, Xu C, Li X, et al. 2004. Xa26, a gene conferring

resistance to Xanthomonas oryzae pv. oryzae in rice, encodes an LRR

receptor kinase-like protein. Plant J 37:517–527.

96. Ashikari M, Sakakibara H, Liu S, Yamamoto T, Takashi T, et al. 2005.

Cytokinin oxidase regulates rice grain production. Science 309:741–

745.

97. Bancroft I. 2002. Insights into cereal genomes from two draft genome

sequences of rice. Genome Biol 3:10–15.

98. Mayer K, Mewes H-K. 2001. How can we deliver the large plant

genomes? Strategies and perspectives. Curr Opin Plant Biol 5:173–

177.

99. Rabinowicz PD, McCombie WR, Martienssen RA. 2003. Gene enrich-

ment in plant genomic shotgun libraries. Curr Opin Plant Biol 6:150–

156.

100. Barbazuk WB, Bedell JA, Rabinowicz PD. 2005. Reduced representa-

tion sequencing: a success in maize and a promise for other plant

genomes. Bioessays 27:839–848.

101. Rensink WA, Buell CR. 2005. Micoarray expression profiling resources

for plant genomics. Trends Plant Sci 10:603–609.

Genes and genomes

432 BioEssays 28.4