Decoding the rice genome
-
Upload
independent -
Category
Documents
-
view
0 -
download
0
Transcript of Decoding the rice genome
Decoding the rice genomeShubha Vij, Vikrant Gupta, Dibyendu Kumar, Ravi Vydianathan,Saurabh Raghuvanshi, Paramjit Khurana, Jitendra P. Khurana,and Akhilesh K. Tyagi*
SummaryRice cultivation is one of the most important agriculturalactivities on earth, with nearly 90%of it being produced inAsia. It belongs to the family of crops that includeswheat,maize and barley, and it supplies more than 50% ofcalories consumed by the world population. Its immenseeconomic value and a relatively small genomesizemakesit a focal point for scientific investigations, so much sothat four whole genome sequence drafts with varyingqualities have been generated by both public and pri-vately funded ventures. The availability of a complete andhigh-quality map-based sequence has provided theopportunity to study genome organization and evolution.Most importantly, theorderand identityof37,544genesofrice have been unraveled. The sequence provides therequired ingredients for functional genomics and mole-cular breeding programs aimed at unraveling intricatecellular processes and improving rice productivity.BioEssays28:421–432, 2006.� 2006WileyPeriodicals, Inc.
Introduction
Rice is one of themost important food crops of theworld. More
than half of the world population depends on rice as the major
source of calories and proteins. About 840million people in the
world are undernourished, which includes almost 200 million
children from developing countries (http://www.fao.org/). Rice
production will have to be increased substantially to meet
the demand of the growing world population, especially in
the Asian subcontinent. The rice production has, however,
declined in the last 4 years (http://www.irri.org).(1) This is due to
increasing urbanization leading to shortage of cultivable land
and deteriorating environmental conditions. To meet the
growing demand, a combination of breeding strategies and
molecular biology tools has to be used in synchrony to obtain
varieties that are high yielding and also more resistant to
various abiotic and biotic stresses.(2) Sequencing of the rice
genome was initiated with the aim of using the sequence
information to understand the function of its gene repertoire.
Thepioneeringwork,which laid the foundation for rice genome
sequencing, was initiated in early 1990s.(3) The work centered
on constructing a linkage map (http://rgp.dna.affrc.go.jp/pub-
licdata/geneticmap2000.index.html),(4) YAC (yeast artificial
chromosome) based physical map,(5,6) a transcript map(7,8)
and sequence-ready BAC/PAC (bacterial artificial chromo-
some/P1-derived artificial chromosome) physical map.(2,9)
Rice is also amenable to genetic transformation, thereby
providing an ideal crop system for functional genomics.(10)
Moreover, the rice genome shares a syntenic relationship with
other cereal crops like sorghum and maize.(11,12) Amongst the
different cereal crops, rice was chosen as the best representa-
tive genome due to a relatively small estimated genome size of
�430Mb.(13) This review aims to trace the path of rice genome
sequencing from its initiation to the current status and seeks to
interpret the information obtained from the genome of the first
food crop to be sequenced.
Strategies to sequence whole genomes
In order to sequence a large DNA molecule, it is first broken
into small fragments, which are cloned and sequenced. The
overlapping sequence reads are assembled using computer
software programs into contigs. The quality of the sequence
submitted in the database is variously classified as phase 0, I,
II and III. The initial raw sequence generated is referred to as
phase 0 and the assembled sequence represents phase I.
When the contigs are ordered and oriented, the sequence is
Interdisciplinary Centre for Plant Genomics and Department of Plant
Molecular Biology, University of Delhi South Campus, New Delhi
110 021, India.
Funding agency: The research work of our group is funded by the
Department of Biotechnology, Government of India, New Delhi. SV,
VG, DK and VR were supported by research fellowships from CSIR/
UGC, Government of India, New Delhi.
*Correspondence to: Akhilesh K. Tyagi, Interdisciplinary Centre for
Plant Genomics and Department of Plant Molecular Biology, University
of Delhi South Campus, New Delhi 110 021, India.
E-mail: [email protected]
DOI 10.1002/bies.20399
Published online in Wiley InterScience (www.interscience.wiley.com).
BioEssays 28:421–432, � 2006 Wiley Periodicals, Inc. BioEssays 28.4 421
Abbreviations: BAC, Bacterial artificial chromosome; bp, Base pairs;
EST, Expressed sequence tag; IRGSP, International Rice Genome
Sequencing Project; JAK, Janus kinase; Mb, Million base pairs; MDRs,
Mathematically determined reads; MOsDB, MIPS Oryza sativa
DataBase; MTP, Minimum tiling path; ORF, Open reading frame;
PAC, P1-derived artificial chromosome; QTL, Quantitative trait loci;
RePS, Repeat masked phrap with scaffolding; RiceGAAS, Rice
Genome Automated Annotation System; SNP, Single nucleotide
polymorphism; SSR, Simple sequence repeat; STAT, Signal Transdu-
cers and Activators of Transcription; STCs, Sequence tag connectors;
TIR-NB-LRR, Toll-Interleukin-Region-Nucleotide-Binding site-Leucine-
Rich Repeat; WGS, Whole Genome Shotgun; YAC, Yeast artificial
chromosome.
Genes and genomes
designated as phase II. All stages before the final stage
generate draft sequences of variable quality, which refers to the
fact that the sequence is incomplete. The final step is to convert
the draft into the finished sequence, also referred to as phase III
(http://www.ncbi.nlm.nih.gov/HTGS/). For small genomes, like
that of microbes, the finished sequence refers to a complete
sequence,withoutgaps.However, in thecaseofeukaryotes, it is
virtually impossible to get the complete genome information ina
single piece because they contain a large amount of repetitive
sequences, which are especially concentrated in the region
spanning centromeres and telomeres.(14) The two main
strategies for whole genome sequencing are discussed below.
In the Clone-by-Clone Shotgun approach, the genome
is fragmented and cloned in BAC/PAC vectors. Inserts of
genomic DNA fragments in the BAC/PAC vectors are
anchored physically to the genome, with the help of DNA
markers, to develop a minimum tiling path (MTP). The MTP is
generated using a combination of techniques including finger-
print patterns, sequence tag connectors (STCs) and marker
information. Each BAC/PAC (with average insert size of 100–
150 kb) present in the MTP is again broken into small-sized
fragments, cloned and sequenced. The sequence of the
genome is then obtained by merging the individual BAC/PAC
sequences.(14,15) Although this approach is time consuming,
it offers the advantage that each clone is anchored to a speci-
fic chromosome, thus making the task of finishing much
easier.(16) In addition, since the finished genome of a model
organism will leverage other genomes, this technique is
eventually cost effective. A high-quality sequence has been
generated for human genome as well as Arabidopsis and rice
genomes adopting this approach.(17–19)
In the Whole Genome Shotgun (WGS) approach, the
genomic DNA as such is broken into small-sized fragments,
cloned and directly used for sequencing. The sequences are
then assembled to reconstruct the whole genome.(15) This
approach avoids the initial task of making BAC/PAC libraries,
constructing aMTPand individual library construction for each
BAC/PAC clone in the MTP. This strategy has been used
extensively for bacterial genomes. The WGS approach has
also been used for sequencing the human genome as well as
indica and japonica rice genomes.(20–22) The potential
problem in use of WGS for eukaryotic genomes is misassem-
bly due to a high percentage of repetitive elements.(14,16) This
is because each contig has to be individually anchored to the
chromosome, which makes the task of finishing more
laborious and cumbersome.
2002: The year of rice genome
sequence drafts
OverviewRice was the ideal candidate for genome sequencing after
Arabidopsis sinceArabidopsis and rice arewidely accepted as
model dicot andmonocot plants, respectively.(23) Rice was the
first organism whose sequencing was pursued by four groups
independently, which itself speaks for the importance of its
genome information.(24,25) Although the task of sequencing
the rice genome was initiated by the publicly funded Interna-
tional Rice Genome Sequencing Project (IRGSP, see next
section),(26) it wasaprivate company,Monsanto (St Louis,MO,
USA), that released the first draft of the rice genome in April,
2000, based on the data generated at the University of
Washington.Monsanto sequenceda total of 3,391BACsusing
a clone-by-clone approach, to the level of 5X coverage, to
produce a draft sequence of 399 Mb. This sequence was
assembled in 52,202 contigs, representing 259 Mb non-
overlapping data, which was expected to cover almost 60% of
the rice genome.(27) Meanwhile, two other groups, Syngenta
(TorreyMesaResearch Institute, SanDiego, USA) andBeijing
Genomics Institute (BGI), China, also launched their indepen-
dent sequencing programs. IRGSP, Monsanto and Syngenta
chose the japonica cultivar ‘Nipponbare’ while BGI used the
indica cultivar ‘93-11’ for sequencing. The aim of both
the private ventures (Monsanto and Syngenta) and BGI for
rice genome sequencing was primarily gene discovery and
identification of molecular markers for breeding. Hence, these
groups aimed at obtaining a draft sequence to get a broad
overview of the rice genome.(14) The Monsanto and Syngenta
data were not released to the public database but could be
accessed for academic purposes on entering a database
registration agreement through their site (http://www.rice-
research.org, http://www.tmri.org).(21,27) Both Monsanto and
Syngenta also allowed its sequences to be incorporated into
IRGSP sequence as long as IRGSP used the information to
improve the sequence from draft to finished level. The BGI
data, unlike the Monsanto and Syngenta data, were made
available freely (http://btn.genomics.org.cn/rice/). However,
the aim of IRGSP was to obtain a highly accurate finished
sequence of the rice genome.(26) As a first step, the IRGSP
announced the release of a high-quality map-based draft
sequence in the public domain in December 2002.(2) As a
result of these private and public ventures, the year 2002 saw
the release of three draft sequences of the rice genome. The
details of the participating groups and their efforts to produce
the draft sequence are given below.
Detailed historyThe decision to sequence the rice genome was taken at the
International PlantMolecular BiologyConference held in 1997
in Singapore. Countries sharing a common interest in
sequencing the rice genome joined hands to achieve this
task(28) and launched the International RiceGenomeSequen-
cingProject (IRGSP).(26) It was the third largest public genome
project undertaken after the human and mouse genome
projects.(28) The consortium included laboratories fromJapan,
USA, China, Taiwan, France, India, Korea, Brazil, Thailand
Genes and genomes
422 BioEssays 28.4
and UK.(29) The participants from the member countries
are Rice Genome Research Program (RGP) Japan (http://
rgp.dna.affrc.go.jp), The Institute for Genomic Research
(TIGR) USA (http://www.tigr.org/tdb/e2k1/osa1), National
Center for Gene Research (NCGR) China (http://www.
ncgr.ac.cn/), Genoscope France (http://www.genoscope.
cns.fr/), Arizona Genomics Institute (AGI) USA (http://
www.genome.arizona.edu), Cold Spring Harbor Laboratory
(CSHL) USA (http://nucleus.cshl.org/riceweb), Academia Sini-
ca Plant Genome Center (ASPGC) Taiwan (http://genome.
sinica.edu.tw), Indian Initiative for Rice Genome Sequencing
(IIRGS) India (http://www.genomeindia.org/), Plant Genome
Initiative at Rutgers (PGIR) USA (http://pgir.rutgers.edu),
Korea Rice Genome Research Program (KRGRP) Korea
(http://biogen.niast.go.kr), National Center for Genetic Engi-
neering and Biotechnology (BIOTEC) Thailand (http://
www.cs.ait.ac.th/nstda/biotec/biotec.html), Brazilian Rice Ge-
nome Initiative (BRIGI) Brazil (http://www.ufpel.tche.br/faem/
fitotecnia/fitomelhoramento), John Innes Centre United King-
dom (http://www.jic.bbsrc.ac.uk), Washington University
School of Medicine Genome Sequencing Center (GSC) USA
(http://genome.wustl.edu/) and Wisconsin Rice Genome
Project (GCOW) USA (http://www.gcow.wisc.edu).
The IRGSP effort evolved around a few basic points: the
sequencing strategy, the rice cultivar to be sequenced, the
accuracy of sequence and the sequence release policy. It was
decided to use the japonica cultivar ‘Nipponbare’ since it had
already been used by the Rice Genome Research Program
(RGP), Japan, as a source of ESTsequencing and construc-
tion of a dense linkage and YAC physical map.(26) The guide-
lines for the method of sequencing, sequence quality and
release policy were developed largely on the same lines as the
Human Genome Project (http://www.gene.ucl.ac.uk/hugo/
bermuda.htm). The backbone of the IRGSP sequence-ready
physical map for sequencing was derived from a PAC library
comprising of 71,040 clones(30) and aBAC library consisting of
48,960 clones.(31) The other equally important sources for
large insert clones for sequencing were a BAC library
(�90,000 clones) made at Clemson University Genomics
Institute (CUGI)(9) and BAC libraries made by Monsanto.(27)
TheMTP for the 12 chromosomes sequenced by IRGSPwere
largely constructed using these large insert clones. The clones
were chosen to form the MTP using the fingerprint patterns,
BAC/PAC end sequences and information available from
markers of each clone.(2) The general strategy employed by
IRGSP for sequencing each large insert BAC/PAC clone was
to shear the DNA andmake two libraries for each clone having
an insert sizeof�2and5kb, respectively.Onanaverage, 2000
clones from the two libraries were randomly sequenced from
bothends to get�10Xcoverageandassembled to thephase II
level, also referred to as the draft sequence.(29) The sequ-
ences were then assembled using a combination of the base
caller, PHRED,(32,33) the assembler, PHRAP (http://boze-
man.mbt.washington.edu) and sequence viewer and editor,
CONSED(34) software. The IRGSP had set the target to finish
the rice genome sequence by 2008.(26) This goal changed
when Monsanto released the draft sequence of ‘japonica’ in
2000.(27) Two other groups, Syngenta and BGI published
drafts of ‘japonica’ and ‘indica’ simultaneously in 2002.(21,22)
Due to these developments, IRGSP decided to release the
draft (phase II data) before releasing the finished se-
quence.(14) Consequently, the draft sequence was released
by the consortium at a meeting held in Japan in December
2002 (http://rgp.dna.affrc.go.jp/rgp/Dec18_NEWS.html). This
task was speeded by Monsanto’s decision to provide its BAC
libraries sequenced up to 5X coverage to IRGSP.(27) The
IRGSP draft sequence consisted of 3,380 BAC/PAC clones
representing 366 Mb of the rice genome. This sequence
covered 92% of the rice genome at >10X level. A total of
62,435 genes were predicted from the non-overlapping draft
sequence (http://rgp.dna.affrc.go.jp/rgp/Dec18_NEWS.html).
Syngenta (Torrey Mesa Research Institute, CA, USA)
collaborated with Myriad Genetics (Salt Lake City, Utah) to
sequence the ‘japonica’ variety of rice.(21) The draft was
completed in just 14months after inception of the program.(35)
The genome was sequenced using a whole genome sequen-
cing strategy. The repeat sequences were removed from the
data and the remaining sequence represented 390 Mb of
the estimated 420 Mb genome with coverage of 6X.(21) The
number of geneswasestimated to be�32,000 to 50,000 using
a combination of different prediction programs [FGENESH
(monocot), GeneMark.HMM (Arabidopsis and rice) and
GENSCAN (Arabidopsis and maize)].
TheBeijingGenomics Institute (BGI), China, announced its
decision to sequence the ‘indica’ rice genome in May 2000.
BGI, like Syngenta, took the whole genome sequencing
route to sequence the rice genome and also released the
draft sequence in 2002.(22) The sequence was made public
by releasing the data on their website; http://www.btn.
genomics.org.cn/rice/. The repeat sequences were identified
mathematically and all 20-mer sequences whose frequency
was above a particular threshold were categorized as
mathematically determined reads (MDRs). On the basis of
this, almost 78 Mb of sequence was identified as repeat
sequence. These data were masked using RePS (Repeat
masked phrap with scaffolding)(36) and the remaining se-
quence represented 361Mb of the estimated 466Mb genome
with a coverage of 4X. Among the different prediction
programs used for gene identification, FGENESH was found
to be the most useful. The program predicted �46,022 to
55,615 genes in the BGI draft sequence.(22)
After the release of the indica draft sequence,(22) BGI
did additional sequencing to get a 6.28X coverage of the
genome,(37) which was almost identical to the coverage
obtained in the Syngenta draft,(21) although for a different
cultivar. For the purpose of analysis, repeats were masked in
Genes and genomes
BioEssays 28.4 423
both the draft sequences and reassembled.(36,38) These
independently assembled scaffolds from BGI and Syngenta
draft data were combined to get super scaffolds in such a way
as to get the order and orientation information but preserve the
SNP differences between the two subspecies.(37) The total
number of genes predicted in BGI, Syngenta and IRGSP data
using FGENESH(39) were 49,088, 45,824 and 43,635, respec-
tively.(37) This reduction in gene number compared to the
previous estimates (http://rgp.dna.affrc.go.jp/rgp/Dec18_
NEWS.html)(21,22) canbeattributedtoan improved identification
and elimination of TE-related genes.(37) Further, the objective of
obtainingalmostall thericegenesinasinglepiecewasfulfilledby
checking the BGI and Syngenta assembled sequence with a
collection of 19,079 full-length cDNA clones available in the
KOMEdatabase. Almost 98%of the genes could be aligned in a
singlepiecetoeitherof the twogenomes.(37)Thesalient features
of the updated BGI and Syngenta draft sequences(37) are
comparedwith the IRGSP finished sequence(18) in Table 1.
Thedraft sequenceswerenot expected tomatch thequality
of finished sequence, yet proved to be quite useful to the rice
research community in general.(14) They were used exten-
sively for identifying genes in rice and for making comparisons
with other plant species. The drafts also accelerated the pace
for functional genomics, since the work on microarrays,
proteomics and several other genome-wide studies could
movemuch faster due to the ready availability of the sequence
information.(29) Other areas of research such as breeding for
introgressionofbetter traitsandevolutionarystudiesalsogained
from the availability of the draft sequence. It was, however,
necessary to have the complete sequence information not only
for accurate interpretation of the rice genome in its own context,
but also to serve as a standard and resource for other cereal
genomes.Thestandard itself shouldbeasreliableaspossible to
help extrapolation of information in the true sense to other
economically valuable cereals.(40)
2004: The international year of rice—IRGSP
releases the map-based finished rice genome
sequence
The year 2004 was declared as the International Year of
Rice by the UN General Assembly. The theme of the program
was ‘Rice is life’, reflecting the importance of rice as a
food crop. The declaration was in recognition of the impor-
tance of rice, which provides food to more than half of the
world population and is a source of income for millions of
rice producers (http://www.fao.org/rice2004). The year also
marked the declaration of the complete rice genomesequence
by the IRGSP (http://rgp.dna.affrc.go.jp/IRGSP/celebrates/
celebrates.html). To commemorate the International Year of
Rice, IRGSP received the Research Accomplishment Award
at the world rice research conference for its role in decoding
the rice genome sequence (http://rgp.dna.affrc.go.jp/IRGSP/
WRRC2004-Award/WRRC2004-Award.html). Before the de-
claration of the completion of the rice genome, the finished
sequence of three chromosomes (1, 4 and 10) had already
been published(41–43) To obtain the finished sequence, more
than 4,000 BAC/PAC clones were sequenced, of which 3,401
clones (with at least 10X coverage and 99.9% accuracy) were
used toobtain�95%coverageof the389Mb ricegenome.The
size of the genome (389Mb)was estimated by adding the sum
of non-overlapping sequence along with the estimated size of
gaps. The finished sequence includes three completely
sequenced centromeres (chromosome 4, 5 and 8). To reach
phase III level (finished sequence), the sequence of each
clone was checked for problem regions. The aimwas to obtain
an error rate of less than one per 10 kb with the least possible
gaps (http://demeter.bio.bnl.gov/Guidelines.html). The main
problem regionswere gaps (physical/sequencing), low-quality
regions and misassembled regions. Generally, these pro-
blemswere solved byanyoneor a combination of the following
approaches. For low-quality regions, resequencing was done
using universal or custom primers. Sequencing gaps were
closed by sequencing of bridge clones, PCR fragments or
direct sequencing of BAC/PAC clones. Physical gaps were
filled using PCR fragments or 40 kb fosmid clones. Sequen-
cing using alternate chemistry was done when the normally
used chemistry did not yield results. For regions that were not
solved by these conventional methods, small insert libraries of
the region weremade or transposonswere used to disrupt the
difficult region. Each finished clone was finally confirmed by
comparing its in silico restriction pattern with the actual
restriction pattern.(18,44)
The 370,733,456 bp long finished sequence was used to
construct 12 chromosome-specific pseudomolecules in
57 contigs with an average continuous sequence length of
6.9 Mb (Fig. 1). A total of 62 physical gaps still remain in the
finished sequence including 9 centromeres and 17 telomeres
constituting 18.1 Mb of rice genome. The total number of
genes predicted for the finished sequence is �37,544. EST
Table 1. Comparison of BGI and Syngenta draft
with IRGSP finished rice genome sequence
SequencingGroup Syngenta* BGI* IRGSP**
Genome size 433 Mb 466 Mb 389 Mb
Coverage >6X 6.28X >10X
Assembled
contigs
46,246 64,052 57
Sub species/
cultivar
japonica/
Nipponbare
indica/93-11 japonica/
Nipponbare
Sequencing
strategy
WGS WGS Clone-by-clone
shotgun
Predicted genes 45,824 49,088 37,544
Based on reference 38* and 18**. The contigs in Syngenta andBGI draft
sequences were linked together to create much larger scaffolds and
super scaffolds.
Genes and genomes
424 BioEssays 28.4
markers were used to measure the genome coverage in the
finished sequence. Almost 99.4% of the available ESTs were
represented in the pseudomolecules.(18) The strength and
validity of the gene prediction programs was checked by
comparing the predicted genes to full-length cDNAs(45) and
ESTs (http://www.ncbi.nlm.nih.gov/dbEST/) available in the
database. A total of 61%predicted genes showedamatchwith
either a cDNA or an EST.(18)
What does the rice genome sequence reveal?
General featuresThe map-based sequence of the rice genome is estimated to
cover 95% of the 389 Mb rice genome. A total of 37,544 genes
have been predicted for the complete sequence with an
average gene density of one gene per 9.9 kb and average
gene length of 2,699 bp. Chromosomes 1 and 3 have the
highest gene density with a gene density of one gene per 8.9 kb
andone gene per 8.7 kb, respectively. Chromosomes11and12
had the lowest gene density of one gene per 10.7 kb and one
gene per 11.6 kb, respectively, compared to the rest of the rice
chromosomes.(18)
The rice genome was estimated to comprise �10–25%
repeat elements before the availability of genome se-
quence.(46,47) In the finished sequence, repeats constitute at
least 35% of the rice genome. The number of transposable
elements was maximum for chromosome 8 (38%) and 12
(38.3%) and least for chromosome 1 (31%), 2 (29.8%) and 3
(29%). The number of class II repeat elements like hAT,
CACTA, IS630/Tc1/mariner, IS256/Mutator and IS5/Tourist is
more than two-fold greater than class I elements like LINEs,
SINEs, Ty1/copia and Ty3/gypsy. However, the class I
elements contribute more to the genome (19.4%) compared
to class II elements (12.9%). Thus, the presence of class II
elements such as IS256/Mutator, IS5/Tourist and IS630/Tc1/
mariner in the rice genome correlated with gene density and
they were most frequently present on the first three chromo-
somes.(18)
Detailed analysis of the rice genome has led to the identi-
fication of three main classes of duplications. The first class
of duplication is segmental, involving duplication of a large
number of genesalong the length of thechromosome.(18,37,48,49)
The second class is tandem duplications involving individual
genes and the third is backgroundduplications accounting for all
other duplications that could not be classified into either of the
first two categories.(18,37) When only those rice genes showing
homology to non-redundant KOME cDNAswere considered for
duplication analysis, a total of 18 pairs of duplicated segments
Chr1
JapanKorea
Chr2
JapanUK
Chr4
China
Chr7
Japan
Chr12
France
Chr8
Japan
Chr3
USA
Chr11
FranceIndiaUSA
Chr10
USA
Chr9
BrazilJapanKoreaThailand
Chr6
Japan
Chr5
Taiwan
50.5462.34
87.6359.53
73.7381.63
51.6384.53
00.0337.92
06.1337.03
82.0346.92
75.8234.82
35.0396.22
69.3286.22
67.0353.82
77.7265.72
Figure 1. Pseudomolecules of the 12 rice chromosomes. The participating nations responsible for sequencing each chromosome are
givenon the top. Thearrowheads indicate the location of centromeres andgreen colour represents thepositionof physical gaps. Thegapon
short armof chromosome9 represents the nucleolar organizer consisting of 17S–5.8S–25S rDNAcoding units. Values given at the bottom
represent estimated (red) and sequenced (green) bases for each chromosome in Mb (modified from reference 18).
Genes and genomes
BioEssays 28.4 425
covering more than 65% of the length of the mapped super-
scaffoldswere identified in the indica rice genomesequence.(37)
Analysis of the japonica rice genome sequence has shown
that almost 60% of the rice genome is duplicated.(18) All the
chromosomes have duplicated segments; however, the biggest
duplicated block is shared between chromosome 11 and 12.
From analysis of the duplicated segments, it seems that the
whole genome duplication occurred about 55 to 70million years
ago before the divergence of the major cereals from their
common ancestor. Most of the observed duplications can be
attributed to this event. However, the chromosome 11–12
duplication is probably recent in origin and represents a
segmental duplication, which was earlier predicted to have
occurred about 20 million years ago.(21,37,49–51) A recent
analysis on the basis of the finished sequence, however, esti-
mates this to have happened as recently as 7.7 million years
ago.(52) It may be mentioned that, for assessing the age of
segmental duplications, the quality of sequence and annotation
is very important. Thus, analysis of the duplication events in rice
provides evidence for whole genome duplication, a recent
segmental duplication and several individual duplication events.
Analysis of the finished sequence for organellar insertions
showed that there were at least 421 chloroplast and 909
mitochondrial DNA insertions contributing to �0.2% each of
the nuclear genome. The pattern of chloroplast andmitochon-
drial insertions in the rice genome suggests that their transfer
processes were independent of each other.(18) In another
analysis, the nuclear localized plastid DNAwas similarly found
to be 0.2% of the total rice nuclear genome and was
predominantly present near the pericentromeric regions. On
the one hand, the number of such insertions was highest
in chromosome 1 and lowest for chromosomes 9, 10 and 11.
On the other hand, amount of insertions (in kb) was greatest in
chromosome 10 and least in 11. Age distribution analysis
revealed that the phenomenon of chloroplast–nuclear DNA
flux involved a constant process of integration, shuffling and
eliminationwith 80%of thembeing eliminated from thenuclear
genome in the span of a million years.(53)
The GC content varieswidely amongst different organisms
ranging from 26 to 65%.(54) Study of GC content in plant
species showed that Gramineae genomes were richer in GC
content compared to dicot genomes.(55) The overall GC
content of the Arabidopsis genome is 34.7% with the exons
having 44.1% and introns having 32.7% GC content.(17) The
rice genome has an average GC content of 43.6% with 54.2%
GC content of exons and 38.3%GCcontent of introns.(18) This
GC content is much higher compared to Arabidopsis,
especially in the coding regions. Another difference observed
between the GC content of the two plants was that a distinct
gradient inGC content existedwithin the rice genes,with the 50
end having on an average 25%more GC content compared to
the 30 end. Such a gradient in GC content was not seen in
Arabidopsis genes.(22) In another study, the GC content of two
Gramineae data sets (rice andmaize) was compared with two
dicots (Arabidopsisand tobacco) andasimilar difference in the
gradient of GC content in the direction of transcription was
observed.(54)
The centromere is the physical entity on the chromosome
that binds microtubules and other centromeric-associated
proteins so they serve as points of chromatid segregation
during cell division.(56)With the exception of yeast centromere,
which is made of �125 bp unique sequence, eukaryotic
centromeres are known to contain long stretches of repetitive
DNAsequences.Due to this, centromereshave long remained
recalcitrant to cloning, sequencing and subsequent assem-
bly.(56) Although most rice centromeres, like other eukaryotes,
are large in size (>1 Mb) and thus difficult to sequence, some
rice centromeres (chromosomes 4, 5 and 8) were smaller and
hence could be sequenced fully.(18,57,58) Rice centromeres
typically comprise 165 bp CentO satellite repeat sequences
and retrotransposon elements.(59) The rice chromosome 4
centromere is 124 kb long with 18 tracts of 379 CentO repeats
(59 kb) and 19 centromeric retroelements forming the core
centromere. There were four different types of retroelements
but the LTR retrotranspsosons like Ty3/gypsy-like retrotran-
sposons constituted the largest retrotransposon family.(58) The
chromosome 8 centromere was made of three clusters of
CentO repeats spanning 68.5 kb while for chromosome 5 the
size was 50.3 kb. The CentO repeats were tandemly arrayed
and interrupted by �220 TE-related sequences, mostly Ty3/
gypsy-like retrotransposons. Chromosomes 4 and 8 had
similar amounts of CentO repeats but had different numbers
of retroelements. Surprisingly, 201ORFswere predicted in the
1.97 Mb region around the chromosome 8 centromere. The
majority of these predicted genes were found to code for
hypothetical proteins but at least 20% showed similarity to
known proteins or rice full-length cDNAs.(57) Out of these
genes, 14 were present in the centromeric region and 12 of
these were experimentally confirmed to be functional.(60) The
presence of functional genes inside rice centromeres was
an interesting finding since centromeres were previously
considered to be transcriptionally silent heterochromatic
zones.(61,62) This finding is similar to the presence of genes in
the human neocentromeres.(63) Possibly, human neocentro-
meres represent an earlier stage and the rice centromeres
(chromosomes 4 and 8) represent an intermediate stage in
centromereevolution. Thus, the rice centromeresare probably
not fully developed and in the due course of time the
centromeric region would adapt to its role in cell division
and accumulate repetitive sequences, and the genes will lose
their expression and become transcriptionally silent.(60)
The complete sequencing of the rice centromeres has
revealed the basic structure of eukaryotic centromeres, has
helped identify the minimum sequence required for centro-
mere function and would prove useful for understanding their
evolution.
Genes and genomes
426 BioEssays 28.4
Gene predictionsGene annotation consists of two basic steps. In the first
step, different computer prediction programs are used for
gene prediction and, in the second step, the predicted
genes are validated using information on gene function
available in the database.(64) Computational gene prediction
in rice is facilitated by several publicly available databases.
The different gene prediction programs used for annotating
rice by different groups include Genscan (http://genes.
mit.edu/ GENSCAN.html), FGENESH (http://www.softberry.
com/berry.phtml), GeneMark.hmm (http://opal.biology.gatech.
edu/GeneMark/eukhmm.cgi), GlimmerR (http://www.tigr.org/
software/glimmer/), RiceHMM (http://rgp.dna.affrc.go.jp/
RiceHMM/), tRNAscan-SE (http://www.genetics.wustl.edu/eddy/
tRNAscan-SE/), SplicePredictor (http://bioinformatics.iastate.
edu/cgi-bin/sp.cgi), GeneSplicer (http://www.tigr.org/tdb/
GeneSplicer/), GeneFinder (http://rulai.cshl.org/tools/genefinder/)
and NetGene2 (http://www.cbs.dtu.dk/services/NetGene2/).(65)
Out of these, FGENESH has been found to be the most-useful
prediction tool available for rice.(22) Several websites provide
detailed information about rice gene annotation. These include
RiceGAAS, Rice Genome Automated Annotation System
(http://RiceGAAS.dna.affrc.go.jp), TIGR, The Institute for Geno-
mic Research (http://www.tigr.org/tdb/e2k1/osa1), Gramene
(http://www.gramene.org/) and MOsDB, MIPS Oryza sativa
DataBase (http://mips.gsf.de/proj/plant/jsf/rice/index.jsp).(66–69)
The availability of�400,000 ESTs and at least 32,000 full-length
cDNA clones has helped to a large extent in validation of
computational gene prediction in rice (http://www.ncbi.nlm.nih.
gov/UniGene, cdna01.dna.affrc.go.jp/cDNA).(70) The rice gen-
ome annotation at TIGR named Osa1 (Oryza sativa 1) is the
most-widely used database for certain major projects of gene
array, transcriptomics and annotation as it provides details of
annotation and sequence assembly of the rice genome.(71) The
annotation details of each genemodel are linked to its functional
information like expression data, gene ontologies and tagged
lines (http://www.tigr.org/tdb/e2k1/osa1). Recently, another da-
tabase called Rice Annotation Project Database (RAP-DB) has
been made public. It utilizes IRGSP assembly and can be
accessed through http://rapdb.lab.nig.ac.jp/.(72)
The number of genes predicted in the finished rice genome
sequence is �37,544.(18) In addition, a 7 Mb region on
chromosome 9 and 0.25 Mb region on chromosome 11 code
for ribosomal RNA. 763 tRNA geneswere also predicted. This
is smaller in comparison to the number predicted in draft
sequences. In the IRGSP data, the vast difference in the
number of genes predicted in the previously finished chromo-
somes 1, 4 and 10 as well as draft sequence in comparison to
the finished genome sequence is explained by improvement
in their annotation process. This excludes transposon-related
genesnumbering17,752, since,FGENESHpredicteda total of
55,296 genes for the finished sequence, which was compar-
able to the genes estimated from the previously finished
chromosomes or draft sequence. The view that the rice genes
without Arabidopsis homologues could include wrongly pre-
dicted genes is also supported by the finding that this subset of
genes is largely different in its features from the rest of the rice
genes. These differences include smaller size, more introns
and unusual 30 GC richness. Another striking feature of these
genes is that onlya very small percentage is supported byEST
data.(73) To further prove this point, these rice genes were
annotated using the maize transcriptome data (representing
more than 80% of maize genes). Only 15% of the rice genes
lacking Arabidopsis homologues were supported by maize
ESTs. Further, manual annotation of these genes showed that
at least 30% of these genes were transposable elements.(73)
This study supports the number of genes predicted by IRGSP
for the finished sequence.(18) It is possible that the predicted
rice genes that are not supported by ESTs could be supported
by expression evidence from other functional genomics
approaches like tiling microarrays(74–76) or MPSS (http://
mpss.udel.edu/rice) and only detailed analysis will give a true
picture about the nature of these genes.
The 37,544 genes predicted from the IRGSP finished
sequence could be classified into 3,328 different types of
domains. Out of the most abundant domains predicted, five
were protein kinases. More than half of the predicted proteins
could be associated with a biological process.(18) A total of
71% predicted rice gene products had homologues in
Arabidopsis, while the percentage of rice gene products with
homologues in humans, Drosophila, C. elegans, yeast,
Synechocystis and E. coli were 40.8%, 38%, 36.5%, 30.2%,
17.6% and 10.2%, respectively (Fig. 2).
Comparison of the two sequenced
plant genomes
Rice and Arabidopsis are distantly related species that
diverged about 200 million years ago.(77) The rice genome is
about three times larger and has almost 50%more genes than
the Arabidopsis genome.(18,65) The largest syntenic region
observed to date between the two organisms was identified in
an analysis of chromosome 4 finished sequence with the
Arabidopsis genome. The syntenic region covered 119
Arabidopsis proteins showing an identity of at least 70% over
a minimum stretch of 30 amino acids. This analysis revealed
that there was collinearity between the two genomes but was
preserved only to a small extent.(21) Analysis of the Arabidop-
sis genome revealed that only 35% of its genes were unique
while at least 17% of the genes were tandemly duplicated.(17)
Similarly, in rice, almost 60%of the genome is duplicatedwhile
14%of its genes are tandemly duplicated.(18) The high number
of duplicated genes in both the plant genomes indicates that
gene diversity in plants has probably arisen through genome
duplication.(22)
In the recent study done with the IRGSP data, almost 90%
of Arabidopsis proteins had rice homologues, while �71% of
Genes and genomes
BioEssays 28.4 427
predicted rice proteins had an Arabidopsis homologue. To
eliminate the possibility of wrongly predicted genes, homology
search was also done using only those predicted rice genes
that were supported by an ESTor a cDNA and the percentage
of predicted rice genes with an Arabidopsis homologue
increased to 88%.(18) Comparison of the predicted rice and
Arabidopsis genes shows that the organisms share many
common genes. These include most of the disease- and
flowering-related genes, phosphate transporters, transcrip-
tion factors and those involved in metabolism. In contrast,
there are several common genes that are absent in these two
plant genomes,but arepresent in other sequencedorganisms.
These include members of gene families encoding nuclear
steroid receptor, p53, Notch/lin12, Janus kinase (JAK) and
Signal Transducers and Activators of Transcription (STAT).
Another important category was genes that were present
either in Arabidopsis or rice. Some of the Arabidopsis genes
that do not have rice homologues are FRIGIDA, FLOWERING
LOCUS C, UNUSUAL FLORAL ORGANS and SUPERMAN
amongst the flowering-related genes and TIR-NB-LRR (Toll-
Interleukin-Region-Nucleotide-Binding site-Leucine-Rich Re-
peat) amongst the disease-related genes.(21,25) Amongst the
predicted rice genes, �8% do not show homologues in
Arabidopsis and include well-known cereal-specific genes like
prolamins along with several proteins such as chitinase
precursor, seed allergen, starch branching enzyme, wound-
induced protease inhibitor and abscisic stress ripening
protein.(18) However, the majority of these genes do not show
hits in the database or to hypothetical proteins. The basic
difference between monocots and dicots will become clear
only when the function of these largely unknown cereal-
specific genes becomes clear.(18)
Comparison of rice with cereal genomes
The cereals diverged from their common ancestor around
60 million years ago.(78) Despite this period of independent
evolution, the genes as well as their order in cereals is seen to
be quite conserved.(12) A major advantage of sequencing the
rice genome was its syntenic relationship with other cereal
species.(26) Analysis of rice genome sequence draft showed
that homologues of almost 98% wheat, barley and maize
proteins could be identified in rice.(21) However, most of the
analyses, which report strong syntenic relationship among
cereals, have been done at a low resolution due to the limited
number of common markers available.(79) There are several
instances, however, where collinearity between cereals was
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
22000
24000
26000
Nu
mb
er o
f P
red
icte
d P
rote
ins
Arabid
opsis
Yeast
C. eleg
ans
Droso
phila
Human
Synec
hocy
stis
E. coli
<E-200<E-150 to E-200
<E-100 to E-150
<E-50 to E-100<E-10 to E-50
<E-5 to E-10
Figure 2. Comparison of predicted rice proteins with proteins from model organisms at different e-value cut-offs (modified from
reference 18).
Genes and genomes
428 BioEssays 28.4
found to be disrupted when studied at a higher resolution. For
instance, high-resolution mapping was done for studying
wheat–rice synteny using a total of 4,485 wheat ESTs for
comparison with the rice genome sequence. The analysis
revealed that there was a general conservation of genes and
their order in the two species. However, several breaks in
the collinearity were observed.(79) In a similar study, 2,932
predicted genes from the long arm of chromosome 11 were
compared with wheat ESTs. Although the genes were
conserved in the analyzed region, several rearrangements
could be seen that disrupted the gene order.(80) An analysis of
rice sequencewith 2629maizemarkers identified 656 putative
orthologs but revealed several breaks in collinearity.(81) Similar
sequence-based alignments of rice done with sorghum and
barley revealed that there were some rearrangements along
with a general conservation of synteny.(82,83) Also, there are
studies where identifying candidate genes on the basis of
synteny did not prove useful. For instance, attempts to identify
theRph7 (leaf rust resistance) gene andPhd-H1 (photoperiod
response) gene in barley on the basis of their expected
collinearity in rice did not yield the expected results.(84,85)
However, comparative genomics based on the syntenic
relationship of rice with other cereals has helped in identifying
several important genes, such as the QTL for malting quality in
barley,majorheadingdateQTL inperennial ryegrass, liguleless
region in sorghum and Ror2, a gene conferring resistance to
powdery mildew disease in barley.(86–89) Hence, from these
studies, it seems that cross-species comparison would be
useful in identifying genes of interest. However, in each case,
collinearity will have to be investigated at the micro level in the
region of interest using high-density genetic maps.(90)
Conclusions
Sequencing of the rice genome was initiated by four different
groups. Monsanto and IRGSP used the clone-by-clone
approach, while Syngenta and BGI made use of the WGS
approach to sequence the rice genome. Amongst these
different groups, only the publicly funded IRGSP was
interested in the complete sequence information, while the
other groups pursued sequencing for gene discovery and
marker information.(14) Analysis of the rice genome sequence
has confirmed the syntenic relationship amongst cereal
crops.(21) However, it also shows that collinearity is not so well
preserved as previously thought. Hence, information from rice
can be extrapolated to other cereal crops, but only after
studyingmicrocollinearity in the region of interest.(90) A total of
18,828 SSRs and a large number of SNPs (0.5–0.8%) have
been identified in the finished rice genome sequence, which
will aid map-based cloning.(18) In fact the availability of
sequence has already facilitated such efforts and several rice
genes like Hd1, a major photoperiod-sensitive QTL, PLAS-
TOCHRON1, a regulator of leaf initiation, Spl7, a heat-stress
transcription factor,Rf-1, a fertility-restorer gene,Xa26, a gene
conferring resistance to Xanthomonas oryzae pv. oryzae and
Gn1a, a cytokinin oxidase gene representing a QTL for grain
production have been cloned utilizing the rice genome
sequence information.(91–96)
The sequence availability of two plant genomes (Arabidopsis
and rice) provided an opportunity to compare their sequences
and understand the special features of plant genomes. Plants
have a much larger number of genes compared to other
sequenced organisms. This is mainly due to the higher number
of duplicated genes.(21) Thus, plant genomes seem to have
evolved through polyploidization and subsequent gene loss.(97)
There is also a need for functional validation of predicted genes
in rice and Arabidopsis to serve as the landmark for other plant
genomes for which full sequencing will probably never be
done.(98–100) The recent use of rice genome sequence in
microarray projects indicates its importance as a tool for global
geneexpressionprofiling, discoveryof newgenesandvalidating
computational gene predictions.(74–76,101) Rice is also amongst
the few organisms for which sequences are available in two
Box 1. Glossary of terms
Bacterial artificial chromosome (BAC): A
bacterial cloning vector that can typically carry 100–
150 kb insert DNA.
Contig: A contiguous DNA sequence generated by
assembling overlapping sequences.
Draft sequence: An incomplete sequence in
terms of both contiguity and likelihood of errors.
Finished sequence: A sequence with an error
rate of less than one error per 10 kb, assembled in the
correct order and orientation with least possible gaps.
Minimum tiling path (MTP): The least number
of overlapping clones that span a chromosomal region.
P1-derived artificial chromosome (PAC): A
cloning vector derived from P1 phage that can carry
typically 100–150 kb insert DNA.
Retrotransposon: A type of transposon that can
move by producing an RNA intermediate.
Scaffolds: An ordered set of contigs placed on the
chromosome.
Sequencing gap: A gap in the sequence that can
be filled by sequencing of bridge clones available in the
region.
Shotgun sequencing: An approach to sequence
DNAby breaking it into a large number of fragments that
can be sequenced individually.
Transposon:Any segment of DNA that can change
its position in the genome.
Yeast artificial chromosome (YAC): A high
capacity cloning vector that can typically carry 300–
400 kb DNA.
Genes and genomes
BioEssays 28.4 429
subspecies. Analysis of the alignments shows that, although the
genes are highly conserved in the two subspecies, the major
difference lies in the intergenic regions.(37) Thus, the availability
of the rice genome sequences has given a deeper insight about
the gene content, regulatory elements and the nature of repeats
in its genome.(18) But, in the end, the real worth of rice genome
sequence will be measured in terms of the agro-economic
benefits. The huge effortsmade in studying the rice genomewill
finally be justified when the sequence information is used in
developing better rice varieties with greater yield and enhanced
tolerance to various abiotic and biotic stresses.
References1. Peng S, Huang J, Sheeshy JE, Laza RC, Visperas RM, et al. 2004. Rice
yields decline with higher night temperature from global warming. Proc
Natl Acad Sci USA 101:9971–9975.
2. Sasaki T, Matsumoto T, Antonio BA, Nagamura Y. 2005. From mapping
to sequencing, post-sequencing and beyond. Plant Cell Physiol 46:3–
13.
3. Sasaki T. 1998. The rice genome project in Japan. Proc Natl Acad Sci
USA 95:2027–2028.
4. Harushima Y, Yano M, Shomura A, Sato M, Shimano T, et al. 1998. A
high-density rice genetic linkage map with 2275 markers using a single
F2 population. Genetics 148:479–494.
5. Umehara Y, Inagaki A, Tanoue H, Yasukochi Y, Nagamura Y, et al.
1995. Construction and characterization of a rice YAC library for
physical mapping. Mol Breed 1:79–89.
6. Saji S, Umehara Y, Antonio BA, Yamane H, Tanoue H, et al. 2001. A
physical map with yeast artificial chromosome (YAC) clones covering
63% of the 12 rice chromosomes. Genome 83:32–37.
7. Yamamoto K, Sasaki T. 1997. Large-scale EST sequencing in rice.
Plant Mol Biol 35:135–144.
8. Wu J, Maehara T, Shimokawa T, Yamamoto S, Harada C, et al. 2002. A
comprehensive rice transcript map containing 6591 expressed
sequence tag sites. Plant Cell 14:525–535.
9. Chen M, Presting G, Barbazuk WB, Goicoechea JL, Blackmon B, et al.
2002. An integrated physical and genetic map of the rice genome.
Plant Cell 14:537–545.
10. Tyagi AK, Mohanty A. 2000. Rice transformation for crop improvement
and functional genomics. Plant Science 158:1–18.
11. Moore G, Devos KM, Wang Z, Gale MD. 1995. Grasses, line up and
form a circle. Curr Biol 5:737–739.
12. Gale MD, Devos KM. 1998. Plant comparative genetics after 10 years.
Science 282:656–659.
13. Goff SA. 1999. Rice as a model for cereal genomics. Curr Opin Plant
Biol 2:86–89.
14. Buell CR. 2002. Obtaining the sequence of the rice genome and
lessons learned along the way. Trends Plant Sci 7:538–542.
15. Green ED. 2001. Strategies for the systematic sequencing of complex
genomes. Nat Rev Genet 2:573–583.
16. Waterston RH, Lander ES, Sulston JE. 2002. On the sequencing of the
human genome. Proc Natl Acad Sci USA 99:3712–3716.
17. The Arabidopsis Genome Initiative. 2000. Analysis of the genome
sequence of the flowering plant Arabidopsis thaliana. Nature 408:796–
815.
18. International Rice Genome Sequencing Project. 2005. The map-based
sequence of the rice genome. Nature 436:793–800.
19. International Human Genome Sequencing Consortium. 2001. Initial
sequencing and analysis of the human genome. Nature 409:860–
921.
20. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, et al. 2001. The
sequence of the human genome. Science 291:1304–1351.
21. Goff SA, Ricke D, Lan TH, Presting G, Wang R, et al. 2002. A draft
sequence of the rice genome (Oryza sativa L. ssp. japonica). Science
296:92–100.
22. Yu J, Hu S, Wang J, Wong GK, Li S, et al. 2002. A draft sequence of the
rice genome (Oryza sativa L. ssp. indica). Science 296:79–92.
23. Izawa T, Shimamoto K. 1996. Becoming a model plant: the importance
of rice to plant science. Trends Plant Sci 1:95–99.
24. Buell CR. 2002. Current status of the sequence of the rice genome and
prospects for finishing the first monocot genome. Plant Physiol
130:1585–1586.
25. Delseny M. 2003. Towards an accurate sequence of the rice genome.
Curr Opin Plant Biol 6:101–105.
26. Sasaki T, Burr B. 2000. International Rice Genome Sequencing Project:
the effort to completely sequence the rice genome. Curr Opin Plant Biol
3:138–141.
27. Barry GF. 2001. The use of the Monsanto draft rice genome sequence
in research. Plant Physiol 125:1164–1165.
28. Eckardt NA. 2000. Sequencing the rice genome. Plant Cell 12:2011–
2017.
29. Tyagi AK, Khurana JP, Khurana P, Raghuvanshi S, Gaur A, et al. 2004.
Structural and functional analysis of rice genome. J Genet 83:79–99.
30. Baba T, Katagiri S, Tanoue H, Tanaka R, Chiden Y, et al. 2000.
Construction and characterization of rice genomic libraries: PAC library
of japonica variety, nipponbare and BAC library of indica variety,
Kasalath. Bulletin of the NIAR 14:41–49.
31. Wu J, Mizuno H, Hayashi-Tsugane M, Ito Y, Chiden Y, et al. 2003.
Physical maps and recombination frequency of six rice chromosomes.
Plant J 36:720–730.
32. Ewing B, Hillier L, Wendl MC, Green P. 1998. Base-calling of automated
sequencer traces using phred. I. Accuracy assessment. Genome Res
8:175–185.
33. Ewing B, Green P. 1998. Base-calling of automated sequencer traces
using phred. II. Error probabilities. Genome Res 8:186–194.
34. Gordon D, Abajian C, Green P. 1998. Consed: a graphical tool for
sequence finishing. Genome Res 8:195–202.
35. Davenport RJ. 2001. Rice genome. Syngenta finishes, consortium goes
on. Science 291:807.
36. Wang J, Wong GK, Ni P, Han Y, Huang X, et al. 2002. RePS: a
sequence assembler that masks exact repeats identified from the
shotgun data. Genome Res 12:824–831.
37. Yu J, Wang J, Lin W, Li S, Li H, et al. 2005. The genomes of Oryza
sativa: A history of duplications. PLoS Biol 3:e38.
38. Zhong L, Zhang K, Huang X, Ni P, Han Y, et al. 2003. A statistical
approach designed for finding mathematically defined repeats in
shotgun data and determining the length distribution of clone-inserts.
Genomics Proteomics Bioinformatics 1:43–51.
39. Salamov A, Solovyev V. 2000. Ab initio gene finding in Drosophila
genomic DNA. Genome Res 10:516–522.
40. Leach J, McCouch S, Slezak T, Sasaki T, Wessler S. 2002. Why
finishing the rice genome matters. Science 296:45.
41. Sasaki T, Matsumoto T, Yamamoto K, Sakata K, Baba T, et al. 2002.
The genome sequence and structure of rice chromosome 1. Nature
420:312–316.
42. Feng Q, Zhang Y, Hao P, Wang S, Fu G, et al. 2002. Sequence and
analysis of rice chromosome 4. Nature 420:316–320.
43. The Rice Chromosome 10 Sequencing Consortium. 2003. In-depth
view of structure, activity, and evolution of rice chromosome 10.
Science 300:1566–1569.
44. de la Bastide M, Johnson D, Balija V, McCombie WR. 2001. Strategies
and techniques for finishing genomic sequence. In: Khush GS, Brar
DS, Hardy B, editors. Rice Genetics IV. New Delhi: Science Publishers,
Inc. pp 197–213.
45. Kikuchi S, Satoh K, Nagata T, Kawagashira N, Doi K, et al. 2003.
Collection, mapping, and annotation of over 28,000 cDNA clones from
japonica rice. Science 301:376–379.
46. Mao L, Wood TC, Yu Y, Budiman MA, Tomkins J, et al. 2000. Rice
transposable elements: a survey of 73,000 sequence-tagged-connec-
tors. Genome Res 10:982–990.
47. Turcotte K, Srinivasan S, Bureau T. 2001. Survey of transposable
elements from rice genomic sequences. Plant J 25:169–179.
48. Wang S, Wang J, Jiang J, Zhang Q. 2000. Mapping of centromeric
regions on the molecular linkage map of rice (Oryza sativa L.) using
centromere-associated sequences. Mol Gen Genet 263:165–172.
Genes and genomes
430 BioEssays 28.4
49. Paterson AH, Bowers JE, Chapman BA. 2004. Ancient polyploidi-
zation predating divergence of the cereals, and its consequences
for comparative genomics. Proc Natl Acad Sci USA 101:9903–
9908.
50. Salse J, Piegu B, Cooke R, Delseny M. 2002. Synteny between
Arabidopsis thaliana and rice at the genome level: a tool to identify
conservation in the ongoing rice genome sequencing project. Nucleic
Acids Res 30:2316–2328.
51. Vandepoele K, Simillion C, Van de Peer Y. 2003. Evidence that rice
and other cereals are ancient aneuploids. Plant Cell 15:2192–
2202.
52. The Rice Chromosomes 11 and 12 Sequencing Consortia. 2005. The
sequence of rice chromosomes 11 and 12, rich in disease resistance
genes and recent gene duplications. BMC Biology 3:20.
53. Matsuo M, Ito Y, Yamauchi R, Obokata J. 2005. The rice nuclear
genome continuously integrates, shuffles, and eliminates the chlor-
oplast genome to cause chloroplast-nuclear DNA flux. Plant Cell 17:
665–675.
54. Wong GK, Wang J, Tao L, Tan J, Zhang J, et al. 2002. Compositional
gradients in Gramineae genes. Genome Res 12:851–856.
55. Carel N, Bernardi G. 2000. Two classes of genes in plants. Genetics
154:1819–1825.
56. Cooke HJ. 2004. Silence of the centromeres-not. Trends Biotechnol
22:319–321.
57. Wu J, Yamagata H, Hayashi-Tsugane M, Hijishita S, Fujisawa M, et al.
2004. Composition and structure of the centromeric region of rice
chromosome 8. Plant Cell 16:967–976.
58. Zhang Y, Huang Y, Zhang L, Li Y, Lu T, et al. 2004. Structural features
of the rice chromosome 4 centromere. Nucleic Acids Res 32:2023–
2030.
59. Lamb JC, Theuri J, Birchler JA. 2004. What’s in a centromere? Genome
Biol 5:239.
60. Nagaki K, Cheng Z, Ouyang S, Talbert PB, Kim M, et al. 2004.
Sequencing of a rice centromere uncovers active genes. Nat Genet
36:138–145.
61. Hosouchi T, Kumekawa N, Tsuruoka H, Kotani H. 2002. Physical map-
based sizes of the centromeric regions of Arabidopsis thaliana
chromosomes 1, 2, and 3. DNA Res 9:117–121.
62. Nagaki K, Talbert PB, Zhong CX, Dawe RK, Henikoff S, et al. 2003.
Chromatin immunoprecipitation reveals that the 180-bp satellite repeat
is the key functional DNA element of Arabidopsis thaliana centromeres.
Genetics 163:1221–1225.
63. Saffery R, Sumer H, Hassan S, Wong LH, Craig JM, et al. 2003.
Transcription within a functional human centromere. Mol Cell 12:509–
516.
64. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, et al. 2000.
The genome sequence of Drosophila melanogaster. Science 287:
2185–2195.
65. Schoof H, Karlowski WM. 2003. Comparison of rice and Arabidopsis
annotation. Curr Opin Plant Biol 6:106–112.
66. Sakata K, Nagamura Y, Numa H, Antonio BA, Nagasaki H, et al. 2002.
RiceGAAS: an automated annotation system and database for rice
genome sequence. Nucleic Acids Res . 30 :98–102.
67. Yuan Q, Ouyang S, Liu J, Suh B, Cheung F, et al. 2003. The TIGR
rice genome annotation resource: annotating the rice genome and
creating resources for plant biologists. Nucleic Acids Res 31:229–
233.
68. Ware DH, Jaiswal P, Ni J, Yap IV, Pan X, et al. 2002. Gramene, a tool for
grass genomics. Plant Physiol 130:1606–1613.
69. Karlowski WM, Schoof H, Janakiraman V, Stuempflen V, Mayer KF.
2003. MOsDB: an integrated information resource for rice genomics.
Nucleic Acids Res 31:190–192.
70. Rensink WA, Buell CR. 2004. Arabidopsis to rice. Applying knowledge
from a weed to enhance our understanding of a crop species. Plant
Physiol 135:622–629.
71. Yuan Q, Ouyang S, Wang A, Zhu W, Maiti R, et al. 2005. The Institute for
Genomic Research Osa1 rice genome annotation database. Plant
Physiol 138:18–26.
72. Ohyanagi H, Tanaka T, Sakai H, Shigemoto Y, Yamaguchi K, et al.
2006. The Rice Annotation Project Database (RAP-DB): hub for Oryza
sativa ssp. japonica genome information. Nucleic Acids Res 1:D741–
D744.
73. Bennetzen JL, Coleman C, Liu R, Ma J, Ramakrishna W. 2004.
Consistent over-estimation of gene number in complex plant genomes.
Curr Opin Plant Biol 7:732–736.
74. Jiao Y, Jia P, Wang X, Su N, Yu S, et al. 2005. A tiling microarray
expression analysis of rice chromosome 4 suggest a chromosome-
level regulation of transcription. Plant Cell 17:1641–1657.
75. Li L, Wang X, Xia M, Stolc V, Su N, et al. 2005. Tiling microarray
analsyis of rice chromosome 10 to identify the transcriptome and
relate its expression to chromosomal architecture. Genome Biol 6:
R52.
76. Li L, Wang X, Stolc V, Li X, Zhang D, et al. 2006. Genome-wide
transcription analyses in rice using tiling microarrays. Nature Genet 38:
124–129.
77. Wolfe KH, Gouy M, Yang Y-W, Sharp PM, Li W-H. 1989. Date of the
monocot-dicot divergence estimated from chloroplast DNA sequence
data. Proc Natl Acad Sci USA 86:6201–6205.
78. Petsko GA. 2002. Grain of truth. Genome Biol 3:1007.
79. Sorrells ME, La Rota M, Bermudez-Kandianis CE, Greene RA, Kantety
R, et al. 2003. Comparative DNA sequence analysis of wheat and rice
genomes. Genome Res 13:1818–1827.
80. Singh NK, Raghuvanshi S, Srivastava SK, Gaur A, Pal AK, et al. 2004.
Sequence analysis of the long arm of rice chromosome 11 for rice-
wheat synteny. Funct Integr Genomics 4:102–117.
81. Salse J, Piegu B, Cooke R, Delseny M. 2004. New in silico insight into
the synteny between rice (Oryza sativa L.) and maize (Zea mays L.)
highlights reshuffling and identifies new duplications in the rice
genome. Plant J 38:396–409.
82. Dubcovsky J, Ramakrishna W, SanMiguel PJ, Busso CS, Yan L, et al.
2001. Comparative sequence analysis of colinear barley and
rice bacterial artificial chromosomes. Plant Physiol 125:1342–1353.
83. Klein PE, Klein RR, Vrebalov J, Mullet JE. 2003. Sequence-based
alignment of sorghum chromosome 3 and rice chromosome 1 reveals
extensive conservation of gene order and one major chromosomal
rearrangement. Plant J 34:605–621.
84. Dunford RP, Yano M, Kurata N, Sasaki T, Huestis G. 2002. Comparative
mapping of the barley Ppd-H1 photoperiod response gene region,
which lies close to a junction between two rice linkage segments.
Genetics 161:825–834.
85. Brunner S, Keller B, Feuillet C. 2003. A large rearrangement involving
genes and low-copy DNA interrupts the microcollinearity between rice
and barley at the Rph7 locus. Genetics 164:673–683.
86. Han F, Kleinhofs A, Ullrich SE, Kilian A, Yano M. 1998. Synteny with
rice-analysis of barley malting quality QTLs and RPG4 chromosomal
regions. Genome 41:373–380.
87. Zwick MS, Islam-Faridi MN, Czeschin DG, Wing RA, Hart GE, et al.
1998. Physical mapping of the liguleless linkage group in Sorghum
bicolor using rice RFLP-selected sorghum BACs. Genetics 148:1983–
1992.
88. Collins NC, Thordal-Christensen H, Lipka V, Bau S, Kombrink E, et al.
2003. SNARE-protein-mediated disease resistance at the plant cell
wall. Nature 425:973–977.
89. Armstead IP, Turner LB, Farrell M, Skot L, Gomez P, et al. 2004.
Synteny between a major heading-date QTL in perennial ryegrass
(Lolium perenne L.) and the Hd3 heading-date locus in rice. Theor Appl
Genet 108:822–828.
90. La Rota M, Sorrells ME. 2004. Comparative DNA sequence analysis of
mapped wheat ESTs reveals the complexity of genome relationships
between rice and wheat. Funct Integr Genomics 4:34–46.
91. Yano M, Katayose Y, Ashikari M, Yamanouchi U, Monna L, et al. 2000.
Hd1, a major photoperiod sensitivity quantitative trait locus in rice, is
closely related to the Arabidopsis flowering time gene CONSTANS.
Plant Cell 12:2473–2484.
92. Yamanouchi U, Yano M, Lin H, Ashikari M, Yamada K. 2002. A rice
spotted leaf gene, Spl7, encodes a heat stress transcription factor
protein. Proc Natl Acad Sci USA 99:7530–7535.
93. Komori T, Ohta S, Murai N, Takakura Y, Kuraya Y, et al. 2004. Map-
based cloning of a fertility restorer gene, Rf-1, in rice (Oryza sativa L.).
Plant J 37:315–325.
Genes and genomes
BioEssays 28.4 431
94. Miyoshi K, Ahn BO, Kawakatsu T, Ito Y, Itoh J, et al. 2004.
PLASTOCHRON1, a timekeeper of leaf initiation in rice, encodes
cytochrome P450. Proc Natl Acad Sci USA 101:875–880.
95. Sun X, Cao Y, Yang Z, Xu C, Li X, et al. 2004. Xa26, a gene conferring
resistance to Xanthomonas oryzae pv. oryzae in rice, encodes an LRR
receptor kinase-like protein. Plant J 37:517–527.
96. Ashikari M, Sakakibara H, Liu S, Yamamoto T, Takashi T, et al. 2005.
Cytokinin oxidase regulates rice grain production. Science 309:741–
745.
97. Bancroft I. 2002. Insights into cereal genomes from two draft genome
sequences of rice. Genome Biol 3:10–15.
98. Mayer K, Mewes H-K. 2001. How can we deliver the large plant
genomes? Strategies and perspectives. Curr Opin Plant Biol 5:173–
177.
99. Rabinowicz PD, McCombie WR, Martienssen RA. 2003. Gene enrich-
ment in plant genomic shotgun libraries. Curr Opin Plant Biol 6:150–
156.
100. Barbazuk WB, Bedell JA, Rabinowicz PD. 2005. Reduced representa-
tion sequencing: a success in maize and a promise for other plant
genomes. Bioessays 27:839–848.
101. Rensink WA, Buell CR. 2005. Micoarray expression profiling resources
for plant genomics. Trends Plant Sci 10:603–609.
Genes and genomes
432 BioEssays 28.4