The Human Genome: An Introduction

11
The Human Genome: An Introduction JEROEN AERSSENS, MARTIN ARMSTRONG, RON GILISSEN, NADINE COHEN Department Pharmacogenomics, Janssen Research Foundation, Beerse, Belgium The Oncologist 2001;6:100-109 www.TheOncologist.com Correspondence: Jeroen Aerssens, Ph.D., Department of Pharmacogenomics, Janssen Research Foundation, Turnhoutseweg 30, Beerse B-2340, Belgium. Telephone: 32-14-606146; Fax: 32-14-607162; e-mail: [email protected] Received September 29, 2000; accepted for publication November 28, 2000. ©AlphaMed Press 1083-7159/2001/$5.00/0 T he Oncologist Fundamentals of Cancer Medicine INTRODUCTION During the past two decades, tremendous progress has been made in genetics and genomics. Diseases that run in fam- ilies have been recognized for many centuries, but it was only in the early 1980s that the first mutations in a gene responsible for a disease could be identified. Subsequently, numerous dis- coveries of disease-related mutations in other genes have been found, initially in rare single-gene disorders, but more recently also in common disorders such as Alzheimer’s disease and cancer. Applications of this newly discovered information pro- vide new opportunities for progression in medical science, including the design of genetic tests to diagnose or predict (subtypes of) diseases, the redefinition of diseases and the understanding of their pathogenesis based on the molecular mechanisms behind them, and the selection of new target mol- ecules for drug discovery. The time has now come that such applications will transfer further toward the day-to-day prac- tice of the clinician and transform the practice of clinical med- icine. For this to happen, an appropriate education and understanding of the basic concepts of genetics and genomics by clinicians is needed. This article aims to provide a general introduction of these concepts for clinicians not familiar with these fields. The list of references and the websites indicated in the manuscript should encourage the reader to get a broader appreciation of this research area. WHAT GENETICISTS ARE ALWAYS TALKING ABOUT: DNA Our inherited information is encoded in a macromolecule called “DNA” (deoxyribonucleic acid). Basically, our DNA is like a bar code—it encrypts information. The vast majority of DNA molecules are stored within the nucleus of the cell and covered with proteins, together forming chromosomes. DNA completely dissolves in aqueous solutions (and thus in its nat- ural environment). When precipitated in alcoholic solvents (ethanol, isopropanol) during extraction procedures, the DNA becomes visible as a white viscous clot. All multicellular organisms, including humans, start their life as a single cell (the fertilized egg), and almost all cells of the organism developed from this single cell contain a full and identical copy of the DNA of this single cell. In humans, each somatic cell contains approximately 0.006 nanogram of DNA which harbors all the genetic information needed to develop and function normally. For an adult human body, this makes a total of about 600 grams of DNA. For clinical genetic analyses, genomic DNA is usually iso- lated from leukocytes, although identical DNA could be iso- lated from virtually any other cell of an individual, as well. From 5 to10 ml whole blood, approximately 100-200 micro- grams of DNA can be extracted, which is in most cases more than sufficient to perform genetic analyses. THE DNA IS IN THE CHROMOSOMES: TWO COPIES OF A LIBRARY The DNA in our cells is divided over a constant number of chromosomes (46 in humans), each of them with a specific size and form, as can be observed under the microscope using specific coloring techniques (e.g., Giemsa staining). Two sex chromosomes (X and Y) determine the gender of an individ- ual (XX for females, XY for males); the other 44 so-called autosomal chromosomes are not different in males and females. Together, these 46 chromosomes comprise two nearly identical copies of the whole genome: one copy of the genome is inherited from the father (via a set of 22 autosomal by guest on January 4, 2016 http://theoncologist.alphamedpress.org/ Downloaded from by guest on January 4, 2016 http://theoncologist.alphamedpress.org/ Downloaded from by guest on January 4, 2016 http://theoncologist.alphamedpress.org/ Downloaded from

Transcript of The Human Genome: An Introduction

The Human Genome: An IntroductionJEROEN AERSSENS, MARTIN ARMSTRONG, RON GILISSEN, NADINE COHEN

Department Pharmacogenomics, Janssen Research Foundation, Beerse, Belgium

The Oncologist 2001;6:100-109 www.TheOncologist.com

Correspondence: Jeroen Aerssens, Ph.D., Department of Pharmacogenomics, Janssen Research Foundation, Turnhoutseweg 30,Beerse B-2340, Belgium. Telephone: 32-14-606146; Fax: 32-14-607162; e-mail: [email protected] Received September29, 2000; accepted for publication November 28, 2000. ©AlphaMed Press 1083-7159/2001/$5.00/0

TheOncologistFundamentals of Cancer Medicine

INTRODUCTION

During the past two decades, tremendous progress hasbeen made in genetics and genomics. Diseases that run in fam-ilies have been recognized for many centuries, but it was onlyin the early 1980s that the first mutations in a gene responsiblefor a disease could be identified. Subsequently, numerous dis-coveries of disease-related mutations in other genes have beenfound, initially in rare single-gene disorders, but more recentlyalso in common disorders such as Alzheimer’s disease andcancer. Applications of this newly discovered information pro-vide new opportunities for progression in medical science,including the design of genetic tests to diagnose or predict(subtypes of) diseases, the redefinition of diseases and theunderstanding of their pathogenesis based on the molecularmechanisms behind them, and the selection of new target mol-ecules for drug discovery. The time has now come that suchapplications will transfer further toward the day-to-day prac-tice of the clinician and transform the practice of clinical med-icine. For this to happen, an appropriate education andunderstanding of the basic concepts of genetics and genomicsby clinicians is needed. This article aims to provide a generalintroduction of these concepts for clinicians not familiar withthese fields. The list of references and the websites indicated inthe manuscript should encourage the reader to get a broaderappreciation of this research area.

WHAT GENETICISTS ARE ALWAYS TALKING

ABOUT: DNAOur inherited information is encoded in a macromolecule

called “DNA” (deoxyribonucleic acid). Basically, our DNA islike a bar code—it encrypts information. The vast majority ofDNA molecules are stored within the nucleus of the cell and

covered with proteins, together forming chromosomes. DNAcompletely dissolves in aqueous solutions (and thus in its nat-ural environment). When precipitated in alcoholic solvents(ethanol, isopropanol) during extraction procedures, the DNAbecomes visible as a white viscous clot.

All multicellular organisms, including humans, starttheir life as a single cell (the fertilized egg), and almost allcells of the organism developed from this single cell containa full and identical copy of the DNA of this single cell. Inhumans, each somatic cell contains approximately 0.006nanogram of DNA which harbors all the genetic informationneeded to develop and function normally. For an adulthuman body, this makes a total of about 600 grams of DNA.For clinical genetic analyses, genomic DNA is usually iso-lated from leukocytes, although identical DNA could be iso-lated from virtually any other cell of an individual, as well.From 5 to10 ml whole blood, approximately 100-200 micro-grams of DNA can be extracted, which is in most cases morethan sufficient to perform genetic analyses.

THE DNA IS IN THE CHROMOSOMES: TWO COPIES

OF A LIBRARY

The DNA in our cells is divided over a constant numberof chromosomes (46 in humans), each of them with a specificsize and form, as can be observed under the microscope usingspecific coloring techniques (e.g., Giemsa staining). Two sexchromosomes (X and Y) determine the gender of an individ-ual (XX for females, XY for males); the other 44 so-calledautosomal chromosomes are not different in males andfemales. Together, these 46 chromosomes comprise twonearly identical copies of the whole genome: one copy of thegenome is inherited from the father (via a set of 22 autosomal

by guest on January 4, 2016http://theoncologist.alpham

edpress.org/D

ownloaded from

by guest on January 4, 2016

http://theoncologist.alphamedpress.org/

Dow

nloaded from

by guest on January 4, 2016http://theoncologist.alpham

edpress.org/D

ownloaded from

Aerssens, Armstrong, Gilissen et al. 101

chromosomes and an X or Y chromosome) and the othercopy from the mother (via another set of 22 autosomalchromosomes and one X chromosome).

The autosomal chromosomes derived from the father andmother are two-by-two homologous: they look similar underthe microscope and comprise the same genes, or eventually,variants of the same genes. These chromosomes are num-bered from 1 to 22, mainly based on size (1 being the largestand 22 the smallest chromosome). Thus, each somatic cellcontains two copies of each of these 22 chromosomes, andthus two copies of each of the genes located on these chro-mosomes. One could compare this with a library which con-tains two copies of each book, although there mightsometimes be different editions of each specific book. Inwomen, the two X chromosomes are also homologous: onecopy is inherited from their mother and the other copy fromtheir father. Men, on the contrary, have one X chromosomeinherited from their mother and one Y chromosome inheritedfrom their father. The Y chromosome is much smaller thanthe X chromosome and contains many fewer genes as well.

In germ cells, only one copy of each homologous chro-mosome is present (thus in total, 22 autosomal and one sexchromosome). During reproduction, two germ cells (oneegg cell and one sperm cell) combine their genetic infor-mation so that the offspring will contain two copies of eachchromosome: one from the father and one from the mother(Fig. 1). Thus, the gender of the offspring, determined bythe combination of the sex chromosomes of the mother(always X) and the father (X or Y) is completely dependenton whether the sperm cell contains an X or Y chromosome.

THE SIZE AND STRUCTURE OF DNA—A DOUBLE

HELIX

From a structural point of view, the DNA looks like a longchain of connected letters without any spaces or punctuationmarks (Fig. 2A). The total physical length of all the DNAchains in each of our cells is approximately two meters, with adiameter of 0.000002 mm. In order to write the DNA text, thebody has at its disposal four different but related buildingblocks (called nucleotides or, more precisely, deoxyribonu-cleotides): A, C, G, and T, representing respectively adenine,cytosine, guanine, and thymine. These nucleotides are con-nected by a deoxyribose-phosphate backbone. Each phosphatelinks the hydroxyl group on the 3′ carbon atom of a deoxyri-bose of one nucleotide to the hydroxyl group on the 5′ carbonatom in the deoxyribose group of the adjacent nucleotide.Importantly, all the information content encrypted within theDNA is in the specific sequence of these nucleotides. Theinformation stored within the DNA code can be used for trans-lation into functional activity (i.e., production of proteins) onlyin one orientation on the backbone, namely the 5′-to-3′ direc-tion. Therefore, nucleotide sequences are usually displayedfrom the 5′-to-3′ end (from left to right).

Attached to this DNA strand is a second DNA strandwhich is the exact complement of the first one. This is pos-sible because of the complementary chemical structure ofthe DNA building blocks. Two kinds of base pairs, oftenreferred to as complementary base pairs, exist in all DNA:the As on one strand always pair with Ts on the other strand(via two hydrogen bonds) while Cs pair with Gs (via threehydrogen bonds). Thus, if the sequence of one strand is

known, the sequence of the com-plementary strand can easily be

Figure 1. Schematic overview of theinheritance of our genetic informa-tion. Each individual has two copies ofeach chromosome (each harboringone copy of the genes on the chromo-some). The germ cells comprise onlyone of these copies. Through combina-tion of the genetic material from thesperm cell and egg cell, the offspringinherits one copy of each chromosomefrom the father and one from themother. Assume a particular gene inthe DNA which determines the pheno-type hair style, and for which twovariants exist (A and G). The geno-type, which is the combination of thevariants on the two inherited homolo-gous chromosomes in an individual(e.g., A/G), will determine the pheno-

type. In the example, the G/G genotype is linked with straight hair, while individuals with the other genotypes (A/G and A/A) have curly hair(indicating that the G variant, associated with straight hair, is a recessive characteristic).

by guest on January 4, 2016http://theoncologist.alpham

edpress.org/D

ownloaded from

102 The Human Genome: An Introduction

derived. The two deoxyribose-phosphate backbones haveopposite 5′-to-3′ orientations and are wound around eachother to form a double-helix structure.

The size of DNA molecules is noted as the number of basepairs (bp) or a multiple (1,000 bp = 1 kbp, 1,000 kbp = 1 Mbp).Our complete library of genetic information is called “thehuman genome,” and comprises somewhat more than three bil-lion bp (3,000 Mbp), distributed over 22 autosomal chromo-somes (numbered from 1 to 22) and two sex chromosomes(X and Y). A printed edition of this sequence would requireapproximately one million printed pages with single-line spac-ing. The elucidation of this genomic DNA sequence is ofextreme interest, as it contains—in encrypted form—all theinherited information needed to develop and direct the func-tioning of the human body. Table 1 summarizes the dimensionsand information content of the human genome in numbers.

GENES IN THE GENOME

The unit of information in the DNA is the gene, which isa stretch of DNA sequence that contains the code for the pro-duction of a protein, which is a single piece of the wholemachinery required by the cell to normally function in its

environment. More specifically, each gene comprises all thedetailed instructions that determine the precise compositionof a specific protein, as well as the regulatory instructions

Figure 2. (A) The human genome sequence can be compared with a text lacking any spaces or punctuation marks. It is extremely difficult toread the genomic text without an analysis tool; even special software programs which have been specifically written to identify meaningful sen-tences (genes) only have a limited success rate. (B) This is because in between the words (the exons) of the text which form a meaningful sen-tence (a gene), variable amounts of nonsense letters (the introns) are placed. In fact, more than 90% of the human sequence consists of text fromwhich the meaning is currently not understood. (C) Variation frequently occurs in the human genome (about one letter differs in every 1,000 let-ters between the genomic texts of two individuals). This might have consequences on the meaning of the sentence, or eventually make the sen-tence unreadable. These variations in the DNA are called mutations or polymorphisms (depending on their frequency and on whether there isor is not a direct link to the cause of a disease).

Table 1. The human genome in numbers

• 3,000,000,000 nucleotides in the human genome (estimated)

• 22 autosomal chromosomes and two sex chromosomes (X and Y)

• 46 chromosomes in each somatic cell (two copies of the whole genome)

• 30,000–120,000 genes in the human genome (estimated)

• 35,000 nucleotide sequences per gene at the genomic level, including intronic sequences (on average)

• 1,500 nucleotides directly coding sequence per gene (on average)

• less than 5% of the human genome sequence directly encodes for proteins

• four different nucleotides in the genome (adenine, cytosine, guanine, thymine)

• three nucleotides comprised in a codon which encodes 1 amino acid

• one nucleotide difference between two unrelated individuals per 1,000 nucleotides sequence (on average)

by guest on January 4, 2016http://theoncologist.alpham

edpress.org/D

ownloaded from

Aerssens, Armstrong, Gilissen et al. 103

that determine when this specific protein will be producedand in what quantity. The size of a gene at the genomic levelcan vary widely (usually between 10,000 and 150,000 bp).Although most of the genomic DNA sequence is currentlyknown, there is still a large debate ongoing on the number of genes present: estimates of experts in the field varybetween 30,000 and 120,000, with an average around 60,000(www.ensembl.org/genesweep.html). Very intriguingly, theregions in the genomic DNA which encode for proteinsaccount for about 150 million nucleotides, which is less than5% of the complete human genome. Apart from some of theDNA sequences which comprise instructions needed to reg-ulate the expression of the genes and specific instructions forthe chromosomes to function correctly, the significance ofthe other 95% of the genome is at present largely unknownand/or poorly understood. This latter part of the genome con-tains large numbers of highly repeated DNA sequence fami-lies. Two major types of repeat families can be distinguished:tandemly repeated DNA and interspersed repetitive DNA.Tandemly repeated DNA families consist of long or shortarrays of DNA repeat units, with the repeat being a simple ormoderately complex sequence (size usually between 2 and100 bp). Depending on the size of arrays of repeat units, thisis called satellite DNA (>100 kbp), minisatellite DNA (0.1-20 kbp), or microsatellite DNA (<150 bp). Interspersedrepetitive DNA consists of individual repeat units which arenot clustered at a specific location on a chromosome, but aredispersed at numerous locations. Among these are the SINEs(short interspersed nuclear elements) and the LINEs (longinterspersed nuclear elements). Well-known examples areAlu repeats (SINE with full-length of 280 bp; approximately1,000,000 copies in the human genome) and LINE-1 or L1element (LINE with full-length of 6.1 kbp; approximately80,000 copies in the human genome).

THE HUMAN GENOME PROJECT

In 1987, a worldwide scientific effort called the HumanGenome Project was initiated to unravel the complete DNAsequence of the human genome. Recently, a first draft covering 85%-90% of the complete human genomesequence (3.12 billion bp) has been announced simultane-ously by scientists of the publicly funded Human GenomeProject (www.sanger.ac.uk/hgp; www.gene.ucl.ac.uk/huqo;www.nhgri.nih.gov; www.ncbi.nlm.nih.gov/genome/seq/)and the private company Celera Genomics (Rockville, MD;www.celera.com). As a consequence of the strategies used todetermine the human genome sequence, experts anticipatethat it will be another two years before the complete humangenome sequence will be known with a confidence of morethan 99.99% [1]. Although the complete sequence of thewhole human genome will soon be known, it is expected that

it will take many more decades before all this information(i.e., identification of all genes and their regulation, signifi-cance of genetic variations, etc.) will be fully understood.Nevertheless, the scientific importance of the achievementsreached so far by the Human Genome Project can hardly beoverestimated and is at least of the same order of magnitudeas the Apollo lunar program.

FROM GENE TO PROTEIN

The main role of the DNA in the cell is to permanentlystore and make available all the information needed to regulateeach of the activities in the cell. The production of proteins—which are the functionally active molecules in the cell—takesplace in the cytoplasm of the cell. Since the instructions for howto make the proteins is within the DNA which is stored in thenucleus, an intermediate molecule (messenger RNA, ormRNA) is used to transfer this information from the nucleus tothe cytoplasmic protein factory. As a matter of comparison,imagine a library (the genomic DNA) which contains manybooks (genes) on many different topics. A reader makes a copy(the mRNA) of a specific book which contains the specificinformation on how to make a cake and takes this to his homeas he is not allowed to make cake in the library. At home, theperson can then make a cake (the protein) using all the requiredingredients and supplies as described in the copied information.

When and how much of a gene should be expressed in acell is directed by specific proteins (transcription factors) whichare present in the nucleus and which can interact in a stimula-tory or inhibitory manner with regulatory sequences in theDNA flanking the coding part of the gene. When this fine-tunedregulation mechanism indicates that additional copies of thegene should be expressed, an enzyme in the nucleus (RNApolymerase) transcribes the genetic information from the DNAtemplate into an RNA (ribonucleic acid) copy. The structure ofRNA is similar to a single-strand DNA molecule, althoughthymine (T) is replaced by uracil (U). Because the protein-cod-ing information in the DNA is interrupted by irrelevantsequences (called introns), the RNA must be further edited(spliced) to remove these intron sequences and join the codingsequences (called exons) (Fig. 2B). In some genes, a choicebetween several alternative exons is being made during thissplicing process, which will result in different proteins. TheRNA molecule that results from transcription and splicing iscalled messenger RNA (mRNA). This mRNA (on average1,500 bp) is transported to the cytoplasm where it is used as atemplate for the generation of a protein. Thus, the mRNA isthreaded through ribosomes as a tape is threaded through thehead of a tape player in order to decode the information andassemble the amino acids into chains. For the decoding, eachsubsequent group (called a “codon”) of three nucleotides on themRNA specifies a new amino acid. Mostly starting from a

by guest on January 4, 2016http://theoncologist.alpham

edpress.org/D

ownloaded from

104 The Human Genome: An Introduction

so-called start codon with the sequence “ATG” (which encodesa methionine), each adjacent codon on the mRNA specifies thenext amino acid to be linked to the growing protein chain. Aftercompletion of this translation process, additional modificationsare made to the protein (e.g., phosphorylation, glycosylation),resulting in a mature and functional protein.

In summary, the properties of each protein depend on thesequence of the amino acids used to construct it, and thissequence in turn is determined directly by the nucleotidesequence of the mRNA, which in turn is an (edited) copy ofthe genomic DNA sequence (Fig. 3). It should be noted that,although generally the information for making any singleprotein is always encoded by a single gene, one gene may (asa result of differential splicing) carry the information neededto make several (usually related) proteins.

GENE EXPRESSION IN THE CELL: WHICH GENES

AND IN WHAT AMOUNT

As indicated above, the DNA content is identical in eachcell or tissue type of the body; however, not all our cells areidentical in terms of structure, function, or behavior. Whatmakes them different is the pattern of genes which areexpressed and translated into proteins during the life cycle ofthe cells. Some cell types express many genes (e.g., in braincells approximately 30,000 genes are expressed), while inothers a large number of the genes are transcriptionally

inactive (e.g., in red blood cells only 30 genes are expressed).Apart from an overall switching of the expression of specificgenes from “on” to “off” (or vice versa), fine-tuning of theexpression level of specific genes might also occur. Changesin the level of expression may be the result of a disease ormay eventually lead to a disease. Therefore, there is an enor-mous scientific interest in studying and comparing the levelof expression of genes, i.e., gene expression in disease statusversus in healthy controls.

Analysis of mRNA samples is very useful, as these con-tain only the transcribed sequences of the human genome(and thus the genes). Therefore, several research groups andbiotech companies have cloned and analyzed large librariesfrom mRNA sequences. For example, the mRNA extractedfrom a brain sample contains copies of thousands of tran-scribed genes which might be of interest. Technically, inorder to clone the transcribed genes, the mRNA moleculesfirst need to be converted into double-strand molecules. Thiscan be done by adding the complement nucleotides on a sec-ond DNA strand (a process called “reverse transcription”),resulting in double-stranded DNA molecules which containonly the exon sequences of the genes (but not the intronsequences). These are called cDNA molecules (copy DNA),and contain the open reading frame of the gene which caneasily be converted into the amino acid sequence of theresulting protein. In large projects, several thousands of these

Figure 3. Schematic overview of how the information comprised within the genetic code is being used to synthesize the proteins. Thisinvolves the processes of transcription from DNA into RNA, RNA splicing to form mRNA, transport of the mRNA from the nucleus to the cyto-plasm, translation into a chain of amino acids, and finally post-translational modifications and folding of the synthesized protein. Note that atboth the 5′ and 3′ ends of the coding region in the exonic sequence, an untranslated region (UTR) is also transcribed and spliced into mRNA(respectively 5′-UTR and 3′-UTR regions).

by guest on January 4, 2016http://theoncologist.alpham

edpress.org/D

ownloaded from

Aerssens, Armstrong, Gilissen et al. 105

cDNA clones have been partially sequenced and haverevealed previously unknown fragments of expressed genes(often called ESTs, expressed sequence tagged sites).

Comparison of databases of EST sequences might eventu-ally also reveal new information on tissue-specific expressionof some genes. In the laboratory, the evaluation of geneexpression levels in tissue samples can now also be evaluatedsimultaneously in thousands of genes, thanks to the enormousprogression in the development of microarray technology(more popularly, “DNA chip” technology) during the last fewyears. Today, this technology enables scientists to simultane-ously compare the expression levels of several thousand genesin a single experiment, on a surface smaller than a stamp. Thistechnology is based on the hybridization of RNA samples(e.g., extracted from diseased and healthy tissue) on glassslides (DNA chips) containing DNA molecules with the spe-cific sequences of thousands of different genes. The inten-sity of the hybridization signals, which are a measure of theexpression levels of the different genes, can be evaluatedusing powerful software [2].

VARIATIONS IN THE GENOME—THE BASIS OF

HUMAN DIVERSITY

When the DNA sequence of a gene is identified in dif-ferent individuals from the population, some differences inthe nucleotide sequence are often detected (Fig. 2C). The

information content of DNA can be altered dramatically bysuch variations in the nucleotide sequence, especially ifthese differences are located in protein-coding or regulatorysequences. The consequence of such variations might leadto the insertion of a different amino acid on a specific posi-tion in the protein, or to a different level of expression of aprotein. Variations located in the intronic regions of genes oroutside the genes will usually have fewer consequences. Thedifferent forms of a genetic variation are called the “alleles”of the variation. Frequently occurring variations are oftencalled “polymorphisms,” while more rare variations (withallele frequency below 1%) and variations with a direct rela-tionship to a disease are often called “mutations” (althoughthese definitions are arbitrary). Genetic variations caninvolve only 1 bp (called single nucleotide polymorphism,SNP), a few bp (e.g., di- and trinucleotide repeat polymor-phisms), up to large stretches of DNA. Roughly, the varia-tions can be divided into substitutions, insertions, deletions,amplifications, and translocations (Fig. 4).

The major contributors to genetic variation, comprisingsome 80% of all known polymorphisms, are the singlenucleotide polymorphisms. An SNP located in the codingregion of a gene is indicated as “cSNP.” It has been estimatedthat, on average, the DNA sequence of two unrelated individ-uals differs in 0.1% (1 in 1,000 bp), which would in the com-plete genome account for three million nucleotides. As a

Figure 4. Schematic summary of the various forms of variations which occur in genes. These might involve only one or a few base pairs (smallmutations) or large genomic regions (large mutations). Adapted from [9].

by guest on January 4, 2016http://theoncologist.alpham

edpress.org/D

ownloaded from

106 The Human Genome: An Introduction

comparison, the DNA sequence of a human and a chimpanzeeis estimated to differ 2% (1 in 50 base pairs).

SEARCHING FOR DISEASE GENES USING VARIATIONS

IN THE GENOME

There is major interest among scientists in studyingvariations in genes, especially in the regulatory and protein-coding sequences, because such variations might be directlyrelated to specific diseases or other specific characteristics(e.g., eye or hair color). The investigation of potential rela-tionships of variations in specific genes with a specific dis-order might be very useful if the candidate gene(s) to beinvestigated can be well chosen. Such choice of candidategenes could be based on scientific knowledge or on new exper-imental evidence (e.g., altered serum level of a protein in a spe-cific patient group, microarray expression experiments, etc.).

Good candidate genes are, unfortunately, not always avail-able. Therefore, genetic approaches have been developed inthe past based on the analysis of highly polymorphic dinu-cleotide repeat markers (microsatellites) in DNA samples fromindividuals from large families with multiple disease-affectedindividuals. The strategy is based on the identification of chro-mosomal markers cosegregating with the disease in the fami-lies. Such linkage studies are very attractive because theyallow identification of a chromosomal region on the geneticmap which contains a disease-causing gene without requiringany functional knowledge of the disease gene. Once a chro-mosomal region with significant linkage is found, the disease-causing gene needs to be cloned and the responsiblemutation(s) identified. This positional cloning strategy hasbeen very successful, especially for identifying genes involvedin single-gene disorders (also called simple genetic disorders,or Mendelian inherited disorders). Indeed even for some morecommon disorders such as breast cancer or Alzheimer’s dis-ease, a positional cloning strategy has been successfullyapplied and has led to the identification of the genes involved(BRCA1 and BRCA2 in familial breast cancer, and presenilingenes in early onset Alzheimer’s disease, respectively).Although it is clear that genetic tests for these mutations mightbe extremely useful for predicting disease risk in other mem-bers of these families, it should be noted that defects in thesegenes can explain only a small fraction (usually less than 5%)of the whole population of patients suffering these commondisorders, consisting mainly of non-familial cases.

Unfortunately, however, the resolution that is obtainedusing these family studies is rather limited—at the very best,up to a region of about one million bp. As this is still a verylarge region—and may eventually contain more than 50genes—it is key to refining this region of interest. Because oftheir high frequency in the genome, the analysis of SNP mark-ers has been proposed as a possible tool. SNP markers in the

region of interest can be analyzed in a population of affectedindividuals and a population of matched healthy controls. Foreach of the analyzed SNPs, the allele frequency in both pop-ulations is then compared. A statistically significant differ-ence in allele frequency of a genetic marker is suggestive foran association of this marker with the disease.

Following the successes in genetic mapping and identifi-cation of the molecular basis of Mendelian traits, attentionhas rapidly shifted to more complex and more prevalentgenetic disorders that involve multiple genes and environ-mental effects (e.g., cardiovascular disease, diabetes, andschizophrenia). It is believed that SNPs could probably bethe best available markers in the search for the origins ofcomplex genetic diseases. Moreover, it has been hypothe-sized that ultimately, if enough SNP markers would becomeavailable with a chromosomal localization evenly dispersedover the whole human genome, it should be feasible todirectly perform population-based whole-genome associa-tion studies which would permit skipping of the initial step offamily-based linkage studies. As a consequence of the greatpromise of SNPs, ten of the world’s pharmaceutical giants,along with five academic partners, entered into a close col-laboration in April 1999 called “The SNP Consortium.” Themajor mission of this consortium is to create a high-quality,dense, genome-wide SNP map, which will be made availableto the public. More specifically, The SNP Consortium aimsto generate genome SNP maps which would allow whole-genome, population-based association studies. It is estimatedthat this will require at least one marker every 5 to 50 kbp ofDNA. To cover the whole genome at this resolution wouldrequire the identification and chromosomal localization of200,000-300,000 new SNP markers. In July 2000, alreadymore than 800,000 SNPs were made available to the public(www.ncbi.nlm.nih.gov/snp; http://snp.cshl.org).

COMPARATIVE GENOMICS—ANOTHER TOOL

FOR IDENTIFYING AND UNDERSTANDING GENES

RELEVANT IN HUMAN DISEASE

A powerful tool for understanding the human genome iscomparison with the genome information from other organ-isms; this area of research is called “comparative genomics”[3]. The currently available sequence technology allows deter-mination of the complete genome sequence of organismswithin reasonable time frames. In 1995, the first entire sequenceof an organism, Haemophilus influenza (1.8 Mbp), was pub-lished. Since then, the complete genome sequence of a con-stantly growing list of microorganisms (bacteria and viruses)became known (size usually between 0.5-5 Mbp). Thesequence information can be used to identify specific genes andtheir structure, regulation, and function, which might poten-tially lead to new drugs which target a specific microorganism.

by guest on January 4, 2016http://theoncologist.alpham

edpress.org/D

ownloaded from

Aerssens, Armstrong, Gilissen et al. 107

Saccharomyces cereviseae (baker’s yeast) was the firsteukaryotic organism from which the entire genome sequence(15 Mbp) was published (http://genome-www.stanford.edu/Saccharomyces/). An enormous amount of information isknown about the structure, regulation, and function of yeastgenes. Of particular interest for developmental biologyresearch is the availability of the complete genome sequenceof the long roundworm Caenorhabditis elegans (97 Mbp);this animal consists of 959 somatic cells, the exact lineage ofwhich is known for every cell (http://elegans.swmed.edu/).A cross-comparison of the complete gene sets of S. cerevisiae(6,000 genes) and C. elegans (19,000 genes) has revealedthat 23% of the proteins encoded by yeast genes have appar-ent homologues in the nematode worm, reflecting functionscommon to both organisms [4].

The fruit fly (Drosophila melanogaster) (137 Mbp) has along history in genetic research, especially for its ease of cor-relation of genotype and phenotype. Because crucially impor-tant gene functions and developmental processes appear to behighly conserved between species, the relevance for humandisease research becomes clear. Moreover, there is also animportant conservation in the area of the cell-cycle controlgenes (and DNA repair and apoptosis), with immediate relevance to human cancer (http://flybase.bio.indiana.edu/).

The mouse genome (3,000 Mbp) shows large subchromo-somal areas with a strong conservation of linkage (synteny)between mouse and humans. This implies that, based on thechromosomal localization of a gene on the mouse genome,predictions can be made on the chromosomal localization of itshuman homologue. Nearly every human gene appears tohave a mouse homologue (http://www.ncbi.nlm.nih.gov/Homology/). Because of their small body size, the short gen-eration time, and the technical ability to modify the DNA con-tent of mice cells at the germline level, these animals providealso a powerful tool for studying gene expression and functionand for creating models of human disease (eventually bymeans of knock-out and/or transgenic mice) [5].

An interesting observation emerging from comparingcomplete gene sets in model organisms known to date is thatgene number is not necessarily a good measure of complex-ity. For example, the fruit fly would be considered moreanatomically complex (with 10× more cells) than the nema-tode, and the fruit fly undergoes a more complex develop-mental process than C. elegans. Yet the fruit fly genomecontains only 13,000 genes, compared with the 19,000 genesfound in the nematode genome. It is generally expected thatcomparative genomic assessments will become increas-ingly important because they allow expansion of the utilityof the genomic information known and documented (“anno-tated”) in one species toward other species, includinghumans.

GENOTYPE AND PHENOTYPE

The two copies of a specific gene inherited from thefather and the mother are not always identical, because formost genes many variants (alleles) exist. The combination ofthe two alleles present on the two homologous chromosomesof the DNA is defined as the “genotype” for a specific geneticvariation. For example, imagine an SNP in a gene with twopossible alternative alleles: allele A and allele G. The possiblegenotypes are thus A/A, A/G, and G/G. When two differentalleles of a gene are identified on the two homologous chro-mosomes of an individual, the genotype is called “heterozy-gous” (e.g., A/G); if the same allele is present on bothhomologous chromosomes, the genotype is “homozygous”(e.g., A/A and G/G). More generally, the genotype of an indi-vidual can be defined as the complete composition of anindividual’s genome (including all the information on thevariations within his/her genome), as has been defined atconception. When used in clinical genetic applications,however, the term genotype usually refers to some specificvariation(s) in a small part of the DNA, often named for thegene involved. For example, the APOE gene in the DNA canexist as allele e2, e3, or e4, the latter of which is associatedwith an increased risk of developing Alzheimer’s disease. Theage of onset of the disease is lower in individuals with a geno-type harboring one or more copies of the e4 allele, namely thee2/e4 or e3/e4, and especially the e4/e4 genotype.

Opposite to the genotype is the “phenotype,” which canbe defined as the combination of all the observable or mea-surable characteristics of an individual (e.g., eye color, hairstyle, body height, affected by disease, etc.). The phenotypeis—at least partially—determined by the genotype, because itdepends on the level at which specific genes can be expressed.The latter depends on the variations in the DNA sequence butalso on the environmental influences (e.g., nutrition status).

Very importantly, it should be pointed out that althoughthe phenotype may appear to be equal in two individuals, theirgenotypes might be different. This could be due to a significantenvironmental influence which overrules the genetic impreg-nation, or alternatively because some of the possible genotypesdo not result in phenotypic differences (e.g., genotypes A/Aand A/G can both have straight hair, while only genotype G/Gshows curly hair—Fig. 1). In a molecular diagnostic setting,the determination of the genotype usually aims to predict thephenotype. Indeed the genotype might sometimes be fully pre-dictable (especially in single-gene Mendelian inherited disor-ders, such as cystic fibrosis). Unfortunately, most often theanalyzed genotype is not fully predictable for the phenotypebut merely allows assignment of a certain risk level to an indi-vidual for expressing or developing a specific phenotype. Forexample, susceptibility-conferring genotypes at the BRCA1and BRCA2 gene loci confer a relative risk of breast cancer of

by guest on January 4, 2016http://theoncologist.alpham

edpress.org/D

ownloaded from

108 The Human Genome: An Introduction

about 5. Consequently, it is strongly advised that all results ofgenetic testing are accompanied by an interpretation for eachmolecular genetic diagnostic report to be used in the clinic.

GENETICS AND GENOMICS IN CANCER

As mentioned above, familial cases of some specific can-cer types are known, indicative of an inherited trait similar toany other genetic disorder. For some of these cancer types,successful positional cloning projects have allowed identifi-cation of genes harboring mutations which cause the disease.As these mutations are inherited by the next generation, itimplies that the responsible gene defect is present in thegermline cells. At present, more than 20 different hereditarycancer syndromes have been defined and attributed to specificgermline mutations. Collectively, these syndromes affectapproximately 1% of all cancer patients [6]. For several of theinherited cancer syndromes, genetic testing for disease suscep-tibility is feasible and already part of the clinical managementof affected families. Controversy on its value has been raised,however, especially in cases where the risks of developingcancer associated with a predisposing mutation are less cer-tain, or where there is no effective intervention to offer thosewith a positive result [7].

Most cancer patients do not have any pronounced fam-ily history, yet genomic defects are at the basis of the dis-ease. The genetic information present in normal cells canalso be altered (e.g., due to incorrect DNA duplication dur-ing cell division), either by gross chromosomal changessuch as translocations, deletions, inversions, and amplifica-tions, or through more subtle changes such as point muta-tions and microdeletions [8]. The accumulation of thesegenetic alterations can finally lead to the expression of thefull cancer phenotype. It should be noticed that—apart fromthe familial cases—these changes in the DNA do occur inthe somatic cells and do not transfer to the germline cells.Consequently, these abnormalities are not inherited by thechildren of these patients.

Historically, chromosomal abnormalities in tumors werefirst recognized when an unusually small chromosome, the“Philadelphia chromosome,” was observed in white bloodcells as a hallmark of chronic myeloid leukemia. The signifi-cance of these chromosomal abnormalities has only relativelyrecently become clear by a combination of improved cytoge-netics and molecular biology. The central concept is that ofproto-oncogenes and tumor suppressor genes: normal cellulargenes controlling growth, development, differentiation, DNArepair, and DNA modification become deregulated in the neo-plastic cancer cell due to mutations, fusions, or deletions. Thenormal structure of a resident proto-oncogene may be con-verted to a dominant oncogene by mutations or chromosomalrearrangements. Such conversion in one copy of the gene (one

chromosome homologue) is sufficient to result in neoplastictransformation. On the contrary, loss or inactivation of tumorsuppressor genes may release a cell from constraints imposedby these genes, resulting in uncontrolled growth. Their behav-ior is recessive, and both allele copies must be lost for tumoractivation to occur. Therefore, recurrent deletions of chromo-somal material are recognized as indications for the presenceof tumor-suppressor genes. On the other hand, recurring spe-cific chromosomal aberrations (translocations, amplifications,and inversions) have been instrumental in identifying proto-oncogenes. The cloning of the chromosomal breakpoints ofsuch aberrations has proven to be an effective strategy foridentifying mutant genes in tumors (e.g., ETV6 in leukemia,c-MYC in Burkitt’s lymphoma). At present, more than 50chromosomal translocation breakpoints have been molecu-larly cloned and the involved genes identified. The vastmajority of these tumors were of hematopoietic origin, ascytogenetic data on these tumors are easier to obtain and arethus more extensively studied. It is clear that cytogeneticanalysis might allow subtyping of patients with an apparentlysimilar phenotype. Depending on the chromosomal abnor-malities (and thus the genes involved), the efficacy of therapycan be predicted to a certain extent. Therefore, cytogeneticanalysis has now become a routine analysis in many centers(http://www.waisman.wisc.edu/cytogenetics/Bmproject/CancerCyto.htmlx). Finally, the colorful fluorescence in situhybridization (FISH) technique allows direct identificationof the breakpoint region involved in a chromosomal abnor-mality at a relative high resolution (10-100 kbp). Specificprobes to be used by FISH which recognize recurrent abnor-malities of specific regions of the genome are now commer-cially available for routine analysis of cancer cells derivedfrom oncology patients.

A nice overview on the currently known genetics andgenomics behind cancers is provided in a subsection of theweb site of the National Center for BiotechnologyInformation (NCBI), which specifically deals with thistopic (http://www.ncbi.nlm.nih.gov/disease/Cancer.html).Ongoing research in oncology is further directed toward amore complete and basic understanding of why somesomatic cells at a certain point in time become tumor cells.In this respect, the National Cancer Institute coordinates theCancer Genome Anatomy Project (CGAP), which providesa valuable resource of information and technological toolsrequired to analyze the molecular anatomy of the cancercell (http://www.ncbi.nlm.nih.gov/ncicgap).

TOWARD GENETICS AND GENOMICS APPLICATIONS

IN THE CLINIC

Originally, molecular genetics was used in medicine onlyto identify gene defects in major single-gene disorders such as

by guest on January 4, 2016http://theoncologist.alpham

edpress.org/D

ownloaded from

Aerssens, Armstrong, Gilissen et al. 109

cystic fibrosis. The excitement in the field has shifted gradu-ally toward more common and complex genetic disorders. It istherefore not surprising that the amount of genetic informationon an ever-increasing number of diseases has exploded overthe past few years. The Online Mendelian Inheritance In Man(OMIM) is a database of bibliographic information abouthuman genes and genetic disorders and is freely availableonline (http://www.ncbi.nlm.nih.giv/omim/). With more than10,000 entries (newly defined for each distinct disease gene orgenetic disorder for which sufficient information exists), itprovides probably the most comprehensive, authoritative, andtimely compendium of information in human genetics.Clinicians can use OMIM as an aid in differential diagnosis bysearching the database using key clinical features of a patient.

An important question for the clinician is which impact onfuture medical practice might be expected from ongoingresearch activities in genomics and genetics. The main contri-bution to date has been in the identification of new moleculartargets for drug action, which might in the long term result in

new and better drug therapies. It is expected, however, thatclinical practice will also be increasingly affected by new diag-nostic tests based on genetic markers associated with increaseddisease risk, therapeutic efficacy, or adverse events. In thisrespect, the research area designated as pharmacogenomics isexpected to become a driving force toward a more rational useof pharmaceutical products. Expensive therapies might possi-bly no longer be authorized without a definite diagnosis basedon a genetic test. Validated genetic tests enabling prediction ofincreased risk on disease development might eventually lead toa shift from curative toward predictive treatment, long beforeclinical symptoms of the disease can be observed.

In conclusion, it might be expected that genomics andgenetics will largely impact future medical practice. Severalgenomics-based applications are on their way to enter theclinic within the next few years, and many more will mostprobably follow in a later stage. For clinicians of the 21stcentury, it will be key to be well prepared and open-mindedfor this molecular future of medicine.

REFERENCES

1 Macilwain C. World leaders heap praise on human genomelandmark. Nature 2000;405:983-984.

2 Lockhart DJ, Winzeler EA. Genomics, gene expression andDNA arrays. Nature 2000;405:827-836.

3 Bentley DR. Decoding the human genome sequence. HumMol Genet 2000;9:2353-2358.

4 Rubin GM, Yandell MD, Wortman JR et al. Comparativegenomics of the eukaryotes. Science 2000;87:2204-2215.

5 O’Brien S, Menotti-Raymond M, Murphy WJ et al. Thepromise of comparative genomics in mammals. Science1999;286:458-481.

6 Fearon ER. Human cancer syndromes: clues to the origin andnature of cancer. Science 1997;278:1043-1050.

7 Ponder B. Genetic testing for cancer risk. Science1997;278:1050-1054.

8 Lengauer C, Kinzler KW, Vogelstein B. Genetic instabilitiesin human cancers. Nature 1998;396:643-649.

9 Varmus H, Weinberg RA. Genes and the biology of cancer.New York: Scientific American Library, 1993:10-14.

ADDITIONAL READING

Brown PO, Hartwell L. Genomics and human disease—varia-tions on variation. Nat Genet 1998;18:91-93.

Collins FS, Guyer MS, Chakravarti A. Variations on a theme:cataloging human DNA sequence variation. Science1997;278:1580-1581.

Collins FS. Medical and societal consequences of the humangenome project. N Engl J Med 1999;341:28-37.

Hamosh A, Scott AE, Amberger J et al. Online MendelianInheritance in Man (OMIM). Hum Mutat 2000;15:57-61.

Holtzman NA, Marteau TM. Will genetics revolutionize medi-cine? N Engl J Med 2000;343:141-144.

Lander ES, Schork NJ. Genetic dissection of complex traits.Science 1994;265:2037-2048.

Poste G. Molecular medicine and information-based targetedhealthcare. Nat Biotech 1998;16(suppl 1):19-21.

Roses AD. Pharmacogenetics and future drug development anddelivery. The Lancet 2000;355:1358-1361.

Schafer AJ, Hawkins JR. DNA variation and the future of humangenetics. Nat Biotech 1998;16:33-39.

Strachan T, Read AP. Human Molecular Genetics, 2nd Ed. Oxford:Bios Scientific Publishers Ltd., 1999:1-53, 139-168, 295-314,351-375, 427-444.

Wolf CR, Smith G, Smith RL. Pharmacogenetics. BMJ2000;320:987-990.

by guest on January 4, 2016http://theoncologist.alpham

edpress.org/D

ownloaded from

The Oncologist 2001;6:222 www.TheOncologist.com

THE HUMAN GENOME: AN INTRODUCTION

Jeroen Aerssens, Martin Armstrong, Ron Gilissen, Nadine Cohen The Oncologist 2000;6:100-109

On page 107, in the final paragraph, “(e.g., genotypes A/A and A/G can both have straight hair, while only genotypeG/G shows curly hair-Fig. 1)” should be “(e.g., genotypes A/A and A/G can both have curly hair, while only genotype G/Gshows straight hair-Fig. 1).”

Erratum