Composition-sensitive analysis of the human genome for regulatory signals

27
In Silico Biology 3 (2003) 145–171 IOS Press Electronic publication can be found in In Silico Biol. 3, 0013 <http://www.bioinfo.de/isb/2003/03/0013/>, 27 June 2003. 1386-6338/03/$8.00 © 2003 – IOS Press and Bioinformation Systems e.V. All rights reserved 145 Composition-Sensitive Analysis of the Human Genome for Regulatory Signals Olga V. Kel-Margoulis 1,* , Dmitri Tchekmenev 1 , Alexander E. Kel 1 , Ellen Goessling 1 , Klaus Hornischer 1 , Birgit Lewicki-Potapov 1 and Edgar Wingender 1,2 1 BIOBASE GmbH, Halchtersche Str. 33, D-38304 Wolfenbüttel, Germany 2 Department of Bioinformatics, UKG, Georg-August-University Göttingen, Goldschmidtstr. 1, D-37077 Göttingen, Germany Edited by T. Werner; received 11 October 2002; revised and accepted 24 June 2003; published 27 June 2003 ABSTRACT: Known transcription regulatory signals which generally act as transcription factor binding sites (TFs) differ significantly in their base composition. Therefore, their occurrence in a genome largely depends on the local base composition. In an attempt to initiate an all human genome analysis for the occurrence of potential TFs, we systematically analyzed the GC-content of distinct functional regions (e.g., upstream and downstream gene regions, exons, long and short introns, repetitive elements) and correlated the frequencies of potential binding sites of a rep- resentative set of TFs in these regions. For these analyses, we used the pattern collection of the TRANSFAC ® data- base on transcriptional regulation, the information about functionally relevant combinations of them from the data- base TRANSCompel ® , and our new resource, TRANSGenome TM , which provides an overall annotation of the hu- man genome with emphasis on its regulatory characteristics. We show that the occurrence of sequence patterns with regulatory potential may be supported by, but cannot be fully explained by either the GC content of a whole chro- mosome or its putative promoter regions, nor by the information content of the patterns. Several patterns, HNF-3, NFAT, and GC box, show a clear overrepresentation in all promoter groups as well as in all chromosomes. Other patterns, like E2F and CRE-BP1, are underrepresented in all promoter groups as well as in all chromosomes in com- parison with random sequences. Simultaneously, both patterns are over-represented in promoters in comparison with repetitive elements. We define several structural characteristics of the proximal promoters that differentiate them from other functional genomic regions. Two well-known promoter elements, GC- and TATA-boxes, are statistically enriched in promoters in comparison with random sequences, repetitive elements and exons. Altogether, our findings provide insights into the macroheterogeneity amongst the individual chromosomes, into the microheterogeneity among different functional regions of individual chromosomes, contribute to further understand- ing of structural organization of gene regulatory regions, and give first hints on the development of regulatory fea- tures during evolution. KEYWORDS: human genome, transcription factor binding sites, computational analysis, gene regulation, promot- ers, repetitive elements * Corresponding author. E-mail: [email protected].

Transcript of Composition-sensitive analysis of the human genome for regulatory signals

In Silico Biology 3 (2003) 145–171 IOS Press

Electronic publication can be found in In Silico Biol. 3, 0013 <http://www.bioinfo.de/isb/2003/03/0013/>, 27 June 2003. 1386-6338/03/$8.00 © 2003 – IOS Press and Bioinformation Systems e.V. All rights reserved

145

Composition-Sensitive Analysis of the Human Genome for Regulatory Signals Olga V. Kel-Margoulis1,*, Dmitri Tchekmenev1, Alexander E. Kel1, Ellen Goessling1, Klaus Hornischer1, Birgit Lewicki-Potapov1 and Edgar Wingender1,2 1BIOBASE GmbH, Halchtersche Str. 33, D-38304 Wolfenbüttel, Germany 2Department of Bioinformatics, UKG, Georg-August-University Göttingen, Goldschmidtstr. 1, D-37077 Göttingen, Germany Edited by T. Werner; received 11 October 2002; revised and accepted 24 June 2003; published 27 June 2003 ABSTRACT: Known transcription regulatory signals which generally act as transcription factor binding sites (TFs) differ significantly in their base composition. Therefore, their occurrence in a genome largely depends on the local base composition. In an attempt to initiate an all human genome analysis for the occurrence of potential TFs, we systematically analyzed the GC-content of distinct functional regions (e.g., upstream and downstream gene regions, exons, long and short introns, repetitive elements) and correlated the frequencies of potential binding sites of a rep-resentative set of TFs in these regions. For these analyses, we used the pattern collection of the TRANSFAC® data-base on transcriptional regulation, the information about functionally relevant combinations of them from the data-base TRANSCompel®, and our new resource, TRANSGenomeTM, which provides an overall annotation of the hu-man genome with emphasis on its regulatory characteristics. We show that the occurrence of sequence patterns with regulatory potential may be supported by, but cannot be fully explained by either the GC content of a whole chro-mosome or its putative promoter regions, nor by the information content of the patterns. Several patterns, HNF-3, NFAT, and GC box, show a clear overrepresentation in all promoter groups as well as in all chromosomes. Other patterns, like E2F and CRE-BP1, are underrepresented in all promoter groups as well as in all chromosomes in com-parison with random sequences. Simultaneously, both patterns are over-represented in promoters in comparison with repetitive elements. We define several structural characteristics of the proximal promoters that differentiate them from other functional genomic regions. Two well-known promoter elements, GC- and TATA-boxes, are statistically enriched in promoters in comparison with random sequences, repetitive elements and exons. Altogether, our findings provide insights into the macroheterogeneity amongst the individual chromosomes, into the microheterogeneity among different functional regions of individual chromosomes, contribute to further understand-ing of structural organization of gene regulatory regions, and give first hints on the development of regulatory fea-tures during evolution. KEYWORDS: human genome, transcription factor binding sites, computational analysis, gene regulation, promot-ers, repetitive elements

*Corresponding author. E-mail: [email protected].

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 146

INTRODUCTION Functional genomics approaches heavily depend on a preceding thorough computational analysis of the

genome under consideration, with respect to the features to be analyzed. The biological function of a whole genome and its parts, the genes it comprises, has to consider both the potential gene products, their structure and (potential) functions in the cell, as well as the regulation of the encoding genes. Over many years, information that is relevant for gene regulation in all kind of eukaryotes has been collected and stored in databases such as the Eukaryotic Promoter Database (EPD) [Praz et al., 2002], TRANSFAC® [Matys et al., 2003], TRRD [Kolchanov et al., 2002] or COMPEL [Kel et al., 1995; Kel-Margoulis et al., 2002]. On the basis of their contents, programs were developed which enable to identify individual regu-latory sequence elements (transcription factor binding sites, TFS) [e.g. Chen et al., 1995; Quandt et al., 1995; Wingender et al., 1996; Goessling et al., 2001; Kel et al., 2003], complex compositions of regula-tory regions [Frech et al., 1997, 1998; Kel et al., 1999, 2001], or whole promoters with highly variable degree of accuracy among the different tools [for review, see Fickett and Hatzigeorgiou, 1997] though more recent developments may have provided real breakthroughs in the latter field [Scherf et al., 2000; Davuluri et al., 2001; Bajic et al., 2002].

Now, with the (nearly) complete sequence of the human genome at hand, we can start to systematically apply what we have learned so far about individual regulatory sequence signals and their context and, thus, to evaluate the regulatory potential of a complex genome as a whole. This study presents initial in-vestigations to achieve that goal, making best use of the databases we have developed earlier (TRANS-FAC®, TRANSCompel®) together with a set of new tools which are described as well. METHODS The TRANSGenome™ resource

TRANSGenome™ consists basically of three parts — the central database, a supplementary collection

of links to other databases (EXDB), and a viewer with a number of search facilities. The TRANSGenomeTM database itself has been filled using a system of PERL scripts parsing in exter-

nal database information, then storing information in tables of a relational database system. The reference sequence of the genomic database is currently based on RefSeq as a data source for the nucleotide se-quence and the NCBI annotation results, including gene models and SNP information [Wheeler et al., 2002]. RefSeq information is parsed in automatically. Nomenclature information is added from HGNC (Human Genome Nomenclature Committee) [Povey et al., 2001], automatically updated as soon as changes are available. Proprietary information, e.g. from the TRANSFAC® database, is matched through scripts and subsequent additional manual annotation to provide a maximum level of reliability.

EXDB contains extracts from publicly available databases LocusLink and UniGene [Wheeler et al., 2002], HGNC [Povey et al., 2001], EMBL/GenBank/DDBJ [Stoesser et al., 2002; Benson et al., 2002; Tateno et al., 2002], Swiss-Prot [O’Donovan et al., 2002]. Several features from these databases have been read in through PERL scripts and stored in a SQL database system. The data stored in EXDB is re-quired for TRANSGenomeTM but not directly part of the central database, and is therefore formally sepa-rated from TRANSGenomeTM.

The TRANSGENOME VIEWER is developed in Java. It is the window to the TRANSGenomeTM da-tabase and enables the user not only to retrieve the information stored in the database, but also to combine and restrict search items for specified queries applying different constraints. The data output is configur-able to yield a number of possible formats, thus allowing to extract information for subsequent usage in analysis tools, which might be directly fed back to the TRANSGenomeTM system.

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 147

A particular feature of TRANSGenomeTM is the mapping of upstream regions of primary transcripts. This mapping has been done by using gene models from RefSeq and mRNA sequences matched to the reference sequence. For those gene models for which LocusLink numbers are available, the nucleotide at the 5'-end of the first exon was assumed to be the start of transcription. Though still being obviously in-complete and certainly still with some mistakes, the resulting set of 5'-flanking regions is a rather com-prehensive and at least highly enriched set of human promoters. Comparable data sets for exons, introns and 3'-flanking regions have been retrieved from TRANSGenomeTM as well. As for introns, we differen-tiated between “short” (<700 bp) and “long” introns (>2000 bp) according to Figure 36c in [International Human Genome Sequencing Consortium, 2001] where it has been shown that they exhibit clearly distinct GC contents. Databases on transcription regulation (TRANSFAC® and TRANSCompel®)

The databases used for this study are the TRANSFAC® database on transcription factors (TF), their ex-

pression patterns, genomic and artificial binding sites as well as positional weight matrices for many of the TF stored in the database [Matys et al., 2003]. It is made available as a collection of interlinked flat files (with the number of entries in release 6.3 from October 2002): FACTOR (4921), SITE (12514), GENE (3821), MATRIX (596), CLASS (50), CELL (1455); more recently, REFERENCE (9444) was added as a new resource instead of providing references only as information incorporated in the individ-ual entries of the other tables.

TRANSCompel® is a database on composite elements [Kel-Margoulis et al., 2002]. These are combi-nations of individual TF binding sites which as such provide a new regulatory quality. TRANSCompel® documents the position and binding sequence of each of the constituent regulatory sequence elements, the interacting TF (with TRANSFAC® link), and the experimental evidence of its function. With its release 6.3 from October 2002, the database comprised information about 350 composite elements extracted from 411 references; these entries are linked with 294 transcription factors described in TRANSFAC®. Programs for the analysis of regulatory regions (MatchTM and TRANSPLORERTM)

MatchTM is a weight matrix-based tool for searching putative transcription factor binding sites (TFs) in

DNA sequences [Goessling et al., 2001; Kel et al., 2003]. It is closely interconnected and distributed to-gether with the TRANSFAC® database [Matys et al., 2003]. In particular, MatchTM uses the matrix library collected in the TRANSFAC® and therefore provides the possibility to search for a large variety of differ-ent transcription factor binding sites. The user may construct and save his/her specific user profiles which are selected subsets of matrices including default or user-defined cut-off values. A public version of the MatchTM tool is available at http://www.gene-regulation.com/pub/programs.html — match.

TRANSPLORERTM (TRANScription exPLORER) is a software package for the analysis of transcrip-tion regulatory sequences. It includes a tool for the prediction of potential TFs in any sequence that may be of interest, and is currently based on the same algorithm as MatchTM. The TRANSPLORERTM site pre-diction tool also uses position weight matrix (PWM) collections. But in contrast to MatchTM it is able to use several matrix sources: the largest and most up-to-date library of matrices derived from TRANSFAC® Professional database, other matrix libraries as well as any user-developed matrix library. A search can be made using all or either pre- or user-defined subsets of matrices from the libraries.

TRANSPLORERTM also enables to visualize the search results together with feature information of EMBL or GenBank entries. It comes with a large number of filtering options allowing to specify the (kind of) sites to appear in the program output. Thus, the program can be restricted to show only potential bind-ing sites for human factors, for factors belonging to a certain class, for tissue-specific factors, etc.

Both tools enable to construct user-specific profiles and provide specific thresholds for the TRANSFAC®

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 148

matrix libraries. Furthermore, predefined tissue-specific profiles are included. In near future, TRANS-PLORERTM will also provide the possibility to identify composite elements using weight matrices. Algorithm for identifying potential binding sites using weight matrices

The algorithm uses two score values: the matrix similarity score and the core similarity score resem-

bling herein the previously published MatInspectorTM algorithm [Quandt et al. 1995]. The matrix similar-ity score denotes the quality of a match between the sequence and the matrix, whereas the core similarity score is a weight for the quality of matches between the core sequence of a matrix (the five most con-served consecutive positions in a matrix) and the input sequence. Both scores range from 0 to 1, where 1 denotes an exact match. We calculate the matrix similarity score slightly different from MatInspectorTM (see below). The core length used for matrices in MatchTM and TRANSPLORERTM is prolonged to in-crease the speed of the algorithm. The main steps of the algorithm are: 1. For each matrix all possible matches with the matrix core in a sequence s are searched. Since the

length of a core is defined as 5, all subsequences x of the length 5 within the sequence s are found. 2. For each of these subsequences the start position in the sequence s and the core similarity value are

stored in a table. 3. For each entry with a core similarity value higher than a certain cut-off each occurrence of this sub-

sequence is looked up in the sequence s and is prolonged at both ends, so that it fits the matrix length. Then the matrix similarity value is estimated and those matches with a matrix similarity score higher than a certain cut-off are shown in the program output.

The score for the matrix similarity of a subsequence x of the sequence s of length L is calculated as

MinMaxMinCurrentxsimmatsimmat

−−

= )(__ (1)

In here, ∑=

=L

ibifiICurrent

1,)(: with fi,b as the frequency of nucleotide B (B ∈ {A, C, G, T}) in posi-

tion i of the matrix, min

1)(: i

L

ifiIMin ∑

=

= with fimin as the frequency of the nucleotide which is rarest in

position i in the matrix, and max

1

)(: i

L

i

fiIMax ∑=

= with fimax as highest frequency in position i.

The position information vector is calculated as the follows:

{ } { } { }∑∑∑

∈∈∈

=−=CGTAB

BiBiCGTABCGTAB

BiBi ffffiI,,,

,2,,,,

41

241

,,,,2, )4(loglog)(log)( (2)

It describes the conservation of the positions i in a matrix [Quandt et al., 1995] (ranging from 0 in the case of equally represented nucleotides in a given position up to 2 in the case of positions that have the same nucleotides in all sites of the set). Multiplication of the frequencies with the information vector leads to a higher acceptance of mismatches in less conserved regions, whereas acceptance of mismatches in highly conserved regions is significantly reduced. We normalize the matrix similarity score during its calculation by subtracting Min. The core similarity score is calculated similarly. Matrix Similarity cut-off estimation

The appropriate cut-off selection is very important and depends largely on the user’s objectives. It can

be observed clearly that different sites for one transcription factor vary significantly in their similarity to

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 149

the consensus and this is reflected in the variation of their mat_sim score. Search for only exact matches between matrix and sequence can lack any biological relevance since some transcription factors have low affinity binding sites of biological significance. Therefore we have calculated three different cut-off val-ues for each matrix, each applying on a distinct problem.

Cut-offs minimizing the false negative rate (minFN): We mainly applied a previously published ap-proach [Pickert et al., 1998]. We have applied individual matrices to sets of corresponding genomic bind-ing sites collected in TRANSFAC®. In contrast to Pickert et al. [1998], we did not only include those sites in our test sets which were used to calculate a matrix, but all genomic binding sites for the binding factor of the respective matrix which are contained in TRANSFAC®. Each of these binding sites was prolonged by ten base pairs at each end with the help of the linked corresponding EMBL entry. Sets containing at least 15 individual binding sites were used which allowed us to apply this approach onto 111 matrices. Here, we considered a cut-off which leads to 10% false negative matches as minFN cut-off, i.e. the cut-off which allows to recognize 90% TRANSFAC® sites of the sample.

Applying the minFN cut-offs the user will find almost all (known) genomic binding sites, but in this case a high rate of false positives should be taken into account as well. The minFN cut-offs are useful for the detailed analysis of relatively short DNA fragments.

Cut-offs minimizing the false positive rate (minFP): Again we followed Pickert et al. [1998] estimating the cut-offs to minimize the false positive rate. We have applied MATCHTM to a collection of second exon sequences (~ 6×106 bp). For every matrix the lowest cut-off which gives no match in this set of ex-ons is considered as minFP cut-off.

When minFP cut-offs are applied for searching a DNA sequence, MATCHTM and TRANSPLORERTM will return a relatively low number of matches per nucleotide. In the output the user will find only puta-tive sites with a high similarity to the weight matrix, however, some known genomic binding sites will not be recognized. This kind of cut-off is useful, for instance, for searching the most promising potential binding sites in extended genomic DNA sequences.

Cut-offs minimizing the sum of both errors (minSum): These thresholds were computed to optimize both error rates against each other in a similar way as it was described earlier [Reuter, 2000]. To find cut-offs that give optimal numbers of both false positives and false negatives, we compute a sum of both error rates. For that, we computed for every matrix the number of matches found in the exon sequences using minFN cut-off. This number is defined as 100% of false positives. The sum of corresponding percentages for false positives and false negatives is computed then for every cut-off ranging from minFN to minFP. The cut-off that gives the minimum sum we refer to as minSum cut-off. Selected positional weight matrices

We made the selection out of the whole collection of about 300 matrices (for vertebrate factors

only) such that some of the most important and widespread promoter elements (such as TATA and GC boxes) are included as well as at least one representative member of each TF class [Wingender, 1997]. As an additional criterion for matrix selection we have used an estimation of false negative rate on genomic binding sites, that is we have selected those matrices for which a sufficient number of ge-nomic binding sites has been collected in TRANSFAC®. 23 selected vertebrate matrices are given in Table 1 along with their TRANSFAC® identifiers and accession numbers, the commonly used de-nomination of the corresponding transcription factor, and the consensus motif which has been derived from the matrix. Also included are the mat_sim cut-off values used for the analyses described here and the total information content which is the sum of the information vector (eq. 2) over all positions of the matrix. We chose FN50 values, i.e. adjusted the threshold in a way to re-identify only 50% of the known genomic binding sites but, more importantly, to drastically reduce the number of false positive matches.

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 150

Table 1 Characteristics of the positional weight matrices selected for genome analysis

TRANSFAC ID Accession

number Transcription

factor Consensus GC-content Length Information

content mat_sim (FN50)

V$AP1_C M00199 AP-1 TGASTCAG 41.1 9 10,70 0.87 V$CEBPA_01 M00116 C/EBPa ATTRCNNAA 38.2 14 7,04 0.93 V$CREBP1_Q2 M00179 CRE-BP1 VSTGACGTMAS 54.8 12 12,85 0.9 V$E2F_Q6 M00427 E2F TTTSGCGS 64.6 8 9,88 0.94 V$EGR1_01 M00243 Egr-1 WTGCGTGGGCGK 67.8 12 17,38 0.79 V$ER_Q6 M00191 ER, estrogen

receptor ARGNNANNNTGACCY 51.2 19 11,29 0.95

V$GATA1_04 M00128 GATA-1 CWGATAR 41.2 13 10,18 0.97 V$GC_01 M00255 GC-box RGGGGCGGGGCNK 68.2 14 11,53 0.92 V$GR_Q6 M00192 GR, glucocor-

ticoid rec. CNNTNTGTNCT 46.0 19 8,98 0.89

V$HNF1_01 M00132 HNF-1 GGTTAATNWTTAMM 27.6 15 14,23 0.89 V$HNF3B_01 M00131 HNF-3b KGNANTRTTTRYTTW 26.2 15 12,85 0.89 V$HNF4_01 M00134 HNF-4 RGGNCAAAGKTCA 49.5 19 11,18 0.82 V$MEF2_02 M00231 MEF-2 KCTAWAAATAGM 36.1 22 16,89 0.84 V$MYB_Q6 M00183 c-Myb AACKGNC 51.7 10 7,04 0.98 V$NF1_Q6 M00193 NF-1 TTGGCNNNNNNCC 53.0 18 9,51 0.97 V$NFAT_Q6 M00302 NF-AT WGGAAAA 37.2 12 10,17 0.98 V$NFKAPPAB_01 M00054 NF-kB GGGAMTTYCC 61.5 10 13,16 0.96 V$NFY_Q6 M00185 NF-Y / CP1 /

CBF TRRCCAATSR 47.3 11 11,47 0.97

V$OCT_C M00210 Octamer motif CTNATTTGCATAY 32.4 13 14,95 0.9 V$SRF_C M00215 SRF DCCWTATATGGNCW 46.2 15 15,22 0.92 V$TATA_01 M00252 TATA box /

TBP STATAAAWR 37.7 15 10,08 0.86

V$USF_Q6 M00187 USF GYCACGTGNC 63.3 10 9,41 0.94 V$YY1_01 M00059 YY1 CCATNTWNNNW 44.4 17 7,90 0.96

RESULTS 5'-flanking regions of the human genome exhibit an elevated GC-content

It is known that the human genome has a GC-content that is below 50%, but still different between the

individual chromosomes. In a first approximation, we can distinguish five groups as shown in Table 2.

Table 2 Classification of human chromosomes according to GC-content

Group GC-content Chromosomes

I 0.38 4, 13 II 0.40 2, 3, 5, 6, 7, 8, 12, 14, 18, 21, X, Y III 0.42 1, 9, 10, 11, 15 IV 0.44 16, 17, 20 V 0.48 19, 22

With the help of TRANSGenomeTM, we comparatively examined different functional regions of human

chromosomes for their GC-content. It could be shown that in nearly all chromosomes the immediate 5'-flanking regions (−300/+50 bp around the transcription start site, TSS, which is set to +1) are the most

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 151

GC-rich sequences (Figure 1). Only in chr 16 and 19, short introns (see Methods) exhibit an equal or even higher GC-content, whereas in chr Y exons have a slightly higher GC-content than 5'-flanking sequences. Long introns (>2000 bp) show the lowest GC-contents amongst the functional regions examined here in nearly all chromosomes.

Expectedly, the particular GC-content of the immediate upstream regions is gradually increased when the more proximal region is taken. GC-content of −1000/+100 regions is higher than that of −2000/+100. GC-content of proximal promoters −300/+50 is remarkably higher than that of −1000/+100 regions (Figure 1).

Studying the GC-enrichment around the transcription start sites in more detail, we computed the GC-content within 300-bp windows between positions −3000 and +3000. It is clearly seen that in most chromosomes, the peak is within the region −1000/+1000, with the most sharp peak within 300 bp downstream of the TSS, whereas the preceding and subsequent windows still have also high though slightly lower GC contents than the +1/+300 region (Figure 2a). A further refined analysis of 50-bp windows between −1000 and +1000 shows that there may be several maxima around the TSS, between −100 and +300, the prevalence of which largely differs between the individual chromosomes (Figure 2b).

Fig. 1. GC content of different functional regions in human chromosomes. Indicated is the GC content of the whole chromosomes, of “complete” genes (i.e., transcribed sequences) including exons and introns, of genes together with the 10 kb upstream nucleotides, the 5'-flanking 10, 2, 1 and 0.3 kb, resp., exons, introns < 700 bp, intron >2000 bp, as well as of the 2 or 10 kb 3'-flanking sequences. Only genes that appear with a LocusLink in RefSeq were ana-lyzed.

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 152

Fig. 2. GC-content of different intervals around starts of 1st exons (LocusLink genes). A) 300 bp intervals B) 50 bp intervals. The vertical straight line indicates the transcription start site position, the dashed lines point to the position of the maximal GC content in different chromosomes.

A

B

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 153

Distribution of AT- and GC-rich 5'-flanks does not correlate with overall chromosomal GC-content On the basis of elevated GC-content in comparison with surrounding sequences as well as because of

our previous findings [Kel et al., 2001], we have selected the regions of −300/+50 around the TSS for further analysis, and refer to them as proximal promoters. Here we show that the correlation between the general chromosomal GC content and that of upstream regions is very poor (Figure 3).

We analyzed the distribution of the proximal promoters according to their GC content and confirmed that there is a strong asymmetry. If we summarize the distribution of 5'-flanking regions over all human chromosomes, we see that there is a peak around GC-content 0.5, with a relatively sharp decline towards lower and a tailing towards higher GC contents (Figure 4).

Fig. 3. Correlation between the general chromosomal GC content and that of upstream regions, −350/+50 relative to the starts of the 1-st exons.

Fig. 4. Distribution of 5'-flanking regions (−300/+50 relative to the transcription start site) of all human genes with a LocusLink (12,291 genes) according to their GC content.

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 154

Fig. 5. Distribution of 5'-flanking regions (−300/+50 relative to the transcription start site) of genes with a Locus-Link (12,291 genes) according to their GC content, separately for human chromosomes of group I (A), group II (B–D), group III (E, F), group IV (G), and group V (H), as they are defined in Table 2.

A

B

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 155

(Figure 5 continued)

C

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 156

(Figure 5 continued)

D

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 157

(Figure 5 continued)

E

F

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 158

(Figure 5 continued)

G

H

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 159

The average GC-content of these immediate upstream regions largely differs between the individual chromosomes (Figure 5 A–H). It slightly shifts towards higher values when investigating the chromo-somes of higher groups, i.e. of higher general GC-content (Figure 5 G,H). The most GC-rich promoters with the average GC-content about 60% are observed on chr 21, although this chromosome belongs to the second group with the average GC-content 40% (Figure 5 D). For this chromosome, the difference in GC-content between proximal promoters and surrounding regions is the most remarkable amongst human chromosomes. This example further illustrates the poor correlation between the GC content of a chromo-some and that of proximal promoters on the same chromosome. Presence of potential transcription factor sites in human chromosomes

To investigate the occurrence of potential binding sites for a set of selected transcription factors (TF),

we applied 23 selected vertebrate matrices from the TRANSFAC® database. For the selection criteria, see Methods section. The threshold for all matrices was set to FN50, i.e. under relatively stringent conditions which allowed for the detection of (only) 50% of the known genomic sites, but significantly reduced the number of false positive matches.

As to be expected, the number of hits for individual TF DNA-binding profiles largely varies (mainly) amongst the different patterns as well as between the chromosome groups (Table 3). In general, the order of matching frequencies for the individual matrices is: V$TATA_01 > V$GR_Q6 > V$HNF3B_01 > V$CEBPA_01 > V$AP1_C > V$YY1_01 > V$GATA1_04, V$HNF4_01 > V$NFAT_Q6 > V$MEF2_02 > V$NF1_Q6 > V$MYB_Q6 > V$EGR1_01 > V$USF_Q6 > V$GC_01 > V$OCT_C > V$HNF1_01 > V$NFY_Q6 > V$ER_Q6 > V$CREBP1_Q2 > V$NFKAPPAB_01 > V$SRF_C > V$E2F_Q6

Some of the matching frequencies are very close to each other so that the order can be different in indi-vidual chromosomes and chromosomal groups. In particular the most GC-rich chromosomes (group V) exert a number of deviations in the order of matching frequencies (Table 3).

Also, the number of matches for each pattern strongly depends on the GC-contents of both the pattern itself and the chromosome. For instance, the highest number of matches was found for V$TATA_01 in the AT-rich chromosomes of group I (824.2 hits per 100 kb) and drops to nearly half of this value in the chromosomes of group V (436.9 hits per 100 kb). On the other end of the scale, the matrix V$E2F_Q6 reveals as few as 0.8 to 1.4 hits per 100 kb, with the higher values in the more GC-rich chromosome group (Table 3).

Interestingly, the number of matches shows no correlation with the stringency (e.g. in terms of the re-verse of the absolute number of hits produced in the whole chromosomes or genomes; cf. Tables 3 and 4) or the information content (Table 1) of the matrix but rather reflect its particular pattern. For instance, matrices comprising an obligatory “CG” dinucleotide such as the E2F profile match at a much lower rate than others. But this correlation also is not a strict one: There are five patterns that comprise a CpG in the conserved part of their binding site (V$EGR1_01, V$USF_Q6, V$GC_01, V$CREBP1_Q2, V$E2F_Q6), all of them are located in the rarely matching half of the series shown above. However, in between them are many patterns without obligatory CpG, and one that even comprises two CpG dinucleotides is by far not the rarest (V$EGR1_01) (Table 3).

Table 3 Pattern frequencies in groups of human chromosomes and random sequences of same GC-content

aver-

age of group

I

rand 0,38

ratio group I/

rand 0,38

average of

group II

rand 0,40

ratio group

II/ rand 0,40

average of

group III

rand 0,42

ratio group

III/ rand 0,42

average of

group IV

rand 0,44

ratio group

IV/ rand 0,44

average of

group V

rand 0,48

ratio group

V/ rand 0,48

V$GC_01 9,0 1,3 6,78 11,2 1,8 6,16 14,1 2,6 5,37 20,8 3,7 5,60 32,1 7,1 4,55 V$NFAT_Q6 103,6 42,7 2,42 99,0 39,8 2,49 93,6 37,3 2,51 81,7 34,4 2,37 66,7 28,0 2,38 V$HNF3B_01 375,3 183,2 2,05 347,9 150,7 2,31 324,0 123,6 2,62 293,8 100,0 2,94 268,5 63,9 4,20 V$OCT_C 17,2 10,7 1,61 15,7 9,3 1,68 14,0 8,0 1,74 11,1 6,9 1,60 8,5 4,9 1,72 V$NFKAPPAB_01 2,8 1,5 1,87 2,9 1,7 1,66 3,0 1,8 1,65 3,0 2,1 1,42 3,1 2,6 1,18 V$MEF2_02 61,5 46,7 1,32 58,7 40,2 1,46 57,7 35,1 1,64 56,5 29,7 1,90 59,3 21,0 2,82 V$YY_01 172,9 127,2 1,36 172,6 123,5 1,40 171,2 118,8 1,44 168,5 113,3 1,49 162,8 99,5 1,64 V$ER_Q6 4,4 3,3 1,31 4,8 3,7 1,30 5,1 4,0 1,28 5,7 4,1 1,40 6,5 4,7 1,40 V$NF1_Q6 40,3 30,0 1,34 43,3 33,7 1,29 45,7 37,3 1,23 50,5 40,9 1,23 52,3 48,0 1,09 V$GR_Q6 507,7 408,7 1,24 510,8 414,3 1,23 505,9 416,8 1,21 490,4 416,3 1,18 475,3 404,4 1,18 V$TATA_01 824,2 715,5 1,15 737,0 602,8 1,22 658,5 503,2 1,31 535,8 418,7 1,28 436,9 284,1 1,54 V$HNF4_01 95,0 80,4 1,18 97,2 84,7 1,15 99,7 87,9 1,13 98,6 89,7 1,10 95,0 92,7 1,02 V$MYB_Q6 43,2 41,2 1,05 43,8 43,5 1,01 43,4 45,5 0,95 42,0 47,3 0,89 37,8 48,8 0,77 V$CEBPA_01 288,7 310,5 0,93 265,4 284,1 0,93 245,0 257,1 0,95 207,2 230,6 0,90 165,1 180,7 0,91 V$AP1_C 262,1 288,7 0,91 247,6 270,9 0,91 236,2 252,9 0,93 213,6 234,3 0,91 182,7 196,0 0,93 V$GATA_04 119,5 131,9 0,91 110,7 121,2 0,91 103,8 109,8 0,94 86,0 99,6 0,86 69,4 79,0 0,88 V$NFY_Q6 7,4 8,7 0,85 7,2 8,7 0,82 6,7 8,3 0,81 6,1 7,7 0,79 5,2 6,8 0,76 V$HNF1_01 12,9 17,0 0,76 11,5 14,7 0,78 10,5 12,1 0,87 8,6 9,9 0,86 7,0 6,5 1,08 V$EGR_01 13,0 15,3 0,85 15,0 19,4 0,77 17,6 24,4 0,72 24,9 30,6 0,81 35,7 45,9 0,78 V$SRF_C 1,3 1,8 0,69 1,4 1,9 0,74 1,3 1,9 0,67 1,2 1,9 0,64 1,1 1,8 0,59 V$USF_Q6 13,0 18,9 0,69 14,7 23,2 0,64 16,7 27,4 0,61 23,1 32,4 0,71 28,6 43,7 0,65 V$CREBP1_Q2 4,0 24,0 0,16 4,2 25,6 0,17 4,4 26,9 0,16 5,2 28,0 0,19 6,1 29,5 0,21 V$E2F_2Q6 0,8 10,9 0,07 0,9 12,1 0,07 0,9 13,7 0,07 1,2 14,7 0,08 1,4 17,3 0,08

Note: The chromosome groups are as defined in Table 2. The patterns are sorted according to the over-/underrepresentation factor in chromosome group II which com-prises half of all human chromosomes. Yellow and orange: overrepresentation is >2-fold or <0.5-fold, resp.; grey: patterns appear as expected from random sequences within a +/− 25% range.

O.V. K

el-Margoulis et al. / C

omposition-Sensitive Analysis of the H

uman G

enome for Regulatory Signals

160

Table 4 Pattern matches in distinct genomes

Aeropyrum pernix1

random ratio

A.pernix/ rand

E.coli K123

random ratio

E.coli/ rand

Bacillus subtilis2

random ratio

B.subt/ rand Yeast4 random

ratio Yeast/ rand C.elegans random

ratio C.eleg/

rand GC-content 0,56 0,56 0,5 0,5 0,44 0,44 0,38 0,38 0,34 0,34 Length of the fragments analysed

1 669 695 bp 106 bp 4 639 221

bp 106 bp 4 214 814 bp 106 bp 12 070 522

bp 106 bp 95 078 296 bp 106 bp

V$AP1_C 33,9 123,1 0,28 186,7 177,5 1,05 207,8 234,3 0,89 212,2 288,7 0,73 217,4 319,8 0,68 V$CEBPA_01 42,2 104,0 0,41 217,7 159,6 1,36 284,4 230,6 1,23 357,4 310,5 1,15 480,5 369,7 1,30 V$CREBP1_Q2 17,6 29,2 0,60 28,8 29,8 0,97 20,1 28,0 0,72 18,9 24,0 0,79 23,5 20,9 1,12 V$E2F_2Q6 6,2 20,5 0,30 35,4 18,1 1,95 23 14,7 1,56 10,3 10,9 0,95 22,1 8,2 2,70 V$EGR_01 105,9 95,2 1,11 80,6 55,9 1,44 24,7 30,6 0,81 16,9 15,3 1,10 18 8,9 2,02 V$ER_Q6 3,3 5,1 0,65 5,7 4,8 1,19 2,9 4,1 0,71 3,5 3,3 1,06 2,2 2,6 0,84 V$GATA_04 72,7 45,3 1,60 104,7 69,6 1,50 99,7 99,6 1,00 119,2 131,9 0,90 92 154,1 0,60 V$GC_01 40,8 22,2 1,84 4,8 9,6 0,50 4,2 3,7 1,13 1,8 1,3 1,36 7,1 0,5 13,30 V$GR_Q6 267,3 352,1 0,76 307 394,9 0,78 402,9 416,3 0,97 410,8 408,7 1,01 393,7 387,5 1,02 V$HNF1_01 1,6 2,4 0,66 8,3 5,2 1,58 5,3 9,9 0,53 10,2 17,0 0,60 11,5 23,6 0,49 V$HNF3B_01 22,6 23,8 0,95 83 50,9 1,63 186,7 100,0 1,87 260 183,2 1,42 367,5 267,3 1,37 V$HNF4_01 56,1 90,5 0,62 100,5 93,6 1,07 101,9 89,7 1,14 97,2 80,4 1,21 78 70,9 1,10 V$MEF2_02 14,2 9,3 1,53 13 17,2 0,75 22,2 29,7 0,75 37,7 46,7 0,81 50,2 60,8 0,83 V$MYB_Q6 18,7 47,4 0,39 95,1 49,1 1,94 44,3 47,3 0,94 38,6 41,2 0,94 42,6 35,4 1,20 V$NF1_Q6 38 59,1 0,64 51,8 51,3 1,01 54 40,9 1,32 45,2 30,0 1,51 47,2 22,7 2,08 V$NFAT_Q6 17,5 16,2 1,08 52,3 24,6 2,13 85,5 34,4 2,49 99,7 42,7 2,33 136,9 46,8 2,93 V$NFKAPPAB_01 2,2 3,2 0,69 2,5 2,8 0,89 2,2 2,1 1,03 1,6 1,5 1,09 1,8 1,0 1,80 V$NFY_Q6 0,6 4,6 0,13 9,7 6,2 1,56 7 7,7 0,91 9,8 8,7 1,13 8,1 9,0 0,90 V$OCT_C 0,5 2,2 0,23 5,3 4,1 1,30 10,6 6,9 1,53 11,1 10,7 1,04 12,2 13,4 0,91 V$SRF_C 1,7 1,5 1,12 0,7 1,9 0,38 1,2 1,9 0,62 1,9 1,8 1,05 0,7 1,5 0,45 V$TATA_01 222,3 118,4 1,88 262,2 229,8 1,14 491,1 418,7 1,17 652,3 715,5 0,91 613,6 993,7 0,62 V$USF_Q6 31,7 71,3 0,44 20,8 50,0 0,42 15,9 32,4 0,49 12,4 18,9 0,66 22,6 12,4 1,82 V$YY_01 70,2 69,5 1,01 100,4 92,4 1,09 144,1 113,3 1,27 154,7 127,2 1,22 132,1 130,8 1,01 1Aeropyrum pernix: Archaea; Crenarchaeota; Aeropyrum. 2Bacillus subtilis: Bacteria; Firmicutes; Bacillus/Clostridium group; Bacillus/Staphylococcus group; Bacillus. 3Escherichia coli K12: Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; scherichia; Escherichia coli. 4Saccharomyces cerevisiae: Eukaryota; Fungi; Ascomycota; Saccharomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces.

O.V. K

el-Margoulis et al. / C

omposition-Sensitive Analysis of the H

uman G

enome for Regulatory Signals 161

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals

162

To correlate the frequency with which certain patterns appear in the whole chromosome with the oc-currence that could have been expected, we generated randomized chromosomal sequences with the same GC content as the chromosomes of the individual groups (Table 2). The results for the individual groups are also shown in Table 3. We found that about half of the matrices match in whole chromosomes as ex-pected from their GC-content within a deviation of +/− 25%. However, there are some clear exceptions, and most of them hold true for any group. Thus, HNF-3, NFAT, and GC box patterns show a clear over-representation in all five groups of chromosomes (>2-fold compared with random sequences of same GC-content). The absolute number of hits for the GC box element increases from group I to V, whereas that of the TATA box decreases, but the enrichment is significant in all cases.

Consistently found underrepresented throughout all groups (to less than half of the expected frequency) are potential E2F and CRE-BP1 sites. Slightly underrepresented patterns in all groups except the amongst the most AT-rich promoters (group I) are those of EGR1 and USF.

Some other over-represented patterns exhibit a more heterogeneous picture. In particular, MEF2 is 2.8-fold over-represented in the group V and 1.9-fold in group IV chromosomes, whereas the NF-κB pattern is over-represented in group I chromosomes (Table 3). Comparison of pattern frequencies between human and other genomes

To elucidate particular regulatory features within the human genome, we further compared the fre-

quencies of 23 regulatory patterns with their occurrence in the genomes of an archaebacterium (Aeropy-rum pernix K1, a strictly aerobic hyperthermophilic archaeon), in a gram-positive (Bacillus subtilis) and a gram-negative bacterium (Escherichia coli) as well as with two lower eukaryotes (the yeast Saccharomy-ces cerevisiae and the nematode Caenorhabditis elegans). Since these genomes are quite different in their GC-content, we have compared them with random sequences of the corresponding GC-content (Table 4). Our results show remarkable difference of over- and under-representation for a number of patterns be-tween human and other genomes.

Matches for V$GC_01 are highly over-represented in all 5 groups of human chromosomes: 4.55-fold in group V to 6.78-fold in group I, in comparison with random sequences with the same GC-content. In C. elegans genome, which is more AT-rich than all human chromosomes, matches for V$GC_01 are even more significantly over-represented (13.3-fold). They are also slightly over-represented in A. pernix, B. subtilis and yeast genomes. Surprisingly, in E. coli genome matches for V$GC_01 are 2 times under-represented.

Matches for V$NF-AT_Q6 are about 2.5-times higher in human chromosomes in comparison with cor-responding random sequences. The similar relation between genome frequencies and random frequencies occurs for E. coli, B. subtilis, yeast, and C. elegans, whereas in the A. pernix genome the frequency of NF-AT patterns is as to be expected from its GC-content.

Frequencies of the matches for V$TATA_01 are higher in human chromosomes than in random se-quences, 1.15- to 1.54-fold in different groups. A similar over-representation of the TATA-box pattern occurs in genomes of A. pernix, E. coli, and B. subtilis. In contrast, in yeast and C. elegans genomes the TATA-box is under-represented. This possibly reflects a counter-selection against TATA-like patterns amongst AT-rich genomes of yeast and C. elegans.

Matches for V$NFKAPPAB_01 are 1.18–1.87-fold over-represented in human chromosomes. Simi-larly, these matches are 1.8-fold over-represented in the C. elegans genome. In yeast and B. subtilis ge-nomes the number of matches is the same as expected from GC-content, and in E. coli and A. pernix this pattern is even under-represented.

Frequencies of matches for V$E2F_Q6 are extremely under-represented in all groups of human chro-mosomes, the number of found matches constitutes just 7–8% of those in random sequences with the same GC-content. For A. pernix, this ratio is 30%. For yeast, the number of matches is the same as ex-

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 163

pected from GC-content. In B. subtilis, E. coli, and C. elegans genomes, E2F matches are about 2 times over-represented. This observation further supports the possible counter-selection against E2F patterns in the human genome.

Although the average GC-content of the group I of human chromosomes is the same as of yeast ge-nome (0.38), the frequencies of certain patterns are clearly different (Table 4). For example, the GC-box pattern occurs 5-times more frequently, the matrices for NF-κB, MEF2, Oct, HNF-3B match about 2-times more frequently in the human group I chromosomes than in the yeast genome; TATA, HNF1, ER, GR, AP1 are slightly (1.24–1.26 times) more frequent in these human chromosomes than in the yeast ge-nome.

In summary, our results suggest that differences of pattern frequencies between human and other ge-nomes cannot be explained just by the distinct GC-content of the genomes. Distinct sequence patterns that could function as transcription factor binding sites are under positive or negative selection during evolu-tion. Classification of upstream sequences according to their GC-content

The general examination about the distribution of GC-content in upstream sequences suggested the ex-

istence of different promoter classes (Figures 4 and 5). To further investigate the over- or under-representation of the TF binding patterns in the proximal upstream sequences (−300/+50), we classified promoters into four groups. This structural classification takes into account the GC-content of promoters only and is chromosome-independent (Table 5).

It is interesting to note that within the set analyzed, comprising 12,291 upstream sequences of the most reliably annotated human genes, the majority of the proximal 5'-flanks are AT-rich, with a GC-content of less then 0.5 (groups A and B; 6,788 upstream regions).

To estimate over- or under-representation of distinct motifs, we generated four sets of random se-quences (random promoters) according to the four promoter groups. Each random set contains the same number of sequences of the same length, and the same GC-content as the corresponding promoter set. We compared the frequency of matches found for each matrix in four groups of promoters with the frequen-cies in the corresponding set of random promoters (Table 6). Student’s t-test has been applied to calculate the probabilities that the frequency of matches in the real promoters does not differ from the frequency of matches in the random promoters are given. The less is the probability, the more reliable is the difference between promoters and random sequences (Table 6).

In particular, the matches of six matrices are significantly over-represented in all classes of upstream sequences: GC box, GR, HNF-3, NFAT, YY1 and HNF-4 (Tables 6 and 7). These patterns exhibit rather diverse GC-contents by themselves, ranging from 0.26 for HNF-3B to 0.68 for the GC-box. In all up-stream sequence groups with GC-content higher than 0.4, potential TATA boxes, MEF2 and C/EBP sites are over-represented as well. In the most AT-rich class, MEF2 appears as expected, whereas putative TATA and C/EBP elements are even underrepresented (Tables 6 and 7). It seems that these motifs were counter-selected in these promoters during evolution to avoid erroneous transcription start sites. NF-1 and c-Myb matches are found specifically over-represented in the most AT-rich upstream sequences only

Table 5 Size and features of classes of potential promoter regions

Group GC-content Number of promoters Total length, bp Average GC-content

A GC < 0.4 3,239 1,133,650 0.3433 B 0.4 ≤ GC < 0.5 3,549 1,242,150 0.4471 C 0.5 ≤ GC < 0.6 2,480 868,000 0.5458 D GC > 0.6 3,023 1,058,050 0.6900

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals

164

Table 6 Occurrences of transcription factor binding patterns in human upstream sequences

Frequency Frequency

Matrix Group A promoters1

Random sequences

(GC=0.34)2

Promoters/ random

Probability (p < )3 Matrix Group B

promoters4

Random sequences

(GC=0.45)5

Promoters/ random

Probability(p < )3

V$GC_01 19.2 0.7 27.21 5.57E-30 V$GC_01 29.6 4.8 6.13 3.18E-37 V$NFAT_Q6 108.2 43.8 2.47 2.80E-67 V$HNF3B_01 326.9 88.5 3.70 2.57E-205 V$NF1_Q6 41.4 19.4 2.13 9.92E-21 V$NFAT_Q6 96.4 31.8 3.03 1.50E-85 V$EGR1_01 17.4 9.0 1.93 4.89E-07 V$OCT_C 12.5 4.6 2.72 1.52E-11 V$HNF3B_01 414.2 256.8 1.61 1.28E-67 V$TATA_01 602.3 376.3 1.60 2.02E-41 V$GR_Q6 514.4 363.8 1.41 4.96E-62 V$MEF2_02 43.1 28.3 1.52 9.24E-10 V$HNF4_01 97.9 70.4 1.39 4.23E-12 V$HNF4_01 110.5 82.0 1.35 8.90E-13 V$MYB_Q6 52.2 37.7 1.39 3.50E-07 V$YY1_01 140.6 108.0 1.30 2.98E-13 V$YY1_01 158.7 126.6 1.25 8.06E-11 V$GR_Q6 510.9 396.8 1.29 1.47E-38 V$TATA_01 787.6 953.9 0.83 1.23E-17 V$CEBPA_01 234.3 210.3 1.11 3.27E-04 V$CEBPA_01 289.6 356.6 0.81 1.06E-16 V$USF_Q6 25.0 37.4 0.67 2.82E-06 V$AP1_C 253 314.8 0.80 4.48E-13 V$CREBP1_Q2 7.2 27.1 0.27 5.87E-29 V$GATA1_04 109.7 149.0 0.74 1.14E-16 V$E2F_2Q6 1.4 13.8 0.10 3.58E-28 V$HNF1_01 13 21.0 0.62 4.76E-06 V$ER_Q6 6.5 4.1 1.59 9.43E-03 V$CREBP1_Q2 4.2 19.6 0.21 3.50E-24 V$NFKAPPAB_01 4.4 2.3 1.90 2.10E-02 V$E2F_2Q6 1.2 7.2 0.17 4.36E-12 V$NF1_Q6 45.5 41.0 1.11 9.01E-02 V$ER_Q6 4 2.3 1.74 2.34E-02 V$EGR1_01 29.3 33.2 0.88 1.10E-01 V$NFKAPPAB_01 1.8 0.9 2.04 8.57E-02 V$GATA1_04 85.9 91.4 0.94 1.63E-01 V$MEF2_02 57 60.1 0.95 3.51E-01 V$AP1_C 224.4 221.1 1.01 6.64E-01 V$USF_Q6 15.1 13.5 1.12 4.04E-01 V$NFY_Q6 7.2 7.6 0.95 7.11E-01 V$SRF_C 1 1.3 0.76 5.16E-01 V$SRF_C 1.5 1.4 1.12 7.77E-01 V$NFY_Q6 6.9 7.6 0.91 5.37E-01 V$HNF1_01 10.5 10.8 0.97 8.04E-01 V$OCT_C 13.1 13.1 1.00 1.00E+00 V$MYB_Q6 47.3 47.5 0.99 9.31E-01

Frequency Frequency

Matrix Group C promoters6

Random sequences

(GC=0.55)7

Promoters/ random

Probability (p < )3 Matrix Group D

promoters8

Random sequences

(GC=0.69)9

Promoters/ random

Probability(p < )3

V$HNF3B_01 215.0 27.6 7.77 4.24E-143 V$OCT_C 4.2 0.1 44.00 2.91E-10 V$GC_01 82.3 16.8 4.89 2.21E-61 V$HNF3B_01 52.1 3.7 14.13 4.09E-51 V$NFAT_Q6 70.7 14.7 4.80 1.03E-66 V$NFY_Q6 12.0 1.3 9.07 4.95E-19 V$MEF2_02 27.1 7.8 3.46 5.61E-21 V$HNF1_01 2.3 0.3 8.00 1.57E-04 V$NFY_Q6 13.7 4.5 3.05 1.71E-09 V$NFAT_Q6 32.4 4.3 7.62 6.72E-44 V$TATA_01 342.5 123.4 2.78 4.80E-75 V$MEF2_02 9.9 1.8 5.53 1.98E-13 V$YY1_01 121.7 66.4 1.83 3.85E-31 V$TATA_01 88.2 20.0 4.40 6.45E-30 V$CEBPA_01 156.5 109.7 1.43 5.38E-14 V$GC_01 406.5 99.0 4.11 1.88E-224 V$AP1_C 167.6 124.4 1.35 4.25E-09 V$YY1_01 62.9 24.0 2.62 1.71E-36 V$GR_Q6 474.4 353.6 1.34 2.81E-31 V$CEBPA_01 58.1 27.4 2.12 3.45E-21 V$HNF4_01 110.3 88.1 1.25 4.94E-06 V$CREBP1_Q2 34.3 19.7 1.75 1.16E-07 V$CREBP1_Q2 20.9 30.5 0.68 5.77E-04 V$AP1_C 71.6 46.9 1.53 2.75E-08 V$USF_Q6 40.1 67.5 0.59 5.83E-11 V$EGR1_01 367.8 241.9 1.52 9.59E-37 V$E2F_2Q6 6.1 16.7 0.37 8.95E-11 V$GATA1_04 19.8 13.3 1.48 4.19E-04 V$EGR1_01 68.1 82.7 0.82 1.07E-03 V$HNF4_01 86.3 61.0 1.42 4.13E-11 V$OCT_C 6.5 3.0 2.15 1.44E-03 V$GR_Q6 264.3 202.8 1.30 7.23E-18 V$GATA1_04 55.1 45.5 1.21 5.27E-03 V$USF_Q6 83.0 120.2 0.69 1.98E-12 V$HNF1_01 5.0 2.5 1.95 1.22E-02 V$E2F_2Q6 11.7 17.0 0.69 2.14E-03 V$NF1_Q6 55.5 50.5 1.10 1.52E-01 V$NFKAPPAB_01 4.9 2.7 1.79 3.57E-02 V$ER_Q6 6.1 4.8 1.26 2.74E-01 V$SRF_C 0.9 0.2 5.00 5.92E-02 V$NFKAPPAB_01 4.6 3.6 1.29 3.32E-01 V$NF1_Q6 57.7 53.5 1.08 2.01E-01 V$MYB_Q6 48.0 45.9 1.05 5.01E-01 V$MYB_Q6 29.5 27.5 1.07 3.99E-01 V$SRF_C 1.0 1.3 0.82 7.23E-01 V$ER_Q6 3.6 4.2 0.86 5.20E-01

(group A), both of them representing more GC-rich patterns of GC-content 0.53 and 0.52, resp. On the other side, potential NF-Y and AP-1 sites seem to be a characteristic of the more GC-rich upstream se-quences of groups C and D (GC-content >0.5), those for HNF-1 and GATA factors even for the most GC-rich group D (GC-content >0.6) where they appear to be significantly over-represented 8- or 1.5-fold, re-

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 165

Table 7 Over- and underrepresentation of patterns in distinct promoter classes

GC-content Group A

(GC < 0.4) Group B

(GC = 0.4–0.5) Group C

(GC = 0.5–0.6) Group D

(GC > 0.6) USF 0.63 − − − NF-1 0.53 + Myb 0.52 + EGR1 0.68 + + GC 0.68 + + + + GR 0.46 + + + + HNF-3B 0.26 + + + + NFAT 0.37 + + + + YY1 0.44 + + + + HNF-4 0.50 + + + + MEF2 0.36 + + + NF-Y 0.47 + + OCT 0.32 + + TATA 0.38 − + + + C/EBP 0.38 − + + + AP-1 0.41 − + + GATA 0.41 − + HNF-1 0.28 − + CRE-BP1 0.55 − − − + E2F 0.65 − − −

spectively. Interestingly, EGR1 and the octamer motif exhibit a more discontinuous behavior: EGR1 is slightly but significantly over-represented in groups A and D 1.9- and 1.5-fold, resp. In groups B and C, it tends to be rather deprived, although this seems to be not statistically relevant (Table 6). The octamer mo-tif is found at exactly the same rate in group A upstream sequences, and seems to be over-represented in group C with a error probability just below the set threshold of p<0.001 (Table 6), but is clearly over-represented with high significance in group B and, particularly, in group D.

On the other side, some patterns are mainly conspicuous because of their under-representation in three out of the four promoter groups: USF (in groups B–D), CRE-BP1 and E2F (in groups A–C).

Three of the investigated patterns do not show any statistically significant enrichment or deprivation in either of the four groups of upstream sequences when compared with random sequences: ER, NF-κB, and SRF (Table 6).

Altogether, there is no clear correlation between over- or under-representation of individual patterns in the distinct upstream sequence groups with their GC-content (Table 7). Occurrence of regulatory patterns within SINE and LINE repetitive elements

Sequences of the three types of repetitive elements, class LINE families L1 and L2, and class SINE

family Alu, have been isolated from the human chromosome 1 according to the Ensembl annotation. Randomized L1, L2, and Alu sequences have been generated which have exactly the same nucleotide fre-quencies as the original repeats.

Frequencies of matches for the matrices presented vary significantly between L1, L2 and Alu, which could be in part explained by different GC-content of these three types of repeats. To study the influence of the nucleotide composition, we have compared frequencies for matrices between repeats and random-ized sequences (Table 8).

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals

166

Table 8 Frequencies of patterns in repetitive elements in comparison with randomized repetitive elements and with the

whole chromosome 1

H chr 1 LINE L1 L1 rand LINE L2 L2 rand SINE Alu Alu rand lenght 230.311.131 31.917.906 31.970.605 9.320.827 9.361.090 25.003.925 25.043.712GC-content 0,42 0,36 0,43 0,51 V$AP1_C 236,4 220,7 253,2 270,1 223,9 77 141,3 V$CEBPA_01 244,2 296,5 284,2 198,1 216,7 82,8 122,8 V$CREBP1_Q2 4,3 2,7 18,3 4,1 19,9 3,6 25,4 V$E2F_2Q6 0,9 0,4 11 0,6 10,7 0,3 20,3 V$EGR_01 17,8 8,9 15,8 14,6 39,5 14,7 62,6 V$ER_Q6 5,2 3,3 2,4 4,6 2,5 7,6 4 V$GATA_04 104,3 121,6 155,7 124,8 126,2 10,6 67,9 V$GC_01 14,5 6,4 1,7 8,8 11,5 4 13,7 V$GR_Q6 501,2 607,2 422,5 580 354,3 125,8 395,6 V$HNF1_01 10,6 13,6 20,3 6,3 10,5 6 4 V$HNF3B_01 325,1 383,5 387,5 228,2 144,7 258,9 61,8 V$HNF4_01 99,9 100,4 83,1 128 90,2 37,6 102,8 V$MEF2_02 58,2 58,6 54,2 28,5 31 167,6 14,6 V$MYB_Q6 43,2 44 33,5 40,6 35 6 43,5 V$NF1_Q6 45,9 27,2 25,1 36 28,1 63,6 48,6 V$NFAT_Q6 92,4 114,2 63,3 83,8 59 13,4 33,6 V$NFKAPPAB_01 3 3,8 1,2 1,5 1,2 0,4 2,6 V$NFY_Q6 6,7 7,9 9,8 4,9 6,5 0,3 5,6 V$OCT_C 13,8 30,1 11,1 8,4 6,8 1 3,4 V$SRF_C 1,3 0,8 1,3 1,1 1,5 0,1 1,6 V$TATA_01 659,1 899,3 894 418,4 507,3 141,4 197,8 V$USF_Q6 16,1 5,8 15,5 12,7 23,1 19,9 50,4 V$YY_01 171,4 189,1 112,9 137,5 121,6 256,4 79,7

In Alu repeats, frequencies of matches for ER, HNF-3B, MEF2, and YY1 matrices are significantly

higher than it is expected, and these patterns are over-represented in all chromosomes in comparison with random sequences (Table 3). Thus, over-representation of these patterns in human chromosomes can be, at least in part, explained by their frequent occurrence in Alu repeats. Over-representation of MEF2 matches in Alu repeats is especially remarkable, 11.5-fold.

Over-representation of some other patterns in human chromosomes cannot be explained by their high frequencies in repeats. For example, the GC-box motif is clearly over-represented in all chromosomal groups (Table 3), and is under-represented in Alu and L2 repeats (Table 8).

Matches for a number of matrices are under-represented in repetitive elements in comparison with ran-domized sequences. Thus, matches for CRE-BP1, E2F, EGR and HNF1 are under-represented in all three types of repetitive elements (Table 8). It is interesting to note that matches for CRE-BP1 and E2F are also under-represented in all chromosomes (Table 3) and significantly under-represented in all four promoter groups (Table 6).

Next, we have compared the frequencies of matches between repetitive elements and the promoter group with the most similar GC-content. Thus, repeats L1 have been compared with group A, repeats L2 with group B, and Alu repeats with group C (Table 9). In each comparison, promoters differ form the cor-responding repetitive elements by frequencies of several matrices. Most of the matrices under study matches more frequently in promoters than in repetitive elements. Thus, matches for five matrices are over-represented in three promoter groups in comparison with the corresponding repetitive elements:

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 167

Table 9 Frequencies of matches in repetitive elements LINE and SINE in comparison with promoter groups with similar

average GC-content

group A promoters

LINE L1 group B promoters

LINE L2 group C promoters

SINE Alu

lenght, bp 1.133.650 31.917.906 1.242.150 9.320.827 868.000 25.003.925 GC-content 0,34 0,36 0,45 0,43 0,55 0,51 V$AP1_C 253 220,7 224,4 270,1 167,6 77 V$CEBPA_01 289,6 296,5 234,3 198,1 156,5 82,8 V$CREBP1_Q2 4,2 2,7 7,2 4,1 20,9 3,6 V$E2F_2Q6 1,2 0,4 1,4 0,6 6,1 0,3 V$EGR_01 17,4 8,9 29,3 14,6 68,1 14,7 V$ER_Q6 4 3,3 6,5 4,6 6,1 7,6 V$GATA_04 109,7 121,6 85,9 124,8 55,1 10,6 V$GC_01 19,2 6,4 29,6 8,8 82,3 4 V$GR_Q6 514,4 607,2 510,9 580 474,4 125,8 V$HNF1_01 13 13,6 10,5 6,3 5,0 6 V$HNF3B_01 414,2 383,5 326,9 228,2 215,0 258,9 V$HNF4_01 97,9 100,4 110,5 128 110,3 37,6 V$MEF2_02 57 58,6 43,1 28,5 27,1 167,6 V$MYB_Q6 52,2 44 47,3 40,6 48,0 6 V$NF1_Q6 41,4 27,2 45,5 36 55,5 63,6 V$NFAT_Q6 108,2 114,2 96,4 83,8 70,7 13,4 V$NFKAPPAB_01 1,8 3,8 4,4 1,5 4,6 0,4 V$NFY_Q6 6,9 7,9 7,2 4,9 13,7 0,3 V$OCT_C 13,1 30,1 12,5 8,4 6,5 1 V$SRF_C 1 0,8 1,5 1,1 1,0 0,1 V$TATA_01 787,6 899,3 602,3 418,4 342,5 141,4 V$USF_Q6 15,1 5,8 25,0 12,7 40,1 19,9 V$YY_01 158,7 189,1 140,6 137,5 121,7 256,4

CRE-BP1, E2F, EGR, GC-box, and USF. Matches for NF-κB, NF-Y, Oct, and TATA-box are over-represented in the promoter groups B and C in comparison with L2 or Alu elements, respectively.

Matches for several matrices that are over-represented in promoters in comparison with random pro-moters (Table 6), are also over-represented in repeats in comparison with random repeats (Table 8). Thus in Alu repeats, matches for HNF3B, MEF2 and YY1 are clearly over-represented in comparison with random Alu repeats. Matches for these three matrices are significantly over-represented in the promoter group C.

To summarize, our analysis clearly shows that promoters possess structural characteristics that differ-entiate them from repetitive elements. DISCUSSION

This study is one of the first attempts to find regularities in the distribution of potential transcriptional

regulatory elements in the human genome. The results obtained may contribute to the current understand-ing of the organization of gene regulatory regions, in particular proximal promoters. We have demon-strated that promoters possess structural characteristics that differentiate them from random sequences as well as from other functional parts of genome, such as surrounding sequences, exons, introns, and repeti-

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals

168

tive elements. Examining the heterogeneity in the GC content within individual human chromosomes clearly shows

some particularities for 5'-flanking sequences. Even if we consider that our set of 5'-flanking regions is not a substitute for a database of verified human promoters such as EPD [Praz et al., 2002], we consider it a collection of sequences which compared with sets of other functional regions is significantly enriched in promoter sequences. The results obtained by Suzuki et al. [2002] and collected in the DBTSS database showed that for the majority of gene starts practically coincide with the RefSeq annotation (for ~7500 genes out of ~8800 studied genes). For about 400 genes, the TSS is shifted to the 5'-direction, but only by ~100 bp on the genomic level; and for about 1000 genes, TSS is shifted by 1000 bp and more in the 5'-direction. For the last group of genes, such an extended difference between the 1st exon annotated in Ref-Seq and mapped by Suzuki et al. could be explained by alternative starts of transcription. Thus, we are concerned that by collecting promoters according to RefSeq annotation of the 1st exon, we are dealing in many cases with alternative promoters that are not the most upstream ones. Anyway, for this study we have taken just one promoter per gene. Therefore, we consider our data set for regions potentially con-taining promoters of sufficient quality to yield valuable results. It is nevertheless intended to refine the data by repeating the analysis using information from DBTSS and by analyzing several alternative pro-moters of the same gene, when they are known.

We have shown that in general the immediate 5'-flanking regions (−1000/+1000 relative to the known or circumstantially evidenced transcription start site) exhibit a higher GC content than surrounding se-quences, with particularly pronounced peaks of GC-content within the region −300/+300 (Figure 2 A, B). Similar results have been obtained, e.g., by Ohler et al. [2001] for Drosophila promoters. The asymmetric distribution of the GC content among the 5'-flanking regions qualitatively resembles that of the reported distribution for the whole human genome (see Figure 12 in [International Human Genome Sequencing Consortium, 2001]). However, there is no straight correlation between the general GC content of a chro-mosome and the 5' flanking regions of its genes (Figure 3) and, thus, no fixed offset of increased GC con-tent of promoters in a certain sequence background.

On the basis of elevated GC-content in comparison with surrounding sequences as well as of our previ-ous findings [Kel et al., 2001], we have selected the regions of −300/+50 around the TSS for further analysis. We have classified promoters into the four groups according to their GC-content Within these groups, along with increasing in GC-content, the percentage of promoters overlapping with CpG islands also increases (data not shown). In the most GC-rich group D, 30% of promoters overlap with CpG is-lands.

To look for more specific sequence features than just for GC content, we looked for the presence of transcription factor binding site patterns, represented by positional weight matrices. We selected a set of 23 matrices out of the TRANSFAC® database to cover all known types of DNA binding domains and for which the representative sets of experimentally verified genomic binding sites are available. The search criteria were optimized similarly for all patterns to ensure at least 50% of true hits.

Under these conditions, with the false negative rate equal to 50% for all matrices selected, the absolute number of matches varies in a very broad range, from 824.2 hits per 100 kb in whole chromosomes for V$TATA_01 to 0.8 for V$E2F_Q6. The number of matches obviously depends on many different charac-teristics, such as their GC content and that of the sequence analyzed, their stringency composed of matrix width and conservation (i.e., the overall information content of a matrix; see Table 2), and presence of CpG dinucleotides in pattern itself. But none of these characteristics alone exhibited a clear correlation with the number of matches found.

However, when examining the over- or under-representation of these matches in whole chromosomes compared to random sequences of the same GC content, four out of the five CpG-containing patterns are generally under-represented (E2F, CRE-BP1, USF, EGR1), whereas one (GC box) is clearly over-represented. The GC box is particularly highly enriched in 5'-flanking regions of all groups, compared

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 169

with random sequences of identical GC content. Matches for V$GC_01 are also over-represented in exon sequences, but much less than in promoters, as could be shown for chromosome 22 (Figure 6). Moreover, GC-boxes are over-represented in all promoter groups in comparison with SINE and LINE repetitive elements (Table 9).

The TATA box pattern is slightly over-represented in all chromosomal groups when compared with the corresponding random sequences. It is even more significantly over-represented in promoter sequences with GC-content higher than 0.4 (groups B, C, D). Matches for V$TATA_01 are significantly enriched in promoters versus exons, as shown for chromosome 22 (Figure 6). Matches for TATA-box are over-represented in the promoter groups B and C in comparison with L2 or Alu elements respectively (Table 9). Thus, both GC-box and TATA-box patterns, one of them is GC-rich and the other AT-rich, are en-riched in promoters in comparison with random sequences, repetitive elements and exons. Statistical en-richment of the proximal promoters by patterns for GC-boxes and TATA-boxes is in a good agreement with their well established function in transcription regulation.

It may be interesting to notice that certain TF binding patterns are clearly over-represented in upstream sequences in general, no matter whether in AT- or in GC-rich putative promoter regions. Others seem to be more indicative for either promoter subclass, which is not due to the a priori expectation from the GC con-tent. This observation may give a first hint that the straightforward classification of promoters according to their GC content may have a functional correlation which is reflected by distinct TF binding site composi-tions. We made a similar observation previously when analyzing different functional groups of promoters. We have shown that immune cell-specific promoters that are characterized by an elevated AT content have a high frequency of NFAT composite elements that achieve the basic functionality of these promoters [Kel et al., 1999]. As for the CG-rich cell cycle specific promoters, they appeared to be highly enriched by bind-ing sites for E2F which is known as the key regulator of cell cycle gene expression [Kel et al., 2001].

Fig. 6. Frequencies of pattern matches in the 5' flanks versus exons for the genes in chromosome 22. 5'-flanks and exons are compared with random sequences of the corresponding GC content, that is 0.58 for 5' flanks and 0.54 for exons.

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals

170

When searching for potential transcription factor binding sites in all chromosomal sequences or, more specifically, in 5'-flanking regions, for instance by a standard weight matrix approach, it cannot be claimed that all matrix matches should be considered as binding sites, even when applying very restrictive search conditions which recover only half of the experimentally known binding sites. However, these pat-terns and their accumulation in certain chromosomal regions may be indicative for a certain function. For instance, it is known that certain regulatory elements may occur in a highly redundant manner, or that the presence of mutilated binding sites may support the function of a real one. It seems conceivable that a favorable sequence background facilitates the evolution of efficiently working regulatory elements out of a reservoir of potential sites, some of them may lose their function later on during evolution.

We are presently extending our analyses to additional patterns, to homo- and heterogeneous clusters of TF binding sites, and investigating the specificity of the individual patterns by complementing the results obtained with randomized chromosomes and functional regions by systematic shuffling of the patterns as well. Also, it will be interesting to investigate heterogeneities in the 5'-flanking regions sorted according to the gene classification proposed by the Gene Ontology group [The Gene Ontology Consortium, 2001]. REFERENCES • Bajic, V. B., Seah, S. H., Chong, A., Zhang, G., Koh, J. L. and Brusic, V. (2002). Dragon Promoter Finder:

recognition of vertebrate RNA polymerase II promoters. Bioinformatics 18, 198–199. • Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., Rapp, B. A. and Wheeler D. L. (2002).

GenBank. Nucleic Acids Res. 30, 17–20. • Chen, Q. K., Hertz, G. Z. and Stormo, G. D. (1995). MATRIX SEARCH 1.0: a computer program that scans

DNA sequences for transcriptional elements using a database of weight matrices. Comput. Appl. Biosci. 11, 563–566.

• Davuluri, R. V., Grosse, I. and Zhang, M. Q. (2001). Computational identification of promoters and first ex-ons in the human genome. Nat. Genet. 29, 412–417.

• Fickett, J. W. and Hatzigeorgiou, A. G. (1997). Eukaryotic promoter recognition. Genome Res. 7, 861–878. • Frech, K., Danescu-Mayer, J. and Werner, T. (1997). A novel method to develop highly specific models for

regulatory units detects a new LTR in GenBank which contains a functional promoter. J. Mol. Biol. 270, 674–687.

• Frech, K., Quandt, K. and Werner, T. (1998). Muscle actin genes: a first step towards computational classifi-cation of tissue specific promoters. In Silico Biol. 1, 0005. http://www.bioinfo.de/isb/1998/01/0005/

• Goessling, E., Kel-Margoulis, O.V., Kel, A. E. and Wingender, E. (2001). MATCHTM — a tool for searching transcription factor binding sites in DNA sequences. Application for the analysis of human chromosomes. In: Proceedings of the German Conference on Bioinformatics GCB 2001, Wingender, E., Hofestädt, R. and Lie-bich, I. (eds.), Braunschweig, pp. 158–161. http://www.bioinfo.de/isb/gcb01/poster/goessling.html

• International Human Genome Sequencing Consortium (2001). Initial sequencing and analysis of the human genome. Nature 409, 860–921.

• Kel, O. V., Romaschenko, A. G., Kel, A. E., Wingender, E. and Kolchanov, N. A. (1995). A compilation of composite regulatory elements affecting gene transcription in vertebrates. Nucleic Acids Res. 23, 4097–4103.

• Kel, A., Kel-Margoulis, O., Babenko, V. and Wingender, E. (1999). Recognition of NFATp/AP-1 composite elements within genes induced upon the activation of immune cells. J. Mol. Biol. 288, 353–376.

• Kel, A. E., Kel-Margoulis, O. V., Farnham, P. J., Bartley, S. M., Wingender, E. and Zhang, M. Q. (2001). Computer-assisted identification of cell cycle-related genes: new targets for E2F transcription factors. J. Mol. Biol. 309, 99–120.

• Kel, A. E., Gössling, E., Reuter, I., Cheremushkin, E., Kel-Margoulis, O. V. and Wingender, E. (2003). MATCHTM: a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res. 31, 3576–3579.

• Kel-Margoulis, O. V., Kel, A. E., Reuter, I., Deineko, I. V., Wingender, E. (2002). TRANSCompel: a data-

O.V. Kel-Margoulis et al. / Composition-Sensitive Analysis of the Human Genome for Regulatory Signals 171

base on composite regulatory elements in eukaryotic genes. Nucleic Acids Res. 30, 332–334. • Kolchanov, N. A., Ignatieva, E. V., Ananko, E. A., Podkolodnaya, O. A., Stepanenko, I. L., Merkulova, T. I.,

Pozdnyakov, M. A., Podkolodny, N. L., Naumochkin, A. N. and Romashchenko, A. G. (2002). Transcription Regulatory Regions Database (TRRD): its status in 2002. Nucleic Acids Res. 30, 312–317.

• Matys, V., Fricke, E., Geffers, R., Goessling, E., Haubrock, M., Hehl, R., Hornischer, K., Karas, D., Kel, A. E., Kel-Margoulis, O. V., Kloos, D.-U., Land, S., Lewicki-Potapov, B., Michael, H., Münch, R., Reuter, I., Rotert, S., Saxel, H., Scheer, M., Thiele, S., Wingender, E. (2003). TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 31, 374–378.

• O’Donovan, C., Martin, M.J., Gattiker, A., Gasteiger, E., Bairoch, A. and Apweiler, R. (2002). High-quality protein knowledge resource: SWISS-PROT and TrEMBL. Brief Bioinform. 3, 275–284.

• Ohler, U., Niemann, H., Liao, G.-c. and Rubin, G. M. (2001). Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition. Bioinformatics 17, S199–S206.

• Pickert, L., Reuter, I., Klawonn, F. and Wingender, E. (1998). Transcription regulatory region analysis using signal detection and fuzzy clustering. Bioinformatics 14, 244–251.

• Povey, S., Lovering, R., Bruford, E., Wright, M., Lush, M. and Wain, H. (2001). The HUGO Gene Nomen-clature Committee (HGNC). Hum. Genet. 109, 678–680.

• Praz, V., Perier, R., Bonnard, C. and Bucher, P. (2002). The Eukaryotic Promoter Database, EPD: new entry types and links to gene expression data. Nucleic Acids Res. 30, 322–324.

• Quandt, K., Frech, K., Karas, H., Wingender, E. and Werner, T. (1995). MatInd and MatInspector — New fast and sensitive tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 23, 4878–4884.

• Reuter, I. (2000). Dissertation, TU Braunschweig. http://www.biblio.tu-bs.de/ediss/data/20000317a/ 20000317a.html.

• Scherf, M., Klingenhoff, A. and Werner, T. (2000). Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. J. Mol. Biol. 297, 599–606.

• Stoesser, G., Baker, W., van den Broek, A., Camon, E., Garcia-Pastor, M., Kanz, C., Kulikova, T., Leinonen, R., Lin, Q., Lombard, V., Lopez, R., Redaschi, N., Stoehr, P., Tuli, M.A., Tzouvara, K. and Vaughan, R. (2002). The EMBL Nucleotide Sequence Database. Nucleic Acids Res. 30, 21–26.

• Suzuki, Y., Yamashita, R., Nakai, K. and Sugano, S. (2002). DBTSS: DataBase of human transcriptional start sites and full-length cDNAs. Nucleic Acids Res. 30, 328–331.

• Tateno, Y., Imanishi, T., Miyazaki, S., Fukami-Kobayashi, K., Saitou, N., Sugawara, H. and Gojobori, T. (2002). DNA Data Bank of Japan (DDBJ) for genome scale research in life science. Nucleic Acids Res. 30, 27–30.

• The Gene Ontology Consortium. (2001). Creating the gene ontology resource: design and implementation. Genome Res. 11, 1425–1433.

• Wheeler, D. L., Church, D. M., Lash, A. E., Leipe, D. D., Madden, T. L., Pontius, J. U., Schuler, G. D., Schriml, L. M., Tatusova, T. A., Wagner, L. and Rapp, B. A. (2002). Database resources of the National Cen-ter for Biotechnology Information: 2002 update. Nucleic Acids. Res. 30, 13–16.

• Wingender, E. (1997). Classification scheme of eukaryotic transcription factors. Mol. Biol. Engl. Tr. 31, 483–497.

• Wingender, E., Karas, H. and Knüppel, R. (1996). TRANSFAC Database as a Bridge between Sequence Data Libraries and Biological Function. Pacific Symposium on Biocomputing ’97 (PSB’97), R. B. Altman, A. K. Dunker, L. Hunter, T. E. Klein (eds.). World Scientific, Singapore – New Jersey – London – Hong Kong 1996, pp. 477–485.