Nucleosomal DNA property database

11
Nucleosomal DNA property database Victor G. Levitsky, Mikhail P. Ponomarenko, Julia V. Ponomarenko, Anatoly S. Frolov and Nikolay A. Kolchanov Laboratory of Theoretical Genetics, Institute of Cytology & Genetics, 630090, Lavrentieva 10, Novosibirsk, Russia Received on November 30, 1998; revised on March 11, 1999; accepted on April 22, 1999 Abstract Motivation: Chromatin structure plays the crucial role in proper gene functioning. Therefore, it is very important to investigate nucleosomal DNA properties and recognize genome nucleosome positioning sequences. Nevertheless, applying different sequence analysis methods separately is insufficient for complete nucleosomal DNA description. One of the most probable reasons for that is the weakness of nucleosome positioning signals. The present paper offers a set of methods to reveal the most important nucleosomal DNA characteristics and to show a common pattern of nucleosome site properties. Results: A complex approach was used to determine conformational and physicochemical properties that are most significant for nucleosome binding site description. The integrated database of nucleosomal DNA properties is compiled. This database comprises different sections for description of DNA characteristics. Revealing significant DNA characteristics allows the classification of various samples of site sequences and the generation of programs for site recognition. Availability: The current version of the database is available at http://wwwmgs.bionet.nsc.ru/system/BDNAvideo/. C-code of the recognition program may be found in the section FEATURE. WWW-available programs for testing arbitrary sequences are accessible at http://wwwmgs.bionet.nsc.ru/ Programs/bDNA/NA_bDNA.htm/. The links to the mirror site(s) can be found at http://wwwmgs.bionet.nsc.ru/mgs/ links/mirrors.html. Contact: [email protected] Introduction Chromatin structure strongly influences the transcription of eukaryotic genes. The nucleosome, an elementary repeating subunit of the chromatin, may collaborate with numerous regulatory proteins to provide the proper gene expression regulation (for a review, see van Holde, 1989). The nucleosome core particle (histone octamer) is composed of a central tetramer (H3/H4) 2 flanked by two H2A/H2B dimers. The nucleosome is composed of a histone octamer and 146 bp of DNA that is wrapped around it as a left-handed superhelix. The linker histone H1 binding approximately to 20 bp completes the second turn of a superhelix and organizes a subnucleosomal particle chromatosome. In the chromatin, neighboring nucleosome particles are related by a linker DNA segment of various length. Both transcription initiation and elongation require chromatin structure perturbation. In this connection, special interest is invoked by the cases when nucleosome location is strictly confined. These cases may be considered as rotational or translational positioning. The first refers to DNA helix orientation relative to the core particle surface so that some phased minor or major groove regions are facing the octamer surface. The second, translational positioning, means that the nucleosome has a precisely defined positioning along the DNA sequence. A positioned nucleosome is often found near gene regulatory regions; hence, it may confine accessibility of DNA, or bring two DNA pieces close to each other (Wolffe, 1994). Nucleosome positioning signal dispersed in the eukaryotic genome may be considered as specific chromatin code, one of many codes contained in genomic DNA (Trifonov, 1997). That code may be in superposition with the other codes, but it appeared to be necessary for proper DNA segregation chromatin packaging. The sequence periodicities of eukaryotic DNA that reflect its deformational anisotropy were first considered as chromatin folding facilitating factors (Trifonov and Sussman, 1980). This was strongly supported by the prominent periodicities in the distribution of the di- and trinucleotides revealed for nucleosomal DNA sequences (Satchwell et al., 1986). Multi-alphabet consensus search reveals that the nucleosomal pattern is of a very degenerate nature (Ulyanov and Stormo, 1995). Multiple sequence alignment shows the main part of that signal is created by a specially phased AA and TT dinucleotide positional frequency distribution (Ioshikhes et al., 1996). Varieties of methods were developed for investigating nucleosome positioning (Calladine and Drew, 1986; Staffelbach et al., 1994). Different approaches were proposed for nucleosomal DNA conformation study (Fitzgerald et al., 1994; Sivolob and Kharpunov, 1995; Ponomarenko et al., 1997b). Crystal structure analysis revealed that the path of nucleosomal DNA is kinked in several points and irregularly curved (Luger et al., 1997). All these investigations of nucleosomal DNA conformation Vol. 15 nos 7/8 1999 Pages 582-592 582 E Oxford University Press 1999 BIOINFORMATICS

Transcript of Nucleosomal DNA property database

Nucleosomal DNA property database

������ �� ���� �� �� ���� �� �������� �� ����� �� �������� ��������� �� ������ ��� �� ���� �� ���������

��������� � !�������� ������� "������� � #�����$� % ������� &'(()(�

�������� *(� ���������� � +�����

�������� �� ������ �� ����� ������� �� ����� ��� ����� �������� �� ����� ��� ����

AbstractMotivation: Chromatin structure plays the crucial role inproper gene functioning. Therefore, it is very important toinvestigate nucleosomal DNA properties and recognizegenome nucleosome positioning sequences. Nevertheless,applying different sequence analysis methods separately isinsufficient for complete nucleosomal DNA description. Oneof the most probable reasons for that is the weakness ofnucleosome positioning signals. The present paper offers a setof methods to reveal the most important nucleosomal DNAcharacteristics and to show a common pattern of nucleosomesite properties.Results: A complex approach was used to determineconformational and physicochemical properties that are mostsignificant for nucleosome binding site description. Theintegrated database of nucleosomal DNA properties iscompiled. This database comprises different sections fordescription of DNA characteristics. Revealing significantDNA characteristics allows the classification of varioussamples of site sequences and the generation of programs forsite recognition.Availability: The current version of the database is available athttp://wwwmgs.bionet.nsc.ru/system/BDNAvideo/. C-code ofthe recognition program may be found in the sectionFEATURE. WWW-available programs for testing arbitrarysequences are accessible at http://wwwmgs.bionet.nsc.ru/Programs/bDNA/NA_bDNA.htm/. The links to the mirrorsite(s) can be found at http://wwwmgs.bionet.nsc.ru/mgs/links/mirrors.html.Contact: [email protected]

Introduction

Chromatin structure strongly influences the transcription ofeukaryotic genes. The nucleosome, an elementary repeatingsubunit of the chromatin, may collaborate with numerousregulatory proteins to provide the proper gene expressionregulation (for a review, see van Holde, 1989). Thenucleosome core particle (histone octamer) is composed of acentral tetramer (H3/H4)2 flanked by two H2A/H2B dimers.The nucleosome is composed of a histone octamer and 146 bpof DNA that is wrapped around it as a left-handed superhelix.

The linker histone H1 binding approximately to 20 bpcompletes the second turn of a superhelix and organizes asubnucleosomal particle chromatosome. In the chromatin,neighboring nucleosome particles are related by a linker DNAsegment of various length. Both transcription initiation andelongation require chromatin structure perturbation. In thisconnection, special interest is invoked by the cases whennucleosome location is strictly confined. These cases may beconsidered as rotational or translational positioning. The firstrefers to DNA helix orientation relative to the core particlesurface so that some phased minor or major groove regions arefacing the octamer surface. The second, translationalpositioning, means that the nucleosome has a preciselydefined positioning along the DNA sequence. A positionednucleosome is often found near gene regulatory regions;hence, it may confine accessibility of DNA, or bring twoDNA pieces close to each other (Wolffe, 1994). Nucleosomepositioning signal dispersed in the eukaryotic genome may beconsidered as specific chromatin code, one of many codescontained in genomic DNA (Trifonov, 1997). That code maybe in superposition with the other codes, but it appeared to benecessary for proper DNA segregation chromatin packaging.

The sequence periodicities of eukaryotic DNA that reflectits deformational anisotropy were first considered aschromatin folding facilitating factors (Trifonov and Sussman,1980). This was strongly supported by the prominentperiodicities in the distribution of the di- and trinucleotidesrevealed for nucleosomal DNA sequences (Satchwell et al.,1986). Multi-alphabet consensus search reveals that thenucleosomal pattern is of a very degenerate nature (Ulyanovand Stormo, 1995). Multiple sequence alignment shows themain part of that signal is created by a specially phased AAand TT dinucleotide positional frequency distribution(Ioshikhes et al., 1996). Varieties of methods were developedfor investigating nucleosome positioning (Calladine andDrew, 1986; Staffelbach et al., 1994). Different approacheswere proposed for nucleosomal DNA conformation study(Fitzgerald et al., 1994; Sivolob and Kharpunov, 1995;Ponomarenko et al., 1997b). Crystal structure analysisrevealed that the path of nucleosomal DNA is kinked inseveral points and irregularly curved (Luger et al., 1997). Allthese investigations of nucleosomal DNA conformation

���� *, ��� -./ *)))

��$� ,/01,)0

582 � Oxford University Press 1999

BIOINFORMATICS

Nucleosomal DNA property database

583

confirm that there exists a special pattern of DNA helixcurvative facilitating core particle binding. Finally, manyapproaches confirm that the nucleosome site pattern consistsof two regions, 50–60 bp in length, of increased bendingspecificity, which are divided by the central 15–20 bp regionof a site, where DNA is more indifferent to bending.

Owing to its ubiquitous and degenerate nature, thenucleosome signal probably has no obvious consensus andmost likely is defined by particular DNA conformationrecognition. In general, signals of nucleosome positioningappear to be very weak, hence a combination of differentmethods may enable them to be localized more reliably. In thisconnection, the present research is devoted to analysis ofvarious conformational and physicochemical parameters (38sets in total) directed to find significant DNA features. Toinvestigate nucleosomal DNA, an experimentally definednucleosome site database was used (Ioshikhes and Trifonov,1993). Our paper classifies and reveals a hierarchy ofsignificant nucleosomal DNA peculiarities. The compilationof revealed DNA characteristics is presented in theNucleosomal DNA property database. On the basis ofrevealed significant nucleosomal DNA characteristics, theprograms for nucleosome site recognition were generated.These programs are available on the WWW and may be usedfor chromatin structure research and genome study.

System and methods

The Nucleosomal DNA property database comprises severalsections (Figure 1). The database SAMPLES contains sitesequences. The database PROPERTY includes the sets ofcontext-dependent conformational and physicochemicalproperties. The database PROFILES presents nucleosomalDNA property profiles with their extrema and linear trenddescriptions. The database FEATURES consists ofknowledge on the most significant DNA properties andrecognition programs generated by this knowledge.

The Nucleosomal DNA Database (Ioshikhes and Trifonov,1993) was used for extraction of 130 nucleosome sitesequences of 200 bp in length from EMBL. It is available at

Fig. 1. The scheme of the Nucleosomal DNA property database.

http://www.embl-heidelberg.de/Services/index.html, sectionMolecular Biology Databases, catalog nucleosomal_dna. Theretrieved site set consists of sequences of different specieslocated in various genome regions. Owing to heterogeneity ofthe site set, it was subdivided. Five individual samples basedon concrete genome region or gene type were generated.Additional sample of stable nucleosome binding sites wasused (Widlund et al., 1997). These sequences, available athttp://130.241.160.80/sequence/, are localized at thecentromeric regions of mouse metaphase chromosomes. Afteradjusting to a common length of 108 bp, they were compiledinto a separate sample. describes the resulting six nucleosomesite samples. The sequences of all samples were alignedrelative to the footprint center. Because of internal symmetryof the nucleosome, all site sequences were analyzed in bothpossible orientations.

The sequences of all used samples are available athttp://wwwmgs.bionet.nsc.ru/Dbases/NSamples/auto1.exe(SAMPLE database). Thirty-eight dinucleotide parameter setscompiled in PROPERTY database, at http://wwwmgs.bionet.nsc.ru/systems/BDNAVideo/, section DNA Conforma-tional and Physicochemical Parameters, were considered asDNA properties. The PROPERTY database contains adescription and literature reference for every property. Someproperties are illustrated graphically. The majority of consideredDNA properties were described earlier (Kolchanov et al., 1998).

Table 1. Nucleosome site samples used for analysis

Sample number Sample name No. of sequences Sequence length

1 Moderate repeats (rRNA, tRNA, histone genes) 24 200

2 Satellite DNA 14 200

3 Virus SV40 genome 24 200

4 5′ and 3′ region sequences of unique genes 26 200

5 Coding region sequences of unique genes 42 200

6 Stable mouse sequences 87 108

V.G.Levitsky et al.

584

The database PROFILES contains profiles of DNAproperties and descriptions of their extrema and linear trends.To construct a profile, various sliding window sizes wereapplied. The best suited window size was evaluated bycomparison of the sample profile with the profile constructedon the random sequences with the same dinucleotide contentas that of the site. The best selected profile is included in thedatabase entry. Then the profiles were checked for lineartrends and the resulting data are also compiled in the database.

The database FEATURES includes the data on the mostsignificant DNA properties and the programs for siterecognition generated on the basis of this knowledge. For thispurpose, average property values were considered for thecertain site region. The distributions of these values for the setsincluding nucleosome sites and random sequences werecompared by several criteria. Afterwards, a fixed utility valuewas assigned to each comparison and their values werecompared.

Algorithm

Input data for algorithm execution are:1. Consider a sample of N nucleotide sequences {S} = {S1,

…, Sn, …, SN} of length L bp such that Sn = X, X2 … XL,where 1 ≤ n ≤ N, and X denotes a nucleotide type.

2. For 16 dinucleotide types {D1, …, D16}, 38 sets ofdinucleotide parameters d(k) ={d(k)

1, ���d(k)

16}, where 1 ≤ k

≤ 38, were used.

Profiles section

The first step of an algorithm is the calculation ofposition-specific dinucleotide frequency matrix Fi,j, where idenotes a position along a sequence (where 1 ≤ i ≤ L – 1) andj means a dinucleotide type (1 ≤ j ≤ 16). This matrix wascalculated as follows:

Fi,j � (1�N) ��N

n�1

f (n),i,j so that for every i : �

16

j�1

Fi, j � 1

Here f(n)i,j is equal to observed Dj dinucleotide frequency in the

ith position of the nth sequence of the sample, i.e. f (n)i,j = 1, if

Xi, Xi + 1 and Dj dinucleotides are the same; otherwise, f (n)i,j = 0.

To compare peculiarities of different sizes, the similar matrixF(W) was calculated with several dinucleotides averagingwindow sizes W = 1,.., Wmax as follows:

Fi,j(1) � Fi,j � (1�N) ��N

n�1

f (n)i,j

Fij(W) � 1�W ��W–1

r�0

Fj�r,j(1)(1)

Matrix Fij(W) and parameters set d(k) produce the profile{P(k)

i (W)} of property k and window size W as follows:

P(k)i (W) ��

16

j�1

(Fi,j(W) � d(k)j ) (2)

A value of 10 bp was used as the upper limit of the windowsize Wmax.

Profile selection was executed by calculation of a qualityratio R =Qsite�Qrandom, here Q depends on the optimized profilewindow size and indexes denote samples of site sequencesand random sequences of the same dinucleotide content asthat of the site sequences. The dependence of the qualityfunction Q on the window size W was accepted as the ratio ofa profile span to its standard deviation:

Q(k)(W) � (p(k)max(W) � p(k)

min(W)��(k)(W) (3)

Here p(k)max(W) and p(k)

min(W) denote maximum and minimum

profile values calculated with window size W, and �(k) is astandard deviation for profile { p(k)

i (W)}, 1 ≤ i ≤ L – W.The profile with the best suited window size was included

in the database entry. In what follows, this profile wasexamined for significant extreme values. The Student’scriterion was used for evaluation of significance level. A listof profile extrema with a significance level α < 0.02 wasincluded in the database entry.

The least squares method was used to find linear trends. Thetrend significance level was estimated by Fisher–Snedecorcriterion. Only linear trends of significance level α < 0.05were compiled in the PROFILE database. The trend for the kthDNA property carrying points {Xi} on the interval [Xstart; Xend]is presented in the following form:

Y(k)(Xi) � Ymean(k) � K(k) * (Xi � Xcenter) (4)

Here, Ymean(k) is the mean value of the property over the region

[Xstart; Xend], K (k) is the respective slope coefficient andXcenter � (Xstart � Xend)�2 trend center position. If two trends(1 and 2) of a single profile overlap partially, then thefollowing restriction is applied:

| X(2)center � X(2)

center | � 1 �(X(2)

end� C(2)

start) � (X(1)end

� X(1)start)

2(5)

This means that the distance between two centers of the trendsshould be greater than the average trend length.

The profiles of conformational and physicochemical DNAparameters, their significant extremum points and lineartrends are described in the PROFILES database.

Features section

An algorithm used to find out nucleosomal DNA contextualfeatures was developed earlier for functional sites (Kel et al.,1993). Let us consider the mth DNA parameter set for a single

Nucleosomal DNA property database

585

sequence S of the sample. The mean value of the kth parameterover the region [a; b] (1 ≤ a ≤ b ≤ L) is calculated as follows:

Pk,a,b(S) � 1b � a

�b�1

i�a

Pk(XiXi�1) (6)

Applying equation (6) to the site sequence set {S} at a fixedk, a and b yields the distribution Pk,a,b{S} for the site.Similarly, the distribution Pk,a,b{R} is generated for randomsequences {R} with the same nucleotide frequencies as in thereal sequences. The difference between these distributionsPk,a,b{S} and Pk,a,b{R} is tested for significance by using sixstatistical criteria. Each criterion was tested on 100 subset {Sn}and {Rn} (1 ≤ n ≤ 100), randomly retrieved from {S} and {R},respectively. If the difference between the distributionsPk,a,b{Sn} and Pk,a,b{Rn} is significant by the mth criterion(1 ≤ m ≤ 6), then a positive value between 0 and 1 is assignedto the weight Umn(Pk,a,b); otherwise, a negative valuebetween –1 and 0. Hence, the total number of weights is6 × 100 = 600 {Umn(Pk,a,b)}. The generalized differencebetween Pk,a,b{S} and Pk,a,b{R} is the mean of 600 weights:

U(Pk,a,b) �

�6

m�1

�100

n�1

Umn(Pk,a,b)

600(7)

Thus, the calculated value U(Xk,a,b) is the integralcharacteristic of the discriminating ability of Xk,a,b. It is calleda utility and has two important features:

U(Xk,a,b) < 0 implies that ‘Xk,a,b falls short of significance’ (8)

U(Xk,a,b) > U(Xq,c,d) ≥ 0 implies that ‘Xk,a,b is more significant than Xq,c,d’ (9)

Note that the highest value of U(Xk,a,b) points to the best,in terms of utility, B-DNA feature Xk,a,b of the site. Eachconformational feature Xk,a,b with U(Xk,a,b) < 0 is discardedby decision (8). If any two features Xk,a,b and Xq,c,d correlate,the feature Xq,c,d with a lower value of U(Xq,c,d) is discardedby decision (9).

Implementation

An entry of the PROFILES database is exemplified by thedescription of DNA property twist theoretically calculated bySklenar (Karas et al., 1996) for 5′ and 3′ region nucleosomesites. The entry header (Figure 2a) contains the name of a sitesample in the field SD and the link to the SAMPLES databaseon the DNA sequences in the field LD. The list of links to theindividual property entries (profile identifiers) and the namesof properties are included in the field PI. The detailed profiledescription is found in the separate property entry (Figure 2b).The best selected window size (9 bp) is presented in the fieldPW; average profile value 36.54 and standard deviation 0.12

Fig. 2. The entry of the PROFILE database. (a) DNA property twist(Karas et al., 1996) for the sample of 5′ and 3′ gene regionnucleosome site. Entry header which presents site sample and theDNA property list. (b) Individual property entry.

are shown in fields PA and PD, respectively. Quantities ofsignificant extrema (two maxima and two minima) areincluded in field PN. Every extremum is described in the nextfields ET–EP. Each extremum is supplied with its profilevalue, position and significance level. The current part of theentry contains two links to the figure (fields FG), which arealso shown in Figure 3a and b. The first figure explains whythis very profile window size was selected and the secondshows the profile itself.

V.G.Levitsky et al.

586

Fig. 3. Charts linked to the PROFILE database entry. DNA property twist (Karas et al., 1996) for the sample of 5′ and 3′ gene region nucleosomesite used. (a) Window selection. Dependence of window quality function Q(W) on window size. (b) Profile with the best selected window size.For each point are given standard deviations calculated by individual sequence profiles. Average value computed over all the points of the profilealso shown. (c) Profile of the minimal window size (one dinucleotide) and revealed significant linear trends (significance level α < 0.05).

Nucleosomal DNA property database

587

Table 2. The best conformational and physicochemical features which appeared to be significant for all six samples of nucleosome sites

Property name Units Sample name Region Utility Averaged mean for

[a; b] U SITE RANDOM

Probability of contact % Repeats –50; 50 0.98 12.48 ± 0.70 11.28 ± 0.50

with a nucleosome core Satellites –71; 71 0.845 12.30 ± 0.74 11.27 ± 0.42

(Satchwell and Travers, 1989) Virus –62; 62 0.999 12.94 ± 0.36 11.28 ± 0.44

5′ and 3′ regions –86; 86 0.993 12.53 ± 0.35 11.27 ± 0.38

Coding regions –26; 26 0.66 12.27 ± 0.75 11.31 ± 0.71

Stable –45; 45 0.675 12.03 ± 1.17 11.28 ± 0.55

Tilt for DNA–protein complex degree Repeats –90; 90 0.992 0.94 ± 0.06 0.78 ± 0.05

(Suzuki et al., 1996) Satellites –92; 92 0.935 0.95 ± 0.03 0.78 ± 0.06

Virus –71; 71 0.915 0.95 ± 0.08 0.78 ± 0.06

5′ and 3′ regions –95; 95 0.936 0.92 ± 0.07 0.78 ± 0.05

Coding regions –96; 96 0.635 0.87 ± 0.08 0.78 ± 0.05

Stable –51; 51 0.655 0.54 ± 0.24 0.78 ± 0.13

Twist [theoretically calculated degree Repeats –85; 85 0.87 36.47 ± 0.26 36.14 ± 0.24

by Sklenar (Karas et al., 1996)] Satellites –93; 93 0.903 36.51 ± 0.16 36.14 ± 0.22

Virus –79; 79 0.945 36.63 ± 0.25 36.14 ± 0.24

5′ and 3′ regions –97; 97 0.905 36.61 ± 0.15 36.14 ± 0.21

Coding regions –78; 78 0.89 36.58 ± 0.29 36.14 ± 0.24

Stable –53; 53 0.634 36.50 ± 0.70 36.13 ± 0.29

Table 3. The significant conformational and physicochemical features revealed for a sample of nucleosome sites from 5 and 3 gene regions

Property name Units Region Utility Averaged mean for

[a; b] U SITE RANDOM

Probability of contact % –86; 86 0.993 12.53 ± 0.35 11.27 ± 0.38

with a nucleosome core

Tilt for DNA–protein complex degree –95; 95 0.936 0.92 ± 0.07 0.78 ± 0.05

Twist degree –97; 97 0.905 36.61 ± 0.15 36.14 ± 0.21

Enthalpy change kcal/mol –66; 66 0.807 –8.20 ± 0.35 –8.64 ± 0.19

Free energy change kcal/mol –90; 90 0.806 –1.48 ± 0.12 –1.61 ± 0.05

Entropy change cal/mol/K –76; 76 0.78 –21.71 ± 0.65 –22.64 ± 0.42

Propeller twist degree –70; 70 0.701 –13.33 ± 0.58 –12.53 ± 0.31

Free DNA roll degree –31; 31 0.65 1.07 ± 0.40 0.65 ± 0.42

Rise degree –26; 26 0.634 3.47 ± 0.06 3.52 ± 0.05

Clash strength angstrom –24; 24 0.627 1.00 ± 0.10 1.10 ± 0.07

The trends are presented in the database entry in thefollowing way. Borders of the trend region are included in thefield GX. For a DNA property twist for the 5′ and 3′ regionnucleosome sites, two significant trends were found, locatedwithin [–44.5; –16.5] and [16.5; 44.5] site regions. Theaverage property value for this region (twist equaling 36.54�)enters the field GA. For the right trend, the slope coefficientis 0.026�/bp and its 0.05 confidence interval is 0.022�/bp.These values are given in fields GK and GI, respectively. The

finally compiled trend formula (4) constitutes field GF. Thelast occurring field FG contains a link to the figure thatdisplays all revealed trends and the profile of a minimalwindow size (W = 1 bp, one dinucleotide). An illustration forDNA property twist for 5′ and 3′ region nucleosome sites isgiven in Figure 3c.

Figure 4 contains the profiles and all the significant lineartrends for conformational DNA property bend. Profileexamples shown in Figure 4 prove that profile trends may be

V.G.Levitsky et al.

588

Fig. 4. Profiles of the DNA property bend and corresponding trends with significance level α < 0.05 for all six considered nucleosome sitesamples. Window size is equal to one dinucleotide. Bend angle unit is degree.

very useful for collective nucleosomal DNA presentation. Inthe site region of central 10 bp, there are breaks or small fallsof bend. In the intervals [–50; –10] and [+10; +50], the bendtends to decrease towards the edges of the site. In the regions[–90; –70] and [+70; +90] entering the linker DNA, there isa tendency for the bend angle to increase out of the site center.

The presentation of significant results in the databaseFEATURES may be found elsewhere (Ponomarenko et al.,1997a). Consider again the DNA property twist for 5′ and 3′gene region nucleosome sites. In this example (Table 2), thesites within the region [–97; 97] relative to the center of the siteappeared to differ significantly by the mean values of theDNA property for the sample of random sequences. Utility ofthis feature is U = 0.905. The averaged value of the twist over

the region was 36.610 ± 0.154 for the site sequences and36.138 ± 0.211 for the random sequences. The distributions ofthe mean twist calculated over the considered region for thereal and random sequences are compared in Figure 5a. Notethe right-shift of the distribution for the real sites compared tothat for the random sequences. This result confirms theprevious observation (Ponomarenko et al., 1997a). Thedatabase FEATURES entry contains an automaticallygenerated C-code of computer program calculating the valueof the considered parameter from a DNA sequence. There arealso links to the executable program constructing the profilesof the significant conformational and physicochemicalfeatures along an arbitrary DNA sequence. Recognitionprograms apply the list of all significant site properties

Nucleosomal DNA property database

589

Fig. 5. Histograms of the mean values of the revealed significant features of the nucleosome sites (black columns) and the random sequences(white columns). (a) DNA property twist (Karas et al., 1996) for the sample of 5′ and 3′ gene region nucleosome site. (b) All significant featuresfound for the sample of 5′ and 3′ gene region nucleosome site used for discrimination. (c) All significant features found for the united set ofnucleosome sites used (whole set includes samples 1–5 from Table 1).

revealed, so that the user may choose any of them for siterecognition or apply them all to calculate mean recognition byall significant site properties. The distributions of the meanvalue calculated by all significant properties for 5′ and 3′ generegion nucleosome sites and random sequences are comparedin Figure 5b.

The information accumulated in the database FEATURESmay be useful for studying nucleotide sequences of interest.The database FEATURES also contains the link to theexecutable program. Via this link, a user may input a DNAsequence from the known databases or files and obtain aprofile of a DNA feature of interest. Consider the significantfeature revealed for 5′ and 3′ gene region nucleosome sitesfor the parameter twist on the region [–97; 97] relative to thesite center. The value profile of this feature along thesequence of the human pS2 gene (EMBL: X05030;HSPS2G1) with experimentally determined nucleosomebinding site (Sewack and Hansen, 1997) is shown in Figure6a. The mean recognition profile for this gene is given inFigure 6b. Consideration of Figure 6 proves that thesequence has the highest feature values in two regions ofnucleosome binding as compared to the internucleosomaland neighboring regions.

In order to appreciate the predictive opportunities of ourprogram and compare it with the other analogous programs, wecompile two sets of human regulatory sequences. The first setcontains 53 sequences of promoters extracted from EPD (region[–600; +600] relative to transcription start). The second setconsists of donor splicing site sequences (region [–400; +400]relative to exon–intron boundary). We expect that both sets havespecial patterns of the average profiles of nucleosome siterecognition function. The constructed average profile forpromoters is shown in Figure 7a. Both remote fromtranscription start profile sections (distance >400 bp) arecharacterized by the high values of recognition function. Withinthe upstream region [–200; +1] there is the steep fall, and in thedownstream region [+100; +400] the less steep growth of

function is observed. The minimal value of the profile is locateddownstream of the start transcription point (region [+1; +100]).The average profile for the splicing site is given in Figure 7b.This profile confirms that introns lacking genetic code burdenmay bind nucleosome better than exons. This proves the earlierassumption about the role of introns for chromatin folding(Solovyev and Kolchanov, 1985).

Discussion

Six samples of nucleosomal DNA sequence from the databaseSAMPLES were considered. Thirty-eight conformational andphysicochemical dinucleotide parameters from the databasePROPERTY were examined. Finally, for every sample, theprofiles of all properties were constructed. In addition, allsignificant extrema and linear trends were described in thedatabase PROFILES. Besides, each sample of sites ischaracterized by a specific set of significant conformationaland physicochemical features in the knowledge databaseFEATURES.

Table 2 presents the following most significant features:probability of contact with a nucleosome core, tilt forDNA–protein complex and twist. These properties appearedto be significant for all six considered sets of nucleosome sites.Location of the region within the site, the averaged propertyvalues over this region and in random sequences, and theutilities are also indicated in the table. Among the othervaluable nucleosomal DNA properties are the following:clash strength, enthalpy change, entropy change, free energychange and propeller twist. These properties are significant foralmost all site samples. Table 3 shows the most significantfeatures with the highest utilities for the set of 5′ and 3′ regionnucleosome sites. This sample seems to be very interestingbecause of the important role of chromatin structure in theregulatory gene regions.

The database integration, which was performed in thecurrent work, is very valuable for site characteristic

V.G.Levitsky et al.

590

Fig. 6. Nucleosome site recognition by revealed significant features. Sequence of the human pS2 gene (EMBL: X05030; HSPS2G1) with twoexperimentally determined nucleosome binding sites (Sewack and Hansen, 1997) is used. For program generating, the sample of nucleosomesites from 5′ and 3′ gene regions was taken as the training set of nucleosomal DNA sequences.. Arrows indicate approximate experimentallydefined nucleosome site centers. (a) Recognition by significant DNA property twist (Karas et al., 1996). (b) Mean recognition over all significantDNA properties found.

investigation. For example, consider again 5′ and 3′ generegion nucleosome sites and DNA property twist. Thedatabase FEATURE states that this property is one of the mostimportant for the sites. The twist distribution has the right-shift

for the real sites compared to that for the random sequences(Figure 5a). The database PROFILES discovers the finestructure of the twist value distribution along the sitesequences. The selected twist profile (Figure 3b) has the pair

Nucleosomal DNA property database

591

Fig. 7. Nucleosome site recognition profiles for human regulatory regions (averaged by sets of sequences). (a) Promoter region (53 sequencesfrom EPD, profiles show positions [–500; +500] relative to transcription start). (b) Splice site region (50 sequences of >800 bp in length fromEMBL aligned relative to exon–intron boundary, every sequence has an exon in the right half.)

of minima at positions –20.5 and +20.5 (see database entry inFigure 2) and a pair of maxima at positions –41.5 and +41.5.Profile consideration (Figure 3b) shows that the central regionof the site [–30; +30] has the lowest twist in comparison withtwo neighboring regions [–50; –30] and [+30; +50] with theextremely high twist. The additional information is given bytwo revealed significant trends (Figure 3c), which are placedin the regions [–44.5; –16.5] and [+16.5; +44.5]. Suchcomplex property profile presentation helps to realize the

pattern for property twist for the sample of sites. The outputdata of the generated program for nucleosome site recognition(Figure 6) show that revealed significant site features may beuseful for investigation of DNA affinity to nucleosome. Thisis strongly confirmed by the comparison of distributions of thefeatures for real and random sequences. To check this idea, thesets of nucleosome sites were united in a set comprising fivesamples (1–5 in Table 1) and consisting of 130 sequences.Comparison of the real and random distributions calculated by

V.G.Levitsky et al.

592

all revealed significant features (Figure 5c) for united site setconvinces that the recognition program gives reliable results.

In future, we are planning to develop our complexrepresentation of nucleosome site sequences and to present theresults in the integrated database Nucleosomal DNA property.The analysis scheme described above is useful for theidentification of significant DNA conformation featurestogether with classification and appreciation of various DNAcontext-dependent parameter set utilities for a sample ofsequences.

Acknowledgements

We are grateful to Dr E.N.Trifonov for helpful discussions andcritical comments. The work was supported by the RussianFoundation for Basic Research, the Russian Human GenomeProgram, Russian State Committee on Science andTechnology and Integrated Program of Siberian Departmentof the Russian Academy of Sciences.

References

Calladine,C.R. and Drew,H.R. (1986) Principles of sequence-depend-ent flexure of DNA. J. Mol. Biol., 192, 907–918.

Fitzgerald,D.J., Dryden,G.L., Bronson,E.C., Williams,J.S. andAnderson,J.N. (1994) Conserved patterns of bending in satellite andnucleosome positioning DNA. J. Biol. Chem., 269, 21303–21314.

Ioshikhes,I. and Trifonov,E.N. (1993) Nucleosomal DNA sequencedatabase. Nucleic Acids Res., 21, 4857–4859.

Ioshikhes,I., Bolshoy,A., Derenshteyn,K., Borodovsky,M. and Trifo-nov,E.N. (1996) Nucleosome DNA sequence pattern revealed bymultiple alignment of experimentally mapped sequences. J. Mol.Biol., 262, 129–139.

Karas,H., Knuppel,R., Schulz,W., Sklenar,H. and Wingender,E.,(1996) Combining structural analysis of DNA with search routinesfor the detection of transcription regulatory elements. Comput. Appl.Biosci., 12, 441–446.

Kel,A.E., Ponomarenko,M.P., Likhachev,E.A., Orlov,Yu.L., Ischen-ko,I.V., Milanesi,L. and Kolchanov,N.A. (1993) SITEVIDEO: acomputer system for functional site analysis and recognition.Investigation of the human splice sites. Comput. Appl. Biosci., 9,617–627.

Kolchanov,N.A., Ponomarenko,M.P., Ponomarenko,J.V., Podkolod-nyi,N.L. and Frolov,A.S. (1998) Functional sites of pro- andeukaryotic genomes: computer modeling and predicting activity.Mol. Biol. (Mosk.), 32, 255–267.

Luger,K., Mader,A.W., Richmond,R.K., Sargent,D.F. and Rich-mond,T.J. (1997) Crystal structure of the nucleosome core particle at2.8 A resolution. Nature, 389, 251–260.

Ponomarenko,M.P., Ponomarenko,J.V., Kel,A.E. and Kolchanov,N.A.(1997a) Search for DNA conformational features for functionalsites. Investigation of the TATA box. Pacif. Symp. Biocomput., 2,340–351.

Ponomarenko,M.P., Savnikova,L.K., Ponomarenko,J.V., Kel,A.E.,Titov,I.I. and Kolchanov,N.A. (1997b) Modeling TATA-box se-quences in eukaryotic genes. Mol. Biol. (Mosk.), 31, 726–732.

Satchwell,S.C. and Travers,A.A. (1989) Asymmetry and polarity ofnucleosomes in chicken erythrocyte chromatin. EMBO J., 9,229–238.

Satchwell,S.C., Drew,H.R. and Travers,A.A. (1986) Sequence period-icities in chicken nucleosome core DNA. J. Mol. Biol., 191,659–675.

Sewack,G.F. and Hansen,U. (1997) Nucleosome positioning andtranscription-associated chromatin alterations on the human es-trogen-responsive pS2 promoter. J. Biol. Chem., 272, 31118–31129.

Sivolob,A.V. and Kharpunov,S.N. (1995) Translational positioning ofnucleosomes on DNA: the role of sequence-dependent isotropicDNA bending stiffness. J. Mol. Biol., 247, 918–931.

Solovyev,V.V. and Kolchanov,N.A. (1985) The eucaryotic genesexon-intron structure can be determined by the nucleosomesorganisation of the chromatin and related characteristics of geneexpression regulation. Dokl. Akad. Nauk SSSR, 284, 232–237.

Staffelbach,H., Koller,T. and Burks,C. (1994) DNA structural patternsand nucleosome positioning. J. Biomol. Struct. Dyn., 12, 301–325.

Suzuki,M, Yagi,N. and Finch,J.T. (1996) Role of base-backbone andbase-base interactions in alternating DNA conformations. FEBSLett., 379, 148–152.

Trifonov,E.N. (1997) Genetic level of DNA sequences is determinedby superposition of many codes. Mol. Biol. (Mosk.), 31, 759–767.

Trifonov,E.N. and Sussman,J.L. (1980) The pitch of chromatin DNA isreflected in its nucleotide sequence. Proc. Natl Acad. Sci. USA, 77,3816–20.

Ulyanov,A.V. and Stormo,G.D. (1995) Multi-alphabet consensusalgorithm for identification of low specificity protein–DNA interac-tions. Nucleic Acids Res., 23, 1434–1440.

Van Holde,K.E. (1989) Chromatin. Springer-Verlag, New York.Widlund,H.R., Cao,H., Simonsson,S., Magnusson,E., Simonsson,T.,

Nielsen,P.E., Kahn,J.D., Crothers,D.M. and Kubista,M. (1997)Identification and characterization of genomic nucleosome-posi-tioning sequences. J. Mol. Biol., 267, 807–817.

Wolffe,A.P. (1994) Nucleosome positioning and modification: chro-matin structures that potentiate transcription. Trends Biochem Sci.,19, 240–244.