Download - Multi-scale coding of genomic information: From DNA sequence to genome structure and function

Transcript

Physics Reports 498 (2011) 45–188

Contents lists available at ScienceDirect

Physics Reports

journal homepage: www.elsevier.com/locate/physrep

Multi-scale coding of genomic information: From DNA sequence togenome structure and functionAlain Arneodo a,b,∗, Cédric Vaillant a,b, Benjamin Audit a,b, Françoise Argoul a,b,Yves d’Aubenton-Carafa c, Claude Thermes c

a Université de Lyon, F-69000 Lyon, Franceb Laboratoire Joliot-Curie and Laboratoire de Physique, CNRS, Ecole Normale Supérieure de Lyon, F-69007 Lyon, Francec Centre de Génétique Moléculaire, CNRS, Allée de la Terrasse, 91198 Gif-sur-Yvette, France

a r t i c l e i n f o

Article history:Accepted 24 September 2010Available online 8 October 2010editor: H. Orland

Keywords:DNA sequenceChromatinNucleosomeGenome organizationEpigeneticsTranscriptionReplicationCompositional strand asymmetryStatistical physicsHetero-polymerGeneralized worm-like-chain modelScale-invarianceMulti-fractalMulti-scale analysisWavelet transformLong-range correlationsAtomic force microscopy

a b s t r a c t

Understanding how chromatin is spatially and dynamically organized in the nucleus ofeukaryotic cells and how this affects genome functions is one of the main challenges ofcell biology. Since the different orders of packaging in the hierarchical organization ofDNA condition the accessibility of DNA sequence elements to trans-acting factors thatcontrol the transcription and replication processes, there is actually a wealth of structuraland dynamical information to learn in the primary DNA sequence. In this review, weshow that when using concepts, methodologies, numerical and experimental techniquescoming from statistical mechanics and nonlinear physics combined with wavelet-basedmulti-scale signal processing, we are able to decipher the multi-scale sequence encodingof chromatin condensation–decondensation mechanisms that play a fundamental role inregulating many molecular processes involved in nuclear functions.

© 2010 Elsevier B.V. All rights reserved.

Contents

1. Introduction............................................................................................................................................................................................. 481.1. Multi-scale coding of genomic information.............................................................................................................................. 481.2. Outline of the report ................................................................................................................................................................... 50

2. Long-range correlations in eukaryotic DNA sequences: a footprint of nucleosome packaging ........................................................ 522.1. About the controversy concerning the existence of long-range correlations in DNA sequences ......................................... 522.2. Coding DNA sequences for statistical analysis.......................................................................................................................... 53

2.2.1. DNA walk representation............................................................................................................................................ 532.2.2. DNA bending profile .................................................................................................................................................... 54

∗ Corresponding author at: Laboratoire Joliot-Curie and Laboratoire de Physique, CNRS, Ecole Normale Supérieure de Lyon, F-69007 Lyon, France.E-mail addresses: [email protected] (A. Arneodo), [email protected] (C. Vaillant), [email protected] (B. Audit),

[email protected] (F. Argoul), [email protected] (Y. d’Aubenton-Carafa), [email protected] (C. Thermes).

0370-1573/$ – see front matter© 2010 Elsevier B.V. All rights reserved.doi:10.1016/j.physrep.2010.10.001

46 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

2.3. Mastering the mosaic structure of DNA walks with the wavelet transform microscope ...................................................... 542.4. Application of the wavelet transform modulus maxima method to multifractal analysis of DNA sequences..................... 55

2.4.1. DNA walks are monofractal ........................................................................................................................................ 552.4.2. About the Gaussian nature of the fluctuations in DNA walk landscapes................................................................. 56

2.5. Long-range correlations in DNA sequences do not originate from genome plasticity .......................................................... 572.5.1. Uncovering long-range correlations in coding DNA sequences ............................................................................... 572.5.2. Nucleotide composition effects on the long-range correlation properties of human genes .................................. 57

2.6. Towards a structural interpretation of long-range correlations in DNA sequences............................................................... 602.6.1. Investigating long-range correlations in fully sequenced genomes......................................................................... 602.6.2. Evidencing long-range correlations between DNA bending sites in eukaryotic organisms................................... 602.6.3. Long-range correlations in eukaryotic DNA: a signature of the nucleosomal structure......................................... 612.6.4. Long-range correlations in genomic DNA: a requisite for condensation–decondensation processes? ................. 63

3. DNA sequence effect on the nucleosomal organization of the eukaryotic chromatin fiber............................................................... 653.1. Experiments confirm the existence of long-range correlations in yeast nucleosomal occupancy profiles.......................... 65

3.1.1. In vivo nucleosome occupancy data........................................................................................................................... 653.1.2. In vitro nucleosome occupancy data .......................................................................................................................... 66

3.2. Grand canonical modeling of nucleosome occupancy landscape............................................................................................ 673.2.1. Nucleosome occupancy profile ................................................................................................................................... 673.2.2. Nucleosome formation energy ................................................................................................................................... 683.2.3. In vitro data validate theoretical nucleosome occupancy predictions..................................................................... 703.2.4. In vivo nucleosome ordering: organizing via excluding? ......................................................................................... 71

3.3. Functional location of genomic energy barriers and nucleosome-free regions ..................................................................... 713.3.1. Yeast gene data ............................................................................................................................................................ 723.3.2. In vitro nucleosome-free regions ............................................................................................................................... 723.3.3. In vivo nucleosome-free regions ................................................................................................................................ 733.3.4. Nucleosome-free regions are evolutionary conserved.............................................................................................. 73

3.4. Thermodynamics of intragenic nucleosome ordering ............................................................................................................. 743.4.1. Yeast genes display a highly organized nucleosomal architecture in vivo.............................................................. 743.4.2. Two classes of intragenic chromatin architecture: crystal-like versus bi-stable structures .................................. 753.4.3. Intragenic chromatin architecture conforms to equilibrium statistical ordering principles.................................. 77

3.5. Distinct modes of gene regulation by chromatin nucleosomal organization ......................................................................... 813.5.1. Transcription initiation regulation by promoter nucleosome occupancy ............................................................... 813.5.2. A novel strategy of transcription regulation by intragenic nucleosome ordering .................................................. 813.5.3. Promoter versus intragenic nucleosome organization: distinct modes of gene regulation? ................................. 84

3.6. ‘‘Intrinsic’’ versus ‘‘extrinsic’’ nucleosome positioning ............................................................................................................ 853.6.1. Transcription factors ................................................................................................................................................... 853.6.2. ATP-dependent chromatin remodeling factors ......................................................................................................... 85

4. Atomic force microscopy imaging of sequence effect on the first step of DNA compaction inside eukaryotic nuclei .................... 864.1. Motivations ................................................................................................................................................................................. 864.2. Probing persistence in DNA curvature properties with atomic force microscopy ................................................................. 87

4.2.1. Materials and methods................................................................................................................................................ 874.2.2. Defining the experimental protocol appropriated to characterize intrinsic structural disorder ........................... 884.2.3. Experimental check of 2D thermodynamic equilibrium conditions ........................................................................ 924.2.4. Experimental demonstration of the existence of long-range correlations in human DNA molecules .................. 94

4.3. Atomic force microscopy imaging of nucleosome positioning by genomic excluding energy barriers ............................... 964.3.1. Reconstituting and imaging nucleosomes on native DNA ........................................................................................ 974.3.2. Genomic energy barriers locally inhibit nucleosome formation.............................................................................. 994.3.3. Genomic energy barriers condition the collective assembly of neighboring nucleosomes consistently with

equilibrium statistical ordering principles ................................................................................................................ 1004.3.4. Genomic energy barriers direct the intrinsic nucleosome occupancy of regulatory sites...................................... 103

5. Low-frequency rhythms in human DNA sequences: from the detection of replication origins to the modeling of replication inhigher eukaryotes ................................................................................................................................................................................... 1045.1. Evidencing low-frequency rhythms in the GC content and compositional strand asymmetries.......................................... 104

5.1.1. GC content.................................................................................................................................................................... 1045.1.2. Strand asymmetries..................................................................................................................................................... 106

5.2. Transcription-coupled strand asymmetries in the human genome........................................................................................ 1085.2.1. Strand asymmetries in human gene sequences ........................................................................................................ 1085.2.2. Revealing the bifractality of human skew DNA walks with the wavelet transform modulus maxima method... 1105.2.3. Transcription-induced step-like skew profiles in the human genome.................................................................... 1115.2.4. From skew multifractal analysis to the detection of genes expressed in the germline .......................................... 112

5.3. Replication-associated strand asymmetries in mammalian genomes.................................................................................... 1135.3.1. Replication-associated strand asymmetries in prokaryotic genomes: the replicon model ................................... 1145.3.2. Analysis of strand asymmetries around experimentally determined replication origins in the human genome 1145.3.3. Conservation of replication-associated strand asymmetries in mammalian genomes .......................................... 1165.3.4. Factory-roof skew profiles in the human genome .................................................................................................... 117

5.4. A wavelet-based method to detect putative replication origins ............................................................................................. 118

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 47

5.5. A model of replication in mammalian genomes....................................................................................................................... 1206. Disentangling transcription- and replication-associated strand asymmetries reveals a remarkable gene organization in the

human genome ....................................................................................................................................................................................... 1236.1. A wavelet-basedmethodology to disentangle the replication and the transcription contributions to the compositional

strand asymmetries in replication N-domains ......................................................................................................................... 1236.1.1. A working model of mammalian ‘‘factory roof’’ skew profiles ................................................................................. 1236.1.2. A space-scale pattern matching method to identify replication N-domains .......................................................... 1236.1.3. Disentangling transcription- and replication-associated strand asymmetries ....................................................... 1246.1.4. Statistical analysis of replication-associated strand asymmetries ........................................................................... 1266.1.5. Statistical analysis of transcription-associated strand asymmetries ....................................................................... 1286.1.6. Correlating transcription-associated strand asymmetries to gene expression in the germline ............................ 129

6.2. DNA replication timing data corroborate in silico human replication origin predictions ..................................................... 1316.2.1. Human chromosome 6 ................................................................................................................................................ 1326.2.2. Human autosomes....................................................................................................................................................... 133

6.3. Gene organization in the detected replication N-domains...................................................................................................... 1346.3.1. Gene orientation .......................................................................................................................................................... 1346.3.2. Gene breadth expression ............................................................................................................................................ 134

6.4. Identifying several mega-base-pair long replication split-N-domains................................................................................... 1366.4.1. Detection of replication split-N-domains .................................................................................................................. 1366.4.2. Linking split-N-domains to GC-poor isochores and gene deserts ............................................................................ 137

7. From sequence analysis to the modeling of the chromatin tertiary structure ................................................................................... 1387.1. Sequence effects on the structure of the eukaryotic chromatin fiber ..................................................................................... 138

7.1.1. The fiber structure ....................................................................................................................................................... 1387.1.2. Geometrical and physical modeling of the chromatin fiber ..................................................................................... 1407.1.3. Modeling heterogeneous chromatin fibers with defects .......................................................................................... 142

7.2. Depletion effects and chromatin loop formation ..................................................................................................................... 1437.2.1. Current models for the chromatin tertiary structure................................................................................................ 1437.2.2. Chromatin loops formed by clustered polymerases.................................................................................................. 1457.2.3. Spontaneous emergence of sequence-dependent rosette-like folding of chromatin fiber .................................... 146

7.3. Genomic DNA codes for open chromatin around ‘‘master’’ replication origins in human cells ............................................ 1507.3.1. N-domain borders are hypersensitive to DNase I digestion ..................................................................................... 1507.3.2. DNA sequence codes for the accumulation of nucleosome-free regions around N-domains borders .................. 1527.3.3. DNA hypomethylation is associated with N-domain borders .................................................................................. 1547.3.4. Open chromatin regions around N-domain borders have a characteristic size ...................................................... 1547.3.5. Open chromatin around N-domain borders are potentially fragile regions involved in chromosome instability155

7.4. Master replication origins at the heart of the chromatin tertiary structure........................................................................... 1567.4.1. Recent experimental replication origins mapped on ENCODE confirm N-domain border predictions ................ 1567.4.2. N-domain borders: a subset of ‘‘master’’ replication origins at the heart of the spatio-temporal program of

replication .................................................................................................................................................................... 1567.4.3. Unifying replication foci and transcription factories ................................................................................................ 157

8. Concluding remarks ................................................................................................................................................................................ 159Acknowledgements................................................................................................................................................................................. 159Appendix A. A wavelet-based multifractal formalism: the wavelet transform modulus maxima method................................. 160A.1. The continuous wavelet transform............................................................................................................................................ 160A.2. Analyzing wavelets ..................................................................................................................................................................... 161A.3. Scanning singularities with the wavelet transform modulus maxima ................................................................................... 162A.4. The wavelet transform modulus maxima method for multifractal analysis .......................................................................... 162

A.4.1. Singularity spectrum ................................................................................................................................................... 162A.4.2. The wavelet transform modulus maxima method .................................................................................................... 163A.4.3. Monofractal versus multifractal functions................................................................................................................. 163

Appendix B. Test applications of the wavelet transform modulus maxima method on monofractal and multifractalsynthetic random signals........................................................................................................................................................................ 163B.1. Fractional Brownian signals ....................................................................................................................................................... 163B.2. Random W-cascades .................................................................................................................................................................. 165Appendix C. Modeling the effect of sequence-dependent long-range correlated structural disorder on the thermodynam-ical properties of 2D elastic chains ........................................................................................................................................................ 167C.1. The worm-like-chain model ...................................................................................................................................................... 167

C.1.1. Persistence length estimate from the mean square end-to-end distance ............................................................... 168C.1.2. Persistence length estimate from the mean projection of the end-to-end vector on the initial orientation........ 169

C.2. Revisiting polymer statistical physics to account for the presence of long-range correlated structural disorder in 2DDNA chains .................................................................................................................................................................................. 169C.2.1. The worm-like-chain model still at work for DNA chains with uncorrelated intrinsic disorder ........................... 169C.2.2. Generalized 2D worm-like-chain model for long-range correlated DNA chains .................................................... 171

Appendix D. DNA simulations of uncorrelated and long-range correlated 2D elastic chains....................................................... 171D.1. Intrinsically straight DNA........................................................................................................................................................... 172D.2. DNA chains with intrinsic uncorrelated structural disorder ................................................................................................... 172

48 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

D.3. DNA chains with long-range correlated structural disorder ................................................................................................... 172References................................................................................................................................................................................................ 173

Abbreviations

2D Two dimensional3C, 4C, 5C, high-C Chromatin conformation capture3D Three dimensionalAFM Atomic force microscopyAPTES AminopropyltriethoxysilaneATP Adenosine triphosphateCGI CpG islandCHO Chinese hamster ovaryCL Cross-linkedDHS DNase I hypersensitive siteDLA Diffusion limited aggregateDNA Deoxyribonucleic acidDNS Direct numerical simulationDPN Depleted proximal nucleosomeEM Electron microscopyEST Expressed sequence tagFACS Fluorescence activated cell sortingFARIMA Fractional autoregressive integrated mov-

ing averagefBm Fractional Brownian motionFISH Fluorescent in situ hybridizationFITC Fluorescein isothiocyanateGWLC Generalized wormlike chainHCV Hepatitis C RNA virusHMG High mobility groupHMGA High mobility group A

HU Histone-like DNA-binding proteinID InterdigitatedLINES Long interspersed nuclear elementsLRC Long-range correlationsMLS Multi-loop subcompartmentNCP Nucleosome core particleNFR Nucleosome-free regionNRL Nucleosome repeat lengthOPN Occupied proximal nucleosomeORC Origin recognition complexPCR Polymerase chain reactionPIC Pre-initiation complexRLC Rod-like-chainRNA Ribonucleic acidRW/GL Random walk/giant loopSAGE Serial analysis of gene expressionSEM Standard error of the meanSINES Short interspersed nuclear elementsSNS Small nascent strandTFS Transcription factor binding siteTOTO Thiazole orange dimerTSS Transcription start siteTTS Transcription termination siteWLC Wormlike chainWT Wavelet transformWTMM Wavelet transform modulus maxima

1. Introduction

1.1. Multi-scale coding of genomic information

The relation between the DNA primary structure and its biological function is one of the outstanding problems inmodernbiology. There are many objective reasons to believe that the functional role of the DNA sequence is not only to code forproteins, which represents less than 5% of mammalian genomes, but also to contribute to controlling the spatial structure ofDNA in chromatin. Nowadays, it is well-recognized that the dynamics of DNA folding and unfoldingwithin living cells play amajor role in regulating many biological processes, such as gene expression, DNA replication, recombination and repair [1–4]. As sketched in Fig. 1 [5], the genomic DNA of eukaryotic cells is tightly packaged into nucleosomes which constitute thebasic units of chromatin [6,7]. As experimentally detailed by high resolution X-ray analysis [8–10], each nucleosome consistsof almost two turns of DNA wrapped around an octamer of core histone proteins. An additional DNA fragment separatessuccessive nucleosomes which are disposed as beads-on-a-string along DNA. This nucleosomal array is further organizedinto successive higher order structures [4] including condensation into the 30 nm chromatin fiber and the formation ofchromatin loops, up to a full extent of condensation in metaphase chromosomes. Actually, the structure and dynamicsof chromatin are under the control of a number of mechanisms involving DNA-protein interactions which may dependupon the local sequence-dependent double-helix physical (structural, mechanical. . . ) properties. The precise influence ofthe so-called primary structure (i.e. the sequence) on the organization of chromatin at all scales remains controversial. On alocal scale, specific sequence elements have been identified to interact with protein components of chromatin. For instance,some sequence motifs that favor the formation and positioning of nucleosomes were found to be regularly spaced, e.g. the10 bp periodicity exhibited by some dinucleotides like AA/TT [11–13]. Alternatively, similar motifs were shown to presentlong-range correlations along the genome that are a signature of nucleosomes [14–16]. Other DNA regions, the scaffold ormatrix attachment regions that constitute the anchor points of chromatin loop domains, are constituted by ∼1 kbp AT-richsequence patterns [17,18]. On larger scales, the folding of the nucleosomal strings into higher-order structures has been theissue of various models involving, e.g., random packing, coiling into hierarchical helical structures (solenoids) [19–21], orloop-models [17,18,22–24], but the DNA sequence itself was not taken into account. Further experimental results suggest

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 49

Fig. 1. Hierarchical structure of eukaryotic DNA. Each DNA molecule is packed into a mitotic chromosome that is 1/50 000 shorter than its extendedlength.Source: Adapted with permission from Ref. [5].© 2003, by Nature Publishing Group.

that loopsmight be organized by the active transcription complexes [25–27]. Accordingly, gene positions and transcriptionalactivity would constitute major determinants of themicroscopic structure of chromatin that would self-organize in a ratherpredictable way: the 3D structure would then result to some extent from the DNA primary sequence.

Actually there is much more to be learned about the different stages of DNA compaction inside the cell nucleus (Fig. 1)from the DNA sequence than is commonly thought. The originality of the approach described in this report relies on the factthatwe are going to extract structural, dynamical and functional information from theprimaryDNA sequence using conceptscoming from statistical mechanics and nonlinear physics and methodologies issuing from physics and signal processing[28–32]. More precisely, we will mainly use a mathematical microscope, namely the continuous wavelet transform (WT)[28,33–55], to explore the structural complexity of signals generated from adequate codings of the DNA sequences.

At high magnification, for scales up to 20–40 kbp, we will unravel information concerning the first two orders of DNApackaging in eukaryotic nuclei, namely the linear nucleosomal array commonly called the 10 nm chromatin fiber and itscondensation into the packed nucleosomal 30 nm fiber (see the top panels in Fig. 1). On that range of rather small scales, theconcepts at work are scale-invariance, fractal, multifractal, long-range correlations (LRC) and the underlying mechanismsare the ones encountered in the physics of disordered systems [56–79]. The observation of LRC between DNA bending sites[14–16] has urged the necessity of revisiting polymer statistical physics to account for the long-range structural disorderinduced by theDNA sequence [31,80].When further developing a grand canonical physicalmodeling of nucleosomeorderingalong the 10 nm chromatin fiber [31,81,82], it is the actual role of the DNA sequence in nucleosome positioning and its rolein regulating gene expression [83] that will be elucidated. These physical modelings based on the principles of equilibriumstatistical physics will be used as a theoretical guide for in vitro imaging experiments by atomic force microscopy (AFM),aiming at studying the effect of the DNA sequence on (i) the elastic properties of genomic DNA molecules [31,84,85] and(ii) the nucleosome ordering along genomic DNA fragments [86].

At scales above 40 kbp, when decreasing the magnification of our mathematical WT microscope, some characteristicscales will emerge as the signature of the breaking of scale invariance properties (symmetry under dilations) [30,87]. Inthe second part of this report, our goal will be to show that, at these large scales (hundreds of kbp up to several Mbp),the primary sequence still contains structural and mechanical information, no longer on DNA, but rather on the 30 nm

50 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

chromatin fiber and its propensity to form loops and multi-looped structural patterns (see the lower panels in Fig. 1), thatare likely to stabilize chromatin domains of autonomous DNA replication and gene expression [30]. This studywill lead us topropose (i) some universal mechanism accounting for the self-organization of multi-looped rosettes prior to the replicationand transcription proteic complexes coming into play [30,88] and (ii) some dynamical modeling of the spatio-temporalreplication programme in the human genome, with replication first initiating at ‘‘master’’ replication origins likely specifiedby an open chromatin structure favored by the DNA sequence [30,32,89–92] followed by successive activations of secondaryorigins in a ‘‘cascade’’ or ‘‘domino’’ like model.

Altogether the results reviewed in this report will contribute to decipher the superimposition of various levels ofgenomic information encrypted in the DNA sequence. As an important component, a multi-scale coding of the hierarchicalpackaging of DNA inside eukaryotic chromatin actually contains fundamental information about the constitutive regulationof functional processes like transcription and DNA replication.

1.2. Outline of the report

The report is organized as follows:

Section 2. In this section, we report the observation that in eukaryotic genomes, sequence motifs that favor the formationof nucleosomes present particular correlation properties. These correlations, which are referred to as long-rangecorrelations (LRC), extend over large distances up to 20–40 kbp and are related to scale-invariant properties ofDNA sequences. We first explain how the LRC are identified and quantified using a mathematical microscope, theso-calledWavelet Transform (WT).We show the existence of LRC in both coding and non-coding DNAwalkswhichdiscards any understanding of these LRC in terms of genome plasticity. Then we compare eukaryotic, eubacterialand archeal sequences using various coding tables including a structural coding table based on nucleosomepositioning data. The corresponding DNA chain bending profiles display scale-invariance properties characterizedby two LRC regimes. In the 10–200 bp range, LRC are observed for eukaryotic sequences as the signature ofthe nucleosomal structure. In contrast, no LRC can be detected for eubacterial sequences. Over larger distances(&200 bp) up to few kbp, stronger LRC seem to exist in all DNA sequences and are likely to play a role in thecondensation of the nucleosomal string notably into the so-called eukaryotic 30 nm chromatin fiber.

Section 3. The recent flowering of high throughput micro-array and sequencing data has provided an unprecedentedopportunity to study nucleosome positioning along eukaryotic chromosomes. In this section, we carry out astatistical study of experimental nucleosome occupancy landscapes in the yeast genome based on power spectrumand correlation function analysis. This study confirms that the spatial organization of nucleosomes is long-rangecorrelated with characteristics similar to the LRC imprinted in the DNA sequence. Then we develop a realistic butsimple model of nucleosome assembly that relies on the computation of the free-energy cost of bending a DNAfragment of a given sequence from its natural curvature to the final superhelical structure around the histonecore. This model accounts for the nucleosome-free region (NFR) observed at yeast promoters. It also faithfullypredicts promoter strength in fly as encoded in distinct chromatin architectures characteristic of strongly andweakly expressed genes.When further extending this theoreticalmodeling, we obtain nucleosome density profilesthat remarkably reproduce the heterogeneity of the nucleosome occupancy landscapes observed in vivo in yeast,with alternation of ‘‘crystal-like’’ phases of confined regularly spaced nucleosomes and ‘‘fluid-like’’ phases ofrather diluted nonpositioned nucleosomes. Furthermore, we provide a very attractive interpretation of the NFRsexperimentally observed at gene promoters in terms of very high inhibitory energy barriers that are absent inthe energy landscape generated from uncorrelated sequences. After noticing that most yeast genes exhibit a NFRat the transcription start site (TSS) but also at the transcription termination site (TTS), corresponding in fact tothe polyadenylation site, we revisit the in vivo nucleosome occupancy probability data with special focus on theintragenic chromatin structure. We put into light a remarkably organized nucleosome distribution resulting fromthe collective confinement of nucleosomes by the inhibitory energy barriers located at both gene extremities.According to the gene size, we clearly identify ‘‘crystal’’ gene domains where gene chromatin is characterizedby a well-defined nucleosome repeat length (NRL). Interestingly, at the transition between successive crystaldomains, we observe ‘‘bistable’’ genes with a fuzzy-looking nucleosomal profile resulting from a statistical mixingof two possible crystal states, one with a rather expanded chromatin (n nucleosomes) and the other one with amore compact chromatin (n + 1 nucleosomes). Within each crystal domain, the transcription rate, and in turnthe expression level, decreases when the nucleosome compaction decreases. As compared to crystal-like genesthat present a constitutive expression level, bistable genes show a higher transcriptional plasticity and are moresensitive to chromatin regulators. Accordingly, by mean of a single nucleosome switching, bistable genes maydrastically alter their expression level in response to various stimuli. Finally, the fact that bistable genes areenriched in intron containing genes suggests that chromatin is indeed involved in the splicing regulation. As regardto previous works on transcription initiation regulation by promoter nucleosome occupancy, our results shed anew light on the role of the intra-gene chromatin structure on gene expression regulation.

Section 4. This section is devoted to two recent experimental studies that provide strong support to our structuralinterpretation of the existence of LRC in DNA sequences. The first one concerns the observation of the influenceof genomic LRC on the elastic properties of DNA fragments of the human genome deposited on mica under 2D

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 51

thermodynamical equilibrium conditions, by atomic force microscopy (AFM). As compared to the 2D equilibriumconformations of synthetic intrinsically straight DNA and of uncorrelated DNA from hepatitis C RNA virus (HCV),the conformations of human DNA clearly display a macroscopic intrinsic curvature that confirms that a persistentLRC distribution of DNA bending sites constitutes a realistic and very attractive alternative to the more popular10 bp periodicity nucleosome positioning signature. In the second experimental study, we combine AFM imagingin liquid and physical modeling of nucleosome assembly to study the effect of the DNA sequence on nucleosomepositioning. In particular, (i) by comparing the statistical positioning distribution of a single nucleosome (whichis not accessible in vivo) to the theoretical nucleosome energy profile, we characterize the ability of the sequenceto locally favor or inhibit nucleosome formation, and (ii) by comparing experimental to theoretical equilibriumnucleosomedensity profiles,we can quantify the contribution of theDNA sequence to the collective organization ofnucleosomes observed in vitro as well as in vivo. We show that the sequence signaling that prevails are high energybarriers that locally inhibit nucleosome formation as observed in vivo (NFRs). We also bring the first experimentaldemonstration that these excluding genomic energy barriers condition the collective positioning of neighboringnucleosomes. Analysis of Saccharomyces cerevisiae and human gene promoters reveals that by directing thepositioning of these energy barriers relative to the TSS, the sequence specifies the intrinsic nucleosome occupancyat regulatory sites, thereby contributing to gene regulation.

Section 5. In this section, we start unraveling the genomic information contained at large scales in DNA sequences. Firstwe show that along the human chromosomes, the GC content displays rather regular nonlinear oscillations with,among others, two main periods of 110 ± 20 kbp and 400 ± 50 kbp, that are well recognized characteristic scalesof chromatin loops and loop domains involved in the hierarchical folding of the chromatin fibers. These periodsare also remarkably similar to the size of mammalian replicons and replicon clusters. Analysis of deviations fromintrastrand equimolarities between T and A, and between G and C, the so-called TA and GC skews, corroborates theexistence of these rhythms as the footprints of replication and/or transcription mutation bias. We show that theseoscillations enlighten a remarkable cooperative gene organization. Analysis of human intron-containing genesreveals that these skews are correlated to each other and display a characteristic step-like profile exhibiting sharptransitions between transcribed (finite bias) and non-transcribed (zero bias) regions. We further use the wavelettransform modulus maxima (WTMM) method to investigate the multifractal properties of (TA + GC) skew walkprofiles along human chromosomes. This study reveals the bifractal nature of these skew profiles which involvetwo competing scale-invariant components. The first corresponds to the long-range correlated homogeneousfluctuations previously observedwith DNA structural codings. The second is associatedwith the presence of jumpsin the original (TA + GC) strand-asymmetry profile. We confirm that a majority of upward (downward) jumps inthis profile co-locate with gene TSS and TTS. We find that the skew displays rather sharp upward jumps fromnegative to positive values at the locations of experimentally identified human replication origins. Wavelet-basedmulti-scale analysis of the skew profiles along the 22 human autosomes reveals numerous sharp upward jumpsthat are good candidates as replication initiation zones and allows us to predict a thousand of them. Between twoneighboring sharp upward jumps, the skewdisplays a linear decreasing profile leading to a characteristic N-shapedpattern also observed in mouse and dog. Along this observation, we propose a model of replication in mammaliangermline cells with well positioned origins and random terminations.

Section 6. In this section, we develop a multi-scale pattern recognition methodology based on the WT of the total(TA + GC) skew S, using a best-adapted analyzing wavelet to detect N-shaped replication domains. Since theskew S contains some contribution induced by transcription (step-like blocks), we use a least-squares fittingprocedure to provide unprecedented estimates of the replication and transcription biases. When looking atindividual genes,we show that their transcriptionbias decreaseswith thedistance to theputative origins borderingthe replication N-domains. Importantly, when comparing transcription bias to expression data, we find thatthe estimated transcription bias strongly correlates to the expression level in the germline (spermatogonia) butnot in somatic lineages. Altogether, these results suggest that (i) transcription bias is indeed a consequence ofgermline transcription and (ii) putative replication origins correspond to transcriptionally active regions in manytissues. Similar results are obtained for the mouse genome which suggests that the mechanisms underlying thetranscription- and replication-associated skews are characteristics of mammalian genomes. Taking advantageof the recent availability of high resolution timing data, we show that most of the putative replication originsthat border N-domains are replicated earlier in the S phase than their surroundings whereas the central regionsreplicate late. These results bring the first experimental confirmation of these putative replication origins. Wheninvestigating further the large scale organization of human genes with respect to replication, we show that thesereplication origins, gene orientation and gene expression are not randomly distributed but on the opposite areat the heart of a strong organization of human chromosomes. Around the putative origins, genes are abundantand broadly expressed, and their transcription is co-oriented with replication fork progression. These featuresweaken progressively with the distance from putative replication origins. At the center of domains, genes arerare and expressed in few tissues. We propose that this specific organization could result from the constraints ofaccommodating the replication and transcription initiation processes at chromatin level. Finally, we report therecent discovery of a novel kind of skew domains, the so-called split-N-domains similar to N-domains at theirborders, but exhibiting at their heart a large region of null and constant skew. We show that the central regions of

52 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

these large split-N-domains (fewMbp long) correspond toheterochromatin genedeserts found in lowGC isochoresandwe propose that the initiation of replication in these regions could be random. AltogetherN-domains and split-N-domains cover more than 50% of the human genome.

Section 7. In this section, we address the issue of the role of the sequence on the chromatin tertiary structure. First, wereview the two-angle model of the 30 nm chromatin fiber and propose to let the nucleosome repeat length tovary, in order to account for the nucleosome ordering predicted by our sequence-dependent physical model. Thisshould allow us to derive the genomic map of in vitro chromatin fiber structure and compactness. By comparisonwith various models of interphase chromatin based on a multi-looped (rosette-like) structure of the 30 nm fiberinduced by the action of external factors (e.g. transcription factors), we elaborate on the very attractive alternativethat the chromatin fiber can self-organize into multi-looped patterns thanks to its heterogeneous structure. In thecrowded environment of the eukaryotic nucleus, we show that the presence of intrinsic structural defects (localopen chromatin fiber defects) predisposes the chromatin fiber to form rosette-like structures. These multi-loopedpatterns self-organize through entropy-driven clustering of these sequence-induced fiber defects by depletiveforces, prior to any external factor coming into play. This clustering is likely to favor the recruiting of proteincomplexes involved in the activation of replication and transcription. In this context, the set of ∼1000 putativereplication initiation zones identified in silico in the human genome are good candidates for some intrinsicdecondensed fiber defects. When mapping experimental and numerical chromatin mark data in replication N-domains, we find, surrounding most of the putative replication origins that replicate early in the S phase, regionsof a few hundred kbp wide that are hypersensitive to DNase cleavage, that are hypomethylated and that presenta significant enrichment in genomic energy barriers that impair nucleosome formation (NFRs). This suggeststhat these replication origins, further qualified as ‘‘master’’ origins, are specified by an open chromatin structurefavored by the DNA sequence. We propose that replication would initiate in early S phase at these privilegedlocations and the diverging replication fork progression would further trigger secondary origins, likely separatedby shorter distances (∼50–300kbp as compared to∼1Mbp formaster origins), as recently observed in the ENCODEregions, in a ‘‘domino cascade’’ manner. As structural defects (bursts of ‘‘openness’’) in the chromatin fiber, thesemaster replication origins might also be central to the tertiary structure of eukaryotic chromatin into rosette-like structures. Thus the spontaneous emergence of rosette patterns, likely maintained by the origin recognitioncomplexes (ORCs), provides a very attractive description of the so-called replication foci observed in interphasemammalian nuclei as stable structural domains of autonomous replication. Furthermore, the remarkable geneorganization discovered around the putative replication origins suggests that these rosettes contribute to thecompartmentalization of the genome into autonomous domains of gene transcription. Via the self-organizingstructural role of the replication origins, the DNA sequence might therefore code, to some extent, for the tertiarychromatin structure. When investigating evolutionary breakpoint regions (breakage of synteny) along humanchromosomes, we show that they appear more frequently near the predicted origins than in N-domain centralregions, suggesting that the distribution of large-scale rearrangements in mammals reflects a mutational biastowards fragile regions of high transcriptional activity and replication initiation. The fact that chromosomeanomalies involved in the tumoral process coincide with replication N-domain extremities raises the possibilitythat the replication origins detected in silico are potential candidate loci susceptible to breakage in some cancercell types.

This review ends with some concluding remarks in Section 8, and 3 Appendices collectingmaterial used in themain text.

2. Long-range correlations in eukaryotic DNA sequences: a footprint of nucleosome packaging

2.1. About the controversy concerning the existence of long-range correlations in DNA sequences

The possible relevance of scale invariance and fractal concepts to the structural complexity of genomic sequences hasbeen the subject of considerable increasing interest [29,93]. During the past fifteen years or so, there has been intensediscussion about the existence, the nature and the origin of LRC in DNA sequences. Different techniques including mutualinformation functions [94,95], autocorrelation functions [96–98], power spectra [95,99,100], ‘‘DNAwalk’’ representation [93,101], Zipf analysis [102–104] and entropies [105–108], were used for statistical analysis of DNA sequences. For years therehas been some permanent debate on rather struggling questions like the fact that the reported LRC might be just an artifactcaused by the compositional heterogeneity of the genome organization [29,93,96,97,109–114]. Another controversial issueis whether or not LRC properties are different for protein-coding (exonic) and non-coding (intronic, intergenic) sequences[29,93–104,111,113,115–120]. Actually, there were many objective reasons for this somehow controversial situation. Mostof the pioneering investigations of LRC in DNA sequences were performed using different techniques that all consisted inmeasuring the power-law behavior of some characteristic quantity, e.g., the fractal dimension of the DNA walk, the scalingexponent of the correlation function or the power-law exponent of the power spectrum. Therefore, in practice, they all facedthe same difficulties, namely finite-size effects due to the finiteness of the sequence [121–123] and statistical convergencethat required some precautions when averaging over many sequences [93,111]. But beyond these practical problems, therewas also a more fundamental restriction since the measurement of a unique exponent characterizing the global scaling

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 53

Table 1Trinucleotide coding tables of relative local bending properties of the DNA double helix axis. (a) PNuc table has been established from nucleosomepositioning data [134,135] and is thought to code for the local curvature properties. (b) DNase I table is based on the sensitivity of DNA fragments toDNase I [136] and more likely codes for DNA local flexibility properties. The values of theses two tables were arbitrarily assigned between 0° and 10°.

Trinucleotide Bending Trinucleotide BendingPNuc DNase I PNuc DNase I

AAA/TTT 0.0 0.1 CAG/CTG 4.2 9.6AAC/GTT 3.7 1.6 CCA/TGG 5.4 0.7AAG/CTT 5.2 4.2 CCC/GGG 6.0 5.7AAT/ATT 0.7 0.0 CCG/CGG 4.7 3.0ACA/TGT 5.2 5.8 CGA/TCG 8.3 5.8ACC/GGT 5.4 5.2 CGC/GCG 7.5 4.3ACG/CGT 5.4 5.2 CTA/TAG 2.2 7.8ACT/AGT 5.8 2.0 CTC/GAG 5.4 6.6AGA/TCT 3.3 6.5 GAA/TTC 3.0 5.1AGC/GCT 7.5 6.3 GAC/GTC 5.4 5.6AGG/CCT 5.4 4.7 GCA/TGC 6.0 7.5ATA/TAT 2.8 9.7 GCC/GGC 10.0 8.2ATC/GAT 5.3 3.6 GGA/TCC 3.8 6.2ATG/CAT 6.7 8.7 GTA/TAC 3.7 6.4CAA/TTG 3.3 6.2 TAA/TTA 2.0 7.3CAC/GTG 6.5 6.8 TCA/TGA 5.4 10.0

properties of a sequence fails to resolve multifractality [113,124], and thus provides very poor information upon the natureof the underlying LRC (if there were any). Actually, it can be shown that for homogeneous (monofractal) DNA sequence, thescaling exponents estimated with the techniques previously mentioned, can all be expressed as a function of the so-calledHurst or roughness exponent H of the corresponding DNA walk landscape [29,93,101,113,124]. H = 1/2 corresponds toclassical Brownian, i.e., uncorrelated random walks. For any other value of H , the steps (increments) are either positivelycorrelated (H > 1/2: ‘‘persistent’’ random walks) or anti-correlated (H < 1/2: ‘‘antipersistent’’ random walks) (seeAppendix B).

One of the main obstacles to LRC analysis in DNA sequences is the genuine mosaic structure of these sequences whichare well-known to be formed by ‘‘patches’’ of different underlying compositions [125–127]. When using the ‘‘DNA walk’’representation, these patches appear as trends in the DNA walk landscapes that are likely to break scale invariance [29,93,101,109–114,124,128]. Most of the techniques, e.g., the variancemethod, used for characterizing the presence of the LRC arenotwell adapted to study non-stationary sequences. There have been somephenomenological attempts to differentiate localpatchiness from LRC using specific methods such as the so-called ‘‘min–maxmethod’’ [101] and the ‘‘detrended fluctuationsanalysis’’ [120,129]. In previous work [29,113,124], we have emphasized the WT as a well suited technique to overcomethis difficulty (see Appendix A). By considering analyzing wavelets that make the WT microscope blind to low-frequencytrends, any bias in the DNAwalk can be removed and the existence of power-law correlations with specific scale invarianceproperties can be revealed accurately. In this section, we report the results of a comparative analysis of the LRC propertiesfor both DNA texts and DNA bending profiles of various eukaryotic, eubacterial and archaeal genomes [14–16,113,124,130].

2.2. Coding DNA sequences for statistical analysis

2.2.1. DNA walk representationA DNA sequence is a four-letter (A, C, G, T) text where A, C, G and T are the bases adenine, cytosine, guanine, and thymine

respectively. A popular method to graphically portray the genetic information stored in DNA sequences consists in mappingthem in some variants of n-dimensional random-walks [131–133]. In this section, we follow the strategy originally proposedby Peng et al. [101]. The so-called ‘‘DNA walk’’ analysis requires first to convert the DNA text into a binary sequence. Thiscan be done, for example, on the basis of purine (Pu = A,G) versus pyrimidine (Py = C, T) distinction, by defining anincremental variable that associates to position i the value χ(i) = 1 or −1, depending on whether the ith nucleotide of thesequence is Pu or Py. Each DNA sequence is thus represented as a string of purines and pyrimidines. Then the graph of theDNA walk is defined by the cumulative variables:

f (n) =

n−i=1

χ(i). (1)

Various mono- and dinucleotide codings have been used in the literature to investigate the possible existence of LRCproperties in the so-reconstructed DNA walks. These codings are actually inspired from the binary coding method exten-sively used by Voss [99,100] and which consists in decomposing the nucleotide sequence in four sequences correspondingto A, C, G or T, codingwith 1 at the considered nucleotide positions and 0 at the other positions. Similarly dinucleotide codingrules, e.g. AT, consists in coding with 1 at the positions of the As that are followed by a T and at the positions of the Ts thatare preceded by an A and 0 at all other positions.

54 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 2. Schematic representation of a frozen DNA chain. (a) Classical polymer representation of the double helix with an intrinsically straight axis; thebase-pair planes are parallel and each base-pair twists with respect to its neighbor by about 34.3°. (b) Intrinsic structural disorder induced by the sequence:the base-pair planes are no longer parallel and the double helix axis displays local curvature fluctuations θo(i) [3].

2.2.2. DNA bending profileTo construct DNA bending profiles that account for the fluctuations of the local double helix curvature induced by the

sequence (Fig. 2) [1–3], we have used in our original work [14–16,130], the trinucleotide model proposed in Ref. [134] (herecalled PNuc) and which was deduced from experimentally determined nucleosome positioning [135] (Table 1). To test thatour results do not come out from a simple recoding of the DNA text, we have also used the trinucleotide coding table definedin Ref. [136] (here called DNase) which is based on sensitivity of DNA fragments to DNaseI and that likely codes for the DNAlocal flexibility properties (Table 1). The ‘‘‘PNuc’’ and ‘‘DNase’’ trinucleotide coding rules are thus defined by coding thenucleotide ni at position i by the numerical value given by either one of these tables for the trinucleotide defined by thisnucleotide and its two nearest neighbors, i.e., the triplet (ni−1, ni, ni+1). A complete coding of DNA sequence is achieved byrepeating this operation for all the positions i from 2 to L − 1, where L is the overall length of the sequence.

As illustrated in Fig. 3, it is worth noticing that the differences between the ‘‘PNuc’’ and ‘‘DNase’’ tables are clearlysufficient to produce, at least at first sight, significantly different DNA bending profiles. Both the structural and mechanicalbending profiles look very irregular with some qualitative resemblancewith the landscapes obtained from the displacementof a random walker (see Appendix B.1) [15,29,137]. They both display scale invariance properties: when zooming insome part of these profiles, we recover similar fluctuating landscapes which are statistically undistinguishable from thecorresponding original profiles.

2.3. Mastering the mosaic structure of DNA walks with the wavelet transform microscope

Because of the characteristic heterogeneity of genome composition, one of themain features of DNAwalks is theirmosaicstructure [125–129]. Even though the patchy structure of DNA sequences probably contains biological information of greatimportance, it is rather cumbersome as far as LRC investigation is concerned. In Fig. 4 is shown the WT analysis of thebacteriophage λ DNA walk (Fig. 4(a)) when using the binary purine-pyrimidine coding defined in Section 2.2.1 (Eq. (1))[29,113,124]. Fig. 4(b) shows the WT space-scale representation of this DNA signal when using the first derivative of theGaussian function g(1) (Appendix A, Fig. A.1(a)) as analyzing wavelet. In Fig. 4(d,e), two horizontal cuts of Tg(1)(b, a) areshown at two different scales a = a1 = 22 and a2 = 27 that are represented by the dashed lines in Fig. 4(b). When takinginto account the characteristic size of the analyzing wavelet at the scale a = 1 corresponding to 3 nucleotides, these twoscales correspond to looking at the fluctuations of the DNA walk over a characteristic length of the order of 12 and 384

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 55

Fig. 3. (a) Cumulative bending profiles for a human DNA fragment (chromosome 21, positions 192–200 kbp). Abscissa is the position on the sequence;the curves are cumulative representations of PNuc (black) and DNase (grey) codings (in order to facilitate the comparison, the mean drift of the curves hasbeen eliminated). (b) Enlargement of the portion of the cumulative bending profiles contained in the inset of (a).Source: Adapted from Ref. [137].

nucleotides respectively. When focusing theWTmicroscope at the small scale a = a1 in Fig. 4(d), since g(1) is orthogonal toconstants (nψ = 1) (Appendix A, Eq. (A.3)), we reveal the local (high frequency) fluctuations of f (n), i.e., the local fluctuationsin purine and pyrimidine compositions over small size (∼12 nucleotides) domains. When decreasing theWTmagnificationin Fig. 4(e), we realize that these fluctuations actually occur around three successive linear trends; g(1) not being blind tolinear behavior, the WT coefficients fluctuate around non-zero constant behavior that corresponds to the slopes of thoselinear trends. Note that this result is a direct consequence of Eq. (A.4) (Appendix A). Although this phenomenon is morepronouncedwhen progressively increasing the scale parameter a towards a value that corresponds to the characteristic size(∼15000 nucleotides) of these strand biases, it is indeed present at all scales. In Fig. 4(f), at the same coarse scale a = a2 asin Fig. 4(e), the fluctuations of theWT coefficients are shown as computed with the order-2 analyzing wavelet g(2) (nψ = 2)(Appendix A, Eq. (A.3)). The WT microscope now being also orthogonal to linear behavior, the WT coefficients fluctuateabout zero and we do not see the influence of the strand bias any longer, and this at all scales (Fig. 4(c)). Furthermore, byconsidering successively g(3), g(4), . . ., we can not only remove linear trends from the DNA signal but also eliminate morecomplicated high-order polynomial trends that could be mistaken with the presence of LRC in DNA sequences [113,124]. Inthat respect the WT microscope is a very efficient tool to filter the scale-invariant component of the DNA walk landscapeand thus to characterize the possible existence of LRC properties (see Appendices A and B) [29].

2.4. Application of the wavelet transform modulus maxima method to multifractal analysis of DNA sequences

2.4.1. DNA walks are monofractalTo quantify the scale invariance properties of DNA walks, we have intensively used the WTMM method [29,138] that is

described in Appendix A together with test applications in Appendix B. In an early work [113,124,130], we have mainlyfocused on the statistical analysis of coding and noncoding sequences in the human genome. The results reported inFig. 5 correspond to the study of 2184 introns of length L ≥ 800 bp and of 226 exons of length L ≥ 600 bp selected(in the late nineties) in the EMBL data bank [15,16,130]. The length criteria used to select these sequences result fromsome compromise between the control of finite-size effects in the WTMM scaling analysis (those sequences are rathershort sequences: Lintrons ≃ 800 bp, Lexons ≃ 150 bp) and the achievement of statistical convergence (2184 introns and226 exons are rich enough statistical samples). Fig. 5(a) displays plots of the partition functions Z(q, a) (Appendix A, Eq.(A.12)), computed from theWT skeletons using g(2) (Appendix A, Eq. (A.3)) as analyzing wavelet, versus the scale parametera in a logarithmic representation. These results correspond to quenched averaging over the corresponding 2184 intronwalks(red curves) and 226 exon walks (blue curves) generated using the G mononucleotide coding. For a rather wide range of qvalues: −2 ≤ q ≤ 4, scaling actually operates over a sufficiently large range of scales for the estimate of the τ(q) exponentsto be meaningful. The τ(q) spectra obtained from linear regression fits of the intron and exon data are shown in Fig. 5(b).For both these noncoding and coding sequences, the data points fall remarkably on a straight line which indicates that theconsidered intron walks and exon walks are likely to display monofractal scaling. But the slope of the straight line obtainedfor introns H = ∂τ(q)/∂q = 0.60 ± 0.02 is definitely larger than the slope of the straight line derived for the exonsH = 0.53 ± 0.02. This is a clear indication that intron walks display LRC (H > 1/2), while exon walks look much morelike uncorrelated random walks (H ≃ 1/2) (see Appendix B). At first sight these results are in good agreement with theconclusions of previous studies concerning the existence of LRC in noncoding DNA sequences only [93,101,116,117]. Similarobservations are reported in Refs. [113,124], when investigating the largest individual introns and exons found in the EMBLdata bank.

56 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 4. WTanalysis of the bacteriophage λ genome (L = 48 502 bp). (a) DNAwalk displacement f (n) based on purine-pyrimidine distinction, vs. nucleotidedistance n. (b) WT of f (n) computed with g(1) (Appendix A, Fig. A.1(a)); Tg(1) (b, a) is coded using 256 colors from black (min|Tg(1) |) to red (max|Tg(1) |).(c) Same analysis as in (b) but with the second-order analyzing wavelet g(2) (Appendix A, Fig. A.1(b)). (d) Tg(1) (b, a = a1) vs. b for a1 = 12 (nucleotides).(e) Tg(1) (b, a = a2) vs. b for a2 = 384 (nucleotides). (f) Same analysis as in (e) but with g(2) . The scales a1 and a2 are indicated by the horizontal dashedlines in (b) and (c). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)Source: Reprinted from Ref. [29].

One of the most striking results of our WTMM analysis in Fig. 5, is the fact that the τ(q) spectra extracted for both setsof introns and exons, are well fitted by Eq. (B.8), i.e., the analytical spectrum for fBm’s (see Appendix B). Let us note thatthis remarkable finding as well as the respective values of H obtained from intronic and exonic sequences are robust withrespect to change in the considered nucleotide coding used to generate the DNA walks (see Table 2).

2.4.2. About the Gaussian nature of the fluctuations in DNA walk landscapesWithin the perspective of confirming the monofractality of DNA walks, we have studied the probability density function

(pdf) of wavelet coefficient values ρa(Tg(2)(., a)), as computed at a fixed scale a in the fractal scaling range. According to themonofractal scaling properties, one expects these pdfs to satisfy the self-similarity relationship (Appendix B, Eq. (B.9)):

aHρa(aHT ) = ρ(T ), (2)

where ρ(T ) is a ‘‘universal’’ pdf (actually the pdf obtained at scale a = 1) that does not depend on the scale parameter a. InRefs. [113,124], we have shown that when plotting aHρa vs. aHT , all the ρa curves corresponding to different scales actuallycollapse on a unique curve when using the exponent H = 0.6 for the set of human introns and H = 0.53 for the set ofhuman exons. Moreover, independently of the coding or noncoding nature of the sequences, the so-obtained universal pdfscannot be distinguished from a Gaussian distribution. As shown in Fig. 6, this result is quite general since similar GaussianWT coefficient statistics are observed on a comparable range of (small) scales in between 10 and 200 nucleotides, for both

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 57

Fig. 5. Comparative WTMM analysis of the DNA walks of 2184 introns (L ≥ 800 bp) and 226 exons (L ≥ 600 bp) in the human genome. The analyzingwavelet is g(2) (Appendix A, Fig. A.1(b)). The DNA walks are generated using the G mononucleotide coding for introns (red) and exons (blue). The reportedresults correspond to quenched averaging over our statistical samples of introns and exons respectively. (a) ⟨log2 Z(q, a)⟩ vs. log2 a for various values ofq. (b) τ(q) vs. q; the solid lines correspond to the theoretical spectrum τ(q) = qH − 1 (Appendix B, Eq. (B.8)) for fBm’s with H = 0.60 ± 0.02 (introns)and 0.53 ± 0.02 (exons). (c) h(q) = (τ (q)+ 1)/q vs. q for the coding subsequences relative to position 1 (blue circles), 2 (green circles) and 3 (red circles)of the bases within the codons; the data for the introns (red squares) are shown for comparison; the horizontal broken-lines correspond to the followingrespective values of the Hurst exponent H = 0.55, 0.58 and 0.60. (For interpretation of the references to color in this figure legend, the reader is referredto the web version of this article.)Source: Reprinted from Ref. [29].

the S. cerivisiae (H = 0.59, Fig. 6(c)) and Escherichia coli (H = 0.50, Fig. 6(d)), when using the Amononucleotide coding (seeTable 2) [15,16]. Thus, as explored through the optics of theWTmicroscope, the basic fluctuations (about the low frequencytrends due to compositional patchiness) in DNA walks are likely to have monofractal Gaussian statistics. The presence ofLRC in the human introns is in fact contained in the scale dependence of the root mean-square (rms) fluctuations of waveletcoefficients (see Appendix B):

σ(a) ∝ aH , (3)

with H = 0.60 ± 0.02 like for persistent random walk fBms (see Appendix B.1), as compared to uncorrelated random walkvalue H = 0.53 ± 0.02 ≃ 1/2 obtained for the coding sequences.

2.5. Long-range correlations in DNA sequences do not originate from genome plasticity

2.5.1. Uncovering long-range correlations in coding DNA sequencesBecause of the ‘‘period three’’ codon structure of coding DNA, it was natural to investigate separately the three

subsequences relative to the position (1, 2 or 3) of the bases within their codons [139,140]. We have build up thesesubsequences from our set of 226 human exons and we have repeated the WTMM analysis. The data obtained for thecorresponding τ(q) spectra again fall on straight lines which corroborates monofractal scaling properties for the threesubsequences. As shown in Fig. 5(c), for the subsequences relative to positions 1 and 2, we get the same slope h(q) =

∂τ(q)/∂q = 0.55±0.02, which is undistinguishable from the value obtained for the overall coding sequences. Surprisingly,the data for the subsequence relative to position 3, exhibit a slope which is clearly largerH = 0.58±0.02, i.e., a value whichis very close to the exponent H = 0.60 ± 0.02 estimated for the set of introns. This observation suggests that this thirdcoding subsequence is likely to display the same degree of LRC as noncoding sequences [140].

2.5.2. Nucleotide composition effects on the long-range correlation properties of human genesThe idea of looking for a link between the LRC properties and theGC content of the sequences results from the remark that

theWTMMmethod indeed fails to distinguish a few introns from actual exons [113]. For example, two introns of the humanfactor XIIIb subunit gene of length L = 9952 and 2874 nucleotides have the following Hölder exponent H = 0.49 ± 0.02

58 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 6. Probability distribution functions of wavelet coefficient values of ‘‘A’’ DNA walks. The analyzing wavelet is the Mexican hat g(2) (Appendix A,Fig. A.1(b)). S. cerevisiae (complete genome): (a) log2(ρa) vs. Tg(2) for the set of scales a = 12 (), 24 (), 48 (), 192 (N), 384 ( ), and 768 (•); (c) small-scaleregime: log2(aHρa(aHTg(2))) vs. Tg(2) with H = 0.59; (e) large-scale regime: log2(aHρa(aHTg(2))) vs. Tg(2) with H = 0.75. E. coli (complete genome): (b)log2(ρa) vs. Tg(2) for the set of scales a = 24 (), 48 (), 96 (), 384 (N), 768 ( ), and 1536 (•); (d) small-scale regime: log2(aHρa(aHTg(2))) vs. Tg(2) withH = 0.50; (f) large-scale regime: log2(aHρa(aHTg(2))) vs. Tg(2) with H = 0.80. Scales are in nucleotide units.Source: Reprinted from Ref. [29].

and 0.46 ± 0.02 respectively. Similarly, one large (L = 10 986) intron of the human retinoblastoma susceptibility gene hasa Hölder exponent H = 0.51 ± 0.02 which is undistinguishable from the value 1/2 for uncorrelated biased random walks.After having tried to understand to which extent these introns were different from other introns, we realized that they allcorresponded to DNA sequences with a low percentage in GC content (from 31% to 36%). As shown in Fig. 7 (first row), thestrength H of the LRC observed in both the human introns and exons definitely increases when increasing the GC content ofthe sequence under study from values close to 1/2 at low GC content (.40%) up to values larger than 0.6 at high GC content(&60%). These results clearly confirm the existence of LRC in coding and noncoding sequences [15,16,130] and make quitequestionable most of the models proposed in the literature to account for the LRC and that are based on genome plasticity(insertion-deletion, rearrangment events) [93,95,107,141–145]. Actually, because of selection pressure, these models fail toexplain the existence of LRC in exonic regions due to the strong constraints imposed by their coding properties. The resultsreported in Fig. 7 further suggest that the LRCmight be related to themechanisms underlying the ‘‘isochore structure’’ of thehuman genome [119,143]. The isochore structure of the human andmore generally mammalian genomes is manifested as auniformity of GC content within specific domains, called isochores, and appreciable scatter of the average GC content when

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 59

Fig. 7. Global estimate of the rms of WT coefficients: log2 σ(a) − 0.6 log2 a is plotted versus log2 a; the scale is expressed in nucleotide units; the thinstraight lines corresponding to uncorrelated (H = 1/2) and strongly correlated (H = 0.8) regimes are drawn to guide the eyes. (Some horizontal lineon this logarithmic representation will correspond to H = 0.6). The analyzing wavelet is g(2) (Appendix A, Fig. A.1(b)). The results correspond to someaveraging over the mononucleotide ‘‘G’’ and ‘‘C’’ codings for introns (left column) and exons (central column) of length L ≥ 600 nucleotides, and for theposition 3 nucleotides within codons (right column) for the eukaryotic organisms human, rodents, birds and drosophila from top to bottom. Introns andexons were grouped into classes of different GC content: GC% ∈ [0, 40] (dotted line), [40, 50] (dashed line), [50, 60] (thin black line) and [60, 100] (thickblack line).Source: Adapted from Ref. [130].

comparing different domains [127,146–152]. Consistently, GC dependent LRC are also observed in coding DNA sequences inrodents and birds (Fig. 7). But if the isochores are well recognized structures in warm blood vertebrates, their characteristiclength ∼105–106 nucleotides is much larger than the characteristic sizes of both introns and exons and there are not foundin Drosophila where LRC are equally observed as shown in Fig. 7 (lower row) [15,16,130]. This explains that in the latenineties, the origin of LRC in DNA sequences was still a very debated question [29,93,107,113,120,123].

60 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

2.6. Towards a structural interpretation of long-range correlations in DNA sequences

2.6.1. Investigating long-range correlations in fully sequenced genomesThe success in DNA sequencing has provided a wide range of investigations for statistical analysis of genomic sequences

[150,153–161]. The availability of fully sequenced genomes offers the possibility to study scale-invariance properties ofDNA sequences over a wide range of scales extending from tens to thousands of nucleotides. The first completely sequencedeukaryotic genome S. cerevisiae [158]was an opportunity to performa comparativewavelet analysis of the scaling propertiesdisplayed by each chromosome [14–16,130]. When looking at the scale dependence of the root-mean square σ(a) ofthe wavelet coefficient pdf computed with g(2) (Appendix A, Eq. (A.3)), for the ‘‘A’’ DNA walks in each of the 16 yeastchromosomes,we see, in Fig. 8(a), that they all present superimposable behavior.Wenotably observe the same characteristicscale that separates two different scaling regimes. Let us note that common behavior to the 16 yeast chromosomes hasalready been pointed out in Refs. [114,126]. At small scales, 20 . a . 200, weak power-law correlations are observed ascharacterized by H = 0.59 ± 0.02, a mean value which is significantly larger than 1/2. At large scales, 200 . a . 5000,strong power-law correlations with H = 0.82± 0.01 become dominant with a cutoff around 104 bp (a number which is bynomeans accurate) where uncorrelated behavior is observed. (In this section the scale parameter is expressed in nucleotideunits). The existence of these two scaling regimes is confirmed in Fig. 6(a), (c) and (e) [29], where the WT pdfs computedat different scales (Fig. 6(a)) are shown to collapse on a single curve, as predicted by the self-similarity relationship (2),provided we use the scaling exponent value H = 0.59 in the scale range 10 . a . 100 (Fig. 6(c)) and H = 0.75 in the scalerange 200 . a . 1000 (Fig. 6(e)). In the small-scale regime, the pdfs are very well approximated by Gaussian distributions(Fig. 6(c)) which recalls the results of the WTMM analysis of eukaryotic intronic and exonic sequences in Section 2.4.2.In the large-scale regime, the pdfs have stretched exponential-like tails (Fig. 6(e)). In both regimes, the fact that Eq. (2) isverified, corroborates the monofractal nature of the roughness fluctuations of the yeast DNA walks [14–16,130]. We havealso examined other eukaryotic contigs from different organisms (human, rodent, avian, plant and insect) and we haveobserved the same characteristic features as those obtained with S. cerevisiae (Table 2). In Fig. 8(c), a similar but slightlysmaller characteristic scale a∗

≃ 100 bp is clearly seen on a human contig (see also Fig. 9). Moreover, the cross-over from aH = 0.62 ± 0.01 to a H = 0.75 ± 0.02 scaling regime is remarkably robust since the data for the four ‘‘A’’, ‘‘C’’, ‘‘G’’ and ‘‘T’’DNA walks fall almost on the same curve in the range 20 . a . 2000 [14–16,130].

The striking overall similarity of the results obtained with these different eukaryotic genomes prompted us to alsoexamine the scale invariance properties of bacterial genomes. In Figs. 6(b), (d), (f) and 8(b), are reported the results obtainedfor E. coli [159] which are typical of what we have observed with fifteen other bacterial genomes (Table 2) [14–16,130].Again, there exists a well defined characteristic scale a∗

≃ 100–200 that delimits the transition to stronger LRC withH = 0.85 ± 0.02 at large scales. In order to examine if these properties actually extend homogeneously over the wholegenomes, σ(a) was calculated over a window of width l = 2000 nucleotides, sliding along the DNA walk profiles. Theresults reported in Fig. 9 clearly reveal the existence of a characteristic scale a∗

≃ 100–200 nucleotides which seems to berobust all along the corresponding DNAmolecules and this for all investigated genomes. There exists however an importantdifference between eukaryotic and bacterial genomes where uncorrelated (H = 1/2) Brownian motion-like behavior isobserved in the small-scale regime (Fig. 6(d)). But the striking feature of these data is that the strong large-scale power-lawcorrelations are present in all bacterial, archaeabacterial and eukaryotic genomes (Fig. 8; see also Fig. 11(a)) [14–16,99,100,130].

2.6.2. Evidencing long-range correlations between DNA bending sites in eukaryotic organismsWhat mechanism or phenomenon might explain the small-scale LRC in eukaryotic genomes? Their total absence in

bacterial genomes raises the possibility that they could be related to certain nucleotide arrangements in the 147 bplong DNA regions which are wrapped around histone proteins to form the eukaryotic nucleosomes [1–4,6–10]. Indeed,eubacterial genomic DNA is associated with histone-like proteins (e.g. Hu), but no nucleosome-type structure has beendetected in these organisms [162]. Following this hypothesis, we have extended the application of the WT microscope tothe investigation of the multi-scale properties of DNA bending signals that are likely to reflect the hierarchical organizationof nucleoprotein complexes. To construct a bending profile that accounts for the fluctuations of the local DNA curvature,we use the trinucleotide coding table ‘‘PNuc’’ (Table 1) [134,135]. The results of the WTMM analysis of the bending profilesof the yeast chromosomes are reported in Figs. 8(a) and 10(a). This analysis reveals striking similarities with the curvesresulting from the DNA walk analysis, in both the small-scale and the large-scale regimes. As shown in Fig. 10(a), in theformer regime, the wavelet coefficient pdfs computed at different scales again collapse on a same parabola in a semi-logarithmic representation, as predicted by the self-similarity relationship (2), when using the scaling exponent H = 0.6,the hallmark of LRC Gaussian statistics [14–16,130]. Eq. (2) also applies in the later regime when adjusting H = 0.75, butas previously observed with the mononucleotide codings in Fig. 6(e), the pdfs have now stretched exponential-like tails.These observations are not simply due to a ‘‘recoding’’ of the DNA sequences since when using the DNase coding table(Table 1) [136], we notice a significant weakening of the H exponent in the large scale regime (H ≃ 0.6) (Fig. 8(a)).

As shown in Figs. 8(b) and 10(b), the structural information that is contained in the scaling properties of the bendingprofiles obtained for E. coli can also be directly extracted from theWTMM reading of the DNA texts [14–16,130]. But there isa major difference between the scaling properties observed for eubacteria (Fig. 10(b) and Table 2) and the ones previously

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 61

Fig. 8. Global estimate of the rms ofWT coefficients: log10 σ(a)−0.6 log10 a is plotted versus log10 a; the scale is expressed in nucleotide units; the straightdashed lines corresponding to uncorrelated (H = 1/2) and strongly correlated (H = 0.80) regimes are drawn to guide the eyes. (Some horizontal linein this logarithmic representation will correspond to H = 0.6.) The analyzing wavelet is the ‘‘Mexican hat’’ wavelet g(2) (Appendix A, Fig. A.1(b)). (a) S.cerevisiae: ‘‘A’’ DNA walks of the 16 S. cerevisiae chromosomes (grey line) and of the corresponding bending profiles obtained with the PNuc () and DNase() coding tables when averaged over the 16 chromosomes. (b) E. coli: ‘‘A’’ (grey line), ‘‘T’’ (grey –––), ‘‘G’’ (black line) and ‘‘C’’ (black –––) DNA walksand the corresponding PNuc () and DNase () bending profiles. (c) Human chromosome 21: ‘‘A’’, ‘‘C’’, ‘‘G’’ and ‘‘T’’ DNA walks and corresponding PNuc andDNase bending profiles; same symbols as in (b). (d) Human chromosome 21: comparative analysis of DNA walks for all adenines (line), adenines part of adinucleotide AA (· · ·) and isolated adenines not part of a dinucleotide AA (– - –).Source: Reprinted from Ref. [137].

observed for S. cerivisiae (Fig. 10(a)) and more generally for all eukaryotic PNuc bending profiles (Fig. 8(c) and Table 2),namely the fact that in the small-scale regime no correlations are observed for prokaryotes whereas LRC are likely to be thesignature of eukaryotic nucleosomal DNA [14–16,130].

2.6.3. Long-range correlations in eukaryotic DNA: a signature of the nucleosomal structureFollowing the nucleosomal interpretation of LRC in the small-scale regime, the observation of such correlations in

archaeal genomes in Fig. 11 and Table 2 [15,16,130] is consistent with the presence in archaebacteria of structures similar tothe eukaryotic nucleosomes [163–169]. This analysis has also been extended to viral genomes. Small-scale LRC are clearlydetected in organelle genomes (data not shown) and inmost viral double-stranded DNA genomes (Herpes-, Adeno-, Papova-,Parvo- and Hepadna-viruses) as shown for Epstein-barr virus in Fig. 11 [15,16,130]. This further supports the hypothesis ofnucleosome-based LRC since nucleosomes have been observed on several classes of double-strandedDNAviruses [170–174].The Poxviridae, which are the only animal DNA viruses replicating in the cytoplasm of their host cells, code for a bacterial-type of histone-like protein [175], and no LRC are found in this scale range as shown in Fig. 11 for Melanoplus sanguinipes

62 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 9. Local estimate of the rms σ(a, x) of the WT coefficients of the ‘‘A’’ DNA walk, computed with the Mexican hat analyzing wavelet g(2) (Appendix A,Fig. A.1(b)). σ(a) is computed over a window of width l = 2000 bp, sliding along the first 106 bp of the yeast chromosome IV (a), E. coli (b) and a humancontig (c). log10 σ(a) − 2/3 log10 a is coded using 128 colors from black (min) to red (max). In this space-scale wavelet like representation, x and a areexpressed in bp units. The horizontal white dashed lines mark the scale a∗ where some minimum is observed consistently along the entire genomes:a∗

= 200 bp for S. cerevisiae, a∗= 200 bp for E. coli and a∗

= 100 bp for the human contig. (For interpretation of the references to color in this figurelegend, the reader is referred to the web version of this article.)Source: Adapted from Ref. [29].

Fig. 10. Probability distribution functions ofwavelet coefficient values of PNuc bending profiles. The analyzingwavelet is theMexican hat g(2) (Appendix A,Fig. A.1(b)). (a) S. cerevisiae: log2(aHρa(aHT )) vs. T for the set of scales a = 12 (), 24 (), 48 (), 192 (N), 384 (), and 768 (•) in nucleotide units; H = 0.60(H = 0.75) in the small (large)-scale regime. (b) E. coli: same as in (a) but with H = 0.50 (H = 0.80) in the small (large)-scale regime.Source: Reprinted from Ref. [14].

virus. This observation is consistent with our hypothesis and suggests that the genomic DNA of these viruses is submitted topackaging process different from other animal viruses. Finally, bacteriophage genomes do not present any small-scale LRC(Fig. 11 for T4 bacteriophage) as already observed for their eubacterial hosts (Table 2). Other classes of virus genomes like the

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 63

Fig. 11. Global estimate of the rms of WT coefficients of (a) ‘‘A’’ DNA walks and (b), (c) PNuc bending profiles [137]. The various symbols correspond tothe following genomes: Archaeglobus fulgidus (squares), Epstein-Barr virus (dots), Melanoplus sanguinipes entomapoxvirus (open circles), T4 bacteriophage(stars), average over 21 single-stranded RNA viruses (open triangles) and 17 retroviruses (black triangles). Same representation as in Fig. 8.Source: Reprinted from Ref. [137].

single and double-stranded RNA viruses (to the exception of the retroviruses) are very unlikely associated to nucleosomes. Inall cases except retroviruses, we observe a total absence of small-scale LRC (Fig. 11). In the case of retroviruses, it is knownthat the integrated viral DNA is associated to nucleosomes in the cell nucleus [176]; we clearly confirm the presence ofsmall-scale LRC (H ≃ 0.57±0.02) (Fig. 11(c)). These wavelet based fractal analysis of viral and cellular genomes of all threekingdoms sustain without failure the fact that small-scale LRC provide a reliable diagnostic of the existence of eukaryoticnucleosomes (Table 2) [14–16,130].

To further investigate this LRC nucleosomal diagnostic, we ask whether particular dinucleotides which are known toparticipate to the positioning and formation of nucleosomes [11–13,177–189] (e.g. AA dinucleotides) would carry LRCspecifically associated to eukaryotic genomes. This can be examined if one performs the analysis of different DNA walksgenerated with (i) all adenines, (ii) only adenines that are part of a dinucleotide AA and (iii) isolated adenines that are notpart of a dinucleotide AA. The analysis of human (Fig. 8(d)) and eukaryotes (Table 2) shows that the ‘‘isolated A’’ DNA walkexhibits a clear weakening of the LRC properties at small scale, while the ‘‘AA’’ DNA walk accounts for a major part of theobserved LRC on the ‘‘A’’ DNA walk, which confirms the nucleosomal signature of small-scale LRC [14–16,130]. Note thatthis observation is an additional illustration of the fact that recoding does not trivially conserve the correlation law [15,130].

2.6.4. Long-range correlations in genomic DNA: a requisite for condensation–decondensation processes?Several studies have established the presence in genomic sequences of DNA motifs related to bending properties.

A 10.2 bp periodicity has been observed using either Fourier or correlation function analysis, specifically in eukaryoticgenomes where it has been interpreted in relation to the nucleosomal structure [11–13,177–189]. However there is afundamental difference between this nucleosome diagnostic based on periodicity (i.e. invariance with respect to discretetranslations) and our analysis based on scale invariance properties (i.e. invariance with respect to continuous dilations)which strongly suggests that the mechanisms underlying the nucleosomal structure of eukaryotic genomes are multi-scalephenomena that actually involve the whole set of scales in the 10–200 bp range. In this respect, periodicity (which concerns5% of sequences that present affinities for the histone octamer that are significantly larger than average [190]) and scaleinvariance (which concerns 95% of bulk genomic DNA sequences having an affinity for the histone octamer similar to that ofrandomsequences [190]) should not be considered as opposed to each other but rather complementary. Hence, in contrast tothe tight histone binding obtainedwith an adequate periodic distribution of bending sites, LRCmight facilitate the formationand the positioning of the histone core throughout a major part of the genome. If we consider the translational positioningof nucleosomes as a mechanism of diffusion along the DNA chain and if we assume that, once the histone core is bound toDNA, the distribution of bending sites has a direct consequence on this diffusion process, then the LRC are likely to increasethe average nucleosome displacement along DNA with respect to uncorrelated sequences (anomalous diffusion) [137,191].

What mechanisms might explain the strongly correlated patterns observed on the DNA walks as well as on thecorresponding bending profiles in all genomes, in the 200–5000 bp range? As already suggested in Ref. [190], the signalsinvolved in nucleosome binding and positioning may act collectively over large distances to the packing of nucleosomalarrays into high-order chromatin structures [18,192–194]. Since DNA bending sites are key elements for nucleosomalstructures, the detailed investigation of large-scale LRC observed in eukaryotic bending profiles should shed light on thecompaction mechanisms at work in the hierarchical formation and dynamics of chromatin (Fig. 1). An important clueprovided by our studies is that similar long distance correlations in bending profiles are also observed in eubacteriaand archaebacteria. It is then tempting to conjecture that the compaction of the eubacterial nucleoid in rosette-shapedchromosomal structures [195] is submitted to similar multi-scale structural constraints. Indeed, it is well established

64 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Table 2Values of the Hurst exponent H in the small-scale regime. This exponent is estimated using our wavelet-based method as described in Appendices A andB. The numbers in each line correspond to a linear regression fit of log10 σWT (a) versus log10 a, in the 10(20)–100 bp range, for the indicated sequence orset of sequences. The error bars are estimated from the fluctuations of the local slope of the data in this range of scales. Each column indicates the codingrule that is used (Section 2.2). n.a., non-attributed, when the statistical sample is not large enough to allow reliable measurements.Source: Reprinted table from Ref. [16].

PNuc DNase AA Aiso A GG Giso G(=TT ) (=Tiso) (+T ) (=CC) (=Ciso) (+C)

Homo sapiens 0.67 0.59 0.64 0.55 0.61 n.a 0.54 0.59±0.04 ±0.03 ±0.02 ±0.01 ±0.03 ±0.02 ±0.03

Homo sapiens 0.68 0.60 0.64 0.54 0.60 n.a 0.54 0.60Introns ±0.02 ±0.02 ±0.01 ±0.02 ±0.02 ±0.01 ±0.01

Homo sapiens 0.55 0.49 n.a 0.50 0.53 0.54 0.51 0.54Exons ±0.01 ±0.02 ±0.01 ±0.01 ±0.01 ±0.01 ±0.01

Exons 0.53 0.50 0.50 0.50 0.53 n.a 0.50 n.aGC% < 50 ±0.04 ±0.02 ±0.05 ±0.01 ±0.02 ±0.02

Exons 0.55 0.50 n.a 0.52 0.53 0.56 0.53 0.58GC% > 60 ±0.02 ±0.01 ±0.02 ±0.01 ±0.03 ±0.03 ±0.03

Danio rerio 0.61 0.58 0.58 0.58 0.60 n.a 0.56 0.57±0.05 ±0.03 ±0.05 ±0.03 ±0.03 ±0.03 ±0.06

Drosophila 0.62 0.56 0.60 0.59 0.63 n.a 0.56 0.61melanogaster ±0.03 ±0.03 ±0.05 ±0.02 ±0.05 ±0.01 ±0.06

Caenorhabditis 0.66 0.59 0.63 0.56 0.59 n.a 0.57 0.62elegans ±0.06 ±0.05 ±0.05 ±0.02 ±0.05 ±0.04 ±0.07

Arabidopsis 0.60 0.55 0.58 0.55 0.60 n.a 0.54 0.58thaliana ±0.07 ±0.02 ±0.05 ±0.01 ±0.03 ±0.02 ±0.07

Saccharomyces 0.54 0.54 0.54 0.55 0.57 n.a 0.52 0.53cerevisiae ±0.01 ±0.02 ±0.01 ±0.01 ±0.03 ±0.01 ±0.03

Herpesviridae 0.57 0.52 n.a 0.53 0.53 0.57 0.53 0.59±0.01 ±0.01 ±0.01 ±0.01 ±0.02 ±0.01 ±0.01

Adenoviridae 0.57 0.53 0.55 0.54 0.53 n.a 0.52 0.54±0.01 ±0.02 ±0.04 ±0.02 ±0.02 ±0.04 ±0.02

Melanoplus 0.51 0.49 0.50 0.51 0.49 n.a 0.49 0.51sanguinipes ±0.01 ±0.02 ±0.02 ±0.03 ±0.02 ±0.02 ±0.01

Vaccinia virus 0.51 0.49 0.51 0.50 0.48 n.a 0.51 0.48±0.02 ±0.01 ±0.02 ±0.02 ±0.04 ±0.01 ±0.01

positive-strand 0.53 0.52 0.51 0.53 0.50 n.a 0.53 0.49ssRNA viruses ±0.01 ±0.02 ±0.01 ±0.01 ±0.01 ±0.02 ±0.02

dsRNA viruses 0.55 0.49 n.a 0.51 0.50 0.53 0.53 0.51±0.01 ±0.02 ±0.01 ±0.03 ±0.03 ±0.03 ±0.01

Human 0.62 0.50 0.53 0.51 0.49 n.a 0.62 0.54spumaretrovirus ±0.02 ±0.02 ±0.01 ±0.02 ±0.01 ±0.04 ±0.02

Retroviridae 0.57 0.50 0.57 0.51 0.53 0.61 0.56 0.54±0.03 ±0.01 ±0.01 ±0.02 ±0.02 ±0.01 ±0.01 ±0.01

Escherichia 0.49 0.48 0.50 0.50 0.51 0.50 0.50 0.50coli ±0.02 ±0.02 ±0.02 ±0.01 ±0.01 ±0.01 ±0.01 ±0.01

Rickettsia 0.54 0.50 0.54 0.52 0.52 n.a 0.52 0.51prowazekii ±0.03 ±0.02 ±0.02 ±0.01 ±0.02 ±0.02 ±0.03

Helicobacter 0.51 0.51 0.52 0.52 0.52 n.a 0.50 0.55pylori 26695 ±0.05 ±0.02 ±0.04 ±0.01 ±0.04 ±0.03 ±0.03

Chlamydia 0.51 0.51 0.52 0.52 0.46 n.a 0.51 0.49trachomatis ±0.01 ±0.02 ±0.01 ±0.01 ±0.02 ±0.01 ±0.01

Treponema 0.54 0.51 0.50 0.51 0.50 0.49 0.53 0.50pallidum ±0.04 ±0.01 ±0.01 ±0.01 ±0.02 ±0.03 ±0.01 ±0.01

Mycoplasma 0.51 0.51 0.52 0.52 0.52 0.52 0.52 0.52pneumoniae ±0.04 ±0.03 ±0.03 ±0.01 ±0.02 ±0.03 ±0.03 ±0.01

Bacillus 0.51 0.49 0.50 0.51 0.52 n.a 0.50 0.51subtilis ±0.02 ±0.01 ±0.02 ±0.01 ±0.01 ±0.02 ±0.01

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 65

Table 2 (continued)

PNuc DNase AA Aiso A GG Giso G(=TT ) (=Tiso) (+T ) (=CC) (=Ciso) (+C)

Synechocystis 0.51 0.47 0.50 0.50 0.50 0.49 0.49 0.51PCC6803 ±0.01 ±0.02 ±0.01 ±0.01 ±0.01 ±0.02 ±0.01 ±0.01

Thermotoga 0.52 0.49 0.52 0.51 0.51 0.52 0.51 0.50maritima ±0.02 ±0.01 ±0.03 ±0.01 ±0.02 ±0.02 ±0.01 ±0.02

Aquifex 0.57 0.51 0.57 0.51 0.54 0.52 0.51 0.53aeolicus ±0.04 ±0.02 ±0.03 ±0.01 ±0.03 ±0.02 ±0.01 ±0.02

Bacteriophage 0.50 0.50 0.50 0.52 0.49 n.a 0.53 0.47T4 ±0.01 ±0.01 ±0.01 ±0.01 ±0.01 ±0.02 ±0.01

Bacteriophage 0.49 0.50 0.50 0.52 0.51 n.a 0.51 0.50SPBc2 ±0.01 ±0.03 ±0.01 ±0.01 ±0.01 ±0.01 ±0.01

Thermoplasma 0.56 0.50 0.52 0.50 0.54 0.52 0.51 0.51acidophilum ±0.03 ±0.02 ±0.05 ±0.01 ±0.03 ±0.03 ±0.01 ±0.02

Methanococcus 0.56 0.52 0.55 0.51 0.55 n.a 0.51 0.53jannaschii ±0.05 ±0.03 ±0.03 ±0.01 ±0.04 ±0.02 ±0.02

Pyrococcus 0.52 0.51 0.53 0.51 0.53 0.50 0.51 0.52horikoshii ±0.04 ±0.02 ±0.04 ±0.01 ±0.01 ±0.02 ±0.01 ±0.02

Archaeoglobus 0.54 0.51 0.52 0.50 0.50 0.50 0.52 0.58fulgidus ±0.05 ±0.02 ±0.05 ±0.01 ±0.03 ±0.02 ±0.01 ±0.02

Aeropyrum 0.53 0.51 n.a 0.50 0.48 0.50 0.51 0.58pernix ±0.01 ±0.01 ±0.01 ±0.03 ±0.01 ±0.01 ±0.02

Sulfolobus 0.54 0.50 0.53 0.51 0.53 n.a 0.51 0.55solfataricus ±0.03 ±0.01 ±0.02 ±0.01 ±0.02 ±0.01 ±0.02

that architectural proteins interacting with DNA bending sites participate to the stabilisation of large nucleoproteinarrays, as for example, the eukaryotic HMG and the eubacterial HU histone-like proteins [196]. Moreover, the individualdomains constituting the bacterial nucleoid are variable in size and position [197]. Actually eukaryotic, eubacterial andarchaebacterial chromosomes are submitted to dynamical constraints that might result in scale invariant properties of theDNA text. For instance, large-scale LRC may result from DNA structural features related to the modulation of transcription,replication and recombination events [198–200]. A deep understanding of the large-scale LRC and their interpretation interms of these structural and dynamical constraints raise challenging questions that will be addressed, at least as far aseukaryotic chromatin is concerned, in the next sections.

3. DNA sequence effect on the nucleosomal organization of the eukaryotic chromatin fiber

3.1. Experiments confirm the existence of long-range correlations in yeast nucleosomal occupancy profiles

Since the pioneering experiment of Yuan et al. [201], on chromosome III of S. cerevisiae, the recent flowering of genome-wide experimental maps of nucleosome positions for many different organisms and cell types [202–217], has provided anunprecedented opportunity to elucidate to which extent the DNA sequence participates to the positioning of nucleosomesobserved in vivo along eukaryotic chromosomes. Nucleosome organization is generally analyzed by micrococcal nuclease(MNase) digestion of chromatin. To perform large-scale studies, the distribution of MNase cleavage sites is determinedthroughout genomic regions or in the whole genome by means of either high resolution oligonucleotide tiling microarrays[201,205–208] or more recently be several different massive DNA sequencing technologies [202–204,209–217]. In thissection, we will mainly focus on the simple organism S. cerevisiae whose genome is highly genic (∼75% gene coverage)and for which in vivo [201,204,205,208–210,213,215,216], and also in vitro [216,217] genome-wide positioning data areavailable.

3.1.1. In vivo nucleosome occupancy dataIn Fig. 12 is shown the in vivo nucleosome occupancy profile obtained by Lee et al. [205] along a 30 kbp long fragment

of S. cerevisiae chromosome II. This profile P(s) is an estimate (up to some normalization) of the probability for a base pairlocated at position s to be occupied by a nucleosome of length 146 bp. When plotted in a semi-logarithmic representation,this experimental nucleosome occupancy landscape reveals a patchy organizationwith some alternation of genomic regionsof somehowperiodic oscillations ofmeanperiod l∗ ∼ 167±10bp, corresponding to regular arrays of nucleosomes, separatedby either nucleosome depleted regions also called nucleosome-free regions (NFRs) [201,205] or disordered (unstructured)regions of ‘‘fuzzy’’ positioning. At a statistical level, this peculiar organization can be quantified by performing power-spectrum and correlation analyses. These studies reveal that nucleosomal 167 bp periodicity is not the only main spectral

66 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 12. Nucleosome occupancy profile (δY (s) = Y (s)− Y , where Y (s) = log2(P(s))) along the yeast chromosome II: (a) in vivo data [205] (red); (b) in vitrodata [216] (orange); (c) corresponding power spectra; (d) autocorrelation function (C(1s) = ⟨δY (s)δY (s + 1s)⟩). In (c) and (d) are also represented theresults of our theoretical modeling (Section 3.2): energy landscape (green) for δE = ⟨(E − E)2⟩1/2 = 2 kT; nucleosome occupancy profile for δE = 2 kT andµ = µ− E = −1.3 kT (dark blue) and −6 kT (cyan) so that the mean in vivo and in vitro nucleosome densities are correctly reproduced respectively. In (c)and (d) the data correspond to the results obtained when averaging over the 16 yeast chromosomes. In (c), the dashed lines correspond to the power-lawscaling exponents ν = 0.65, 0.74, 0.68, 0.74 and 0.46 from top to bottom corresponding to H = 0.82, 0.87, 0.84, 0.87 and 0.77 LRC properties respectively.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

characteristic of the experimental occupancy profile. As shown in Fig. 12(c), the corresponding power-spectrum displays avery convincing power-law decay S(k) ∝ k−ν with exponent ν = 2H − 1 = 0.74 ± 0.02 (H = 0.87) that is likely to be adirect consequence of the large-scale (low frequency k . 1/200) LRC regime observed in yeast DNA bending profiles (seeSection 2.6, Fig. 8(a)). In Fig. 12(d), the auto-correlation function C(1s) clearly confirms the presence of a small-scale periodicarrangement of nucleosomes with a characteristic spacing l∗ = 167 ± 10 bp in good agreement with the well-established165 bp value for yeast [218]. Note that this characteristic nucleosome repeat length (NRL) observed in the stretches of well-ordered nucleosomes is significantly smaller than the mean in vivo nucleosome repeat length ∼210 bp estimated whenassuming an homogeneous 75% coverage of the 16 yeast chromosomes by nucleosomes. This suggests the presence of somelong-distance confining process. But this experimental two-point correlation function also reveals that this ‘‘nucleosomal’’periodicity statistically appears as a modulation of a dominant slow decaying component which characterizes the large-scale disordered occupancy landscape fluctuations. C(1s) ∝ 1s−α actually decays as a power-law (See Appendix B, Eq.(B.5)), with an exponent α = 0.48 which, under the assumption of monofractality (α = 2 − 2H), is in reasonably goodagreement with the H = 0.75 value characterizing the large scale (1s & 200 bp) LRC in yeast bending profiles (Fig. 10(a)).Altogether these results demonstrate that the spatial organization of nucleosomes observed in vivo is long-range correlatedwith characteristics similar to the LRC imprinted in the DNA sequence [81].

3.1.2. In vitro nucleosome occupancy dataVery recently, Kaplan et al. [216] have assembled chicken erythrocyte histone octamers on purified yeast genomic DNA

by salt gradient dialysis. They succeeded in producing in vitro genome-wide nucleosome occupancy data but for a genomecoverage by nucleosomes ∼30%, which corresponds to a much smaller nucleosome density (mean NRL ∼500 bp) thanpreviously observed in vivo [201,205]. As shown in Fig. 12(b), the in vitro nucleosome occupancy profile along the same30 kbp long fragment of S. cerevisiae chromosome II looks much more disordered as compared to the in vivo profile inFig. 12(a). In particular, if we still observe rather localized nucleosome depleted regions (NFRs), there is no longer evidenceof stretches of periodically distributed nucleosomes. This is statistically confirmed in Fig. 12(d), where the auto-correlationfunction C(1s) does not oscillate any longer and the power spectrum in Fig. 12(c) no longer displays a bump at highfrequencies corresponding to a characteristic NRL l∗ ∼ 167 bp. But what is remarkable is the fact that both the in vitropower spectrum and auto-correlation function still present very convincing power-law behavior with respective exponentν = 0.74 (Fig. 12(c)) and α = 0.33 (Fig. 12(d)) that are both consistent with a value of H ≃ 0.85 > 1/2, the hallmark of the

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 67

presence of LRC in the in vitro yeast nucleosome occupancy profile. Furthermore, the specific value H ≃ 0.85 found for bothin vivo and in vitro nucleosome occupancy data is reminiscent of the one found in Section 2.6 (Figs. 8(a) and 10(a)) for theyeast DNA bending profiles. This strongly suggests that along the line of the theoretical modeling of the thermodynamics of2DDNA loops in the presence of LRC structural disorder in Refs. [137,191], the LRC observed in the experimental nucleosomeoccupancy profiles are a direct consequence of the LRC present in the DNA sequence at scales &200 bp. In the remaining ofthis section, our goal is to show that the LRC observed in the sequence are not only recovered in the nucleosome occupancyprofiles but that they globally provide a convincing understanding of the succession of patterns observed in vivo, regions ofwell-ordered nucleosomes confined near excluding energy barriers alternatingwith regions ofmore diluted non-positionednucleosomes (Fig. 12(a)) [81].

3.2. Grand canonical modeling of nucleosome occupancy landscape

Preliminary analyses [219,220] of experimental nucleosome occupancy data have concluded that a significant set(30%–50%) of nucleosome positions found in vivo could effectively be related to a ‘‘genomic nucleosomal pattern’’ based ona ∼10 bp periodicity in the distribution of some di- or trinucleotides (e.g. AA/TT) showing higher affinity for nucleosomes[11–13,177–189]. The periodic positioning of these motifs over a few helical pitches would contribute to a globalspontaneous curvature of DNA that would favor its wrapping on the histone surface. However this nucleosome code [219,220], based on the statistical significance of this 10 bp periodicity nucleosome positioning signal, remains very debated.According to more recent reports [221,222], in S. cerevisiae, no more than 20% of the in vivo nucleosome positioning abovewhat is expected by chance is determined by intrinsic signals in the genomic DNA. Thanks to the recent availability ofnew in vivo and in vitro genome-wide nucleosome occupancy data [216], a model of DNA-sequence-dependent nucleosomepositioning based on statistical learning [209,216] was shown to be significantly predictive of nucleosome organization invivo in yeast as well as in other organisms like fly and human. In this section, in the spirit of the physical modeling developedin a previous work [137,191], we propose a model which is based on sequence-dependent DNA bending properties [81]. Toevaluate the performance and the predictive power of our model, we will systematically compare our results with those ofField et al. model [209,216] further considered as a reference statistical model.

3.2.1. Nucleosome occupancy profileWhen focusing on the dynamical assembly of histone octamers along the DNA chain, chromatin can be reasonably

modeled by a fluid of rods of finite extension ℓ (the DNA wrapping length around the octamer), binding and movingin an external potential E(s) (the effective nucleosome formation potential) and interacting (potential υ(s, s′)) along a1D substrate (the DNA chain) [81]. Within the grand canonical formalism, considering that the fluid is in contact with athermal bath (at reciprocal temperature β) and a histone octamer reservoir (at chemical potentialµ), the thermodynamicalequilibrium properties of the system are described by the grand partition function:

Ξ =

Nmax−N=0

1N!

∫D[s(N)] exp

−β(V (s(N))− µN)

, (4)

where V (s(N)) is the total potential energy of the N (&1) rod system:

V (s(N)) =

N−1

E(sk)+ 1/2−j=k

υ(sk, sj), (5)

E(s) an external field and υ(s, s′) the interaction potential. From Eq. (4) we obtain the nucleosome density profile as:

ρ(s) = −1β

∂ lnΞ∂E(s)

=1Ξ

Nmax−N=1

1(N − 1)!

∫Ds(N−1) exp −β V (s, s(N−1))

− µN

. (6)

The thermodynamics of this system has been largely investigated: in the case of monodisperse hard rods in a uniformexternal potential, it is the well-known Tonks gas [223]. In the case of a non-uniform external potential of interest here,the problem was solved by Percus who derived an exact functional relationship between the residual chemical potentialµ− E(s) and the hard rod density ρ(s) [224]:

βµ = βE(s)+ ln ρ(s)− ln1 −

∫ s+l

sρ(s′)ds′

+

∫ s

s−l

ρ(s′)

1 − s′+ls′ ρ(s′′)ds′′

ds′. (7)

Eq. (7) has an explicit solution [225] that requires numerical integration. The nucleosome occupancy profile P(s) is thenobtained by convolving the nucleosome density ρ(s)with the rectangular functionΠ of width 146 bp:

P(s) = ρ Π146(s). (8)

For pedagogical purposes, we show in Fig. 13 the theoretical nucleosome occupancy profile obtained at rather highnucleosome density for a toy energy landscape bordered by two infinite walls and that displays a stretch of square-likewells

68 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 13. Nucleosome occupancy profile (thin line) obtained from the Vanderlick et al. solution [225] of Eq. (7) for the theoretical energy landscape E(s)(thick line) and the chemical potential µ = −1 kT.

on the right side and two square-like energy barriers on the left side. As expected, an array of well positioned nucleosomes isobserved as induced by the stretch of regularly distributed potential wells (e.g. a stretch of highly positioning sequences likethewell known601 sequence [190]).More interesting is the nucleosome ordering observed near the two energy barriers andthe limiting walls. This is a very simple illustration of the ‘‘statistical ordering’’ mechanism originally proposed by Kornbergand Stryer [226] andwhich is awell-knownphenomenon of nonuniformhard rod fluids. If we study 1D hard rod distributionin the vicinity of a boundary (i.e., a fixed exclusion zone), we obtain, for a sufficiently dense system, a damped oscillatingdensity pattern, i.e., a statistical short-range ordering due to the interplay between boundary confinement and rod–rodexcluded volume interaction. Realistic boundaries or barriers can be induced by unfavorable sequences that resist to thestructural distortions required for nucleosome formation (e.g. the presence of poly(dA:dT) [209,210,216,227,228] mightimpair nucleosome formation or favor nucleosome disassembly [229,230] by increasing the DNAwrapping free energy cost)or particular sequences that may recruit transcription factors [216,231] and/or other protein complexes such as chromatinregulators [232–234] that may compete with nucleosomes. Note that fixed nucleosomes on strong positioning sequencesmay also play the role of motionless obstacles. As illustrated on the left of Fig. 13, this confinement-induced nucleosomeordering can be enhanced when two high energy barriers are found at sufficiently short distance from each other, thusproviding an alternative scenario for collective nucleosome organization by excluding energy barriers [81,82].

3.2.2. Nucleosome formation energyTo compute the free energy landscape E(s) associated to the formation of one nucleosome at a given position along

the DNA chain, we will assume that [81] (i) DNA is an unshearable elastic rod whose conformations are described by theset of local angles Ω1(s) (tilt), Ω2(s) (roll), Ω3(s) (twist) (Fig. 2) and (ii) the DNA chain along the nucleosome at positions is constrained to form an ideal superhelix of radius R = 4.19 nm and pitch P = 2.59 nm as observed in the X-ray crystallographic nucleosome structure (Fig. 14) [8–10] over a total length ℓ, which fixes the distribution of angulardeformation rate (ΩNuc

i (u))i=1,2,3, u = s, . . . , s + ℓ. Within linear elasticity approximation, the energy cost reads:

βE(s) =

∫ s+ℓ

s

3−i=1

Ai

2

ΩNuc

i (u)−Ωio(u)2

du, (9)

where A1, A2 and A3 are the stiffnesses associated to the tilt, roll and twist deformations around their intrinsic value Ωo1,Ωo2 andΩo3 respectively. Consistently with our previous work (Section 2) [14–16,130,137,191], we will use here the PNucstructural roll coding table (Table 1) which is mainly a trinucleotide roll coding table (Ωo2), with zero tilt (Ωo1 = 0) andconstant twist (Ωo3 = 2π/10.5). As the values for this bending table were arbitrary assigned between 0 and π/18 [134,135], we will perform the following affine rescalingΩ∗

o2 = γΩo2 − η with γ = 0.4, η = 0.06, in order to get a comparableenergy fluctuation range as in experiments [201,205].

In Fig. 15 is shown the theoretical nucleosome occupancy profile obtained along the yeast chromosome II when fixingthe model parameters to provide a very good match, at a statistical level, with the experimental in vitro [216] and invivo [205] data. The 2D map in Fig. 15(a) actually represents the evolution of the nucleosome occupancy probability whenincreasing the mean residual chemical potential µ = µ − E. In Fig. 15(b) is shown the predicted energy landscapewhen fixing the effective nucleosome wrapping length to l = 125 bp, a value which is smaller than the typical and wellaccepted 146 bp nucleosomal DNA length [8]. Note that this might actually reflect the fact that nucleosome–nucleosomeinteractions have a short-range attractive component [235–239] that is lacking in our hard core model but that we recoverin an effective way by considering a smaller hard core DNA length. Interestingly, Fig. 15(a) enlightens the fundamental roleof the energy landscape and its topography (amplitude, size and distribution of favorable and unfavorable regions) thatentirely control the fluctuations in the nucleosome occupancy profile but in a non-trivial (nonlinear and non-local) mannerthat depends on the chemical potential. As a main feature, the highest energy barriers present in the energy landscapecorrespond to regions that are robustly depleted in nucleosomes whatever the overall nucleosome density. At low density,

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 69

Fig. 14. (Top) X-ray structure of the nucleosome core particle: front (left) and side (right) views. Adapted with permission from Ref. [8]. Copyright 1997by Nature Publishing Group. (Bottom) Our physical modeling consists in computing the energy cost to bend a DNA fragment of length l into the almost twoturns of the DNA double helix that are involved in the crystallized nucleosome particle (radius R = 4.19 nm, pitch P = 2.59 nm). Adapted with permissionfrom Ref. [10]. Copyright 2003 by Nature Publishing Group.

Fig. 15. (a) 2Dmap representing the theoretical nucleosome occupancy (log2) along a 12 kbp long fragment of the yeast chromosome II as a function of themean residual chemical potential µ = µ− E [82]: dark blue corresponds to low probability and red to high probability. The two white occupancy profilesare the theoretical profiles obtained for µ = −6 kT and−1.3 kT that correspond to a genome nucleosome coverage of 30% and 75% as observed in vitro [216]and in vivo [205] respectively; the corresponding in vitro and in vivo experimental nucleosome occupancy profiles are shown in red for comparison. (b)The corresponding energy landscape 1E(s) = E(s) − E computed with the following parameter values: δE = ⟨(E − E)2⟩1/2 = 2 kT and l = 125 bp.In (a) are indicated the positions of the transcription start site (TSS) (black dots), of the transcription termination site (TTS), corresponding in fact to thepolyadenylation site (circles), and of the transcription factor binding sites (TFS) (black triangles). (For interpretation of the references to color in this figurelegend, the reader is referred to the web version of this article.)

the confinement is weak and the nucleosomes distribute everywhere in between the highest energy barriers according tothe energy landscape fluctuations. When the nucleosome density is increased, ordering progressively appears leading to anoverall organization with ‘‘crystal-like’’ phases of regularly positioned nucleosomes confined near or in between excludingenergy barriers, coexisting with ‘‘fluid-like’’ phases where ordering is lost. Thus, our sequence-dependent grand canonicalmodel generates nucleosome occupancy profiles that strongly resemble to the collective nucleosome organization observed

70 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 16. Energy landscape statistics (1E(s) = E(s)− E) computed with the following parameter values: δE = ⟨(E − E)2⟩1/2 = 2 kT and l = 125 bp. Thecolors correspond to the LRC genomic yeast DNA sequence (green) and to its randomly shuffled uncorrelated version (black). (a)1E(s) along a 50 kbp longfragment of yeast chromosome II. (b) Energy pdfs computed for the 16 yeast chromosomes. (c) Energy autocorrelation function C(1s)/C(0) vs. 1s. (Forinterpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

in vivo in Fig. 12(a). In particular it illustrates the importance in this nucleosomal organization of the presence of excludingenergy barriers encoded in the DNA sequence. Importantly, as shown in Fig. 16(a), when recomputing the energy landscapeafter randomly shuffling the DNA sequence, the obtained energy profile displays Gaussian fluctuations without any moreevidence of high energy barriers. This is a strong indication that these barriers result from the presence of LRC in the DNAbending profile. Indeed, the energy pdf obtained for the genuine DNA sequence significantly deviates from the Gaussiandistribution recovered from the shuffled sequence (Fig. 16(b)), in particular as far as its right exponential tail correspondingto large 1E > 0 fluctuations is concerned [137,191]. As further shown in Fig. 16(c), while the energy correlation functiondecreases rather fast to zero for the uncorrelated (randomly shuffled) DNA chains, in the presence of LRC structural disorder,this correlation function decreases more slowly at large distances and this as a power law in very good agreement with thetheoretical prediction C(1s) ∝ 1s−α (Appendix B, Eq. (B.5)) with an exponent α = 2 − 2H = 0.51 comparable to thevalue H = 0.75 previously found in Section 2 for the large-scale LRC exponent in yeast bending profiles (Fig. 10(a)). Theseresults confirm the importance of the LRC observed in the DNA sequence on the nucleosome organization along the yeastchromosomes via the emergence of excluding energy barriers.

3.2.3. In vitro data validate theoretical nucleosome occupancy predictionsAs shown in Fig. 15(a), when adjusting the chemical potential µ to obtain the nucleosome density observed in vitro by

Kaplan et al. [216],we get a nucleosomeoccupancy profile (white profile) that reproduces the experimental data (red profile)remarkably. The mean Pearson correlation computed along the 12 millions bp of the yeast genome is r = 0.71, a resultwhich is as good as the correlation value r = 0.77 obtained with the Field et al. model based on statistical learning [209].Furthermore this very satisfactory mean Pearson correlation value really reflects the pertinence and consistency of ourphysical model all along the yeast chromosomes. As shown in Fig. 17(a), the histogram of correlation values computed in1 kbp sliding windows along the entire genome is mainly concentrated over a range 0.7 . r . 1, with a well definedmaximum for a value as high as r = 0.85. For the sake of comparison, we have also reported in Fig. 17(a), the histogramof correlation values obtained between the predictions of our physical model and those of the Field et al. statistical model.This histogram is even more concentrated at very large r values with a rather sharp maximum for r = 0.92. This bringsthe demonstration that our model based on the structural and mechanical properties of the DNA double helix performs aswell as rather sophisticated models requiring statistical learning [209,216]. This is confirmed in Fig. 12(c) and (d) whereboth the predicted power spectrum and autocorrelation functions display a power-law decay with respective exponentsν = 2H − 1 = 0.65 and α = 2 − 2H = 0.61, very similar to the ones obtained for the in vitro data and the ones computeddirectly from the energy landscapes (ν = 0.68, α = 0.51). Let us emphasize that all these results are consistent with thevalue H = 0.83 ± 0.03 (>1/2) derived from the energy landscape as the signature of the presence of LRC dictated by the

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 71

Fig. 17. Histograms of correlation values r between theoretical and experimental nucleosome occupancy profiles, as measured in 1 kbp sliding windowover the 16 yeast chromosomes. (a) In vitro data: Pearson correlation between our physical modeling (δE = 2 kT, µ = −6 kT) and in vitro Kaplan et al.data [216] (light blue), Field et al. statistical model [209] (pink) and a random occupancy landscape (black). (b) In vivo data: Pearson correlation betweenour physical model (δE = 2 kT, µ = −1.3 kT) and in vivo Lee et al. data [205] (dark blue) and a random occupancy landscape (black). (For interpretationof the references to color in this figure legend, the reader is referred to the web version of this article.)

DNA bending profile and therefore by the DNA sequence. Importantly, the physical model neither predicts any oscillatorymodulation of the power-law decay of the auto-correlation function (Fig. 12(d)) nor the presence of some periodicity(∼167 bp) in the power-spectrum (Fig. 12(c)), thereby confirming the absence of statistical ordering at low nucleosomedensity as previously noticed in the analysis of in vitro data (Fig. 12).

3.2.4. In vivo nucleosome ordering: organizing via excluding?As shown in Fig. 15(a), when increasing the chemical potential µ to reach the in vivo nucleosome density, we get a

nucleosome occupancy profile (white profile) that is still in good agreement with the experimental data (red profile).However, as reported in Fig. 17(b), the histogram of Pearson correlation values is significantly shifted to lower values, ascompared to the one previously obtained at lower (in vitro) nucleosome density in Fig. 17(a), with amean value r = 0.27 anda rather wide maximum at rmax = 0.31. The weakest correlations observed for our model are also shared by the statisticalmodel of Field et al. [209] that does not perform much better with a mean Pearson correlation r = 0.33. As a carefulinspection of the theoretical and in vivo experimental nucleosome occupancy profiles (Fig. 15(a)) seems to indicate, theseweakest correlations result from two main features, i.e., (i) experimental NFRs that do not correspond to genomic energybarriers butmore likely result from the action of external factors like transcription or other proteic factors (TF, PIC, Pol I/II/III,. . . ) and (ii) regions (up to 1 kbp) where the experimental nucleosomal pattern is shifted by a few tens bp with respect tothe predicted nucleosomal pattern as the possible outcome of ATP consuming remodeling factors. Wewill come back to thispoint in more details in Section 3.6.

Let us emphasize that, at a statistical level, our physical model now accounts remarkably well for the harmonicmodulation with period l∗ ∼ 167 bp observed in vivo in the two-point correlation function (Fig. 12(d)) as well as for thecorresponding bump at high frequency (k ∼ 1/167) in the power spectrum (Fig. 12(c)). These results confirm that thestretches of well-ordered nucleosomes observed in vivo but not in vitro (i.e., at high but not at low nucleosome density)are the direct consequence of the organizing role of the genomic energy barriers that condition nucleosome ordering overrather long distances according to equilibrium statistical physics principles [81,82,226]. As a final remark, (i) the fact that ourphysical model yields power-law decay exponents ν = 0.46 and α = 0.63 for both power spectrum and auto-correlationfunction, in good agreement with the values obtained for the in vitro data and (ii) altogether with the exponent valuesfound in vitro, the fact that the nucleosome occupancy profiles are long-range correlated with LRC exponent H ≃ 0.75–0.85directed by the DNA bending profile, strongly suggest that in the disordered ‘‘liquid’’ phase, far away from the genomicenergy barriers, the nucleosomes are randomly distributed according to the sequence-induced LRC fluctuations of thenucleosome formation energy landscape (Fig. 15) [81,82].

3.3. Functional location of genomic energy barriers and nucleosome-free regions

The modeling of nucleosome occupancy data in Section 3.2 has confirmed the genomic excluding energy barriers andmore generally the NFRs, as major actors in the collective assembly of nucleosomes observed in vivo [81,210,222,240]. Inthat regards, these genome-wide results raise the fundamental question of elucidating the possible functional role of thesenucleosome depleted regions. In this section, we will investigate the specific location of NFRs relative to various regulatoryregions, namely the TSS, the TTS (defined in Fig. 15) and the transcription factor binding sites (TFS). To specify the positionof these NFRs in the theoretical nucleosome occupancy profiles, we first threshold P(s) < 0.35 and only keep the minimaof the resulting signal. Then we check for the presence of a barrier in E(s): after removing the low-frequency trends in theenergy landscapewith a high-pass filter, if the energy profile is still higher than 3 kT at aminimum of P(s), then this position

72 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 18. Average (4554 yeast genes) in vitro (orange) and theoretical (cyan) nucleosome occupancy (log2) profiles, and mean theoretical energy profile(green) plotted around (a) the TSS, (b) the TTS, (c) the 5′ nucleosome downstream the TSS and (d) the 3′ nucleosome upstream the TTS. The physical modelparameters are δE = ⟨(E− E)2⟩1/2 = 2 kT and µ = µ− E = −6 kT. In (a–d) are shown for comparison the nucleosome occupancy profiles (pink) predictedby the Field et al. [209] statistical model. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of thisarticle.)

is accepted as a NFR. A score is then assigned to the so-defined energy barriers as the mean number of nucleosomes in a125 bp wide window centered at the barrier position, the smaller the better.

3.3.1. Yeast gene dataA selection of 4554 yeast genes for which high-confidence TSS and TTS could be assignedwas determined as follows [83].

We used the high resolution transcriptome analysis from David et al. [241] in which genome-wide tiling arrays were usedto define a set of transcribed segments along the S. cerevisiae genome. Transcribed segments overlapping known non-dubious genes on less than 50% of the segment length were not considered. The TSS and TTS were determined from theboundaries of the given transcribed segments. Note that the transcribed segment used to determine the 5′ end could bedifferent from the one used for the 3′ end of the gene. This resulted in 4554 genes for which high-confidence TSS and TTScould be assigned. Genes containing introns were downloaded from the Saccharomyces Genome Database (SGD project:http://www.yeastgenome.org/). Only genes with one intron were retained for our analysis of intron size. Genome widedata on yeast RNA polymerase II was retrieved from Ref. [242]. A given position was considered enriched in Pol II when itsscore was in the highest quintile. We also considered that each polymerase has a certain occupancy in space, thereforewe considered each enriched Pol II position to be included in a ±30 bp surrounding box. Transcription plasticity wasretrieved from Ref. [243]: it is the average of the square log2 expression-ratio estimated from a large number of microarrayexperiments. Sensitivity to disruption of chromatin modifiers was obtained from Refs. [243,244]: it quantifies the extent bywhich gene expression depends on chromatin regulators activity.

3.3.2. In vitro nucleosome-free regionsUsing a simple thresholding algorithm (δY < −3), we have extracted from the in vitro nucleosome occupancy profile

(Fig. 12(b)) [216] about 3500 NFRs that are predominantly associated to gene TSS (24%) and TTS (55%). A large majority(∼63%) co-localized at 70 bp (68% at 200 bp) with NFRs predicted by our physical modeling at low nucleosome density(µ = −6 kT). Actually 63% of the in vitro TSS NFRs correspond to high genomic energy barriers detected in the energylandscape. A similar estimate (70%) is found for the in vitro TTS NFRs. This means that nomore than 15% of the yeast TSS and30% of TTS regions are constitutively depleted in nucleosome and likely encoded in the DNA sequence during evolution.

In Fig. 18(a) is shown the mean in vitro nucleosome occupancy profile obtained when averaging around the 4554 yeastgene TSS. The observed nucleosome depleted region (5′ NFR) is remarkably well described by our physical model (as well

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 73

by the Field et al. statistical model) that simply accounts for the presence of a high energy barrier in the energy landscape.A nucleosome depleted region (3′ NFR) is also observed in the mean in vitro occupancy profiles when aligned at the TTS(Fig. 18(b)). Actually themean nucleosome occupancy profiles observed at the TSS and TTS look very similar and symmetric.They both exhibit no flanking nucleosome ordering except the first 5′ nucleosome downstream of the 5′ NFR (Fig. 18(a))and the last 3′ nucleosome upstream the 3′ NFR (Fig. 18(b)) [82] that are the two nucleosome (relative) positioning that areknown to be associatedwith a 10 bpperiodic distribution of AA/TT dinucleotides [210,213,219]. This privileged positioning isclearly evidenced in Fig. 18(c) and (d) when averaging the nucleosome occupancy profiles around the 5′ and 3′ nucleosomesrespectively.

3.3.3. In vivo nucleosome-free regionsWhen extending this study of NFRs (δY < −2) to genome-wide in vivo data (Fig. 12(a)), we realize that significantly

more NFRs (4800) are detected that are now predominantly associated to TSS (70%) and to a lesser extent to gene TTS (33%).Clearly, as compared to the in vitro NFRs, more NFRs are found in vivo at the gene TSS as the probable signature of the(sequence-dependent) action of external factors such as transcription factors (a NFR is observed at themajority of TFS out ofour physical model predictions, see Fig. 15(a) and Section 3.6) and remodelers (some phase shift is experimentally observedwith respect to the sequence-induced nucleosome occupancy profile, see Section 3.6) [245]. This explains that our sequence-directed physical model at high nucleosome density (µ = µ − E = −1.3 kT) predicts a number of NFRs corresponding toa smaller percentage (29%) of the in vivo 5′ NFRs but that still remarkably match (46%) with the experimental in vitro NFRs.Interestingly, almost half (40%) of the in vivo 3′ NFRs are predicted by our physical model and match to a large extent (67%)the NFRs previously observed in vitro. Thusmost of the NFRs observed in vitro as corresponding to genomic excluding energybarriers are consistently observed in vivo, and this both at the gene TSS and TTS, besides some substantial additional 5′ NFRsthat are not accounted for by our sequence-dependent physical model.

In Fig. 19(a) is shown themean in vivonucleosomeoccupancyprofile around the 4554 yeast gene TSSwhich again displaysa nucleosome depleted region but now flanked, especially towards the ORF of the gene, by stretches (5–6) of well positionednucleosomes with a periodic ordering ∼167 bp. Our physical modeling accounts very well for the depleted zone but withno ordering on the sides, very similar to what is experimentally observed at gene TTS in Fig. 19(b), but significantly smallerandwider than the one observed at TSS (Fig. 19(a)). We interpret this discrepancy as a consequence of a less precise phasingof the theoretical NFR with respect to the TSS, leading to the averaging out of the periodic ordering and a widening of thedepletion. As shown in Fig. 19(a), if the histogram of experimental distance values between the TSS and 5′ nucleosome issharply peaked at ∼70 bp as the signature of a strong phasing, the corresponding theoretical histogram is much wider andrather flat. This simple lack of phasing is confirmed in Fig. 19(c) where when now averaging the nucleosome occupancyprofile around the 5′ NFRs, we recover theoretically a periodic ordering bordering the NFR on both sides. A similar result isobtained in Fig. 19 (d) around the 3′ NFR, corroborating the organizing role of these NFRs over distances encompassing arather large number of nucleosomes. As illustrated in Fig. 19(e) and (f), the statistical ordering observed in vivo around yeastgene TSS and TTS can also be evidencedwhen averaging the nucleosome occupancy profile around the 5′ and 3′ nucleosomesthat are rather well phased with respect to the 5′ and 3′ NFRs.

In order to elucidate the mechanisms underlying the substantial set of NFRs detected in vivo at gene promoters butthat are not predicted by our physical model and not observed in vitro, we have averaged the experimental and theoreticalnucleosome occupancy profiles around annotated TF binding sites. As shown in Fig. 20, most TFS reside in nucleosome-depleted regions in vivo (Fig. 20(b)) [201,204,205] but not in vitro (Fig. 20(a)), suggesting that in these genomic regionsthe TF have won the competition with the histones for binding on the DNA target sites [246]. This is confirmed by ourphysical modeling which does not predict an inhibitory energy barrier at the TFS but simply some mild hilly profilethat is not discernible in the in vitro nucleosome occupancy profile (Fig. 20(a)) but that induces a rather weak relativenucleosome depletion at larger nucleosome density (by phasing the neighboring nucleosome arrangement) as compared tothe pronouncedmean NFR profile found in vivo (Fig. 20(b)). To account for these extra in vivoNFRs, in the following sections,whenevermentioned, wewill locally impose at the corresponding TSS and TTS the presence in E(s) of an excluding barrier oftrapezoidal shape to mimic the effect of external factors likely involved in the transcription machinery. As shown in Figs. 19and 20 (b), this new ingredient improves evenmore the comparison between our physical model predictions and the in vivonucleosome occupancy data.

3.3.4. Nucleosome-free regions are evolutionary conservedIn their pioneering experiment, Yuan et al. [201] have investigated sequence conservation in seven closely related

Saccharomyces species. When aligning promoters by the NFR and averaging the conservation scores, they found a markof conservation in the NFRs that contrasts with nucleosomal intergenic sequences that appeared to be poorly conservedacross evolution. Because these regions of conservation often include a great deal of sequences beyond the well-knownconserved TF binding motifs [247], they proposed as a complementary interpretation, the presence of multiple poly(dA:dT)stretches that are known to be poorly incorporated into nucleosome because of their relative rigidity [209,210,216,227–230] and to have an increased likelihood of being linkers. In a more recent study, Washietl et al. [248] have confirmed thaton average, lower substitution rates are observed in linker regions than in nucleosomal yeast DNA and this in intergenic aswell as in coding regions. Actually, among intergenic regions, the NFRs observed at gene TSS and TTS seem to be relatively

74 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 19. Average (4554 yeast genes) in vivo (red) and theoretical (dark blue) nucleosome occupancy (log2) profiles, and mean theoretical energy profile(green) plotted around (a) the TSS, (b) the TTS, (c) the 5′ NFR, (d) the 3′ NFR, (e) the 5′ nucleosome and (f) the 3′ nucleosome. The red (dark blue) histogramscorrespond to experimental (theoretical) distance between (a) TSS and 5′ nucleosome, (b) TTS and 3′ nucleosome, (c) 5′ NFR and 5′ nucleosome and (d)3′ NFR and 3′ nucleosome. The model’s parameters are δE = ⟨(E − E)2⟩1/2 = 2 kT and µ = µ − E = −1.3 kT. The dark blue dashed line correspondsto the theoretical nucleosome profile obtained when imposing an artificial excluding energy barrier at TFS when not predicted by the sequence. (Forinterpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

more conserved, some probably because of a concentration of TFS, some others probably because of the presence of severalpoly(dA:dT) stretches. In that context, investigating the conservation status of the sequences underlying the excludingenergy barriers predicted by our physical model should bring further understanding of the strong nucleosome depletionobserved in vitro and in vivo at some yeast gene TSS and TTS and in turn of the concomitant observation of long-rangenucleosome ordering inside genes [83].

3.4. Thermodynamics of intragenic nucleosome ordering

3.4.1. Yeast genes display a highly organized nucleosomal architecture in vivoWhatever the origin of the effective energy barriers that impair nucleosome formation at the observed NFRs, they result

in a confining and an ordering of flanking nucleosomes and as such, condition the chromatin organization inside the genes.When ordering yeast genes by the distance L that separates the first (5′) and last (3′) nucleosomes [210,249], we obtain a2D map that reveals a striking organization of the nucleosome distribution inside the genes (Fig. 21(a)) [82,83]. Note thatwe use L as a substitute of the distance L between the 5′ and 3′ NFRs that is more difficult to measure accurately becauseof the NFR shape variability. Small genes (L . 1.5 kbp) present a clear periodic packing in between the two borderingNFRs with a well defined number n of regularly spaced nucleosomes (Fig. 21(b)). As the inter-distance L increases, these

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 75

Fig. 20. Average nucleosome occupancy (log2) profiles around TF binding sites. (a) In vitro data (orange), theoretical physical model (light blue) and Fieldet al. (pink). (b) In vivo data (red), theoretical physical model (dark blue), and physical model with artificial excluding energy barriers at TSS and/or TTSwhen not predicted by the sequence (dark blue dashed). The red (dark blue) histogram corresponds to experimental (theoretical) distance between theNFR and the TFS. Themodel’s parameters are the same as in Figs. 18 and 19 respectively. The green line corresponds to themean theoretical energy profile.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

‘‘crystallized’’ genes cluster into L-domains with genes having the same number of nucleosomes, from n = 2 to about 9nucleosomes. For rather large gene sizes (L & 1.5 kbp), the nucleosome positioning appears periodic essentially at the twoboundaries and fuzzy in the middle where the confinement induced by both boundaries is probably too weak to constrainthe positioning of the central nucleosomes (Fig. 21(a)). Close inspection of this 2D-map shows that the range of influence ofthe 5′ NFR extends to about 7 nucleosomes as compared to about 5 nucleosomes for the 3′ NFR. This likely results from NFRnucleosome occupancy profiles that are more pronounced, on average, at the 5′ than at the 3′ gene extremities (Fig. 19(a)and (b)).We thus observe a periodic ordering that extends inside the genes, consistentwith a statistical orderingmechanisminduced by exclusion from the boundaries [82,83]. In this ordering mechanism, the strength, period and range depend onthe degree of nucleosome confinement that increases with the nucleosome exclusion strength of the boundaries and withthe average nucleosome density [226]. The more nucleosomes are confined, the more they adopt a long-range and compactperiodic organization, with the inter-boundary distance as an additional control parameter.

This interpretation is strengthened by the corresponding 2D-map obtained from in vitro data in Fig. 22(a), where exceptfrom some positioning of the 5′ and 3′ nucleosomes at gene extremities, no collective periodic nucleosome ordering isobserved inside the genes. This is an indication that the main positioning signal is not specified by the high affinity ofintragenic positioning sequences but rather by the long-range confinement induced by the bordering excluding barriers. Thisis confirmed by the 2D-map of nucleosome occupancy predicted by our physical model which (i) at low nucleosome density(Fig. 22(b)), accounts for the featureless structure of the in vitro 2D map (Fig. 22(a)) and (ii) at high nucleosome density(Fig. 21(b)) reproduces remarkably well the highly organized intragenic nucleosomal architecture observed in vivo [82,83].

As shown in Fig. 22(c), when investigating nucleosome positioning inside intergenic regions in vivo, we obtain a 2D-map that again strongly differs from the one observed inside the genes (Fig. 21(a)). Along intergenic regions, that consistin regions located between the NFRs of neighbor genes, only the bordering nucleosomes are well positioned. We observeno propagation of nucleosome phasing from bordering to internal nucleosomes that present a rather fuzzy distribution.This lack of intergenic crystal-like nucleosomal organization is likely related to the asymmetric shape of the NFRs that aresignificantly sharper on the gene side than on the intergenic side where their range of influence appears to be limited to theclosest nucleosome (Fig. 19).

3.4.2. Two classes of intragenic chromatin architecture: crystal-like versus bi-stable structuresFrom spectral analysis of the in vivo nucleosome occupancy profiles, we assign [83] genes to one of the three categories,

crystal-like (1940/4554), bi-stable (946/4554) and other (1668/4554), depending on the crystallization state of theirchromatin. Genes presenting a profile with a single and dominant periodic contribution are considered as ‘‘crystal genes’’(Fig. 23(a), (c), (d) and (f)). Genes presenting two periodic contributions are considered as ‘‘bi-stable’’ genes (Fig. 23(b) and(e)). Among large genes, we also detect genes presenting more than two periodic contributions thus displaying propertiesof multi-stable genes.

As shown in Fig. 23(g), we observe [83] a quantized distribution of L values for all genes, with maxima corresponding tocrystal-like genes with L = n × 167 (n = 2, 3, etc.. . . ). It results that genes with L values corresponding to bi-stability areunder-represented. Indeed, our criteria to select ‘‘bi-stable’’ genes from their power spectrum was chosen very stringentlyon purpose (Fig. 23), in order to ensure no contamination by ‘‘crystal’’ or ‘‘other’’ genes; this certainly contributed to someunderestimation of the ‘‘bi-stable’’ category to the benefit of higher confidence level. In agreement with analyses of the invivo 2D-map (Fig. 21(a)), we find that themean proportion of crystal genes clearly presents a 167 bp periodicity as a functionof L up to around 1.5 kbp (Fig. 23(h)). In addition, this proportion is globally larger at small L values and decreases whenincreasing L. This actually defines a crystallization regime associated with L values, in which large domains of crystal-likegenes alternate with smaller domains where crystal-gene density locally drops. At transitions between n and n+1 domains

76 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 21. (a) 2D map of local minima (red) of the experimental in vivo nucleosome occupancy profile at yeast genes. The 4554 yeast genes are orderedvertically by the distance Lbetween the 5′ and3′ nucleosomes. Insets:mean experimental (red) and one individual theoretical (blue) nucleosomeoccupancyprofiles for ‘‘crystal’’ genes harboring 5 nucleosomes (right, top), 6 nucleosomes (right, bottom) and the bi-stable genes with 5/6 nucleosomes (left).(b) Zoom on the first 2000 genes in (a); on the top of the experimental data (red) are superimposed the predictions of our physical modeling (blue) whenimposing an artificial excluding energy barrier at TSS and/or TTS when not predicted by the sequence (δE = 0.75 kT, µ = 1.5 kT). (c) Same as in (b)but where the theoretical predictions (blue) correspond now to the noiseless energy landscape defined in Fig. 24(a); the black curves indicate the 5′ and3′ end position of the theoretical excluding energy barriers. In (b) and (c), horizontal grey-shaded bands correspond to some ‘‘bi-stable’’ L-domains. (Forinterpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)Source: Adapted from Refs. [82,83].

(Fig. 23(i)), intermediate Lwindows can be identifiedwhere the nucleosome occupancy profile becomes apparently irregularwithout clear nucleosome positioning (Fig. 23(b)). As revealed by the power spectrum analysis, this irregularity does notmean that the crystallization mechanisms are no longer at work inside those genes. The presence of two dominating peaksin the power spectrum reveals the statistical coexistence of two crystal-like chromatin states with respectively n and n + 1nucleosomes (Fig. 23(b) and (e)). Indeed when increasing L, more peaks can appear in the power spectrum suggesting someevolution from a bi-stable to a multi-stable structure.

This quantized distribution of L values observed for all gene classes (Fig. 23(g)–(i)) actually reflects specific crystallizationmechanisms at work inside genes [82,83]. A naive way to order nucleosomes in a crystal-like fashion is to impose a strictperiod of 167 bp which will lead to a trivially quantized distribution of L distances: L = n × 167 bp. A more realisticinterpretation of the observed quantized distributions is to relax this very strict constraint allowing the nucleosomal repeatlength (NRL) to fluctuate around a given value with some lower and upper bounds: lmin < NRL = L/(n − 1) ≃ L/n < lmax.Then n-crystal stateswill exist in the range (n−1)lmin < L < (n−1)lmax. When L slightly increases (nlmin < L < (n−1)lmax),there is possible coexistence of n- and (n + 1)-crystal states. Similarly, when L slightly decreases ((n − 1)lmin < L <

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 77

Fig. 22. 2D-map of nucleosome occupancy along yeast genes (same coding as in Fig. 21(a)): (a) in vitro data retrieved from Ref. [216]; (b) theoreticalphysical model for parameter values δE = 2 kT and µ = −6 kT (low nucleosome density); (c) 2D-map of nucleosome occupancy along yeast intergenicregions: in vivo data retrieved from Ref. [205].

(n− 2)lmax), there is possible coexistence of (n− 1)- and n-crystal states. This very tentative scenario raises the issue of thecrystallizationmechanisms thatwould produce suchNRL distribution and in turn the succession of n-crystal genes separatedby bi-stable (multi-stable) genes observed in vivo (Fig. 21(a)).

As reported in Table 3, we have investigated whether crystal-like and bi-stable gene classes display some enrichment inspecific functions [83]. We have only observed a slightly significant enrichment in cytoskeleton organization and biogenesisgenes among bi-stable genes (56%, P = 10−2) and in genes involved in translation among crystal-like genes (74%, P = 10−2).

3.4.3. Intragenic chromatin architecture conforms to equilibrium statistical ordering principlesTo test whether a statistical ordering mechanism induced by excluding boundaries at gene extremities may account for

the highly organized nucleosomal yeast chromatin observed in vivo, let us simplify even more the nucleosome formationenergy landscape by assuming that it is flat (no sequence effect) inside the genes and bordered by two linear energy barriersthat amounts to impose a constant force on both sides of the intragenic nucleosome array (Fig. 24(a)) [83]:

E(s) = EM s < s1 = sTSS −1

E(s) = EM(1 − (s − s1)/1) s1 < s < sTSSE(s) = EM for sTSS < s < s2 = sTTS −1

E(s) = EM(s − s2)/1 s2 < s < sTTSE(s) = EM s > sTTS

(10)

78 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 23. Average experimental nucleosome occupancy profiles of crystal-like and bi-stable genes (black), together with some examples of individual geneprofiles (grey): (a) crystal geneswith n = 5 nucleosomes; (b) bi-stable genes at the transition between the n = 5 and n = 6 crystal gene domains; (c) crystalgenes with n = 6 nucleosomes. Power spectrum (PS) S2(k) = S2(k)/

∑i=1 S

2(i) of nucleosome occupancy profiles: (d) PS of one crystal gene profile witha dominant peak (>0.06) at n = 5 nucleosomes; (e) PS of one bi-stable gene with two main peaks of comparable magnitude at n = 5 and 6 nucleosomes;(f) PS of one crystal gene profile with a dominant peak at n = 6 nucleosomes. (g) Total number of genes, (h) proportion (density) of crystal genes and (i)proportion (density) of bi-stable genes as a function of the gene size L; the plotted data correspond to average values computed in a 50 bp sliding window.In (g)–(i), the vertical grey bands define bi-stability domains and the vertical dotted lines indicate the successive L = n × 167, with n = 2, 3, . . ..Source: Adapted from Ref. [83].

Fig. 24. Theoretical probability of nucleosome occupancy at each point of a box bordered by two linear energy barriers (see Eq. (10)): (a) box large enoughto shelter 5 nucleosomes (solid line); (b) larger box where the two n = 5 (dotted line) and n = 6 (dashed line) configurations are possible; the weightedaverage of the 5 and 6 nucleosome crystal-like profiles yields an apparently irregular profile (solid line); (c) larger boxwhere 6 nucleosomes can be inserted(solid line).Source: Adapted from Ref. [83].

Small genes (L . 1.5 kbp): As shown in Fig. 21(c), when fixing the height EM = 6 kT and the width 1 = 80 bp of thebordering linear energy barriers, we get for µ = 1 kT so that nucleosomes cover 75% of the yeast genome, a 2D-mapof nucleosome occupancy in remarkable agreement with the in vivo data [83]. Indeed this theoretical 2D-map appears asthe backbone of the noisy 2D-map obtained with our physical model (Fig. 21(b)) that accounts for the sequence-inducedenergy landscape fluctuations. In that respect it can be considered as the footprint of intragenic nucleosome ordering underconstant force imposed by the bordering NFRs. This very simplemodel actually provides some understanding (Fig. 24) of thealternation of crystal-like and bistable nucleosome patterns observed in vivowhen increasing the distance L (∼L+ 188 bp)

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 79

Table 3Numbers of S. cerevisiae genes displaying a given Gene Ontology (GO) that areclassified as bi-stable or crystal-like [83].

GO-term Bistable Crystal-like

Cytoskeleton organization and biogenesis 18 14Translation 46 127Cytokinesis 18 16Response to stress 23 66Pseudohyphal growth 14 14Amino acid and derivative metabolic process 33 43Protein modification 55 82Signal transduction 14 16Vitamin metabolic process 3 13Conjugation 9 9DNA metabolic process 74 117Transport 84 135Cell budding 10 11Transcription 33 76Cell homeostasis 11 13Biological process unknown 262 516Lipid metabolic process 18 44Electron transport 2 8Meiosis 13 18Vesicle mediated transport 27 59Carbohydrate metabolic process 16 36Ribosome biogenesis and assembly 51 104Cellular respiration 9 21Organelle organization and biogenesis 37 75Sporulation 6 8Protein catabolic process 14 29Generation of precursor metabolites and energy 8 13Cell cycle 27 52RNA metabolic process 49 90Anatomical structure morphogenesis 5 8Cell wall organization and biogenesis 25 46Membrane organization and biogenesis 11 21Nuclear organization and biogenesis 5 10

between the bordering NFRs (Fig. 23). Indeed, the gene nucleosome occupancy profile is the sum of n = 1, 2, . . . nucleosomeoccupancy profiles. The theoretical weighted probability of a n-nucleosome configuration according to its size L is shownin Fig. 25(a). For small genes, this model predicts that in between the crystal n (Fig. 24(a)) and n + 1 (Fig. 24(c)) domains,there is a coexistence domain (Fig. 25(a)) where these two (or more) crystalline configurations contribute statistically toan apparently irregular occupancy profile (Fig. 24(b)). Note that in good agreement with in vivo observations, the domainsof ‘‘crystallization’’ are characterized by a NRL that increases with L over the range: 140 bp . lmin < NRL ∼ L/n <lmax . 200 bp (Fig. 25(b)) [82,83]. Globally, crystallization is observed up to gene sizes of at least 10 nucleosomes, as shownin the in vivo 2D-map. In our modeling, the 2D map geometry is symmetrical with respect to gene center by constructionalthough in vivo the 3′ NFR range of influence is slightly smaller than on the 5′ side. This elementary thermodynamicalmodeling thus provides a simple interpretation of the crucial role of the inter-barrier distance as a fine control parameter ofchromatin structure, and demonstrates that the complex pattern of intragenic nucleosome distribution mainly depends onthe presence of the bordering NFRs and not somuch on the gene sequence [83]. From the characteristics of the linear energybarriers mimicking the NFRs at gene extremities, we get the following estimate of the force they apply on the nucleosomalarray:

F =EM1

=6 kT

27.2 nm∼ 1 pN. (11)

This estimate is significantly smaller than the force required for full DNA unwrapping as measured in previous pullingexperiments, namely ∼6 pN for simple nucleosome [250] and substantially more ∼15–25 pN [251–256] in the contextof a nucleosomal array. A possible explanation of these rather high unwrapping forces is the fact that most of these pullingexperiments have used DNA arrays consisting of repeats of strongly positioning sequences (e.g. 601 sequence [190]) thatoverstabilize the wrapped conformation. Note however that 1 pN is comparable and actually a little less than the fewpiconewton tensions generated by polymerases [257,258] and helicases [259,260] suggesting that these enzymes candrastically affect the nucleosomepattern observed inside yeast genes. Hopefully equilibrium statistical nucleosomeordering(likely resulting from the action of chromatin remodeling factors, see Section 3.6) will be recovered along most genes witha characteristic time much shorter than the typical time separating the successive chromatin alterations induced by theseexternal factors.

80 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 25. (a) Theoretical probability of crystal configurations with a fixed number n of nucleosomes vs. the box size; the energy profile is given by Eq. (10)(Fig. 24(a)). Vertical dashed lines correspond to the interbarrier distances used in Fig. 24(a)–(c) respectively. (b) NRL dependency on the box size: thin blackdotted lines correspond to a fixed number n of nucleosomes and the thick black dotted lines to the NRL at a fixed nucleosome density (∼85%); grey dotscorrespond to individual crystal gene values. Vertical grey-shaded bands correspond to the experimental ‘‘bi-stable’’ L-domains [83].

Fig. 26. Example of nucleosome occupancy profile along a large gene. Profile of the KOG1 gene (TSS = 480 731, TTS = 475 999) on yeast chromosomeVIII: (a) in vivo data from Ref. [205]; (b) Physical model when imposing artificial excluding energy barriers at TSS and/or TTS when not predicted by thesequence (δE = 0.75 kT, µ = 1.5 kT); (c) in vitro data from Ref. [216]. (a’–c’) represent a zoom on the 1 kbp central region of the KOG1 gene. To determinethe nucleosomes that can be considered as well positioned (dots), we smooth the nucleosome occupancy data by a (σ = 18 bp) Gaussian window andthen select the maxima above a 0.20 (in vivo), 0.15 (in vitro) and 0.718 (theoretical P(s)) threshold. The vertical dotted–dashed line marks the location of anucleosome detected in vitro as well as in vivo and also predicted by our theoretical modeling. In (b’), the blue dashed-line corresponds to the predictionof our toy model with energy barriers bordering a flat energy potential.

Long genes (L > 3 kbp): This understanding of intragenic nucleosome organization in terms of statistical ordering is furthersupported by the nucleosome positioning observed in vivo in the central region (1 kbp) of the N = 483 longest yeastgenes with L > 3000 bp. In these regions, far away from the influence of the bordering inhibitory energy barriers, 1270nucleosomes were found to be well positioned corresponding to ∼39% coverage of the sequence [83], i.e. a number smallerthan the coverage∼80% observed close to the gene extremities due to boundary confining, but significantly larger than zeroas an indication that the sequence plays some role in the statistical positioning of central nucleosomes (Fig. 26). Importantly,among this set of well positioned central nucleosomes observed in vivo [205], only 191 correspond to (at a 35 bp precision)

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 81

well positioned nucleosomes also observed in the in vitro data [216]. Since intrinsic histone-DNA interactions are likely toresult into well positioned nucleosomes at the lower in vitro nucleosome density, this means that less than 15% (191/1270)of in vivowell positioned nucleosomes in central intragenic regions can be attributed to strongly positioning DNA sequences.Again themajor determinant of nucleosomepositioning is statistical ordering, no longer imposed by the bordering excludingbarriers but by the fluctuations in the central energy landscape induced by the DNA sequence [83]. As expected, our toymodel with constant energy profile E(s) between the bordering barriers does not generate any well positioned nucleosomein the central region of large genes (data not shown). When considering our physical model, the corresponding nucleosomeoccupancy profiles exhibit some central nucleosome positioning (Fig. 26(b’)). Indeed 1309 nucleosomes are predicted tobe well positioned (corresponding to ∼40% coverage of the sequence) among which we recover 28% (360/1270) of the invivo central nucleosomes including 53% (100/191) of the ones also observed in vitro. Moreover, when shuffling the genesequence, we recover a similar number ∼1140 of well positioned nucleosomes as the signature of statistical ordering in anoisy energy landscape but we lose a majority (74/100) of the nucleosomes intrinsically positioned by sequences with highaffinity to histones.

Altogether the results reported in this section [82,83] agreewith the central conclusion of recent experimental in vitro andin vivo studies of nucleosome positioning in S. cerevisiae [217], which provides additional evidence that intrinsic histone-DNA interactions make only a modest contribution to the in vivo intragenic nucleosome positioning pattern, the majordeterminant being the statistical ordering mainly induced by the excluding NFRs located at gene extremities.

3.5. Distinct modes of gene regulation by chromatin nucleosomal organization

3.5.1. Transcription initiation regulation by promoter nucleosome occupancyThe role of chromatin structure (nucleosome density and positioning) on gene expression regulation has been mostly

investigated at the level of transcription initiation and different regulation strategies associated to different kinds ofpromoter structural design have been revealed [205,243,246,261,262]. As reported in Section 3.3, recent high-resolutionexperiments have revealed a specific gene promoter pattern characterized by a nucleosome depleted region (NFR), of typicalsize 100–200 bp, upstream the TSS and flanked by periodically ordered nucleosomes, notably at the gene 5′ end [201,204,205,210,213]. However, we can further distinguish from this ‘‘average’’ picture some structural gene classes: if amajority of genes presents indeed this stereotypic pattern, a significant subset reveals a depleted but delocalized regionand another subset does not present any noticeable NFR [205,243]. Interestingly, the expression level has been shown to be(negatively) correlated with the absolute nucleosome occupancy at promoter whereas the relative nucleosome occupancywould rather specify the logic of expression regulation and control the ability for genes to vary (adapt) their expressionunder environmental changes [243,262]. For example, Tirosh and Barkai [243] clearly showed that genes that present in vivohigher (relative) occupancy at the 150 bp vicinity of the TSS (‘‘OPN’’ for occupied proximal nucleosome, see Fig. 27(c)), revealhigher transcriptional plasticity, expressionnoise and expressiondivergence [243,263]. In contrast, geneswith a pronouncedproximal depletion region (‘‘DPN’’ for depleted proximal nucleosome, see Fig. 27(d)) correspond to constitutively expressedgenes and are less versatile. These observations are corroborated by the fact that as shown in Fig. 27(b), DPNare also observedin vitro [216] as well as in our physical model at low nucleosome density as resulting from the presence of an inhibitoryenergy barrier encoded in the DNA sequence. Note that if the mean energy barrier observed at DPN is not high enough toaccount statistically for the NFRs observed in vivo at a majority of yeast genes (Fig. 27(d)), the in vivo mean nucleosomeoccupancy profile at OPN (Fig. 27(c)) also displays some nucleosome statistical ordering towards the ORF gene that is notreproduced by our physicalmodel. Consistentwith these observations, there is amain difference in the spatial distribution ofTFS in yeast gene promoters. They aremainly localized at the NFRs of DPNswhereas for the OPN, they are broadly distributedover all the promoter region. This suggests that, at promoters, some genes have a rather ‘‘static’’ chromatin pattern with a‘‘stable (stereotypic)’’ nucleosome depleted region facilitating the permanent access/recruitment of activators/repressorswhile some others present a more homogeneous and dynamic chromatin reflecting a finer level (program) of regulation bya nucleosome positioning control at regulatory sites [243,261,262,264].

Changes in transcriptional regulation during evolution are important for generating phenotypic diversity among species,but the mechanisms underlying these regulatory changes are far from being understood. In a recent work [215], geneexpression changes were shown to be linked to a co-evolution of promoter nucleosome organization. In anaerobic yeastspecies, where cellular respiration genes are relatively inactive under typical growth conditions, the promoter sequences ofthe respiration genes encode a relatively closed (OPN) chromatin organization. In contrast in aerobic yeast species whererespiration genes are active under typical growth conditions, their promoter sequences encode a relatively open (DPN)nucleosome organization. These results [215] suggest that these changes in the expression of yeast respiration genes isachieved, at least in part, through DNA sequence evolution that alters the nucleosome occupancy profile in their promoterregions.

3.5.2. A novel strategy of transcription regulation by intragenic nucleosome ordering

Intragenic nucleosome density correlates positively with transcription rate: In this section, we investigate the functionalimplications of the intragenic chromatin structure by analyzing howdifferent features relate to the two types of nucleosomalorganization identified in the in vivo data [83]. We first examine the way transcription rate, estimated by the Pol II

82 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 27. Average nucleosome occupancy profiles plotted around the 5′ nucleosome of (N = 503) OPN (a, c) and (N = 458) DPN (b, d) yeast genes [243].The colored curves have the same meaning as in Figs. 18 and 19: in vitro data (orange), in vivo data (red), physical model δE = 2 kT, µ = −6 kT (lightblue) and −1.3 kT (dark blue). In (a,b) are shown for comparison the corresponding theoretical energy profiles. Note that in (c, d), the mean theoreticalnucleosome occupancy profile has beenmultiplied by a factor of 3 indicating that the in vivo data result from the combination of ‘‘intrinsic’’ and ‘‘extrinsic’’(TFs, chromatin regulators, Pols) sequence specific force fields. (For interpretation of the references to color in this figure legend, the reader is referred tothe web version of this article.)

Fig. 28. Bi-stable nucleosome organization controls gene expression. Sliding window (50 bp) analysis of average (a) transcription rate estimated by Pol IIdensity [242], (b) nucleosome repeat length (dark blue) and intragenic mean nucleosome occupancy (light blue), (c) transcriptional plasticity [243], (d) H3turnover rate [265], (e) sensitivity to chromatin regulators disruption [244] and (f) H2A.Z occupancy [243,266] as a function of the distance L. The verticalgrey bands define bi-stability domains. The dotted curves correspond to the results obtained when excluding the 175 (/4554) ribosomal protein genesfrom the analysis; no significant changes are observed. (For interpretation of the references to color in this figure legend, the reader is referred to the webversion of this article.)Source: Reprinted from Ref. [83].

density [242], relates to L distance (Fig. 28(a)). In individual crystal n-domains, the transcription rate decreases whenincreasing L. Some significant anti-correlation is actually observed between inter-nucleosome distance (Fig. 28(b)) and

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 83

transcription rate in n-crystal domains: for n = 3, r = −0.2, P = 3.6 × 10−2; n = 4, r = −0.24, P = 7 × 10−4;n = 5, r = −0.21, P = 3.5 × 10−3; n = 6, r = −0.3, P = 1.3 × 10−3 and n = 7, r = −0.23, P = 5 × 10−3. Hence,the more compact the nucleosomal array, the shorter the linker size, the higher the transcription rate. Consistently, in theintermediate bi-stable domains between the n- and (n+1)-crystal domains, we observe a local and sharp increase of the ratecoincidingwith an increasing proportion of the compact (n+1)-crystal pattern that coexists with themore diluted n-crystalchromatin state. Since intragenic nucleosome arrangement may drastically affect the elongation process, our analysis raisesthe question of why a well-ordered and regularly compacted nucleosome array would enhance the transcription rate [83].As previously proposed [264], a compact 10 nm chromatin fiber is likely to restrict aberrant binding of transcription factorsor of other factors thatmay perturb the proper progression of Pol II and/or induce cryptic transcription. Crystallization of thenucleosome array actually means stronger linear confinement, i.e.well-defined nucleosome positioning within genes. Pol IIelongation has to deal with chromatin structure and in particular requires structural regulation that involves nucleosomemodifications and remodeling. We guess that a regular intragenic nucleosome array might, upon gene activation, enhancetranscription rate by reducing intrinsic disturbances. According to geometrical modeling of the 30 nm chromatin fiber[267–275], a short linker size would rather lead to an openwell ordered chromatin secondary structure that would facilitatethe sequential action of chromatin regulators associated with Pol II progression such as the FACT complex [276], as wellas the action of chromatin modifiers. Recently, reversomes (for reverse nucleosomes, built on a right-handed tetrasome)were proposed to form under the action of a DNA supercoiling wave pushed in front of the RNA polymerase [277]. Thenucleosome reversome would facilitate transcription elongation by giving the RNA polymerase a lever to break the dockingof the H2A-H2B dimers which otherwise exerts a stringent blocking against transcription in absence of other factors [278].By a dynamical domino-like effect, the elongation process of compact regularly spaced nucleosomal intragenic arrays mightresult in a reversome wave that would progress faster than the RNA polymerase.Genes showing a bi-stable dynamic chromatin structure are expression regulated genes: The distribution of transcriptional plas-ticity,which quantifies for each gene the variation of expression level as the result of environmental condition changes [243],presents large variations with the L distance, bi-stable (resp. crystal) genes presenting high (resp. low) plasticity values(Fig. 28(c)). Bi-stable genes present, at least, two crystal chromatin states, a diluted (NRL = L/(n − 1)) and a compact (NRL= L/n) state respectively. According to our model, their transcription rate profile results from the contributions of highand low transcription rates. These genes may thus adopt one of the two crystal states and switch from one to the otherstate under some perturbation or condition changes, e.g. a very small change in the L distance, that would in turn lead tostrong changes of transcription rate and to high level of transcriptional plasticity. In full agreement with this suggestion,we observe that bi-stable gene density is significantly correlated to transcriptional plasticity (r = 0.42, P = 1.1 × 10−3).These properties of the plasticity distribution strongly sustain our model of the nucleosomal array compaction-dependenceof transcription rate [83].

In a recent study, it has been shown that gene expression can substantially vary when disrupting chromatin regulators,including chromatin modifiers such as HATs, HDACs, HMTs and ATP-dependent chromatin remodeling factors [244]. Thedistribution of sensitivity to chromatin regulators reveals that bi-stable genes are in average significantly more sensitivethan crystal genes to such a lack of structural regulation (Fig. 28(e)). This is consistentwith the fact that chromatin regulatorsmay control the inter-barrier distance L and in turn the distance L (∼L − 188 bp) via the statistical ordering imposed bynucleosome excluding barriers. Note that in the precise case of Isw2 remodeling [208,232], we did not obtain conclusiveresults concerning the crystal/bi-stable nature of the target genes (data not shown). When compared to crystal genes, bi-stable genes are thus highly plastic, with a wide dynamic range of expression level as the signature of a more dynamic andregulated chromatin structure [83].Intragenic chromatin structure and epigenetic marks: We next examine the histone variant H2A.Z occupancy [243,266] andthe histone H3 turnover rate [265] within crystal and bi-stable gene domains [83]. Crystal genes present a high enrichmentinH2A.Z (Fig. 28(f)), whereas bi-stable genes present a highH3 turnover rate (Fig. 28(d)). These results are in agreementwiththe hypothesis that the H2A.Z histone variant, which is mainly deposited at the 5′ nucleosome position [204], contributesto stabilize this nucleosome thereby reinforcing the position and possibly the nucleosome exclusion strength of the 5′ endNFR. This would in turn enhance the crystallization properties (periodicity and phasing) of the nucleosome array along thegene. Consequently H2A.Z incorporation may contribute to lock the chromatin into a very stable crystal state. By contrast,low H2A.Z enrichment in the 5′ nucleosome of bi-stable genes may favor the ability for these genes to change from oneto the other of the n- and (n + 1)-nucleosome states [83]. On the other hand, the high H3 turnover rate observed for bi-stable genes likely reflects the dynamical nature of their chromatin resulting from a facilitated transition between the n- and(n+ 1)-nucleosome states. More frequent eviction and reassembly of one nucleosome may thus improve a de novo histoneH3 deposition [83].Intron containing genes preferentially localize within bi-stable domains: To examine possible relation between chromatinstate and the presence of introns, we have selected the 107 genes containing one intron from the 4554 yeast genes [83].As shown in Fig. 29(a), these intron containing genes are mainly found within bi-stable domains [83]. This observationsuggests that intron size in yeast might work as a mechanism allowing for the transition of a gene from one to anothercrystal-like/bi-stable nucleosomal architecture. Observation that intronic size distribution is bi-modal with peaks at 90 and410 bp (Fig. 29(b)) sustains this hypothesis. Note that similar bi-modal distributions are also obtained for Saccharomyceskluywerii, Kluyveromyces thermotolerans and Debaryomyces hansenii (data not shown). Indeed, 90 bp is about the distance

84 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 29. (a) Proportion of single intron containing yeast genes (107/4554) as a function of the gene size L; the vertical grey bands define bi-stabilitydomains. The dotted curves correspond to the density of ribosomal protein genes (54) (black dotted curve) and to the density of the non-ribosomal proteingenes (53) (red dotted curve). (b) Histogram of intron sizes in single-intron containing genes. (For interpretation of the references to color in this figurelegend, the reader is referred to the web version of this article.)Source: Reprinted from Ref. [83].

Fig. 30. Venn diagrams showing the repartition of DPN and OPN yeast genes [243] among crystal and bistable genes [83].

that would allow a gene to change from bi-stable to crystal chromatin state (and vice-versa). Unfortunately, when using ourmodel to compute the nucleosome occupancy profiles on intron-containing intragenic sequences with and after removingthe intron [279], the sampling of each categories (crystal, bi-stable, other) was too small to draw significant conclusions.An alternative hypothesis relies on recent studies indicating that splicing regulation might be associated with chromatin-mediated regulation of transcriptional elongation, involving the action of remodeling factors such as SWI/SNF [280,281]. Aproper chromatin structure might thus be favored in order to facilitate/regulate intron splicing.

Ribosomal protein genes are generally short, highly transcribed and intron containing. Among the 107 single introncontaining genes, the subset of 54 ribosomal protein genes is significantly enriched in bi-stable genes (Fig. 29(a)), consistentwith our previous observation concerning the high plasticity level of bi-stable genes (Fig. 28(c)) [83].

3.5.3. Promoter versus intragenic nucleosome organization: distinct modes of gene regulation?Interestingly, when comparing the results reported in Sections 3.5.1 and 3.5.2, similar functional/structural relationships

are obtained for the intragenic chromatin [83] as those observed by Tirosh and Barkai [243] at promoters. Crystal genes, asDPN genes, correspond to constitutively expressed genes with a well-defined and stable chromatin pattern, i.e. a periodicnucleosome arrangement for crystal genes and a well-localized and pronounced NFR for DPN (Fig. 27(d)). On the otherhand, bi-stable genes, as OPN genes, correspond to transcriptionally plastic genes with a more dynamic and regulatedchromatin. We observe an apparently irregular pattern for bi-stable genes (Fig. 23(b)) that actually corresponds to thestatistical superposition of two crystal states and an extended less depleted region for OPN genes (Fig. 27(c)). Howeverdespite similar structural regulation strategies, we do not observe any significant correlation between the bi-stable/crystalnature of genes and their OPN/DPN promoter architecture (Fig. 30). For both crystal-like and bi-stable genes, nucleosomeorganization results from the presence of NFRs at gene extremities. These gene categories are thus expected to correlatewith the DPN gene class that reflects the presence of a deep trough in the nucleosome occupancy profile upstream of theTSS. DPN genes are significantly enriched in crystal and bi-stable genes (56% are crystal and 25% are bi-stable) as comparedto OPN genes (40% are crystal and 20% are bi-stable). This latter level of enrichment may indicate that some OPN genescan have a 5′ NFR with more variable phasing with respect to the TSS but that still induces nucleosome confinement anda periodic ordering for short gene sizes as confirmed by the weaker periodicity observed at the 5′ gene end in the averagenucleosome occupancy profile (Fig. 27(c)).

The nucleosomal structure of promoters and its implications in transcription initiation has been studied in variousorganisms from yeast to human, including the nematode and Drosophila [201,205,207,210,213,214,240,243,282,283]. InRefs. [82,83], we have identified new regulation mechanisms in S. cerevisiae that involve intragenic chromatin structure. Towhat extent these chromatin mediated regulation processes generalize to distant eukaryotic species is a very challengingquestion for future studies.

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 85

Fig. 31. Nucleosome occupancy profiles observed in vivo (red) and predicted by our physical model for parameter values δE = 2 kT, µ = −1.3 kT(dark blue) along fragments of yeast chromosome II (a), VII (b), II (c) and VI (d). For comparison are also represented the corresponding theoretical energylandscapes (green). The dots represent the positions of TSS (red) and TFS (black). (For interpretation of the references to color in this figure legend, thereader is referred to the web version of this article.)

3.6. ‘‘Intrinsic’’ versus ‘‘extrinsic’’ nucleosome positioning

The small-scale chromatin structure, as defined by the local nucleosome occupancy, conditions the regulation oftranscription in particular by modulating the accessibility of TFs to their cognate regulatory sites [7,246,264,284,285].Actually, as seen in the previous sections, the nucleosome occupancy profile predicted, directly from the DNA sequence,by our physical model using a grand canonical description, accounts remarkably well for the nucleosome occupancy profileobserved in vitro (Section 3.2) [81,82]. However the comparison with in vivo data reveals that the ‘‘intrinsic’’ nucleosomepositioning encoded in the sequence can be influenced and perturbed by the action of ‘‘extrinsic’’ factors like TFs and ATP-dependent remodelers [263,286].

3.6.1. Transcription factorsTFs can influence nucleosome positioning in vivo by competing with histones to access to their DNA target sites [287].

The outcome of this competition likely depends on the relative affinities of the nucleosomes and TFs to the underlying DNAsequence but also on their relative concentrations [286]. As shown in Fig. 31(a) and (b), when comparing the nucleosomeoccupancy profile predicted by our physical model at high nucleosome density with the in vivo yeast data, we observemainly two kinds of differences. There are locations where a NFR is observed in vivo but not in vitro as predicted by ourphysicalmodel (Fig. 31(a)). Atmany other locationswhere some nucleosome depletion is predicted by ourmodel, the in vivonucleosome occupancy profile displays a deeper and pronounced depleted region. At a larger scale, as seen in the intergenicregion between the two divergent yeast genes in Fig. 31(b), the high concentration of TFs coincides with a significantlowering of the in vivo mean nucleosome occupancy. Importantly, this lowering has not disturbed the regular nucleosomeordering predicted by the physical model consistent with the emerging view that nucleosome and TFs compete to occupyDNA in thermodynamic equilibrium [246,286,288]. Evenmore interesting, TFS reside in the predicted nucleosome depletedlinker regions of the intergenic regular nucleosome array suggesting some cooperativity in TF binding [289–292] as the resultof a collaborative competition against ‘‘intrinsic’’ collective nucleosome ordering and not just of specific protein–proteininteractions. Note that the theoretical NFR of the anti-sense gene at the left of Fig. 31(b) has been disturbed (mainly shifted)by the binding of TFs that have induced a second NFR nearby possibly catalyzed by remodeling factors.

3.6.2. ATP-dependent chromatin remodeling factorsNucleosome positioning can also be controlled by a family of enzymes that consume the energy from ATP hydrolysis

to move nucleosomes to different locations along the DNA or even to disassemble nucleosomes [208,213,233,245,261,293–300]. In vitro thesemolecular motors were shown to actively drive nucleosomes away from presumed equilibrium positions[301,302] and this regardless of the underlying DNA sequence. As illustrated in Fig. 31(c) and (d), some in vivo nucleosome

86 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

occupancy patterns including the 5′ NFR (resp. 3′ NFR) and the flanking ordered nucleosomes inside the corresponding genesare globally shifted by∼50–100 bp from their predicted positions by our DNA sequence directed grand canonical modeling.These distances are typical of distances over which these ATP consuming machines are known to operate in vivo [208,213,245]. This ‘‘extrinsic’’ nucleosome positioning under the action of remodeling factors is likely to explain the strong phasingof the 5′ NFR with respect to the gene TSS observed in vivo as compared to our physical model predictions (Fig. 19(a)).

But as noticed by Rippe et al. [302], some remodeling complexes only change the relative nucleosome occupancywithoutaltering nucleosome positions. This suggest that equilibrium positioning can also be relevant even during remodelingactivity. Since these chromatin remodelers are found all over the yeast chromosomes, they may contribute to increasethe effective temperature so that thermal equilibrium is attained much faster, in particular along most yeast genes asdiscussed in Section 3.4. In that context, our grand canonical physical model can be used as a theoretical reference for invitro nucleosome positioning whose comparison with in vivo data is likely to provide very instructive information for futuremodeling of both remodeler [234] and TF [288] driven ‘‘extrinsic’’ nucleosome positioning.

4. Atomic force microscopy imaging of sequence effect on the first step of DNA compaction inside eukaryotic nuclei

4.1. Motivations

During the nineties, several experimental and theoretical studies have investigated the effect of the sequence on theelastic properties of long DNA chains. In particular, micromanipulation experiments performed on single molecules ofDNA have provided very interesting information about the elastic response of these macromolecules to external stretching[303–308] and twisting [307,309,310] forces. One remarkable feature revealed by these pioneering experiments is the factthat the extension vs. force data are very well described by a simple elastic model, the so-called Worm-Like-Chain (WLC)model [311–317] with a single elastic constant, the persistence length Aeff that characterizes the DNA bending rigidity(see Appendix C). In physiological salt concentrations, at room temperature, most experiments yield similar estimates ofAeff ≃ 50 nm (∼170 bp). In the presence of a twisting constraint, the Rod-Like-Chain (RLC)model [318–322], which involvesan extra parameter, the twist persistence length Ceff , reproduces quite well the experimental extension vs. supercoilingcurves (f . 0.5 pN) with Ceff between 75 and 110 nm [318–322]. But the values of the bend and twist persistence lengthsmeasured by torsionally constrained stretching experiments, namely Aeff and Ceff , do not correspond to the genuine bend (A)and twist (C) rigidities of the DNA double helix since the intrinsic structural disorder induced by the sequence contributesto the elastic response to the external stresses. In the late eighties, Trifonov et al. [323] suggested that Aeff can be expressedas:

1/Aeff = 1/A + 1/Ao, (12)

where Ao is the ‘‘static’’ bend persistence length of the random walk defined by the axis of the DNA double helix in theabsence of any thermal fluctuations, and A is the ‘‘dynamic’’ bend persistence length of a DNA double helix in the absenceof any intrinsic structural disorder. Eq. (12) has received some early theoretical and computational confirmation [324,325].Experimentally, from the investigation of native [323,326] and ‘‘intrinsically straight’’ synthetic [327–329] DNA, Eq. (12) hasled to values of A ranging from 60 up to 210 nm, as compared to the generally accepted value Aeff ≃ 50 nm. More recently,Nelson [330] has shown that Eq. (12) is correct in the limit of weak structural disorder, i.e.:

Aeff = A(1 − A/Ao), for A/Ao ≪ 1. (13)

This correction differs from Aeff = A(1−√A/Ao/2) found by Bensimon et al. [331] in a random version of the Kratky–Porod

model [311]. Also, according to Nelson’s calculations [330], the twist persistence length would not suffer such a correction:Ceff = C .

In a previouswork [332], we have revisited the results of single-molecule DNA stretching experiments using a RLCmodelthat explicitly includes some intrinsic LRC structural disorder induced by the sequence. The investigation of artificial and realsequences has confirmed that theWLCmodel reproduces quite well the data but with an effective bend stiffness Aeff , whichunderestimates the true elastic bend stiffness according to Trifonov et al. formula (12) and this independently of the elastictwist stiffness C . In particular, we have shown that this RLC model provides a remarkable fit of the bacteriophage λ-DNAwhen considering A = 70± 10 nm, in good agreement with previous experimental estimates of the ‘‘dynamic’’ persistencelength [326–328]. But themainmessage of this full 3Dmodeling of the double-helix elastic response to external stress is thefact that the extension vs. force RLCmodel predictionsmainly depend on the amplitude σo of the structural disorder and turnout to be rather insensitive to the presence of LRC in the sequence. As illustrated in Fig. 32, the presence of LRC structuraldisorder is likely to become noticeable when confining the DNA heteropolymer in a plane (2D). When comparing somerealizations of uncorrelated and LRC ‘‘unconstrained’’ 2D (equilibrium) trajectories, due to the persistence of the double-helix axis orientation fluctuations, LRC 2D spontaneous trajectories are more looped than the uncorrelated ones. One maythus suspect that LRC structural disorder can significantly affect the thermodynamical properties of semi-flexible chains likeDNAwhen confined in 2D. Since the DNA loops around histone octamer are approximately planar [8–10], the LRC structuraldisorder is likely to play a fundamental role in the formation, the positioning and the dynamics of nucleosomes.

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 87

Fig. 32. Frozen trajectories for two 2D semi-flexible chains of length L = 3000 bp with uncorrelated (H = 0.5, blue) and LRC (H = 0.8, red) structuraldisorder (σo = 0.05). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)Source: Reprinted from Ref. [31].

4.2. Probing persistence in DNA curvature properties with atomic force microscopy

Various experimental techniques have been used to investigate DNA bending and flexibility, including X-ray crystallog-raphy, gel electrophoresis, electric birefringence, DNA cyclization, NMR, high resolution microscopy (Electron Microscopyand Atomic Force Microscopy) [327,333–341]. Single molecule experiments have also provided very interesting informa-tion about the elastic response of rather long (several tens of kbp) DNA fragments to external stretching forces [303–308].Among these methods, because it provides a visualization of the 2D conformations of a DNA molecules, AFM has provedto be a very efficient technique, not only because it can be used to estimate the persistence length of DNA via end-to-enddistance measurements [336–341], but more interestingly because it can give access, via some thermodynamical averagingover a population of these conformations, to both the intrinsic DNA curvature and bending flexibility and this with a surpris-ingly good resolution of a few double helix pitches [339,342–345]. During the past decade, DNA structures imaged by AFMhave been analyzed in the framework of the WLC model (see Appendix C) that accounts for the entropic elastic behavior ofan intrinsically straight semi-flexible polymer [339–341]. In this section, our goal is to use AFM in air and in liquid to bringexperimental evidence of sequence-induced LRC in naked DNA molecules deposited onto mica under 2D thermodynamicequilibrium conditions [84,85], thus making the experimental estimate of the persistence length amenable to treatment bythe generalized 2DWLC model described in the Appendix C [80,346].

4.2.1. Materials and methodsDesign and preparation of the DNA fragments: To illustrate our purpose, we have selected three different types of DNA

molecules. (i) Intrinsically straight DNA fragment: an intrinsically straight 200 bp DNA fragmentwas obtained from a plasmidkindly provided by Vologodskii [329]. Four of these 200 bp straight DNAwere ligated together to produce a straightmoleculeof total length L = 800 bp. Special care was taken when designing restriction and ligation sites so that the overall intrinsiccurvature (at scale above the helical pitch) remains zero. (ii) Uncorrelated Hepatitis C RNA virus fragment: a fragment ofL = 2200 bp ((G + C) = 50%) was extracted from the HCV genome via EcoRI digestion of phCMVcE1E2 plasmid [347].HCV belongs to the family of non-retrotranscribed RNA viruses that display an uncorrelated bending profile [15,16] (Section2.6). (iii)Human genomic DNAmolecules: amixture of DNAmolecules of size ranging from 1000 to 3000 bp, representative ofhuman complete genomewas obtained byDraI digestion of genomic DNA. (iv)HumanDNA fragments of different GC contents:Four repeated sequence free loci of size L ≃ 2200 bp were delineated along the human genome using UCSC data (May 2004release): (α) chr 21: 21524828–21527029, L = 2002 bp, G + C = 31%; (β) chr 21: 34394438–34396643, L = 2006 bp,G+C = 38%; (γ ) chr 21: 37044232–37046437, L = 2006 bp, G+C = 43%; (δ) chr 8: 125789398–125791587, L = 2189 bp,G + C = 46%. They were extracted by PCR amplification, cloned in pGEMTeasy vector and digested by NotI and EcoRI.

All DNA fragments of interest were purified by agarose gel electrophoresis without DNA intercalating dyes. Prior to theAFM analysis, we have performed a LRC analysis of the corresponding DNA chain bending profiles generated with the PNuctrinucleotide coding table (Table 1), using the same wavelet-based methodology as reported in Section 2. For the RNA virussequence we confirm that, over a whole range of scales from 10 to 2200 bp, we observe a nice scaling behavior with Hurstexponent H = 1/2 as the signature of uncorrelated bending fluctuations. We detect LRC in the four human DNA sequences,in particular for scales ranging from 200 to 2200 bp which will be probed by the AFM apparatus where the Hurst exponentH = 0.73 ± 0.03 is definitely found larger than 1/2. Note that this estimate is robust when extending the scaling analysisto larger portions of chromosomes 8 and 21, taking into account the neighboring 30 kbp left and right sequences of the 4selected L = 2200 bp sequences. Actually this estimate is quite representative of the H values obtained all over the humangenome [15,16].

88 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Table 42D persistence length measurement from the mean projection of the end-to-end vector on the initial orientation of N DNA molecules deposited on micawith different procedures (see Appendix C) [85]. The persistence lengths ℓp (= ℓdp) and ℓ

effp were estimated froma least squares fit of the corresponding ℓS

p (s)data for the straight DNA (Fig. 35(a)) and the HCV (Fig. 35(b)) molecules respectively, using theWLCmodel prediction (Appendix C, Eq. (C.9)). σo =

2/ℓsp

(Eq. (C.17)), was estimated using Eq. (14).

Experimental protocol Straight DNA L = 800 bp HCV L = 2200 bpℓp(N) ℓ

effp (N) σo

AirBuffer A1 490 ± 30 bp (193) 440 ± 30 bp (65) 0.022± 0.01210 mM Tris-HCL, 5 mMMgCl2, pH = 7.4 157 ± 10 nm 141 ± 10 nm

Buffer A2 395 ± 25 bp (108) 360 ± 25 bp (99) 0.023± 0.01510 mM Tris-HCL, 5 mMMgCl2, 0.05 mM NiCl2 , 126 ± 8 nm 115 ± 8 nmpH = 7.8

Buffer A3 297 ± 20 bp (101) 291 ± 20 bp (71) 0.012± 0.03810 mM Tris-HCL, 0.5 mM NiCl2, pH = 7.8 95 ± 6 nm 93 ± 6 nm

LiquidBuffer L 285 ± 20 bp (109) 276 ± 20 bp (98) 0.015± 0.03410 mM Tris-HCL, 0.5 mM NiCl2, pH = 7.8 102 ± 6 nm 99 ± 6 nm

Atomic force microscopy: AFM imaging was carried out in air and in solution using a Nanoscope IIIa (Veeco/Digital Instru-ments, Santa Barbara, CA) equipped with type-E scanner operating in tapping mode. Freshly cleaved mica (muscovite mica,grade V-I, SPI) served as a support for sample adsorption. DNA preparations were diluted (0.3 ng/µl) in different buffersaccording to different experimental protocols. (i) AFM imaging in air: For scanning in air, we used the three buffer solutionsA1, A2 and A3 defined in Table 4 (Experimental Protocol). As shown in Refs. [338,348,349], in the presence of doubly chargedcounter ions (Mg2+,Ni2+) in the solution, DNA molecules are able to equilibrate on the surface before being captured in agiven conformation. A 5 µl droplet of the solution was deposited for 2 min, then the mica disk was rinsed carefully with1 ml of MilliQ water and dried under a nitrogen flow prior to imaging. Samples were imaged with silicon tips (type MPP-11100, Veeco Instruments) at resonance frequency f of 250–350 Hz and set point of 0.6–1.2 V. (ii) AFM imaging in liquid: Forscanning in liquid, we followed a slightly different protocol with the following buffer L (Table 4) similar to Buffer A3 used inair. As before a 5 µl droplet of the solution was deposited on to a freshly cleaved mica disk and incubated for 2 min. Then a100µl drop of Buffer L was introduced in the liquid cell prior to imaging with a commercial silicon nitride probe (type NP-S,Veeco Instruments) at resonance frequency f of 9–10 Hz and a set point of 0.3–0.4 V.

The images were recorded at room temperature, both in air and liquid, at a scan size of 1.5×1.5µm2, a scan rate of 2 Hzand a resolution of 512×512 pixel2. Prior to image processing, images were oversampled to 1024×1024 pixel2 resulting ina resolution ∼4 bp per pixel. To avoid molecular crowding effects resulting from the interaction of neighboring moleculesduring their equilibrium process [338], we used a high dilution rate so that each image contained only a few distant DNAmolecules, at the expense of our statistical sampling.

Image processing: The DNA traces were analyzed using a homemade Matlab (MathWorks Inc., Natick, MA) toolbox. Firstthe images were flattened to remove the long-term drift of the set up. A few flattened images of DNA molecules areshown in Figs. 33 and 34. Then the DNA chains were identified and digitized using morphological tools such as hysteresisthresholding, erosion, dilation and skeletonization to estimate their freeDNA length using Kulpa and corner chain estimators[350]. As reported in previous studies [338,350,351], our estimate of the mean contour length of the DNAs under study,yields an helical rise of 3.20 ± 0.05 Å (resp. 3.60 ± 0.05 Å) per bp in air (resp. in liquid) which underestimates (resp.overestimates) the 3.38 Å per bp measured by crystallography. For each entropic realization of a DNA molecule of size L(∼2200 bp), we estimated both the end-to-end distance R(s) and the scalar product u(0).R(s) (Appendix C, Fig. C.1) with aresolution of 1 pixel = 4 bp. Then by averaging over all the DNA trajectories obtained from the AFM images, we estimatedℓDp (s) = ⟨R2(s)⟩/2s (Appendix C, Eq. (C.6)) and ℓS

p (s) = ⟨u(0).R(s)⟩ (Appendix C, Eq. (C.9)) that both converged to the 2Dpersistence length in the limit s → L → ∞ (Appendix C, Eqs. (C.5) and (C.8) respectively).

4.2.2. Defining the experimental protocol appropriated to characterize intrinsic structural disorderOne of the most attractive features of the AFM is that it can operate equally well with the cantilever immersed in liquid

as in air, making it possible to image biological molecules like DNA and proteins under quasi-physiological conditions [336,344,345,348,350–353]. Whereas the weak electrostatic attachment of DNA to the mica surface obtained by adding thedivalent cations Mg2+ into the buffer is efficient enough when imaging in air (Figs. 33 and 34), it is definitely too loosewhen scanning in liquid [348,352]. In aqueous buffer, Ni2+ ions are generally preferred because of their higher affinity formica which allows a better confinement of DNA to the mica surface without either immobilizing it or releasing it in theliquid cell. Because the persistence length of DNAmolecules is indeed sensitive tomany experimental parameters includingtemperature, pH, ionic forces and excluded volume effects for long DNA fragments, in this section, we perform a series ofpreliminary persistence length measurements using different AFM imaging protocols with the objective of selecting themost efficient one to evidence and quantify intrinsic DNA curvature disorder [84,85].

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 89

Fig. 33. AFM images in tapping mode in air using Buffer A1 of L = 800 bp intrinsically straight DNA molecules.Source: Reprinted from Ref. [31].

Persistence length measurement in air using Buffer A1: In Figs. 33 and 34 are shown AFM images in air using Buffer A1 of theL = 800 bp intrinsically straight DNA fragment (no intrinsic curvature disorder) and of the L = 2200 bp Hepatitis cRNAvirus molecule (uncorrelated structural disorder) respectively. Theoretically, according to theWLCmodel extension to DNAchains with uncorrelated intrinsic curvature disorder (see Appendix C), from Eqs. (C.16) and (C.17) relating the persistencelengths of the straight DNA (ℓdp = 2A) and of the virus molecule (ℓeffp ), one should be in a position to estimate the amplitudeσo of the curvature disorder intrinsic to HCV DNA:

σ 2o = 2

1

ℓeffp

−1ℓdp

. (14)

Figs. 35(a) and 36(a) show the experimental data for the local persistence length versus s obtained from the DNA tracesof N = 193 straight DNA molecules deposited using Buffer A1 protocol (Table 4). The data obtained when using eitherthe mean projection of the end-to-end vector on to the initial orientation estimate of ℓSp(s) (Eq. (C.9)) (Fig. 35(a)) or themean square end-to-end distance estimate of ℓD

p (s) (Eq. (C.6)) (data not shown), are remarkably well fitted by the WLC

90 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 34. AFM images in tapping mode in air using Buffer A1 of L = 2200 bp uncorrelated HCV DNA molecules.Source: Reprinted from Ref. [31].

model predictions. Fig. 36(a) shows the estimate of ℓdp = 2A when fitting the data over the first s bp. For s ranging from200 bp to the total length 800 bp, all the data points for 2AD and 2AS fall in the range ℓp = 2A = 490 ± 30 bp. This yieldsan estimate of the bend rigidity A = 245 ± 15 bp which corresponds, according to Eq. (C.18), to a 3D persistence lengthℓ3Dp = A × 0.32 ≃ 79 nm, in good agreement with previous measurement by cryo-electron microscopy [327] and recentmolecular dynamic simulations [354]. For comparison, in Fig. 36(b) and (c) are shown similar estimates of 2AD and 2AS fortwo sets of N = 193 2D equilibrium conformations of straight DNA of total length L = 800 bp numerically generated byDNA simulations (see Appendix D), using a centered bend angle distribution of variance 1/A given by Eq. (C.3) (ℓ = 1) with2A = 490 bp. It is clear that for a statistical sample of about 200 DNA trajectories of rather small total length L = 800 bp,there is some statistical limitation in the achievable accuracy when estimating the persistence length ℓp = 2A. Results ofthe numerical simulations in Fig. 36(b) and (c) show that estimates of ℓD

p (s) and ℓSp (s) can differ by a few tens of bp which

corroborates our estimate of the experimental uncertainty ℓp = 2A = 490 ± 30 bp in Fig. 36(a).In Figs. 35(b) and 36(d) are reported the results of a similar analysis of a collection of AFM images (Fig. 34) from which

the DNA traces of N = 65 HCV molecules were digitized. Again the persistence length ℓDp (s) and ℓ

Sp(s) are well fitted by

the WLC model predictions but with an effective persistence length ℓeffp significantly smaller than the dynamic persistencelength estimated just above for intrinsically straight DNA. Indeed, as shown in Fig. 36(d), when performing a fit of the datawith Eqs. (C.6) and (C.9) over different ranges of scales from s = 200 to 2000 bp, we get values for 2AD and 2AS that allfall in the interval ℓeffp = 2Aeff

= 440 ± 30 bp which corresponds to a 3D persistence length ℓ3Dp = Aeff× 0.32 ≃ 70 nm.

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 91

a b

Fig. 35. Persistence length measurement of ℓSp (Appendix C, Eq. (C.9)) when averaging over N imaged molecules of (a) intrinsically straight DNA

(L = 800 bp) and (b) uncorrelated HCV DNA (L = 2200 bp). The symbols correspond to the following AFM imaging protocols: Air with Buffer A1 (),A2 () and A3 (); liquid with Buffer L (•). The solid lines correspond to least squares fit estimates of the persistence length with theWLCmodel prediction(Appendix C, Eq. (C.9)). The so-obtained estimates of ℓp and the corresponding numbers of analyzed molecules are reported in Table 4.Source: Reprinted from Ref. [85].

a d

b e

c f

Fig. 36. Fit of the data for ℓDp (s) (•) and ℓ

Sp (s) () with the WLC model predictions (Appendix C, Eqs. (C.6) and (C.9)): 2A estimated when restricting the

domain of fit of the data in Fig. 35 to the first s bp only. Intrinsically straight DNA: (a) AFM study (N = 193, L = 800 bp), (b) and (c) DNA simulations(N = 193, L = 800 bp) with elastic bend rigidity 2A = 490 bp. Uncorrelated DNA: (d) AFM study of HCV molecules (N = 65, L = 2200 bp), (e) and (f)DNA simulations (N = 100, L = 2200 bp) of 2D equilibrium conformations of two uncorrelated random chains (H = 1/2, σo = 0.022) generated with anelastic bend rigidity 2A = 490 bp and corresponding to 2Aeff

= 440 bp (see Appendix D).Source: Reprinted from Ref. [85].

Since the experimental protocol used to deposit and visualize the different types of DNAs was rigorously the same, wecan reasonably assume that the bend rigidities of all molecules are identical. Then, we get the following estimate of thecurvature angle function amplitude σo = 0.022 ± 0.012 typical of a weak structural disorder and corresponding to a staticpersistence length ℓsp = 2/σ 2

o = 4310 bp. We report in Fig. 36(e) the results of a comparative DNA simulation study ofN = 100 2D equilibrium conformations of an uncorrelated random chain (H = 1/2, σo = 0.022), generated with an elasticbend rigidity 2A = 490 bp as described in Appendix D (Fig. D.1(c)). Like for the HCV molecules, the estimated persistencelength definitely underestimates the elastic bend rigidity consistently with Trifonov et al. formula (Appendix C, Eq. (C.16)).Redoing these DNA simulations for a second realization of uncorrelated random chains with the same curvature disorderamplitude σo = 0.022, then, as shown in Fig. 36(f), some average estimate of 2Aeff

= 440 ± 20 bp is obtained in goodagreement with Eq. (C.15) which predicts ℓeffp = 2Aeff

= 436 bp. These results bring the demonstration that AFM imagingis capable of detecting and quantifying the presence of rather weak curvature structural disorder in short (a few kbp long)DNA molecules [84,85].Persistence lengthmeasurement in air using Buffer A2: To investigate the effect of different ionic conditions,wehave performedadditional persistence length measurements in air when using Buffer A2 that contains both Mg2+ and Ni2+ divalentcounterions. More precisely, Buffer A2 was obtained by adding 0.05 mM NiCl2 to the 5 mM MgCl2 containing Buffer A1.The ℓS

p (s) data recorded with Buffer A2 () fall systematically below the corresponding experimental points previouslyobtained with Buffer A1 () and this for both sets of N = 108 intrinsically straight DNA molecules (Fig. 35(a)) andN = 99 HCV molecules (Fig. 35(b)). From the very good fits obtained with the WLC model (Appendix C, Eq. (C.6)), we

92 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

get the following estimates ℓp = 395 ± 25 bp for the former molecules and ℓp = 360 ± 25 bp for the latter ones, i.e.values that are systematically below the corresponding ones obtained with Buffers A1 (Table 4). But what is remarkableis that when introducing the so-obtained values in Eq. (14), identifying the dynamic persistence length ℓdp to ℓp (straightDNA) and the effective persistence length ℓeffp to ℓp (HCV), we get a value for the intrinsic curvature fluctuation amplitudeσo = 0.023 ± 0.015 which is quantitatively similar to the one obtained in air with Buffer A1 (σo = 0.022 ± 0.012) (seeTable 4) [85].Persistence length measurement in air using Buffer A3: In Fig. 35 are also reported the results obtained when using Buffer A3that contains Ni2+ divalent cations instead of Mg2+ (Table 4). The data for ℓS

p (s) obtained from the analysis of N = 101intrinsically straight DNA molecules and N = 71 HCV molecules are still well fitted by the WLC model (Appendix C, Eq.(C.6)) with ℓp = 2A = 297 ± 20 bp for the former and ℓp = 2A = 291 ± 20 bp for the later, i.e. values which aresignificantly smaller than the ones previously obtained with Buffers A1 and A2 but that are no longer distinguishable, up tothe statistical and experimental uncertainties [85].Persistence length measurement in liquid using Buffer L: When performing AFM imaging in liquid with Buffer L (identical toBuffer A3) that contains NiCl2 and no more MgCl2 (Table 4), we obtain from the analysis of N = 109 intrinsically straightDNAmolecules and N = 98 HCVmolecules, data for ℓS

p (s) that systematically fall in the error bars of the corresponding datapoints previously obtained when imaging in air using Buffer A3 (Fig. 35), yielding to the estimates ℓp = 2A = 285 ± 20 bpand ℓeffp = 2Aeff

= 276 ± 20 bp respectively. These results show that when using the same experimental depositionprocedure with the same buffer composition, AFM imaging in air and in liquid provides consistent and robust estimates ofthe persistence length of short DNA chains [85].

Thus the variations observed in the effective persistence length ℓeffp of these HCV molecules when changing the ionicconditions (Fig. 35(b)) are likely to be accounted for by the variations of the dynamic persistence length ℓdp contributionin Eq. (14) as estimated from the set of intrinsically straight DNA molecules (Fig. 35(a)). As summarized in Table 4, whenprogressively replacing Mg2+ by Ni2+ cations, ℓdp = 2A decreases from 490 ± 30 bp (Buffer A1) to 290 ± 20 bp (BufferA3 and L), which corresponds to a decrease of the DNA bending rigidity A and in turn of the 3D persistence length ℓ3Dp(Appendix C, Eq. (C.18)) from 79 nm down to 49 nm, i.e. down to a value consistent with the 50 nm consensus valueobtained for native DNA by independent methods [188,313,334]. This increase in flexibility is likely to result from somecharge-induced electrostatic softening of the DNA due to the high affinity of Ni2+ ions to DNA that can attenuate repulsionsbetween negative charges on the phosphates [348,349,352,355]. But what our AFM results suggest is that these changesin ionic conditions have not affected the static persistence length ℓsp of the HCV molecules, which quantifies the intrinsiccurvature disorder induced by the HCV DNA sequence with an amplitude as small as σo =

2/ℓsp ≃ 0.02 (Table 4) [84,

85]. Note that the prohibitive size of the error bars in the estimate of σo when imaging in air with Buffer A3 and in liquidwith Buffer L likely explains why the effect of the sequence on the DNA elastic properties was found negligible in manyexperimental studies performed in quasi-physiological conditions. However, a careful examination of the data in Fig. 35reveals that if the ℓS

p (s) data points for the intrinsically straight DNA (Fig. 35(a)) and the HCV DNA (Fig. 35(b)) fall in theerror bars of each other, nevertheless for each s value, ℓS

p (s) for HCV molecules is found systematically smaller than for theintrinsically straight DNA molecules. This strongly suggests that despite the statistical (and experimental) uncertainty, thesystematic underestimation of the persistence length observed for HCV relative to intrinsically straight DNA is an indicationof the the presence of an uncorrelated intrinsic curvature disorder in the former molecules [84,85].

The consistency of the AFM results reported in Fig. 35 and Table 4 brings into light a remark of practical importance.Since the weak curvature structural disorder evidenced for the HCV molecules σo ≃ 0.02 at the bp scale is likely to betypical of native DNA molecules, according to Eq. (14), the best situation to estimate σ 2

o as accurately as possible is whenthe two terms 1/ℓeffp and 1/ℓdp are both as small as their difference, i.e.when ℓeffp andmainly ℓdp = 2A are as large as possible.In other words, if we intend to characterize the intrinsic structural disorder induced by the DNA sequence and possibly toevidence the presence of LRC, it is better to operate under deposition and AFM imaging conditions such that the consideredDNA molecule be as rigid as possible (ideally, for uncorrelated disorder, σ 2

o = 2/ℓeffp in the limit ℓdp = 2A → +∞). Thisjustifies our choice in the next sections to operate in air using the AFM imaging protocol with Buffer A1 (Table 4) [84,85].

4.2.3. Experimental check of 2D thermodynamic equilibrium conditionsA crucial issue in the present interpretation of experimental AFM studies in terms of the WLC model (Appendix C.1) and

its generalization to LRCDNA chains (Appendix C.2), is the control of the deposition process in order to avoid kinetic trappingof the DNAmolecules on the surface before equilibrating in a given 2D conformation. As described in Section 4.2.1, we havefollowed protocols used in previous studies [338,348,349,352,353] as attaining 2D thermodynamic equilibrium. This sectionis devoted to some experimental check on our set of N = 193 imaged conformations of the L = 800 bp intrinsically straightDNA fragment when imaged in air using Buffer A1 (Fig. 33) [85]. The results of the statistical analysis of DNA bend anglefluctuations 1θ(s, ℓ) over different distances ℓ are reported in Fig. 37. In Fig. 37(a) and (b) are shown the pdfs Pℓ(1θ)obtained for ℓ = 230 bp and 310 bp respectively. As a check of 2D thermodynamic equilibrium, we have computed the firstfour moments of Pℓ(1θ)with the specific goal of testing the Gaussian nature of these pdfs (Appendix C, Eq. (C.3)). As shownin Fig. 37(e) and (f), the data for ⟨1θℓ⟩ and ⟨1θ3ℓ ⟩ do not display any significant departure from zero as expected for the odd

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 93

a b

c d

e

f

g

h

i

Fig. 37. Statistical analysis of DNA bend angle fluctuations for the N = 193 intrinsically straight DNA molecules of length L = 800 bp imaged in air withBuffer A1. Histogram Pℓ(1θ) estimated for different scales ℓ: AFM experiment for ℓ = 230 bp (a) and 370 bp (b); DNA simulations with 2A = 490 bp andfor ℓ = 230 bp (c) and 370 bp (d). Moment analysis: (e) ⟨1θ⟩, (f) ⟨1θ3⟩, (g) ⟨1θ2⟩, (h) 2ℓ/⟨1θ2⟩ and (i) ⟨1θ4⟩/⟨1θ2⟩2 versus ℓ; the symbols (•) correspondto AFM experiments and (, ) to two DNA simulations respectively. The horizontal solid lines in (e), (f), (h) and (i) correspond to ⟨1θℓ⟩ = 0, ⟨1θ3ℓ ⟩ = 0,2ℓ/⟨1θ2ℓ ⟩ = 490 bp and ⟨1θ4ℓ ⟩/⟨1θ

2ℓ ⟩

2= 3 respectively. The solid line in (g) corresponds to the slope 2/2A with 2A = 490 bp.

Source: Reprinted from Ref. [85].

94 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

moments of a Gaussian distribution. For the second moment, Eq. (C.3) predicts:

⟨1θ2ℓ ⟩ =2ℓ2A

=2ℓℓp. (15)

The data for ⟨1θ2ℓ ⟩ in Fig. 37(g) exhibit a very convincing linear behavior when plotted versus ℓ. Accordingly, when plotting2ℓ/⟨1θ2ℓ ⟩ vs. ℓ in Fig. 37(h), a well-defined plateau at the value 2A = 490±30 bp is observed, in remarkable agreementwiththe estimate of the bend rigidity previously obtained from persistence length measurements (see Table 4). The ratio of thefourth moment to the square of the second moment, the so-called kurtosis of a Gaussian distribution satisfies the followingremarkable relationship:

⟨1θ4ℓ ⟩

⟨1θ2ℓ ⟩2

= 3, ∀ℓ. (16)

As shown in Fig. 37(i), the ratio determined from the experimental AFM data is very close to the theoretical value 3. Overall,the results reported in Fig. 37 bring the experimental demonstration that the bend angle fluctuations of the intrinsicallystraight DNA fragment have Gaussian statistics for distances ℓ between two chain vectors ranging from 100 to 500 bp [85].For shorter distances, the analysis is biased by the AFM resolution. For distances larger than 500 bp, the analysis startsbeing affected by finite sampling effects. For the sake of comparison, we have reproduced this test of Gaussian statisticsfor two independent sets of N = 193 numerically generated 2D equilibrium conformations of a L = 800 bp straight DNAwith a bend rigidity 2A = 490 bp (see Appendix D). As shown in Fig. 37, the results obtained for the real and simulatedmolecules are quite comparable and consistent. In particular the slight differences observed in the first four moments’behavior of Pℓ(1θ) between our two sets of simulated molecules are as important as the differences observed betweenthe real and the numerical molecules. This observation attests that there is no significant departure from Gaussian statisticsin the fluctuations of DNA bending angle imaged by AFM corroborating that 2D thermodynamic equilibrium is achieved byour experimental deposition protocol [84,85].

4.2.4. Experimental demonstration of the existence of long-range correlations in human DNA molecules

Genomic human DNA: DNA fragments of total length 1000 . L . 3000 bp were purified from exponentially growing HeLacells using standard procedure as described in Section 4.2.1. Fig. 38(a) shows representative AFM images obtained usingBuffer A1 of 2D equilibrium conformations of these human genomic molecules when submitted to thermal fluctuations.From a large set of images collected in different experiments, the DNA traces of N = 290 molecules were digitized witha total length selection: 1700 . L . 2500 bp. Since each of these imaged DNA molecules most probably correspondsto a different DNA sequence, the experimental estimate of the persistence length is likely to account for averaging overboth entropic and intrinsic structural disorders thus making the comparison with the predictions of our GWLC model (seeAppendix C.2) amenable [84,85]. Fig. 38(b) shows the experimental data for ℓD

p (s) and ℓSp (s) obtained for our set of 290

human DNAmolecules.When using the theoretical predictions (Appendix C, Eqs. (C.21) and (C.22)) of the GWLCmodel to fitthese data, limiting the range of variation of H between 0.7 and 0.8 according to the LRC detected in the genomic sequences,we get a very good agreement for the parameter values (2A = 487 bp, H = 0.73, σo = 0.0077) and (2A = 490 bp,H = 0.73, σo = 0.0062) respectively. As a first observation, we recover a value of the elastic bend rigidity in remarkableagreement with the reference value obtained for the intrinsically straight DNA fragment 2A = 490 ± 30 bp. As far asthe amplitude of the intrinsic curvature disorder σo ≃ 0.007 is concerned, it looks smaller than the one obtained forthe uncorrelated HCV molecule. However, as discussed in Section 2.6.4, the H = 0.73 LRC regime is observed for scales&200 bp only, whereas for scales .200 bp there is a cross-over to a H = 0.6 LRC regime (Sections 2.6.2 and 2.6.3) [14–16]. Thus, up to scale 200 bp, σℓ = σoℓ

0.6 rather than σℓ = σ ′oℓ

0.73 which is only valid above 200 bp; by continuityσ ′o = σo2000.6/2000.73. Taking this property into account to estimate the intrinsic curvature disorder amplitude at the

bp scale, we get σo = 0.007 × 2000.73/2000.6≃ 0.015, i.e a value similar to the one (σo = 0.022) found for the HCV DNA

(Table 4). This is, a posteriori, an important observation since the local curvature fluctuations observed between successivebase-pairs is expected to be of the same magnitude whatever the DNAmolecules under consideration. Let us point out thatfor the parameter values (2A = 490 bp, H = 0.73, σ = 0.007), our GWLC model predicts a persistence length ℓp = 396 bpthat corresponds to a value of the 3D persistence length ℓ3Dp = ℓp/2 × 0.32 ≃ 63 nm that is significantly smaller thanthe persistence lengths previously found for the intrinsically straight DNA (ℓ3Dp = 79 nm) and the uncorrelated HCV DNA(ℓ3Dp = 70 nm) [84,85].Individual human DNA molecules of different GC content: As reported in Fig. 38(b), the classical WLC model predictions(Appendix C, Eqs. (C.6) and (C.9)) also provide an acceptable fit of the genomic human molecules when fixing 2Aeff

=

410 ± 10 bp, which corresponds, according to Eq. (C.15), to a larger value of σo = 0.028. Actually, a close inspection ofthe behavior of ℓS

p (s) in the range of scales 600 . s . 1000 bp reveals that the WLC model indeed fails to account for theobserved pronounced ‘‘knee’’ behavior hastening the convergence of ℓS

p (s). As illustrated in Fig. C.2, this faster convergenceto a lower value of ℓp is likely to be the signature of the presence of LRC in the local DNA curvature fluctuations [80,346].Unfortunately, for rather weak structural disorder with σo . 0.01, as observed for the human genomic DNA molecules,

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 95

Fig. 38. (a) AFM images in tapping mode in air using Buffer A1 (Table 4) of the 1700 . L . 2500 bp LRC human genomic DNA. (b) Persistence lengthmeasurements of ℓD

p (s) (•) and ℓSp (s) () when averaging over N = 290 molecules. The solid lines correspond to the predictions of the GWLC model

(see Appendix C.2, Eqs. (C.21) and (C.22) respectively) for 2A = 490 bp, H = 0.73 and σo = 0.007; the predictions of the WLC model for 2A = 410 bp(Appendix C.1, Eqs. (C.6) and (C.9)) are shown in dashed lines for comparison.Source: Adapted from Ref. [85].

Fig. 39. AFM images in tapping mode in air using Buffer A1 of human DNA fragments of different GC contents. (a) chr 21: L = 2200 bp, G + C = 31%; (b)chr21: L = 2006 bp, G + C = 38%; (c) chr 21: L = 2206 bp, G + C = 43%; (d) chr 8: L = 2189 bp, G + C = 46%.Source: Reprinted from Ref. [85].

this ‘‘knee’’ behavior in ℓp(s) is not sufficiently pronounced to allow us to definitely discard the WLC model. Thankfully thehuman genome is well known to be compartimentalized into wide (∼105 bp) domains with uniform GC content, commonlycalled isochores [127,146–152,160], with appreciable scatter of the average GC content when comparing different domains.One of the main isochore characteristics is the fact that the variance of the GC fluctuations increases significantly with themean GC content. As noticed in a previous work [332], when using structural tri-nucleotide coding tables like the PNuctable (Table 1), the variance of the generated bending profiles is mainly governed by the variance of the GC content. Moreprecisely, when comparing GC poor (∼30%) to GC rich (∼55%) isochores, the variance σ 2

o of bend angle fluctuations can bemultiplied by a factor of 2 or more. As described in Section 4.2.1, along this observation, we have selected four repeated-sequence free DNA fragments of size L ≃ 2200 bp along the human genomewith GC content ranging from 31% to 46%. Threeof these sequences are from chromosome 21, with the last one from chromosome 8. Fig. 39 shows AFM images of typical2D equilibrium conformations obtained in air with Buffer A1 for these four selected human DNA fragments. Fig. 40(a) and(b) show the corresponding experimental data for ℓD

p (s) and ℓSp (s) respectively, as obtained when averaging over statistical

samples of about N ≃ 100 molecules. Several observations must be underlined. First the ℓDp (s) and ℓ

Sp (s) data for the

DNA fragment of lowest GC content (31%) are very similar to the ones previously obtained for the human genomic DNAmolecules in Fig. 38(b). According to our GWLC model [80] (with 2A = 490 bp, H = 0.73, σo = 0.007), we predict forthe low GC human DNA fragment the same persistence length ℓp = 396 bp, corresponding to a 3D persistence lengthℓ3Dp = ℓp/2 × 0.32 = 63 nm, suggesting that our experimental sample of N = 290 human genomic DNA molecules isindeed typical of the rather low average GC content (∼41%) observed in the human genome. For the other DNA fragmentswith higher GC contents, both ℓD

p (s) and ℓSp (s) are found significantly lower than for the HCV (Fig. 35(b)) and the human

96 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

genomic DNA (Fig. 38(b)). Indeed for each of these single DNA fragments, the observed behavior of ℓDp (s) and ℓ

Sp (s) versus

s is no longer comparable to the predictions of the WLC model (Appendix C, Eqs. (C.6) and (C.9)). This is particularly patentfor the G + C = 46% human DNA fragment for which both ℓD

p (s) and ℓSp (s) start decreasing beyond s ≃ 700–800 bp as

a possible indication of the presence of some large-scale cooperative intrinsic curvature. A simple visual inspection of theAFM images of these human DNA fragments in Fig. 39, confirms this interpretation. Whereas molecules can still adopt bentor straighter HCV-like configurations indicating that their flexibility is likely to be retained, some proportion of them displaysome macroscopic curvature. But what is remarkable is that this proportion becomes very significant for the DNA fragmentof the highest GC content (Fig. 39(d)) [85]. To demonstrate that this persistence in curvature properties is not due to some‘‘hyperperiodic’’ distribution of curved DNA fragments [356], but rather to the presence of LRC in the random distributionof bend sites [137,191], we have performed DNA simulations of LRC chains with parameters 2A = 490 bp, H = 0.73and σo ranging from 0.006 to 0.015 (see Appendix D). We have selected three frozen LRC chains that are illustrated at theright edge of Fig. 40(d) as representative of increasing intrinsically macroscopically curved DNA chains. The ℓD

p (s) and ℓSp (s)

numerical data obtained when averaging over N = 100 2D equilibrium conformations of these three frozen chains areshown in Fig. 40(c) and (d) respectively. These numerical data display the same characteristics as experimentally observedin Fig. 40(a) and (b) for the four human DNA fragments. Going from the less intrinsically curved synthetic DNA to the mostcurved one, we recover the WLC-like behavior observed for the GC poor DNA as well as the decrease of ℓD

p (s) and ℓSp (s)

observed at large s values (s & 800 bp) for the GC rich DNA fragments [84,85].Now if we look at the DNA simulations of uncorrelated and LRC DNA chains in Fig. D.1(a) and (b) (Appendix D), we

recognize that the frozen LRC chains display some marked macroscopic curvature. Actually, from the Gaussian statistics ofintrinsic bend angle fluctuations (Appendix C, Eq. (C.12)), we can show that the characteristic size of a frozen LRC chainmaking a half-loop is ℓ∗

= (π/σo)1/H ; this yields for H = 0.73 the values ℓ∗

= 4295 and 895 bp for σo = 0.007 and 0.022respectively. Since the internal end-to-end distance is expected to decrease when s reaches half the loop-size, this explainsthat the decrease observed for s & 800 bp in ℓD

p (s) and ℓSp (s) of GC rich DNA fragments is quite representative of the presence

of LRC between randomly distributed curvature sites. Actually, in order to get a similar decrease with an uncorrelated(H = 1/2) frozen chain, we would need a quite unrealistic large value of σo =

π2/ℓ∗ ≃

π2/895 ≃ 0.105. With a

more realistic value of σo ≃ 0.02–0.03, the probability to get by chance an uncorrelated chain that displays a macroscopiccurvature similar to the one observed for the GC rich DNA fragments is less than 10−3, which again discards a possibleinterpretation of the AFM data in Figs. 39 and 40 in terms of the WLC model. Finally, for σo = 0.010, 0.012 and 0.015,our GWLC model [80] for 2A = 490 bp, H = 0.73 yields the following respective asymptotic estimates of the persistencelength of GC rich DNA fragments: ℓp = 339, 306 and 262 bp. These estimates of ℓp correspond to 3D persistence lengthvalues ℓ3Dp = 54, 49 and 42 nm that are consistent with the 50 nm consensus values for native DNA obtained from previousmeasurements by independent methods [313,327,334,335].

Persistence length of DNA molecules is indeed sensitive to many experimental parameters like temperature, pH, ionicforces and excluded volume effects for long DNA fragments, that can explain significant discrepancies between ℓp valuesestimated by different groups using various techniques including light scattering [357,358], ligase-catalyzed cyclization[329,359–362], flow dichroism [363], transient electric birefringence [364,365] and transient electric dichroism [366].However all these methods require simplifying theoretical models to interpret the data that may affect the results. UsuallyDNA is assumed to be a classical intrinsically straight polymer and the local sequence-dependent variations in localcurvature and flexibility are neglected; this can significantly affect the efficiency of circularization in biochemical cyclizationexperiments [329,359–362]. As aforementioned, electron microscopy [367] and atomic force microscopy [336–345,348–352] are unique techniques that provide essential information to persistence length measurement, namely the contourof the individual molecules and a population of such contours. However these direct visualization techniques require theDNA molecules to be confined in 2D for imaging which can cause some changes in the preferred DNA conformations. Asemphasized by Bednar et al. [327], cryo-electron microscopy overcomes the drawback of classical electron microscopysince there is no adsorption, no drying and no staining artifact, the molecules being suspended in a thin layer of cryo-vitrified buffer. As far as AFM is concerned, two extreme cases of DNA molecule adsorption on supporting films can bedistinguished [338,348,349]: (i) strong kinetic trapping roughly corresponding to a projection of the 3D conformations and(ii) weak adsorption so that the molecules have time to freely equilibrate on the surface before being trapped in a particular2D equilibrium conformation. In this section, we have used AFM in tappingmode in air and in liquid to image DNAmoleculesunder different ionic conditions [85]. We have defined an experimental protocol that ensures 2D thermal equilibriumand enhances the rigidity of the DNA molecules making accurate measurement of the intrinsic curvature disorder (σo,H) amenable. By assisting AFM imaging by DNA simulations (see Appendix D), we have reported the first experimentaldemonstration of the existence of LRC in the intrinsic curvature fluctuations of humanDNA resulting in a significant loweringof the persistence length [84,85]. This lowering does not correspond to some increased flexibility butmore likely results fromsome large-scale intrinsic curvature due to a persistent distribution ofDNA curvature sites. This raises the issue of developingmethodologies capable of separating the local sequence induced curvature and flexibility along the DNA chain [345].

4.3. Atomic force microscopy imaging of nucleosome positioning by genomic excluding energy barriers

The structure and dynamics of nucleosomes have attracted increasing experimental and theoretical interest [6,7]. Highresolution X-ray analyses [8,10] have provided deep insight into the wrapping of about 147 bp of DNA in almost two super-

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 97

a b

c d

Fig. 40. Persistence length measurement of ℓDp and ℓS

p when averaging over N DNA traces. (a) ℓDp (s) and (b) ℓS

p (s)measured from an AFM study in air withBuffer A1 of human DNA fragments of G+ C content () 31% (N = 99), () 38% (N = 80), () 43% (N = 87) and (♦) 46% (N = 134). (c) ℓD

p (s) and (d) ℓSp (s)

computed from DNA simulations of N = 200 2D equilibrium conformations generated with an elastic bend rigidity 2A = 490 bp from the three frozenLRC chains (•, , ) illustrated at the right edge of (d) and selected as representative of increasingly macroscopically curved DNA chains generated withH = 0.73 and σo = 0.01 (see Appendix D).Source: Reprinted from Ref. [85].

helical turns around a histone octamer to form the nucleosome. These nucleosomes are disposed as beads-on-a-string alongthe DNA to form the 10 nm chromatin fiber that is further organized into successive high-order structures spanning a widerange of scales to reach a full extent of condensation in metaphase nucleosomes (Fig. 1). Actually, nucleosomes are highlydynamical structures that can be moved along DNA by chromatin remodeling complexes at the expenses of ATP [201,233,234,293–302,368–370] but that also move autonomously on short DNA fragments [353,371–376]. Different models havebeen proposed to account for the nucleosome mobility [377–379] extending from the DNA reptation model [380] to thetwist defect diffusion model [381]. In that context, as reported in the previous sections, recent genome-wide nucleosomemappings alongwith bioinformatics studies have shown that the DNA sequence plays amore important role in the collectiveorganization of nucleosomes in vivo than previously thought [81–83,179,210,216,217,219–222,240,243,246,263,286]. Yet,in living cells, this organization also results from the action of various external factors like DNA binding proteins andchromatin remodelers (Section 3.6). To decipher the code for intrinsic chromatin organization, there is thus a need for in vitroexperiments to bridge the gap between computational models of nucleosome sequence preferences and in vivo nucleosomeoccupancy data [216,217]. In this section, we combine AFM in liquid and the theoretical physical modeling described inSection 3.2 to confirm experimentally that a major sequence signaling in vivo are high energy barriers that locally inhibitnucleosome formation rather than favorable positioning motifs [86]. Besides providing direct single molecule visualizationand precise quantification of intrinsic nucleosome positioning, our approach has several advantages since for a same DNAtemplate, very instructive and complementary information is made available: (i) by comparing the statistical positioningdistribution of a single mono-nucleosome (which is not accessible in vivo) to the theoretical nucleosome energy profile,we are able to characterize the ability of the sequence to locally favor nucleosome formation and (ii) by comparing theexperimental nucleosome density profile obtained by increasing the input histone level during reconstitution to enhancenucleosome loading, to the theoretical grand canonical nucleosome equilibrium density, we are in position to understandand quantify to which extent the DNA sequence contributes to the collective organization of nucleosomes observed in vivo.

4.3.1. Reconstituting and imaging nucleosomes on native DNAEarly AFM imaging was applied to the characterization of chromatin structure at the nanoscale level [382–384]. But

except for a few studies of telomeric [385,386] and centromeric [387] nucleosomes, AFM has been used to image, mainlyin air, nucleosome assembly on specific positioning sequences (e.g. Xenopus 5S rDNA and 601 DNA sequences) and arraysof concatenated repetitions of these sequences [388,389]. In this section, we report the results of AFM experiments [86]performed in aqueous solution according to the protocol defined in Ref. [390]. Mono- and di-nucleosomes are reconstituted

98 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 41. AFM image treatment and analysis of amono-nucleosome reconstituted on the 595 bp long yeast (chr. 7) genomic DNA fragment [86]. (a)Moleculepath skeletization when DNA is loaded; LS corresponds to the length of the skeleton. (b) Sketch illustrating the entry and exit sites of the nucleosome coreparticle (NCP), the dyad position , the lengths of the longest (L+ = LS − (lM + d/2)) and shortest (L− = lM − d/2) free DNA arms outside the nucleosomeand the length (Lc = L − (L+ + L−)) of complexed DNA. (c) Topographical profile along the molecule path in (a); the diameter d of the NCP is fixed tod = 11 nm; the NCP is positioned at the position lM corresponding to the local height maximumM . (d,e) Illustration of the methodology used to orient thehuman ILR2A fragment (L = 898 bp): the pattern of height variation along the skeleton of the fragment (red) is compared to the calculated pitch variation(black) obtained when using the Bolshoy et al. DNA coding table [393]. (For interpretation of the references to color in this figure legend, the reader isreferred to the web version of this article.)

on genomic yeast and human DNA templates using standard salt dialysis procedure [391]. Selection of the properly foldednucleosomes, determination of dyad position and resulting positioning map are then performed according to the proceduredetailed just below.Preparation of DNA fragments: The DNA fragments were selected according to the energy landscape and nucleosomeoccupancy profile obtained from the physical modeling of nucleosome positioning (see Section 3.2) [86]. The short((A) 394 bp, (B) 386 bp and (C) 387 bp) DNA fragments were obtained by PCR amplification with Taq polymerase (Promega)from yeast S. cerevisiae genomic DNA. Templates were amplified from sequences of yeast chromosome 3 at positions: (A)249050–249443, (B) 61210–61595, (C) 214107–214493 (SGD assembly database, March 2010). The 595 bp long sequencecontaining the gene YGR105W (698500–698950) located on yeast chromosome 7: 698436–699030, was amplified by PCRfrom genomic DNA. The human ILR2A promoter sequence (898 bp) on chromosome 10: 4433–5331 (NCBI, March 2010, locusn NG007403.1), was amplified by PCR with Taq polymerase (Sigma) from human genomic DNA.Protein purification and nucleosome reconstitution: Recombinant full-length histone proteins of Xenopus laevis were overex-pressed in bacteria and purified as previously described in Ref. [392]. Nucleosome reconstitution was performed by the saltdialysis procedure [391]. The reaction was stopped in TE buffer (10 mM Tris-HCl, pH 7.4, 1 mM EDTA) and 10 mM NaCl.For all considered DNA templates, nucleosomes were reconstituted with histone/DNA ratio 1/0.8. The saturating loading ofyeast chromosome 7 DNA fragment was obtained with a histone/DNA ratio of 1.1/0.8.Atomic force microscopy: AFM imaging was performed in solution using a Nanoscope IIIa microscope (Digital Instruments,Santa Barbara, CA, USA) equippedwith a type-E scanner and in tappingmode, essentially as described in previouswork [390].10 µl of a 1 nM reconstitution solution in imaging buffer (10 mM Tris-HCl, 1 mM NiCl2, pH 7.9) were deposited on freshlycleaved mica and incubated for two minutes. A 100 µl drop of imaging buffer was added in the liquid cell prior to imaging.For scanning directly in the adsorption buffer, commercial silicon nitride probes (type NP-S, Veeco Instruments) were usedat drive frequency of 8–9.5 kHz and a set point of 0.3–0.4 V at constant temperature. AFM images were recorded from1 × 1 µm2 or 2 × 2 µm2 frames, at a scan rate of 2.18 Hz and a resolution of 512 × 512pixel2.Image analysis: AFM image treatment and analysis were done using the ‘‘Scanning Adventure’’ software [351]. Briefly, lineby line second-order flattening was first applied, followed by thresholding and filtering. Zooms on individual mono- ordi-nucleosomes were performed and the molecule path was skeletonized; then topographical analysis and length mea-surements were carried out as explained in Fig. 41 [86]. The results enabled a measurement of nucleosomal height and amapping of the nucleosome position along DNA fragments. The position x of each nucleosome was considered to be thelength corresponding to the midpoint of the DNA wrapping length named ‘‘dyad’’ (Fig. 41(b)). As shown in Fig. 42(a), thedistributions of mono-nucleosome height obtained for the three short yeast DNA templates (A), (B) and (C), all display twodiscrete peaks centered at 1 and 2.9 nm. The largest peak value is slightly smaller than the expected height of 5.0 nm forthe nucleosome core particle (NCP), as previously estimated by AFM in liquid when using an APTES surface treatment [387].

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 99

Fig. 42. (a) Height distribution of nucleosomal particles reconstituted on the three small yeast genomic DNA fragments: (A) red (L = 394 bp, N =136),(B) green (L = 386 bp, N = 149) and (C) blue (L = 387 bp, N = 143); bin size = 0.25 nm; errors on the bin height were evaluated as the square root ofthe number of counts. (b) Distribution of DNA complexed length Lc (Fig. 41(c)), for the nucleosomes reconstituted on the three small yeast DNA fragments:(A) red (L = 394 bp, N = 107), (B) green (L = 386 bp, N = 102) and (C) blue (L = 387 bp, N = 105); bin size = 5 bp; errors on bin heights wereevaluated as the square root of the number of counts. (c) Distribution of the lengths ofN = 100 unloaded naked DNA during the experiment with the ILR2ADNA fragment (L = 898 bp); the mean experimental length Lexp = 322.5 nm is indicated by the orange vertical line; the blue (resp. purple) vertical linecorresponds to the theoretical length Ltheo = 324 nm (resp. 306 nm) obtained when considering 0.36 nm (resp. 0.34 nm) as the average distance betweennearest neighbor bps in B-DNA [86]. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of thisarticle.)

The smallest peak value ∼1 nm corresponds to the height of sub-nucleosomal particles that do not contain the full histoneoctamer but more likely the histone tetramer. In all our AFM nucleosome positioning studies, nucleosomes were identifiedas peaks in the topographical profiles of height h ≥ 2 nm. The nucleosome diameter measured by AFM is usually largerthan the real diameter due to tip convolution, which means that the actual entry/exit sites of DNA in the nucleosome can-not be unambiguously localized. For the sake of comparison with previous AFM studies [385,387], the entry/exit sites wereempirically determined by positioning the NCP at the positionM corresponding to the local height maximum and fixing itsdiameter to the crystallographic value [8] d = 11 nm (Fig. 41(c)). In this way the length of free DNA outside the nucleosome,namely L+ = LS − (ℓM + d/2) for the longest arm and L− = ℓM − d/2 for the shortest one, could be measured as well asthe corresponding length Lc of DNA that was complexed into nucleosome:

Lc = L − (L+ + L−), (17)

where L is the total length of the considered DNA fragment. The distributions of complexed DNA length obtained for the setsof nucleosomes reconstituted on the three small yeast DNA fragments look quite similar (Fig. 42(b)). The mean complexedlength Lc is found∼150 bp in agreement with the expected core particle length of 147 bp. Let us point out that some peak atLc = 135 bp is observed for DNA fragment (A) where, as induced by the underlying sequence, a majority of nucleosomes arefound to be located at the edges of the fragment (see Fig. 44(a)). We have checked that those nucleosomes are the ones thatare effectively partially wrapped by DNA. Note that, consistent with previous AFM studies in liquid [351], the DNA lengthwas converted in bp by considering 0.36 nm as the average distance between the nearest neighbor bps in B-DNA (Fig. 42(c)).

From the measurement of L+, L− and Lc (Fig. 41), we position the dyad along the DNA fragment by adding Lc/2 to eitherL+ or L−. But to compare the so-obtained experimental dyad position distribution to the nucleosome probability densitypredicted by our physical modeling, we need to orient DNA.We have succeeded doing it for the 898 bp long ILR2A promoterfragmentmaking use of the AFM topography profile (Fig. 41(d) and (e)). For all the other DNA fragments, for each AFM image,we counted the fragment twice by positioning the dyad according to the two possible orientations, namely the 5′ end is onthe longest arm side or vice-versa (Figs. 44–46) [86].

4.3.2. Genomic energy barriers locally inhibit nucleosome formationIn a first experiment [86],wehave reconstituted a single nucleosomeon three short (394, 386 and 387 bp)DNA templates.

These three DNA fragments have been selected in the yeast chromosome 3 according to the nucleosome formation energyprofiles predicted by our physicalmodeling (Fig. 43). Two of these profiles display rather high energy barriers correspondingto NFRs observed in vivo [205]; for the first fragment (A), there is a barrier positioned at the center (Fig. 43(a)) whereas twoenergy barriers are bordering the second fragment (B) (Fig. 43(b)). As a reference, the third fragment (C) presents a flat energyprofile without any barrier susceptible to impair nucleosome formation (Fig. 43(c)). As illustrated on a few characteristicsinglemolecule AFM images (Fig. 44(a)), the nucleosome ismostly observed at the edges of DNA fragment (A). The statisticalanalysis of the (symmetrized) dyad positioning distribution obtained from N = 107 molecules (Fig. 44(d)) confirms anenrichment at both extremities whereas some lack of positioning events is noticeable at the center where the sequence-induced energy barrier is located (Fig. 43(a)). This experimental distribution is in good agreement (up to free end effects)with the theoretical mono-nucleosome occupancy probability profile predicted by our physical modeling. This is a strongindication of the underlying control of the DNA sequence on nucleosome formation. This observation is corroborated by themono-nucleosome positioning observed (N = 102) on DNA fragment (B) (Fig. 44(b)), where no nucleosome bound to thefragment ends because of the presence of inhibitory energy barriers (Fig. 43(b)). Indeed the nucleosome density is mainlyconcentrated in the central region as predicted by our physicalmodelingwhich explains the remarkable agreement observedbetween the experimental and theoretical symmetrized nucleosome occupancy probability profiles (Fig. 44(e)). Consistent

100 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 43. Theoretical nucleosome formation energy landscapes (color) and nucleosome occupancy probability profiles (black) predicted by our physicalmodeling, for the three small yeast DNA fragments (Section 3.2): (a) 394 bp fragment (A); (b) 386 bp fragment (B); (c) 387 bp fragment (C). For eachfragment, the chemical potential µ was adjusted so that the average number of nucleosomes is one.Source: Adapted from Ref. [86].

with its featureless theoretical energy landscape (Fig. 43(c)), the experimental nucleosome occupancy profile (N = 105)obtained for DNA fragment (C) (Fig. 44(f)) turns out to be rather flat as the signature of a statistical homogeneous dyadpositioning distribution (Fig. 44(c)). This result confirms that the observed nucleosome positioning at the edges of fragment(A) (Fig. 44(a) and (d)) is not due to some possible entropic advantage [394], but more likely results from the exclusion fromthe energetically unfavorable central region [81,82,210,222,240] (Section 3.3). Similarly, the remarkable positioning of thenucleosome in the central region of DNA fragment (B) results from the excluding role of the bordering energy barriers andnot from some positioning signal as observed in a test AFM experimentwith a 255 bp DNA fragment containing the 601 DNAsequence (Fig. 45) which was recently shown to prevent the nucleosome from sliding [353]. This first series of experimentsdemonstrate the excluding role of high energy barriers that are encoded in the sequence and that preclude nucleosomeformation [86].

4.3.3. Genomic energy barriers condition the collective assembly of neighboring nucleosomes consistently with equilibriumstatistical ordering principles

In a second series of experiments (Fig. 46), we have reconstituted nucleosomes on a slightly longer (L = 595 bp)DNA fragment in yeast chromosome 7 that contains the gene YGR105W coding for a vacuolar membrane protein [86]. Asdiscussed in Sections 3.4 and 3.5, this fragment is quite representative of the in vivo highly organized nucleosomal chromatinobserved in a majority of yeast genes that are constitutively expressed (Fig. 21) [82,83,205,210,213,243,246]. Two ratherwell positioned nucleosomes (Fig. 46(g)) are bordered by two NFRs, one just upstream the TSS facilitating the permanentaccess of activator/repressor (Section 3.3.3) [82,83,201,205,210,213,221,222,240,243,246], and another one at the gene 3′

end possibly involved in transcription termination, anti-sense initiation and recycling of RNA polymerase to the promoterby DNA looping [210] (see Section 3.3). As shown in Fig. 46(g), these two gene bordering NFRs actually correspond to highenergy barriers that are encoded in theDNA sequence.When adjusting the chemical potential µ tomimic in vivonucleosomedensity, our grand canonical physical model (Section 3.2) predicts a nucleosome occupancy profile with two confined intra-genic nucleosomes in very good agreement with the in vivo data.

AFM imaging of mono-nucleosomes reconstituted on this 597 bp DNA fragment (Fig. 46(b)) shows a lack of positioningat the two extremities, consistent with the presence of two inhibitory energy barriers. Indeed, the statistical (symmetrized)

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 101

Fig. 44. AFM imaging in liquid ofmono-nucleosome positioning along the three small yeast (chr. 3) genomic DNA fragments: (a, d), yeast DNA fragment (A)with L = 394 bp, N = 107 molecules; (b, e), yeast DNA fragment (B) with L = 386 bp, N = 102; (c,f), yeast DNA fragment (C) with L = 387 bp, N = 105.Panels (a)–(c) give examples of four single molecule AFM images for each of these DNA templates respectively. (d), (e) and (f) show the correspondingsymmetrized dyad positioning distributions; a color bar was positioned at each experimentally detected dyad location and at its symmetrical positionwith respect to the center of the fragment; the cross-hatching at the edges corresponds to unphysical dyad positioning. The experimental nucleosomeoccupancy probability profile (color curve), obtained as a moving average of the symmetrized dyad density over a window of length 43.2 nm, is comparedto the theoretical nucleosome occupancy probability profiles (Fig. 43) after symmetrization (black curve). In the physicalmodeling, the boundary conditionswere such that no nucleosome could settle on the sequence if not adsorbed on its total length (147 bp); the chemical potential µwas adjusted so that theaverage number of nucleosomes is one.Source: Adapted from Ref. [86].

Fig. 45. AFM imaging in liquid of mono-nucleosome positioning along a 255 bp DNA fragment containing the 601 nucleosome positioning sequence [86].(a) Examples of single molecule AFM images. (b) Symmetrized dyad positioning distribution (N = 105 molecules); similar representation as in Fig. 44.Note that the physical modeling predicts some nucleosome positioning on the 52 and 56 bp DNA fragments bordering the 601 positioning sequence thatis not observed with our AFM statistical sampling.

102 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 46. AFM imaging in liquid of (a, b), mono-nucleosomes and (a, c), di-nucleosomes along the yeast (chr. 7) DNA fragment (L = 595 bp) containingthe gene YGR105W. (d) Statistical analysis of mono-nucleosome positioning (N = 113 molecules); red bars correspond to experimentally detected dyadlocations and to their symmetrical position with respect to the center of the fragment. The experimental nucleosome occupancy profile (red curve) iscompared to the theoretical predictions of our physical modeling after symmetrization (black curve). (e) Statistical analysis of di-nucleosome positioning(N = 62); for each image, blue bars were drawn at both the 2 dyad positions and their symmetric locations with respect to the center of the fragment. Theexperimental nucleosome occupancy probability profile (blue curve) is compared to the symmetrized theoretical profile (black curve). (f) Histogram ofinter-nucleosomal distances between di-nucleosomes; the continuous black line corresponds to the distribution predicted by the pair-function (Eq. (18)).(g) Oriented nucleosome occupancy probability distribution observed in vivo [205] (black dots) and predicted by the physical modeling (black curve); inred is shown the location of the YGR105W gene with the TTS (arrow) and the TTS (cross). Theoretical nucleosome formation energy landscape (greencurve) and oriented nucleosome occupancy probability profile observed in vitro [216] (circles). Theoretical mono (d) and di-(e, f, g) nucleosome occupancyprofiles were computed by adjusting the chemical potential µ in our physical modeling so that the average number of loaded nucleosomes is one and tworespectively. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)Source: Adapted from Ref. [86].

dyad positioning distribution obtained for N = 113 molecules (Fig. 46(d)) is similar to the one previously observed for theshort yeast DNA fragment (B) (Fig. 44(e)). The mono-nucleosome density is mainly concentrated in the central genic region,whereas the 5′ promoter and 3′ downstream regions appear to be significantly depleted in nucleosome. The experimentalnucleosome occupancy probability profile (Fig. 46(d)) is rather flat in the gene, as predicted by our physical modeling. Thisattests to a homogeneous statistical dyad positioning inside the gene without any preferential location induced by thepresence of an underlying sequence strongly favorable to nucleosome formation [86]. This conclusion is corroborated bythe nucleosome occupancy distribution observed in vitro at low nucleosome density in a genome wide experiment [216](Fig. 46(g)).

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 103

The statistical analysis of N = 62 di-nucleosomes reconstituted on the same L = 595 bp yeast DNA fragment (Fig. 46(c)),clearly reveals a bimodal (symmetrized) nucleosome occupancy probability profile with two privileged intragenic dyadlocations, as expected from the physical modeling when fixing the chemical potential so that the average number ofloaded nucleosomes is two (Fig. 46(e)). Since the physical modeling reproduces the oriented in vivo nucleosome occupancyprofile (Fig. 46(g)), our AFM imaging of in vitro nucleosome positioning is remarkably consistent with the one observedin vivo. Importantly, the two privileged locations of maximal nucleosome occupancy observed in vivo (Fig. 46(g)), didnot emerge in the in vitro AFM mono-nucleosome occupancy distribution (Fig. 46(d)) as well as in the genome wide invitro data (Fig. 46(g)). This strongly suggests that the in vivo nucleosome positioning was not driven by the presence ofsome highly positioning DNA sequences as in Fig. 45, but more likely corresponds to a global equilibrium ordering oftwo non-overlapping objects confined between fixed excluding energy barriers at both gene extremities [82,83,226]. Thisnucleosome confinement scenario is confirmed by the inter-nucleosomal distance distribution obtained from AFM imagingof di-nucleosome positioning in Fig. 46(f). According to our grand canonical physical modeling (Section 3.2), the theoreticalprobability of linker size xℓ can be computed from the pair function associated with the underlying energy E(s) [86]:

P(x = l + xl) =1Γ

∫ L−1

0eβ(µ−E(s))eβ(µ−E(s+x))ds, (18)

where Γ is a normalizing constant. In good agreement with the theoretical prediction, the experimental distribution(Fig. 46(f)) presents a main peak at small linker length (.10 nm), corresponding to one or two helical pitches (extreme leftand extreme right configurations in Fig. 46(c)), and a lower secondary peak at larger linker length (∼25 nm) correspondingto less probable configurations. The organizational role of inhibitory energy barriers is a second important result comingout from our in vitro AFM imaging of nucleosome assembly on genomic DNA templates [86]. Along the line of our thermo-dynamical modeling in Section 3 [81], statistical nucleosome ordering is nucleated from these stable ‘‘genomic boundaries’’(see Fig. 19), likely imprinted in the DNA sequence during evolution.

4.3.4. Genomic energy barriers direct the intrinsic nucleosome occupancy of regulatory sitesFinally, we have investigated nucleosome assembly on a DNA fragment long enough to allow orientation using DNA to-

pography profile as illustrated in Fig. 41(e), without the need for end labeling (e.g.with a biotin–avidin or biotin-streptavidincomplex and/or by the use of gold nanoparticles) susceptible to affect DNA structure and nucleosome positioning. We haveselected a L = 898 bp DNA fragment in the human genome that contains the promoter of the gene ILR2A that plays a majorrole in the control of the immune system response [86]. This fragment includes different functional elements: (i) the TSS, (ii)the core promoter with the TATA box, (iii) two Positive Regulatory Regions, PRRI and PRRII lying at positions −289/−216and −137/−64 (in bp units) upstream of the major TSS respectively (Fig. 47(b)). These regulatory elements are importantfor regulating inducible transcription of the ILR2A gene [395]. Previous in vivo study in unstimulated T-cells [396] revealedthat prior to transcription, an inhibitory nucleosome was positioned on the TATA box and TSS regions, therefore preventingpre-initiation complex (PIC) formation on the TATA box. In addition, this critically positioned nucleosome may also takepart in the prevention of uncontrolled DNA melting in the TSS region [397]. This makes this promoter a suitable model toinvestigate the actual role of the DNA sequence in this peculiar repressive chromatin environment.

From the analysis of N = 100 oriented DNA molecules loaded by a mono-nucleosome (Fig. 47(a)), we confirm [86] therepressive nucleosome positioning on the TATA box and TSS as observed in vivo [396]. As shown in Fig. 47(b), we also reveala favorable positioning region ∼350 bp upstream of the TSS that covers the regulatory sequence PRRI. More importantly,the observed in vitro nucleosome occupancy probability profile is in remarkable agreement with the theoretical profileobtained from our physical modeling [86]. There exists an energy barrier ∼150 bp upstream of the TSS that is coded in thesequence and that extends over the PRRII regulatory region thus explaining the lack of nucleosome positioning recordedfrom −200 bp to −100 bp (Fig. 47(b)). This stable genomic excluding barrier further conditions, by classical orderingphenomenon close to a wall [226], the observed repressive nucleosome positionings (i) at its right foot on the TATA boxand TSS and (ii) at its left foot encompassing the PRRI region (Fig. 47(b)). These results demonstrate the fundamental role ofthe DNA sequence in conditioning a peculiar local nucleosome chromatin structure that constitutively contributes to generepression. Furthermore by inhibiting nucleosome formation in the PRRII region, the sequence may favor DNA accessibilityto HMGA proteins that are known to induce a negative torsional stress and, by accumulation, a negative supercoil that mayhelp nucleosome remodeling and/or eventually nucleosome ejection [397,398]. Altogether these experimental observationsenlighten the importance of this sequence-induced local nucleosomal organization for regulating inducible transcription ofthe ILR2A gene [86].

In conclusion, in this experimental section devoted to AFM imaging in liquid of nucleosome assembly on genomic DNAfragments, we have provided in vitro confirmation [86] of the main predictions of our grand canonical physical modelingdeveloped in Section 3 [81–83]. In particular, we have demonstrated that sequences coding for high energy barriers impairnucleosome formation and condition the collective assembly of neighboring nucleosomes consistently with equilibriumstatistical ordering principles. By directing the positioning of these energy barriers relative to the TSS, the sequencecontributes to the specification of the activatory or inhibitory role of the local chromatin environment. These studies are astep forward imaging nucleosome assembly in various functional regions (promoters, replication origins, . . . ). In particular,comparing in vivo nucleosome positioning data to intrinsic sequence-induced positioning will provide direct measurement

104 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 47. AFM imaging in liquid of mono-nucleosome positioning along the human ILR2A promoter DNA fragment (L = 898 bp). (a) Examples of moleculesthat were oriented using the DNA topography profile methodology (see Fig. 41(e)). (b) Statistical analysis (N = 100 molecules) of dyad positioning (redbars) relative to the locations of the two regulatory regions PRRI and PRRII, the TATA box and the TSS. The experimental nucleosome occupancy probabilityprofile (red curve) is compared to the theoretical prediction of our physical modeling (black curve) when adjusting the chemical potential µ so that theaverage number of loaded nucleosomes is 1 (Section 3.2). The green curve corresponds to the theoretical nucleosome formation energy landscape. (Forinterpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)Source: Adapted from Ref. [86].

of the action of external factors (DNA binding proteins, remodelers, . . . ) on chromatin architecture (see Section 3.6). Thestrategy developedhere can also be generalized to real-timedynamical studies of nucleosomeunwrapping and repositioningon genomic sequences, thanks to the recent development of high-speed AFM performing time-lapsed imaging at speedsmuch faster than those of traditional AFM techniques [353].

5. Low-frequency rhythms in human DNA sequences: from the detection of replication origins to the modeling ofreplication in higher eukaryotes

5.1. Evidencing low-frequency rhythms in the GC content and compositional strand asymmetries

5.1.1. GC contentThe recent sequencing of the human genome [150] has opened the door to the statistical analysis of genomic sequence

complexity at a chromosomal scale. One of the most striking features of eukaryotic chromosomes is their large-scale varia-tions in base composition. In particular, an extraordinary large heterogeneity of the GC content is observed in mammalian

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 105

Fig. 48. Space-scale representation of the GC content of a 10Mbp long fragment of human chromosome 22when using a Gaussian smoothing filter g(0)(x)(Appendix A, Eq. (A.3)). (a) GC content fluctuations computed in adjacent 1 kbp intervals. (b) Color coding of the convolution product Tg(0)[GC](n, a) =

(GC ∗ g(0)(·/a))(n) using 256 colors from black (0) to red (max); superimposed are shown the smoothed GC profiles obtained at scales a∗

1 = 40 kbp anda∗

2 = 160 kbp. On the right-hand side is shown vertically the scale (frequency−1) spectrumΛ(a) (Appendix A, Eq. (A.6)) computedwith the complexMorletwavelet (Appendix A, Eq. (A.5)) over the entire chromosome 22. The horizontal dashed lines in the color picture correspond to the two main characteristicoscillation lengths l1 = 100 and l2 = 400 kbp. (For interpretation of the references to color in this figure legend, the reader is referred to the web versionof this article.)Source: Reprinted from Ref. [87].

genomes; this has led Bernardi [146,148] to propose a description of these genomes in terms of a mosaic organization ofdomains of relatively constant GC levels, originally called isochores. The isochoremodel is a topic of controversial discussions[127,146–152,399–404]. Nevertheless there is definite evidence that the compositional heterogeneity in a DNA sequencecorrelates with its GC content [400] which is unanimously recognized as a fundamental property of the chromosomal DNAand is likely to be one of the possible key to the understanding of the organization of eukaryotic genomes [146,148,149,400]. Indeed the GC content has a taxonomy value [405]; it determines the amino acid composition of the encoded proteinsand is also related to codon usage in genes [406]. Moreover there is conspicuous evidence [149,160,400] that GC-rich andGC-poor regions respectively match the cytogenic R and G bands and correlate well with early and late replicating domainsin the cell cycle. GC-rich regions correspond to regions of very high density of genes including the housekeeping genes andassociated CpG islands and also of short inter-dispersed repetitive DNA elements (SINES, Alu) [150]. In contrast, GC-poorregions are definitively poor in genes, predominantly tissue-specific genes containing rather long introns, but are relativelyrich in long inter-disperse repetitive DNA elements (LINES) [150] that are significantly more abundant in these regions. TheGC content has also some impact on the structure of chromatin. For example, it has been suggested [18] that the proteicchromosomal scaffold that serves as a structural skeleton for the organization of chromatin loops is much less tightly folded(to the benefit of replication and transcription processes) in GC-rich than in GC-poor regions.

In Fig. 48 are reported the results [87,407] of a space-scale decomposition of the GC content fluctuations of a 10 Mbplong fragment of human chromosome 22, when using the Gaussian g(0)(x) (Appendix A, Eq. (A.3)) as smoothing filter.This decomposition reveals that for distances larger than ∼20–30 kbp, the GC content can no longer be considered asfluctuating homogeneously (at smaller scales the fluctuations cannot be distinguished from a monofractal long-rangecorrelated noise [407]); it instead displays rather regular nonlinear oscillatory behavior. The corresponding scale spectrumΛ(a) (Appendix A, Eq. (A.6)) shown vertically in Fig. 48(b), reveals the existence of two main broad peaks correspondingto the scales l1 = 100 ± 20 kbp and l2 = 400 ± 50 kbp respectively, that emerge from a continuous background.The former is the characteristic length of the basic oscillations obtained with the low-pass filtering scale a∗

1 = 40 kbp,although one may observe from time to time oscillations that have a larger length (∼2l1 = 200 kbp). If we use a largerfiltering scale a∗

2 = 160 kbp, in order to smooth both the ‘‘small scales’’ (high frequencies) colored noise component andthe basic oscillations of scale l1, we get some oscillatory profile with a fundamental length l2 = 400 kbp as illustratedin Fig. 49(a). Let us point out that the investigation of large GC-rich fragments in various human chromosomes revealssimilar periodicities [87], namely l1 = 120 ± 30 kbp, l2 = 410 ± 60 kbp for chromosome 11 (24 Mbp, NT_033899.3),l1 = 130±30 kbp, l2 = 420±60 kbp for chromosome 14 (68Mbp, NT_02637.9), and l1 = 110±20 kbp, l2 = 390±50 kbpfor chromosome 21 (29 Mbp, NT_011512.7).

106 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 49. Compositional oscillations observed in the human chromosome 22 fragment (23 Mbp, NT_011520.8) after low-pass filtering at scale a∗

2 =

160 kbp (see Fig. 48). (a) GC content. (b) Total skew S = STA + SGC (Eq. (20)). The red (blue) portions of the profiles correspond to the location of Watson(Crick) genes that have the same (opposite) orientation than the sequence. The location of the immunoglobulin locus is shown in pink. (For interpretationof the references to color in this figure legend, the reader is referred to the web version of this article.)Source: Adapted from Ref. [87].

A possible interpretation of these low-frequency rhythms observed in the GC content is of structural nature and isrelated to recent experimental and numerical observations of the high-order hierarchical folding of chromatin into fibersand loops of different sizes (Fig. 1). 100 kbp is typically the size of DNA loops that are observed by electron microscopy[17,408,409] to be held together by a longitudinal network of scaffolding proteins in histone-depleted chromosomes,favoring a radial loop/scaffold model [18,410,411] of metaphase chromosome structure. 100 kbp is also very close to thechromosome loop size (∼80 kbp) estimated from physical measurements of the dynamics of force relaxation in singlemitotic chromosomes [412]. Furthermore, 400 kbp is likely to be the size of larger chromatin loops made of a few basicloops that have some coherent dynamical behavior during the cell cycle possibly governed by themechanisms that underliereplication and transcription processes [413–417]. As an alternative to the loop/scaffold [18,410,411] and multi-loopsubcompartment [24,418] models, 100 and 400 kbp might well be characteristic scales involved into the successive levelsof helical coiling of chromatin into fibers or tubes of diameters ranging from 30 to 700 nm observed during interphase usingeither electron or light microscopy [19,20,419–424].

5.1.2. Strand asymmetriesAn alternative interpretation of the rhythms observed in theGC content is of functional nature and is a direct consequence

of the observation that 100 and 400 kbp correlate well with the replicon sizes observed in warm-blooded vertebrateorganisms [425–427]. Early experimental investigations [428,429] of replicon size by fiber auto-radiography or fluorographyhave led to the classical view that mammalian replicons are heterogeneous in size but that most fall into the range of30–450 kbp with the most frequent sizes in the range 75–150 kbp. Furthermore there is experimental evidence [425,427,428] that replicons are likely to be in groups, the so-called replicon foci, with all the replicons in each group firing at similartime in the S-phase. Results obtained with modern replicon-mapping methods clearly show the existence of much largerreplicons (the largest ones being as large as a few Mbp) than previously thought, requiring most or all the S-phase to becompleted [427,430,431]. In particular, the average size of a mammalian replicon has been reconsidered to be more likely∼500 kbp [430,431].

According to the second parity rule [432,433], under no strand-bias conditions, each genomic DNA strand should presentequimolarities [434,435] of A and T and of G and C. Because transcription (Fig. 50) and replication (Fig. 51) both require theopening of the DNA double helix, there are likely to induce some departure from intrastrand equimolarities resulting fromasymmetry in mutational and repair events on the two strands. As indicators of DNA strand-asymmetry, we will mainly usein the following, the TA and GC skews computed in non-overlapping 1 kbp windows:

STA =nT − nA

nT + nA, SGC =

nG − nC

nG + nC, (19)

where nA, nC, nG and nT are respectively the numbers of A, C, G and T in the windows. Because of the observed correlationbetween the TA and GC skews (see Section 5.2), we will also consider the total skew S:

S = STA + SGC. (20)

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 107

Transcription

3’

5’ 3’

5’U

Rewinding of DNA

CA

TG

A

AT

CG

GC CTG

AC

RNA polymerase

Promoter region

Unwinding of DNA

mRNA

3’

Ribosome binding sites

5’

Fig. 50. Schematic representation of transcription elongation [4]. During the transcription elongation, i.e. the progression of the RNA polymerase proteincomplex, this enzyme catalyses the unwinding of the DNA double helix on a little more than one turn of the helix (one turn corresponds to 10-11 basepairs). The template DNA strand is exposed to the synthesis of a complementary RNA chain, in the 5′ to 3′ direction. This nascent RNA is first hybridized tothe template DNA and further separated from its template, while the two DNA strands reform the double helix during the progression of the ‘‘transcriptionbubble’’.

5’

3’

5’

5’

3’

3’Replication bubble

Semiconservative replication

3’

5’

Okazaki fragments

Replication

Okazaki fragments

Fig. 51. Schematic representation of replication initiation [4]. Replication initiates at local regions on the genome called replication origins (there is aunique origin in eubacterial genomes and many origins in eukaryotic genomes). At each replication origin position, the two strands of the parental DNAdouble helix are separated from each other to serve as templates for the synthesis of the two daughter DNA strands. On each extremity of the resulting‘‘replication bubble’’, two replication forks are formed: onemoves rightwardwith the leading (bottom) and lagging (top) strands; the othermoves leftwardwith the leading (top) and lagging (bottom) strands. All strands are synthesized in the 5′ to 3′ direction. It results that the DNA synthesized on the leadingstrand is made continuously while the lagging strand is made discontinuously, i.e. as successive short sized fragments called Okazaki fragments.

From the skews STA(n), SGC(n) and S(n), obtained along the sequences, where n is the position (in kbp units) from the origin,we will also compute the cumulative skew profiles:

ΣTA(n) =

n−j=1

STA(j), ΣGC(n) =

n−j=1

SGC(j), (21)

and

Σ(n) =

n−j=1

S(j). (22)

108 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

0.03

0.02

0.02

101 102 103

a (kbp)

Fig. 52. Scale spectrum Λ(a) (Appendix A, Eq. (A.6)) vs. a of the total skew S (Eq. (20)) computed over the entire chromosome 22 of the humangenome [440].

Deviations from intrastrand equimolarities have been extensively studied in prokaryote, organelle and viral genomes forwhich they have been used to detect the origins of replication [436–439]. During replication, mutational events can affectthe leading and lagging strands (see Section 5.2) differently and one strand can be more efficiently repaired than the otherone, leading to strand compositional asymmetry. In eukaryotes, the existence of compositional biases has been debatedand most attempts to detect the replication origins from strand compositional asymmetry have been inconclusive. Whenperforming a space-scale analysis of both the STA and SGC skews of human chromosome 22, using the smoothing Gaussianfilter g0(x) (Appendix A, Eq. (A.3)), we get oscillatory profiles similar to those obtained for the GC content, with still twomaincharacteristic lengths l1 = 140±20 kbp and l2 = 400±40 kbp as shown in Fig. 52, where the corresponding scale spectrumof the total skew S (Eq. (20)) displays twomain bumps, the latter at 400 kbpbeing themost pronounced. In Fig. 49(b) is shownthe oscillatory skew profile obtained for the smoothing scale a∗

2 = 160 kbp. This profile displays rather regular oscillationtrends of basic length ∼400 kbp. Quite remarkably this oscillatory skew profile provides a guide for the organization of thespatial location and orientation of the (largest) genes [87,407]: (+) genes with the same orientation as the sequence arelocated around the positive maxima of the oscillations (among transcribed sequences, this corresponds to 79.6 ± 1.9% (ch.22), 84.0± 2.6% (ch. 11), 89.2± 1.2% (ch. 14) and 88.1± 2.4% (ch. 21) of 1 kbp fragments that have the same orientation asthe Watson strand), while (−) genes are quite symmetrically located around the minima (mainly negative). This might besome indication of the existence of a transcription mutation bias. However, since these skew oscillations are also observedin large intergenic regions, but with smaller amplitude (data not shown), they may also arise from replication mutationbiases. Indeed, as we will elaborate on in the next sections, these oscillations of rather marked relaxational character arelikely to reflect some correlation between gene organization into clusters with preferential gene orientation and replication.For more details on the existence of low frequency rhythms in DNA sequences, we refer the reader to the PhD manuscriptsof Brodie of Brodie [440] and Nicolay [441], where the analysis of the 22 human autosomes has confirmed the existence ofrhythms in the S skew profile but with a wide range of characteristic sizes from ∼100 kbp to several Mbp.

5.2. Transcription-coupled strand asymmetries in the human genome

During genome evolution, mutations do not occur at random as illustrated by the diversity of the nucleotide substitutionrate values [442–445]. This non-randomness is considered as a by-product of the various DNAmutation and repair processesthat can affect each of the two DNA strands differently. Deviations from intrastrand equimolarities, the so-called Chargaff’ssecond parity rule [432,433], have been extensively studied and the observed skews have been attributed to asymmetriesintrinsic to the replication (Fig. 51) or to the transcription (Fig. 50) processes. Asymmetries of substitution rates coupledto transcription have been mainly observed in prokaryotes [446–448], with only preliminary results in eukaryotes. In thehuman genome, excess of T was observed in a set of gene introns [449] and some large-scale asymmetry was observedin human sequences but they were attributed to replication [450]. A comparative analysis of mammalian sequencesdemonstrated a transcription-coupled excess of G+T over A+C in the coding strand [451]. In contrast to the substitutionbiases observed in bacteria presenting an excess of C → T transitions, these asymmetries are characterized by an excess ofpurine (A → G) transitions relatively to pyrimidine (T → C) transitions. These might be a by-product of the transcription-coupled repair mechanism acting on uncorrected substitution errors during replication [452]. In this section, we report theresults of a genome-scale analysis of human genes that definitely establish the existence of transcription-coupled nucleotidebiases [453–455].

5.2.1. Strand asymmetries in human gene sequencesWe have started examining nucleotide compositional strand asymmetries in transcribed regions of human sequences

[453]. We have computed the STA and SGC skews (Eq. (19)) for intron sequences since, in contrast to exonic sequences, they

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 109

a b

Fig. 53. Probability density functions of the skews STA (•) and SGC ( ) values computed from the intronic sequences of the 14854 intron-containing genesafter removing repeated sequences. (a) (+) genes; (b) (−) genes.Source: Adapted from Ref. [440].

Fig. 54. (a) STA and SGC skews (in % units) in human introns: each point corresponds to one of the 14854 intron-containing genes; repeated elements areremoved from the analysis; red points correspond to (+) genes (7508) with the same orientation as the Watson strand; blue points correspond to (−)genes (7346) with opposite orientation; black crosses represent the standard deviations of the distributions. (b) Correlation between STA and SGC skewsdetermined on the coding strand from intronic regions without repeats: each point corresponds to a gene for which the total length of intronic regions isl > 25 kbp (7797 genes); Pearson’s correlation coefficient r = 0.61 (the slope of the regression line is 0.58). (For interpretation of the references to colorin this figure legend, the reader is referred to the web version of this article.)Source: Adapted from Ref. [453].

can be considered as weakly selected sequences. Human intron sequences were downloaded from RefGene (April 2003)at UCSC. When several genes presented identical exonic sequences, only the longest one was retained; repeated elementswere removedwith RepeatMasker [456]. The introns of each gene were taken as a single sequence; introns without repeatswere also taken as a single sequence; to avoid the skew associated with splicing signals, 560 bp were removed at both in-tron extremities.When the resulting intron sequenceswere shorter than 1120 bp, theywere not considered for the analysis,leading to 14854 intron-containing genes. The distributions of the TA and the GC skews, computed on this set of intron-containing genes, present positivemean values for (+) genes (7508), namely STA = 4.72%±0.07% and SGC = 2.97%±0.07%,and nearly opposed values for (−) genes (7346), namely STA = −4.56%±0.07% and SGC = −3.05%±0.07%.When removingthe repeated sequences from the analysis, the TA and GC biases are not strongly altered. The corresponding pdfs of STA andSGC are shown in Fig. 53 in a semi-logarithmic representation. For both (+) (Fig. 53(a)) and (−) (Fig. 53(b)) genes, we observethe presence, for both skews, of large tails that clearly indicates some departure from a parabolic profile, the signature ofGaussian statistics. When examined on the coding strand, the mean value for all intron sequences without repeats presentsignificant excess of T over A, namely STA = 4.49% ± 0.01%, and excess of G over C, namely SGC = 3.29% ± 0.01% (afterappropriately removing intron extremities).

A question of interest is the possible existence of correlations between STA and SGC skews in intronic sequences withoutrepeats. When all genes are considered, only a small correlation is observed (Pearson’s correlation coefficient r = 0.09).However, the values of the skews from small genes turn out to be highly noisy. When we exclude these small genes, STAand SGC present a larger correlation, e.g. r = 0.45 for genes with total intron length l > 10 kbp and r = 0.61 for geneswith l > 25 kbp, as illustrated in Fig. 54(b). Let us point out that STA and SGC present weak correlation with the intronic GCcontent as well as with the sequence length and this even if only the large genes are considered [453].

110 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 55. (a) Skew profile S(n) (Eq. (20)) of a repeat-masked fragment of human chromosome 6; red (resp. blue) 1 kbp window points correspond to (+)genes (resp. (−) genes) lying on the Watson (resp. Crick) strand; black points to intergenic regions. (b) Cumulated skew profileΣ(n) (Eq. (22)). (c) WT ofΣ; Tg(2) (n, a) is coded from black (min) to red (max); the WT skeleton defined by the maxima lines is shown in solid (resp. dashed) lines correspondingto positive (resp. negative) WT values. For illustration yellow solid (resp. dashed) maxima lines are shown to point to the positions of 2 upward (resp. 2downward) jumps in S (vertical dashed lines in (a) and (b)) that coincide with gene transcription starts (resp. ends). In green are shownmaxima lines thatpersist above a ≥ 200 kbp and that point to sharp upward jumps in S (vertical solid lines in (a) and (b)) that are likely to be the locations of putativereplication origins (see Section 5.3); note that 3 out of those 4 jumps are co-located with transcription start sites. (For interpretation of the references tocolor in this figure legend, the reader is referred to the web version of this article.)Source: Adapted from Ref. [455].

5.2.2. Revealing the bifractality of human skew DNA walks with the wavelet transform modulus maxima methodIn this section, our goal is to show that, when examined over a range of scales not too large as compared to the mean

characteristic size of human genes (∼30–40 kbp), the skew DNA walks Σ (Eq. (22)) of the 22 human autosomes displayan unexpected (with respect to previous monofractal diagnosis (Section 2.4.1) [14–16,113,124]) bifractal scaling behavioras the signature of the presence of transcription-induced jumps in the LRC noisy S profiles [455]. Sequences and geneannotation data (‘‘RefGene’’) were retrieved from the UCSC Genome Browser (May 2004). Again, we used RepeatMaskerto exclude repetitive elements that might have been inserted recently and would not reflect long-term evolutionarypatterns. As an illustration of our wavelet-based methodology (see Appendix A), we show in Fig. 55 the S skew profile of afragment of human chromosome 6 (Fig. 55(a)), the corresponding skew DNA walk (Fig. 55(b)) and its space-scale waveletdecomposition (Fig. 55(c)) using the Mexican hat analyzing wavelet g(2) (Appendix A, Fig. A.1(b)). When computing thepartition function Z(q, a) (Appendix A, Eq. (A.12)) from the WT skeletons of the skew DNA walks Σ of the 22 humanautosomes, we get convincing power-law behavior for −1.5 ≤ q ≤ 3 (data not shown). In Fig. 56(a) are reported theτ(q) exponents (Appendix A, Eq. (A.13)) obtained using a linear regression fit of ln Z(q, a) vs. ln a over the range of scales10 kbp ≤ a ≤ 40 kbp. All the data points remarkably fall on two straight lines τ1(q) = 0.78q − 1 and τ2(q) = q − 1, whichstrongly suggests the presence of two types of singularities h1 = 0.78 and h2 = 1, respectively on two sets S1 and S2 withthe same Haussdorf dimension D = −τ1(0) = −τ2(0) = 1, as confirmed when computing the D(h) singularity spectrum(Appendix A, Eq. (A.14)) in Fig. 56(b). This observation means that Z(q, a) can be split in two parts [455]:

Z(q, a) = C1(q)aqh1−1+ C2(q)aqh2−1, (23)

where C1(q) and C2(q) are prefactors that depend on q. Since h1 < h2, in the limit a → 0+, the partition function is expectedto behave like Z(q, a) ∼ C1(q)aqh1−1 for q > 0, and like Z(q, a) ∼ C2(q)aqh2−1 for q < 0, with a so-called phase transition[457–459] at the critical value qc = 0. Surprisingly, it is the contribution of the weakest singularities h2 = 1 that controlsthe scaling behavior of Z(q, a) for q > 0, while the strongest ones h1 = 0.78 actually dominate for q < 0 (Fig. 56(a)). Thisinverted behavior originates from finite (1 kbp) resolution which prevents the observation of the predicted scaling behaviorin the limit a → 0+. The prefactors C1(q) and C2(q) in Eq. (23) are sensitive to (i) the number of maxima lines in the WTskeleton alongwhich theWTMMbehave as ah1 or ah2 , and (ii) the relative amplitude of theseWTMM.Over the range of scalesused to estimate τ(q), the WTMM along the maxima lines pointing (at small scale) to h2 = 1 singularities are significantlylarger than those along the maxima lines associated to h1 = 0.78 (see Fig. 56(c) and (d)). This implies that the larger q > 0,

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 111

a b

c d

Fig. 56. Multifractal analysis of Σ(n) (Eq. (22)) of the 22 human (filled symbols) and 19 mouse (open circle) autosomes using the WTMM method withg(2) (see Appendix A) over the range 10 kbp ≤ a ≤ 40 kbp. (a) τ(q) vs. q. (b) D(h) vs. h. (c) WTMM pdf: ρ is plotted vs. |T |/aH where H = h1 = 0.78, insemi-log representation; the inset is an enlargement of the pdf central part in linear representation. (d) Same as in (c) but with H = h2 = 1. In (c) and (d),the symbols correspond to scales a = 10 (•), 20 () and 40 (N) in kbp units.Source: Adapted from Ref. [455].

the stronger the inequality C2(q) ≫ C1(q) and the more pronounced the relative contribution of the second term in ther.h.s. of Eq. (23). On the opposite for q < 0, C1(q) ≫ C2(q) which explains that the strongest singularities h1 = 0.78 nowcontrol the scaling behavior of Z(q, a) over the explored range of scales.

In Fig. 56(c) and (d) are shown the WTMM pdfs computed at scales a =10, 20 and 40 kbp after rescaling by ah1 and ah2respectively. We note that there does not exist a value of H such that all the pdfs collapse on a single curve as expected fromEq. (B.9) formonofractal DNAwalks (seeAppendix B). Consistentwith the τ(q)data in Fig. 56(a) andwith the inverted scalingbehavior discussed above, when using the two exponents h1 = 0.78 and h2 = 1, we succeed in superimposing respectivelythe central (bump) part (Fig. 56(c)) and the tail (Fig. 56(d)) of the rescaledWTMMpdfs. This corroborates the bifractal natureof the skewDNAwalks that display two competing scale-invariant components of different Hölder exponents: (i) h1 = 0.78corresponds to LRC homogeneous fluctuations previously observed over the range 200 bp . a . 20 kbp in DNA walksgenerated with structural codings (Section 2.6) [14–16,130] and (ii) h2 = 1 is associated to convex ∨ and concave ∧ shapesin the DNA walks Σ indicating the presence of discontinuities in the derivative of Σ , i.e., of jumps in S (Fig. 55(a) and (b)).At a given scale a, according to Eq. (A.4), a large value of the WTMM in Fig. 55(c) corresponds to a strong derivative of thesmoothed S profile and the maxima line to which it belongs is likely to point to a jump location in S. This is particularly thecase for the coloredmaxima lines in Fig. 55(c): upward (resp. downward) jumps (Fig. 55(a)) are so-identified by themaximalines corresponding to positive (resp. negative) values of the WT.

5.2.3. Transcription-induced step-like skew profiles in the human genomeIn order to identify the origin of the jumps observed in the skew profiles, we have performed a systematic investigation

of the skews observed along 14854 intron containing genes [453,454]. In Fig. 57 are reported themean values of STA and SGCskews for all genes as a function of the distance to the 5′- or 3′ end. At the 5′ gene extremities (Fig. 57(a)), a sharp transitionof both skews is observed from about zero values in the intergenic regions to finite positive values in transcribed regionsranging between 4% and 6% for STA and between 3% and 5% for SGC. At the gene 3′ extremities (Fig. 57(b)), the TA andGC skewsalso exhibit transitions from significantly large values in transcribed regions to very small values in untranscribed regions.However, in comparison to the steep transitions observed at 5′ ends, the 3′ end profiles present a slightly smoother transitionpattern extending over ∼5 kbp and including regions downstream of the 3′ end likely reflecting the fact that transcription

112 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

a b

Fig. 57. TA (•) and GC ( ) skew profiles in the regions surrounding 5′ and 3′ gene extremities. STA and SGC are calculated in 1 kbp windows startingfrom each gene extremity in both directions. In abscissa is reported the distance (n) of each 1 kbp window to the indicated gene extremity; zero valuesof abscissa correspond to 5′ (a) and 3′ (b) gene extremities. In ordinate is reported the mean value of the skews over our set of 14854 intron-containinggenes for all 1 kbp windows at the corresponding abscissa. Error bars represent the standard error of the means.Source: Adapted from Ref. [440].

a b

Fig. 58. Statistical analysis of skew variations at the singularity positions determined at scale 1 kbp from the maxima lines that exist at scales a ≥ 10 kbpin the WT skeletons of the 22 human autosomes. For each singularity, we have computed the variation amplitudes1S = S(3′)− S(5′) over two adjacent5 kbp windows, respectively in the 3′ and 5′ directions and the distances 1n to the closest TSS (resp. TTS). (a) Histograms N(|1S|) for upward (1S > 0,red) and downward (1S < 0, black) skew variations. (b) Histograms of the distances1n of upward (red) or downward (black) jumps with |1S| ≥ 0.1 tothe closest TSS (•, ) and TTS (, ). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of thisarticle.)Source: Adapted from Ref. [455].

continues to some extent downstream of the polyadenylation site. In pluricellular organisms, mutations responsible for theobserved biases are expected to have mostly occurred in germline cells. It could happen that gene 3′ ends annotated in thedatabank differ from the poly-A sites effectively used in germline cells. Such differenceswould then lead to some broadeningof the skew profiles.

5.2.4. From skew multifractal analysis to the detection of genes expressed in the germlineIn Fig. 58 are reported the results of a statistical analysis of the jump amplitudes in human S profiles [455]. For maxima

lines that extend above a∗= 10 kbp in theWT skeleton (see Fig. 55(c)), the histograms obtained for upward and downward

variations are quite similar, especially their tails that are likely to correspond to jumps in the S profiles (Fig. 58(a)). Whencomputing the distance between upward or downward jumps (|1S| ≥ 0.1) to the closest TSS or TTS (Fig. 58(b)), we revealthat the number of upward jumps in close proximity (|1n| . 3 kbp) to TSS over-exceeds the number of such jumps closeto TTS. Similarly, downward jumps are preferentially located at TTS. These observations are consistent with the step-likeshape of skew profiles induced by transcription: S > 0 (resp. S < 0) is constant along a (+) (resp. (−)) gene and S = 0 inthe intergenic regions (Fig. 57) [453]. Since a step-like pattern is edged by one upward and one downward jump, the set ofhuman genes that are significantly biased is expected to contribute to an even number of1S > 0 and1S < 0 jumps whenexploring the range of scales 10 . a . 40 kbp, typical of humangene size. Note that in Fig. 58(a), the number of sharp upwardjumps actually slightly exceeds the number of sharp downward jumps, consistent with the experimental observation thatwhereas TSS are well defined, TTS may extend over 5 kbp resulting in smoother downward skew transitions (Fig. 57(b)).

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 113

a b

Fig. 59. (a) Number of TSS with an upward jump within |1n| (abscissa) for jump amplitudes1S > 0.1 (black), 0.15 (dark grey) and 0.2 (light grey). Solidlines correspond to true jump positions while dashed lines to the same analysis when jump positions were randomly drawn along each chromosome.(b) Among the Ntot(1S∗) upward jumps of amplitude larger than some threshold 1S∗ , we plot the proportion of those that are found within 1 kbp (•),2 kbp () or 4 kbp (N) of the closest TSS vs. the number NTSS of the so-delineated TSS. Curves were obtained by varying1S∗ from 0.1 to 0.3 (from right toleft). Open symbols correspond to similar analyses performed on random upward jump and TSS positions.Source: Reprinted from Ref. [455].

This TTS particularity also explains the excess of upward jumps found close to TSS as compared to the number of downwardjumps close to TTS (Fig. 58(b)).

In Fig. 59(a), we report the analysis of the distance of TSS to the closest upward jump [455]. For a given upward jumpamplitude, the number of TSS with a jumpwithin |1n| increases faster than expected (as compared to the number found forrandomized jump positions) up to |1n| ≃ 2 kbp. This indicates that the probability to find an upward jump within a genepromoter region is significantly larger than elsewhere. For example, out of 20023 TSS, 36% (7228) are delineated within2 kbp by a jump with 1S > 0.1. This provides a very reasonable estimate for the number of genes expressed in germlinecells as compared to the 31.9% recently experimentally found to be bound to Pol II in human embryonic stem cells [460].

Combining the previous results presented in Figs. 58(b) and 59(a), we report in Fig. 59(b) an estimate of theefficiency/coverage relationship by plotting the proportion of upward jumps (1S > 1S∗) lying in TSS proximity as a functionof the number of so-delineated TSS [455]. For a given proximity threshold |1n|, increasing1S∗ results in a decrease of thenumber of delineated TSS, characteristic of the right tail of the gene bias pdf. Concomitant to this decrease, we observe anincrease of the efficiency up to a maximal value corresponding to some optimal value for1S∗. For |1n| < 2 kbp, we reach amaximal efficiency of 60% for1S∗

= 0.225; 1403 out of 2342 upward jumps delineate a TSS. Given the fact that the actualnumber of human genes is estimated to be significantly larger (∼30000) than the number provided by RefGene, a largepart of the the 40% (939) of upward jumps that have not been associated to a RefGene could be explained by this limitedcoverage. In other words, jumps with sufficiently high amplitude are very good candidates for the location of highly-biasedgene promoters. Let us point that out of the above 1403 (resp. 2342) upward jumps, 496 (resp. 624) jumps are still observedat scale a∗

= 200 kbp. We will see in the next section that these jumps are likely to also correspond to replication originsunderlying the fact that large upward jumps actually result from the cooperative contributions of both transcription- andreplication-associated biases [87,89–91]. The observation that 80% (496/624) of the predicted replication origins are co-located with TSS enlightens the existence of a remarkable gene organization at replication origins [91].

To summarize, we have demonstrated the bifractal character of skew DNA walks in the human genome. When usingthe WT microscope to explore (repeat-masked) scales ranging from 10 to 40 kbp, we have identified two competinghomogeneous scale-invariant components characterized by Hölder exponents h1 = 0.78 and h2 = 1 that respectivelycorrespond to LRC colored noise and sharp jumps in the original DNA compositional asymmetry profiles. Remarkably, theso-identified upward (resp. downward) jumps aremainly found at the TSS (resp. TTS) of humangeneswithhigh transcriptionbias and thus very likely highly expressed in the germline. As illustrated in Fig. 56(a), similar bifractal properties are alsoobserved when investigating the 19 mouse autosomes. This suggests that the results reported in this section are generalfeatures of mammalian genomes [455].

5.3. Replication-associated strand asymmetries in mammalian genomes

DNA replication is an essential genomic function responsible for the accurate transmission of genetic informationthrough successive cell generations. According to the so-called ‘‘replicon’’ paradigm derived from prokaryotes [461], thisprocess starts with the binding of some ‘‘initiator’’ protein complex to a specific ‘‘replicator’’ DNA sequence called origin ofreplication. The recruitment of additional factors initiates the bi-directional progression of two divergent replication forksalong the chromosome. As illustrated in Figs. 51 and 60(a), one strand is replicated continuously (leading strand), while theother strand is replicated in discrete steps towards the origin (lagging strand). In eukaryotic cells, this event is initiated at

114 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

a number of replication origins and propagates until two converging forks collide at a terminus of replication [462,463].The initiation of different replication origins is coupled to the cell cycle but there is a definite flexibility in the usage of thereplication origins at different developmental stages [464–468]. Also, it can be strongly influenced by the distance and timingof activation of neighboring replication origins, by the transcriptional activity and by the local chromatin structure [465–468]. Actually, sequence requirements for a replication origin vary significantly between different eukaryotic organisms. Inthe unicellular eukaryote S. cerevisiae, the replication origins spread over 100-150 bp and present some highly conservedmotifs [462]. However, among eukaryotes, S. cerevisiae seems to be the exception that remains faithful to the repliconmodel.In the fission yeast Schizosaccharomyces pombe, there is no clear consensus sequence and the replication origins spread overat least 800 to 1000 bp [462]. In multicellular organisms, the nature of initiation sites of DNA replication is even morecomplex [463]. Metazoan replication origins are rather poorly defined and initiation may occur at multiple sites distributedover a thousand of base pairs [469]. The initiation of replication at random and closely spaced sites was repeatedly observedin Drosophila and Xenopus early embryo cells, presumably to allow for extremely rapid S phase, suggesting that any DNAsequence can function as a replicator [464,470,471]. A developmental change occurs around midblastula transition thatcoincides with some remodeling of the chromatin structure, transcription ability and selection of preferential initiationsites [464,471]. Thus, although it is clear that some sites consistently act as replication origins in most eukaryotic cells, themechanisms that select these sites and the sequences that determine their location remain elusive in many cell types [463,472,473]. As recently proposed by many authors [474–476], the need to fulfill specific requirements that result from celldiversification may have led high eukaryotes to develop various epigenetic controls over the replication origin selectionrather than to conserve specific replication sequence. This might explain that for many years, very few replication originshave been identified inmulticellular eukaryotes, namely around 20 inmetazoa and only about 10 in humanwhenwe startedthis study [89,90]. Since then, about a few hundred replication origins have been further experimentally verified [477–479]on the ENCODE regions that cover one percent of the human genome only. Along the line of this epigenetic interpretation,one might wonder what can be learned about eukaryotic DNA replication from DNA sequence analysis.

5.3.1. Replication-associated strand asymmetries in prokaryotic genomes: the replicon modelAsmentioned in Section 5.1.2, the existence of replication-associated strand asymmetries has beenmainly established in

bacterial genomes [435–439]. As illustrated in Fig. 60, theGC and TA skews abruptly switch sign (over fewkbp) fromnegativeto positive values at the replication origin and in the opposite direction from positive to negative values at the replicationterminus. This step-like profile is characteristic of the repliconmodel [461]. In Bacillus subtilis, as inmost bacteria, the leading(resp. lagging) strand (Fig. 60(a)) is generally richer (resp. poorer) in G than in C (Fig. 60(b)), and to a lesser extent in T than inA (data not shown). This typical pattern is particularly clearwhen plotting the cumulated skewsΣGC (Fig. 60(c)) andΣTA (Eq.(21)); both present decreasing (or increasing) profiles in regions situated 5′ (or 3′) to the origin, displaying a characteristic∨-shape pointing to the replication origin position (similarly a characteristic ∧-shape is observed at the terminus position).The research of ∨ patterns in the cumulated skews has been extensively used as a strategy to detect the position of the(unique) replication origin in (generally circular) bacterial genomes [436–439].

As shown in Fig. 60(b) and (c), when looking at the gene organization around the replication origin of Bacillus subtilis,we observe that most of the (+) (resp. (−)) genes are preferentially on the right (resp. left) of the replication origin. Thissuggests that the replication forks progression is co-oriented with transcription, as to minimize the risk of frontal collisionbetween DNA and RNA polymerases [480–483].

5.3.2. Analysis of strand asymmetries around experimentally determined replication origins in the human genomeIn eukaryotes, the existence of compositional biases is unclear and most attempts to detect the replication origins from

strand compositional asymmetry have been inconclusive. Several studies have failed to show compositional biases relatedto replication, and analysis of nucleotide substitutions in the region of the β-globin replication origin in primates do notsupport the existence ofmutational bias between the leading and the lagging strands [436,484,485]. Other studies have led torather opposite results. For instance, strand asymmetries associatedwith replication have been observed in the subtelomericregions of S. cerevisiae chromosomes, supporting the existence of replication-coupled asymmetric mutational pressure inthis organism [486]. With the same methodology as the one developed in Section 5.2.3 for gene extremities, we present inthis section analyses of strand asymmetries flanking experimentally determined human replication origins [89,90].

As shown in Fig. 61, most of the known replication origins in the human genome correspond to rather sharp (over severalkbp) transitions from negative to positive STA and SGC skew values that clearly emerge from the noisy background. This isreminiscent of the behavior observed in Fig. 60 for Bacillus subtilis, except that the leading strand is relatively enriched inT over A than in G over C. This observation is even more patent when looking at the cumulated skewΣTA andΣGC profilesthat both display characteristic ∨-shapes pointing to the experimentally identified initiation zones. According to the geneenvironment, the amplitude of the jump observed in the skew profiles can be more or less important and its position moreor less localized (from a few kbp to a few tens of kbp). Indeed, we have seen in Section 5.2 that transcription generatespositive TA and GC skews on the coding strand [453,454,492], which explains that larger jumps are observed when the (+)and/or the (−) genes are on the leading strand so that replication and transcription biases add to each other. To measurecompositional asymmetries that would result from replication only, we have calculated the skews in intergenic regions onboth sides of the origins [90]. The total skew S definitely shifts fromnegative (S = −6.2%±0.4%) to positive (S = 11.1%±1%)

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 115

a

leadinglagging

5’ 3’

Replication Origin

5’3’

c

b O T

Fig. 60. (a) Schematic representation of the divergent bi-directional progression of the two replication forks from the replication origin (see Fig. 51).(b) SGC calculated in 1 kbp windows along the genomic sequence of Bacillus subtilis. (c) Cumulated skew ΣGC . The vertical lines correspond respectivelyto the replication origin (O) and termination (T) positions. In (b) and (c), red (resp. blue) points correspond to (+) (resp. (−)) genes that have the same(opposite) orientation than the sequence. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of thisarticle.)

Table 5Strand asymmetries associated with human replication origins. The skews were calculated in the regions flanking the six human replication origins(Fig. 61(a)) and in the corresponding homologous regions of themouse genome. Intergenic sequenceswere always considered in the direction of replicationfork progression (leading strand); they were considered in totality (all) or after elimination of conserved regions (ncr.) between human (Homo sapiens,H.s.) and mouse (Mus musculus, M.m.) (≤6% of intergenic sequences). To calculate the mean skew in introns, the sequences were considered on thenontranscribed strand. For Slead , the orientation of transcription was the same as the replication fork progression; for Slag , the situation was the opposite.The mean values of the skews STA , SGC , and S are given in percent (±SEM). l, total sequence length in kbp.Source: Reprinted from Ref. [90].

STA SGC S l G+ C, %

Intergenic (H.s.) all 3.9 ± 0.4 3.0 ± 0.4 6.9 ± 0.4 487 42Intergenic (H.s.) ncr. 4.0 ± 0.4 3.0 ± 0.5 7.0 ± 0.5 461 42Intergenic (M.m.) ncr. 3.6 ± 0.4 2.2 ± 0.5 5.8 ± 0.5 441 42Slead (H.s. introns) 7.5 ± 0.3 6.8 ± 0.4 14.3 ± 0.4 358 40Slag (H.s. introns) −1.9 ± 1.0 −0.3 ± 1.4 −2.2 ± 1.3 49 44

values when crossing the replication origin. This result strongly suggests the existence of mutational pressure associatedwith replication, leading to the mean compositional biases STA = 4.0% ± 0.4% and SGC = 3.0% ± 0.5% (Table 5) [90]. Letus note that the value of the skew could vary from one origin to another, possibly reflecting different initiation efficiencies.From the calculation of the intron skew values on the leading and lagging strands reported in Table 5, we can estimate themean skew associated with transcription by subtracting intergenic skews from Slead values giving STA = 3.6% ± 0.7% andSGC = 3.8%± 0.9%. These estimations are remarkably consistent with those obtained with our large set of human introns inSection 5.2.1 (Figs. 53 and 54), further supporting the existence of replication-coupled strand asymmetries. Overall, theseresults indicate that the mean replication bias on the leading strand and the mean transcriptional bias on the coding strandare of the same order of magnitude, namely S = STA + SGC ∼ 7% (Table 5) [90].

In that context, one can wonder to which extent the biases observed in intergenic regions may result from the possiblepresence of still undetected genes. Two pieces of evidence argue against this eventuality. First, we have been careful enoughto retain as transcribed regions one of the largest sets of transcripts available, resulting in a stringent definition of intergenicregions. Second, several studies have demonstrated the existence of hitherto unknown transcripts in regions where no

116 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

a

b

c

Fig. 61. TA and GC skew profiles around experimentally determined human replication origins. (a) The skew profiles were determined in 1 kbp windowsin regions surrounding (±100 kbp without repeats) experimentally determined human replication origins. (Upper) TA and GC cumulated skew profilesΣTA (thick line) andΣGC (thin line). (Lower) Total skew S calculated in the same regions. The jump amplitude1S associated with these origins, calculatedas the difference of the skewsmeasured in 20 kbp windows on both sides of the origins, are: (31%)MCM4 [487], (29%) HSPA4 [488], (18%) TOP1 [489], (14%)MYC [490], (38%) SCA7 [491], and (14%) AR [491]. (b) Cumulated skew profiles calculated in the six regions of themouse genome homologous to the humanregions analyzed in (a). (c) Cumulated skew profiles in the six regions of the dog genome homologous to human regions analyzed in (a). The abscissa (n)represents the distance (in kbp) of a sequence window to the corresponding origin; the ordinate represents the values of S and Σ given in percent. Thecolors have the following meaning: red, (+) genes (coding strand identical to the Watson strand); blue, (−) genes (coding strand opposite to the Watsonstrand); black, intergenic regions. In (c), genes are not represented. (For interpretation of the references to color in this figure legend, the reader is referredto the web version of this article.)Source: Adapted from Ref. [90].

protein coding genes have been previously identified [493–496]. Taking advantage of the set of non-protein-coding RNAsidentified in the ‘‘H-Inv’’ database [497], we have checked that none of them are present in the intergenic regions studiedhere. Finally we have eliminated the possibility that intergenic skews are due to conserved sequences by checking thatthe removal of homologous segments found in the mouse genome (∼5.3% of all intergenic sequences) does not changesignificantly the skews in intergenic regions [90].

5.3.3. Conservation of replication-associated strand asymmetries in mammalian genomesAs a next step of our study, we have analyzed [90] the STA an SGC skew profiles in DNA regions of mammalian genomes

homologous to the six human origin investigated in Fig. 61(a). As shown in Fig. 61(a)–(c), the human, mouse and dogcumulated skew profiles look strikingly similar to each other, suggesting that in mouse and dog, these regions alsocorrespond to replication initiation zones (note that they are very similar in primate genomes). For each replication origin,

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 117

n (kbp)

ΣT

AG

C

–300 –100 100 300

0

500

1000

–300 –100 100 300

0

500

1000

–300 –100 100 300

–1000

–500

0

500

DNMT1 Lamin B2 globinβ−

Σ

Fig. 62. Cumulated skew profiles calculated around the origin of replication DNMT1 [498], Lamin B2 [499], and β-globin [500] in the human genome:ΣTA(thick line) andΣGC (thin line). The colors have the same meaning as in Fig. 61.Source: Reprinted from Ref. [90].

we robustly observe a ∨-shape characteristic of a sharp upward jump from negative to positive skew values. A detailedexamination of the mouse intergenic regions suggests the existence of a compositional bias associated with replicationS = STA + SGC ∼ 5.8% ± 0.5% (Table 5). Let us point out that, at these homologous loci, human and mouse intergenicsequences present almost no (∼5.3%) conserved elements. Hence, the presence of strand asymmetry in regions that havestrongly diverged during evolution further supports the existence of compositional bias associated with replication in bothorganisms. In the absence of such a process, intergenic sequences would have lost a significant fraction of their strandasymmetry.

Altogether, these results establish the existence of strand asymmetries associated with replication in mammaliangermline cells [90]. They show that most replication origins experimentally detected in somatic cells coincide with sharpupward transitions of the skew profile. They also imply that for the majority of experimentally determined origins, the po-sition of initiation zones are conserved in mammalian genomes as recently confirmed by the identification of a replicationorigin in the mouseMYC locus [501].

Remark. Let us emphasize that among the nine experimentally-known human origins analyzed here, three do not presenttypical ∨-shape cumulated profiles as reported in Fig. 62. For DNMT1 (left panel in Fig. 62), the sharp central part of the ∨

profile is replaced by a large horizontal plateau (few tens of kbp), possibly reflecting the presence of several origins dispersedover the whole plateau. Note that dispersed origins have been observed, e.g. in the hamster DHFR initiation zone [502]. Bycontrast, the cumulated skew profiles of the Lamin B2 (central panel of Fig. 62) and β-globin (right panel of Fig. 62) originspresent no ∨ profile suggesting that they might be inactive in germline cells or less active than neighboring origins.

5.3.4. Factory-roof skew profiles in the human genomeAs illustrated in Fig. 63(a), for TOP1 replication origin, when examining the behavior of the skews at larger distances

from the origin, one does not observe a step-like pattern with upward and downward jumps at the origin and terminationpositions respectively, as expected for the bacterial replicon model (Fig. 60(b)). Surprisingly, on both sides of the upwardjump, the noisy S profile decreases steadily in the 5′ to 3′ direction without clear evidence of pronounced downward jumps.As shown in Fig. 63(b–d), sharpupward jumps of amplitude1S & 15%, similar to the ones observed for the known replicationorigins (Fig. 61(a)), seem to exist also atmany other locations along the human chromosomes. But themost striking feature isthe fact that in between two neighboringmajor upward jumps, not only the noisy S profile does not present any comparabledownward sharp transition, but it displays a remarkable decreasing linear behavior. At chromosome scale, one thus getsjagged S profiles that have the aspect of ‘‘factory roofs’’ [89,90].

Remark. For comparison, we show in Fig. 64, the STA, SGC and S profiles obtained for a large fragment of the human chromo-some 12 after (Fig. 64(a–c)) and without (Fig. 64(a’–c’)) removing the repeated sequences. There is no doubt that repeatedsequences increase the level of noise in the skewprofiles. Indeed factory roofs aremore easily seen on themasked sequencesand specially on the total skews S = STA + SGC. As reported in Fig. 65, the pdfs of STA, SGC and S are nearly Gaussian for themasked sequences; some large tails are present but for skew amplitudes larger than 40%. The fact that the skew pdfs of thenative sequences, andmore particularly the STA pdf, significantly depart fromGaussian distributions justifies, a posteriori, theneed of removing repeated sequences prior to our statistical analysis. Most of these sequences have been inserted recentlyin the human genome and do not reflect long-term evolutionary skew patterns.

The jagged S profiles shown in Figs. 63(a–d) and 64(a–c) look somehow disordered because of the extreme variability inthe distance between two successive upward jumps, from spacing ∼50–100 kbp (∼100–200 kbp for the native sequences)mainly in GC rich regions (Fig. 63(d)), up to 2–3 Mbp (∼4–5 Mbp for the native sequences) in agreement with recentexperimental studies [427] that have shown that mammalian replicons are heterogeneous in size with an average size∼500 kbp, the largest ones being as large as a few Mbp. But what is important to notice is that some of these segmentsbetween two successive upward jumps of the skew are entirely or predominantly intergenic (Fig. 63(a) and (c)), clearly

118 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

a b

c d

e f

Fig. 63. S profiles along mammalian genome fragments. (a) Fragment of chromosome 20 including the TOP1 origin (red vertical line).(b and c) Chromosome 4 and chromosome 9 fragments, respectively, with low GC content (36%). (d) Chromosome 22 fragment with larger GC content(48%). In (a) and (b), vertical lines correspond to selected putative origins (see Section 5.4); yellow lines are linear fits of the S values between successiveputative origins. Black, intergenic regions; red, (+) genes; blue, (−) genes. Note the fully intergenic regions upstream of TOP1 in (a) and a predominantlyintergenic region frompositions 5290-6850 kbp in (c). (e) Fragment ofmouse chromosome 4 homologous to the human fragment shown in (c). (f) Fragmentof dog chromosome 5 homologous to the human fragment shown in (c). In (e) and (f), genes are not represented. (For interpretation of the references tocolor in this figure legend, the reader is referred to the web version of this article.)Source: Adapted from Ref. [90].

illustrating the particular profile of a strand bias resulting solely from replication [89,90]. In most other cases, we observethe superimposition of this replication profile and of the step-like profiles of (+) and (−) genes, appearing as upward anddownward blocks standing out from the replication pattern (Fig. 63(c)). Importantly, as illustrated in Fig. 63(e) and (f), thefactory-roof pattern is not specific to human sequences but is also observed in numerous regions of the mouse and doggenomes [90].

5.4. A wavelet-based method to detect putative replication origins

We have shown in Section 5.3 that experimentally determined human replication origins coincide with large-amplitudeupward transitions in noisy skew profiles. The corresponding1S ranges between 14% and 38%, owing to possible differentreplication initiation efficiencies and/or different contributions of transcriptional biases (Fig. 61). To predict replicationorigins, we thus need a methodology to detect jumps in noisy signals. The basic principle of the detection of jumps in thenoisy skew profiles with the WT transform has been discussed in Section 5.2.2 and illustrated in Fig. 55. From Eq. (A.4) (seeAppendix A), when using the first derivative of the Gaussian function as analyzing wavelet, it is obvious that at a fixed scalea, a large value of the modulus of the WT coefficient corresponds to a strong derivative of the smoothed skew profile. Inparticular, a jump manifests as local maxima of the WT modulus that connect on to maxima lines in the WT skeleton thatoriginate at small scale from the jump location and extend over a range of scales that depends on the amplitude of the jump.

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 119

Fig. 64. Skew profiles along a large fragment of the human chromosome 12. Repeat-masked sequence (8 Mbp): STA (a), SGC (b), and S = STA + SGC (c).Native sequence (15.7 Mbp): STA (a’), SGC (b’), and S (c’).Source: Reprinted from [440].

0

a b c

Fig. 65. Probability density functions of the skews STA (a), SGC (b), and S = STA + SGC (c) values computed in nonoverlapping 1 kbp windows from the DNAsequences of the 22 human autosomal chromosomes. Symbols have the following meaning: () native sequences and (•) repeat-masqued sequences.Source: Reprinted from [440].

We have checked that upward jumps observed in the skew S at the experimentally identified replication origins (Fig. 61(a))correspond to maxima lines (with positive wavelet coefficients) that extend to rather large scales a > a∗

= 200 kbp.This observation has led us to select, in theWT skeleton, themaxima lines that still exist above a∗

= 200 kbp (Fig. 55), i.e.a scale which is smaller than the typical replicon size and larger than the typical gene size [89,90]. In this way, we not onlyreduce the effect of noise but we also reduce the contribution of the upward (TSS) and downward (TTS) jumps associated tothe step-like skew pattern induced by transcription only (see Section 5.2, Fig. 57). The detected jump locations are estimatedas the positions at scale 20 kbp of the so-selected maxima lines. According to Eq. (A.4), upward (resp. downward) jumps areidentified by themaxima lines corresponding to positive (resp. negative) values of theWT as illustrated in Fig. 55 by the solid(resp. dashed) maxima lines. When applying this methodology to the total skew S along the repeat-masked DNA sequencesof the 22 human autosomal chromosomes, 2415 upward jumps are detected and, as expected, a similar number (namely2686) of downward jumps. In Fig. 66(a) are reported the histograms of the amplitude |1S| of the so-identified upward(1S > 0) and downward (1S < 0) jumps respectively. These histograms no longer superimpose, as previously observedat smaller scales in Fig. 58(a), the former being significantly shifted to larger |1S| values. When plotting N(|1S| > 1S∗) vs.1S∗ in Fig. 66(b), we can see that the number of large amplitude upward jumps overexceeds the number of large amplitudedownward jumps. These results confirm that most of the sharp upward transitions in the S profiles in Figs. 63 and 64 haveno sharp downward transition counterpart. This excess likely results from the fact that, contrasting with the prokaryotereplicon model (Fig. 60) where downward jumps result from precisely positioned replication terminations, in mammalstermination appears not to occur at specific positions but to be randomly distributed [89,90] (this point will be detailed

120 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

350

300

250

200

150

100

50

0

N

0 10 20 30 40 50

|ΔS| ΔS*

a2000

1500

1000

500

010 20 30 40 50

R=3

b

Fig. 66. Statistical analysis of the sharp jumps detected in the S profiles of the 22 human autosomal chromosomes by the WT microscope at scalea∗

= 200 kbp for repeat-masked sequences. |1S| = |S(3′)−S(5′)|, where the averageswere computed over the two adjacent 20 kbpwindows, respectively,in the 3′ and 5′ direction from the detected jump location. (a) Histograms N(|1S|) of |1S| values. (b) N(|1S| > 1S∗) vs.1S∗ . In (a) and (b), the black (resp.red) line corresponds to downward 1S < 0 (resp. upward 1S > 0) jumps. R = 3 corresponds to the ratio of upward over downward jumps presentingan amplitude |1S| ≥ 12.5% (see text). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of thisarticle.)Source: Adapted from Ref. [89].

in Section 5.5). Accordingly the small number of downward jumps with large |1S| is likely to result from transcription(Fig. 55) and not from replication. These jumps are probably due to highly biased genes that also generate a small numberof large-amplitude upward jumps, giving rise to false-positive candidate replication origins. In that respect, the number oflarge downward jumps can be taken as an estimation of the number of false positives. In a first step, we have retained asacceptable a proportion of 33% of false positives. As shown in Fig. 66(b), this value results from the selection of upwardand downward jumps of amplitude |1S| ≥ 12.5%, corresponding to a ratio of upward over downward jumps R = 3. Let usnotice that the value of this ratio is highly variable along the chromosome (Fig. 67). In GC poor regions, namely G+C < 37%,we observe in Fig. 67(a, a’) the largest R value, namely R = 6.5. In regions with 37% ≤ G + C ≤ 42%, we obtain R = 3.9(Fig. 67(b, b’)) that contrasts with small R values, R = 1.9 (Fig. 67(c, c’)) found in regions with G + C > 42%. In these latterregions (accounting for∼40% of the genome) with high gene density and small gene length [150], the skew profiles oscillaterapidly with large upward and downward amplitudes (Fig. 63(d)), resulting in a too large estimate of the number of falsepositives (∼53%).

In a final step, we have decided [89,90] to retain as putative replication origins upward jumps with |1S| ≥ 12.5%detected in regions with G + C ≤ 42%. This selection leads to a set of 1012 candidates among which our estimate of theproportion of true replication origins is 79% (R = 4.76). Some of these putative replication origins are illustrated in Fig. 63.In Fig. 68 is shown the mean skew profile calculated in intergenic windows on both sides of the 1012 putative replicationorigins [90]. This mean skew profile presents a rather sharp transition from negative to positive values when crossing theorigin position. To avoid any bias in the skew values that could result from incompletely annotated gene extremities (e.g. 5′

and 3′ UTRs), we have removed 10 kbp sequences at both ends of all annotated transcripts. As shown in Fig. 68, the removalof these intergenic sequences does not significantly modify the mean skew profile, indicating that the observed values donot result from transcription. On both sides of the jump, we observe a linear decrease of the bias with some flatteningof the profile close to the transition point. Note that, due to (i) the potential presence of signals implicated in replicationinitiation and (ii) the possible existence of dispersed origins [502], one might question the meaningfulness of this flatteningthat leads to a significant underestimate of the jump amplitude. Furthermore, according to our detection methodology, thenumerical uncertainty on the putative origin position estimatemay also contribute to this flattening. As illustrated in Fig. 68,when extrapolating the linear behavior observed at distances>100 kbp from the jump, one gets a skew of 5.3%, i.e. a valueconsistent with the skew measured in intergenic regions around the six experimentally known replication origins namely7.0% ± 0.5% (Table 5). Overall, the detection of sharp upward jumps in the skew profiles with characteristics similar tothose of experimentally determined replication origins and with no downward counterpart, further supports the existenceof replication-associated strand asymmetries in human chromosomes, leading to the identification of numerous putativereplication origins active in germline cells.

5.5. A model of replication in mammalian genomes

Following the observation of jagged skewprofiles similar to factory roofs in Section 5.3, and the quantitative confirmationof the existence of such (piecewise linear) profiles in the neighborhood of 1012 putative origins in Fig. 68, we have proposed,in Refs. [89,90], a rather crude model for replication in the human genome that relies on the hypothesis that the replicationorigins are quite well positioned while the terminations are randomly distributed. Although some replication termini havebeen found at specific sites in S. cerevisiae and to some extent in S. pombe [504], they occur randomly between active origins

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 121

NN

N

|ΔS| ΔS*

(a) (a,)

(b) (b,)

(c) (c,)

R=6.5

R=3.9

R=1.9

Fig. 67. Statistical analysis of the sharp jumps detected in the S profiles of the 22 human autosomal chromosomes by the WT microscope at a scale ofa∗

= 200 kbp for repeat-masked sequences. The detected jumps have been classified into three categories according to the GC content found in a 100 kbpwindow centered at the position of the jump. Same as in Fig. 66: (a, a’) G + C < 37%; (b, b’) 37% ≤ G + C ≤ 42%; (c, c’) 42% < G + C.Source: Adapted from Ref. [503].

in Xenopus egg extracts [464,505,506]. Our results indicate that this property can be extended to replication in humangermline cells. As illustrated in Fig. 69, replication termination is likely to rely on the existence of numerous terminationsites distributed along the sequence. For each termination site (used in a small proportion of cell cycles), strand asymmetriesassociated with replication will generate a step-like skew profile with a downward jump at the position of terminationand upward jumps at the positions of the adjacent origins (as in bacteria, Fig. 60(b)). Various termination positions willthus correspond to classical replicon-like skew profiles (Fig. 69, left panel). Addition of those profiles will generate theintermediate profile (Fig. 69, central panel). In a simple picture, we can reasonably suppose that termination occurs withconstant probability at any position on the sequence. This behavior can, for example, result from the binding of sometermination factor at any position between successive origins, leading to a homogeneous distribution of termination sitesduring successive cell cycles. The final skew profile is then a linear segment decreasing between successive origins (Fig. 69,right panel). Let us point out that random firing of replication origins during the S phase [507]might result in some flatteningof the skew profile at the origins as sketched in Fig. 69 (right panel, grey curve). These results [89,90] support the hypothesisof random replication termination in human, and more generally in mammalian cells (Figs. 63 and 64), but further analyseswill be necessary to determine what scenario is precisely at work.

122 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

S

n (kbp)

Fig. 68. Mean skew profile of intergenic regions around putative replication origins. The skew S was calculated in 1 kbp windows (Watson strand) aroundthe position (±300 kbp without repeats) of the 1012 detected upward jumps; 5′ and 3′ transcript extremities were extended by 0.5 kbp and 2 kbp,respectively (•), or by 10 kbp at both ends (∗). The abscissa represents the distance (in kbp) to the corresponding origin; the ordinate represents the skewscalculated for the windows situated in intergenic regions (mean values for all discontinuities and for 10 consecutive 1 kbp window positions). The skewsare given in percent (vertical bars, SEM). The lines correspond to linear fits of the values of the skew (∗) for n < −100 kbp and n > 100 kbp.Source: Adapted from Ref. [90].

Fig. 69. Model of replication termination. Schematic representation of the skew profiles associated with three replication origins O1 , O2 , and O3; wesuppose that these replication origins are adjacent, bidirectional origins with similar replication efficiency. The abscissa represents the sequence position;the ordinate represents the S value (arbitrary units). Upward (or downward) steps correspond to origin (or termination) positions. For convenience, thetermination sites are symmetric relative to O2 . (Left) Three different termination positions Ti , Tj , and Tk , leading to elementary skew profiles Si , Sj , and Sk .(Center) Superposition of these three profiles. (Right) Superposition of a large number of elementary profiles leading to the final factory-roof pattern. Inthe simple model, termination occurs with equal probability on both sides of the origins, leading to the linear profile (thick line). In the alternative model,replication termination is more likely to occur at lower rates close to the origins, leading to a flattening of the profile (grey line).Source: Adapted from Ref. [90].

In this section, we have revealed a factory roof skew profile as an alternative in mammalian genomes to the repliconstep-like profile observed in bacteria (Fig. 60). This pattern is displayed by a set of 1012 upward transitions, each flankedon each side by DNA segments of ∼300 kbp (without repeats), which can be roughly estimated to correspond to 20%–30%of the human genome. In these regions, which are characterized by low and medium GC content (G + C ≤ 42%), skewprofiles reveal a portrait of germline replication consisting of putative origins separated by rather long DNA segments(∼1–3 Mbp on the native sequences). Although such segments are much larger than expected from the classical view[427–429] (∼100–500 kbp on the native sequences), they are not incompatible with estimations showing that replicon sizecan reach up to 1 Mbp [427,430], and that replicating units in meiotic chromosomes are much longer than those engagedin somatic cells [508]. Finally, it is not unlikely that in GC-rich (gene-rich) regions (Fig. 63(d)) replication origins wouldbe closer to each other than in other regions, further explaining the greater difficulty in detecting origins in these regions.Indeed, the wavelet-basedmethodology developed in this section remains efficient as long as there exists a clear separationbetween the characteristic size of a replicon and the characteristic size of a gene; while this separation is unquestionable atlow and medium GC content, this is no longer obvious in high GC regions.

For more details on the existence and modeling of replication-associated strand asymmetries in mammalian genomes,we refer the reader to the Ph.D. thesis manuscripts of Brodie of Brodie [440], Nicolay [441] and Touchon [503].

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 123

Fig. 70. (a) N-shaped replication-associated skew profile. (b) Transcription-associated skew profile showing positive square blocks at (+) gene positionsand negative square blocks at (−) gene positions. (c) Superimposition of the replication- and the transcription-associated skew profiles producing the finalfactory-roof pattern that defines ‘‘N-domains’’.Source: Adapted from Ref. [91].

6. Disentangling transcription- and replication-associated strand asymmetries reveals a remarkable gene organiza-tion in the human genome

6.1. A wavelet-based methodology to disentangle the replication and the transcription contributions to the compositional strandasymmetries in replication N-domains

6.1.1. A working model of mammalian ‘‘factory roof’’ skew profilesAccording to the results reported in Section 5, from now on, we will use as a working model that the overall factory roof

skew profiles observed in mammalian genomes (Figs. 63 and 64) actually result from the superimposition of two patternsas sketched in Fig. 70 [91]. The first one decreases steadily from the 5′ to the 3′ direction and would be attributable toreplication in germline cells (Fig. 70(a)). As predicted by our toymodel in Fig. 69, the linear decline of the skewwould reflecta progressive change in the proportion of center- and border-oriented replication forks. The other patternwould result fromtranscription-associated strand asymmetries that generate step-like profiles (Fig. 57) corresponding to sense (+) and anti-sense (−) genes (Fig. 70(b)). When the two patterns are superimposed, this leads to the factory roof pattern illustratedin Fig. 70(c) [91]. Because the typical gene size (∼30 kbp) is much smaller than the characteristic distance between twoadjacent putative replication origins, we have named these replication domains as ‘‘N-domains’’ [91] with regard to theiroverall qualitative N-shape.

6.1.2. A space-scale pattern matching method to identify replication N-domainsThe first step to detect putative replication domains consists in developing amulti-scale pattern recognitionmethodology

based on the WT of the strand compositional asymmetry S using, as analyzing wavelet, the N-let ψN(x) (Appendix A, Eq.(A.7)) that is adapted to perform an objective segmentation of factory roof skew profiles into N-domains. As illustrated inFig. 71, the space-scale location of significant maxima values in the 2DWT decomposition (red areas in Fig. 71(b)) indicatesthemiddle position (spatial location) of candidate replicationN-domainswhose size is given by the scale location. In order toavoid false positives,we then check that there does exist awell-definedupward jumpat eachdomain extremity. These jumpsappear in Fig. 71(b) as blue cone-shape areas pointing at small scale to the jump positions where the putative replicationorigins are located. Note that because the analyzing wavelet is of zero mean (Eq. (A.7)), theWT decomposition is insensitiveto (global) asymmetry offset. In that respect, to enforce strong compatibility with the mammalian replicon model proposedin Section 5.5, we will only retain the domains the most likely to be bordered by putative replication origins, namely thosethat are delimited by upward jumps corresponding to a transition from a negative S value ≤ −ϵ to a positive S value ≥ ϵ(when estimated in 20 kbp windows). According to the mean skew profile of intergenic regions observed around putativereplication origins in Fig. 68, we choose to fix the threshold parameter to ϵ = 3%, that indeed corresponds to a very stringentselection criterion that we may somehow relax in a future step.Human autosomes: When applying our wavelet-based segmentation method to the skew profiles along the 22 humanautosomes [509], we detect 759 N-domain candidates (Fig. 71). Among these domains, we discard 17 domains that containstretches of N (unknown nucleotides) longer than 100 kbp. We also remove from the remaining N-domains those (64)of length L > 2.8 Mbp whose shape is reminiscent of an N but split in half, leaving in the center a region of null skewwhose length increases with domain size. As further studied in Section 6.4, these split-N-domains have a central regioncorresponding to large heterochromatic gene deserts of homogeneous composition, i.e. both a null skew and a constant andlow GC content. We end with a library of 678 N-domains bordered by 1062 putative replication origins spanning 33.8%of the genome (Table 6). As shown in Fig. 72, the size of these N-domains ranges from ∼200 kbp (resp. 100 kbp whenmasked) to 2.8 Mbp (resp. 1.6 Mbp when masked) with a mean length L = 1.19 Mbp (resp 0.63 Mbp when masked). Mostof the 1062 putative replication origins at the extremities of the detected replication domains are intergenic (78%) and arelocated near to a gene promoter more often than would be expected by chance (data not shown). These N-domains containapproximately equal numbers of genes oriented in each direction (1653 (+) genes and 1690 (−) genes) with a mean genenumber per domain of 4.93. As observed in Ref. [91], gene distributions in the 5′ half ofN-domains containmore (+) than (−)genes, and vice versa for the 3′ half ofN-domains. As reported in Table 6 and Fig. 73, theseN-domains have a high intergenic

124 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

6 8 10 12 14

–0.5

00.

5

S TA

+S G

C

102

103

104

a (k

bp)

n (Mbp)

a

b

Fig. 71. (a) Skew profile S of a 9 Mbp repeat-masked fragment of human chromosome 21. (b) N-let transform of S using ΨN (Eq. (A.7)); TΨN [S](n, a) iscolor-coded from dark-blue (min; negative values) to red (max; positive values) through green (null values). Light blue and purple lines illustrate thedetection of two replication domains of significantly different sizes. Note than in (b), blue cone-shape areas signing upward jumps point at small scale(top) towards the putative replication origins and that the vertical positions of the WT maxima (red areas) corresponding to the two indicated replicationdomains match the distance between the putative replication origins (1.6 Mbp and 470 kbp respectively). (For interpretation of the references to color inthis figure legend, the reader is referred to the web version of this article.)Source: Adapted from Ref. [509].

Table 6Replication N-domains detected in the human and mouse genomes [509].

Number Mean length (Mbp) ± std. dev. Genome coverage (%) Mean number of genes Intergenic coverage (%) GC (%)

Human 678 0.63 ± 0.33 (masked) 33.8 4.93 57.3 (masked) 40.81.19 ± 0.62 (native)

Mouse 587 0.54 ± 0.34 (masked) 22.3 4.13 58.2 (masked) 42.40.91 ± 0.62 (native)

(masked) coverage (57.3%) where the skew S is likely to result from replication only. Indeed only a few N-domains (64/678)have a (masked) intergenic coverage less than 20% (Fig. 73(b)) [509].

Mouse autosomes: When reproducing the same wavelet-based analysis for the 19 mouse autosomes [509], 634 N-domainscandidates are identified with no domain containing large stretches of unsequenced nucleotides (N). After discarding 47detected N-domains of length L > 2.8 Mbp, we end with a library of 587 N-domains that cover 22.3% of the genome, i.e. apercentage slightly smaller than previously obtained for the human genome. This results from the fact that more smalldomains are detected in the mouse genome (Fig. 72) with a mean native (resp. masked) length of 0.91 Mbp (resp. 0.54Mbp). Consistently with previous observations that replicon sizes are smaller in GC rich regions (Fig. 63(d)) [91], the meanGC content of the mouse N-domains (42.4%) is slightly higher than in the human N-domains (40.8%). Despite these slightquantitative differences,most of the features concerning gene organization in humanN-domains are also observed inmouseN-domains [509] with a mean number of genes of 4.13 and globally a similar number of genes oriented in each direction(1220 (+) genes and 1204 (−) genes). Like in the human autosomes, a majority of the 994 putative replication origins thatborder the 587 mouse N-domains are intergenic (71%). Importantly the relative coverage of these N-domains by intergenicregions is important (58.2%) and statistically very similar to what is observed with the human autosomes (Fig. 73).

6.1.3. Disentangling transcription- and replication-associated strand asymmetriesThe second step in our methodology consists in deciphering in each detected N-domain, the contribution to the skew

coming from replication and the contribution associated to transcription [509,510]. According to the working model

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 125

a b

Fig. 72. Normalized histograms ofN-domain length detected in the human (black) andmouse (grey) genomes: (a) native sequences; (b)masked sequences.Source: Reprinted from Ref. [509].

a b

Fig. 73. Normalized histograms of intergenic regions in N-domains detected in the human (black) andmouse (grey) genomes: (a) total masked intergeniclength; (b) coverage by masked intergenic regions per N-domain.Source: Reprinted from Ref. [509].

introduced in Section 6.1.1 (Fig. 70) [91], the theoretical skew profile in a N-domain can be written as:

S(x′) = SR(x′)+ ST (x′) = −2δx′

−12

+ h +

−genes

cgχg(x′), (24)

where position x′ within the domain has been rescaled between 0 and 1, h and δ (>0) are parameters that define thereplication bias (S5

R = h + δ at the 5′ N-domain extremity (Watson strand) and S3′

R = h − δ at the 3′ extremity), χg isthe characteristic function for the gth gene that belongs to the N-domain (1 when the point is in the gene and 0 elsewhere)and cg is its transcriptional bias calculated on the Watson strand (likely to be positive for (+) genes and negative for (−)genes). For each N-domain detected, we use a general least-square fit procedure to estimate, from the noisy S profile, theparameters δ, h and each of the cg ’s. The resulting χ2 value is then used to select the N-domain candidates where theskew is well described by Eq. (24). As illustrated in Fig. 74 for a fragment of human chromosome 6 that contains 4 adjacentreplication N-domains (Fig. 74(a)), this method provides a very efficient way to disentangle the square-like transcriptionskew component (Fig. 74(c)) from the N-shaped component induced by replication (Fig. 74(d)).

Remark. In the least-squares fit procedure,we fix the varianceσ 2 of theGaussian noise to the varianceσ 2=

12σ

2δS computed

in each N-domain from the pdf of the skew increments δS(n)/√2 = [S(n + 1) − S(n)]/

√2. As quantitatively verified a

posteriori (Fig. 75), this variance, directly estimated from the total skew S (Fig. 74(a)), is a very good approximation of thenoise component of the skew once subtracted our model skew profile (Fig. 74(b)).

As reported in Table 7, among the 678 N-domains detected in the 22 human autosomes, our disentangling methodologyfails to provide satisfactory results (prohibitively too large χ2 values) for 14 domains only. We have checked that the mainreason for which our working hypothesis Eq. (24) does not apply is the fact that some regions present anomalous highamplitude skew values. As far as the mouse genome is concerned, among the 587 N-domains detected in the 19 mouse

126 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

a

b

c

d

Fig. 74. (a) Skew profile S of a 4.3Mbp repeat-masked fragment of human chromosome 6; each point corresponds to a 1 kbpwindow: red, (+) genes; blue(−) genes; black, intergenic regions (the color is defined bymajority rule); the estimated skew profile (Eq. (24)) is shown in green; vertical lines correspondto the locations of 5 putative replication origins that delimit 4 adjacent N-domains identified by the wavelet-based methodology. (b) Noise componentSNoise obtained by subtracting the estimated total skew (green line in (a)) from the original skew profile in (a). (c) Transcription-associated skew ST obtainedby subtracting the estimated replication-associated profile (green line in (d)) from the original S profile in (a); the estimated transcription step-like profile(third term in the rhs of Eq. (24)) is shown in green. (d) Replication associated skew SR obtained by subtracting the estimated transcription step-like profile(green lines in (c)) from the original S profile in (a); the estimated replication serrated profile (first two terms on the rhs of Eq. (24)) is shown in green; thelight-blue dots correspond to high-resolution replication timing tr data (Section 6.2.1). (For interpretation of the references to color in this figure legend,the reader is referred to the web version of this article.)Source: Adapted from Ref. [509].

autosomes, only 2 are discarded as not satisfying Eq. (24). The results reported in the next subsections will thus correspondto a statistical analysis performed on 664 human and 585 mouse N-domains [509].

6.1.4. Statistical analysis of replication-associated strand asymmetries

Human autosomes: In Fig. 76 are reported the results of our estimate of the replication parameters S5′

R = h + δ (Fig. 76(a)),S3

R = h−δ (Fig. 76(b)), and h (Fig. 76(c)). The normalized histogramof the offset parameter h (vertical shift of theN profile) issymmetric (Fig. 76(c)), with amean value h = (−9.5%±8.4%)×10−2 (Table 7), that cannot be distinguished from zero. Thismeans that the replication N-shaped profile is mainly observed centered at zero with equal statistical probability of upwardand downward vertical shift by a few percent. The histograms of replication bias at the 5′ (Fig. 76(a)) and 3′ (Fig. 76(b)) N-domain extremities are quite symmetric one from the other with mean values S5

R = 7.2% ± 0.1% and S3′

R = −7.4% ± 0.1%,

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 127

a b

Fig. 75. Semi-log representation of normalized histograms of skew values in: (a) a N-domain of length L = 2.6 Mbp (1.5 Mbp masked) in the humangenome and (b) a N-domain of length L = 2.3 Mbp (1.3 Mbp masked) in the mouse genome. The symbols correspond to the histogram of the skewcomponent SNoise (•) and of the total skew increments δS/

√2 (). The continuous parabola corresponds to Gaussian distributions of standard deviation

σ = 0.090 (a) and 0.087 (b).Source: Reprinted from Ref. [509].

a b c

Fig. 76. Normalized histograms of replication parameters S5′

R (a), S3′

R (b) and h (c) estimated in 664 N-domains identified in the 22 human autosomes(black) and in 585 N-domains detected in the 19 mouse autosomes (grey) (see Table 7).Source: Reprinted from Ref. [509].

Table 7Mean replication parameters (in percent ±SEM) computed with our wavelet-based disentangling method from human and mouse skew profiles (Eq.(24)) [509].

Number of N-domains S5′

R S3′

R h

Human 664 7.2 ± 0.1 −7.4 ± 0.1 (−9.5±8.4)×10−2

Mouse 585 6.8 ± 0.2 −6.8 ± 0.2 (−1.8±7.9)×10−2

as expected from an antisymmetric N-shaped skew pattern of zero mean. Altogether these results provide some estimate ofthe mean replication bias δ = 7.3% ± 0.1%, which corresponds to an upward jump of mean amplitude ≃14.6% in the skewprofile at the putative replication origins that border the replication N-domains. These estimates are consistent with thoseobtained in our pioneering study of large amplitude upward jumps in human skew profiles (Sections 5.3 and 5.4, Table 5)[89,90]. As a test of the reliability of our methodology, we compare in Fig. 77, the results of our estimates of the replicationparameters h and δ for each N-domain to the corresponding values obtained directly from some fit of the S profile whenconsidering only the intergenic regions where the observed skew is supposed to result from replication only. As observedin Fig. 77(a) for the parameter h and in Fig. 77(b) for the parameter δ, a large majority of points fall, up to the numericaluncertainty, on the diagonal. Thus, except for a small percentage of N-domains where intergenic coverage (Fig. 73) is toosmall to allow us to compute the replication parameters directly from the intergenic skew profile, the estimates reported inTable 7 are quite consistent with the skew values observed genome wide in intergenic sequences [509].

Mouse autosomes: As shown in Fig. 76, the histograms of replication parameters (S5′

R , S3′

R , h) values computed with ourmethodology for the mouse autosomes are undistinguishable from the ones previously obtained for the human autosomes.

128 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

a b

Fig. 77. Test of the consistency of our disentangling methodology: for each N-domain detected in the 22 human autosomes, the replication parameterscomputed directly from the intergenic skew only are plotted vs. the corresponding parameters derived from our disentangling method. (a) h vs hintergenic ;(b) δ vs δintergenic . For clarity, only (437/664) N-domains containing more than 200 kbp of intergenic masked sequences are represented such that the errorbars remain small enough.Source: Reprinted from Ref. [509].

Fig. 78. Normalized histograms of transcription bias ST computed as explained in the main text for human (black) and mouse (grey) genes of length≥20 kbp. (a) (+) Genes; (b) (−) genes (see Table 8).Source: Reprinted from Ref. [509].

The detected N-domains have a zeromean replication skew (h = (−1.8%±7.9%)×10−2) and an antisymmetric shape withS5

R = 6.8% ± 0.2% and S3′

R = −6.8% ± 0.2% (Table 7), corresponding to an upward jump in the mouse skew profile, at thedetected putative replication origins, of characteristic amplitude ∼13.6% similar to the ∼14.4% previously observed for thehuman autosomes.

6.1.5. Statistical analysis of transcription-associated strand asymmetries

Human autosomes: To estimate the mean transcription bias of the genes belonging to a given N-domain, we consider thetranscription-associated skew ST obtained after subtracting the estimated N-shaped replication profile as illustrated inFig. 74(c). Then as first proposed in Ref. [454], the transcription skew of the genes is measured by averaging ST over intronsequences after removing 490 bp at each intron extremities in order to get rid of the contribution to the skew coming fromsplicing signals. In Fig. 78 and Table 8 are reported the transcription bias so estimated for 726 (+) genes and 731 (−) genes oflength≥ 20 kbp, so that the total intronic coverage is enough to ensure convergence in the estimate of the gene transcriptionskew. The histograms of ST values for (+) and (−) genes are remarkably symmetric with means S(+)T = 5.1% ± 0.2% andS(−)T = −5.2%± 0.2% respectively. This confirms that the local genic contribution of transcription to the total skew is of thesamemagnitude (∼5%) than the contribution induced by replication (∼7.5%) which can be seen a posteriori as a justificationof the need to develop a methodology capable to disentangle these two skew components. This suggests that previousestimates [453,454] of total transcription bias (∼8%, see Fig. 57) were biased by the presence of a replication-associatedskew. According to the remarkable gene organization evidenced in replication N-domains [91] (see Section 6.3), we furtherfocus on R+ genes ((+) genes in the 5′ N-domain half and (−) genes in the 3′ N-domain half) that are transcribed in thesame direction as the putative replication fork progression and R− genes ((−) genes in the 5′ N-domain half and (+) genesin the 3′ N-domain half) that are transcribed in the opposite direction. Close to the bordering putative origins, R+ genes

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 129

Table 8Mean transcription skews (in percent±SEM) computedwith our wavelet-based disentanglingmethod from human andmouse skew profiles (see Eq. (24))S(+)T , S(−)T , SR+T , SR−T , are the mean transcription skews computed for (+), (−), R+ and R− genes of length ≥ 20 kbp. For R+ and R− genes, the transcriptionskew is computed on the gene strand, and therefore expected to be positive (see Fig. 57) [509].

S(+)T (n(+)) S(−)T (n(−)) SR+T (nR+) SR−T (nR−)

Human 5.1 ± 0.2 (726) −5.2 ± 0.2 (731) 5.3 ± 0.1 (942) 4.8 ± 0.3 (309)Mouse 5.1 ± 0.2 (467) −4.9 ± 0.2 (481) 5.1 ± 0.2 (561) 5.1 ± 0.4 (213)

Fig. 79. Mean transcription skew computed on the gene strand for R+ (•) and R− () genes of length ≥20 kbp, as a function of the (non-masked) distanceto the closest N-domain border: (a) Human genes; (b) mouse genes (see Table 8). Each point corresponds to average over 40 gene promoter positions andtranscriptional skews; vertical error bars represent SEM.Source: Reprinted from Ref. [509].

are more abundant and longer than R− genes. As reported in Table 8, R+ and R− genes have a similar transcriptional skewSR+T = 5.3% ± 0.1% and SR−T = 4.8% ± 0.3%. Furthermore, as shown in Fig. 79(a), when plotting their transcriptionalskew vs. their distance to their proximal N-domain border, we observe some significant decreases for both R+ and R−genes from values∼7% close to the putative replication origin to values∼3% (R−) and∼4% (R+). Thus genes lying in a closeneighborhood of the replication origins aremore biased by transcription than central genes, a result which seems to be quiteconsistent with the recent observation that these putative replication origins correspond to particular chromatin regionspermissive to transcription, that may have been imprinted in the DNA sequence during evolution [92] (see Section 7.3). Inparticular, if according to the size of the error bars, we cannot conclude about the significance of the damped oscillationsobserved for the R+ transcription bias in Fig. 79(a), they resemble to those displayed by the mean density profile of Pol IIbinding at gene promoter (±2 kbp around TSS) when plotted vs. the distance to the closest N-domain border (see Fig. 113)[92].Mouse autosomes: Similarly, the estimate of the transcription bias for (+) (Fig. 78(a)) and (−) (Fig. 78(b)) mouse genes arein remarkable agreement with those obtained for human genes. As reported in Table 8, we get the following mean valuesS(+)T = 5.1% ± 0.2%, S(−)T = −4.9% ± 0.2% for (+) and (−) genes respectively, and SR+T = 5.1% ± 0.2%, SR−T = 5.1% ± 0.4%for the transcription bias for R+ and R− genes. Interestingly, the transcription bias for R+ and R− genes decrease fromvalues ∼6%–8% close to the N-domain borders down to values ∼2% towards the N-domain center (Fig. 79(b)). Again werecover some feature previously evidenced with human genes, except that no (damped) oscillatory behavior is observedin the decrease of SR+T as a function of the distance to the nearest N-domain border. This difference will be the subject offurther investigation via a specific comparison of the replication and transcription biases of orthologous genes in the humanand mouse genomes and also of their relative positioning inside the replication N-domains.

6.1.6. Correlating transcription-associated strand asymmetries to gene expression in the germlineThe quantitative agreement obtained between the human and mouse skew estimates strongly suggests that both

the replication and transcription components of the skew inside the N-domains are conserved in mammalian genomes.These results open new perspectives in genomic and post-genomic data analysis. In particular, as reported in Fig. 80 andTable 9, they provide the possibility to correlate separately the replication-associated and transcription-associated skewcomponents to gene expression data. Here we focus our analysis on a key group of 315 protein-coding genes that werefound [511] to be differentially expressed betweenmale germcells and somatic controls and to display highly similarmeiotic

130 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Table 9Correlation analysis between expression data in germline cells (SG= spermatogonia, SC= spermatocytes, ST= spermatids, TU= tubules) and somatic cells(Brain, Liver) with transcription-associated skew ST and replication-associated skew SR . r = Pearson’s product moment correlation coefficient and P is theassociated P-value for no association (r = 0). The expression data correspond to the key group of 315 protein-coding genes with conserved transcriptomein human and rodent male gametogenesis identified in Ref. [511].

Skew Germline SomaticSG SC ST TU Brain Liver

STr 0.36 −0.019 −0.19 −0.042 0.12 −0.0023p 6.9 × 10−11 0.73 7.7 × 10−4 0.46 5 × 10−5 0.94

SRr 0.17 0.025 −0.058 0.051 0.11 −0.0040p 2.1 × 10−3 0.65 0.30 0.36 2.6 × 10−4 0.90

a b

c d

Fig. 80. Correlation between expression data in male germline in mouse with transcription skew ST and replication skew SR computed on the gene strandfor the 315 protein-coding genes defined in Ref. [511]. (a) ST for meiotic spermatocytes: r = −0.019, P = 0.73; (b) ST for post-meiotic round spermatids:r = −0.19, P = 7.7× 10−4; (c) ST for mitotic spermatogonia: r = 0.36, P = 0.9× 10−11; (d) SR for mitotic spermatogonia: r = 0.17, P = 2.1× 10−3 (seeTable 9). The slopes of the regression lines are respectively (a) −0.0004 ± 0.0011, (b) −0.0036 ± 0.0010, (c) 0.0077± 0.0011 and (d) 0.0034 ± 0.0010.

and post-meiotic patterns of transcriptional induction across human, mouse and rat populations. This correlation analysisdoes not reveal any significant correlation between replication-induced component SR of the skew and expression data insomatic as well as germline cells (Fig. 80(d) and Table 9), which can be seen a posteriori as some control of the efficiency ofour disentangling methodology.

A similar absence of correlations is also found for the transcription-induced component ST of the skew and this not onlyfor somatic cells but also for themeiotic class of spermatocytes containing genes required for reproduction, spermatogenesis,the meiotic cell cycle and protein degradation by the ubiquitin cycle (Fig. 80(a), Table 9) as well as for the class of genesinduced in post-meiotic spermatids mainly involved in reproduction, spermatogenesis and fertilization (Fig. 80(b), Table 9).Interestingly, among germline cells, significant positive correlations are only found between ST and expression data inmitotic spermatogonia (r = 0.36, P = 6.9 × 10−11) that were shown [511] to be associated with the cell cycle, germcell and embryonic development, chromosome organization and biogenesis (Fig. 80(c), Table 9). This result indicates that

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 131

a b c

Fig. 81. (a) Histogram N(tr ) of the replication timing ratio tr at the 83 putative replication origins identified in the human chromosome 6. (b) HistogramN(1tr ) of the difference 1tr in replication timing ratio between the borders and the center of the 38 replication N-domains of length L ≥ 1 Mbp. (c)Percentage of border-center pairs such that1tr > 0 (N) and mean value1tr (•) vs. the N-domain length L; N-domains have been ordered by length andbinned into 6 group of border-center pairs.Source: Reprinted from Ref. [510].

a b

Fig. 82. (a) Average replication timing ratio (±SEM) determined around the 83 putative replication origins (•), 20 origins with well-defined localmaxima () and 10 late replicating origins (). 1x is the native distance to the origins in Mbp units. (b) Histogram of Pearson’s correlation coefficientvalues between tr and the absolute value of SR over the 38 predicted domains of length L ≥ 1 Mbp. The dotted line corresponds to the expected histogramcomputed with the correlation coefficients between tr and |S| profiles over independent windows randomly positioned along chromosome 6 and with thesame length distribution as the 38 detected domains.Source: Reprinted from Ref. [510].

the contribution to the asymmetry of strand composition that has accumulated during transcription from cell cycle to cellcycle and further transmitted from generation to generation in the germline actually reflects gene expression level inmitoticgermline cells.

6.2. DNA replication timing data corroborate in silico human replication origin predictions

During the duplication of eukaryotic genomes that occurs during the S phase of the cell cycle, the different replicationorigins are not all activated simultaneously [427,462,463,465,469,507,512]. Recent technical developments in genomicclone microarrays have led to novel way of detecting the temporal order of DNA replication [507,512]. The arrays areused to estimate replication timing ratios, i.e. ratios between the average amount of DNA in the S phase at a locus alongthe genome and the usual amount of DNA present in the G1 phase for that locus. These ratios should vary between 2(throughout the S phase, the amount of DNA for the earliest replicating regions is twice the amount during G1 phase) and

132 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 83. Correlation between the replication-associated compositional asymmetry SR and the replication timing ratio tr along 5 replication N-domainsbordered by rather early replicating putative origins. (a–e) Phase portrait representation of tr vs. SR; the arrows points in the 5′–3′ direction. (a’–e’)Corresponding SR (black points) and tr (blue dots) profiles. (For interpretation of the references to color in this figure legend, the reader is referred tothe web version of this article.)Source: Reprinted from Ref. [510].

1 (the latest replicating regions are not duplicated until the end of S phase). This approach has been successfully used togenerate genome-wide maps of replication timing for S. cerevisiae [513], Drosophila melanogaster [466] and human [514].Recently, two new analyses of human chromosome 6 [512] and 22 [507] have improved replication timing resolution from1 Mbp down to ∼100 kbp using arrays of overlapping tile path clones. In this section, we report on a very promising firststep towards the experimental confirmation of the thousand putative replication origins located at N-domain borders asdiscussed in Section 6.1. The strategy will consist in mapping them on the recent high-resolution timing data [512] and inchecking that these regions replicate earlier than their surrounding [510].

6.2.1. Human chromosome 6Chromosome 22 [507] being rather atypical in gene and GC contents, we mainly report here on the correlation

analysis [510] between nucleotide compositional skew and timing data for chromosome 6 [512] which is morerepresentative of thewhole human genome. Note that timing data for clones completely included in another clone have beenremoved after checking for timing ratio value consistency leaving 1648 data points. The timing ratio value at each point hasbeen chosen as the median over the 4 closest data points to remove noisy fluctuations resulting from clone heterogeneity(clone length 100 ± 51 kbp and distance between successive clone mid-points 104 ±89 kbp), so that the spatial resolutionis rather inhomogeneous ∼300 kbp. Note that using asynchronous cells also results in some smoothing of the data, possiblymasking local maxima.

Our wavelet-based methodology has identified 54 replication domains in human chromosome 6 [509,510]; thesedomains are bordered by 83 putative replication origins among which 25 are common to two adjacent domains. Four ofthese contiguous domains are shown in Fig. 74. In Fig. 74(d), on top of the replication skew profile SR, are reported forcomparison the high-resolution timing ratio tr data from Ref. [512]. The histogram of tr values obtained at the 83 putativeorigin locations is shown in Fig. 81(a). It displays a maximum at tr ≃ tr ≃ 1.5 and confirms what is observed in Fig. 74(d),namely that amajority of the predicted origins are rather early replicatingwith tr ≥ 1.4. This contrastswith the rather low tr(≃1.2) values observed in N-domain central regions (Fig. 74(d)). In Fig. 81(c) are reported the results of a statistical analysisof the difference1tr of replication timing ratios between borders and domain centers vs. the native length L of the domain.The average 1tr steadily increases from 0 to 0.2 for the largest domains. Actually more than 75% of the border-center pairsare such that1tr > 0 when the total domain length L ≥ 1 Mbp, i.e. when the border-center distance is significantly largerthan the timing data resolution. The overall histogram of 1tr values obtained for these 76 border-center pairs is shown inFig. 81(b). It mainly expands over positive 1tr values with a right tail reaching 1tr = 0.4, as a confirmation that while amajority of putative origins at the borders of the replication N-domains replicate rather early in the S phase, the central partof these domains systematically corresponds to regions that replicate quite late.

But there is an even more striking feature in the replication timing profile in Fig. 74(d): 3 among the 5 predicted originscorrespond, relatively to the experimental resolution, to local maxima of the tr profile. As shown in Fig. 82(a), the average trprofile around the 83 putative replication origins decreases regularly on both sides of the origins over a few (4–6) hundredskbp confirming statistically that domain borders replicate earlier than their left and right surroundings, which is consistentwith these regions being true replication origins mostly active early in S phase. In fact, when averaging over the top 20origins with a well-defined local maximum in the tr profile, tr displays a faster decrease on both sides of the origin and a

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 133

a b

Fig. 84. Genome-wide high resolution (kbp) experimental verification of putative replication origins using massive sequencing technology [528,529]. (a)Abundance of nascent replicating strands on both sides of the 678 replication N-domains identified in the 22 human autosomes. (b) Median replicationtiming profile along humanN-domains; themedian timing TR50 is defined as the fraction of the S phase (0 ≤ TR50 ≤ 1) at which 50% of DNA is replicated ina defined genome region (50% of cumulative enrichment); the data have been obtained by linear interpolation of enrichment values in four compartmentsof the S phase [529].

higher maximum value ∼1.55 corresponding to the earliest replicating origins. On the opposite, when averaging tr profilesover the top 10 late replicating origins, we get, as expected, a rather flat mean profile (tr ∼ 1.2) (Fig. 82(a)). Interestingly,these origins are located in rather wide regions of very low GC content (.34%, not shown) correlating with chromosomal Gbanding patterns predominantly composed of GC-poor isochores [515,516]. This illustrates how the statistical contributionof rather flat profiles observed around late replicating origins may significantly affect the overall mean tr profile. Individualinspection of the 38 replication domainswith L ≥ 1Mbp shows that, in those domains that are bordered by early replicatingorigins (tr & 1.4–1.5), the replication timing ratio tr and the absolute value of the replication skew |SR| turn out to be stronglycorrelated. This is quantified in Fig. 82(b) by the histogramof the Pearson’s correlation coefficient values that is clearly shiftedtowards positive values with a maximum at ∼0.4. As illustrated in Fig. 83 for 5 of those replication N-domains [510], whenusing the phase portrait reconstruction technique from dynamical system theory [517–524], we reveal a remarkable noisycycle-like behavior when plotting tr vs. SR (Fig. 83(a-e)) and identifying the position from the domain 5′ extremity to time. If,like in Fig. 83(d), we include the left upward jump in SR in the domain, we start observing a fast variation of SR from negative(∼−0.08) to positive (∼+0.08) values while tr ∼ 1.5 remains almost unchanged. Then, when entering the domain, SR startsdecreasing rather linearly while tr decreases slowly leading to a somehow semi-circular behavior of the trajectory in the (SR,tr ) phase space; aminimum is actually reached in the domain central regionwhere SR (≃0) switches frompositive to negativevalues and tr takes its smallest value (∼1.2–1.25), before increasing again while SR keeps decreasing to negative values untilthe 3′ extremity of the domain is reached. Let us emphasize that this noisy cycle-like behavior of relaxational character isrobustly observed in 16(/38) replication domains that are bordered by early replicating putative origins (Fig. 83(a–e)). In11(/38) domains the cycle is deformed due to the late replication of one of the borders [510].

6.2.2. Human autosomesIn several very recent studies, small nascent DNA strands synthesized at origins were purified by various methods

[477–479,525,526] and hybridized to ENCODE microarrays, to map a few hundred putative origins in 1% of the humangenome. However, the concordance between the different studies is very low (from<5% to<25%), even when they employthe same technique [479,526], for reasons that are currently unclear [527]. Insufficient purification of origin DNA or biasedamplification of the starting material may explain some of these results. In a recent experiment with HeLa cells, O. Hyrienand collaborators havemassively sequenced small nascent strands (SNS) purified using λ-exo digestion [528]. They confirmthat the ENCODE origins identified by Cadoret et al. [479] are enriched in their SNS preparation, but more interestingly,as reported in Fig. 84(a), they find that the replication N-domain borders show a comparable SNS enrichment, stronglysuggesting thatmost of them are functional origins in HeLa cells [528]. However, detailed analysis suggests that backgroundcontamination still hampers the reliable detection of individual origins and that the number of sequence reads should beincreased by a factor of 10–20 to solve this problems, a task that is currently quite prohibitive.

In their pilot experiment, O. Hyrien and collaborators have also determined the replication timing profile of the entirehuman genome. Asynchronous cells were pulse-labeled with BrDU, a thymidine analogue that is incorporated in replicatingDNA, sorted by FACS into four temporal compartments of S phase, and the newly synthesized DNA from each compartmentwas immunoprecipitatedwith anti-BrDU antibodies andmassively sequenced using a Solexa/Illumina platform. Replicationtiming of each genome position was estimated as the time in S phase at which 50% of the sequence reads that map inthe region were obtained [529]. Importantly, strong correlations are observed between these high resolution data and(i) fragmentary timing data from HeLa cells measured in ENCODE regions [530], (ii) low-resolution whole genome timing

134 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 85. Illustration of the gene organization around 4 putative replication origins in a fragment of the human chromosome 9. The color coding of therepeat masked skew profile S is the same as in Fig. 74. In most cases, genes are oriented in the same direction as the putative replication fork progression.

data for human lymphocytes [512,531], and (iii) more recent data in a variety of human cell types [532,533]. Therefore,in contrast to the origin-mapping data based on microarray hybridization of short nascent strands, the replication timingprofiles generated by different groups are satisfyingly concordant. As shown in Fig. 84(b), the genome-wide replicationtiming data of Hyrien’s group confirm and strengthen the results previously obtained in Figs. 81–83 on chromosome 6 [510]:N-domains borders replicate earlier than their flanking sequences and likely correspond to replication origins mostly activein the early S phase whereas the central regions replicate late.

6.3. Gene organization in the detected replication N-domains

6.3.1. Gene orientationAs reported in Section 6.1.2, most of the 1062 putative replication origins that border the 678 detected replication N-

domains are intergenic (78%) and are located near to a gene promoter more often than would be expected by chance [91].The N-domains contain approximatively equal number of genes oriented in each direction (Section 6.1.2). As illustrated inFig. 85 and quantified in Fig. 86(b), gene distributions in the 5′ halves of N-domains contain more (+) genes than (−) genes,regardless of the total number of genes located in the half domains. Symmetrically, the 3′ halves contain more (−) genesthan (+) genes. About 33% of half-domains contain one gene and 51% containmore than one gene,with an overallmean genenumber per N-domain of 4.93. Importantly, when disregarding the genes that overlay a putative replication origin, the 678replicationN-domains contain significantlymore R+ genes (1911), oriented in the same direction as the putative replicationfork progression, than R− genes (1111) that have the opposite direction. Within 50 kbp of putative replication origins, themean density of R+ genes is 8.2 times greater than that of R− genes. This asymmetry weakens progressively with thedistance from the putative origins, up to∼ 250 kbp (Fig. 86(b)). A similar asymmetric pattern is observedwhen the domainscontaining duplicated genes are eliminated from the analysis, whereas control domains obtained after randomization ofdomain positions present similar R+ and R− gene density distributions (Supplementary in Ref. [91]). Themean length of theR+ genes near the putative origins is significantly greater (∼ 160 kbp) than that of the R− genes (∼ 50 kbp), however bothtend towards similar values (∼ 70 kbp) at the center of the domains (Fig. 86(c)). Within 50 kbp of the putative origins, theratio between the numbers of base pairs transcribed in the R+ and R− directions is 23.7; this ratio falls to∼ 1 at the domaincenters (Fig. 86(d)). This strong transcriptional polarity could be mainly attributable to the preferential R+ orientation ofthe first gene (the closest to the N-domain extremity). Yet polarity is still observed for half-domains harboring various genenumbers even after the first gene has been eliminated [91].

6.3.2. Gene breadth expressionIn Fig. 86(e) are reported the results of the analysis of the breadth of expression, Nt (number of tissues in which a gene

is expressed) of genes located within the detected domains [91]. As measured by EST data (similar results are obtained bySAGE or microarray data [91]), Nt is found to decrease significantly from the extremities to the center in a symmetricalmanner in the 5′ and 3′ half-domains. Thus genes located near the putative replication origins tend to be widely expressedwhereas those located far from them are mostly tissue-specific. Note that this result is robustly observed after eliminatingduplicated genes. We have further checked that the decrease in both Nt values (Fig. 86(e)) and gene length L (Fig. 86(c)),from theN-domain border to its center does not reflect a correlation between gene length and expression breadthmeasuredusing EST, SAGE or microarray data [91].

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 135

Fig. 86. Analysis of the genes located in the identified replication N-domains. (a) Arrows indicate the R+ orientation, i.e. the same orientation as themost frequent direction of putative replication fork progression; R− orientation (opposed direction); red, (+) genes; blue, (−) genes. (b) Gene density: thedensity is defined as the number of 5′ ends (for (+) genes) or of 3′ ends (for (−) genes) in 50 kbp adjacentwindows, divided by the number of correspondingdomains; in the abscissa, the distance, d, in Mbp, to the closest domain extremity. (c) Mean gene length: genes are ranked by their distance, d, from theclosest domain extremity, grouped by sets of 150 genes, and the mean length (kbp) is computed for each set. (d) Relative number of base pairs transcribedin the + direction (red), − direction (blue) and non-transcribed (black) determined in 10 kbp adjacent sequence windows. (e) Mean expression breadthusing EST data. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)Source: Adapted from Ref. [91].

Fig. 87. Model of gene organization coordinated by replication and transcription. Two successive putative replication origins (ORI) delineate a replicationN-domain. Open chromatin: arrows illustrate an open chromatin state at replication origin position. Replication timing: the triangles figure the replicationtiming values along the N-domain. Replication fork orientation: the triangles indicate the proportion of replication forks progressing from each extremityto the other extremity along the domain (during the successive cell cycles, replication terminates at random siteswithin the domain). Breadth of expressionis maximumnear the replication origins and decreases toward the domain center (grey triangles). Transcription orientation: it is preferentially co-orientedwith the replication fork progression; the colored triangles indicate the proportion of base pairs along the domain transcribed in the + direction (red) and− direction (blue). Gene organization: red (resp. blue) arrows indicate (+) (resp. (−)) genes in the domains. (For interpretation of the references to colorin this figure legend, the reader is referred to the web version of this article.)Source: Reprinted from Ref. [91].

136 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 88. Masked skewprofiles of three split-N-domains calculated in non-overlapping 1 kbpwindows (masked sequence), with colored dots correspondingto (+) genes (red) and (−) genes (blue) [549,550]. Split-N-domain shape are sketched in green line. Masked GC content calculated in non-overlapping100 kbp windows (with a step of 100 kbp) is drawn in purple line. The cyan curve corresponds to high resolution replication timing TR50 data (100 kbpwindows shifted by 10 kbp steps) (see Fig. 84) [529]. (For interpretation of the references to color in this figure legend, the reader is referred to the webversion of this article.)

Altogether the results reported in this section provide the first demonstration of the quantitative relationships in thehuman genome between gene expression, orientation and distance fromputative replication origins [91,534]. A possible keyto the understanding of this complex architecture is the coordination between replication and transcription. The putativereplication origins would mostly be active early in the S phase in most tissues. Their activity could result from particulargenomic context involving transcription factor binding sites and/or from the transcription of their neighboringhousekeepinggenes. This activity could also be associated with an open chromatin structure, permissive to early replication and geneexpression in most tissues [535–538]. This open conformation could extend along the first gene, possibly promoting theexpression of further genes. This effect would progressively weaken with the distance from the putative replication origin,leading to the observed decrease in expression breadth. This model (Fig. 87) is consistent with a number of data showingthat in metazoans, ORC and RNA polymerase II colocalize at transcriptional promoter regions [539], and that replicationorigins are determined by epigenetic information such as transcription factor binding sites and/or transcription [540–543].It is also consistent with studies in Drosophila and humans that report correlation between early replication timing andincreased probability of expression [466,507,512,539,544]. Furthermore, near the putative origins bordering the replicationdomains, transcription is preferentially oriented in the same direction as replication fork progression. This co-orientationhas been interpreted as possibly reducing head-on collisions between the replication and transcription machineries, whichmay induce deleterious recombination events either directly or via stalling of the replication fork [545,546], as observed inbacteria for essential genes [547]. But this interpretation has been recently questioned by the analysis of the subset of humangenes that are transcribed during the S phase and that do not present a significant co-orientation with the replication-forkprogression [548].

6.4. Identifying several mega-base-pair long replication split-N-domains

6.4.1. Detection of replication split-N-domainsAs reported in Fig. 72(a), N-domain size extends up to 2.8–3 Mbp [509]. However in Section 5.4, with our wavelet-based

methodology, we have detected large amplitude skew upward jumps that are separated by distances larger than 3 Mbpwithout any other upward (and downward) jump in between. As illustrated on three examples in Fig. 88, the skew S nolonger decreases over the whole domain length but only on ∼1.5 Mbp nearby the bordering putative replication origins,independently on the domain size, leaving a large central region of null skew. This shape is reminiscent of an N but split inhalf andwith a central region of null skew that can reach several Mbp; this led us to name these domains ‘‘split-N-domains’’[549,550]. We have detected 113 split-N-Domains with sizes ranging between 3 and 15 Mbp and that cover collectively

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 137

Fig. 89. Description of (a, a’, a’’) the skew S, (b, b’, b’’) the GC content, (c, c’, c’’) the gene density and (d, d’, d’’) the coverage in DNase HypersensitiveSites (DHS) along the 74 central regions of human split-N-domains [550]. (a) Variance of the masked skew calculated in 100 kbp windows vs. mean ofthe masked skew; (a’, a’’) corresponding mean and variance histograms. (b) Variance vs. mean of the masked GC content; (b’, b’’) corresponding meanand variance histograms. (c) Variance vs. mean of the gene density (number of genes per Mbp); (c’, c’’) corresponding mean and variance histograms.(d) Variance vs. mean of the coverage in DHS (in %bp); (d’, d’’) corresponding mean and variance histograms. In (a–d), each red circle corresponds to oneof 74 central regions of split-N-domains; blue circles correspond to regions randomly drawn in the genome that have the same size distribution as the74 split-N-domain central regions. Red (resp. blue) dashed lines correspond to the median of the red (resp. blue) histograms of mean and variance in thecentral regions (resp. in randomly drawn regions in the human genome) [549,550]. (For interpretation of the references to color in this figure legend, thereader is referred to the web version of this article.)

17% of the human genome. We have checked that split-N-domain borders have similar properties as N-domain borders inparticular as far as gene organization (Section 6.3) is concerned [549,550]. This is a strong indication that the split-N-domainborder length ∼1.5 Mbp is a characteristic size in the organization, coordination and regulation of replication, transcriptionand chromatin.

6.4.2. Linking split-N-domains to GC-poor isochores and gene desertsWhat distinguishes these new replication split-N-domains are their large central regions with a homogeneous null skew

that are replicated very late in the S phase (Fig. 88). To study these central regions, we have retained the 74 split-N-domainswhose size is superior to 3.5 Mbp, and considered the central region distant of more than 1.5 Mbp from the two borders.These 74 central regions have sizes ranging from 500 kbp to 11.5 Mbp and overall cover 128 Mbp (4.5% of genome length).

Central regions of split-N-domains: homogeneous null skew. Central regions have a quasi-null skew. What is interesting is notthat the skew is on average null but that it is homogeneously null (Fig. 88). At the scale of a chromosome, there are as muchpositive as negative contributions to the skew profile of the human genome, so that the skew of a chromosome or of thewhole genome has a distribution centered on zero (Fig. 89(a) and (a’)). But, when comparing central regions to randomlydrawn regions of the same size, the skew distribution of central regions is much more narrowed (Fig. 89(a) and (a’)). Theproportion of central regions with an average skew comprised between −0.01 and 0.01 reaches 54.1%, far more than theexpected value of 6.2%. This homogeneity of the skew appears more clearly when directly looking at its variance insideeach central region compared with the random expectation (Fig. 89(a) and (a’’)). A homogeneously null compositional skewsuggests that there is no preferred directionality of the replication fork. One way to account for this is to have a randominitiation of replication in these regions. Indeed, if initiation events are randomly distributed, the opposite contributions

138 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

to strand asymmetry of forks progressing in both directions would compensate each other, and thus the whole region ofrandom initiation would harbour a null skew [549,550].Central regions of split-N-domains: GC poor isochores. As shown in Fig. 89(b–b’’), central regions of split-N-domains alsocorrespond to very low GC isochores [127,146–152]. Their GC content has a median value of 0.34 whereas the median forrandomly drawn regions of the same size is 0.39 (Fig. 89(b) and (b’)). Close to 22% of the central regions and less than 1% ofthe random regions have a GC content inferior to 0.32. Similarly to what we have observed for the skew, their GC content isnot only low but homogeneous (Fig. 89(b’’)) [549,550].Central regions of split-N-domains: gene deserts. LowGC isochores are known to be poor in genes [127,146–152]. Consistently,we find that split-N-domains central regions correspond to gene deserts. Out of the 74 central regions, 33 are void of genesand the remaining 41 contain altogether 205 genes. On average, they have a very low gene density (1.53 gene per Mbp) and∼49% of the central regions and only 1.9% of the random regions have a gene density inferior to one gene perMbp (Fig. 89(c)and (c’)). As for the GC content and the skew, the variance of the gene density in the central regions of split-N-domains issmaller than expected (Fig. 89(c’’)) [549,550].

To summarize, the central regions of split-N-domains correspond to late replicating gene desert with a low GC contentand a null skew (Figs. 88 and 89). The replication of a large region at the end of the S phase, when there is not muchtime left to do so, possibly requires an increase in the number of origin firings. Indeed, studies of replication initiation inXenopus early embryos, where no transcription occurs, have revealed that (i) location of replication initiation is randomeventhough replication timing is deterministic at large scale (Mbp) [464,551], and (ii) the rate of initiation (number of originsfiring per unit of time) increases with time during the S phase [552]. Analyses of the replication timing profiles of severalother eukaryotes have revealed that the increase in the rate of initiation during the S phase could be universal [553,554].Besides, transcriptional activity was shown to contribute to the definition of the replication program as replication initiationis suppressed within transcription units in Xenopus eggs [555] and transcription initiation might specify the activation ofreplication origins [541,556]. Hence the low gene density in these regions could underlay the absence of strong origins andthe randomness of replication initiation [549,550].

In this section, we have segmented the human genome in replication N-domains and split-N-domains that altogethercovermore than half of the genome. In these replication domains that are likely to be conserved inmammalian genomes (seeSection 6.1 forN-domains, data not shown for split-N-domains), genes are remarkably organized in position and orientation,strongly suggesting that these domains are functional units where transcription and replication are highly coordinated. Inthe next section, we will discuss the possibility that these units are indeed structural domains at the heart of the tertiarystructure of chromatin inside eukaryotic nuclei. In particular, wewill report the results of the analysis of chromatinmarkersthat will show that the putative replication origins that border both N-domains and split-N-domains are likely specified bya peculiar open chromatin environment [92].

7. From sequence analysis to the modeling of the chromatin tertiary structure

7.1. Sequence effects on the structure of the eukaryotic chromatin fiber

7.1.1. The fiber structureSo far, we have mainly focused on the primary structure of the nucleosomal array, i.e. on the linear nucleosome

positioning along eukaryotic chromosomes (see Sections 3 and 4). In particular, we have shown that this positioningis mainly controlled by a mechanism of ‘‘equilibrium statistical ordering’’ which provides some understanding of thealternation of well-ordered nucleosome patterns confined near excluding energy barriers (NFRs) and of regions of morediluted non-positioned nucleosomes, observed in vivo [81–83,86]. However, as already pointed out (Fig. 1), this 10 nmchromatin fiber has the propensity to condense into a highly folded and compact structure called the 30 nm chromatinfiber [1,2], depending on the ionic strength and on the presence or not of linker histones [557,558]. But if the structure ofthe nucleosome is known at atomic resolution [8–10,559,560], the structure of the fiber and its higher order organizationremain controversial and various models for the fiber geometry are still under investigation [268,269,274,561,562]. Amongthe various difficulties encountered in this structural issue, one comes from the size of the fiber in between the range of lightmicroscopy (scales larger than a few hundreds nm) and of crystallography and other experimental techniques available atmolecular scales (less than a few tens of nm). Another technical difficulty concerns the necessity of extracting the fiber fromthe cell nucleus without disturbing and modifying its structure. Furthermore, it is by no mean certain that the structuresobserved in vitro for purified fibers or with reconstituted nucleosome arrays, reproduce the actual functional and dynamicalstructure of the fiber in vivo. Evenmore crucial, the fiber structural and functional organization is likely to vary in space, alongthe chromosomes (e.g. heterochromatin versus euchromatin) and in time, during the cell cycle [268]. The structure designedto achieve the very tight and optimal packing required duringmitosis condensation [563–566] is presumably different fromthe more extended and less condensed conformations that are expected to favor the accessibility of regulatory sites toexternal factors involved in functional processes like transcription and replication [416,563,567].

As illustrated in Fig. 90, early in vitro visualization of native or reconstituted chromatin fibers by cryo-electronmicroscopy[568–571] and scanning force microscopy [382,383,572] has revealed a three-dimensional fiber structure at low ionicstrength with a diameter of ∼30 nm and an irregular, zig–zag arrangement of nucleosomes (Fig. 90(a)). At higher ionic

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 139

Fig. 90. A gallery of electron micrographs of chromatin. (a) Low ionic-strength chromatin spread, the ‘‘beads on a string’’; size marker: 30 nm. (b) Isolatedmono-nucleosomes derived from nuclease-digested chromatin; size marker: 10 nm. (c) Chromatin spread at a moderate ionic strength to maintain the30 nm higher-order structure; size marker: 50 nm.Source: Adapted with permission from Ref. [571].© 2003, by Nature Publishing Group.

Fig. 91. Models for theDNApath in the chromatin fiber. Higher order structuremodels: (a) one-start solenoidal, (b) two-start supercoiled, and (c) two-starttwisted. Upper views have the fiber axis running vertically; lower views are down the fiber axis. DNA associated with the nucleosome core is red/blue, andlinker DNA running between cores is yellow. These models are idealized, with nucleosome cores in each start contacting each other. (For interpretation ofthe references to color in this figure legend, the reader is referred to the web version of this article.)Source: Reprinted with permission from Ref. [562].© 2004, by the American Association for the Advancement of Science.

strength, these fibers collapse into a compact helical structure with closer nucleosomal spacing (6–7 nucleosomes/11 nm)and a diameter ∼30–33 nm (Fig. 90(c)) [383,573]. Recently, highly compact fibers have been reconstructed in vitro on DNAcontaining tandem repeats of strong nucleosome positioning sequences with predefined nucleosome repeat length (NRL),at high divalent cation concentration and in the presence of linker histones [574]. EM images reveal that the fiber thicknessactually remains constant for NRL ranging from 177 up to 207 bp (corresponding to a linker size varying from 30 to 60 bp),roughly 35 nm in diameter and 11 nucleosomes/11 nm in compaction. However, arrays with larger linker length (from 70 to90 bp), corresponding to NRL ranging from 217 to 237 bp, produce fibers of 45 nm in diameter and 15 nucleosomes /11 nmin compaction. These results were interpreted [574] in terms of some compact helical structure reminiscent of the so-calledsolenoid model introduced two decades ago [193,557,558] and closely related to the interdigitated solenoidal structureproposed in Ref. [575].

Various models of the chromatin fiber geometry have been proposed in the literature. These models can be groupedinto two main classes [562]: (i) the one-start helical stack of nucleosomes with bent linker DNA connecting each pair ofnucleosome cores (Fig. 91(a)) [193], and (ii) the two-start zig–zag helix where two stacks of helically arranged nucleosomecores are connected by straight linker DNA (Fig. 91(b) and (c)) [561,576]. The historical representative of the one-start classis the solenoid model where the nucleosomes coil a central cavity with some number of nucleosomes per turn quantifyingthe level of compaction [193,557,558,577]. A critical feature of the solenoid model is the interaction between consecutivenucleosomes on the DNA chain. This requires significant bending of the intervening linker DNA which could be facilitated

140 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 92. Two local views of the chromatin fiber that illustrate the basic parameters of the Extended 2 Angle (E2A) model of Diesinger and Heerman [585]:the entry–exit angle αi , the rotational angle βi , the linker length bi and the pitch di . The DNA linkers are fixed in front of the nucleosome by the H1 histone(orange ellipsoid). A large entry–exit-angle was chosen here to make the visualization clear.Source: Adapted with permission from Ref. [585].© 2008, by Elsevier Limited.

by association with the linker histone H1 [193]. As mentionned just above, recent EM studies [574] of chromatin fibersreconstituted in vitrowith longer repeat lengths (NRL∼177–237 bp) and one linker histone per nucleosome, have confirmedthe existence of one-start helical structures with interdigitated nucleosomes [574,576]. The two-start class can actually bedivided into twomain structuremodels, namely the helical ribbonmodel (Fig. 91(b)) [576,578] and the crossed-linkermodel(Fig. 91(c)) [579,580]. In the latter, linker DNA crosses the fiber center, connecting nucleosomes that site on the oppositesides of the fiber, so that internucleosomal contacts are made between nucleosomes at positions i and i + 2 [267]. Theexperimental data currently available support the view that at least, two types of two-start fiber conformations do exist.The first one was proposed on the basis of EM studies of native chromatin fibers extracted from chicken erythrocytes thatshow a zig–zag like DNA backbone with a so-called nucleosome stem motif [569,581]. In this nucleosome stem, the linkerhistone H5 mediates the association of the two DNA segments, leaving the nucleosome core particle over a distance of3–5 nm (10–20 bp) before the linker DNA separates [272,569,581]. The second type of two-start fiber helix is based on thecrystal structure of a tetranucleosome particle [582]. This structure has a short NRL of 169 bp and was determined in theabsence of linker histones. Indeed the tetranucleosome unit can be extended in a fiber model with a predicted compactionof 5.5 nucleosomes/11 nm [270,582].

7.1.2. Geometrical and physical modeling of the chromatin fiberAs a simple geometrical and staticmodel of the inner structure of the chromatin fiber, the two-anglemodel introduced by

Woodcock et al. [267] has played a major role in the recent blooming of chromatin modeling. In this model, the local helicalgeometry of the chromatin fiber is described by two angles (α, β), that define the relative orientation of the entry and exitlinker DNAs, and by the linker size (Fig. 92). Considering the nucleosome as identical solid bodies, these three parametersdetermine the linker DNA path within the chromatin fiber and the relative orientation of successive nucleosomes [268]. Themain structural predictions of the two-angle model are (i) the extreme sensibility and variability of the fiber conformationwhen exploring the (α, β) parameter space (Fig. 93), from flat ribbon and open structure for large entry/exit angle α (∼180)between the linkers, to more compact and dense superhelical coil for small α (∼0) and twist angle β (∼0) along thelinker, and (ii) the robustness of the fiber diameter value that remains around 30 nm whatever the inner structure andcompaction level [268,583,584]. As regards to its utmost simplicity, it was under a delusion to expect this geometricalmodelto faithfully reproduce the wealth of fiber conformations observed experimentally. This explains that extended versions ofthe two-anglemodel have been further developed [272,275,585]. As illustrated in Figs. 92 and 93 for the 4-parametermodelproposed by Diesinger and Heerman [585] to account for the dynamical binding of the linker histone H1, these variants ofthe geometrical two-angle model involved more parameters and in turn a higher-dimensional chromatin phase diagram(Fig. 93). Interestingly, the presence of H1 can be directly introduced in the model by replacing the nucleosome by the so-called chromatosome [270], that regroups the nucleosome core plus the linker histone and the associated ∼169 bp (147 bparound thehistone octamer+22bpbound toH1) ofDNA. Indeed, the refined structure of the chromatosome is unknownandvarious models have been proposed for the integration of the linker histone H1 or its globular domain into the nucleosome[586–589].

Coarse-grained physical modeling, based on the geometrical parametrization of the two-angle model and its variants,have been recently developed to investigate the equilibrium properties of these helical chromatin structures [268,590].These models take explicitely into account the linker DNA elasticity and the interaction between DNA and betweennucleosomes that are known to bemodulated by the ionic environment. At low-salt concentration, all thesemodels supportthe picture of an open zig–zag like fiber structure (Fig. 90(a)), that is consistent with the force–extension curves obtainedon single chromatin fibers under these experimental conditions [235]. These single molecule experimental data actuallyturn out to be in good agreement with the results of computer simulations [591] and of analytical approaches [583,584,592] based on crossed-linker fiber geometries. All these studies suggest that, even at ‘‘moderate’’ condensation level, thelinker size and the nucleosome–nucleosome interaction [272,593] are the key control parameters of the fiber structural

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 141

Fig. 93. A cut-out of the actually four-dimensional chromatin phase diagram of the Diesinger and Heermanmodel [585], as a function of the two-angles αand β (see Fig. 92). The pitch d and the linker length b are fixed. A single point in this diagram corresponds to a specific chromatin stucture. The forbiddenstructures lay left and below the dashed line which is the excluded volume borderline. Due to the parameter distributions of the model, one should notexpect a specific chromatin structure but instead a distribution of structures in the phase diagram. This probability distribution is shown in the back of thefigure as a 2D colored map.Source: Adapted with permission from Ref. [275].© 2009, by Elsevier Limited.

(nucleosome compaction) and elastic (persistence length) properties. In particular, the nucleosome–nucleosome interactionappears to be a crucial driving packing force at the highest physiological salt concentration. As supported by experiments onliquid crystals of mono-nucleosomes [594,595], nucleosome–nucleosome interaction is generally described by a Gay–Bernepotential [270,272,593,596] that presents an attractive short-range component and some excluded volume (hard core)component. As reported in Refs. [272,593], increasing the excluded volume leads to a fiber stiffening, whereas increasingnucleosome–nucleosome interaction energy produces more compact structures. The predicted persistence length lp of thechromatin fiber spans a wide range from 22 to 390 nm which correlates well with the 30–220 nm range determinedexperimentally [597]. In agreement with Monte Carlo simulations of chromatin fiber conformations [596,598], Mergellet al. [593] and Stehr et al. [272] have shown that increasing the NRL of cross-linked (CL) and solenoid interdigitated (ID)fibers (Section 7.1.1) leads to a decrease of both packing density and persistence length. The predicted high flexibility ofthe CL fiber with lp = 22–351 nm is in agreement with the 30 nm persistence length derived from in situ cross-linkingexperiments [599]. ID fibers are expected to be much stiffer with lp = 390 nm for a NRL ∼187 bp. Very recently, MonteCarlo simulations coupled to in vitro experiments performed on a regular nucleosome array [600] have shown that (i)condensation without H1 and Mg2+ generates an open zig–zag structure, whereas (ii) condensation with H1 and Mg2+produces a heteromorphic chromatin with predominantly two-start fiber structure (with interaction between i and i + 2nucleosomes) and some one-start fiber (with i and i+1 interacting nucleosomes) with highly bent linker (Fig. 91). However,so far, all these numerical simulations have failed to achieve the high level of compaction observed by Robinson et al. [574]with 15 nucleosomes/11 nm for a NRL ranging from 217 to 237 bp. In their comparative numerical analysis of the CL modelwithoutH1 and the solenoidal IDmodel proposed byRobinson et al. [574], Kepper et al. [270] have confirmed that the ID fiberis thicker, more compact and stiffer than the CL fiber, but they have failed to observe a thermodynamically stable fiber witha NRL of size large than∼200 bp. As suggested by the experimental demonstration of a cooperative binding of linker histoneH1 toDNA, pointing to the existence of protein–protein interaction in the resulting complex [601,602], this frustrationmightbe overcome by taking into account, in the Monte Carlo simulations, the interactions between neighboring linker histones.Along the line of a former all atom modeling of highly compact structures containing linker histones [603], Depken andSchiessel [604] have recently shown thatwhen considering a simple geometric criteria basedon thenucleosome shape alone,they can reproduce the observed characteristics of the condensed chromatin fiber and how these change with varying theNRL. This study puts into light the important role of nucleosome surface interactions in the regulation of chromatin structureand function.

142 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 94. Example of chromatin fiber conformation of length 98.5 kbp with depletion effects. The linker histone skip rate is 6% and the nucleosome skiprate is 8%. The linker histone skips are marked in orange. (For interpretation of the references to color in this figure legend, the reader is referred to theweb version of this article.)Source: Adapted from Ref. [608].

7.1.3. Modeling heterogeneous chromatin fibers with defectsThe picture that emerges from the numerical and experimental works reported in Sections 7.1.1 and 7.1.2, is that

extended 2-angle models with proper nucleosome interactions and nucleosome shape provide a reliable framework tocharacterize the different levels of helical folding of the 10 nm fiber. The dominant influence of the linker DNA length, thecrucial role of the linker histone H1, the fundamental contribution of the form and strength of the nucleosome interactionunder various salt conditions have been systematically investigated and shown to produce a wealth of different chromatinfiber conformations. If these fiber conformations can be seen as the paradigm of the 30 nm fiber reconstituted in vitro fromnative or synthetic nucleosomal arrays, their relevance in vivo is far from being established. Only rare situations of ‘‘native’’fibers of highly compact structure have been observed like the 12–13 nucleosomes/11 nm fiber from echinoid spermatozoawith a NRL of about 70 nm (∼206 bp) [605] and the dense fiber frommammalian centromeres [606]. More likely, in vivo, thechromatin fiber is expected to be highly heteromorphic with a rather broad distribution of packing density, structural andmechanical properties, as a direct consequence of the heterogeneous distribution of ‘‘control’’ parameter values along thegenomes. The analysis of mouse fibroblast chromatin has indeed revealed a packing density that varies along the genome;for example, centromeric heterochromatin has a packing density of ∼12 nucleosomes/11 nm, while those of pericentricheterochromatin and ‘‘bulk’’ chromatin are lower [606]. Very recently, Revaud et al. [607] have shown that in CHO cells ofhamsters, intersticial telomeric sequences display a dense nucleosomal array with very short linker (167 bp) as comparedto the bulk chromatin (187 bp).

All the current models discussed in the previous subsections indeed assume that the chromatin fiber is homogeneous,i.e. with the same equilibrium structural (linker length) and physical (elasticity, interactions) properties all along the chro-mosome which is very unlikely in vivowhere these properties are expected to vary not only in space but also in time duringthe cell cycle. Actually, as revealed by the genome wide experimental maps of nucleosome positions for many differentorganisms and cell types [201–217], nucleosome positioning and occupancy vary along the chromosomes: some regionspresent a well ordered periodic nucleosome organization, some other regions are depleted in nucleosomes and some othersdo not display any positioning pattern (see Section 3). However, as illustrated in Section 3.4 by the remarkable nucleosomeordering observed inside yeast genes (Fig. 21), even within periodic regions, the DNA linker size may slightly vary. Wood-cock et al. [267], as well as more recently Diesinger and Heermann [275], have shown that introducing variability in thelinker size induces some decrease of the fiber compaction level as it unfavors proper nucleosome–nucleosome stacking. Ifthe variability is too large, onemay observe an unfolded structure and ultimately some nucleosome depleted regions similarto the NFRs observed in vivo at gene promoters (see Section 3.3). As reported by Diesinger and Heermann [275], increasingthe amount of nucleosome skips drastically affect the large-scale conformational properties of the fiber that becomes moreflexible (lp decreases) (Fig. 94). Similarly, the effect of increasing linker histone H1 skips affects the flexibility and the exten-sion of the chromatin fiber (Fig. 94). Actually, the average linker size has been shown to vary between species and betweentissues, and this linearly with the average H1/nucleosome stoichiometry [218]. The global in vivo stoichiometry is not 1/1,clearly suggesting that the fiber is heterogeneous with alternation of stable dense structures (with H1) and loose soft struc-tures (without H1). In the yeast, that lacks H1, the NRL is shorter (see Section 3) and the ‘‘folded’’ structure is probably a loose2-start structure. For higher eukaryotes, the stoichiometry increases and the average linker size does the same. This suggeststhat in these organisms, stable compact folded fibers exist on particular genomic regions with regular nucleosome orderingand where H1 has been targeted. In addition, chromatin is subject to various epigenetic biochemical modifications [609],

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 143

such as histone tail acetylation and histone variants incorporation that are likely to play a role in modulating the foldingproperties of the 30 nm fiber [274,610].

These results stress again the importance of nucleosome depletion. We have seen in Section 3, that the NFRs observed invivo in many organisms, contribute to the statistical linear ordering of nearby nucleosomes (probably under the dynamicalaction of remodelers) that likely promotes the local condensation of the 30 nm fiber. A next step in modeling the effect ofthe DNA sequence on the structural and mechanical properties of the 30 nm fiber will be to combine our grand canonicalphysical modeling of nucleosome assembly along the 10 nm chromatin fiber described in Section 3 [81–83] with currentphysical models of the 30 nm chromatin fiber like the ones of Diesinger and Heermann [275], Diesinger [608] and of Wonget al. [603]. This modeling of the chromatin secondary structure (see Fig. 1) is presently under progress.

7.2. Depletion effects and chromatin loop formation

7.2.1. Current models for the chromatin tertiary structureDNA replication and transcription require great reproducibility and coordination, all this in the crowded environment of

the cell nucleus. Regulation of these complex processes relies in part on the conformation and dynamics of the 30 nm chro-matin fiber that ultimately condition DNA sequence accessibility. As we have just discussed, the chromatin fiber is a nucleo-protein filament with non-homogeneous structural and mechanical properties [2,21]. This heterogeneity evidently affectshow the fiber folds and organizes into higher order structures like loops, coil or chromonema (see below). However, despiteincreasing experimental [415–417,611–614] andmodeling [22–25] efforts, the so-called tertiary chromatin structure is stillvery controversial [424,615] and the possible role, if any, of the DNA sequence at such large scales remains an enigma.

Although the existence of chromatin loops, ranging in size from several kbp to 10 Mbp or more [424,611,616], has beenextensively discussed in the literature, it has generally been inferred from indirect assays. The geometry of chromatin hasbeen mainly experimentally studied in human cells where several models have been proposed to describe how the 30 nmfiber folds into a higher order structure through loop formation. Without aiming at being exhaustive, let us however reviewsome models that have given impetus to the recent increasing experimental and theoretical interest devoted to the studyof the tertiary chromatin structure.

• The radial loop protein scaffold model [17,18,409–411,417,617] has long been proposed to account for the experimentalobservation [408] in extracted metaphase chromosomes of DNA loops attached at their basis to a nuclear substructure,the nuclear matrix, that is believed to organize the chromatin into looped domains during interphase [618]. Note thatloops can be anchored to the peripheral lamina at the nuclear envelope [619] as well as to the nucleolus [620].

• The chromonema fiber model in which successive levels of helical coiling generate a series of thicker chromatin fiberswith diameters 60–80 nm (tertiary structure) and 100–130 nm (quaternary structures) [19,20,419,420,423]. Rather thana scaffold to provide structural support, it is thought that chromonema fibers are held together by fiber–fiber interactionswhich could be regulated by different histone variants and tailmodifications [21]. Thismodel has been inspired from lightand electron microscopy studies of mitotic and interphase artificial chromosomes [19,621,622].

• The model of chromatin loops formed by clustered polymerases [25,26]. In this model, non-specific entropic forces be-tween DNA and/or RNA polymerases, already engaged on the chromatin fiber, are supposed to drive the aggregation ofthese polymerases thereby promoting the genome compartmentalization into rosette-like multi-loop patterns contain-ing several thousands (and sometimes millions) of base pairs. This dynamical multi-loop model has been emphasized asproviding a very attractive description of replication foci and transcription factories [623].

However, whether the chromatin loops originate from specific or nonspecific interactions, their wide range of sizesfrom several kbp to several Mbp raises the issue of confining in the limited nucleus volume, polymer-like chromatin chainsthat can be expected to resist intermingling thereby influencing the localization of other functional compartments throughexclusion interactions [624]. In the early nineties, fluorescence in situ hybridization (FISH) data on 3D distances betweendefined genomic sequences have become available [625] and used to improve the modeling of the tertiary structure ofeukaryotic chromosomes. In particular, the early observation of deviations from random-walk behavior at scales largerthan 1–1.5 Mbp during the G0/G1 part of the cell cycle, was initially interpreted by a polymer model in which the DNAof any one chromosome is confined to a spherical subvolume of the interphase nucleus [626]. An alternative suggestionwas that the deviations could be due to the presence of large flexible chromatin loops, each comprising several Mbp ofDNA (∼3 Mbp on average), that are attached on a random-walk backbone [22,627]. Monte-Carlo simulations based on thisrandom-walk/giant loop (RW/GL) model have unfortunately shown that it cannot explain new available data sets on thespatial distance distribution between chromatinmarkers in interphase nuclei. Then themulti-loop sub-compartment (MLS)model [24,418] has been introduced and shown to account rather well for interphase distances. This model proposes anarrangement of chromatin loops into rosette-like sub-compartments (of∼1Mbp in size) connected by small chromatin fiberfragments (∼120 kbp). Variants of the MLS model have been further developed, for example to account for the chromatinanchoring to the nuclear structures [23] or for the spherical structure of interphase chromosome territories [628].

To summarize, over the years, different experimental studies of the tertiary chromatin structure have yielded apparentlyconflicting models based on loops and scaffold, multi-loops or chromonema fibers. Whether these different models can bereconciled or some of them definitely discarded, will depend on future technological advances. So far progress has been

144 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 95. The depletion attraction. (a) This schematic view shows a suspension of large and small spheres in a box. The shaded regions around the fourlarge spheres are excluded to the center of masses of the small spheres. When one large sphere contacts another, their excluded volumes overlap (1) toincrease the volume available to the small spheres (increasing their entropy); then, aggregation of the large spheres paradoxically increases the entropyof the system. An analogous effect is found when a large sphere contacts the wall (2). The attraction can also be viewed as an osmotic phenomenon; smallspheres cannot enter excluded volumes, and a force equivalent to their osmotic pressure acts on each side of the two touching large spheres to force themtogether (or on one side of the large sphere at the wall to force it to the wall). (b) Spheres bound to each end of a string will also tend to aggregate orassociate with the wall, to loop the connecting string (which has an associated entropic cost).Source: Adapted with permission from Ref. [629].© 2009, by Elsevier Limited.

Fig. 96. Model of eukaryotic genome structure. DNA is coiled around a histone octamer, and runs of nucleosomes form a zigzagging string (bottom). Atthe intermediate level in the structural hierarchy (middle), this string is organized into loops (from kbp to Mbp) by attachment to factories (pink circles)of entropy driven self-assembled polymerases (orange triangles and ovals). (For interpretation of the references to color in this figure legend, the reader isreferred to the web version of this article.)Source: Adapted with permission from Ref. [26].© 2002, by Nature Publishing Group.

hampered by important experimental limitations such as the difficulty of getting the right biological material to study. Forexample, it is far from being certain thatmitotic, meiotic, polytene and synthetic chromosomes are representative of normalinterphase chromatin, and if so, whether the structures observed after extraction and fixation are genuine. Chromatin iseasily shearedwhen it is extracted fromnuclei, and then it often aggregates into an intractable gel. Non-physiological buffersare often used to minimize this drawback, but they can induce some distortions and modifications of the structure. Evenmore penalizing, the resolution of light microscopy is far from allowing us to resolve the various levels of chromatin folding,and the use of electron microscopy with its higher resolution introduces additional problems of preserving the structure in

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 145

Fig. 97. Visualizing replication foci in HeLa cells [623]. Left: A cell in mid-S phase was grown for 5 min in 150 µM BrdU and fixed; then, Br-DNA wasindirectly immunolabeled with a fluorochrome (Cy3) and a fluorescent image of the center of the cell was collected with a confocal microscope. Newlymade DNA appears as discrete white foci in the nucleus (black ‘‘holes’’ are nucleoli [425]). Scale bar, 2.5 µm. Right: Newly made DNA in individual DNAfibers. Cells were grown for 15 min (top) or 30 min (bottom) in BrdU and DNA fibers were spread; Br-DNA was indirectly immunolabeled with Cy3 andphotographed in a conventional fluorescence microscope [425]. Each panel contains three regions of newly replicated DNA strung along one fiber of∼125 µm (∼375 kbp). The three regions probably initiated together as they have equal lengths.Source: Adapted with permission from Ref. [623].© 1999, by the American Association for the Advancement of Science.

vacuo. In their experimental study of transcriptionally active tertiary chromatin, Müller et al. [424] have shown that theirlight microscope images cannot distinguish among competing models; if they are in agreement with the predictions of thechromonema fiber models, they can also be explained by a radial-loop/scaffold model and even by a simple nucleosomeaffinity, random-chainmodel. We hope that the recent efforts spent in developing new fluorescencemicroscopy techniquesthat provide a resolution higher than that afforded by conventionalmicroscopy, like 4Pimicroscopy [630,631], togetherwiththe outstanding progress made in chromosome conformation capture (3C) technique [599] and its 4C [632,633], 5C [634]and Hi–C genome-wide [635] extensions, will allow one not only to reveal long-range chromatin interactions across theentire genome as a footprint of the different levels of chromatin folding, but ultimately to visualize these folding patternsand their dynamics in relation with gene activity and the functional state of the cell.

7.2.2. Chromatin loops formed by clustered polymerasesSome fifty-six years ago, Asakura and Oosawa [636] pointed out that large rigid spheres immersed in a solution of

smaller spheres are subject to an attractive force due to the depletion induced by increasing the space available to smallspheres as the large ones come close to one another (Fig. 95(a)). The environment within a living cell is crowded, with20%–30% of the volume occupied by macromolecules [2,637,638]; then the aggregation of the largest particles can lowerthe free energy of the system through an increase in entropy of the many smaller particles [636,639]. As illustrated inFig. 95(b), this entropy-driven aggregationmechanismhas led Cook and collaborators [26,629,640–643] to propose a passivethermodynamic scenario for DNA looping as an alternative to looping generated by active mechanisms (e.g. molecularmotors). In particular, they have shown that despite the entropic cost of looping the intervening DNA fiber, the clustering oflarge (DNA or RNA) polymerizing complexes strung along the chromatin fiber, actually increases the entropy of the systemthat spontaneously organizes into rosette-like multi-loop patterns (Fig. 96) [26,629,640,642]. Indeed, loops are expected toappear and disappear as DNA or RNA polymerases initiate and terminate [26], but since clustering ensures that the localconcentration of polymerases is high, these multi-loop structures are likely to be self-sustaining. This self-organizing multi-loop structural picture of the chromatin tertiary structure has been emphasized as providing some understanding to theexperimental observations of the so-called replication foci [26,623,644] and transcription factories [26,623,642,645,646].Replication foci: Replication domains in animal cells were first described in the pioneering experiments of Nakamuraet al. [647], where rodent fibroblast were shown to have about 100 sites of active DNA synthesis, which grew in size forabout an hour before new active sites were activated at adjacent nuclear positions. Improved imaging techniques havefurther revealed that mouse 3T3 cells [426] and Hela cells [425] both contain ∼1000 replication sites ∼500 nm in size(∼1Mbp) in early S-phase (Fig. 97, left panel). Since during the early stage of S-phase, the average replicon size is∼150 kbp,this is consistent with the view that each replication site must encompass 5–6 replicons in order to complete DNA synthesisin the observed time frame of 45–60 min (Fig. 97, right panels), given that replication forks progress at a rate of about1.5 kbp/min [425,426]. An important property of these groups of replicons is that they appear to remain stably associatedwithin replication foci during all the cell cycle including mitosis [414,425,426] suggesting that they are a fundamental unitof chromatin organization. As reported in Refs. [427,430,431], many replicons are larger than previously thought (the largestones being as large as a few Mbp), requiring most of the S-phase to be completed. In addition to this large heterogeneity

146 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 98. Visualizing transcription factories. (a) Cells were permeabilized, nascent transcripts extended in BrUTP, cells cryosectioned (100 nm), the resultingBrRNA immunolabeled with FITC (green), nucleic acids counterstained with TOTO-3 (red) and a fluorescence image collected on a confocal microscope.Newly made BrRNA is concentrated in factories in the cytoplasm (made by mitochondrial polymerases), nucleoplasm and nucleoli. (b, c) Conventionalelectron micrographs of spread transcription units. In (b), a crescent, like one of the two in the nucleolar factory in (a), has been stripped off the underlyingstructure; approx. 125 transcripts can be seen engaged on the rDNA unit. In (c), one of the approx. 8 active transcription units in a nucleoplasmic factory likethe one in (a) is shown; the template is associated with one polymerase and transcript. (d) Electronmicrograph of a nucleoplasmic factory obtained using aspecialized technique (ESI) that can detect endogenous phosphorus and nitrogen in unstained sections. HeLa cells were permeabilized, nascent transcriptsextended in BrUTP, the resulting BrRNA immunolabeled with 5-nm gold particles; after sectioning (70 nm), maps of phosphorus (red), nitrogen (green),and the gold particles (white)marking BrRNAwere collected. The image shows amerge of the threemaps. Five gold particlesmark BrRNA in a nitrogen-richfactory (perimeter indicated). Absolute numbers of nitrogen and phosphorus atoms within this perimeter can be calculated by using nearby nucleosomesas references (as they contain known numbers of atoms). Scale bars, 1 µm (a, b) and 100 nm (d). Modified material reproduced with permission fromRef. [645]. Copyright 2008 by the Biochemical Society (http://www.biochemsoctrans.org). Panel (a) reprinted with permission from Ref. [623]. Copyright1999 by the American Association for the Advancement of Science. Panel (b) modified with permission from Ref. [648]. Copyright 1972 by the EuropeanSociety of Endocrinology. Panel (c) adapted with permission from Ref. [649]. Copyright 1998 by the American Society for Cell Biology. Panel (d) adaptedwith permission from Ref. [650]. Copyright 2008 by the Company of Biologists.

in replicon size, it is now recognized that there is also a corresponding heterogeneity in the size and individual intensity ofreplication foci [427].Transcription factories: A typical view of a eukaryotic transcription factory has been provided by studies of RNA polymerase I[623,651], associated with the production of 45S rRNA in the nucleolus. A (triploid) HeLa cell contains ∼540 such genesarranged in tandem repeats on different chromosomes. During interphase, only∼120 genes, on someof these chromosomes,become active and come together to form ∼30 ‘‘fibrilar centers’’ [649]. Each of the 4 or so active genes associated with onecenter is transcribed by∼125 RNA polymerase I and their transcripts can be visualized after extension in Br-UTP as crescent-shape structures in the nucleolus (Fig. 98(a)). Thus, a nucleolar transcription factory contains ∼500 active polymerases and∼4 transcription units [649,652]. For comparison, early studies of nucleoplasmic transcription have estimated that 20000to 100000 polymerases are active within the nucleoplasm of mammalian cells [26,623,645]. Actually RNA polymerase IIgeneratesmostmessenger RNAs and is found in∼8000 nucleoplasmic factories, indicating that these transcription factories∼50 nm in diameter (Fig. 98(a)) contain ∼8 active polymerases each engaged on a different transcription unit [649,653].As pointed out by Cook [26], if the soluble molecules of RNA polymerase II in a HeLa cell were spread randomly insidethe nucleus, they would be spaced approximately every 120 nm, corresponding to a concentration ∼1 µM as observedfor inactive promoters. But the promoter of an active gene is likely to be tethered close to a factory where, thanks to theclustering, the local concentration of polymerases can be up to ∼1000 times higher.

Models for replication and transcription often display polymerases that track along their helical templates as they makeDNA or RNA [4,654,655]. What makes the rosette-like compartmentalization picture of interphase chromosomes advocatedby Cook and collaborators somewhat provocative is that active polymerases are not only clustered but immobilized byattachment to larger structures, acting as ties that organize andmaintain genomic loops.We refer the reader to the followingarticles of the Cook’s group [26,623,642,645,646] where experimental evidence for models involving tracking and immobilepolymerases is reviewed.

7.2.3. Spontaneous emergence of sequence-dependent rosette-like folding of chromatin fiberIn the spirit of the entropy driven looping picture by clustered polymerases described in Section 7.2.2, our goal here is to

show that multi-looped patterns can self-organize through the aggregation of sequence-induced fiber defects by depletiveforces prior to any external factor coming into play [30,88].When generalizing the study of depletion attraction from spheres(Fig. 95) to stiff (but not rigid) impenetrable tubes, Snir and Kamien [656] have shown that depletive forces can drive shortmolecular chains to a helix configuration. However, this holds only for uniform and relatively short tubes of length of the

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 147

Fig. 99. Examples of possible local defects along the fiber [30,88]. (a) Local swelling or attachment of an external agent (e.g. DNA or RNA polymerases inthe model of Cook [25,26]); (b) local shrinking; (c) any form of fiber denaturation inducing a depletive potential well, according to the position along thefiber and the entry–exit angle; (d) as an example of (c), the fiber seen as a compact helix (condensed nucleosomal array) with local partial decondensationillustrating a situation where the excluded volume gain is quite important and the entry–exit angle is fixed.Source: Adapted from Ref. [30].

Fig. 100. Steps involved in loop formation in depleting environment [30,88]: (a) free evolution of the tube; (b) formation of an unstable loop at around(3–4)lp; (c) gliding of the loop governed by the positions of the two contact points along the fiber and the entry–exit angle; (d) trapping of the loop by localdefects. The translucent green surface represents the excluded volume for the fluid of hard spheres; in (b, c, d) we see that some of the excluded volumeis reduced from the overlap resulting from formation of the loop.Source: Adapted from Ref. [88].

order of a few persistence lengths lp of the rod. For longer chains, the picture rapidly grows in complexity with a plethoraof optimal configurations (e.g. hairpin, beta-sheet, superhelix, torus) leading to an overwhelmingly rich phase diagram.In this section, we extend this study to long tubes with ‘‘frozen’’, heterogeneously distributed elastic and/or geometricproperties [88]. By frozen we mean that these fluctuations are imprinted on the 30 nm chromatin fiber by the sequenceitself. We have seen in Section 7.1, that this fiber is known to be dependent upon the properties of the nucleosomal string-of-beads (10 nm chromatin fiber) [267,268,561,593], which in turn is influenced by the double helix sequence-dependentmechanical properties (Section 3). In Section 7.3, we will show that the putative origins that border N-domains (Section 6)provide privileged locations for intrinsic decondensed fiber defects [92] that may be at the heart of the tertiary chromatinstructure.

For a semi-flexible tube in a dilute environment, local repulsive potentials among parts of the fiber induce a self-avoidingrandomwalk configuration (swollen coil [314]). In a crowded environment, the depletion actionmay dominate and the fiberwill tend to collapse on itself, forming a globular phase. We know from standard statistical physics of polymers that thislatter phase does not admit a universal description in terms of macroscopic parameters (such as total length, Kuhn lengthand virial coefficients) but rather depends on a detailed understanding of the interaction potential. An important featureof the depletion potential lies in its simplistic geometrical nature. We thus consider a system constituted of a dense fluidof hard spheres bathing a semi-flexible tube. The tube is assumed to be nonuniform, with localized geometrical defectse.g. local thickening or thinning of the cross-section (Fig. 99). To make the computation easily tractable, this semi-flexibletube can bemodeled by a compact helix [657], whereas the geometrical defect would be a local decompaction to form ‘‘openspots’’, as illustrated in Fig. 99(d). Note that the helix structure as well as the mechanism that led to it are of no importancefor our purpose, the only critical aspect for our demonstration being that the local degree of compaction will vary along

148 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 101. Depletion potential between two ‘‘opened’’ regions along an helicoidal fiber as a function of the relative angle of the two strands (abscissa) andthe relative shift along the axis of one of the two strands (ordinate). The volume overlap vovl is shown in grey level coding for the following parametervalues relative to the radius of the fiber itself (rf = 1): radius of both helices rh = 3, helix pitches ph = 9 and sphere radius rs = 0.2.Source: Reprinted from Ref. [88].

the fiber to mimic decondensed structural defects [88]. The elastic nature of the tube (or helix) prevents the appearance oftoo high a curvature; consequently the first step in the condensation of the tube is the formation of loops (Fig. 100). Loopformation involves a competition between the bending energy of the tube and the entropic gain of the hard sphere fluid. Thefree energy cost is dominated by elastic energy for small loops and by entropy for large ones. This results in a preferentiallength of (3–4)lp in the worm-like-chain (WLC) model with a characteristic loop-formation dynamics (∼10−2 s) which ismuch faster than the replication time scale (20 min to a few hours) [658–660].

Once a loop is formed (Fig. 100(a) and (b)), contact will be maintained by depletion forces; hence the loop willpreferentially relax through local gliding of the two contact points (Fig. 100(c)). This is where local defects come intoplay: when they meet from this gliding process, they act as local geometrical wells and ‘‘stick’’ together (Fig. 100(d)).This defect-induced stabilization is important since it prevents further depletion mechanisms from occurring. Indeed bymodifying locally the angle of tangent vectors at the contact points, the depletion force could drive them to align in oppositedirections, forming the first turn of an helix or toroidal condensate; alternatively it could align them in the same direction,favoring the formation of hairpins. The presence of defects, by favoring a specific contact geometry, breaks the symmetries(translational, axial) essential to the formation of these compact structures. For example [88], let us consider the overlapvolume of excluded regions induced by the entanglement of a pair of ‘‘opened’’ regions along the inhomogeneously compactfiber of our pedagogical example (Fig. 99(d)). For a given geometry of the defects (number of turns, pitch, radius) and puttingaside the elastic/entropic contribution involved in loop formation, we can limit the description of the depletive potentialin terms of two configuration parameters, namely the angle between the two strands and the relative axial translation (orshift). For a given configuration (i.e. for given values of these two parameters), there is a minimal distance between helixaxes, constrained by contact points. We have computed numerically the overlap of excluded volume vovl for two defectsinvolving 4 complete turns of both helices (Fig. 99(d)). Fig. 101 actually shows vovl; the depletion potential itself dependslinearly on vovl (see Eq. (28)). What should be noticed is the rather sharp potential well centered around 32°; indeed we canobserve a narrow valley that cover angles ranging from about 28° to 36°. Note also that the bottom of the valley varies inshift with the angle, a trivial consequence of the fact that when the two defects are deeply entangled, they can still rotatewith respect to one another if they are allowed to glide axially. In any event, the message here is that a simple geometricaldefect conformation can lead to the presence of these local depletive locks that may select some characteristic inner loopangle in the rosette pattern’s self-organization [30,88].

As illustrated in Fig. 102, the emergence of rosette-like folding of the chromatin fiber in the crowded cell nucleus resultsfrom the aggregation of defects. Here, we characterize the distribution of the number of loops per rosette from minimalparametrization of the system [88]. We consider a dense fluid composed of a large number Ns of identical spheres bathinga tube which, for simplicity, contains N equidistant defects separated by a distance l along the tube. Consistently with theexperimental observation that replicons involved in replication foci are likely to be adjacent to each other on the samechromosomal DNA molecule [425–429], we assume that rosettes are formed while respecting sequential order of defectsalong the tube. If we cannot rule out the possibility of non-local loops, we somehow implicitly hypothesize that the crowdedenvironment, far from fluid, imposes additional mechanical constraints that render the contribution of far-reaching defectcollisions to the overall entropy to increase faster than logarithmically. Let n denote the number of rosettes along the tube;solitary defects are also considered as trivial ‘‘rosettes’’ with zero loops. The case n = N represents the absence of clusteringsince all defects are then solitary. The case n = 1 corresponds to a single large rosette assembling all defects. We separatein a natural fashion the system in two parts, namely the hard sphere fluid and the tube itself [88]. Let F , Ft and Fs denotethe free energies of the system, the tube and the spheres respectively (all of which depend on n) such that we can write

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 149

Fig. 102. Illustration of the spontaneous emergence of rosette-like folding of the chromatin fiber in the crowded environment of the cell nucleus. Thenumber of leaves per rosette fluctuates from one rosette to the next; this is due both to statistical fluctuations and variations in local environment. Fromone cell cycle to the next, leaves can be exchanged between neighboring rosettes.Source: Adapted from Ref. [30].

F = Ft + Fs. The most probable value for n is obtained by equating to zero the chemical potential µr of a rosette:

µr ≡∂F∂n

=∂Ft∂n

+∂Fs∂n

= 0. (25)

The equation of state of a fluid of hard spheres has been extensively studied in the past [661]. We follow the method usedby Dinsmore et al. [662] and make use of the Carnahan-Starling approximation [663]:

Ps(n)Vs(n)NskBT

=1 + ϕ + ϕ2

− ϕ3

(1 − ϕ)3, (26)

where ϕ = Nsvs/Vs(n) is the sphere density, Ps(n) the fluid osmotic pressure, vs the volume of each sphere and Vs(n) is thevolume available to the spheres. The Ft term can be expressed as

Ft = E0 + (N − n) ·1Fl − kBT · log nN

, (27)

where 1Fl > 0 is the free energy cost for the formation of a single loop, and E0 is an energy term assumed to beindependent of n. We appreciate that this assumption may seem an oversimplification; however our goal here is to proposean approximation with minimal complexity that still leads to enough structure such that we can derive a useful result forthe chemical potential µr . This pragmatic approach is motivated by the fact that the parameters that we just introducedcannot be measured with great precision for the time being as we will discuss below. The last term on the right-hand sideof Eq. (27) corresponds to the number of arrangements of n rosettes from N defects and contributes to the entropy of thetube. From Eqs. (26) and (27), and from the thermodynamical identity ∂Fs

∂Vs= −Ps, we get:

µr = −kBT · logN − n

n

−1Fl + kBT ·

vovl

vs

·

ϕ + ϕ2

+ ϕ3− ϕ4

(1 − ϕ)3

, (28)

where vovl = −∂Vs∂n represents the overlapping excluded volume gained for the spheres due to the interaction between two

defects. Thus, this simple model depends on three parameters: the free energy cost of a loop 1Fl, the normalized overlapvolume per loop vovl/vs and ϕ.1Fl can be approximated by the sum of an elastic energy term and an entropic term:

1Fl = 2kBTχ lp/l + ckBT ln(l/lp) . 9kBT , (29)

where l ≃ (3− 4)lp is the length of a typical loop, χ = 7 and c = 3− 4 [658–660]. Physiological values for the hard-spherefluid density Φ = ϕ(n = N) vary between ≃0.2–0.3 [638]. For a 30 nm fiber we can expect vovl ∼ 103 nm3 (see Fig. 101);the typical size of proteins is vs ∼ 102 nm3, leading to values of vovl/vs around 10 [88].

As illustrated in Fig. 103, increasing the sphere densityΦ or the normalized overlap volume results in an increase of theaverage number of loops per rosette (N/n − 1), thus in a more compact structure. On the other hand, increasing the freeenergy cost of loop formation (stiffening of the fiber) decreases the average number of loops. Interestingly, we find that theaverage number of loops per rosette can be regulated by fine-tuning the values of these parameters within physiologicalrange. For instance, for vovl/vs = 10 and1Fl = 9kBT and varyingΦ between 0.28 and 0.29 results in the average number of

150 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 103. Chemical potential vs. the number of loops per rosette (N/n − 1). The thick solid curve serves as a reference and corresponds to the parametervalues:Φ = 0.28,1Fl = 9kBT and vovl/vs = 10. For these values, a mean value of 2.8 loops (•) per rosette is expected. IncreasingΦ to 0.29 (dashed dottedcurve) increases the mean to 7 (); similarly, increasing vovl/vs to 11.5 (dotted curve) results in a mean value of 12.6 (). Increasing1Fl to 10kBT (dashedcurve) reduces the mean to 1 (). Decreasing1Fl to 8kBT (thin solid curve) increases the mean to 7.6 ().Source: Adapted from Ref. [88].

loops per rosette running from 2.8, i.e. low clustering, to 7. This provides attractive scenarios for the spontaneous emergenceof chromatin rosettes in the nucleus prior to their possible stabilization by external factors, e.g. DNA binding proteins(Section 7.2.2). As recently reported by de Nooijer et al. [664], computer simulations based on Molecular Dynamics haveconfirmed that non specific interactions can explain the self-organization of rosette-like structures in Arabidopsis thaliana.Furthermore, these authors have shown that in the absence of chromatin loops, chromosomes mix whereas the formationof loops in the Mbp range proves to be sufficient for chromosome territory stability during the cell cycle. This would implythat chromosome territory formation and stability could in fact also be explained by thermal equilibrium principles, ratherthan being the signature of the slow relaxation kinetics of very long entangled polymers as proposed by Rosa and Everaers[624] for higher eukaryote genomes.

7.3. Genomic DNA codes for open chromatin around ‘‘master’’ replication origins in human cells

In higher-eukaryotes, extensive connections have been established between replication timing, genome organizationand gene transcriptional state; early replication tends to co-localize with active transcription and, in mammals, to gene-dense GC rich isochores [466,512,530,539,665–667] and to transcription initiation early in development [556]. Interestingly,as reported in Section 6, the putative origins at N-domain borders were also shown to be at the heart of a remarkablegene organization [91]: in a close neighborhood, genes are abundant and broadly expressed and their transcriptionis mainly directed away from the borders. This preferential orientation was interpreted in relation to replication forkdirectionality [91]. All these features weaken progressively with the distance to domain borders. Altogether, these resultssuggest that N-domain borders are landmarks of the human genome organization and possible triggers of the replicationprogram; they correspond to early replicating origins separated by large distance (∼1 Mbp) around which replication andtranscription are highly coordinated. Genome-wide investigation of chromatin architecture has revealed that, at large scales(100 kbp–1 Mbp), regions enriched in open chromatin fibers correlate with regions of high-gene density [535] whereas,at small scales (.1 kbp), DNA accessibility and nucleosome distribution and modifications are important determinantfor transcriptional activity [207,212,668–670]. Moreover, there is a growing body of evidence that transcription factorsare regulator of origin activation [671]. In this context, we can reasonably ask to which extent the remarkable genomeorganization observed aroundN-domain borders ismediated byparticular chromatin structure favorable to the specificationof early replication origins. In this subsection, we map experimental and numerical chromatin mark data in the 678replication N-domains (Fig. 104(a) and (b)) and we show that a significant subset of N-domain borders corresponds toparticular open chromatin regions, permissive to transcription, that may have been imprinted in the DNA sequence duringevolution [92].

7.3.1. N-domain borders are hypersensitive to DNase I digestionBy combining DNase sequencing and DNase tiled microarray strategies, high resolution DNase I hypersensitivity (HS)

sites were identified in human primary CD4+ T cells as markers of open chromatin across the genome [670]. The resultinglibrary contains 94925 DNase I HS sites covering 60 Mbp (2.1%) of the human genome. We first observe [92] that, whereasreplication N-domains cover 30% of the human autosomes, only 22.7% of the autosomal DNase I HS sites fall within anN-domain. This is significantly lower than expected if the DNase I HS sites were uniformly distributed along the humangenome (P < 10−15 using a binomial test).Whenmapping DNase I HS sites inside the 678 replicationN-domains previouslyidentified in the human autosomes [92], we observe that the mean site coverage is maximum at the N-domain extremities

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 151

(a)

(b)

(c)

(d)

(e)

Fig. 104. Open chromatin markers along N-domains. (a) Nucleotide compositional asymmetry profile S (Eq. (20)) along a 19Mbp long fragment of humanchromosome 6 that contains 7 replication N-domains (horizontal green lines) bordered by 11 distinct putative replication origins O1 to O11 (vertical greenlines). Each dot corresponds to the compositional asymmetry S calculated in awindow of 1 kbp of repeat-masked sequence. The different colors correspondto: black, intergenic regions; red, (+) genes; blue, (−) genes. (b) The different colored profiles correspond to the DNase I HS score (black, resolution1 kbp), the NFR density (blue, resolution 100 kbp), the CpG o/e (red, resolution 100 kbp) and the replication timing ratio tr (light blue, inhomogeneousspatial resolution ∼300 kbp). The vertical dashed green line marks the location O0 of the closest upward jump to the region where we observe theconcomitant occurrence of a prominent burst in DNase I HS data, NFR numerical density and CpG o/e, in a region where the replication timing ratiois high (tr ∼ 1.6). (c) Correlation between replication timing ratio tr and DNase I HS site coverage (window size = 300 kbp) along the 54 N-domainsidentified in chromosome 6 (r = 0.56, P = 1.3× 10−14); dots are color coded according to CpG o/e value. (d) Correlation between replication timing ratiotr and NFR density (window size = 300 kbp, GC content ≤ 41%) along the 54 N-domains in chromosome 6 (r = 0.41, P = 1.7 × 10−6); dots are colorcoded according to DNase I HS sites coverage. (e) Correlation between CpG o/e and NFR density (window size = 300 kbp, GC content ≤ 41%) along the 678N-domains identified along the 22 human autosomes (r = 0.71, P < 10−15); dots are color coded according to DNase I HS site coverage. In (c, d, e) the3 color code is provided as a lateral color bar; each color corresponds to one third of the data points. (For interpretation of the references to color in thisfigure legend, the reader is referred to the web version of this article.)Source: Adapted from Ref. [92].

and decreases significantly from the extremities to the center that is rather insensitive to DNase I cleavage (Fig. 105(a)). Thisdecrease extends over ∼150 kbp suggesting that N-domain extremities are at the center of an open chromatin region of∼300 kbp. This is illustrated in the 19 Mbp long fragment of human chromosome 6 containing 7 N-domains (Fig. 104(a))and showing peaks of DNase I hypersensitivity that co-localize within 3 kbp for 7 (O1, O3, O6, O7, O8, O10 and O11) out of the11 distinct N-domain borders (O1 to O11) (Fig. 104(b)). Similar observations weremade for each of the 22 human autosomes(see Supplementary in Ref. [92]).

When examining high resolution replication timing data previously measured in chromosome 6 [512], we observe asignificant correlation between DNase I HS sites coverage and the replication timing ratio showing that DNase I HS sitesare preferentially located in early replicating regions (r = 0.56, P = 1.3 × 10−14; Fig. 104(c)). This is confirmed inFig. 105(a) where the mean DNase I HS sites coverage profiles computed around (i) the 25 earliest replicating (tr > 1.51)and the 25 latest replicating (tr < 1.4) N-domain borders among the 83 putative replication origins that border the 54N-domains predicted along chromosome 6, are compared. In contrast to the earliest borders located within ±150 kbpregions characterized by a high sensitivity to DNase I cleavage, the regions around the 25 latest borders do not display suchan enrichment in DNase I HS sites. This suggests that these putative origins lie in a less accessible chromatin environmentsimilar to the one observed at the N-domains centers which replicate late and where genes are rare and expressed in a fewtissues [91,510]. This is further illustrated by the observation that among the 11 N-domain borders predicted in the human

152 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

(a) (b)

Fig. 105. Mean profiles of DNase I HS sites coverage over the 678 replication N-domains identified in the human genome. (a) DNase I HS site coverage vs.the distance to the closestN-domain border. Black lines correspond to the overall average; orange (resp. light blue) line corresponds to the average over thehalf N-domains bordered by the 25 earliest (resp. the 25 latest) chromosome 6 putative replication origins (out of a total of 83); (b) Same as (a) but whenconditioning the analysis by the GC content; color lines correspond to the average over loci belonging to different isochores — blue lines: G + C < 37%(L1), green lines: 37 < G + C < 41% (L2) and red lines G + C > 41% (H1–3). (For interpretation of the references to color in this figure legend, the readeris referred to the web version of this article.)Source: Adapted from Ref. [92].

chromosome 6 fragment previously examined (Fig. 104(a)), the only 2 (O2 and O9) that are found without any DNase I HSsite in a close neighborhood, actually present low timing ratios (tr = 1.2 and tr = 1.25, respectively), as the signature oflate replication. For comparison, the 7 putative origins that turned out to be hypersensitive to DNase I cleavage are all earlyreplicating with tr & 1.5 [92].

Recently, it has been observed that the density of DNase I HS sites is positively correlated with the GC content, indicatinga significant compositional preference in the accessibility of chromatin to DNase I [672]. We have examined the DNase IHS site coverage around the set of putative replication origins when conditioning the analysis by the GC content, 100 kbpwindows being grouped into three classes according to their GC level (Fig. 105(b)). Consistently with previous results [672],we observe an overall increase of DNase I HS sites coverage with the GC content. Yet, whatever the GC class, a significantdecrease of the DNase I HS sites coverage with the distance to the N-domain borders is robustly observed over a similar±150 kbp distance around these bordering origins. This observation demonstrates that DNase I HS at putative replicationorigins is not a simple consequence of sequence composition [92].

7.3.2. DNA sequence codes for the accumulation of nucleosome-free regions around N-domains bordersPrevious analysis has revealed that promoter regions for protein-coding genes are extremely hypersensitive to DNase I

digestion [670]. These regionswere shown to be nucleosomedepleted [207,212,668–670], verymuch like theNFRs observedat yeast promoters. As reported in Section 3, to a large extent, these NFRs are coded in the DNA sequence via high energybarriers that (i) impair nucleosome formation and (ii) condition the collective nucleosomal organization observed overrather large distances along the chromatin fiber [81–83,86,210]. Here we use the same physical modeling of nucleosomeformation energy based on sequence-dependent bending properties as described in Section 3.2 [81,82]. Since the GC contentof S. cerevisiae is rather homogeneous around 39% as compared to the heterogeneous isochore structure of the humangenome [127,146–152], we restrict our modeling of nucleosome positioning to the light isochores L1 and L2 (GC < 41%).Combining the nucleosome occupancy probability profile and the original energy profile, we identify nucleosome NFRs asthe genomic energy barriers that are high enough to induce a nucleosome depleted region in the nucleosome occupancyprofile (SupportingMaterial, PhysicalModeling in Ref. [92]).When averaging the nucleosome occupancy profiles around thepredicted NFR positions along human light isochores, we observe a striking correlation between the profiles obtained usingour physical modeling and the experimental genome-wide nucleosomemapping reported in Ref. [212], which confirms therelevance of our physical nucleosome modeling to the human genome (Fig. 106). Moreover, the concomitant observationof ∨-shape experimental occupancy profiles coinciding with high energy barriers in the sequence-derived nucleosomalformation energy landscape indicates that the regions depleted in nucleosome in vivo are likely to be encoded, at least tosome extent, in the DNA sequence [92].

The distribution of NFRs along the 678N-domains shows amean density profile that ismaximal atN-domain extremities(∼0.7 NFR/kbp) and that decreases from extremities to center where some NFR depletion is observed (∼0.62 NFR/kbp)(Fig. 107(a)). This decay over a characteristic length scale ∼150 kbp is strikingly similar to that displayed by DNase I HS

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 153

Fig. 106. Nucleosome occupancy profiles around in silico Nucleosome-Free Regions in the human genome [92]. (a) (resp. (b)) Average theoreticalnucleosome occupancy probability (blue) and experimental nucleosome score [212] (purple) around the 1017747 predicted NFRs in low GC contentregions (≤41%) when aligned on their 5′ (resp. 3′) borders. The green bars represent the theoretical NFR predictions (see Supplementary in Ref. [92]).(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)Source: Adapted from Ref. [92].

(a) (b)

Fig. 107. Mean profile of NFR density (GC content <41%) over the 678 replication N-domains identified in the human genome. (a) NFR density vs. thedistance to the closest N-domain border. Black lines correspond to the overall average; brown (resp. purple) line corresponds to the average over locipresenting a high DNase I HS sites coverage >1% (resp. low, <0.2%). (b) same as in (a) but when conditioning the analysis by the GC content; color linescorrespond to the average over loci belonging to different isochores — blue lines: G + C < 37% (L1) and green lines: 37 < G + C < 41% (L2). (Forinterpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)Source: Adapted from Ref. [92].

site coverage (Fig. 105(a)). The excess of NFRs at putative replication origins is robustly observed when conditioning theanalysis by the GC content but to a lesser extent in the neighborhood of the GC poorest origins that likely replicate late in Sphase (Fig. 107(b)). Indeed, verymuch like the DNase I HS site coverage (Fig. 104(b)), the in silicoNFR density displays strongcorrelation with the replication timing ratio data in human chromosome 6 (r = 0.41, P = 1.7×10−6; Fig. 104(d)). This is inagreement with the fact that no excess of NFRs is observed at the putative replication origins O2 and O9 that fire late in theS phase (tr = 1.2 and 1.25, respectively) and where no DNase I HS site is found in a close neighborhood (Fig. 104(b)) [92].

Altogether, these results show that the NFR density profile displays the same characteristic increase around N-domainborders, as the experimental DNase I HS site coverage profile. In fact, when comparing theNFR density profiles obtained overloci presenting high (resp. low) DNase I HS site coverage, we confirm that NFR enrichment is correlated to higher sensitivityto DNase I (Fig. 107(a)). If this correlation was expected, the fact that we recover it using a sequence-based modeling ofnucleosome occupancy suggests that putative replication origins that border the N-domains are situated within regionsof accessible open chromatin likely encoded in the DNA sequence via excluding energy barriers that inhibit nucleosomeformation and participate to the collective organization of the nucleosome array [81,82].

154 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

(a) (b)

Fig. 108. Mean profiles of the CpG o/e over the 678 replication N-domains identified in the human genome as a function of the distance to the closestN-domain border. In (a), black line corresponds to the overall average; blue (resp. green) line corresponds to the average over NFR (resp. non-NFR) loci. In(b), colors have the same meaning as in Fig. 105(b).Source: Adapted from Ref. [92].

7.3.3. DNA hypomethylation is associated with N-domain bordersCytosine DNA methylation is a mediator of gene silencing in repressed heterochromatic regions, while in potentially

active open chromatin regions, DNA is essentially unmethylated [673]. DNA methylation is continuously distributed inmammalian genomes with the notable exceptions of CpG islands (CGIs), short unmethylated regions rich in CpGs, and ofcertain promoters and TSS [674]. Since there was no genome wide map of DNAmethylation available, we have investigatedthe distribution of DNA methylation using instead indirect estimators calculated directly from the genomic sequence [92].Methyl-cytosines being hypermutable, prone to deamination to thymines, we have considered the CpG observed/expected(CpG o/e) ratio as an estimator of DNA methylation [675]. Using data from the Human Epigenome Project [676], we haveconfirmed that hypomethylation in sperm corresponds to high values of the CpG o/e outside CGIs (Supporting Materialin Ref. [92]). We have also observed that the hypomethylation level of CGIs extends to about 1 kbp around the annotatedCGIs, so that the sequence coverage by CGIs enlarged 1 kbp at both extremities provided a complementary marker forhypomethylated regions.

When computing CpG o/e after removing CGIs from the analysis along the 19 Mbp long reference fragment of humanchromosome 6, we find that 9 out of the 11 putative replication origins that border the 7 N-domains correspond to a welldefined local maximum of the CpG o/e profile (Fig. 104(b)). Since CpG o/e values are known to be positively correlatedwith the GC content [677], we have determined the CpG o/e profiles for fixed GC content. When averaging over the 678N-domains, the overall mean CpG o/e profile (Fig. 108(a)) as well as the mean CpG o/e profiles obtained for each classof GC content (Fig. 108(b)) present a maximum at origins positions, as the signature of hypomethylation, and decreaseover a characteristic distance ∼150 kbp, similar to the one found for DNase I HS sites coverage and NFR density profiles(Figs. 105 and 107), from the extremities to the center of N-domains where a minimal level of CpG o/e is attained. Thesedata show that the peak of CpG o/e observed around the putative replication origins, even when CpG islands were removedfrom the genome, cannot be attributed to some peculiar GC content environment but more likely to an hypomethylatedopen chromatin state where CpG o/e is correlated with DNase I HS sites coverage (r = 0.35, P < 10−15) and NFR density(r = 0.71, P < 10−15) (Fig. 104(e)). The decrease of the CpG o/e with the distance to N-domain border is robustly observedwhen considering separately NFR and non-NFR regions (Fig. 108(a)) showing that the gradient of hypomethylation over∼150 kbp around N-domain borders is not specific to either of these regions. The correlation measured between CpG o/eand replication timing (r = 0.30, P = 1.1× 10−4, Fig. 104(c)) further indicates that this property is significantly associatedwith the putative replication origins located in early-replicating regions. Complementary analysis using the 1 kbp-enlargedCGI coverage as hypomethylation marker provides exactly the same diagnosis. We observe that each of the 11 N-domainborders presented in Fig. 104(b), corresponds to a peak in the 1 kbp-enlarged CGI coverage profile (data not shown) and thatthe average over the 678N-domains decreases from borders to center over a similar characteristic distance∼150 kbp. Theseobservations are consistent with the hypothesis [678] that CGIs are protected frommethylation due to their co-localizationwith replication origins.

7.3.4. Open chromatin regions around N-domain borders have a characteristic sizeThe decreasing behavior over a characteristic distance of±150 kbp fromN-domain borders common to themeanDNase I

HS site coverage profile, themeanNFR density profile, themean CpGo/e profile and the average 1 kbp-enlarged CGI coverage

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 155

(a) (b)

(c) (d)

Fig. 109. Mean profiles of DNase I HS site coverage (a), NFR density (GC content <41%) (b), CpG o/e (c) and GC content (d) as a function of the distanceto the closest N-domain border over the 678 replication N-domains identified in the human genome, for 3 N-domain size categories: L < 0.8 Mbp (red),0.8 < L < 1.5 Mbp (green) and L > 1.5 Mbp (blue). (For interpretation of the references to color in this figure legend, the reader is referred to the webversion of this article.)Source: Adapted from Ref. [92].

is actually observed whatever the size of the replication N-domains (Fig. 109(a–c)). This contrasts with the GC contentprofile that behaves quite differently (Fig. 109(d)). For N-domains of small size (L < 0.8 Mbp), the GC profile is rather flatall along the domains, whereas for the larger sizes (L > 0.8 Mbp), it decreases very slowly towards the center of the N-domains. These results confirm that the excess of CpG o/e observed around the putative replication origins does not simplyreflect some localized high GC environment but more likely some open chromatin state with a mean characteristic size of∼300 kbp [92].

Chromatin structure has also been analyzed at the fiber level using separation by sucrose gradient sedimentation [535].As shown in Fig. 110(a), the proportion of microarray DNA fragments presenting an open/input ratio >1.5 decreasessignificantly (five fold) from N-domain borders to centers. This result provides additional support for the peculiar propertyof chromatin in the neighborhood of N-domain borders. Note that only close to these borders this proportion of open/inputchromatin >1.5 exceeds (for small N-domains) the mean value ∼0.2 obtained when averaging over the 22 humanautosomes. This means that the central part of the N-domains and a fortiori of the larger split-N-domains (Section 6.4),definitely correspond to heterochromatic gene poor regions (see Fig. 89(c) and (d)). When considering these two sets ofreplication domains, they overall covermore than 50% of the human genome. But as reported in Fig. 110(b), some variabilityis observed when examining each one of the 22 human autosomes. Interestingly, the 5 autosomes that have the largest(>30%) proportions of DNA fragments presenting an open/input chromatin ratio>1.5, namely chromosomes 19, 22, 17, 16and 20, correspond to 5 among the 6 chromosomes that have the lowest coverages byN-domains and split-N-domains. Thisconfirms that our wavelet-basedmethodology to detect N-domains (Section 6.1) has been rather efficient for chromosomesthat are less open than average as for example chromosomes 3, 4, 9 and 13, and where, unlike for the top five most openchromosomes, the characteristic size of the N-domains is significantly larger than the mean gene size (Section 6.1).

7.3.5. Open chromatin around N-domain borders are potentially fragile regions involved in chromosome instabilitySince chromatin accessibility and openness are possible factors responsible for fragility and instability,N-domain borders

could also play a key role in genome dynamics during evolution and genome instability in pathologic situations like cancer.Taking advantage of recent progress in the detection of rearrangement breakpoints in mammalian chromosomes [679],

156 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

(a) (b)

Fig. 110. Proportion of DNA fragments presenting a ratio of ‘‘open’’ over input chromatin greater than 1.5 vs. (a) the distance to the closest N-domainborder and (b) the coverage by skew domains (including N- and split-N-domains) for each of the 22 human autosomes. In (a), colors correspond to threeN-domain size categories: L < 0.8 Mbp (red), 0.8 < L < 1.5 Mbp (green) and L > 1.5 Mbp (blue). (For interpretation of the references to color in thisfigure legend, the reader is referred to the web version of this article.)Source: Panel (a) is adapted from Ref. [92].

we have recently studied their distribution along the human genome relatively to the position of genes, to the isochoreorganization and to the segmentation by replication N-domains [680]. As shown in Fig. 111, the available set of 622 high-resolution non-intersecting breakpoint regions fall more frequently nearN-domain borders than in their central regions andthe way the breakpoint density decreases from the borders displays the same characteristic scale ∼±150 kbp as previouslyobserved for the different open chromatin markers. This suggests that the distribution of large-scale rearrangementsin mammals reflects a mutational bias towards regions of high transcriptional activity and replication initiation [680].Furthermore, the fact that chromosome anomalies involved in the tumoral process like at the RUNX1T1 oncogene locus (seeSupplementary Fig. S6 in Ref. [92]) coincide with replication N-domain extremities raises the possibility that the replicationorigins detected in silico are potential candidate loci susceptible to breakage in some cancer cell types [92].

7.4. Master replication origins at the heart of the chromatin tertiary structure

7.4.1. Recent experimental replication origins mapped on ENCODE confirm N-domain border predictionsPrevious analyses of nucleotide strand compositional asymmetries have shown that, out of the 9 experimentally

identified replication origins, 7 (78%) presented an upward jump in the asymmetry profile analog to those bordering N-domains (Section 5.3) [89,90]. Recently, the localization of replication origins has been experimentally investigated along1% of the human genome (ENCODE regions) by hybridization to Affymetrix ENCODE tiling arrays of purified small nascentDNA strands and of restriction fragments containing small replication bubble [477,479]. Out the 7 N-domain borders thatreside within an ENCODE region, 4 match an experimental replication origin (P = 4 × 10−3): 3 to the bubble trappingdataset (Bubble-HL, P = 0.017) and 1 to a nascent strand purification dataset (NS-HL-2, P = 0.09) (Table 10). Actually,as previously noted [479], a second N-domain border is located within 1 kbp of a NS-HL-2 origin (Table 10). Hence, thereis direct experimental evidence that 5/7 (71%) N-domain borders correspond to active replication origins at a few kbpresolution. These results are all the more significant considering the rather low overlap between the experimental datasets.For example, only 69 (25%) of the NS-HL-2 origins overlap with a Bubble-HL origins, only 4 (1.4%) with the second nascentstrand dataset in HeLa cells (NS-HL-1) and only 12 (4.3%) even when extending NS-HL-1 and NS-HL-2 origins by 1 kbp onboth sides.

7.4.2. N-domain borders: a subset of ‘‘master’’ replication origins at the heart of the spatio-temporal program of replicationIn metazoans, recognition of replication origins by the origin recognition complex (ORC) does not involve a simple

consensus DNA sequence. Initiation sites do not share common genetic entities but seem to be favored by various factorsthat can differ from one origin to another and be required or dispensable under different conditions [463,473]. Specificationof initiation sites can be favored by negatively supercoiled DNA [681] (possibly resulting from the removal or displacementof nucleosomes), by interacting proteins that chaperone ORC to specific chromatin sites [682], by the transcriptionalactivity [541] or by open chromatin to which ORC might bind in a non-specific way [683]. Our finding in Section 7.3 showthat the putative early replication origins located at N-domain borders are likely to be specified by the ∼300 kbp wideregion of open chromatin that surrounds this peculiar subset of origins that will be further qualified as ‘‘master’’ replication

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 157

Fig. 111. Mean profiles of DNase I HS site coverage (black), NFR density (GC content<41%, blue), CpG o/e (red) and evolutionary breakpoint density (green)as a function of the distance to the closest N-domain border, after masking CGIs and genes extended by 2 kbp at both extremities [92]. (For interpretationof the references to color in this figure legend, the reader is referred to the web version of this article.)

Table 10Correspondence between N-domain borders and experimental replication origins datasets along ENCODE regions [92]. Main characteristics of replicationorigin prediction along ENCODE regions based on purified restriction fragments containing replication bubble (Bubble) [525] or purified small nascentstrands (NS) [466]. First column indicates the experimental method (Bubble or NS) and the cell type (HL: HeLa cells and GM: GM06990 cell lines). Theyare two independent NS-HL datasets labeled 1 and 2. ‘‘All’’ corresponds to the four initial datasets considered together. NS + 1 kbp-HL-2 corresponds tothe NS-HL-2 dataset when extending replication origins by 1 kbp on both sides. +Bublle-HL corresponds to merging the NS + 1 kbp-HL-2 and Bubble-HLdatasets. We provide the number of replication origins, their total coverage of ENCODE regions, the number of N-domain borders out of 7 within ENCODEregions that match with one of the experimental replication origins and the corresponding P-value using a binomial test (P-values<0.02 are in bold).

Method Number Coverage (%) Match with N-domain borders (P-value) Reference

Bubble-HL 234 8.6 3 (0.017) [477]NS-GM 758 1.0 0 [477]NS-HL-1 434 0.6 0 [477]NS-HL-2 282 1.4 1 (0.093) [479]

All 11.0 4 (0.004)NS + 1 kbp-HL-2 3.2 2 (0.019)+Bublle-HL 11.2 5 (0.0003)

origins [92]. Interestingly, location O0 in Fig. 104 that was not identified as a N-domain border but displays all these openchromatin characteristics, indeed corresponds to a sharp upward jump in the skew profile as the hallmark of the presence ofa ‘‘master’’ replication origin. Such a strong gradient of accessible and open chromatin environment is not observed around alarge fraction of the 283 replication origins experimentally identified along ENCODE regions [479]. Actually, besides a strongassociationwith CGIs, only 29% overlap a DNase I hypersensitive site and half of these origins do not present open chromatinepigenetic marks and are not associated with active transcription [92]. Furthermore, the typical inter-origin distance in thehuman somatic cells has been estimated to be of the order of 50-100 kbp [479,684], a value significantly smaller than thetypical size (∼1 Mbp) of N-domains (Fig. 72, Section 6.1) [91,509]. As illustrated in Fig. 112, we are currently working,in collaboration with A. Goldar and O. Hyrien, on a more refined model of the spatio-temporal replication program wherereplication first initiates at ‘‘master’’ origins atN-domain borders in an open chromatin environment, followed by successiveactivations of secondary origins within N-domains according to the observed timing profile (see Figs. 74(d) and 83). Onepossible mechanism for this is that secondary origins are remotely activated by the approach of a center oriented fork.This ‘‘saltatory transmission’’ of origin activation could explain why replication progress from N-domain borders muchfaster (3–5 times) than the known speed of single fork obtained by DNA combing techniques [685,686]. Furthermore,the linear decline of the skew along N-domain is likely to reflect a progressive change in the proportion of center- andborder- oriented forks that itself reveals the dynamic patternwithwhich secondary initiations occurwithinN-domains. Theresults of preliminary numerical simulations based on previousmodeling studies [553,554,687,688] of replication in severaleukaryotic species look very encouraging. It will be interesting to analyze to which extent chromatin state influences forkprogression and secondary initiations and whether outside of N-domains, the genome replicates according to a similar orcompletely different set of rules.

7.4.3. Unifying replication foci and transcription factoriesThe results reported in Section 7.3 show that the ∼300 kbp wide regions of open chromatin around the putative

replication origins that border N-domains are good candidates as intrinsic structural defects (burst of ‘‘openness’’) in the30 nm chromatin fiber [92]. Actually these regions of higher accessibility are to some extent encoded in the DNA sequence

158 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. 112. A cascade model for replication of N-domains [689]. Replication initiates at master origins (ORI) at N-domain borders followed by successivefiring of secondary origins (ori) which explains why replication is observed to progress faster than single forks. According to replication timing data [529],the dynamic pattern with which secondary initiations occur predicts a progressive change in average fork polarity that matches the progressive inversionof the skew across N-domains.

(a) (b)

Fig. 113. Mean profile of Pol II (a) and H3K4me3 (b) Chip-Seq tag density ±2 kbp around TSS vs. the distance to the closest N-domain border [92]. Colorscorrespond to three N-domain size categories: L < 0.8Mbp (red), 0.8 < L < 1.5Mbp (green) and L > 1.5Mbp (blue). (For interpretation of the referencesto color in this figure legend, the reader is referred to the web version of this article.)Source: Adapted from Ref. [92].

via an enrichment in nucleosome excluding energy barriers (NFRs) (Fig. 107). In that respect, these master replicationoriginsmight play a central role in the self-organization of eukaryotic chromatin into rosette-like structures according to theentropy-driven scenario described in Section 7.2.3 [30,88]. Since the previously investigated chromatin markers have beenassociated, at least to some extent, with genes (e.g. 16% of all DNase I HS sites are in the first exon or at the TSS of a gene,and 42% are found inside a gene [670]), we have reproduced the analysis of their distribution along the N-domains aftermasking the genes extended by 2 kbp at both extremities and the CGIs. As shown in Fig. 111, the fact that the mean DNase IHS site coverage, NFR density and CpG o/e profiles still present the decaying behavior over∼150 kbp, demonstrates that theexcess observed around the putative replication origins does not simply reflect the rather packed gene organization at theN-domain borders (Fig. 86(b)). Additionally, when investigating the transcriptional potential of genes as a function of theirdistance to N-domain borders, we observe in Fig. 113 that Pol II binding and H3K4me3 Chip-Seq tag density ±2 kbp [668]around TSS, also present a strongly decaying mean profile over a length of ∼150 kbp from N-domain borders to center(five-fold and three-fold respectively). The presence of these two marks at the TSS have been shown to correlate with geneactivity [668]. These results thus indicate that the open chromatin regions around putative replication origins are prone totranscription, whereas N-domain central regions appear transcriptionally silent [92].

Remark. The fact that someN-domain borders (like O2 in Fig. 104) do not present an open chromatin signature nor an earlyreplication timing in the cell line used experimentally, but still exhibit a sharp upward jump in the skew profile, raises thequestion of whether they have a different status or are associated with open chromatin and early replication like the othersonly in the germline. To address this issue, we are currently investigating open chromatin marker and replication timingdata in different cell lines.

Altogether the results reported in Sections 7.3 and 7.4 suggest a rather unifying picture of the tertiary chromatin structurewhere ‘‘master’’ replication origins will be at the heart of a compartmentalization of the human genome into multi-looped

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 159

structural domains that are likely to account for both replication foci [414,425–427] and transcription factories [26,642,645,646] (Section 7.2.2). According to the clustering mechanism described in Section 7.2.2, the spatial proximity in the cellnucleus of replication origins and active gene promoters will contribute to increase significantly the local concentration ofDNA and RNA polymerases that in turn will participate in maintaining these super-structures dynamically stable duringthe cell cycle. Note that this unified understanding of replication foci and transcription factories provides an alternativeinterpretation of the gene organization observed around N-domain borders in Section 6.3 [91]. The observed excess of R+genes oriented in the same direction as the putative replication fork progression (Fig. 86(c)) might simplymean that for long(possibly housekeeping) genes (&150 kbp) to have an active promoter, better to be well oriented R+ so that their promoterbe located in the accessible open chromatin environment surrounding themaster replication origins that borderN-domains.Further investigation of the gene organization and orientation inside and outsideN-domains is necessary to corroborate thisintimate link between replication and transcription via the subset of putative replication origins that we have qualified, forthat very peculiar purpose, as ‘‘master’’ origins [92].

8. Concluding remarks

In a recent past, theDNAdouble helixwas simply considered as a biologicalmacromolecule (a polymer)whose nucleotidesequence codes for our genes. The regulation and control of DNA replication and gene transcription was supposed to befully delegated to proteins. Nowadays, DNA is more andmore recognized as a complex heteropolymer whose structural andmechanical properties play a relevant part in the management of gene information. The results reported in this reviewdemonstrate that there is a lot of information encoded in the DNA sequences concerning the different stages of DNAcompaction inside the cell nucleus (Fig. 1). Surprisingly, if the local structural and mechanical properties of the double helixare ‘‘written’’ in the sequence, which in turn influences the formation and dynamics of the nucleosomal string, our resultsshow that the sequence also conditions to some extent the next compaction levels. When playing on the magnificationof our WT microscope, we have seen that at large scales (&30–40 kbp), the primary sequence still contains structural andmechanical information, no longer on the DNA double helix, but rather on the 30 nm chromatin fiber and its propensityto form loops and multi-loop (rosette) patterns that are likely to correspond to structural domains of autonomous DNAreplication and gene expression (replication foci, transcription factories). Since introns and intergenic regions constitutemore than 95% of the human genome, our study therefore contributes to give a role to the non-coding regions in eukaryoticgenomes. These regions actually play a driving role in the condensation and decondensation processes of the chromatinarchitecture as well as in many related regulatory functions.

The results reported here open new perspectives in DNA sequence and genomic data analysis, in modeling as well as inexperiment. From a methodological point of view, they underline the need and potential benefits of a multi-scale approachfor improving our understanding of chromatin-mediated regulation of transcription and replication. Indeed, joint evolutionand adaptation of the different levels of chromatin organization (DNA, nucleosome, fiber, chromatin loop) (Fig. 1) haveled to a situation where all these levels are consistently coupled functionally and cannot be investigated separately. Wehave seen how the lowest levels can condition the organization at larger scales; conversely the upper levels via (Mbp) longchromatin loops, control back and even participate to the shaping of the smaller-scale levels. This explains that regulatorymechanisms involving the chromatin structure also offer numerous ways of epigenetic control (e.g. histone variants andpost-translational modifications) that may amplify direct ‘‘physical by-products’’ of the genomic sequence [268,609]. Aconsistent framework bridging scales and levels of organization from DNA to chromosome is therefore desirable to get anintegrated picture of nuclear functions. Various scientific projects aiming at deciphering the mechanisms underlying howinformation flows across scales during the different stages of the cell cycle, are in current progress.

Acknowledgements

The material reported in this review corresponds to more than fifteen years of collaborative work. We want to thank E.Bacry, A. Baker, L. Berguiga, P. Bouvet, E.-B. Brodie of Brodie, C.L. Chen, G. Chevereau, L. Duquenne, J. Elezgaray, C. Faivre-Moskalenko, L. Farinelli, E. Fontaine, C. Gautier, A. Goldar, P.V. Graves, G. Guilbaud, Z. Haftek-Terreau, M. Huvet, O. Hyrien, P.Kestener, A. Khalil, G. Lavorel, C. Lemaitre, M. Marilley, P. Milani, K. Mills, F. Mongelard, J. Moukhtar, J.-F. Muzy, S. Nicolay, L.Palmeira, A. Rapailles,M.-F. Sagot, P. St-Jean, E. Tannier,M. Touchon, L. Zaghloul and S.-J. Zhang for their participation to someof the reported scientific projects. We would like to acknowledge all scientists and colleagues with whom we had and stillhave stimulating and fruitful discussions and interactions on this rapid growing field of research.We are also very grateful toall themembers of the Joliot-Curie laboratory at the EcoleNormale Supérieure of Lyon (http://www.ens-lyon.fr/Joliot-Curie/)for their technical help and support aswell as for creating such an enthusiastic and creative ambiance. Thismultidisciplinaryplatform gathers theoretical and experimental physicists together with cellular and molecular biologists who share acommon interest in the study of the structure and dynamics of chromatin in relation with the regulation of DNA replicationand gene expression. With a rare combination of expertise present in this laboratory: biochemistry and molecular biology,experimental physics (near-fieldmicroscopywith the very promising high resolution Scanning Surface PlasmonMicroscopy(SSPM) non-intrusive optical set-up [690–693] that allows imaging of biologicalmacromolecules in a liquidmediumwithoutthe need of fluorescentmarkers), theoretical physics, bioinformatics, signal and image processing,we hope to keep followingthe track that has been taking shape for the past decade leading to the results reviewed in this report.

160 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

This research was supported by the GIP GREG (project ‘‘Motif dans les séquences’’, 1996), the Ministère de l’EducationNationale, de l’Enseignement Supérieur, de la Recherche et de l’Insertion Professionnelle, ACC-SV (project ‘‘Génétique etEnvironnement’’, 1998), the Action Bioinformatique of the CNRS (project ‘‘Ecriture fractale des génomes et propriétésstructurales de l’ADN-Construction de génomes in silico’’, 2000), the Conseil Régional Rhônes-Alpes (project ‘‘Etudestatistique et dynamique des macromolécules biologiques’’, Emergence 2002), the Progamme Bioinformatique inter EPST(project ‘‘Détection de signaux dans les génomes par une approche énergétique des interactions ADN-protéines’’, 2002), theAction Concertée Incitative Informatique, Mathématiques, Physique et Biologie Moléculaire (project ‘‘Réplicor: Détection insilicodes origines de réplication chez les eucaryotes supérieurs: analyse et évolution’’, ACI IMPBio 2004), the Conseil RégionalRhônes-Alpes (project ‘‘Le rôle de la séquence sur la structure et la dynamique de la chromatine’’, Emergence 2005), the PAITournesol (2006), and the Agence Nationale de la Recherche ‘‘Programme Physique et Chimie du Vivant’’ (project ‘‘DNAnucl:Influence de la séquence ADN sur la structure et la dynamique du nucléosome’’, PCV 2006 and project ‘‘HUGOREP: originesde réplication chez les métazoaires: caractérisation et nouvelles approches à l’échelle du génome’’, PCV 2005).

Appendix A. A wavelet-based multifractal formalism: the wavelet transformmodulus maxima method

The continuous wavelet transform (WT) is a mathematical technique introduced in signal analysis in the early 1980s[33–35]. Since then, it has been the subject of considerable theoretical developments and practical applications in awide variety of fields [28,37–39,41–55]. The WT has been early recognized as a mathematical microscope that is welladapted to reveal the hierarchy that governs the spatial distribution of singularities of multifractal measures [36,40,694].What makes the WT of fundamental use in the study of ‘‘DNA walks’’ in Section 2, is that its singularity scanning abilityequally applies to singular functions than to singular measures [36,40,694–700]. Taking advantage of this property, aunified thermodynamic description of multifractal distributions including measures and functions, the so-called WaveletTransform Modulus Maxima (WTMM) method has been developed [457–459,701,702]. By using wavelets instead of boxesor increments, one can take advantage of the freedom in the choice of these ‘‘generalized oscillating boxes’’ to get rid ofpossible (smooth) polynomial behavior that might either mask singularities or perturb the estimation of their strength h(Hölder exponent), remedying in this way for one of the main failures of the classical multifractal methods (e.g. the box-counting algorithms in the case of measures and the structure function method in the case of functions [457–459,702]). Theother fundamental advantage of using wavelets is that the skeleton defined by the WTMM [699,700], provides an adaptivespace-scale partitioning fromwhich one can extract the D(h) singularity spectrum via the Legendre transform of the scalingexponents τ(q) (q real, positive as well as negative) of some partition functions defined from the WT skeleton [28,51]. Werefer the reader to Bacry et al. [457], Jaffard [703,704] for rigorous mathematical results and to Hentschel [705] for thetheoretical treatment of random multifractal functions. Note that alternative approaches to the WTMM method have beendeveloped using discrete wavelet bases [706–709], including the recent use of wavelet leaders [710–712].

Applications of theWTMMmethod to 1D signals have already provided insights into a wide variety of problems [29], e.g.,the validation of the log-normal cascade phenomenology of fully developed turbulence [713–716] and of high-resolutiontemporal rainfall [717–719], the characterization and the understanding of long-range correlations in DNA sequences(Section 2) [14–16,113,124,130], the demonstration of the existence of a causal cascade of information from large to smallscales in financial time series [720,721], the use of the multifractal formalism to discriminate between healthy and sickheartbeat dynamics [722,723], the discovery of a Fibonacci structural ordering in 1D cuts of diffusion limited aggregates(DLA) [724–727]. The canonical WTMM method has been further generalized from 1D to 2D [728,729] with the specificgoal to achieve multifractal analysis of rough surfaces with fractal dimensions DF anywhere between 2 and 3 [730–732]. The 2D WTMM method has been successfully applied to characterize the intermittent nature of satellite imagesof the cloud structure [733,734], to perform a morphological analysis of the anisotropic structure of atomic hydrogen(HI ) density in Galactic spiral arms [735], to assist in the diagnosis in digitized mammograms [736] and to perform anautomatic segmentation of chromosome territories in confocal cell nucleus imaging [737,738]. We refer the reader toArneodo et al. [739] for a review of the 2DWTMMmethodology, from the theoretical concepts to experimental applications.Recently, the WTMMmethod has been further extended to 3D scalar [740] as well as 3D vector [741,742] field analysis andapplied to 3D (velocity, vorticity, dissipation, enstrophy) numerical data issued from direct numerical simulations (DNS)of incompressible Navier–Stokes equations [741,742]. Because it combines singular value decomposition and multifractaldescription, the so-called tensorial wavelet transform modulus maxima method for vector fields looks very promising forfuture simultaneous multifractal and structural (velocity sheets, vorticity filaments) analysis of turbulent flows [741,742].

In this Appendix, we summarize the main steps of the WTMMmethod for the multifractal analysis of 1D signals like theDNA walks in Section 2.

A.1. The continuous wavelet transform

The WT is a space-scale analysis which consists in expanding signals in terms of wavelets which are constructed froma single function, the ‘‘analyzing wavelet’’ ψ , by means of translations and dilations. The WT of a real-valued function f isdefined as [33–35]:

Tψ [f ](x0, a) =1a

∫+∞

−∞

f (x)ψx − x0

a

dx, (A.1)

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 161

Fig. A.1. Set of analyzing wavelets ψ(x) that can be used in Eq. (A.1). (a) g(1) and (b) g(2) as defined in Eq. (A.3). (c) ψN as defined in Eq. (A.7) and used inSection 6 to detect replication N-domains. (d) Box function φT used in Section 6 to model step-like skew profiles induced by transcription.

where x0 is the space parameter and a (>0) the scale parameter. The analyzing wavelet ψ is generally chosen to be welllocalized in both space and frequency. Usually ψ is required to be of zero mean for the WT to be invertible. But for theparticular purpose of singularity tracking that is of interest here, we will further require ψ to be orthogonal to low-orderpolynomials [457–459,696–702]:∫

+∞

−∞

xmψ(x)dx, 0 ≤ m < nψ . (A.2)

As originally pointed out by Mallat and collaborators [699,700], for the specific purpose of analyzing the regularity of afunction, one can get rid of the redundancy of the WT by concentrating on the WT skeleton defined by its modulus maximaonly. These maxima are defined, at each scale a, as the local maxima of |Tψ [f ](x, a)| considered as a function of x. TheseWTMM are disposed on connected curves in the space-scale (or time-scale) half-plane, called ‘‘maxima lines’’ (see Fig. B.1(e,f)). Let us define L(a0) as the set of all the maxima lines that exist at the scale a0 and which contain maxima at any scalea ≤ a0. An important feature of these maxima lines, when analyzing singular functions, is that there is at least one maximaline pointing towards each singularity [458,699,700].

A.2. Analyzing wavelets

There are almost as many analyzing wavelets as applications of the continuousWT [29,36,40,458,459,694]. A commonlyused class of analyzing wavelets is defined by the successive derivatives of the Gaussian function:

g(N)(x) =dN

dxNe−x2/2, (A.3)

for which nψ = N and more specifically g(1) and g(2) that are illustrated in Fig. A.1(a,b). Note that the WT of a signal f withg(N) (Eq. (A.3)) takes the following simple expression:

Tg(N) [f ](x, a) =1a

∫+∞

−∞

f (y)g(N)y − xa

dy,

= (−a)NdN

dxNTg(0) [f ](x, a). (A.4)

162 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Eq. (A.4) shows that theWT computed with g(N) at scale a is nothing but the Nth derivative of the signal f (x) smoothed by adilated version g(0)(x/a) of the Gaussian function. This property is at the heart of various applications of theWTmicroscopeas a very efficient multi-scale singularity tracking technique [29].

One of the main advantages of the WT is its adaptive ability to perform time–frequency analysis [43,51] when usingcomplex analyzing wavelets like the Morlet’s wavelet:

ψM(x) =1

√2π

eiωxe−x2/2

−√2e−ω2/4e−x2

, (A.5)

where the second term in the r.h.s. is negligible for large ω values (ω & 5). The ‘‘scale-spectrum’’ of a signal S of total lengthL is defined as:

Λ(a) =1L

∫ L

0

TψM [S](x, a) dx. (A.6)

With the specific goal of disentangling the contributions to the nucleotide composition strand asymmetry comingrespectively from transcription and replication processes, we use in Section 6, an adapted analyzing wavelet called N-letbecause of its shape that looks like the letter N (Fig. A.1(c)) [32,509]:

ψN(x) = −xχ[−1,1](x), (A.7)where χ[−1,1] is the characteristic function of the interval [−1, 1]. By performing multi-scale pattern recognition in the(space, scale) half-plane with this N-let, we are able to achieve some segmentation of the human and more generallymammalian genomes into replication domains bordered by putative replication origins [32,91,509,510].

A.3. Scanning singularities with the wavelet transform modulus maxima

The strength of the singularity of a function f at point x0 is given by the ‘‘Hölder exponent’’, i.e., the largest exponent suchthat there exists a polynomial Pn(x − x0) of order n < h(x0) and a constant C > 0, so that for any point x in a neighborhoodof x0, we have [457,458,696–700]:

|f (x)− Pn(x − x0)| ≤ C |x − x0|h. (A.8)If f is n times continuously differentiable at the point x0, then one can use for the polynomial Pn(x − x0), the order-n Taylorseries of f at x0 and prove that h(x0) > n. Thus h(x0)measures how irregular the function f is at the point x0. The higher theexponent h(x0), the more regular the function f .

The main interest in using the WT for analyzing the regularity of a function lies in its ability to be blind to polynomialbehavior by an appropriate choice of the analyzing waveletψ . Indeed, let us assume that according to Eq. (A.8), f has, at thepoint x0, a local scaling (Hölder) exponent h(x0); then, assuming that the singularity is not oscillating [700,743–745], we caneasily prove that the local behavior of f is mirrored by the WT which locally behaves like [457–459,696–704]:

Tψ [f ](x0, a) ∼ ah(x0), a → 0+, (A.9)provided nψ > h(x0), where nψ is the number of vanishingmoments ofψ (Eq. (A.2)). Therefore one can extract the exponenth(x0) as the slope of a log–log plot of theWT amplitude versus the scale a. On the contrary, if we choose nψ < h(x0), theWTstill behaves as a power-law but with a scaling exponent which is nψ :

Tψ [f ](x0, a) ∼ anψ , a → 0+. (A.10)Thus, around a given point x0, the faster the WT decreases when the scale goes to zero, the more regular f is around thatpoint. In particular, if f ∈ C∞ at x0 (h(x0) = +∞), then the WT scaling exponent is given by nψ , i.e. a value which isdependent on the shape of the analyzing wavelet. According to this observation, we can hope to detect the points where fis smooth by just checking the scaling behavior of theWTwhen increasing the order nψ of the analyzing wavelet [457–459,701,702].

A very important point (at least for practical purpose) raised byMallat and Hwang [699] is that the local scaling exponenth(x0) can be equally estimated by looking at the value of theWTmodulus along amaxima line converging towards the pointx0. Indeed we can prove that both Eq. (A.9) and (A.10) still hold when following a maxima line from large down to smallscales [699,700].

A.4. The wavelet transform modulus maxima method for multifractal analysis

A.4.1. Singularity spectrumAs originally defined by Parisi and Frisch [746], the multifractal formalism of multi-affine functions amounts to comput-

ing the so-called ‘‘singularity spectrum’’ D(h) defined as the Hausdorff dimension of the set where the Hölder exponent isequal to h [457–459]:

D(h) = dimHx, h(x) = h, (A.11)where h can take, a priori, positive as well as negative real values (e.g., the Dirac distribution δ(x) corresponds to the Hölderexponent h(0) = −1) [703].

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 163

A.4.2. The wavelet transform modulus maxima methodA natural way of performing amultifractal analysis of fractal functions consists in generalizing the ‘‘classical’’ multifractal

formalism [747–751] using wavelets instead of boxes or increments. By taking advantage of the freedom in the choice of the‘‘generalized oscillating boxes’’ that are the wavelets, we can hope to get rid of possible smooth behavior that could masksingularities or perturb the estimation of their strength h. But the major difficulty with respect to box-counting techniques[750,752–756] for singular measures, consists in defining a covering of the support of the singular part of the function withour set of wavelets of different sizes. As emphasized in Refs. [457–459,701,702], the branching structure of theWT skeletonsof fractal functions in the (x, a) half-plane enlightens the hierarchical organization of their singularities (see Fig. B.1(e, f)).The WT skeleton can thus be used as a guide to position, at a considered scale a, the oscillating boxes in order to obtaina partition of the singularities of f . The wavelet transform modulus maxima (WTMM) method amounts to compute thefollowing partition function in terms of WTMM coefficients [457–459,701,702]:

Z(q, a) =

−l∈L(a)

sup(x,a′)∈la′≤a

|Tψ [f ](x, a′)|

q

, (A.12)

where q ∈ R and the sup can be regarded as a way to define a scale adaptive ‘‘Hausdorff-like’’ partition. Now from thedeep analogy that links the multifractal formalism to thermodynamics [459,754], we can define the exponent τ(q) from thepower-law behavior of the partition function:

Z(q, a) ∼ aτ(q), a → 0+, (A.13)

where q and τ(q) play respectively the role of the inverse temperature and the free energy. The main result of this wavelet-based multifractal formalism is that in place of the energy and the entropy (i.e. the variables conjugated to q and τ ), one hash, the Hölder exponent, and D(h), the singularity spectrum. This means that the singularity spectrum of f can be determinedfrom the Legendre transform of the partition function scaling exponent τ(q) [457,703,704]:

D(h) = minq(qh − τ(q)). (A.14)

A.4.3. Monofractal versus multifractal functionsFrom the properties of the Legendre transform, it is easy to see that ‘‘homogeneous’’ fractal functions that involve

singularities of unique Hölder exponent h = ∂τ/∂q, are characterized by a τ(q) spectrumwhich is a linear function of q (seeFig. B.2(c)). On the contrary, a nonlinear τ(q) curve is the signature of nonhomogeneous functions that exhibit ‘‘multifractal’’properties, in the sense that the Hölder exponent h(x) is a fluctuating quantity that depends upon the spatial position x (seeFig. B.2(c)). As illustrated in Fig. B.2(d), the D(h) singularity spectrum of a multifractal function displays a single humpedshape that characterizes intermittent fluctuations corresponding to Hölder exponent values spanning a whole interval[hmin, hmax], where hmin and hmax are the Hölder exponents of the strongest and weakest singularities respectively.

For more details on the 1D, 2D and 3DWTMMmethod, we refer the reader to Ref. [757].

Appendix B. Test applications of the wavelet transform modulus maxima method on monofractal and multifractalsynthetic random signals

This Appendix is devoted to test applications of the WTMMmethod to random functions generated either by ‘‘additive’’models like fractional Brownianmotions [758,759] or by ‘‘multiplicative’’models like randomW-cascades onwavelet dyadictrees [713,714,760,761]. For each model, we first wavelet transform 1000 realizations of length L = 65 536 with the firstorder (nψ = 1) analyzing wavelet g(1). From the WT skeletons defined by the WTMM, we compute the mean partitionfunction (Eq. (A.12)) fromwhichwe extract the annealed τ(q) (Eq. (A.13)) and, in turn, D(h) (Eq. (A.14)) multifractal spectra.We systematically test the robustness of our estimates with respect to some change of the shape of the analyzing wavelet,in particular when increasing the number nψ of zero moments, going from g(1) to g(2) (Eq. (A.3)).

B.1. Fractional Brownian signals

Since its introduction by Mandelbrot and Van Ness [758], the fractional Brownian motion (fBm) has become a verypopular model in signal and image processing [28,56–79]. In 1D, fBm has proved useful for modeling various physicalphenomena with long-range dependence, e.g., ‘‘1/f ’’ noises. The fBm exhibits a power-law spectral density:

S(ω) ∼ 1/ωβ , with β = 2H + 1, (B.1)

where ω is the frequency and the spectral exponent β is related to the Hurst exponent H . fBm has been extensively usedas test stochastic signals for Hurst exponent measurements. The performances of classical methods (e.g., height–heightcorrelation function, variance andpower spectralmethods, first return andmulti-return probability distributions,maximum

164 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. B.1. WT of monofractal and multifractal stochastic signals. Fractional Brownian motion: (a) A realization of B1/3 (L = 65 536); (c) WT of B1/3 as coded,independently at each scale a, using 256 colors from black (|Tψ | = 0) to red (maxb |Tψ |); (e) WT skeleton defined by the set of all the maxima lines. Log-normal random W-cascades: (b) A realization of the log-normal W-cascade model (L = 65536) with the following parameter valuesm = −0.355 ln 2 andσ 2

= 0.02 ln 2 (see Ref. [761]); (d) WT of the realization in (b) represented with the same color coding as in (c); (f) WT skeleton. The analyzing wavelet isg(1) (Eq. (A.3)) (see Fig. A.1(a)). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

likelihood techniques) [129,762–770] have been recently competed by wavelet-based techniques [771–781]. A fBm BH(x)indexed by H ∈]0, 1[, is a Gaussian process of mean value 0 and whose correlation function is given by

⟨BH(x)BH(y)⟩ =σ 2

2

|x|2H + |y|2H − |x − y|2H

, (B.2)

where ⟨· · ·⟩ represents the mean value. The variance of such process is

var(BH(x)) = σ 2|x|2H . (B.3)

The classical Brownian motion corresponds to H = 1/2 and to a variance var(B1/2(x)) = σ 2|x|. We can easily show that the

increments of a fBm over a distance l:

δBH,l = BH(x + l)− BH(x) (B.4)

are stationary. Indeed, the correlation function depends only on1x = x − y and l:

CH,l(1x) = ⟨δBH,l(x)δBH,l(x +1x)⟩ =σ 2

2

|1x + l|2H + |1x − l|2H − 2|1x|2H

,

∼1x/l≫1

H(2H − 1)1x

l

2H−2

. (B.5)

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 165

ForH = 1/2,we recover the fact that the classical Brownianmotion has uncorrelated increments: CH,l(1x) = 0 for |1x| > l.For H = 1/2, CH,l(1x) decays very slowly as a power-law. For 1/2 < H < 1, the increments are positively correlated,CH,l(1x) > 0 and the randomwalk is called persistent. On the contrary, for 0 < H < 1/2, the increments are anti-correlatedand the random walk is called antipersistent [29,56,458].

Note that from Eq. (B.2), we get:

BH(x + λy)− BH(x) ≃ λH (BH(x + y)− BH(x)) , (B.6)

where ≃ stands for the equality in law. This means that fBm is a self-affine process [56,759,782,783] and that the Hurstexponent is H . The higher H , the more regular the motion. But since Eq. (B.6) holds for any x and y, this means that almostall realizations of the fBm are continuous, everywhere non-differentiable with a unique Hölder exponent h(x) = H , ∀x[56,458,759,784]. Thus fBm is a homogeneous fractal characterized by a singularity spectrum which reduces to a singlepoint:

D(h) = 1 if h = H,= −∞ if h = H. (B.7)

By Legendre transforming D(h) (Eq. (A.14)), we get the following expression for the partition function exponents:

τ(q) = qH − 1. (B.8)

τ(q) is a linear function of q with a slope given by the index H of the fBm. Let us point out that τ(2) = 0 (i.e. H = 1/2)indicates the presence of long-range correlations.

In Figs. B.1 and B.2, we report the results of a statistical analysis of fBm’s using the WTMM method [457–459,701,702].We mainly concentrate on B1/3 since it has a ω−5/3 power-spectrum similar to the spectrum of the multifractal stochasticsignal we will study in the next section. Actually our goal is to demonstrate that, where the power spectrum analysis fails,the WTMM method succeeds in discriminating unambiguously between these two fractal signals. The numerical signalswere generated by filtering uniformly generated pseudo-random noise in Fourier space in order to have the required ω−5/3

spectral density. A B1/3 fractional Brownian trail is shown in Fig. B.1(a). Fig. B.1(c) illustrates the WT coded, independentlyat each scale a, using 256 colors. The analyzing wavelet is g(1) (nψ = 1). Fig. B.2(a) displays some plots of log2 Z(q, a) vs.log2(a) for different values of q, where the partition function Z(q, a) has been computed on the WTMM skeleton shown inFig. B.1(e), according to the definition in Eq. (A.12). Using a linear regression fit, we then obtain the slopes τ(q) of thesegraphs. As shown in Fig. B.2(c), when plotted vs. q, the data for the exponents τ(q) consistently fall on a straight line thatis remarkably fitted by the theoretical prediction τ(q) = q/3 − 1 (Eq. (B.8)). As expected theoretically, we find from thenumerical application of theWTMMmethod, that the fBm B1/3 is a nowhere differentiable homogeneous fractal signal witha unique Hölder exponent h = H = 1/3, as given by the slope of the linear τ(q) spectrum (the hallmark of homogeneousfractal scaling). Similar good estimates are obtained when using analyzing wavelets of different orders, and this whateverthe value of the index H of the fBm [457–459,701,702].

Within the perspective of confirming the monofractality of fBm’s, we have studied the probability density function (pdf)of wavelet coefficient values ρa(Tg(1)(., a)), as computed at a fixed scale a in the fractal scaling range. According to themonofractal scaling properties, we expect these pdfs to satisfy the self-similarity relationship [29,113,124]:

aHρa(aHT ) = ρ(T ), (B.9)

where ρ(T ) is a ‘‘universal’’ pdf (actually the pdf obtained at scale a = 1) that does not depend on the scale parameter a.As shown in Fig. B.3 for B1/3, when plotting aHρa(aHT ) vs. T , all the ρa curves corresponding to different scales (Fig. B.3(a))remarkably collapse on a unique curve when using a unique exponent H = 1/3 (Fig. B.3(a’)). Furthermore the so-obtaineduniversal curve cannot be distinguished from a parabola in semi-log representation as the signature of the monofractalGaussian statistics of fBm fluctuations [29,113,458].

B.2. Random W-cascades

Multiplicative cascade models have enjoyed increasing interest in recent years as the paradigm of multifractalobjects [458,705,747–749,756,785]. The notion of cascade actually refers to a self-similar process whose properties aredefined multiplicatively from coarse to fine scales. In that respect, it occupies a central place in the statistical theory ofturbulence [73,746,756]. Originally, the concept of self-similar cascades was introduced to model multifractal measures(e.g. dissipation or enstrophy) [756]. It has been recently generalized to the construction of scale-invariant signals(e.g. longitudinal velocity, pressure, temperature) using orthogonal wavelet basis [760,761,786]. Instead of redistributingthe measure over sub-intervals with multiplicative weights, we allocate the wavelet coefficients in a multiplicative wayon the dyadic grid. This method has been implemented to generate multifractal functions (with weights W ) from a givendeterministic or probabilistic multiplicative process. Along the line of the modeling of fully developed turbulent signals bylog-infinitely divisible multiplicative processes [787,788], wemainly concentrate here on the log-normal W-cascades [761]in order to calibrate the WTMM method. If m and σ 2 are respectively the mean and the variance of lnW (where W is a

166 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. B.2. Determination of the τ(q) andD(h)multifractal spectra of fBm B1/3 (circles) and log-normal randomW-cascades (dots) using theWTMMmethod.(a) log2 Z(q, a) vs. log2 a: B1/3 . (b) log2 Z(q, a) vs. log2 a: log-normal W-cascades with the same parameters as in Fig. B.1(b). (c) τ(q) vs. q; the solid linecorresponds to the theoretical spectra (B.8) and (B.10). (d) D(h) vs. h; the solid line corresponds to the theoretical prediction (B.11). The analyzing waveletis g(1) (Fig. A.1(a)). The reported results correspond to annealed averaging over 1000 realizations of L = 65536.

multiplicative random variable with log-normal probability distribution), then, as shown in Ref. [761], a straightforwardcomputation leads to the following τ(q) spectrum:

τ(q) = − log2⟨Wq⟩ − 1, ∀ q ∈ R

= −σ 2

2 ln 2q2 −

mln 2

q − 1, (B.10)

where ⟨· · ·⟩ means ensemble average. The corresponding D(h) singularity spectrum is obtained by Legendre transformingτ(q) (Eq. (A.14)):

D(h) = −(h + m/ ln 2)2

2σ 2/ ln 2+ 1. (B.11)

According to the convergence criteria established in Ref. [761], m and σ 2 have to satisfy the conditions: m < 0 and|m|/σ >

√2 ln 2. Moreover, by solving D(h) = 0, one gets the following bounds for the support of the D(h) singularity

spectrum:

hmin = −mln 2

√2σ

√ln 2

and hmax = −mln 2

+

√2σ

√ln 2

. (B.12)

In Fig. B.1(b) is illustrated a realization of a log-normal W-cascade for the parameter values m = −0.355 ln 2 andσ 2

= 0.02 ln 2. The corresponding WT andWT skeleton as computed with g(1) are shown in Fig. B.1(d) and (f) respectively.The results of the application of theWTMMmethod are reported in Fig. B.2. As shown in Fig. B.2(b), when plotted vs. the scaleparameter a in a logarithmic representation, the annealed average of the partition function Z(q, a) displays a well defined

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 167

Fig. B.3. Probability distribution functions of wavelet coefficient values of fBm B1/3 (open symbols) and log-normal random W-cascades (filled symbols)with the same parameters as in Fig. B.1(b). (a) ρa vs. Tg(1) for the set of scales a = 10 (), 50 (), 100 (), 1000 (), 9000 (); (a’) aHρa(ρ(aHTg(1) )) vs.Tg(1) with H = 1/3; the symbols have the same meaning as in (a). (b) ρa vs. Tg(1) for the set of scales a = 10 (N), 50 (), 100 (•), 1000 ( ), 9000 (H);(b’) aHρa(aHTg(1) ) vs. Tg(1) with H = −m/ ln 2 = 0.355. The analyzing wavelet is g(1) (Fig. A.1(a)).

scaling behavior over a range of scales of about 5 octaves. Note that scaling of quite good quality is found for a rather widerange of q values:−5 ≤ q ≤ 10.When processing to a linear regression fit of the data over the first four octaves, one gets theτ(q) spectrum shown in Fig. B.2(c). This spectrum is clearly a nonlinear function of q, the hallmark of multifractal scaling.Moreover, the numerical data are in remarkable agreement with the theoretical quadratic prediction (Eq. (B.10)). Similarquantitative agreement is observed on the D(h) singularity spectrum in Fig. B.2(d) which displays a single humped parabolashape that characterizes intermittent fluctuations corresponding to Hölder exponents values ranging from hmin = 0.155 tohmax = 0.555. Unfortunately, to capture the strongest and the weakest singularities, we need to compute the τ(q) spectrumfor very large values of |q|. This requires the processing of manymore realizations of the considered log-normal random W-cascade. Themultifractal nature of log-normalW-cascade realizations is confirmed in Fig. B.3(b, b’) where the self-similarityrelationship (Eq. (B.9)) is shown not to apply. Actually there does not exist a H value allowing to superimpose onto a singlecurve the WT pdfs computed at different scales.

The test applications reported in this Appendix demonstrate the ability of the WTMM method to resolve multifractalscaling of 1D signals, a hopeless task for classical power spectrum analysis. They were used on purpose to calibrate andto test the reliability of our methodology (and of the corresponding numerical tools) with respect to finite-size effects andstatistical convergence [29,757].

Appendix C. Modeling the effect of sequence-dependent long-range correlated structural disorder on the thermody-namical properties of 2D elastic chains

C.1. The worm-like-chain model

In this subsection, we summarize the main results of the WLC model [311–317] that accounts for the entropic elasticbehavior of an intrinsically straight semi-flexible polymer. In 2D, the chain conformations of length L with local curvatures

168 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

Fig. C.1. Schematic representation of a DNA chain of total length L with curvature disorder induced by the sequence (intrinsic) and possibly by thermalfluctuations (T = 0). The local curvature at the base-pair n is defined as C(n) = 1/r(n) = 1θn = θn − θn−1 . In the continuous limit, R is the end-to-endvector, u(s) and u(s′) are unit vectors tangent to the chain at locations s and s′ respectively and uo is the unit tangent vector at the origin.

induced by thermal fluctuations (Fig. C.1) are controlled by the elastic energy function [789]:

EeℓkT

=A2

∫ L

0

[dθ(s)ds

]2ds, (C.1)

where the elastic bending modulus A actually controls the exponential decay of the orientational correlation betweentangent unit vectors to the chain, u(s) and u(s′), separated by a curvilinear distance ℓ = |s′ − s|:

⟨u(s).u(s + ℓ)⟩ = ⟨cos(θ(s + ℓ)− θ(s))⟩ = exp

−ℓ

2A

, ∀s (C.2)

where ⟨·⟩ stands for averaging over thermal realizations. Thus 2A = ℓp is nothing but the persistence length, i.e. the decaylength through which the memory of the initial orientation of the chain persists. Actually the probability of having a certainbend angle1θ(s, ℓ) = θ(s + ℓ)− θ(s) over the distance ℓ at any position s, follows a Gaussian distribution [340,341,789]:

P(1θ) =

A

2πℓexp

A2ℓ1θ2

. (C.3)

C.1.1. Persistence length estimate from the mean square end-to-end distanceThen, upon a straightforward integration, we get the following expression for the mean square end-to-end distance of

the polymer:

⟨R2⟩ =

∫ L

0

∫ L

0⟨u(s).u(s′)⟩dsds′,

= 4AL[1 −

2AL

1 − exp

L2A

], (C.4)

from which we can estimate the persistence length

ℓp = limL→∞

⟨R2⟩/2L = 2A. (C.5)

For finite length polymers, we will use the following definition of the ‘‘local persistence length’’:

ℓDp (L) =

⟨R2⟩

2L= ℓp

[1 −

ℓp

L

1 − exp

Lℓp

], (C.6)

where the superscript D stands for the double integration required to compute ⟨R2⟩ in Eq. (C.4).

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 169

C.1.2. Persistence length estimate from the mean projection of the end-to-end vector on the initial orientationAlternatively, we can estimate the persistence length from the projection of the end-to-end vector R on the initial unit

tangent vector to the chain [340], i.e.

⟨u(0).R⟩ =

∫ L

0⟨u(0).u(s)⟩ds,

= 2A(1 − exp(−L/2A)), (C.7)

from which we get in the limit L ≫ 2A:

limL→∞

⟨u(0).R⟩ = 2A = ℓp. (C.8)

Again, for finite length polymers, we will use the following definition:

ℓSp (L) = ⟨u(0).R⟩ = ℓp

1 − exp

−L/ℓp

, (C.9)

where the superscript S now stands for the simple integration involved in Eq. (C.7). Let us remark that as compared to therather slow 1/L convergence of ℓD

p (L) towards the persistence length ℓp = 2A, ℓSp (L) converges exponentially fast to ℓp.

C.2. Revisiting polymer statistical physics to account for the presence of long-range correlated structural disorder in 2DDNA chains

As sketched in Fig. 32, DNA is not intrinsically straight and the main limitation of the WLC model is that it does not takeinto account the intrinsic sequence-dependent structural disorder θo(s) (Fig. 2(b)). Then the elastic energy function takesthe following new form [80,346]:

EeℓkT

=A2

∫ L

0

θ (s)− θo(s)

2ds, (C.10)

where the local fluctuation θ (s) = dθ(s)/ds for a straight polymer (Eq. (C.1)) has been replaced by a local fluctuationθ (s) − θo(s) around the intrinsic frozen DNA trajectory (Fig. 32). Let us point out that when modeling the local intrinsiccurvature of the chain θo(s) as the realization of a fractional Gaussian noise of zero mean and variance σ 2

o (see Appendix D),then

1θo(s, ℓ) =

∫ s+ℓ

sθo(u)du = BH(s + ℓ)− BH(s) = δBH,ℓ(s) (C.11)

is the increment over a distance ℓ of a fBm BH (Appendix B, Eq. (B.4)) [56,758], which is quite consistent with the Gaussianstatistics of wavelet coefficients obtained, at small scales, for the PNuc bending profiles in Fig. 10. This means that theintrinsic bend angle1θo(ℓ) over a distance ℓ follows a Gaussian law

P(1θo) =

1

2πσ 2o ℓ

2Hexp

1θ2o

2σ 2o ℓ

2H

(C.12)

of zero mean and variance σ 2(ℓ) = σ 2o ℓ

2H , where as discussed in the Appendix B and Section 2, the Hurst exponent Haccounts for the possible existence of LRC (1/2 < H < 1) in the distribution of the intrinsic curvature along the chain(Fig. 32). Under this assumption, we can analytically compute the directional correlations between tangent unit vectors andEq. (C.2) transforms into [80,346]:

⟨u(s).u(s + ℓ)⟩ = ⟨cos [(θ(s + ℓ)− θo(s + ℓ))− (θ(s)− θo(s))]⟩ = exp

−ℓ

2A−σ 2o ℓ

2H

2

, (C.13)

where · stands for averaging over intrinsic curvature disorder (θo) realizations.

C.2.1. The worm-like-chain model still at work for DNA chains with uncorrelated intrinsic disorderLet us remark that for uncorrelated (or short-range correlated) DNA chains, introducing H = 1/2 in Eq. (C.13) yields:

⟨u(s).u(s + ℓ)⟩ = exp

−ℓ

2

1A

+ σ 2o

. (C.14)

We recover the exponential behavior predicted by the WLC model (Eq. (C.2)) but with an effective persistence length:

ℓeffp =2A

1 + Aσ 2o. (C.15)

170 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

a c

db

Fig. C.2. Persistence length ℓDp (solid line) and ℓS

p (dashed line) computed from Eqs. (C.21) and (C.22) respectively [346]. (a) 2A = 310 bp, σo = 0.01,H = 0.5, 0.7 and 0.9 from top to bottom; (b) same as (a) but for 2A = 490 bp; (c) 2A = 490 bp, H = 0.5, σo = 0.007, 0.022 and 0.04 from top to bottom;(d) same as (c) but for H = 0.73. The symbols ℓD

p (•) and ℓSp () correspond to DNA simulations when averaging over one thermal realization for N = 103

different frozen chains generated with the corresponding parameters as described in Appendix D. In (a) and (b), the asymptotic convergence of ℓDp and ℓS

pis illustrated up to s = 10 000 bp disregarding excluded volume effects [350,790].Source: Adapted from Ref. [85].

This equation can be rewritten in the following form:1

ℓeffp

=1ℓdp

+1ℓsp, (C.16)

where

ℓdp = 2A, ℓsp = 2/σ 2o (C.17)

are respectively the dynamical persistence length associated with entropic disorder and the static persistence lengthassociated with the intrinsic curvature disorder. Eq. (C.16) is identical to the Trifonov et al. Eq. (12) [323] except that itapplies in 2D and not in 3D.

Remark. From themean square end-to-end distance of an uncorrelated chainwhich freely equilibrates on a surface, we canshow, under the hypothesis that the chain is a continuous, homogeneous and isotropicmaterial, that its apparent persistencelength is twice the one found for the 3D conformations [340,341]:

ℓ2Dp = 2ℓ3Dp . (C.18)

Importantly, WLC equations (C.6) and (C.9) for ℓDp and ℓS

p , still apply to heteropolymers with uncorrelated structuraldisorder but with an effective elastic bend rigidity that has been renormalized by the amplitude σ 2

o of the intrinsic curvaturefluctuations, the larger σ 2

o , the smaller the effective persistence length:

Aeff =

1A

+ σ 2o

−1

. (C.19)

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 171

Note that in the weak disorder limit, σ 2o ≪ 1/A, Eq. (C.19) leads to Nelson’s formula (Eq. (13)) [330]:

leffp = 2A1 − Aσ 2

o

. (C.20)

C.2.2. Generalized 2D worm-like-chain model for long-range correlated DNA chainsFor LRC structural disorder, when using Eq. (C.13) to compute both the mean end-to-end distance and the mean

projection of the end-to-end vector on the initial unit tangent vector, then Eqs. (C.4) and (C.7) become, respectively [80,346]:

⟨R2⟩ =

∫ L

0ds′∫ s′

0ds exp

s2A

−s2Hσ 2

o

2

, (C.21)

and

⟨u(0).R⟩ =

∫ L

0ds exp

s2A

−s2Hσ 2

o

2

. (C.22)

Both these equations can be handled perturbatively and numerically. But unlike Eq. (C.15) for uncorrelated static disorder(H = 1/2), there is no longer analytically tractable expression of the persistence length when considering the presence ofLRC (1/2 < H < 1).

As far as the convergence to the persistence length is concerned, we show in Fig. C.2, the behavior of ℓDp (s) and ℓ

Sp (s)

obtainedwhennumerically integrating Eqs. (C.21) and (C.22) for values ofA,σo andH allowing for some comparative analysisof persistence length estimates of various DNA fragments using AFM imaging in Section 4. As shown in Fig. C.2(a) and (b)for 2A = 310 bp and 490 bp respectively, when fixing σo = 0.01, we see that for uncorrelated H = 1/2 chains, ℓD

p andℓSp both converge to the effective persistence length ℓeffp = 305 bp and 478 bp according to the WLC equations (C.6) and

(C.9) respectively, and in good agreement with the weak-disorder perturbative formula (C.20). When introducing LRC, forboth A values, we observe a faster convergence to a smaller persistence length, the larger the H , the smaller the ℓp. ℓS

p stillconverges faster than ℓD

p to the persistence length that is well approximated by the following weak disorder formula forσ 2o ≪ 2/[(2A)2HΓ (2H + 1)] [80,346]:

ℓeffp = 2A1 −

(2A)2Hσ 2o

2Γ (2H + 1)

, (C.23)

where Γ is the Gamma function. As shown in Fig. C.2(c) for H = 1/2 and (d) for H = 0.73, when increasing the amplitudeσo of the intrinsic curvature disorder, the decrease of the persistence length is even more pronounced for LRC chains thanfor uncorrelated ones. This significant decrease of ℓp that cannot be accounted for by the WLC model is not the signatureof some increased flexibility, but rather the consequence of the LRC induced persistence in the distribution of intrinsicallybent sites [80,346]. In Fig. C.2(c) and (d) are also reported for comparison the results of numerical estimates of ℓD

p and ℓSp

from DNA simulations when averaging over one thermal realization for N = 103 different uncorrelated (H = 1/2) andLRC (H = 0.73) frozen chains respectively (see Appendix D). For the two values of σo = 0.007 and 0.022 considered inthese simulations, the numerical results obtained for the LRC chains in Fig. C.2(d) are in just as good agreement with thetheoretical predictions (Eqs. (C.21) and (C.22)), as are the numerical results for the uncorrelated chains with theWLCmodelpredictions in Fig. C.2(c). These results strengthen our GWLC model as a very reliable guide to future experiments.

Remark. In this Appendix, we have neglected electrostatic effects assuming that the ionic strength is such that theelectrostatic persistence length [791] is much larger than the entropic persistence length [792–795]. In other words, thelocal DNA curvature fluctuations will be assumed to be significantly larger than the thickness of the Debye layer.

Appendix D. DNA simulations of uncorrelated and long-range correlated 2D elastic chains

The GWLC model developed in Appendix C, involves some averaging over the structural disorder, i.e. over many DNAsequences. As experienced in Section 4, this is without any doubt a rather prohibitive experimental task that led us to extendthe DNA simulations performed in Ref. [339] for intrinsically straight DNA to uncorrelated and LRC DNAmolecules [80,346].Because of the rather small size (.3000 bp) of the numerical DNA chains under study, DNA chains are simulated withouttaking into account excluded volume effects among segments of the samemolecule (self-avoiding effects) which have beenshown to show up for sizes larger than 10–20 ℓp only [350,790]. The interaction energy between the DNA chain and thesurface is assumed to be zero. Thus, a simulated chain consists of a series of L segments of length 1 pixel and infinitesimalthickness, joined together at one end.

172 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

a

c

b

d

Fig. D.1. (a)N = 100 numerically generated frozen DNA chains of length L = 2200 bpwith uncorrelated (H = 1/2, blue) and LRC (H = 0.73, red) intrinsiccurvature angle fluctuations of amplitude σo = 0.007; (b) same as in (a) with σo = 0.022. (c) DNA simulations of N = 100 2D equilibrium conformations(black trajectories) generated with an elastic bend rigidity 2A = 490 bp from a chain of length L = 2200 bp with uncorrelated (H = 1/2) intrinsic bendangle fluctuations of amplitude σo = 0.022 (blue trajectory); (d) same as in (c) but for a chain with LRC (H = 0.73) intrinsic bend angle fluctuations ofamplitude σo = 0.007 (red trajectory). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of thisarticle.)Source: Adapted from Ref. [85].

D.1. Intrinsically straight DNA

2D equilibrium conformations of straight DNA are generated using the centered Gaussian distribution of variance 1/Agiven by Eq. (C.3) (ℓ = 1) to randomly choose the angle deformation1θ(n) between successive segments.

D.2. DNA chains with intrinsic uncorrelated structural disorder

To model the presence of some intrinsic uncorrelated curvature disorder induced by the sequence, we first generate aGaussian white noise 1θo(n) of zero mean and variance σ 2

o from which we construct a frozen uncorrelated trajectory likethe one shown in blue in Fig. D.1(a) and (b) for σo = 0.007 and 0.022 respectively. Then 2D equilibrium conformations aregenerated as before but using for 1θ(n) a Gaussian law of variance 1/A which is no longer of zero mean but centered at1θo(n). In Fig. D.1(c) we show for illustrative purpose,N = 100 trajectoriesmimicking the effect of the thermal fluctuations(2A = 490 bp) on the original uncorrelated blue frozen chain (H = 1/2, σo = 0.022).

D.3. DNA chains with long-range correlated structural disorder

Along the line of our GWLCmodel (Appendix C.2.2) [80,346], wemodel the presence of LRC in the DNA intrinsic curvatureproperties using a fractional Gaussian noise of zeromean and variance σ 2

o andwhose index (1/2 < H < 1) actually controls(i) the power-law increase of the varianceσ 2

o ℓ2H of the distribution of intrinsic curvature1θo(ℓ) over a distance ℓ (Eq. (C.12)),

and (ii) the slow power-law decay of the intrinsic curvature correlation function (Eqs. (C.11) and (B.5)). Typical constructedfrozen LRC trajectories with H = 0.73 and σo = 0.007 and 0.022 are shown in Fig. D.1(a) and (b) respectively. Then 2Dequilibrium conformations are generated as before for uncorrelated chains, i.e., using for1θ(n) a Gaussian law centered at1θo(n) and of variance 1/A. In Fig. D.1(d), we show N = 100 trajectories mimicking the effect of thermal fluctuations onsome LRC frozen chains (H = 0.73, σo = 0.007) andwhich, according to the value of the elastic bend rigidity (2A = 490 bp),more or less retain the overall large-scale curvature of the original LRC chain.

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 173

Remark. To use fractional Gaussian noise (and in turn fBm) in numerical experiments, we need to sample it. It has beenshown in Refs. [796–798] that a fractional autoregressive integrated moving average (FARIMA) time series with Gaussianinnovations [784,799,800] corresponds to a well defined sampling (using a particular sampling filter). All our LRC DNAsimulations were performed using FARIMA processes [80,130,346,781].

References

[1] K.E. van Holde, Chromatin, Springer-Verlag, New York, 1988.[2] A.P. Wolffe, Chromatin Structure and Function, 3rd ed., Academic Press, London, 1998.[3] C.R. Calladine, H.R. Drew, Understanding DNA, Academic Press, San Diego, 1999.[4] B. Alberts, D. Bray, J. Lewis, M. Raff, K. Roberts, J.D. Watson, Molecular Biology of the Cell, 4th ed., Garland Publishing, New York, 2002.[5] G. Felsenfeld, M. Groudine, Controlling the double helix, Nature 421 (2003) 448–453.[6] J. Widom, Structure dynamics and function of chromatin in vitro, Annu. Rev. Biophys. Biomol. Struct. 27 (1998) 285–327.[7] R.D. Kornberg, Y. Lorch, Twenty-five years of the nucleosome, fundamental particle of the eukaryote chromosomes, Cell 98 (1999) 285–294.[8] K. Luger, A.W. Mäder, R.K. Richmond, D.F. Sargent, T.J. Richmond, Crystal structure of the nucleosome core particle at 2.8 Å resolution, Nature 389

(1997) 251–260.[9] L. Chantalat, J.M. Nicholson, S.J. Lambert, A.J. Reid, M.J. Donovan, C.D. Reynolds, C.M. Wood, J.P. Baldwin, Structure of the histone-core octamer in

KCl/phosphate crystals at 2.15 Å resolution, Acta Crystallogr. D Biol. Crystallogr. 59 (2003) 1395–1407.[10] T.J. Richmond, C.A. Davey, The structure of DNA in the nucleosome core, Nature 423 (2003) 145–150.[11] S.C. Satchwell, H.R. Drew, A.A. Travers, Sequence periodicities in chicken nucleosome core DNA, J. Mol. Biol. 191 (1986) 659–675.[12] I. Ioshikhes, A. Bolshoy, K. Derenshteyn, M. Borodovsky, E.N. Trifonov, Nucleosome DNA sequence pattern revealed by multiple alignment of

experimentally mapped sequences, J. Mol. Biol. 262 (1996) 129–139.[13] J. Widom, Role of DNA sequence in nucleosome stability and dynamics, Q. Rev. Biophys. 34 (2001) 269–324.[14] B. Audit, C. Thermes, C. Vaillant, Y. d’Aubenton-Carafa, J.-F.Muzy, A. Arneodo, Long-range correlations in genomicDNA: a signature of the nucleosomal

structure, Phys. Rev. Lett. 86 (2001) 2471–2474.[15] B. Audit, C. Vaillant, A. Arneodo, Y. d’Aubenton-Carafa, C. Thermes, Long-range correlations between DNA bending sites: relation to the structure and

dynamics of nucleosomes, J. Mol. Biol. 316 (2002) 903–918.[16] B. Audit, C. Vaillant, A. Arneodo, Y. d’Aubenton-carafa, C. Thermes, Wavelet analysis of DNA bending profiles reveals structural constraints on the

evolution of genomic sequences, J. Biol. Phys. 30 (2004) 33–81.[17] U.K. Laemmli, E. Käs, L. Poljak, Y. Adachi, Scaffold-associated regions: cis-acting determinants of chromatin structural loops and functional domains,

Current Opin. Genetics Dev. 2 (1992) 275–285.[18] Y. Saitoh, U.K. Laemmli, From the chromosomal loops and the scaffold to the classic bands of metaphase chromosomes, Cold Spring Harb. Symp.

Quant. Biol. 58 (1993) 755–765.[19] A.S. Belmont, K. Bruce, Visualization of G1 chromosomes: a folded, twisted, supercoiled chromonema model of interphase chromatid structure,

J. Cell Biol. 127 (1994) 287–302.[20] A.S. Belmont, S. Dietzel, A.C. Nye, Y.G. Strukov, T. Tumbar, Large-scale chromatin structure and function, Current Opin. Cell Biol. 11 (1999) 307–311.[21] P.J. Horn, C.L. Peterson, Chromatin higher order folding: wrapping up transcription, Science 297 (2002) 1824–1827.[22] R.K. Sachs, G. van den Engh, B. Trask, H. Yokota, J.E. Hearst, A random-walk/giant-loop model for interphase chromosomes, Proc. Natl. Acad. Sci. USA

92 (1995) 2710–2714.[23] J. Ostashevsky, A polymer model for the structural organization of chromatin loops and minibands in interphase chromosomes, Mol. Biol. Cell 9

(1998) 3031–3040.[24] C. Münkel, R. Eils, S. Dietzel, D. Zink, C. Mehring, G.Wedemann, T. Cremer, J. Langowski, Compartmentalization of interphase chromosomes observed

in simulation and experiment, J. Mol. Biol. 285 (1999) 1053–1065.[25] P.R. Cook, A chromomeric model for nuclear and chromosome structure, J. Cell Sci. 108 (1995) 2927–2935.[26] P.R. Cook, Predicting three-dimensional genome structure from transcriptional activity, Nat. Genet. 32 (2002) 347–352.[27] N.L. Mahy, P.E. Perry, W.A. Bickmore, Gene density and transcription influence the localization of chromatin outside of chromosome territories

detectable by FISH, J. Cell Biol. 159 (2002) 753–763.[28] A. Arneodo, F. Argoul, E. Bacry, J. Elezgaray, J.-F. Muzy, OndelettesMultifractales et Turbulences: de l’ADN aux croissances cristallines, Diderot Editeur

Arts et Sciences, Paris, 1995.[29] A. Arneodo, B. Audit, N. Decoster, J.-F. Muzy, C. Vaillant, Wavelet based multifractal formalism: application to DNA sequences, satellite images of the

cloud structure and stock market data, in: A. Bunde, J. Kropp, H.J. Schellnhuber (Eds.), The Science of Disasters: Climate Disruptions, Heart Attacks,and Market Crashes, Springer Verlag, Berlin, 2002, pp. 26–102.

[30] A. Arneodo, Y. d’Aubenton-Carafa, B. Audit, E.-B. Brodie of Brodie, S. Nicolay, P. St-Jean, C. Thermes, M. Touchon, C. Vaillant, DNA in chromatin: fromgenome-wide sequence analysis to the modeling of replication in mammals, Adv. Chem. Phys. 135 (2007) 203–252.

[31] A. Arneodo, B. Audit, C. Faivre-Moskalenko, J. Moukhtar, C. Vaillant, F. Argoul, Y. d’Aubenton-Carafa, C. Thermes, From DNA sequence to chromatinorganization: the fundamental role of genomic long-range correlations, in: Mémoire de la Classe des Sciences, in: Collection en −8o , 3e série, TomeXXVIII, no 2049, Académie Royale de Belgique, Bruxelles, 2008.

[32] A. Arneodo, B. Audit, E.-B. Brodie of Brodie, S. Nicolay, M. Touchon, Y. d’Aubenton-Carafa, M. Huvet, C. Thermes, Fractals and wavelets: what can welearn on transcription and replication fromwavelet-basedmultifractal analysis of DNA sequences? in: R.A. Meyers (Ed.), Encyclopedia of Complexityand Systems Science, Springer, New York, 2009, pp. 3893–3924.

[33] P. Goupillaud, A. Grossmann, J. Morlet, Cycle-octave and related transforms in seismic signal analysis, Geoexploration 23 (1984) 85–102.[34] A. Grossmann, J. Morlet, Decomposition of Hardy functions into square integrablewavelets of constant shape, SIAM J. Math. Anal. 15 (1984) 723–736.[35] A. Grossmann, J. Morlet, Decomposition of functions into wavelets of constant shape and related transforms, in: L. Streit (Ed.), Mathematics and

Physics, Lectures on Recent Results, World Scientific, Singapore, 1985, pp. 17–22.[36] A. Arneodo, F. Argoul, J. Elezgaray, G. Grasseau,Wavelet transformanalysis of fractals: application tononequilibriumphase transitions, in:G. Turchetti

(Ed.), Nonlinear Dynamics, World Scientific, Singapore, 1989, pp. 130–180.[37] J.-M. Combes, A. Grossmann, P. Tchamitchian (Eds.), Wavelets, Springer, Berlin, 1989.[38] P.G. Lemarie (Ed.), Les Ondelettes en 1989, Springer, Berlin, 1990.[39] Y. Meyer, Ondelettes, Herman, Paris, 1990.[40] A. Arneodo, F. Argoul, E. Bacry, J. Elezgaray, E. Freysz, G. Grasseau, J.-F. Muzy, B. Pouligny, Wavelet transform of fractals, in: Y. Meyer (Ed.), Wavelets

and Applications, Springer, Berlin, 1992, pp. 286–352.[41] C.K. Chui, An Introduction to Wavelets, Academic Press, Boston, 1992.[42] I. Daubechies, Ten Lectures on Wavelets, SIAM, Philadelphia, 1992.[43] Y. Meyer (Ed.), Wavelets and their Applications, Springer, Berlin, 1992.[44] M.B. Ruskai, G. Beylkin, R. Coifman, I. Daubechies, S. Mallat, Y. Meyer, L. Raphael (Eds.), Wavelets and their Applications, Jones and Barlett, Boston,

1992.[45] M. Farge, J.C.R. Hunt, J.C. Vassilicos (Eds.), Wavelets, Fractals and Fourier Transforms, Clarendon Press, Oxford, 1993.[46] P. Flandrin, Temps-Fréquence, Hermès, Paris, 1993.

174 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

[47] Y. Meyer, S. Roques (Eds.), Progress in Wavelets Analysis and Applications, Éditions Frontières, Gif-sur-Yvette, 1993.[48] G. Erlebacher, M.Y. Hussaini, L.M. Jameson (Eds.), Wavelets: Theory and Applications, Oxford University Press, Oxford, 1996.[49] M. Holschneider, Wavelets: An Analysis Tool, Oxford University Press, Oxford, 1996.[50] P. Abry, Ondelettes et Turbulences, Diderot Éditeur, Art et Sciences, Paris, 1997.[51] S. Mallat, A Wavelet Tour of Signal Processing, Academic Press, New York, 1998.[52] B. Torresani, Analyse Continue par Ondelettes, Éditions de Physique, Les Ulis, 1998.[53] B.W. Silverman, J.C. Vassilicos (Eds.), Wavelets: The Key to Intermittent Information? Oxford University Press, Oxford, 2000.[54] S. Jaffard, Y. Meyer, R.D. Ryan (Eds.), Wavelets: Tools for Science & Technology, SIAM, Philadelphia, 2001.[55] J.P. Antoine, R. Murenzi, P. Vandergheynst, S.T. Ali, Two-DimensionalWavelets and their Relatives, Cambridge University Press, Cambridge, UK, 2008.[56] B.B. Mandelbrot, The Fractal Geometry of Nature, Freeman, San Francisco, 1982.[57] L. Pietronero, E. Tosatti (Eds.), Fractals in Physics, North-Holland, Amsterdam, 1986.[58] H.E. Stanley, N. Ostrowski (Eds.), On Growth and Form: Fractal and Non-Fractal Patterns in Physics, Martinus Nijhof, Dordrecht, 1986.[59] D. Avnir (Ed.), The Fractal Approach to Heterogeneous Chemistry: Surfaces, Colloids, Polymers, Wiley, New York, 1988.[60] J. Feder, Fractals, Pergamon, New York, 1988.[61] H.E. Stanley, N. Ostrowski (Eds.), Random Fluctuations and Pattern Growth, Kluwer, Dordrecht, 1988.[62] A. Aharony, J. Feder (Eds.), Fractals in Physics, Essays in Honour of B.B. Mandelbrot, Physica D 38, North-Holland, Amsterdam, 1989.[63] T. Vicsek, Fractal Growth Phenomena, World Scientific, Singapore, 1989.[64] B.J. West, Fractal Physiology and Chaos in Medecine, World Scientific, Singapore, 1990.[65] A. Bunde, S. Havlin (Eds.), Fractals and Disordered Systems, Springer, Berlin, 1991.[66] F. Family, T. Vicsek, Dynamics of Fractal Surfaces, World Scientific, Singapore, 1991.[67] H.O. Peitgen, H. Jurgens, D. Saupe (Eds.), Chaos and Fractals: New Frontiers of Science, Springer, New York, 1992.[68] A. Bunde, S. Havlin (Eds.), Fractals in Science, Springer, Berlin, 1994.[69] T. Vicsek, M. Schlesinger, M. Matsuchita (Eds.), Fractals in Natural Science, World-Scientific, Singapore, 1994.[70] B.J. West, W. Deering, Fractal Physiology for Physicists: Levy Statistics, Phys. Rep. 246 (1994) 1–100.[71] A.L. Barabàsi, H.E. Stanley, Fractals Concepts in Surface Growth, Cambridge University Press, Cambridge, 1995.[72] F. Family, P. Meakin, B. Sapoval, R. Wood (Eds.), Fractal Aspects of Materials, in: Material Research Society Symposium Proceedings, vol. 367, MRS,

Pittsburg, 1995.[73] U. Frisch, Turbulence, Cambridge Univ. Press, Cambridge, 1995.[74] G.G. Wilkinson, J. Kanellopoulos, J. Megier (Eds.), Fractals in Geoscience and Remote Sensing, in: Image Understanding Research Series, vol. 1, ECSC-

EC-EAEC, Brussels, Luxembourg, 1995.[75] J.-P. Bouchaud, M. Potters, Théorie des Risques Financiers, Cambridge University Press, Cambridge, 1997.[76] P. Meakin (Ed.), Fractals, Scaling and Growth far from Equilibrium, Cambridge Univ. Press, Cambridge, 1998.[77] D. Ben Avraham, S. Havlin (Eds.), Diffusion and Reactions in Fractals and Disordered Systems, Cambridge Univ. Press, Cambridge, 2000.[78] R.N. Mantegna, H.E. Stanley, An Introduction to Econophysics, Cambridge University Press, Cambridge, 2000.[79] A. Bunde, J. Kropp, H.J. Schellnhuber (Eds.), The Science of Disasters: Climate Disruptions, Heart Attacks and Market Crashes, Springer, Berlin, 2002.[80] J. Moukhtar, C. Vaillant, B. Audit, A. Arneodo, Generalizedwormlike chainmodel for long-range correlated heteropolymers, Europhys. Lett. 86 (2009)

48001.[81] C. Vaillant, B. Audit, A. Arneodo, Experiments confirm the influence of genome long-range correlations on nucleosome positioning, Phys. Rev. Lett.

99 (2007) 218103.[82] G. Chevereau, L. Palmeira, C. Thermes, A. Arneodo, C. Vaillant, Thermodynamics of intra-genic nucleosome ordering, Phys. Rev. Lett. 103 (2009)

188103.[83] C. Vaillant, L. Palmeira, G. Chevereau, B. Audit, Y. d’Aubenton-Carafa, C. Thermes, A. Arneodo, A novel strategy of transcription regulation by intra-

genic nucleosome ordering, Genome Res. 20 (2010) 59–67.[84] J. Moukhtar, E. Fontaine, C. Faivre-Moskalenko, A. Arneodo, Probing persistence in DNA curvature properties with atomic force microscopy, Phys.

Rev. Lett. 98 (2007) 178101.[85] J. Moukhtar, C. Faivre-Moskalenko, P. Milani, B. Audit, C. Vaillant, E. Fontaine, F. Mongelard, G. Lavorel, P. St-Jean, P. Bouvet, F. Argoul, A. Arneodo,

Effect of genomic long-range correlations on DNA persistence length: from theory to single molecules experiments, J. Phys. Chem. B 114 (2010)5125–5143.

[86] P. Milani, G. Chevereau, C. Vaillant, B. Audit, Z. Haftek-Terreau, M. Marilley, P. Bouvet, F. Argoul, A. Arneodo, Nucleosome positioning by genomicexcluding-energy-barriers, Proc. Natl. Acad. Sci. USA 106 (2009) 22257–22262.

[87] S. Nicolay, F. Argoul, M. Touchon, Y. d’Aubenton-Carafa, C. Thermes, A. Arneodo, Low frequency rhythms in human DNA sequences: a key to theorganization of gene location and orientation? Phys. Rev. Lett. 93 (2004) 108101.

[88] P. St-Jean, C. Vaillant, B. Audit, A. Arneodo, Spontaneous emergence of sequence-dependent rosettelike folding of chromatin fiber, Phys. Rev. E 77(2008) 061923.

[89] E.-B. Brodie of Brodie, S. Nicolay, M. Touchon, B. Audit, Y. d’Aubenton-Carafa, C. Thermes, A. Arneodo, From DNA sequence analysis to modelingreplication in the human genome, Phys. Rev. Lett. 94 (2005) 248103.

[90] M. Touchon, S. Nicolay, B. Audit, E.-B. Brodie of Brodie, Y. d’Aubenton-Carafa, A. Arneodo, C. Thermes, Replication-associated strand asymmetries inmammalian genomes: toward detection of replication origins, Proc. Natl. Acad. Sci. USA 102 (2005) 9836–9841.

[91] M. Huvet, S. Nicolay, M. Touchon, B. Audit, Y. d’Aubenton-Carafa, A. Arneodo, C. Thermes, Human gene organization driven by the coordination ofreplication and transcription, Genome Res. 17 (2007) 1278–1285.

[92] B. Audit, L. Zaghloul, C. Vaillant, G. Chevereau, Y. d’Aubenton-Carafa, C. Thermes, A. Arneodo, Open chromatin encoded in DNA sequence is thesignature of ‘‘master’’ replication origins in human cells, Nucleic Acids Res. 37 (2009) 6064–6075.

[93] H.E. Stanley, S.V. Buldyrev, A.L. Goldberger, S. Havlin, S.M. Ossadnik, C.-K. Peng, M. Simons, Fractal landscapes in biological systems, Fractals 1 (1993)283–301.

[94] W. Li, Mutual information functions versus correlation-functions, J. Stat. Phys 60 (1990) 823–837.[95] W. Li, Generating non trivial long-range correlations and 1/f spectra by replication and mutation, Int. J. Bifurcation Chaos 2 (1992) 137–154.[96] M.Y. Azbel’, Universality in a DNA statistical structure, Phys. Rev. Lett. 75 (1995) 168–171.[97] H. Herzel, I. Große, Measuring correlations in symbol sequences, Physica A 216 (1995) 518–542.[98] H. Herzel, I. Große, Correlations in DNA sequences: the role of protein coding segments, Phys. Rev. E 55 (1997) 800–810.[99] R.F. Voss, Evolution of long-range fractal correlations and 1/f noise in DNA base sequences, Phys. Rev. Lett. 68 (1992) 3805–3808.

[100] R.F. Voss, Long-range fractal correlations in DNA introns and exons, Fractals 2 (1994) 1–6.[101] C.-K. Peng, S.V. Buldyrev, A.L. Goldberger, S. Havlin, F. Sciortino, M. Simons, H.E. Stanley, Long-range correlations in nucleotide sequences, Nature

356 (1992) 168–170.[102] R.N. Mantegna, S.V. Buldyrev, A.L. Goldberger, S. Havlin, C.-K. Peng, M. Simons, H.E. Stanley, Linguistic features of noncoding DNA sequences, Phys.

Rev. Lett. 73 (1994) 3169–3172.[103] S. Havlin, S.V. Buldyrev, A.L. Goldberger, R.N. Mantegna, C.-K. Peng, M. Simons, H.E. Stanley, Statistical and linguistic features of DNA sequences,

Fractals 3 (1995) 269–284.[104] R.N. Mantegna, S.V. Buldyrev, A.L. Goldberger, S. Havlin, C.-K. Peng, M. Simons, H.E. Stanley, Systematic analysis of coding and noncoding DNA

sequences using methods of statistical linguistics, Phys. Rev. E 52 (1995) 2939–2950.[105] H. Herzel, W. Ebeling, A. Schmitt, Entropies of biosequences: the role of repeats, Phys. Rev. E 50 (1994) 5061–5071.

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 175

[106] P. Bernaola-Galván, R. Román-Roldán, J.L. Oliver, Compositional segmentation and long-range fractal correlations in DNA sequences, Phys. Rev. E 53(1996) 5181–5189.

[107] W. Li, The measure of compositional heterogeneity in DNA sequences is related to measures of complexity, Complexity 3 (1997) 33–37.[108] R. Román-Roldán, P. Bernaola-Galván, J.L. Oliver, Sequence compositional complexity of DNA through an entropic segmentation method, Phys. Rev.

Lett. 80 (1998) 1344–1347.[109] S. Nee, Uncorrelated DNA walks, Nature 357 (1992) 450.[110] B. Borštnik, D. Pumpernik, D. Lukman, Analysis of apparent 1/f α spectrum in DNA sequences, Europhys. Lett. 23 (1993) 389–394.[111] C.A. Chatzidimitriou-Dreismann, D. Larhammar, Long-range correlations in DNA, Nature 361 (1993) 212–213.[112] S. Karlin, V. Brendel, Patchiness and correlations in DNA sequences, Science 259 (1993) 677–680.[113] A. Arneodo, Y. d’Aubenton-Carafa, E. Bacry, P.V. Graves, J.-F. Muzy, C. Thermes, Wavelet based fractal analysis of DNA sequences, Physica D 96 (1996)

291–320.[114] G.M. Viswanathan, S.V. Buldyrev, S. Havlin, H.E. Stanley, Long-range correlationmeasures for quantifying patchiness: deviations fromuniformpower-

law scaling in genomic DNA, Physica A 249 (1998) 581–586.[115] P.J. Munson, R.C. Taylor, G.S. Michaels, DNA correlations, Nature 360 (1992) 636.[116] W. Li, K. Kaneko, DNA correlations, Nature 360 (1992) 635–636.[117] W. Li, K. Kaneko, Long-range correlation and partial 1/f α spectrum in a noncoding DNA-sequence, Europhys. Lett. 17 (1992) 655–660.[118] V.V. Prabhu, J.M. Claverie, Correlations in intronless DNA, Nature 359 (1992) 782.[119] B. Borštnik, The character of the correlations in DNA sequences, Int. J. Quantum Chem. 52 (1994) 457–463.[120] S.V. Buldyrev, A.L. Goldberger, S. Havlin, R.N. Mantegna, M.E. Matsa, C.-K. Peng, M. Simons, H.E. Stanley, Long-range correlation properties of coding

and noncoding DNA sequences: GenBank analysis, Phys. Rev. E 51 (1995) 5084–5091.[121] C.-K. Peng, S.V. Buldyrev, A.L. Goldberger, S. Havlin, M. Simons, H.E. Stanley, Finite-size effects on long-range correlations: implications for analysing

DNA sequences, Phys. Rev. E 47 (1993) 3730–3733.[122] C.L. Berthelsen, J.A. Glazier, S. Raghavachari, Effective multifractal spectrum of a random walk, Phys. Rev. E 49 (1994) 1860–1864.[123] W. Li, The study of correlation structures of DNA sequences: a critical review, Comput. Chem. 21 (1997) 257–271.[124] A. Arneodo, E. Bacry, P.V. Graves, J.-F. Muzy, Characterizing long-range correlations in DNA sequences from wavelet analysis, Phys. Rev. Lett. 74

(1995) 3293–3296.[125] K. Gardiner, Base composition and gene distribution: critical patterns in mammalian genome organization, Trends Genet. 12 (1996) 519–524.[126] W. Li, G. Stolovitzky, P. Bernaola-Galván, J.L. Oliver, Compositional heterogeneity within, and uniformity between, DNA sequences of yeast

chromosomes, Genome Res. 8 (1998) 916–928.[127] G. Bernardi, Isochores and the evolutionary genomics of vertebrates, Gene 241 (2000) 3–17.[128] D. Larhammar, C.A. Chatzidimitriou-Dreismann, Biological origins of long-range correlations and compositional variations in DNA, Nucleic Acids Res.

21 (1993) 5167–5170.[129] C.-K. Peng, S.V. Buldyrev, S. Havlin, M. Simons, H.E. Stanley, A.L. Goldberger, Mosaic organization of DNA nucleotides, Phys. Rev. E 49 (1994)

1685–1689.[130] B. Audit, Analyse statistique des séquences d’ADN par l’intermédiaire de la transformée en ondelettes, Ph.D. Thesis, Université de Paris VI Pierre et

Marie Curie, 1999.[131] C.L. Berthelsen, J.A. Glazier, M.H. Skolnick, Global fractal dimension of human DNA sequences treated as pseudorandomwalks, Phys. Rev. A 45 (1992)

8902–8913.[132] G.A. Dietler, Y.C. Zhang, Crossover from white noise to long-range correlated noise in DNA sequences and writings, Fractals 2 (1994) 473–479.[133] J.A. Glazier, S. Raghavachari, C.L. Berthelsen, M.H. Skolnick, Reconstructing phylogeny from the multifractal spectrum of mitochondrial DNA, Phys.

Rev. E 51 (1995) 2665–2668.[134] D.S. Goodsell, R.E. Dickerson, Bending and curvature calculations in B-DNA, Nucleic Acids Res. 22 (1994) 5497–5503.[135] A. Gabrielian, S. Pongor, Correlation of instrinsic DNA curvature with DNA property periodicity, FEBS Lett. 393 (1996) 65–68.[136] I. Brukner, R. Sanchez, D. Suck, S. Pongor, Trinucleotide models for DNA bending propensity: comparison of models based on DNase I digestion and

nucleosome packaging data, J. Biomol. Struct. Dynam. 13 (1995) 309–317.[137] C. Vaillant, B. Audit, C. Thermes, A. Arneodo, Formation and positioning of nucleosomes: effect of sequence-dependent long-range correlated

structural disorder, Eur. Phys. J. E 19 (2006) 263–277.[138] A. Arneodo, Y. d’Aubenton-carafa, B. Audit, E. Bacry, J.-F. Muzy, C. Thermes, What can we learn with wavelets about DNA sequences? Physica A 249

(1998) 439–448.[139] P. Allegrini, M. Barbi, P. Grigolini, B.J. West, Dynamical model for DNA sequences, Phys. Rev. E 52 (1995) 5281–5296.[140] A. Arneodo, Y. d’Aubenton-Carafa, B. Audit, E. Bacry, J.-F. Muzy, C. Thermes, Nucleotide composition effects on the long-range correlations in human

genes, Eur. Phys. J. B 1 (1998) 259–263.[141] S.V. Buldyrev, A.L. Goldberger, S. Havlin, C.-K. Peng, M. Simons, H.E. Stanley, Generalized Lévy-walk model for DNA nucleotide sequences, Phys. Rev.

E 47 (1993) 4514–4523.[142] S.V. Buldyrev, A.L. Goldberger, S. Havlin, H.E. Stanley, M.H.R. Stanley, M. Simons, Fractal landscapes and molecular evolution: modeling the myosin

heavy chain gene family, Biophys. J. 65 (1993) 2673–2679.[143] W.T. Li, T.G. Marr, K. Kaneko, Understanding long-range correlations in DNA-sequences, Physica D 75 (1994) 392–416.[144] N.V. Dokholyan, S.V. Buldyrev, S. Havlin, H.E. Stanley,Model of unequal chromosomal crossing over in DNA sequences, Physica A 249 (1998) 594–599.[145] A. Provata, Random aggregation models for the formation and evolution of coding and non-coding DNA, Physica A 264 (1999) 570–580.[146] G. Bernardi, The isochore organization of the human genome, Annu. Rev. Genet. 23 (1989) 637–661.[147] D. Mouchiroud, G. Bernardi, Compositional properties of coding sequences and mammalian phylogeny, J. Mol. Evol. 37 (1993) 109–116.[148] G. Bernardi, The human genome: organization and evolutionary history, Annu. Rev. Genet. 29 (1995) 445–476.[149] G. Bernardi, Misunderstandings about isochores. Part 1, Gene 276 (2001) 3–13.[150] E.S. Lander, et al., Initial sequencing and analysis of the human genomes, Nature 409 (2001) 860–921.[151] W. Li, P. Bernaola-Galván, P. Carpena, J.L. Oliver, Isochores merit the prefix iso, Comput. Biol. Chem. 27 (2003) 5–10.[152] L. Duret, A. Eyre-Walker, N. Galtier, A new perspective on isochore evolution, Gene 385 (2006) 71–74.[153] J.D. Watson, M. Gilman, J. Witkowski, H. Zoller, Recombinant DNA, Freeman, New York, 1992.[154] A. Danchin, Le séquençage des petits génomes, La Recherche 24 (1993) 222–232.[155] R.D. Fleischmann, M.D. Adams, O. White, R.A. Clayton, E.F. Kirkness, A.R. Kerlavage, C.J. Bult, J.F. Tomb, B.A. Dougherty, J.M. Merrick, Whole-genome

random sequencing and assembly of Haemophilus influenzae Rd, Science 269 (1995) 496–512.[156] A. Bernot, L’Analyse des Génomes, Nathan, Paris, 1996.[157] J. Carol, et al., Complete genome sequence of the methanogenic archæon, Methanococcus jannaschii, Science 273 (1996) 1058–1073.[158] B. Dujon, The yeast genome project: what did we learn? Trends. Genet. 12 (1996) 263–270.[159] F.R. Blattner, et al., The complete genome sequence of Escherichia coli K-12, Science 277 (1997) 1453–1474.[160] D. Graur, W.H. Li, Fundamentals of Molecular Evolution, Sinauer Associates, Sunderland, MA, 1999.[161] K. Liolios, K. Mavromatis, N. Tavernarakis, N.C. Kyrpides, The genomes on line database (gold) in 2007: status of genomic and metagenomic projects

and their associated metadata, Nucleic Acids Res. 36 (2008) D475–D479.[162] L.D. Murphy, S.B. Zimmerman, Stabilization of compact spermidine nucleoids from Escherichia coli under crowded conditions: implications for in

vivo nucleoid structures, J. Struct. Biol. 119 (1997) 336–346.[163] M.R. Starich, K. Sandman, J.N. Reeve, M.F. Summers, NMR structure of HMfB from the hyperthermophile, Methanothermus Fervidus, confirms that

this archaeal protein is a histone, J. Mol. Biol. 255 (1996) 187–203.

176 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

[164] S.L. Pereira, R.A. Grayling, R. Lurz, J.N. Reeve, Archael nucleosomes, Proc. Natl. Acad. Sci. USA 94 (1997) 12633–12637.[165] J.N. Reeve, K. Sandman, C.J. Daniels, Archaeal histones nucleosomes and transcription initiation, Cell 89 (1997) 999–1002.[166] K.A. Bailey, S.L. Pereira, J. Widom, J.N. Reeve, Archaeal histone selection of nucleosome positioning sequences and the procaryotic origin of histone-

dependent genome evolution, J. Mol. Biol. 303 (2000) 25–34.[167] K. Decanniere, A. Babu, K. Sandman, J.N. Reeve, U. Heinemann, Crystal structures of recombinant histones HMfA and HMfB from the

hyperthermophilic archaeon Methanothermus Fervidus, J. Mol. Biol. 303 (2000) 35–47.[168] D. Musgrave, P. Forterre, A. Slesarev, Negative constrained DNA supercoiling in archaeal nucleosomes, Mol. Microbiol. 35 (2000) 341–349.[169] K. Sandman, J.N. Reeve, Structure and functional relationships of archaeal and eukaryal histones and nucleosomes,Microbiology 173 (2000) 165–169.[170] M. Coca-Prados, H.Y. Yu, M.T. Hsu, Intracellular forms of simian virus 40 nucleoprotein complexes. IV. Micrococcal nuclease digestion, J. Virol. 44

(1982) 603–609.[171] C.J. Marcus-Sekura, B.J. Carter, Chromatin-like structure of adeno-associated virus DNA in infected cells, J. Virol. 48 (1983) 79–87.[172] M.D. Challberg, T.J. Kelly, Animal virus DNA replication, Annu. Rev. Biochem. 58 (1989) 671–717.[173] S.L. Deshmane, N.W. Fraser, During latency herpes simplex virus type 1 DNA is associated with nucleosomes in a chromatin structure, J. Virol. 63

(1989) 943–947.[174] G.J. Olsen, C.R. Woese, Archaeal genomics: an overview, Cell 89 (1997) 991–994.[175] M.V. Borca, P.M. Irusta, G.F. Kutish, C. Carillo, C.L. Afonso, A.T. Burrage, J.G. Neilan, D.L. Rock, A structural DNA binding protein of african swine fever

virus with similarity to bacterial histone-like proteins, Arch. Virol. 141 (1996) 301–313.[176] S.A. Stanfield-Oakley, J.D. Griffith, Nucleosomal arrangement of HIV-1 DNA:maps generated from an integrated genome and an EBV-based episomal

model, J. Mol. Biol. 256 (1996) 503–516.[177] E.N. Trifonov, J.L. Sussman, The pitch of chromatin DNA is reflected in its nucleotide sequence, Proc. Natl. Acad. Sci. USA 77 (1980) 3816–3820.[178] H.R. Drew, A.A. Travers, DNA bending and its relation to nucleosome positioning, J. Mol. Biol. 186 (1985) 773–790.[179] T.E. Shrader, D.M. Crothers, Artificial nucleosome positioning sequences, Proc. Natl. Acad. Sci. USA 86 (1989) 7418–7422.[180] I. Ioshikhes, A. Bolshoy, E.N. Trifonov, Preferred positions of AA and TT dinucleotides in aligned nucleosomal DNA sequences, J. Biomol. Struct. Dynam.

9 (1992) 1111–1117.[181] M. Bina, Periodicity of dinucleotides in nucleosomes derived from simian virus 40 chromatin, J. Mol. Biol. 235 (1994) 198–208.[182] S. Muyldermans, A.A. Travers, DNA sequence organization in chromatosomes, J. Mol. Biol. 235 (1994) 855–870.[183] A. Bolshoy, CC dinucleotides contribute to the bending of DNA in chromatin, Nature Struct. Biol. 2 (1995) 446–448.[184] J. Widom, Short-range order in two eukaryotic genomes: relation to chromosome structure, J. Mol. Biol. 259 (1996) 579–588.[185] H.R.Widlund, H. Cao, S. Simonsson, E. Magnusson, T. Simonsson, P.E. Nielsen, J.D. Kahn, D.M. Crothers, M. Kubista, Identification and characterization

of genomic nucleosome-positioning sequences, J. Mol. Biol. 267 (1997) 807–817.[186] H. Herzel, O. Weiss, E.N. Trifonov, 10-11 bp periodicities in complete genomes reflect protein structure and DNA folding, Bioinformatics 15 (1999)

187–193.[187] A. Stein, M. Bina, A signal encoded in vertebrate DNA that influences nucleosome positioning and alignment, Nucleic Acids Res. 27 (1999) 848–853.[188] A. Thaström, P.T. Lowary, H.R. Widlund, H. Cao, M. Kubista, J. Widom, Sequence motifs and free energies of selected natural and non-natural

nucleosome positioning DNA sequences, J. Mol. Biol. 288 (1999) 213–219.[189] A.B. Cohanim, Y. Kashi, E.N. Trifonov, Yeast nucleosome DNA pattern: deconvolution from genome sequences of S. cerevisiae, J. Biomol. Struct. Dyn.

22 (2005) 687–694.[190] P.T. Lowary, J. Widom, Nucleosome packaging and nucleosome positioning of genomic DNA, Proc. Natl. Acad. Sci. USA 94 (1997) 1183–1188.[191] C. Vaillant, B. Audit, A. Arneodo, Thermodynamics of DNA loops with long-range correlated structural disorder, Phys. Rev. Lett. 95 (2005) 068101.[192] T.J. Richmond, J.T. Finch, B. Rushton, D. Rhodes, A. Klug, Structure of the nucleosome core particle at 7Å resolution, Nature 311 (1984) 532–537.[193] J. Widom, A. Klug, Structure of the 300Å chromatin filament: X-ray diffraction from oriented samples, Cell 43 (1985) 207–213.[194] A. Grosberg, Y. Rabin, S. Havlin, A. Neer, Crumpled globule model of the three-dimensional structure of DNA, Europhys. Lett. 23 (1993) 373–378.[195] D.E. Pettijohn, Escherichia coli and salmonella, in: F. Neidhardt, R. Curtiss, J.L. Ingraham, E.C.C. Lin, K.B. Low, B. Magasanik, et al. (Eds.), The Nucleoid,

AMS Press, Washington, DC, 1996, pp. 158–166.[196] A.M. Segall, S.D. Goodman, H.A. Nash, Architectural elements in nucleoprotein complexes: interchangeability of specific and non-specific DNA

binding proteins, EMBO J. 13 (1994) 4536–4548.[197] N.P. Higgins, X. Yang, Q. Fu, J.R. Roth, Surveying a supercoil domain by using the γ δ resolution system in Salmonella typhimurium, J. Bacteriol. 178

(1996) 2825–2835.[198] G.J. Pruss, K. Drlica, DNA supercoiling and prokaryotic transcription, Cell 56 (1989) 521–523.[199] R. Kanaar, A. Klippel, E. Shekhtman, J.M. Dungan, R. Khamann, N.R. Cozzarelli, Processive recombination by the phage Mu Gin system: implications

for the mechanisms of DNA strand exchange, DNA site alignment, and enhancer action, Cell 27 (1990) 353–366.[200] H.A. Nash, Bending and supercoiling of DNA at the attachment site of bacteriophage lambda, Trends Biochem. Sci. 15 (1990) 222–227.[201] G.-C. Yuan, Y.-J. Liu, M.F. Dion, M.D. Slack, L.F. Wu, S.J. Altschuler, O.J. Rando, Genome-scale identification of nucleosome positions in S. cerevisiae,

Science 309 (2005) 626–630.[202] A. Fire, R. Alcazar, F. Tan, Unusual DNA structures associatedwith germline genetic activity in Caenorhabditis elegans, Genetics 173 (2006) 1259–1273.[203] S.M. Johnson, F.J. Tan, H.L. Mccullough, D.P. Riordan, A.Z. Fire, Flexibility and constraint in the nucleosome core landscape of Caenorhabditis elegans

chromatin, Genome Res. 16 (2006) 1505–1516.[204] I. Albert, T.N. Mavrich, L.P. Tomsho, J. Qi, S.J. Zanton, S.C. Schuster, B.F. Pugh, Translational and rotational settings of H2A.Z nucleosomes across the

Saccharomyces cerevisiae genomes, Nature 446 (2007) 572–576.[205] W. Lee, D. Tillo, N. Bray, R.H. Morse, R.W. Davis, T.R. Hughes, C. Nislow, A high-resolution atlas of nucleosome occupancy in yeast, Nat. Genet. 39

(2007) 1235–1244.[206] Y. Mito, J.G. Henikoff, S. Henikoff, Histone replacement marks the boundaries of cis-regulatory domains, Science 315 (2007) 1408–1411.[207] F. Ozsolak, J.S. Song, X.S. Liu, D.E. Fisher, High-throughput mapping of the chromatin structure of human promoters, Nat. Biotechnol. 25 (2007)

244–248.[208] I. Whitehouse, O.J. Rando, J. Delrow, T. Tsukiyama, Chromatin remodelling at promoters suppresses antisense transcription, Nature 450 (2007)

1031–1036.[209] Y. Field, N. Kaplan, Y. Fondufe-Mittendorf, I.K. Moore, E. Sharon, Y. Lubling, J. Widom, E. Segal, Distinct modes of regulation by chromatin encoded

through nucleosome positioning signals, PLoS Comput. Biol. 4 (2008) e1000216.[210] T.N. Mavrich, I.P. Ioshikhes, B.J. Venters, C. Jiang, L.P. Tomsho, J. Qi, S.C. Schuster, I. Albert, B.F. Pugh, A barrier nucleosome model for statistical

positioning of nucleosomes throughout the yeast genome, Genome Res. 18 (2008) 1073–1083.[211] T.N. Mavrich, C. Jiang, I.P. Ioshikhes, X. Li, B.J. Venters, S.J. Zanton, L.P. Tomsho, J. Qi, R.L. Glaser, S.C. Schuster, D.S. Gilmour, I. Albert, B.F. Pugh,

Nucleosome organization in the Drosophila genome, Nature 453 (2008) 358–362.[212] D.E. Schones, K. Cui, S. Cuddapah, T.-Y. Roh, A. Barski, Z. Wang, G.Wei, K. Zhao, Dynamic regulation of nucleosome positioning in the human genome,

Cell 132 (2008) 887–898.[213] S. Shivaswamy, A. Bhinge, Y. Zhao, S. Jones,M. Hirst, V.R. Iyer, Dynamic remodeling of individual nucleosomes across a eukaryotic genome in response

to transcriptional perturbation, PLoS Biol. 6 (2008) e65.[214] A. Valouev, J. Ichikawa, T. Tonthat, J. Stuart, S. Ranade, H. Peckham, K. Zeng, J.A. Malek, G. Costa, K. Mckernan, A. Sidow, A. Fire, S.M. Johnson, A. High-

resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning, Genome Res. 18 (2008) 1051–1063.[215] Y. Field, Y. Fondufe-Mittendorf, I.K. Moore, P. Mieczkowski, N. Kaplan, Y. Lubling, J.D. Lieb, J. Widom, E. Segal, Gene expression divergence in yeast is

coupled to evolution of DNA-encoded nucleosome organization, Nat. Genet. 41 (2009) 362–366.

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 177

[216] N. Kaplan, I.K. Moore, Y. Fondufe-Mittendorf, A.J. Gossett, D. Tillo, Y. Field, E.M. Leproust, T.R. Hughes, J.D. Lieb, J. Widom, E. Segal, The DNA-encodednucleosome organization of a eukaryotic genome, Nature 458 (2009) 362–366.

[217] Y. Zhang, Z. Moqtaderi, B.P. Rattner, G. Euskirchen, M. Snyder, J.T. Kadonaga, X.S. Liu, K. Struhl, Intrinsic histone-DNA interactions are not the majordeterminant of nucleosome positions in vivo, Nat. Struct. Mol. Biol. 16 (2009) 847–852.

[218] C.L. Woodcock, A.I. Skoultchi, Y. Fan, Role of linker histone in chromatin structure and function: H1 stoichiometry and nucleosome repeat lengths,Chromosome Res. 14 (2006) 17–25.

[219] I.P. Ioshikhes, I. Albert, S.J. Zanton, B.F. Pugh, Nucleosome positions predicted through comparative genomics, Nat. Genet. 38 (2006) 1210–1215.[220] E. Segal, Y. Fondufe-Mittendorf, L. Chen, A. Thaström, Y. Field, I.K. Moore, J.-P.Z. Wang, J. Widom, A genomic code for nucleosome positioning, Nature

442 (2006) 772–778.[221] H.E. Peckham, R.E. Thurman, Y. Fu, J.A. Stamatoyannopoulos,W.S. Noble, K. Struhl, Z.Weng, Nucleosomepositioning signals in genomic DNA, Genome

Res. 17 (2007) 1170–1177.[222] G.-C. Yuan, J.S. Liu, Genomic sequence is highly predictive of local nucleosome depletion, PLoS Comput. Biol. 4 (2008) e13.[223] L. Tonks, The complete equation of state of one, two and three-dimensional gases of hard elastic spheres, Phys. Rev. 50 (1936) 955–963.[224] J.K. Percus, Equilibrium state of a classical fluid of hard rods in an external-field, J. Stat. Phys. 15 (1976) 505–511.[225] T.K. Vanderlick, L.E. Scriven, H.T. Davis, Solution of Percus equation for the density of hard-rods in an external-field, Phys. Rev. A 34 (1986) 5130–5131.[226] R.D. Kornberg, L. Stryer, Statistical distributions of nucleosomes: nonrandom locations by a stochastic mechanism, Nucleic Acids Res. 16 (1988)

6677–6690.[227] Y. Bao, C.L. White, K. Luger, Nucleosome core particles containing a poly(dA.dT) sequence element exhibit a locally distorted DNA structure, J. Mol.

Biol. 361 (2006) 617–624.[228] E. Segal, J. Widom, Poly(dA:dT) tracts: major determinants of nucleosome organization, Current Opin. Struct. Biol. 19 (2009) 65–71.[229] V. Iyer, K. Struhl, Poly(dA:dT) a ubiquitous promoter element that stimulates transcription via its intrinsic DNA structure, EMBO J. 14 (1995)

2570–2579.[230] B. Suter, G. Schnappauf, F. Thoma, Poly(dAdT) sequences exist as rigid DNA structures in nucleosome-free yeast promoters in vivo, Nucleic Acids Res.

28 (2000) 4083–4089.[231] R.-H. Pusarla, V. Vinayachandran, P. Bhargava, Nucleosome positioning in relation to nucleosome spacing and DNA sequence-specific binding of a

protein, FEBS J. 274 (2007) 2396–2410.[232] I. Whitehouse, T. Tsukiyama, Antagonistic forces that position nucleosomes in vivo, Nat. Struct. Mol. Biol. 13 (2006) 633–640.[233] C.R. Clapier, B.R. Cairns, The biology of chromatin remodeling complexes, Annu. Rev. Biochem. 78 (2009) 273–304.[234] V. Teif, K. Rippe, Predicting nucleosome positions on the DNA: combining intrinsic sequence preferences and remodeler activities, Nucleic Acids Res.

37 (2009) 5641–5655.[235] Y. Cui, C. Bustamante, Pulling a single chromatin fiber reveals the forces that maintain its higher-order structure, Proc. Natl. Acad. Sci. USA 97 (2000)

127–132.[236] S. Mangenot, A. Leforestier, P. Vachette, D. Durand, F. Livolant, Salt-induced conformation and interaction changes of nucleosome core particles,

Biophys. J. 82 (2002) 345–356.[237] A. Bertin, A. Leforestier, D. Durand, F. Livolant, Role of histone tails in the conformation and interactions of nucleosome core particles, Biochemistry

43 (2004) 4773–4780.[238] F.J. Solis, R. Bash, J. Yodh, S.M. Lindsay, D. Lohr, A statistical thermodynamic model applied to experimental AFM population and location data is able

to quantify DNA-histone binding strength and internucleosomal interaction differences between acetylated and unacetylated nucleosomal arrays,Biophys. J. 87 (2004) 3372–3387.

[239] M. Shogren-Knaak, H. Ishii, J.-M. Sun, M.J. Pazin, J.R. Davie, C.L. Peterson, Histone H4-K16 acetylation controls chromatin structure and proteininteractions, Science 311 (2006) 844–847.

[240] V.Miele, C. Vaillant, Y. d’Aubenton-Carafa, C. Thermes, T. Grange, DNAphysical properties determine nucleosomeoccupancy fromyeast to fly, NucleicAcids Res. 36 (2008) 3746–3756.

[241] L. David, W. Huber, M. Granovskaia, J. Toedling, C.J. Palm, L. Bofkin, T. Jones, R.W. Davis, L.M. Steinmetz, A high-resolution map of transcription inthe yeast genome, Proc. Natl. Acad. Sci. USA 103 (2006) 5320–5325.

[242] E.J. Steinmetz, C.L. Warren, J.N. Kuehner, B. Panbehi, A.Z. Ansari, D.A. Brow, Genome-wide distribution of yeast RNA polymerase II and its control bySen1 helicase, Mol. Cell 24 (2006) 735–746.

[243] I. Tirosh, N. Barkai, Two strategies for gene regulation by promoter nucleosomes, Genome Res. 18 (2008) 1084–1091.[244] I. Steinfeld, R. Shamir, M. Kupiec, A genome-wide analysis in Saccharomyces cerevisiae demonstrates the influence of chromatin modifiers on

transcription, Nat. Genet. 39 (2007) 303–309.[245] P.D. Hartley, H.D. Madhani, Mechanisms that specify promoter nucleosome location and identity, Cell 137 (2009) 445–458.[246] E. Segal, J. Widom, From DNA sequence to transcriptional behaviour: a quantitative approach, Nat. Rev. Genet. 10 (2009) 443–456.[247] M. Kellis, N. Patterson, M. Endrizzi, B. Birren, E.S. Lander, Sequencing and comparison of yeast species to identify genes and regulatory elements,

Nature 423 (2003) 241–254.[248] S. Washietl, R. Machné, N. Goldman, Evolutionary footprints of nucleosome positions in yeast, Trends Genet. 24 (2008) 583–587.[249] B.J. Venters, B.F. Pugh, How eukaryotic genes are transcribed, Crit. Rev. Biochem. Mol. Biol. 44 (2009) 117–141.[250] S. Mihardja, A.J. Spakowitz, Y. Zhang, C. Bustamante, Effect of force on mononucleosomal dynamics, Proc. Natl. Acad. Sci. USA 103 (2006)

15871–15876.[251] B. Ladoux, J.P. Quivy, P. Doyle, O.du. Roure, G. Almouzni, J.L. Viovy, Fast kinetics of chromatin assembly revealed by single-molecule videomicroscopy

and scanning force microscopy, Proc. Natl. Acad. Sci. USA 97 (2000) 14251–14256.[252] M.L. Bennink, S.H. Leuba, G.H. Leno, J. Zlatanova, B.G.de. Grooth, J. Greve, Unfolding individual nucleosomes by stretching single chromatin fibers

with optical tweezers, Nat. Struct. Biol. 8 (2001) 606–610.[253] B.D. Brower-Toland, C.L. Smith, R.C. Yeh, J.T. Lis, C.L. Peterson, M.D. Wang, Mechanical disruption of individual nucleosomes reveals a reversible

multistage release of DNA, Proc. Natl. Acad. Sci. USA 99 (2002) 1960–1965.[254] C. Claudet, D. Angelov, P. Bouvet, S. Dimitrov, J. Bednar, Histone octamer instability under single molecule experiment conditions, J. Biol. Chem. 280

(2005) 19958–19965.[255] P. Ranjith, J. Yan, J.F. Marko, Nucleosome hopping and sliding kinetics determined from dynamics of single chromatin fibers in Xenopus egg extracts,

Proc. Natl. Acad. Sci. USA 104 (2007) 13649–13654.[256] M. Kruithof, F.-T. Chien, A. Routh, C. Logie, D. Rhodes, J. van Noort, Single-molecule force spectroscopy reveals a highly compliant helical folding for

the 30-nm chromatin fibers, Nat. Struct. Mol. Biol. 16 (2009) 534–540.[257] M.D. Wang, M.J. Schnitzer, H. Yin, R. Landick, J. Gelles, S.M. Block, Force and velocity measured for single molecules of RNA polymerase, Science 282

(1998) 902–907.[258] M.A. Hall, A. Shundrovsky, L. Bai, R.M. Fulbright, J.T. Lis, M.D. Wang, High-resolution dynamic mapping of histone-DNA interactions in a nucleosome,

Nat. Struct. Mol. Biol. 16 (2009) 124–129.[259] T.R. Strick, M.N. Dessinges, G. Charvin, N.H. Dekker, J.F. Allemand, D. Bensimon, V. Croquette, Stretching of macromolecules and proteins, Rep. Progr.

Phys. 66 (2003) 1–45.[260] T. Lionnet, A. Dawid, S. Bigot, F.-X. Barre, O.A. Saleh, F. Heslot, J.-F. Allemand, D. Bensimon, V. Croquette, DNA mechanics as a tool to probe helicase

and translocase activity, Nucleic Acids Res. 34 (2006) 4232–4244.[261] H. Boeger, J. Griesenbeck, R.D. Kornberg, Nucleosome retention and the stochastic nature of promoter chromatin remodeling for transcription, Cell

133 (2008) 716–726.

178 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

[262] F.H. Lam, D.J. Steger, E.K. O’Shea, Chromatin decouples promoter threshold from dynamic range, Nature 453 (2008) 246–250.[263] M. Radman-Livaja, O.J. Rando, Nucleosome positioning: how is it established, and why does it matter?, Dev. Biol. 339 (2010) 258–266.[264] R.H. Morse, Transcription factor access to promoter elements, J. Cell. Biochem. 102 (2007) 560–570.[265] M.F. Dion, T. Kaplan, M. Kim, S. Buratowski, N. Friedman, O.J. Rando, Dynamics of replication-independent histone turnover in budding yeast, Science

315 (2007) 1405–1408.[266] H. Zhang, D.N. Roberts, B.R. Cairns, Genome-wide dynamics of Htz1, a histone H2A variant that poises repressed/basal promoters for activation

through histone loss, Cell 123 (2005) 219–231.[267] C.L.Woodcock, S.A. Grigoryev, R.A. Horowitz, N.Whitaker, A chromatin foldingmodel that incorporates linker variability generates fibers resembling

the native structures, Proc. Natl. Acad. Sci. USA 90 (1993) 9021–9025.[268] A. Lesne, J.-M. Victor, Chromatin fiber functional organization: some plausible models, Eur. Phys. J. E 19 (2006) 279–290.[269] C. Wu, A. Bassett, A. Travers, A variable topology for the 30-nm chromatin fiber, EMBO Rep. 8 (2007) 1129–1134.[270] N. Kepper, D. Foethke, R. Stehr, G. Wedemann, K. Rippe, Nucleosome geometry and internucleosomal interactions control the chromatin fiber

conformation, Biophys. J. 95 (2008) 3692–3705.[271] A. Routh, S. Sandin, D. Rhodes, Nucleosome repeat length and linker histone stoichiometry determine chromatin fiber structure, Proc. Natl. Acad. Sci.

USA 105 (2008) 8872–8877.[272] R. Stehr, N. Kepper, K. Rippe, G. Wedemann, The effect of internucleosomal interaction on folding of the chromatin fiber, Biophys. J. 95 (2008)

3677–3691.[273] J.-P. Wang, Y. Fondufe-Mittendorf, L. Xi, G.-F. Tsai, E. Segal, J. Widom, Preferentially quantized linker DNA lengths in Saccharomyces cerevisiae, PLoS

Comput. Biol. 4 (2008) e1000175.[274] A. Bassett, S. Cooper, C. Wu, A. Travers, The folding and unfolding of eukaryotic chromatin, Current Opin. Genetics Dev. 19 (2009) 159–165.[275] P.M. Diesinger, D.W. Heermann, Depletion effects massively change chromatin properties and influence genome folding, Biophys. J. 97 (2009)

2146–2153.[276] G.A. Hartzog, Transcription elongation by RNA polymerase II, Current Opin. Genetics Dev. 13 (2003) 119–126.[277] C. Lavelle, A. Prunell, Chromatin polymorphism and the nucleosome superfamily: a genealogy, Cell Cycle 6 (2007) 2113–2119.[278] A. Bancaud, G. Wagner, N.E. Conde silva, C. Lavelle, H. Wong, J. Mozziconacci, M. Barbi, A. Sivolob, E. Le Cam, L. Mouawal, J.-L. Viovy, J.-M. Victor, A.

Prunell, Nucleosome chiral transition under positive torsional stress in single chromatin fibers, Mol. Cell. 27 (2007) 135–147.[279] J. Parenteau, M. Durand, S. Véronneau, A.-A. Lacombe, G. Morin, V. Guérin, B. Cecez, J. Gervais-Bird, C.-S. Koh, D. Brunelle, R.J. Wellinger, B. Chabot,

S.A. Elela, Deletion of many yeast introns reveals a minority of genes that require splicing for function, Mol. Biol. Cell 19 (2008) 1932–1941.[280] A.R. Kornblihtt, Chromatin transcript elongation and alternative splicing, Nat. Struct. Mol. Biol. 13 (2006) 5–7.[281] E. Allemand, E. Batsché, C. Muchardt, Splicing transcription and chromatin: a ménage à trois, Current Opin. Genetics Dev. 18 (2008) 145–151.[282] B.E. Bernstein, C.L. Liu, E.L. Humphrey, E.O. Perlstein, S.L. Schreiber, Global nucleosome occupancy in yeast, Genome Biol. 5 (2004) R62.[283] C.-K. Lee, Y. Shibata, B. Rao, B.D. Strahl, J.D. Lieb, Evidence for nucleosome depletion at active regulatory regions genome-wide, Nat. Genet. 36 (2004)

900–905.[284] B. Li, M. Carey, J.L. Workman, The role of chromatin during transcription, Cell 128 (2007) 707–719.[285] O.J. Rando, K. Ahmad, Rules and regulation in the primary structure of chromatin, Current Opin. Cell. Biol. 19 (2007) 250–256.[286] E. Segal, J. Widom, What control nucleosome positions?, Trends Genet. 25 (2009) 335–343.[287] R.T. Koerber, H.S. Rhee, C. Jiang, B.F. Pugh, Interaction of transcriptional regulators with specific nucleosomes across the Saccharomyces genomes,

Mol. Cell. 35 (2009) 889–902.[288] T. Raveh-Sadka, M. Levo, E. Segal, Incorporating nucleosomes into thermodynamic models of transcription regulation, Genome Res. 19 (2009)

1480–1496.[289] C.C. Adams, J.L. Workman, Binding of disparate transcriptional activators to nucleosomal DNA is inherently cooperative, Mol. Cell. Biol. 15 (1995)

1405–1421.[290] K.J. Polach, J. Widom, A model for the cooperative binding of eukaryotic regulatory proteins to nucleosomal target sites, J. Mol. Biol. 258 (1996)

800–812.[291] S. Vashee, K. Melcher, W.V. Ding, S.A. Johnston, T. Kodadek, Evidence for two modes of cooperative DNA binding in vivo that do not involve direct

protein–protein interactions, Curr. Biol. 8 (1998) 452–458.[292] J.A. Miller, J. Widom, Collaborative competition mechanism for gene activation in vivo, Mol. Cell. Biol. 23 (2003) 1623–1632.[293] T. Tsukiyama, C. Wu, Chromatin remodeling and transcription, Current Opin. Genetics Dev. 7 (1997) 182–191.[294] G. Längst, E.J. Bonte, D.F. Corona, P.B. Becker, Nucleosome movement by CHRAC and ISWI without disruption or trans-displacement of the histone

octamer, Cell 97 (1999) 843–852.[295] Y. Lorch, M. Zhang, R.D. Kornberg, Histone octamer transfer by a chromatin-remodeling complex, Cell 96 (1999) 389–392.[296] A. Travers, An engine for nucleosome remodeling, Cell 96 (1999) 311–314.[297] I. Whitehouse, A. Flaus, B.R. Cairns, M.F. White, J.L. Workman, T. Owen-Hughes, Nucleosome mobilization catalysed by the yeast SWI/SNF complex,

Nature 400 (1999) 784–787.[298] C.L. Peterson, J.L.Workman, Promoter targeting and chromatin remodeling by the SWI/SNF complex, Current Opin. Genetics Dev. 10 (2000) 187–192.[299] A. Hamiche, J.G. Kang, C. Dennis, H. Xiao, C. Wu, Histone tails modulate nucleosome mobility and regulate ATP-dependent nucleosome sliding by

NURF, Proc. Natl. Acad. Sci. USA 98 (2001) 14316–14321.[300] D. Angelov, A. Molla, P.-Y. Perche, F. Hans, J. Cote, S. Khochbin, P. Bouvet, S. Dimitrov, The histone variant macroH2A interferes with transcription

factor binding and SWI/SNF nucleosome remodeling, Mol. Cell 11 (2003) 1033–1041.[301] F.Montel, E. Fontaine, P. St-Jean,M. Castelnovo, C. Faivre-Moskalenko, Atomic forcemicroscopy imaging of SWI/SNF action:mapping the nucleosome

remodeling and sliding, Biophys J. 93 (2007) 566–578.[302] K. Rippe, A. Schrader, P. Riede, R. Strohner, E. Lehmann, G. Längst, DNA sequence- and conformation-directed positioning of nucleosomes by

chromatin-remodeling complexes, Proc. Natl. Acad. Sci. USA 104 (2007) 15635–15640.[303] S.B. Smith, L. Finzi, C. Bustamante, Direct mechanical measurements of the elasticity of single DNA molecules by using magnetic beads, Science 258

(1992) 1122–1126.[304] T.T. Perkins, S.R. Quake, D.E. Smith, S. Chu, Relaxation of a single DNA molecule observed by optical microscopy, Science 264 (1994) 822–826.[305] P. Cluzel, A. Lebrun, C. Heller, R. Lavery, J.L. Viovy, D. Chatenay, F. Caron, DNA: an extensible molecule, Science 271 (1996) 792–794.[306] S.B. Smith, Y. Cui, C. Bustamante, Overstretching B-DNA: the elastic response of individual double-stranded and single-stranded DNA molecules,

Science 271 (1996) 795–799.[307] T.R. Strick, J.F. Allemand, D. Bensimon, A. Bensimon, V. Croquette, The elasticity of a single supercoiledDNAmolecules, Science 271 (1996) 1835–1837.[308] M.D. Wang, H. Yin, R. Landick, J. Gelles, S.M. Block, Stretching DNA with optical tweezers, Biophys. J. 72 (1997) 1335–1346.[309] J.F. Allemand, D. Bensimon, R. Lavery, V. Croquette, Stretched and overwound DNA forms a Pauling-like structure with exposed bases, Proc. Natl.

Acad. Sci. USA 95 (1998) 14152–14157.[310] J.F. Léger, G. Romano, A. Sarkar, J. Robert, L. Bourdieu, D. Chatenay, J.F. Marko, Structural transitions of a twisted and stretched DNA molecule, Phys.

Rev. Lett. 83 (1999) 1066–1069.[311] O. Kratky, G. Porod, X-ray investigation of dissolved chain molecules, Recueil: J. Roy. Netherlands Chem. Soc. 68 (1949) 1106–1123.[312] J.A. Schellman, Flexibility of DNA, Biopolymers 13 (1974) 217–226.[313] C. Bustamante, J.F. Marko, E.D. Siggia, S. Smith, Entropic elasticity of lambda-phage DNA, Science 265 (1994) 1599–1600.[314] A.Y. Grossberg, A.R. Khoklov, in: R. Larson, P.A. Pincus (Eds.), Statistical Physics of Macromolecules, AIP Series in Polymers and Complex Materials,

AIP Press, Woodbury, 1994.

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 179

[315] A. Vologoskii, DNA extension under the action of an external forces, Macromolecules 27 (1994) 5623–5625.[316] J.F. Marko, E.D. Siggia, Stretching DNA, Macromolecules 28 (1995) 8759–8770.[317] C. Bouchiat, M.D. Wang, J.F. Allemand, T. Strick, S.M. Block, V. Croquette, Estimating the persistence length of a worm-like chain molecule from

force-extension measurements, Biophys. J. 76 (1999) 409–413.[318] J.F. Marko, E.D. Siggia, Statistical mechanics of supercoiled DNA, Phys. Rev. E 52 (1995) 2912–2938.[319] J.D. Moroz, P. Nelson, Torsional directed walks, entropic elasticity and DNA twist stiffness, Proc. Natl. Acad. Sci. USA 94 (1997) 14418–14422.[320] A.V. Vologodskii, J.F. Marko, Extension of torsionally stressed DNA by external force, Biophys. J. 73 (1997) 123–132.[321] C. Bouchiat, M. Mezard, Elasticity model of a supercoiled DNA molecule, Phys. Rev. Lett. 80 (1998) 1556–1559.[322] C. Bouchiat, M. Mezard, Elastic rod model of a supercoiled DNA molecule, Eur. Phys. J. E 2 (2000) 377–402.[323] E.N. Trifonov, R.K.Z. Tan, S.C. Harvey, in: W.K. Olson, M.H. Sarma, M. Sundaralingam (Eds.), DNA Bending Curvature, Academic Press, Schenectady,

1987, p. 243.[324] J.A. Schellman, S.C. Harvey, Static contributions to the persistence length of DNA and dynamic contributions to DNA curvature, Biophys. Chem. 55

(1995) 95–114.[325] V. Katritch, A. Vologodskii, The effect of intrinsic curvature on conformational properties of circular DNA, Biophys. J. 72 (1997) 1070–1079.[326] L. Song, J.M. Schurr, Dynamic bending rigidity of DNA, Biopolymers 30 (1990) 229–237.[327] J. Bednar, P. Furrer, V. Katritch, A.Z. Stasiak, J. Dubochet, A. Stasiak, Determination of DNA persistence length by cryo-electronmicroscopy: separation

of the static and dynamic contributions to the apparent persistence length of DNA, J. Mol. Biol. 254 (1995) 579–594.[328] P. Furrer, J. Bednar, A.Z. Stasiak, V. Katritch, D. Michoud, A. Stasiak, J. Dubochet, Opposite effect of counterions on the persistence length of nicked

and non-nicked DNA, J. Mol. Biol. 266 (1997) 711–721.[329] M. Vologodskaia, A. Vologodskii, Contribution of the intrinsic curvature to measured DNA persistence length, J. Mol. Biol. 317 (2002) 205–213.[330] P. Nelson, Sequence-disorder effects on DNA entropic elasticity, Phys. Rev. Lett. 80 (1998) 5810–5812.[331] D. Bensimon, D. Dohmi, M. Mézard, Stretching a heteropolymers, Europhys. Lett. 42 (1998) 97–102.[332] C. Vaillant, B. Audit, C. Thermes, A. Arneodo, Influence of the sequence on the elastic properties of long DNA chains, Phys. Rev. E 67 (2003) 032901.[333] J.C. Marini, S.D. Levene, D.M. Crothers, P.T. Englund, A bent helix in kinetoplast DNA, Cold Spring Harb. Symp. Quant. Biol. 47 (1983) 279–283.[334] P.J. Hagerman, Flexibility of DNA, Annu. Rev. Biophys. Biophys. Chem. 17 (1988) 265–286.[335] D.M. Crothers, T.E. Haran, J.G. Nadeau, Intrinsically bent DNA, J. Biol. Chem. 265 (1990) 7093–7096.[336] Y. Lyubchenko, L. Shlyakhtenko, R. Harrington, P. Oden, S. Lindsay, Atomic force microscopy of long DNA: imaging in air and under water, Proc. Natl.

Acad. Sci. USA 90 (1993) 2137–2140.[337] H.G. Hansma, I. Revenko, K. Kim, D.E. Laney, Atomic forcemicroscopy of long and short double-stranded, single-stranded and triple-stranded nucleic

acids, Nucleic Acids Res. 24 (1996) 713–720.[338] C. Rivetti, M. Guthold, C. Bustamante, Scanning force microscopy of DNA deposited onto mica: equilibration versus kinetic trapping studied by

statistical polymer chain analysis, J. Mol. Biol. 264 (1996) 919–932.[339] J.A. Cognet, C. Pakleza, D. Cherny, E. Delain, E. Le Cam, Static curvature and flexibility measurements of DNA with microscopy. A simple

renormalization method, its assessment by experiment and simulation, J. Mol. Biol. 285 (1999) 997–1009.[340] C. Rivetti, C. Walker, C. Bustamante, Polymer chain statistics and conformational analysis of DNA molecules with bends or sections of different

flexibility, J. Mol. Biol. 280 (1998) 41–59.[341] C. Anselmi, P. Desantis, A. Scipioni, Nanoscale mechanical and dynamical properties of DNA single molecules, Biophys. Chem. 113 (2005) 209–221.[342] G. Zuccheri, A. Scipioni, V. Cavaliere, G. Gargiulo, P. De santis, B. Samorì, Mapping the intrinsic curvature and flexibility along the DNA chain, Proc.

Natl. Acad. Sci. USA 98 (2001) 3074–3079.[343] A. Scipioni, C. Anselmi, G. Zuccheri, B. Samori, P. De santis, Sequence-dependent DNA curvature and flexibility from scanning force microscopy

images, Biophys. J. 83 (2002) 2408–2418.[344] A. Scipioni, G. Zuccheri, C. Anselmi, A. Bergia, B. Samorì, P. De santis, Sequence-dependent DNAdynamics by scanning forcemicroscopy time-resolved

imaging, Chem. Biol. 9 (2002) 1315–1321.[345] M. Marilley, A. Sanchez-Sevilla, J. Rocca-Serra, Fine mapping of inherent flexibility variation along DNA molecules: validation by atomic force

microscopy (AFM) in buffer, Mol. Genet. Genomics 274 (2005) 658–670.[346] J. Moukhtar, Effet de séquence sur les propriétés thermodynamiques de brins d’ADN: de la théorie à l’expérience, Ph.D. Thesis, Ecole Normale

Supérieure de Lyon, 2008.[347] B. Bartosch, A. Vitelli, C. Garnier, C. Goujon, J. Dubuisson, S. Pascale, E. Scarselli, R. Cortese, A. Nicosia, F.L. Cosset, Cell entry of hepatitis C viruse

requires a set of co-receptor thats include the CD81 tetraspanin and the SR-B1 scavenger receptor, J. Biol. Chem. 278 (2003) 41624–41630.[348] D. Pastré, O. Piétrement, S. Fusil, F. Landousy, J. Jeusset, M.-O. David, L. Hamon, E. Le Cam, A. Zozime, Adsorption of DNA tomicamediated by divalent

counterions: a theoretical and experimental study, Biophys. J. 85 (2003) 2507–2518.[349] M.L. Sushko, A.L. Shluger, C. Rivetti, Simple model for DNA adsorption onto a mica surface in 1:1 and 2:1 electrolyte solutions, Langmuir 22 (2006)

7678–7688.[350] C. Rivetti, S. Codeluppi, Accurate length determination of DNA molecules visualized by atomic force microscopy: evidence for a partial B- to A-form

transition on mica, Ultramicroscopy 87 (2001) 55–66.[351] A. Sanchez-Sevilla, J. Thimonier, M. Marilley, J. Rocca-Serra, J. Barbet, Accuracy of AFM measurements of the contour length of DNA fragments

adsorbed on mica in air and in aqueous buffer, Ultramicroscopy 92 (2002) 151–158.[352] O. Piétrement, D. Pastré, S. Fusil, J. Jeusset, M.-O. David, F. Landousy, L. Hamon, A. Zozime, E. Le Cam, Reversible binding of DNA on NiCl2-treated

mica by varying the ionic strengths, Langmuir 19 (2003) 2536–2539.[353] L.S. Shlyakhtenko, A.Y. Lushnikov, Y.L. Lyubchenko, Dynamics of nucleosomes revealed by time-lapse atomic force microscopy, Biochemistry 48

(2009) 7842–7848.[354] A.K. Mazur, Evaluation of elastic properties of atomistic DNA models, Biophys. J. 91 (2006) 4507–4518.[355] A. Podesta, M. Indrieri, D. Brogioli, G.S. Manning, P. Milani, R. Guerra, L. Finzi, D. Dunlap, Positively charged surfaces increase the flexibility of DNA,

Biophys. J. 89 (2005) 2558–2563.[356] F. Moreno-Herrero, R. Seidel, S.M. Johnson, A. Fire, N.H. Dekker, Structural analysis of hyperperiodic DNA from Caenorhabditis elegans, Nucleic Acids

Res. 34 (2006) 3057–3066.[357] Z. Kam, N. Borochov, H. Eisenberg, Dependence of laser light scattering of DNA on NaCl concentration, Biopolymers 20 (1981) 2671–2690.[358] E. Sobel, J.A. Harpst, Effetcs of Na+ on the persistence length and excluded volume of T7 bacteriophage DNA, Biopolymers 31 (1991) 1559–1564.[359] T.E. Cloutier, J. Widom, Spontaneous sharp bending of double-stranded DNA, Mol. Cell 14 (2004) 355–362.[360] Q. Du, C. Smith, N. Shiffeldrim, M. Vologodskaia, A. Vologodskii, Cyclization of short DNA fragments and bending fluctuations of the double helix,

Proc. Natl. Acad. Sci. USA 102 (2005) 5397–5402.[361] L. Czapla, D. Swigon, M.V.W.K. Olson, Sequence-dependent effects in the cyclization of short DNA, J. Chem. Theory Comput. 2 (2006) 685–695.[362] Y. Zhang, D.M. Crothers, Statistical mechanics of sequence-dependent circular DNA and its application for DNA cyclization, Biophys. J. 84 (2003)

136–153.[363] V. Rizzo, J. Schellman, Flow dichroism of T7 DNA as a function of salt concentrations, Biopolymers 20 (1981) 2143–2163.[364] P.J. Hagerman, Investigation of the flexibility of DNA using transient electric birefringence, Biopolymers 20 (1981) 1503–1535.[365] Y. Lu, B. Weers, N.C. Stellwagen, DNA persistence length revisited, Biopolymers 61 (2001) 261–275.[366] D. Porschke, Persistence length and bending dynamics of DNA from electro-optical measurements at high salt concentrations, Biophys. Chem. 40

(1991) 169–179.

180 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

[367] J. Griffith, M. Bleyman, C.A. Rauch, P.A. Kitchin, P.T. Englund, Visualization of the bent helix in kinetoplast DNA by electronmicroscopy, Cell 46 (1986)717–724.

[368] C.L. Peterson, ATP-dependent chromatin remodeling: going mobile, FEBS Lett. 476 (2000) 68–72.[369] M. Vignali, A.H. Hassan, K.E. Neely, J.L. Workman, ATP-dependent chromatin-remodeling complexes, Mol. Cell Biol. 20 (2000) 1899–1910.[370] P.B. Becker, Nucleosome sliding: facts and fiction, EMBO J. 21 (2002) 4749–4753.[371] G. Meersseman, S. Pennings, E.M. Bradbury, Mobile nucleosomes–a general behavior, EMBO J. 11 (1992) 2951–2959.[372] A. Flaus, T.J. Richmond, Positioning and stability of nucleosomes on MMTV 3’LTR sequences, J. Mol. Biol. 275 (1998) 427–441.[373] G. Li, M. Levitus, C. Bustamante, J. Widom, Rapid spontaneous accessibility of nucleosomal DNA, Nat. Struct. Mol. Biol. 12 (2005) 46–53.[374] A. Bucceri, K. Kapitza, F. Thoma, Rapid accessibility of nucleosomal DNA in yeast on a second time scale, EMBO J. 25 (2006) 3123–3132.[375] W.J.A. Koopmans, A. Brehm, C. Logie, T. Schmidt, J. van Noort, Single-pair FRETmicroscopy reveals mononucleosome dynamics, J. Fluoresc. 17 (2007)

785–795.[376] H.S. Tims, J. Widom, Stopped-flow fluorescence resonance energy transfer for analysis of nucleosome dynamics, Methods 41 (2007) 296–303.[377] A. Flaus, T. Owen-Hughes, Mechanisms for nucleosome mobilization, Biopolymers 68 (2003) 563–578.[378] H. Schiessel, Topical review: the physics of chromatin, J. Phys.: Condens. Matter. 15 (2003) R699–R774.[379] H. Schiessel, The nucleosome: a transparent, slippery, sticky and yet stable DNA-protein complex, Eur. Phys. J. E 19 (2006) 251–262.[380] H. Schiessel, J. Widom, R.F. Bruinsma, W.M. Gelbart, Polymer reptation and nucleosome repositioning, Phys. Rev. Lett. 86 (2001) 4414–4417.[381] I.M. Kulić, H. Schiessel, Chromatin dynamics: nucleosomes go mobile through twist defects, Phys. Rev. Lett. 91 (2003) 148103.[382] S.H. Leuba, G. Yang, C. Robert, B. Samori, K. van Holde, J. Zlatanova, C. Bustamante, Three-dimensional structure of extended chromatin fibers as

revealed by tapping-mode scanning force microscopy, Proc. Natl. Acad. Sci. USA 91 (1994) 11621–11625.[383] S.H. Leuba, C. Bustamante, J. Zlatanova, K. van Holde, Contributions of linker histones and histone H3 to chromatin structure: scanning force

microscopy studies on trypsinized fibers, Biophys. J. 74 (1998) 2823–2829.[384] J.G. Yodh, Y.L. Lyubchenko, L.S. Shlyakhtenko, N. Woodbury, D. Lohr, Evidence for nonrandom behavior in 208-12 subsaturated nucleosomal array

populations analyzed by AFM, Biochemistry 38 (1999) 15756–15763.[385] S. Pisano, E. Pascucci, S. Cacchione, P. De Santis, M. Savino, AFM imaging and theoretical modeling studies of sequence-dependent nucleosome

positioning, Biophys. Chem. 124 (2006) 81–89.[386] A. Poulet, R. Buisson, C. Faivre-Moskalenko, M. Koelblen, S. Amiard, F. Montel, S. Cuesta-Lopez, O. Bornet, F. Guerlesquin, T. Godet, J. Moukhtar,

F. Argoul, A.-C. Déclais, D.M.J. Lilley, S.C.Y. Ip, S.C. West, E. Gilson, M.-J. Giraud-Panis, TRF2 promotes, remodels and protects telomeric Hollidayjunctions, EMBO J. 28 (2009) 641–651.

[387] M. Bussiek, G.Müller,W.Waldeck, S. Diekmann, J. Langowski, Organisation of nucleosomal arrays reconstitutedwith repetitive African greenmonkeyalpha-satellite DNA as analysed by atomic force microscopy, Eur. Biophys. J. 37 (2007) 81–93.

[388] J. Zlatanova, S.H. Leuba, K. van Holde, Chromatin fiber structure: morphology, molecular determinants, structural transitions, Biophys. J. 74 (1998)2554–2566.

[389] J. Yodh, N. Woodbury, L. Shlyakhtenko, Y. Lyubchenko, D. Lohr, Mapping nucleosome locations on the 208-12 by AFM provides clear evidence forcooperativity in array occupation, Biochemistry 41 (2002) 3565–3574.

[390] P. Milani, M. Marilley, J. Rocca-Serra, TBP binding capacity of the TATA box is associated with specific structural properties: AFM study of the IL-2Ralpha gene promoter, Biochimie 89 (2007) 528–533.

[391] V. Mutskov, D. Gerber, D. Angelov, J. Ausio, J. Workman, S. Dimitrov, Persistent interactions of core histone tails with nucleosomal DNA followingacetylation and transcription factor binding, Mol. Cell. Biol. 18 (1998) 6293–6304.

[392] K. Luger, T.J. Rechsteiner, T.J. Richmond, Expression and purification of recombinant histones and nucleosome reconstitution, MethodsMol. Biol. 119(1999) 1–16.

[393] A. Bolshoy, P. McNamara, R.E. Harrington, E.N. Trifonov, Curved DNA without A-A: experimental estimation of all 16 DNA wedge angles, Proc. Natl.Acad. Sci. USA 88 (1991) 2312–2316.

[394] T. Sakaue, K. Yoshikawa, S.H. Yoshimura, K. Takeyasu, Histone core slips along DNA and prefers positioning at the chain end, Phys. Rev. Lett. 87 (2001)078105.

[395] H. Kim, J. Imbert, W. Leonard, Both integrated and differential regulation of components of the IL-2/IL-2 receptor system, Cytokine Growth FactorRev. 17 (2006) 349–366.

[396] R. Reeves,W.J. Leonard, M.S. Nissen, Binding of HMG-I(Y) imparts architectural specificity to a positioned nucleosome on the promoter of the humaninterleukin-2 receptor alpha gene, Mol. Cell. Biol. 20 (2000) 4666–4679.

[397] P. Milani, Caractéristiques structurales et dynamiques du promoteur du gène de la chaîne alpha du récepteur de l’interleukine 2: étude parmodélisation et par microscopie de force atomique, Ph.D. Thesis, Faculté de Médecine, Marseille, 2007.

[398] M.S. Nissen, R. Reeves, Changes in superhelicity are introduced into closed circular DNA by binding of high mobility group protein I/Y, J. Biol. Chem.270 (1995) 4355–4360.

[399] A. Nekrutenko, W.H. Li, Assessment of compositional heterogeneity within and between eukaryotic genomes, Genome Res. 10 (2000) 1986–1995.[400] A. Eyre-Walker, L.D. Hurst, The evolution of isochores, Nat. Rev. Genet. 2 (2001) 549–555.[401] D. Häring, J. Kypr, No isochores in the human chromosomes 21 and 22? Biochem. Biophys. Res. Commun. 280 (2001) 567–573.[402] W. Li, Are isochore sequences homogeneous?, Gene 300 (2002) 129–139.[403] A. Pavlícek, J. Paces, O. Clay, G. Bernardi, A compact view of isochores in the draft human genome sequence, FEBS Lett. 511 (2002) 165–169.[404] N. Cohen, T. Dagan, L. Stone, D. Graur, GC composition of the human genome: in search of isochores, Mol. Biol. Evol. 22 (2005) 1260–1272.[405] H. Hori, S. Osawa, Evolutionary change in 5S rRNA secondary structure and a phylogenic tree of 352 5S rRNA species, Biosystems 19 (1986) 163–172.[406] G. D’onofrio, D. Mouchiroud, B. Aïssani, C. Gautier, G. Bernardi, Correlations between the compositional properties of human genes, codon usage,

and amino acid composition of proteins, J. Mol. Evol. 32 (1991) 504–510.[407] S. Nicolay, E.-B. Brodie of Brodie, M. Touchon, Y. d’Aubenton-carafa, C. Thermes, A. Arneodo, From scale invariance to deterministic chaos in DNA

sequences: towards a deterministic description of gene organization in the human genome, Physica A 342 (2004) 270–280.[408] J.R. Paulson, U.K. Laemmli, The structure of histone-depleted metaphase chromosomes, Cell 12 (1977) 817–828.[409] S.M. Gasser, U.K. Laemmli, A glimpse at chromosomal order, Trends Genet. 3 (1987) 16–22.[410] J.B. Rattner, C.C. Lin, Radial loops and helical coils coexist in metaphase chromosomes, Cell 42 (1985) 291–296.[411] E. Boy de la Tour, U.K. Laemmli, The metaphase scaffold is helically folded: sister chromatids have predominantly opposite helical handedness, Cell

55 (1988) 937–944.[412] M.G. Poirier, A. Nemani, P. Gupta, S. Eroglu, J.F. Marko, Probing chromosome structure with dynamic force relaxation, Phys. Rev. Lett. 86 (2001)

360–363.[413] J.M. Bridger, W.A. Bickmore, Putting the genome on the map, Trends Genet. 14 (1998) 403–409.[414] D. Zink, T. Cremer, R. Saffrich, R. Fischer, M.F. Trendelenburg, W. Ansorge, E.H. Stelzer, Structure and dynamics of human interphase chromosome

territories in vivo, Hum. Genet. 102 (1998) 241–251.[415] A.S. Belmont, Visualizing chromosome dynamics with GFP, Trends Cell Biol. 11 (2001) 250–257.[416] T. Cremer, C. Cremer, Chromosome territories, nuclear architecture and gene regulation in mammalian cells, Nat. Rev. Genet. 2 (2001) 292–301.[417] S.M. Gasser, Visualizing chromatin dynamics in interphase nuclei, Science 296 (2002) 1412–1416.[418] C. Münkel, J. Langowski, Chromosome structure predicted by a polymer model, Phys. Rev. E 57 (1998) 5888–5896.[419] L. Manuelidis, A view of interphase chromosomes, Science 250 (1990) 1533–1540.[420] L. Manuelidis, T.L. Chen, A unified model of eukaryotic chromosomes, Cytometry 11 (1990) 8–25.[421] C.L. Woodcock, H. Woodcock, R.A. Horowitz, Ultrastructure of chromatin. I. Negative staining of isolated fibers, J. Cell Sci. 99 (1991) 99–106.

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 181

[422] C.L. Woodcock, Chromatin fibers observed in situ in frozen hydrated sections. Native fiber diameter is not correlated with nucleosome repeat length,J. Cell Biol. 125 (1994) 11–19.

[423] A.S. Belmont, Large-scale chromatin organization, in: Genome Structure and Function, Kluwer Academic Publishers, Dordrecht, 1997, p. 261.[424] W.G. Müller, D. Rieder, G. Kreth, C. Cremer, Z. Trajanoski, J.G. Mcnally, Generic features of tertiary chromatin structure as detected in natural

chromosomes, Mol. Cell Biol. 24 (2004) 9359–9370.[425] D.A. Jackson, A. Pombo, Replicon clusters are stable units of chromosome structure: evidence that nuclear organization contributes to the efficient

activation and propagation of S phase in human cells, J. Cell Biol. 140 (1998) 1285–1295.[426] H. Ma, J. Samarabandu, R.S. Devdhar, R. Acharya, P.C. Cheng, C. Meng, R. Berezney, Spatial and temporal dynamics of DNA replication sites in

mammalian cells, J. Cell Biol. 143 (1998) 1415–1425.[427] R. Berezney, D.D. Dubey, J.A. Huberman, Heterogeneity of eukaryotic replicons, replicon clusters, and replication foci, Chromosoma 108 (2000)

471–484.[428] H.J. Edenberg, J.A. Huberman, Eukaryotic chromosome replication, Annu. Rev. Genet. 9 (1975) 245–284.[429] R. Hand, Eucaryotic DNA: organization of the genome for replication, Cell 15 (1978) 317–325.[430] Y.B. Yurov, N.A. Liapunova, The units of DNA replication in themammalian chromosomes: evidence for a large size of replication units, Chromosoma

60 (1977) 253–267.[431] N.A. Liapunova, Organization of replication units and DNA replication in mammalian cells as studied by DNA fiber radioautography, Int. Rev. Cytol.

154 (1994) 261–308.[432] E. Chargaff, Structure and function of nucleic acids as cell constituents, Fed. Proc. 10 (1951) 654–659.[433] R. Rudner, J.D. Karkas, E. Chargaff, Separation of B. subtilis DNA into complementary strands. 3. Direct analysis, Proc. Natl. Acad. Sci. USA 60 (1968)

921–922.[434] J.W. Fickett, D.C. Torney, D.R. Wolf, Base compositional structure of genomes, Genomics 13 (1992) 1056–1064.[435] J.R. Lobry, Properties of a general model of DNA evolution under no-strand-bias conditions, J. Mol. Evol. 40 (1995) 326–330.[436] J. Mrázek, S. Karlin, Strand compositional asymmetry in bacterial and large viral genomes, Proc. Natl. Acad. Sci. USA 95 (1998) 3720–3725.[437] A.C. Frank, J.R. Lobry, Asymmetric substitution patterns: a reviewof possible underlyingmutational or selectivemechanisms, Gene 238 (1999) 65–77.[438] E.P. Rocha, A. Danchin, A. Viari, Universal replication biases in bacteria, Mol. Microbiol. 32 (1999) 11–16.[439] E.R. Tillier, R.A. Collins, The contributions of replication orientation, gene direction, and signal sequences to base-composition asymmetries in

bacterial genomes, J. Mol. Evol. 50 (2000) 249–257.[440] E.-B. Brodie of Brodie, De l’analyse des séquences d’ADN à la modélisation de la réplication chez les mammifères, Ph.D. Thesis, ENS de Lyon, France,

2005.[441] S. Nicolay, Analyse des séquences d’ADN par la transformée en ondelettes: extraction d’informations structurelles, dynamiques et fonctionnelles,

Ph.D. Thesis, University of Liège, Belgium, 2006.[442] T. Gojobori, W.H. Li, D. Graur, Patterns of nucleotide substitution in pseudogenes and functional genes, J. Mol. Evol. 18 (1982) 360–369.[443] W.H. Li, C.I.Wu, C.C. Luo, Nonrandomness of pointmutation as reflected in nucleotide substitutions in pseudogenes and its evolutionary implications,

J. Mol. Evol. 21 (1984) 58–71.[444] D.A. Petrov, D.L. Hartl, Patterns of nucleotide substitution in Drosophila and mammalian genomes, Proc. Natl. Acad. Sci. USA 96 (1999) 1475–1479.[445] Z. Zhang, M. Gerstein, Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes, Nucleic Acids

Res. 31 (2003) 5338–5348.[446] J.M. Freeman, T.N. Plasterer, T.F. Smith, S.C. Mohr, Patterns of genome organization in bacteria, Science 279 (1998) 1827.[447] A. Beletskii, A. Grigoriev, S. Joyce, A.S. Bhagwat, Mutations induced by bacteriophage T7 RNA polymerase and their effects on the composition of the

T7 genome, J. Mol. Biol. 300 (2000) 1057–1065.[448] M.P. Francino, H. Ochman, Deamination as the basis of strand-asymmetric evolution in transcribed Escherichia coli sequences, Mol. Biol. Evol. 18

(2001) 1147–1150.[449] L. Duret, Evolution of synonymous codon usage in metazoans, Current Opin. Genetics Dev. 12 (2002) 640–649.[450] C. Shioiri, N. Takahata, Skew of mononucleotide frequencies, relative abundance of dinucleotides, and DNA strand asymmetry, J. Mol. Evol. 53 (2001)

364–376.[451] P. Green, B. Ewing,W.Miller, P.J. Thomas, E.D. Green, Transcription-associatedmutational asymmetry inmammalian evolution, Nat. Genet. 33 (2003)

514–517.[452] J.Q. Svejstrup, Mechanisms of transcription-coupled DNA repair, Nat. Rev. Mol. Cell Biol. 3 (2002) 21–29.[453] M. Touchon, S. Nicolay, A. Arneodo, Y. d’Aubenton-Carafa, C. Thermes, Transcription-coupled TA and GC strand asymmetries in the human genome,

FEBS Lett. 555 (2003) 579–582.[454] M. Touchon, A. Arneodo, Y. d’Aubenton-Carafa, C. Thermes, Transcription-coupled and splicing-coupled strand asymmetries in eukaryotic genomes,

Nucleic Acids Res. 32 (2004) 4969–4978.[455] S. Nicolay, E.-B. Brodie of Brodie, M. Touchon, B. Audit, Y. d’Aubenton-Carafa, C. Thermes, A. Arneodo, Bifractality of human DNA strand-asymmetry

profiles results from transcription, Phys. Rev. E 75 (2007) 032902.[456] A.F.A. Smit, R. Hubley, P. Green, RepeatMasker Open 3.0 http://www.repeatmasker.org (1996–2004).[457] E. Bacry, J.-F. Muzy, A. Arneodo, Singularity spectrum of fractal signals from wavelet analysis: exact results, J. Stat. Phys. 70 (1993) 635–674.[458] J.-F. Muzy, E. Bacry, A. Arneodo, The multifractal formalism revisited with wavelets, Int. J. Bifurcation Chaos 4 (1994) 245–302.[459] A. Arneodo, E. Bacry, J.-F. Muzy, The thermodynamics of fractals revisited with wavelets, Physica A 213 (1995) 232–275.[460] T.I. Lee, et al., Control of developmental regulators by polycomb in human embryonic stem cells, Cell 125 (2006) 301–313.[461] F. Jacob, S. Brenner, F. Cuzin, On the regulation of DNA replication in bacteria, Cold Spring Harb. Symp. Quant. Biol. 28 (1963) 329–342.[462] S.P. Bell, A. Dutta, DNA replication in eukaryotic cells, Annu. Rev. Biochem. 71 (2002) 333–374.[463] M.L. Depamphilis (Ed.), DNA Replication and Human Disease, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York, 2006.[464] O. Hyrien, M. Méchali, Chromosomal replication initiates and terminates at random sequences but at regular intervals in the ribosomal DNA of

Xenopus early embryos, EMBO J. 12 (1993) 4511–4520.[465] S.A. Gerbi, A.K. Bielinsky, DNA replication and chromatin, Current Opin. Genetics Dev. 12 (2002) 243–248.[466] D. Schübeler, D. Scalzo, C. Kooperberg, B. van Steensel, J. Delrow, M. Groudine, Genome-wide DNA replication profile for Drosophila melanogaster:

a link between transcription and replication timing, Nat. Genet. 32 (2002) 438–442.[467] M. Anglana, F. Apiou, A. Bensimon, M. Debatisse, Dynamics of DNA replication in mammalian somatic cells: nucleotide pool modulates origin choice

and interorigin spacing, Cell 114 (2003) 385–394.[468] D. Fisher, M. Méchali, Vertebrate HoxB gene expression requires DNA replication, EMBO J. 22 (2003) 3737–3748.[469] D.M. Gilbert, Making sense of eukaryotic DNA replication origins, Science 294 (2001) 96–100.[470] D. Coverley, R.A. Laskey, Regulation of eukaryotic DNA replication, Annu. Rev. Biochem. 63 (1994) 745–776.[471] T. Sasaki, T. Sawado, M. Yamaguchi, T. Shinomiya, Specification of regions of DNA replication initiation during embryogenesis in the 65-kilobase

DNApolalpha-dE2F locus of Drosophila melanogaster, Mol. Cell. Biol. 19 (1999) 547–555.[472] J.A. Bogan, D.A. Natale, M.L. Depamphilis, Initiation of eukaryotic DNA replication: conservative or liberal?, J. Cell. Physiol. 184 (2000) 139–150.[473] D.M. Gilbert, In search of the holy replicator, Nat. Rev. Mol. Cell Biol. 5 (2004) 848–855.[474] C. Demeret, Y. Vassetzky, M. Méchali, Chromatin remodelling and DNA replication: from nucleosomes to loop domains, Oncogene 20 (2001)

3086–3093.[475] M. Méchali, DNA replication origins: from sequence specificity to epigenetics, Nat. Rev. Genet. 2 (2001) 640–645.[476] A.J. Mcnairn, D.M. Gilbert, Epigenomic replication: linking epigenetics to DNA replication, Bioessays 25 (2003) 647–656.

182 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

[477] The ENCODE project consortium: identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project, Nature447 (2007) 799–816.

[478] I. Lucas, A. Palakodeti, Y. Jiang, D.J. Young, N. Jiang, A.A. Fernald, M.M. Le Beau, High-throughput mapping of origins of replication in human cells,EMBO Rep. 8 (2007) 770–777.

[479] J.-C. Cadoret, F. Meisch, V. Hassan-Zadeh, I. Luyten, C. Guillet, L. Duret, H. Quesneville, M.-N. Prioleau, Genome-wide studies highlight indirect linksbetween human replication origins and gene regulation, Proc. Natl. Acad. Sci. USA 105 (2008) 15837–15842.

[480] B.J. Brewer, When polymerases collide: replication and the transcriptional organization of the E. coli chromosome, Cell 53 (1988) 679–686.[481] E.P.C. Rocha, P. Guerdoux-Jamet, I. Moszer, A. Viari, A. Danchin, Implication of gene distribution in the bacterial chromosome for the bacterial cell

factory, J. Biotech. 78 (2000) 209–219.[482] P. Lopez, H. Philippe, Composition strand asymmetries in prokaryotic genomes: mutational bias and biased gene orientation, C. R. Acad. Sci. III 324

(2001) 201–208.[483] E.P.C. Rocha, Is there a role for replication fork asymmetry in the distribution of genes in bacterial genomes, Trends Microbiol. 10 (2002) 393–395.[484] M. Bulmer, Strand symmetry of mutation rates in the beta-globin region, J. Mol. Evol. 33 (1991) 305–310.[485] M.P. Francino, H. Ochman, Strand symmetry around the beta-globin origin of replication in primates, Mol. Biol. Evol. 17 (2000) 416–422.[486] A. Gierlik, M. Kowalczuk, P. Mackiewicz, M.R. Dudek, S. Cebrat, Is there replication-associated mutational pressure in the Saccharomyces cerevisiae

genome? J. Theoret. Biol. 202 (2000) 305–314.[487] E.-M. Ladenburger, C. Keller, R. Knippers, Identification of a binding region for human origin recognition complex proteins 1 and 2 that coincides

with an origin of DNA replication, Mol. Cell. Biol. 22 (2002) 1036–1048.[488] T. Taira, S.M. Iguchi-Ariga, H. Ariga, A novel DNA replication origin identified in the human heat shock protein 70 gene promoter, Mol. Cell. Biol. 14

(1994) 6386–6397.[489] C. Keller, E.-M. Ladenburger, M. Kremer, R. Knippers, The origin recognition complex marks a replication origin in the human TOP1 gene promoter,

J. Biol. Chem. 277 (2002) 31430–31440.[490] L. Vassilev, E.M. Johnson, An initiation zone of chromosomal DNA replication located upstream of the c-myc gene in proliferating HeLa cells, Mol.

Cell. Biol. 10 (1990) 4899–4904.[491] T. Nenguke, M.I. Aladjem, J.F. Gusella, N.S. Wexler, N. Arnheim, Candidate DNA replication initiation regions at human trinucleotide repeat disease

loci, Hum. Mol. Genet. 12 (2003) 1021–1028.[492] E. Louie, J. Ott, J. Majewski, Nucleotide frequency variation across human genes, Genome Res. 13 (2003) 2594–2601.[493] J. Chen, M. Sun, S. Lee, G. Zhou, J.D. Rowley, S.M. Wang, Identifying novel transcripts and novel genes in the human genome by using novel SAGE

tags, Proc. Natl. Acad. Sci. USA 99 (2002) 12257–12262.[494] P. Kapranov, S.E. Cawley, J. Drenkow, S. Bekiranov, R.L. Strausberg, S.P.A. Fodor, T.R. Gingeras, Large-scale transcriptional activity in chromosomes

21 and 22, Science 296 (2002) 916–919.[495] J.L. Rinn, G. Euskirchen, P. Bertone, R. Martone, N.M. Luscombe, S. Hartman, P.M. Harrison, F.K. Nelson, P. Miller, M. Gerstein, S. Weissman, M. Snyder,

The transcriptional activity of human Chromosome 22, Genes Dev. 17 (2003) 529–540.[496] D. Kampa, J. Cheng, P. Kapranov, M. Yamanaka, S. Brubaker, S. Cawley, J. Drenkow, A. Piccolboni, S. Bekiranov, G. Helt, H. Tammana, T.R. Gingeras,

Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22, Genome Res. 14 (2004) 331–342.[497] T. Imanishi, et al., Integrative annotation of 21 037 human genes validated by full-length cDNA clones, PLoS Biol. 2 (2004) e162.[498] F.D. Araujo, J.D. Knox, S. Ramchandani, R. Pelletier, P. Bigey, G. Price, M. Szyf, M. Zannis-Hadjopoulos, Identification of initiation sites for DNA

replication in the human dnmt1 (DNA-methyltransferase) locus, J. Biol. Chem. 274 (1999) 9335–9341.[499] M. Giacca, et al., Fine mapping of a replication origin of human DNA, Proc. Natl. Acad. Sci. USA 91 (1994) 7119–7123.[500] D. Kitsberg, S. Selig, I. Keshet, H. Cedar, Replication structure of the human beta-globin gene domain, Nature 366 (1993) 588–590.[501] C. Girard-Reydet, D. Grégoire, Y. Vassetzky, M. Méchali, DNA replication initiates at domains overlapping with nuclear matrix attachment regions in

the xenopus and mouse c-myc promoter, Gene 332 (2004) 129–138.[502] L.T. Vassilev, W.C. Burhans, M.L. Depamphilis, Mapping an origin of DNA replication at a single-copy locus in exponentially proliferating mammalian

cells, Mol. Cell. Biol. 10 (1990) 4685–4689.[503] M. Touchon, Biais de composition chez les mammifères: rôle de la transcription et de la réplication, Ph.D. Thesis, University Denis Diderot, Paris VII,

France, 2005.[504] S. Codlin, J.Z. Dalgaard, Complex mechanism of site-specific DNA replication termination in fission yeast, EMBO J. 22 (2003) 3431–3440.[505] R.D. Little, T.H. Platt, C.L. Schildkraut, Initiation and termination of DNA replication in human rRNA genes, Mol. Cell Biol. 13 (1993) 6600–6613.[506] D. Santamaria, E. Viguera, M.L. Martinez-Robles, O. Hyrien, P. Hernandez, D.B. Krimer, J.B. Schvartzman, Bi-directional replication and random

termination, Nucleic Acids Res. 28 (2000) 2099–2107.[507] E.J.White, O. Emanuelsson, D. Scalzo, T. Royce, S. Kosak, E.J. Oakeley, S.Weissman,M. Gerstein,M. Groudine,M. Snyder, D. Schübeler, DNA replication-

timing analysis of human chromosome 22 at high resolution and different developmental states, Proc. Natl. Acad. Sci. USA 101 (2004) 17771–17776.[508] H.G. Callan, Replication of DNA in the chromosomes of eukaryotes, Proc. R. Soc. Lond. B Biol. Sci. 181 (1972) 19–41.[509] A. Baker, S. Nicolay, L. Zaghloul, Y. d’Aubenton-Carafa, C. Thermes, B. Audit, A. Arneodo, Wavelet-based method to disentangle transcription- and

replication-associated strand asymmetries in mammalian genomes, Appl. Comp. Harm. Anal. 28 (2010) 150–170.[510] B. Audit, S. Nicolay, M. Huvet, M. Touchon, Y. d’Aubenton-Carafa, C. Thermes, A. Arneodo, DNA replication timing data corroborate in silico human

replication origin predictions, Phys. Rev. Lett. 99 (2007) 248102.[511] F. Chalmel, A.D. Rolland, C. Niederhauser-Wiederkehr, S.S.W. Chung, P. Demougin, A. Gattiker, J. Moore, J.-J. Patard, D.J. Wolgemuth, B. Jégou, M.

Primig, The conserved transcriptome in human and rodent male gametogenesis, Proc. Natl. Acad. Sci. USA 104 (2007) 8346–8351.[512] K. Woodfine, D.M. Beare, K. Ichimura, S. Debernardi, A.J. Mungall, H. Fiegler, V.P. Collins, N.P. Carter, I. Dunham, Replication timing of human

chromosome 6, Cell Cycle 4 (2005) 172–176.[513] M.K. Raghuraman, E.A. Winzeler, D. Collingwood, S. Hunt, L. Wodicka, A. Conway, D.J. Lockhart, R.W. Davis, B.J. Brewer, W.L. Fangman, Replication

dynamics of the yeast genome, Science 294 (2001) 115–121.[514] Y. Watanabe, A. Fujiyama, Y. Ichiba, M. Hattori, T. Yada, Y. Sakaki, T. Ikemura, Chromosome-wide assessment of replication timing for human

chromosomes 11q and 21q: disease-related genes in timing-switch regions, Hum. Mol. Genet. 11 (2002) 13–21.[515] M. Costantini, O. Clay, C. Federico, S. Saccone, F. Auletta, G. Bernardi, Human chromosomal bands: nested structure, high-definition map and

molecular basis, Chromosoma 116 (2007) 29–40.[516] C. Schmegner, H. Hameister,W. Vogel, G. Assum, Isochores and replication time zones: a perfectmatch, Cytogenet. Genome Res. 116 (2007) 167–172.[517] P. Cvitanovic (Ed.), Universality in Chaos, Hilger, Bristol, 1984.[518] J. Guckenheimer, P. Holmes, Nonlinear Oscillations Dynamical Systems and Bifurcations of Vector Fields, Springer, Berlin, 1984.[519] B.L. Hao (Ed.), Chaos, World Scientific, Singapore, 1984.[520] H.G. Schuster, Deterministic Chaos, Physik-Verlag, Weinheim, 1984.[521] P. Bergé, Y. Pomeau, C. Vidal, Order within Chaos, Wiley, New York, 1986.[522] A.V. Holden (Ed.), Chaos, Manchester Univ. Press, Manchester, 1986.[523] H.B. Stewart, Nonlinear Dynamics Chaos, Wiley, New York, 1986.[524] P. Bergé (Ed.), Le Chaos Collection du CEA, Eyrolles, Paris, 1988.[525] L.D. Mesner, E.L. Crawford, J.L. Hamlin, Isolating apparently pure libraries of replication origins from complex genomes, Mol. Cell 21 (2006) 719–726.[526] N. Karnani, C.M. Taylor, A. Dutta, Microarray analysis of DNA replication timing, Methods Mol. Biol. 556 (2009) 191–203.[527] J.L. Hamlin, L.D. Mesner, P.A. Dijkwel, A winding road to origin discovery, Chromosome Res. 18 (2010) 45–61.[528] O. Hyrien, Private communication.

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 183

[529] C.-L. Chen, A. Rappailles, L. Duquenne, M. Huvet, G. Guilbaud, L. Farinelli, B. Audit, Y. d’Aubenton carafa, A. Arneodo, O. Hyrien, C. Thermes, Impact ofreplication timing on non-CpG and CpG substitution rates in mammalian genomes, Genome Res. 20 (2010) 447–457.

[530] N. Karnani, C. Taylor, A. Malhotra, A. Dutta, Pan-S replication patterns and chromosomal domains defined by genome-tiling arrays of ENCODEgenomic areas, Genome Res. 17 (2007) 865–876.

[531] K. Woodfine, H. Fiegler, D.M. Beare, J.E. Collins, O.T. Mccann, B.D. Young, S. Debernardi, R. Mott, I. Dunham, N.P. Carter, Replication timing of thehuman genome, Hum. Mol. Genet. 13 (2004) 191–202.

[532] R. Desprat, D. Thierry-Mieg, N. Lailler, J. Lajugie, C. Schildkraut, J. Thierry-Mieg, E.E. Bouhassira, Predictable dynamic program of timing of DNAreplication in human cells, Genome Res. 19 (2009) 2288–2299.

[533] R.S. Hansen, S. Thomas, R. Sandstrom, T.K. Canfield, R.E. Thurman, M. Weaver, M.O. Dorschner, S.M. Gartler, J.A. Stamatoyannopoulos, Sequencingnewly replicated DNA reveals widespread plasticity in human replication timing, Proc. Natl. Acad. Sci. USA 107 (2010) 139–144.

[534] M. Huvet, Détection in silico des origines de réplication dans les génomes de mammifères, Ph.D. Thesis, Université de Paris VII, 2008.[535] N. Gilbert, S. Boyle, H. Fiegler, K.Woodfine, N.P. Carter,W.A. Bickmore, Chromatin architecture of the human genome: gene-rich domains are enriched

in open chromatin fibers, Cell 118 (2004) 555–566.[536] L.D. Hurst, C. Pál, M.J. Lercher, The evolutionary dynamics of eukaryotic gene order, Nat. Rev. Genet. 5 (2004) 299–310.[537] L. Chakalova, E. Debrand, J.A. Mitchell, C.S. Osborne, P. Fraser, Replication and transcription: shaping the landscape of the genome, Nat. Rev. Genet.

6 (2005) 669–677.[538] D. Sproul, N. Gilbert, W.A. Bickmore, The role of chromatin structure in regulating the expression of clustered genes, Nat. Rev. Genet. 6 (2005)

775–781.[539] D.M. Macalpine, H.K. Rodriguez, S.P. Bell, Coordination of replication and transcription along a Drosophila chromosome, Genes Dev. 18 (2004)

3094–3105.[540] C.M. Lin, H. Fu, M. Martinovsky, E. Bouhassira, M.I. Aladjem, Dynamic alterations of replication timing in mammalian cells, Curr. Biol. 13 (2003)

1019–1028.[541] E. Danis, K. Brodolin, S. Menut, D. Maiorano, C. Girard-Reydet, M. Méchali, Specification of a DNA replication origin by a transcription complex, Nat.

Cell Biol. 6 (2004) 721–730.[542] M. Ghosh, G. Liu, G. Randall, J. Bevington, M. Leffak, Transcription factor binding and induced transcription alter chromosomal c-myc replicator

activity, Mol. Cell Biol. 24 (2004) 10193–10207.[543] M.L. Depamphilis, Cell cycle dependent regulation of the origin recognition complex, Cell Cycle 4 (2005) 70–79.[544] Y. Jeon, S. Bekiranov, N. Karnani, P. Kapranov, S. Ghosh, D. Macalpine, C. Lee, D.S. Hwang, T.R. Gingeras, A. Dutta, Temporal profile of replication of

human chromosomes, Proc. Natl. Acad. Sci. USA 102 (2005) 6419–6424.[545] A.M. Deshpande, C.S. Newlon, DNA replication fork pause sites dependent on transcription, Science 272 (1996) 1030–1033.[546] Y. Takeuchi, T. Horiuchi, T. Kobayashi, Transcription-dependent recombination and the role of fork collision in yeast rDNA, Genes Dev. 17 (2003)

1497–1506.[547] E.P.C. Rocha, A. Danchin, Essentiality not expressiveness drives gene-strand bias in bacteria, Nat. Genet. 34 (2003) 377–378.[548] A. Necsulea, C. Guillet, J.-C. Cadoret, M.-N. Prioleau, L. Duret, The relationship between DNA replication and human genome organization, Mol. Biol.

Evol. 26 (2009) 729–741.[549] L. Zaghloul, Transcriptional activity chromatin state and replication timing in domains of compositional skew in the human genome, Ph.D. Thesis,

Université de Lyon, Ecole Normale Supérieure de Lyon, 2009.[550] L. Zaghloul, A. Arneodo, Y. d’Aubenton carafa, C. Thermes, B. Audit, Large replication domains with homogeneous composition delimit

heterochromatic gene deserts, Nucleic Acids Res. (2010) (submitted for publication).[551] H. Labit, I. Perewoska, T. Germe, O. Hyrien, K.Marheineke, DNA replication timing is deterministic at the level of chromosomal domains but stochastic

at the level of replicons in Xenopus egg extracts, Nucleic Acids Res. 36 (2008) 5623–5634.[552] J. Herrick, P. Stanislawski, O. Hyrien, A. Bensimon, Replication fork density increases during DNA synthesis in X. laevis egg extracts, J. Mol. Biol. 300

(2000) 1133–1142.[553] A. Goldar, H. Labit, K. Marheineke, O. Hyrien, A dynamic stochastic model for DNA replication initiation in early embryos, PLoS One 3 (2008) e2919.[554] A. Goldar, M.-C. Marsolier-Kergoat, O. Hyrien, Universal temporal profile of replication origin activation in eukaryotes, PLoS One 4 (2009) e5899.[555] O. Hyrien, C. Maric, M. Méchali, Transition in specification of embryonic metazoan DNA replication origins, Science 270 (1995) 994–997.[556] J. Sequeira-Mendes, R. Diaz-Uriarte, A. Apedaile, D. Huntley, N. Brockdorff,M. Gomez, Transcription initiation activity sets replication origin efficiency

in mammalian cells, PLoS Genet. 5 (2009) e1000446.[557] J.T. Finch, A. Klug, Solenoidal model for superstructure in chromatin, Proc. Natl. Acad. Sci. USA 73 (1976) 1897–1901.[558] F. Thoma, T. Koller, A. Klug, Involvement of histone H1 in the organization of the nucleosome and of the salt-dependent superstructures of chromatin,

J. Cell Biol. 83 (1979) 403–427.[559] C.A. Davey, D.F. Sargent, K. Luger, A.W. Maeder, T.J. Richmond, Solvent mediated interactions in the structure of the nucleosome core particle at 1.9

Å resolution, J. Mol. Biol. 319 (2002) 1097–1113.[560] Y. Fan, T. Nikitina, J. Zhao, T.J. Fleury, R. Bhattacharyya, E.E. Bouhassira, A. Stein, C.L. Woodcock, A.I. Skoultchi, Histone H1 depletion in mammals

alters global chromatin structure but causes specific changes in gene regulation, Cell 123 (2005) 1199–1212.[561] K. van Holde, J. Zlatanova, What determines the folding of the chromatin fiber? Proc. Natl. Acad. Sci. USA 93 (1996) 10548–10555.[562] B. Dorigo, T. Schalch, A. Kulangara, S. Duda, R.R. Schroeder, T.J. Richmond, Nucleosome arrays reveal the two-start organization of the chromatin

fibers, Science 306 (2004) 1571–1573.[563] J.C. Hansen, Conformational dynamics of the chromatin fiber in solution: determinants, mechanisms, and functions, Annu. Rev. Biophys. Biomol.

Struct. 31 (2002) 361–392.[564] K. Maeshima, U.K. Laemmli, A two-step scaffolding model for mitotic chromosome assembly, Dev. Cell 4 (2003) 467–480.[565] E.M.M. Manders, A.E. Visser, A. Koppen, W.C. de Leeuw, R. van Liere, G.J. Brakenhoff, R. van Driel, Four-dimensional imaging of chromatin dynamics

during the assembly of the interphase nucleus, Chromosome Res. 11 (2003) 537–547.[566] R. Gassmann, P. Vagnarelli, D. Hudson, W.C. Earnshaw, Mitotic chromosome formation and the condensin paradox, Exp. Cell Res. 296 (2004) 35–42.[567] D.L. Spector, The dynamics of chromosome organization and gene regulation, Annu. Rev. Biochem. 72 (2003) 573–608.[568] R.A. Horowitz, D.A. Agard, J.W. Sedat, C.L. Woodcock, The three-dimensional architecture of chromatin in situ: electron tomography reveals fibers

composed of a continuously variable zig-zag nucleosomal ribbon, J. Cell Biol. 125 (1994) 1–10.[569] J. Bednar, R. Horowitz, S.G.L. Carruthers, J. Hansen, A. Koster, C. Woodcock, Nucleosomes linker DNA and linker histone form a unique structural

motif that directs the higher-order folding and compaction of chromatin, Proc. Natl. Acad. Sci. USA 93 (1998) 14173–14178.[570] C.L. Woodcock, R.A. Horowitz, Electron microscopic imaging of chromatin with nucleosome resolution, Methods Cell Biol. 53 (1998) 167–186.[571] D.E. Olins, A.L. Olins, Chromatin history: our view from the bridge, Nat. Rev. Mol. Cell Biol. 4 (2003) 809–814.[572] C. Bustamante, G. Zuccheri, S.H. Leuba, G. Yang, B. Samori, Visualization and analysis of chromatin by scanning force microscopy, Comput. Methods

Enzymol. 12 (1997) 73–83.[573] R.A. Horowitz, A.J. Koster, J. Walz, C.L. Woodcock, Automated electron microscope tomography of frozen-hydrated chromatin: the irregular three-

dimensional zigzag architecture persists in compact, isolated fibers, J. Struct. Biol. 120 (1997) 353–362.[574] P. Robinson, L. Fairall, V. Huynh, D. Rhodes, EM measurements define the dimensions of the ‘‘30-nm’’ chromatin fibre: evidence for a compact,

interdigitated structures, Proc. Natl. Acad. Sci. USA 103 (2006) 6506–6511.[575] J. Daban, A. Bermudez, Interdigitated solenoid model for compact chromatin fiber, Biochemistry 37 (1998) 4299–4304.[576] C. Woodcock, L. Frado, J. Rattner, The higher-order structure of chromatin: evidence for a helical ribbon arrangement, J. Cell Biol. 99 (1984) 42–52.[577] G. Felsenfeld, J. McGhee, Structure of the 30 nm chromatin fiber, Cell 44 (1986) 375–377.

184 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

[578] A. Worcel, S. Strogatz, D. Riley, Structure of chromatin and the linking number of DNA, Proc. Natl. Acad. Sci USA 78 (1981) 1461–1465.[579] S. Williams, B. Athey, L. Muglia, R. Schappes, A. Gough, J. Langmore, Chromatin fibers are left-handed double helice with diameter and mass per unit

length that depend on linker length, Biophys. J. 49 (1986) 233–248.[580] M. Smith, B. Athey, S. Williams, J. Langmore, Radial density distribution of chromatin: evidence that chromatin fibers have solid centers, J. Cell Biol.

110 (1990) 245–254.[581] A. Hamiche, P. Schultz, V. Ramakrishnan, P. Oudet, A. Prunell, Linker histone-dependent DNA structure in linear mononucleosomes, J. Mol. Biol. 257

(1996) 20–41.[582] T. Schalch, S. Duda, D. Sargent, T. Richmond, X-ray structure of a tetranucleosome and its implication for the chromatin fibre, Nature 436 (2005)

138–141.[583] E. Ben-Haim, A. Lesne, J.-M. Victor, Chromatin: a tunable spring at work inside chromosomes, Phys. Rev. E 64 (2001) 051921.[584] E. Ben-Haim, A. Lesne, J.-M. Victor, Adaptive elastic properties of chromatin fiber, Physica A 314 (2002) 592–599.[585] P. Diesinger, D. Heerman, The influence of the cylindrical shape of the nucleosome and the H1 defects on properties of chromatin, Biophys. J. 94

(2008) 4165–4172.[586] M. Vignali, J.L. Workman, Location and function of linker histones, Nat. Struct. Biol. 5 (1998) 1025–1028.[587] A. Travers, The location of the linker histone on the nucleosome, Trends Biochem. Sci. 24 (1999) 4–7.[588] M.M.S. Bharath, N.R. Chandra, M.R.S. Rao, Molecular modeling of the chromatosome particle, Nucleic Acids Res. 31 (2003) 4264–4274.[589] D.T. Brown, T. Izard, T. Misteli, Mapping the interaction surface of linker histone H10 with the nucleosome of native chromatin in vivo, Nat. Struct.

Mol. Biol. 13 (2006) 250–255.[590] J. Langowski, Polymer chain models of DNA and chromatin, Eur. Phys. J. E 19 (2006) 241–249.[591] V. Katritch, C. Bustamante,W. Olson, Pulling chromatin fibers: computer simulations of direct physical manipulations, J. Mol. Biol. 295 (2000) 29–40.[592] H. Schiessel, W.M. Gelbart, R. Bruinsma, DNA folding: structural and mechanical properties of the two-angle model for chromatin, Biophys. J. 80

(2001) 1940–1956.[593] B. Mergell, R. Everaers, H. Schiessel, Nucleosome interactions in chromatin: fiber stiffening and hairpin formation, Phys. Rev. E 70 (2004) 011915.[594] S. Mangenot, A. Leforestier, D. Durand, F. Livolant, X-ray diffraction characterization of the dense phases formed by nucleosome core particles,

Biophys. J 84 (2003) 2570–2584.[595] F. Livolant, S. Mangenot, A. Leforestier, A. Bertin, M. de Frutos, E. Raspaud, D. Durand, Are liquid crystalline properties of nucleosomes involved in

chromosome structure and dynamics? Philos. Transact. R. Soc. Astrophys. J. Suppl. Ser. 364 (2006) 2615–2633.[596] G. Wedemann, J. Langowski, Computer simulation of the 30-nanometer chromatin fiber, Biophys. J. 82 (2002) 2847–2859.[597] K. Rippe, Making contact on a nucleic acid polymer, Trends Biochem. Sci. 26 (2001) 733–740.[598] F. Aumann, F. Lankas, M. Caudron, J. Langowski, Monte Carlo simulation of chromatin stretching, Phys. Rev. E 73 (2006) 041927.[599] J. Dekker, K. Rippe, M. Dekker, N. Kleckner, Capturing chromosome conformation, Science 295 (2002) 1306–1311.[600] S. Grigoryev, G. Arya, S. Corell, C. Woodcock, T. Schlick, Evidence for heteromorphic chromatin fibers from analysis of nucleosome interactions, Proc.

Natl. Acad. Sci. USA 106 (2009) 13317–13322.[601] P. Draves, P. Lowary, J. Widom, Co-operative binding of the globular domain of histone H5 to DNA, J. Mol. Biol. 225 (1992) 1105–1121.[602] J. Thomas, C. Rees, J. Finch, Cooperative binding of the globular domains of histone H1 and H5 to DNA, Nucleic Acids Res. 20 (1992) 187–194.[603] H. Wong, J.-M. Victor, J. Mozziconacci, An all-atommodel for the chromatin fiber containing linker hisones reveals a versatile structure tuned by the

nucleosomal repeat lengths, PLoS One 2 (2007) e877.[604] M. Depken, H. Schiessel, Nucleosome shape dictates chromatin fiber structure, Biophys. J 96 (2009) 777–784.[605] B.D. Athey,M.F. Smith, D.A. Rankert, S.P.Williams, J.P. Langmore, The diameters of frozen-hydrated chromatin fibers increasewith DNA linker length:

evidence in support of variable diameter models for chromatin, J. Cell Biol. 111 (1990) 795–806.[606] N. Gilbert, J. Allan, Distinctive higher-order chromatin structure at mammalian centromeres, Proc. Natl. Acad. Sci. USA 98 (2001) 11949–11954.[607] D. Revaud, J. Mozziconacci, L. Sabatier, C. Desmaze, C. Lavelle, Sequence-driven telomeric chromatin structure, Cell Cycle 8 (2009) 1099–1100.[608] P.M. Diesinger, Genome folding at the 30 nm scale, Ph.D. Thesis, Heidelberg University, 2009.[609] C.D. Allis, T. Jenuwein, D. Reinberg, M.-L. Caparros (Eds.), Epigenetics Cold Spring Harbor Laboratory Press, Cold Spring Harbor, New York, 2006.[610] F. Montel, H. Menoni, M. Castelnovo, J. Bednar, S. Dimitrov, D. Angelov, C. Faivre-Moskalenko, The dynamics of individual nucleosomes controls the

chromatin condensation pathway: direct atomic force microscopy visualization of variant chromatin, Biophys. J. 97 (2009) 544–553.[611] J. Dekker, A closer look at long-range chromosomal interactions, Trends Biochem. Sci. 28 (2003) 277–280.[612] N. Gilbert, S. Gilchrist, W.A. Bickmore, Chromatin organization in the mammalian nucleus, Int. Rev. Cytol. 242 (2005) 283–336.[613] M.R. Branco, A. Pombo, Intermingling of chromosome territories in interphase suggests role in translocations and transcription-dependent

associations, PLoS Biol. 4 (2006) e138.[614] L.S. Shopland, C.R. Lynch, K.A. Peterson, K. Thornton, N. Kepper, J. von Hase, S. Stein, S. Vincent, K.R. Molloy, G. Kreth, C. Cremer, C.J. Bult, T.P. O’brien,

Folding and organization of a contiguous chromosome region according to the gene distribution pattern in primary genomic sequence, J. Cell Biol.174 (2006) 27–38.

[615] P.R. Cook, Principles of Nuclear Structure and Functions, Wiley, New York, 2001.[616] S. Chambeyron, W.A. Bickmore, Does looping and clustering in the nucleus regulate gene expression?, Current Opin. Cell Biol. 16 (2004) 256–262.[617] S.M. Gasser, B.B. Amati, M.E. Cardenas, J.F. Hofmann, Studies on scaffold attachment sites and their relation to genome function, Int. Rev. Cytol. 119

(1989) 57–96.[618] D.A. Jackson, Structure–function relationships in eukaryotic nuclei, Bioessays 13 (1991) 1–10.[619] A. Taddei, F. Hediger, F.R. Neumann, S.M. Gasser, The function of nuclear architecture: a genetic approach, Annu. Rev. Genet. 38 (2004) 305–345.[620] T.M. Yusufzai, H. Tagami, Y. Nakatani, G. Felsenfeld, CTCF tethers an insulator to subnuclear sites, suggesting shared insulator mechanisms across

species, Mol. Cell 13 (2004) 291–298.[621] A.S. Belmont, J.W. Sedat, D.A. Agard, A three-dimensional approach to mitotic chromosome structure: evidence for a complex hierarchical

organization, J. Cell Biol. 105 (1987) 77–92.[622] A.S. Belmont, M.B. Braunfeld, J.W. Sedat, D.A. Agard, Large-scale chromatin structural domains within mitotic and interphase chromosomes in vivo

and in vitro, Chromosoma 98 (1989) 129–143.[623] P.R. Cook, The organization of replication and transcription, Science 284 (1999) 1790–1795.[624] A. Rosa, R. Everaers, Structure and dynamics of interphase chromosomes, PLoS Comput. Biol. 4 (2008) e1000153.[625] G. van den Engh, R. Sachs, B.J. Trask, Estimating genomic distance from DNA sequence location in cell nuclei by a random walk model, Science 257

(1992) 1410–1412.[626] P. Hahnfeldt, J.E. Hearst, D.J. Brenner, R.K. Sachs, L.R. Hlatky, Polymer models for interphase chromosomes, Proc. Natl. Acad. Sci. USA 90 (1993)

7854–7858.[627] H. Yokota, G. van den Engh, J.E. Hearst, R.K. Sachs, B.J. Trask, Evidence for the organization of chromatin in megabase pair-sized loops arranged along

a random walk path in the human G0/G1 interphase nucleus, J. Cell Biol. 130 (1995) 1239–1249.[628] T. Cremer, G. Kreth, H. Koester, R.H. Fink, R. Heintzmann, M. Cremer, I. Solovei, D. Zink, C. Cremer, Chromosome territories interchromatin domain

compartment, and nuclear matrix: an integrated view of the functional nuclear architecture, Crit. Rev. Eukaryot. Gene Expr. 10 (2000) 179–212.[629] D. Marenduzzo, C. Micheletti, P.R. Cook, Entropy-driven genome organization, Biophys J. 90 (2006) 3712–3721.[630] P.E. Hänninen, S.W. Hell, J. Salo, E. Soini, C. Cremer, Two-photon excitation 4Pi confocal microscope: Enhanced axial resolution microscope for

biological research, Appl. Phys. Lett. 66 (1995) 1698–1700.[631] M. Schrader, K. Bahlmann, G. Giese, S.W. Hell, 4Pi-confocal imaging in fixed biological specimens, Biophys. J. 75 (1998) 1659–1668.

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 185

[632] M. Simonis, P. Klous, E. Splinter, Y.Moshkin, R.Willemsen, E.de.Wit, B. van Steensel,W.de. Laat, Nuclear organization of active and inactive chromatindomains uncovered by chromosome conformation capture-on-chip (4C), Nat. Genet. 38 (2006) 1348–1354.

[633] Z. Zhao, G. Tavoosidana, M. Sjölinder, A. Göndör, P. Mariano, S. Wang, C. Kanduri, M. Lezcano, K.S. Sandhu, U. Singh, V. Pant, V. Tiwari, S. Kurukuti,R. Ohlsson, Circular chromosome conformation capture (4C) uncovers extensive networks of epigenetically regulated intra- and interchromosomalinteractions, Nat. Genet. 38 (2006) 1341–1347.

[634] J. Dostie, T.A. Richmond, R.A. Arnaout, R.R. Selzer,W.L. Lee, T.A. Honan, E.D. Rubio, A. Krumm, J. Lamb, C. Nusbaum, R.D. Green, J. Dekker, Chromosomeconformation capture carbon copy (5C): a massively parallel solution for mapping interactions between genomic elements, Genome Res. 16 (2006)1299–1309.

[635] E. Lieberman-Aiden, N.L. van Berkum, L. Williams, M. Imakaev, T. Ragoczy, A. Telling, I. Amit, B.R. Lajoie, P.J. Sabo, M.O. Dorschner, R. Sandstrom,B. Bernstein, M.A. Bender, M. Groudine, A. Gnirke, J. Stamatoyannopoulos, L.A. Mirny, E.S. Lander, J. Dekker, Comprehensive mapping of long-rangeinteractions reveals folding principles of the human genome, Science 326 (2009) 289–293.

[636] S. Asakura, F. Oosawa, On interaction between two bodies immersed in a solution of macromolecules, J. Chem. Phys. 22 (1954) 1255–1256.[637] R.J. Ellis, Macromolecular crowding: obvious but underappreciated, Trends Biochem. Sci. 26 (2001) 597–604.[638] A.P. Minton, The influence of macromolecular crowding and macromolecular confinement on biochemical reactions in physiological media., J. Biol.

Chem. 276 (2001) 10577–10580.[639] A.G. Yodh, K.H. Lin, J.C. Crocker, A.D. Dinsmore, R. Verma, P.D. Kaplan, Entropically driven self-assembly and interaction in suspension, Phil. Trans.

Royal Soc. London A 359 (2001) 921–937.[640] D.Marenduzzo, K. Finan, P.R. Cook, The depletion attraction: an underappreciated force driving cellular organization, J. Cell Biol. 175 (2006) 681–686.[641] N.M. Toan, D. Marenduzzo, P.R. Cook, C. Micheletti, Depletion effects and loop formation in self-avoiding polymers, Phys. Rev. Lett. 97 (2006) 178302.[642] D. Marenduzzo, I. Faro-Trindade, P.R. Cook, What are the molecular ties that maintain genomic loops?, Trends Genet. 23 (2007) 126–133.[643] P.R. Cook, D. Marenduzzo, Entropic organization of interphase chromosomes, J. Cell Biol. 186 (2009) 825–834.[644] A.A. Philimonenko, D.A. Jackson, Z. Hodný, J. Janácek, P.R. Cook, P. Hozák, Dynamics of DNA replication: an ultrastructural study, J. Struct. Biol. 148

(2004) 279–289.[645] D.R.F. Carter, C. Eskiw, P.R. Cook, Transcription factories, Biochem. Soc. Trans. 36 (2008) 585–589.[646] P.R. Cook, A model for all genomes: the role of transcription factories, J. Mol. Biol. 395 (2010) 1–10.[647] H. Nakamura, T. Morita, C. Sato, Structural organizations of replicon domains during DNA synthetic phase in the mammalian nucleus, Exp. Cell Res.

165 (1986) 291–297.[648] O.L. Miller, A.H. Bakken, Morphological studies of transcription, Acta Endocrinol. Suppl. (Copenh) 168 (1972) 155–177.[649] D.A. Jackson, F.J. Iborra, E.M. Manders, P.R. Cook, Numbers and organization of RNA polymerases, nascent transcripts, and transcription units in HeLa

nuclei, Mol. Biol. Cell 9 (1998) 1523–1536.[650] C.H. Eskiw, A. Rapp, D.R.F. Carter, P.R. Cook, RNA polymerase II activity is located on the surface of protein-rich transcription factories, J. Cell Sci. 121

(2008) 1999–2007.[651] P.J. Shaw, E.G. Jordan, The nucleolus, Annu. Rev. Cell Dev. Biol. 11 (1995) 93–121.[652] P. Hozák, P.R. Cook, C. Schöfer, W. Mosgöller, F. Wachtler, Site of transcription of ribosomal RNA and intranucleolar structure in HeLa cells, J. Cell Sci.

107 (1994) 639–648.[653] A. Pombo, D.A. Jackson, M. Hollinshead, Z. Wang, R.G. Roeder, P.R. Cook, Regional specialization in human nuclei: visualization of discrete sites of

transcription by RNA polymerase III, EMBO J. 18 (1999) 2241–2253.[654] A. Kornberg, T.A. Baker, DNA Replication, WH Freeman Company, New York, 1992.[655] H. Lodish, A. Berk, S.L. Zipursky, P. Matsudaira, D. Baltimore, J. Darnell, Molecular Cell Biology, 4th ed., WH Freeman Co., New York, 1999.[656] Y. Snir, R.D. Kamien, Entropically driven helix formation, Science 307 (2005) 1067.[657] C.W. Jones, J.C. Wang, R.W. Briehl, M.S. Turner, Measuring forces between protein fibers by microscopy, Biophys. J. 88 (2005) 2433–2441.[658] H. Yamakawa, W.H. Stockmayer, Statistical mechanics of wormlike chains. II. Excluded volume effects, J. Chem. Phys. 57 (1972) 2843–2854.[659] S. Jun, J. Bechhoefer, B.-Y. Ha, Diffusion-limited loop formation of semiflexible polymers: Kramers theory and the intertwined time scales of chain

relaxation and closing, Europhys. Lett. 64 (2003) 420–426.[660] S. Jun, J. Herrick, A. Bensimon, J. Bechhoefer, Persistence length of chromatin determines origin spacing in Xenopus early-embryo DNA replication:

quantitative comparisons between theory and experiment, Cell Cycle 3 (2004) 223–229.[661] H. Reiss, H.L. Frisch, J.L. Lebowitz, Statistical mechanics of rigid spheres, J. Chem. Phys. 31 (1959) 369–380.[662] A.D. Dinsmore, A.G. Yodh, D.J. Pine, Phase diagrams of nearly-hard-sphere binary colloids, Phys. Rev. E 52 (1995) 4045–4057.[663] N.F. Carnahan, K.E. Starling, Equation of state for nonattracting rigid spheres, J. Chem. Phys. 51 (1969) 635–636.[664] S.de. Nooijer, J. Wellink, B. Mulder, T. Bisseling, Non-specific interactions are sufficient to explain the position of heterochromatic chromocenters

and nucleoli in interphase nuclei, Nucleic Acids Res. 37 (2009) 3558–3568.[665] M.A. Goldman, G.P. Holmquist, M.C. Gray, L.A. Caston, A. Nag, Replication timing of genes and middle repetitive sequences, Science 224 (1984)

686–692.[666] S. Farkash-Amar, D. Lipson, A. Polten, A. Goren, C. Helmstetter, Z. Yakhini, I. Simon, Global organization of replication time zones of the mouse

genome, Genome Res. 18 (2008) 1562–1570.[667] I. Hiratani, T. Ryba, M. Itoh, T. Yokochi, M. Schwaiger, C.-W. Chang, Y. Lyou, T.M. Townes, D. Schubeler, D.M. Gilbert, Global reorganization of

replication domains during embryonic stem cell differentiation, PLoS Biol. 6 (2008) e245.[668] A. Barski, S. Cuddapah, K. Cui, T.-Y. Roh, D.E. Schones, Z. Wang, G. Wei, I. Chepelev, K. Zhao, High-resolution profiling of histone methylations in the

human genome, Cell 129 (2007) 823–837.[669] N.D. Heintzman, R.K. Stuart, G. Hon, Y. Fu, C.W. Ching, R.D. Hawkins, L.O. Barrera, S.V. Calcar, C. Qu, K.A. Ching, W. Wang, Z. Weng, R.D. Green,

G.E. Crawford, B. Ren, Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome, Nat. Genet.39 (2007) 311–318.

[670] A.P. Boyle, S. Davis, H.P. Shulha, P. Meltzer, E.H. Margulies, Z. Weng, T.S. Furey, G.E. Crawford, High-resolution mapping and characterization of openchromatin across the genome, Cell 132 (2008) 311–322.

[671] H. Kohzaki, Y. Murakami, Transcription factors and DNA replication origin selection, Bioessays 27 (2005) 1107–1116.[672] M. Di filippo, G. Bernardi, Mapping DNase-I hypersensitive sites on human isochores, Gene 419 (2008) 62–65.[673] A.P. Bird, A.P. Wolffe, Methylation-induced repression–belts, braces, and chromatin, Cell 99 (1999) 451–454.[674] M.M. Suzuki, A. Bird, DNA methylation landscapes: provocative insights from epigenomics, Nat. Rev. Genet. 9 (2008) 465–476.[675] A. Bird, DNA methylation patterns and epigenetic memory, Genes Dev. 16 (2002) 6–21.[676] F. Eckhardt, J. Lewin, R. Cortese, V.K. Rakyan, J. Attwood, M. Burger, J. Burton, T.V. Cox, R. Davies, T.A. Down, C. Haefliger, R. Horton, K. Howe,

D.K. Jackson, J. Kunde, C. Koenig, J. Liddle, D. Niblett, T. Otto, R. Pettett, S. Seemann, C. Thompson, T. West, J. Rogers, A. Olek, K. Berlin, S. Beck,DNA methylation profiling of human chromosomes 6, 20 and 22, Nat. Genet. 38 (2006) 1378–1385.

[677] L. Duret, N. Galtier, The covariation between TpA deficiency, CpG deficiency, and G+C content of human isochores is due to a mathematical artifact,Mol. Biol. Evol. 17 (2000) 1620–1625.

[678] F. Antequera, A. Bird, CpG islands as genomic footprints of promoters that are associated with replication origins, Curr. Biol. 9 (1999) R661–R667.[679] C. Lemaitre, E. Tannier, C. Gautier,M.-F. Sagot, Precise detection of rearrangement breakpoints inmammalian chromosomes, BMCBioinform. 9 (2008)

286.[680] C. Lemaitre, L. Zaghloul, M.-F. Sagot, C. Gautier, A. Arneodo, E. Tannier, B. Audit, Analysis of fine-scale mammalian evolutionary breakpoints provides

new insight into their relation to genome organisation, BMC Genomics 10 (2009) 335.

186 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

[681] D. Remus, E.L. Beall, M.R. Botchan, DNA topology, not DNA sequence, is a critical determinant for Drosophila ORC-DNA binding, EMBO J. 23 (2004)897–907.

[682] A. Schepers, M. Ritzi, K. Bousset, E. Kremmer, J.L. Yates, J. Harwood, J.F. Diffley, W. Hammerschmidt, Human origin recognition complex binds to theregion of the latent origin of DNA replication of epstein-barr virus, EMBO J. 20 (2001) 4588–4602.

[683] S. Vashee, C. Cvetic, W. Lu, P. Simancek, T.J. Kelly, J.C. Walter, Sequence-independent DNA binding and replication initiation by the human originrecognition complex, Genes Dev. 17 (2003) 1894–1908.

[684] C. Conti, B. Sacca, J. Herrick, C. Lalou, Y. Pommier, A. Bensimon, Replication fork velocities at adjacent replication origins are coordinately modifiedduring DNA replication in human cells, Mol. Biol. Cell 18 (2007) 3059–3067.

[685] H. Labit, A. Goldar, G. Guilbaud, C. Douarche, O. Hyrien, K. Marheineke, A simple and optimized method of producing silanized surfaces for FISH andreplication mapping on combed DNA fibers, Biotechniques 45 (2008) 649–658.

[686] K.Marheineke, A. Goldar, A. Krude, O. Hyrien, Use of DNA combing to study DNA replication in Xenopus and human cell-free systems, in: S. Vengrova,J.Z. Dalgaard (Eds.), DNA Replication Methods and Protocols, Springer, New York, 2009, pp. 575–603.

[687] O. Hyrien, K. Marheineke, A. Goldar, Paradoxes of eukaryotic DNA replication: MCM proteins and the random completion problem, Bioessays 25(2003) 116–125.

[688] O. Hyrien, A. Goldar, Mathematical modelling of eukaryotic DNA replication, Chromosome Res. 18 (2010) 147–161.[689] A. Goldar, O. Hyrien, Private communication.[690] L. Berguiga, S. Zhang, F. Argoul, J. Elezgaray, High-resolution surface-plasmon imaging in air and in water: V(z) curve and operating conditions, Opt.

Lett. 32 (2007) 509–511.[691] T. Roland, A. Khalil, A. Tanenbaum, L. Berguiga, P. Delichere, L. Bonneviot, J. Elezgaray, A. Arneodo, F. Argoul, Revisiting the physical processes of

vapodeposited thin gold films on chemically modified glass by atomic force and surface plasmon microscopies, Surf. Sci. 603 (2009) 3307.[692] S. Zhang, L. Berguiga, J. Elezgaray, N. Hugo, W.X. Li, T. Roland, H. Zeng, F. Argoul, Advances in surface plasmon resonance-based high throughput

biochips, Front. Phys. China 4 (2009) 469–480.[693] J. Elezgaray, T. Roland, L. Berguiga, F. Argoul, Modeling of scanning surface plasmon microscope, J. Opt. Soc. Am. A 27 (2010) 450–457.[694] A. Arneodo, G. Grasseau, M. Holschneider, Wavelet transform of multifractals, Phys. Rev. Lett. 61 (1988) 2281–2284.[695] M. Holschneider, On the wavelet transform of fractal objects, J. Stat. Phys. 50 (1988) 963–993.[696] S. Jaffard, Hölder exponents at given points and wavelet coefficients, C. R. Acad. Sci. Paris Sér. I 308 (1989) 79–81.[697] M. Holschneider, P. Tchamitchian, Régularité locale de la fonction non-différentiable de Riemann, in: P.G. Lemarié (Ed.), in: Les Ondelettes en 1989,

Springer, Berlin, 1990, pp. 102–124.[698] S. Jaffard, Pointwise smoothness two-microlocalization and wavelet coefficients, Publ. Mat. 35 (1991) 155–168.[699] S. Mallat, W. Hwang, Singularity detection and processing with wavelets, IEEE Trans. Inf. Theory 38 (1992) 617–643.[700] S. Mallat, S. Zhong, Characterization of signals from multiscale edges, IEEE Trans. Pattern Recognit. Mach. Intell. 14 (1992) 710–732.[701] J.-F. Muzy, E. Bacry, A. Arneodo, Wavelets and multifractal formalism for singular signals: application to turbulence data, Phys. Rev. Lett. 67 (1991)

3515–3518.[702] J.-F. Muzy, E. Bacry, A. Arneodo, Multifractal formalism for fractal signals: the structure–function approach versus the wavelet-transform modulus-

maxima methods, Phys. Rev. E 47 (1993) 875–884.[703] S. Jaffard, Multifractal formalism for functions part I: results valid for all functions, SIAM J. Math. Anal. 28 (1997) 944–970.[704] S. Jaffard, Multifractal formalism for functions part II: self-similar functions, SIAM J. Math. Anal. 28 (1997) 971–998.[705] H.G.E. Hentschel, Stochastic multifractality and universal scaling distributions, Phys. Rev. E 50 (1994) 243–261.[706] D. Veitch, P. Abry, A wavelet-based joint estimator of the parameters of long-range dependence, IEEE Trans. Inf. Theory 45 (1999) 878–897.[707] P. Abry, P. Flandrin, M.S. Taqqu, D. Veitch, Wavelets for the analysis estimation and synthesis of scaling data, in: K. Parks, W. Willinger (Eds.), Self

Similar Network Traffic Analysis and Performance Evaluation, Wiley, 2000, pp. 39–88.[708] P. Abry, R. Baraniuk, P. Flandrin, R. Riedi, D. Veitch, Multiscale nature of network traffic, IEEE Signal Process. Mag. 19 (2002) 28–46.[709] P. Abry, P. Flandrin, M.S. Taqqu, D. Veitch, Self-similarity and long-range dependence through the wavelet lens, in: P. Doukhan, G. Oppenheim, M. S.

Taqqu (Eds.), Theory and Applications of Long Range Dependence, Birkhauser, Boston, 2002.[710] S. Jaffard, B. Lashermes, P. Abry, Wavelet leaders in multifractal analysis, in: T. Qian, M.I. Vai, Y. Xu (Eds.), Wavelet Analysis and Applications,

Birkhäuser Verlag, Basel, Switzerland, 2006, pp. 219–264.[711] H. Wendt, P. Abry, Multifractality tests using bootstrapped wavelet leaders, IEEE Trans. Signal Process. 55 (2007) 4811–4820.[712] H. Wendt, P. Abry, S. Jaffard, Bootstrap for empirical multifractal analysis, IEEE Signal Process. Mag. 24 (2007) 38–48.[713] A. Arneodo, S. Manneville, J.-F. Muzy, Towards log-normal statistics in high Reynolds number turbulence, Eur. Phys. J. B 1 (1998) 129–140.[714] A. Arneodo, S. Manneville, J.-F. Muzy, S.G. Roux, Revealing a lognormal cascading process in turbulent velocity statistics withwavelet analysis, Philos.

Trans. R. Soc. Lond. A 357 (1999) 2415–2438.[715] S. Roux, J.-F. Muzy, A. Arneodo, Detecting vorticity filaments using wavelet analysis: about the statistical contribution of vorticity filaments to

intermittency in swirling turbulent flows, Eur. Phys. J. B 8 (1999) 301–322.[716] J. Delour, J.-F. Muzy, A. Arneodo, Intermittency of 1D velocity spatial profiles in turbulence: a magnitude cumulant analysis, Eur. Phys. J. B 23 (2001)

243–248.[717] V. Venugopal, S.G. Roux, E. Foufoula-Georgiou, A. Arneodo, Scaling behavior of high resolution temporal rainfall: new insights from a wavelet-based

cumulant analysis, Phys. Lett. A 348 (2006) 335–345.[718] V. Venugopal, S.G. Roux, E. Foufoula-Georgiou, A. Arneodo, Revisiting multifractality of high-resolution temporal rainfall using a wavelet-based

formalism, Water Resour. Res. 42 (2006) W06D14.[719] S.G. Roux, V. Venugopal, K. Fienberg, A. Arneodo, E. Foufoula-Georgiou, Evidence for inherent nonlinearity in temporal rainfall, Adv. Water Resourc.

32 (2009) 41–48.[720] A. Arneodo, J.-F. Muzy, D. Sornette, ’’Direct’’ causal cascade in the stock market, Eur. Phys. J. B 2 (1998) 277–282.[721] J.-F. Muzy, D. Sornette, J. Delour, A. Arneodo, Multifractal returns and hierarchical portfolio theory, Quant. Finance 1 (2001) 131–148.[722] P.C. Ivanov, M.G. Rosenblum, C.-K. Peng, J. Mietus, S. Havlin, H.E. Stanley, A.L. Goldberger, Scaling behaviour of heartbeat intervals obtained by

wavelet-based time-series analysis, Nature 383 (1996) 323–327.[723] P.C. Ivanov, L.A. Amaral, A.L. Goldberger, S. Havlin, M.G. Rosenblum, Z.R. Struzik, H.E. Stanley, Multifractality in human heartbeat dynamics, Nature

399 (1999) 461–465.[724] A. Arneodo, F. Argoul, E. Bacry, J.-F. Muzy, M. Tabard, Goldenmean arithmetic in the fractal branching of diffusion-limited aggregates, Phys. Rev. Lett.

68 (1992) 3456–3459.[725] A. Arneodo, F. Argoul, J.-F. Muzy, M. Tabard, Structural 5-fold symmetry in the fractal morphology of diffusion-limited aggregates, Physica A 188

(1992) 217–242.[726] A. Arneodo, F. Argoul, J.-F. Muzy, M. Tabard, Uncovering Fibonacci sequences in the fractal morphology of diffusion-limited aggregates, Phys. Lett. A

171 (1992) 31–36.[727] A. Kuhn, F. Argoul, J.-F. Muzy, A. Arneodo, Structural-analysis of electroless deposits in the diffusion-limited regime, Phys. Rev. Lett. 73 (1994)

2998–3001.[728] E. Freysz, B. Pouligny, F. Argoul, A. Arneodo, Optical wavelet transform of fractal aggregates, Phys. Rev. Lett. 64 (1990) 745–748.[729] J.-F. Muzy, B. Pouligny, E. Freysz, F. Argoul, A. Arneodo, Optical-diffraction measurement of fractal dimensions and f (α) spectrum, Phys. Rev. A 45

(1992) 8961–8964.[730] J. Arrault, A. Arneodo, A. Davis, A. Marshak, Wavelet based multifractal analysis of rough surfaces: application to cloud models and satellite data,

Phys. Rev. Lett. 79 (1997) 75–78.

A. Arneodo et al. / Physics Reports 498 (2011) 45–188 187

[731] A. Arneodo, N. Decoster, S.G. Roux, A wavelet-based method for multifractal image analysis. I. Methodology and test applications on isotropic andanisotropic random rough surfaces, Eur. Phys. J. B 15 (2000) 567–600.

[732] N. Decoster, S.G. Roux, A. Arneodo, A wavelet-based method for multifractal image analysis. II. Applications to synthetic multifractal rough surfaces,Eur. Phys. J. B 15 (2000) 739–764.

[733] A. Arneodo, N. Decoster, S.G. Roux, Intermittency, log-normal statistics and multifractal cascade process in high-resolution satellite images of cloudstructure, Phys. Rev. Lett. 83 (1999) 1255–1258.

[734] S.G. Roux, A. Arneodo, N. Decoster, A wavelet-based method for multifractal image analysis. III. Applications to high-resolution satellite images ofcloud structure, Eur. Phys. J. B 15 (2000) 765–786.

[735] A. Khalil, G. Joncas, F. Nekka, P. Kestener, A. Arneodo, Morphological analysis of HI features. II. Wavelet-based multifractal formalism, Astrophys. J.Suppl. Ser. 165 (2006) 512–550.

[736] P. Kestener, J.-M. Lina, P. Saint-Jean, A. Arneodo, Wavelet-based multifractal formalism to assist in diagnosis in digitized mammograms, Image Anal.Stereol. 20 (2001) 169–174.

[737] L.B. Caddle, J.L. Grant, J. Szatkiewicz, J. van Hase, B.-J. Shirley, J. Bewersdorf, C. Cremer, A. Arneodo, A. Khalil, K.D. Mills, Chromosome neighborhoodcomposition determines translocation outcomes after exposure to high-dose radiation in primary cells, Chromosome Res. 15 (2007) 1061–1073.

[738] A. Khalil, J.L. Grant, L.B. Caddle, E. Atzema, K.D. Mills, A. Arneodo, Chromosome territories have a highly nonspherical morphology and nonrandompositioning, Chromosome Res. 15 (2007) 899–916.

[739] A. Arneodo, N. Decoster, P. Kestener, S.G. Roux, A wavelet-based method for multifractal image analysis: from theoretical concepts to experimentalapplications, Adv. Imaging Electr. Phys. 126 (2003) 1–92.

[740] P. Kestener, A. Arneodo, Three-dimensional wavelet-based multifractal method: the need for revisiting the multifractal description of turbulencedissipation data, Phys. Rev. Lett. 91 (2003) 194501.

[741] P. Kestener, A. Arneodo, Generalizing thewavelet-basedmultifractal formalism to randomvector fields: application to three-dimensional turbulencevelocity and vorticity data, Phys. Rev. Lett. 93 (2004) 044501.

[742] P. Kestener, A. Arneodo, A multifractal formalism for vector-valued random fields based on wavelet analysis: application to turbulent velocity andvorticity 3D numerical data, Stoch. Environ. Res. Risk Assess. 22 (2007) 421–435.

[743] A. Arneodo, E. Bacry, J.-F. Muzy, Oscillating singularities in locally self-similar functions, Phys. Rev. Lett. 74 (1995) 4823–4826.[744] A. Arneodo, E. Bacry, S. Jaffard, J.-F. Muzy, Oscillating singularities on Cantor sets: a grand-canonical multifractal formalism, J. Stat. Phys. 87 (1997)

179–209.[745] A. Arneodo, E. Bacry, S. Jaffard, J.-F. Muzy, Singularity spectrum of multifractal functions involving oscillating singularities, J. Fourier Anal. Appl. 4

(1998) 159–174.[746] G. Parisi, U. Frisch, Fully developed turbulence and intermittency, in: M. Ghil, R. Benzi, G. Parisi (Eds.), Turbulence and Predictability in Geophysical

Fluid Dynamics and Climate Dynamics, in: Proc. of Int. School, North-Holland, Amsterdam, 1985, pp. 84–88.[747] T.C. Halsey, M.H. Jensen, L.P. Kadanoff, I. Procaccia, B.I. Shraiman, Fractal measures and their singularities: the characterization of strange sets, Phys.

Rev. A 33 (1986) 1141–1151.[748] P. Collet, J. Lebowitz, A. Porzio, The dimension spectrum of some dynamical systems, J. Stat. Phys. 47 (1987) 609–644.[749] G. Paladin, A. Vulpiani, Anomalous scaling laws in multifractal objects, Phys. Rep. 156 (1987) 147–225.[750] P. Grassberger, R. Badii, A. Politi, Scaling laws for invariant measures on hyperbolic and non hyperbolic attractors, J. Stat. Phys. 51 (1988) 135–178.[751] D. Rand, The singularity spectrum for hyperbolic Cantor sets and attractors, Ergod. Th. Dyn. Sys. 9 (1989) 527–541.[752] J.D. Farmer, E. Ott, J.A. Yorke, The dimension of chaotic attractors, Physica D 7 (1983) 153–180.[753] P. Grassberger, I. Procaccia, Measuring the strangeness of strange attractors, Physica D 9 (1983) 189–208.[754] T. Bohr, T. Tèl, The thermodynamics of fractals, in: B.L. Hao (Ed.), in: Direction in Chaos, vol. 2, World Scientific, Singapore, 1988, pp. 194–237.[755] F. Argoul, A. Arneodo, J. Elezgaray, G. Grasseau, Wavelet analysis of the self-similarity of diffusion-limited aggregates and electrodeposition clusters,

Phys. Rev. A 41 (1990) 5537–5560.[756] C. Meneveau, K.R. Sreenivasan, The multifractal nature of turbulent energy-dissipation, J. Fluid Mech. 224 (1991) 429–484.[757] A. Arneodo, B. Audit, P. Kestener, S.G. Roux, Wavelet-based multifractal analysis, Scholarpedia 3 (2008) 4103.[758] B.B. Mandelbrot, J.W. van Ness, Fractional Brownian motions, fractal noises and applications, SIAM Rev. 10 (1968) 422–437.[759] H.O. Peitgen, D. Saupe (Eds.), The Science of Fractal Images, Springer, New York, 1987.[760] R. Benzi, L. Biferale, A. Crisanti, G. Paladin, M. Vergassola, A. Vulpiani, A random process for the construction of multiaffine fields, Physica D 65 (1993)

352–358.[761] A. Arneodo, E. Bacry, J.F. Muzy, Random cascades on wavelet dyadic trees, J. Math. Phys. 39 (1998) 4142–4164.[762] T. Higuchi, Relationship between the fractal dimension and the power law index for a time series: a numerical investigation, Physica D 46 (1990)

254–264.[763] N.P. Greis, H.P. Greenside, Implication of a power-law power-spectrum for self-affinity, Phys. Rev. A 44 (1991) 2324–2334.[764] G. Wornell, A.V. Oppenheim, Estimation of fractal signals from noisy measurements using wavelets, IEEE Trans. Signal Process. 40 (1992) 611–623.[765] J. Beran, Statistics for Long-Memory Processes, Chapman & Hall, New York, 1994.[766] J. Schmittbuhl, J.-P. Vilotte, S. Roux, Reliability of self-affine measurements, Phys. Rev. E 51 (1995) 131–147.[767] A. Scotti, C. Meneveau, S.G. Saddoughi, Fractal dimension of velocity signals in high-Reynolds-number hydrodynamic turbulence, Phys. Rev. E 51

(1995) 5594–5608.[768] M.S. Taqqu, V. Teverovsky, W. Willinger, Estimators for long-range dependence: an empirical study, Fractals 3 (1995) 785–798.[769] A.R. Mehrabi, H. Rassamdana, M. Sahimi, Characterization of long-range correlations in complex distributions and profiles, Phys. Rev. E 56 (1997)

712–722.[770] B. Pilgram, D.T. Kaplan, A comparison of estimators for 1/f noise, Physica D 114 (1998) 108–122.[771] P. Flandrin, On the spectrum of fractional Brownian motions, IEEE Trans. Info. Theory 35 (1989) 197–199.[772] P. Flandrin, Wavelet analysis and synthesis of fractional brownian motion, IEEE Trans. Inf. Theory 38 (1992) 910–917.[773] A. Tewfik,M. Kim, Correlation structure of the discretewavelet coefficients of fractional Brownianmotions, IEEE Trans. Inf. Theory 38 (1992) 904–909.[774] E. Masry, The wavelet transform of stochastic processes with stationary increments and its application to fractional Brownian motions, IEEE Trans.

Inf. Theory 39 (1993) 260–264.[775] P. Abry, P. Gonçalvès, P. Flandrin,Wavelets spectrum and 1/f processes, in: A. Antoniadis, G. Oppenheim (Eds.), in: Lect. Notes Stat., vol. 105, Springer

Verlag, Berlin, 1995, pp. 15–30.[776] C.L. Jones, G.T. Lonergan, D.E. Mainwaring, Wavelet packet computation of the Hurst exponent, J. Phys. A: Math. Gen. 29 (1996) 2509–2527.[777] P. Abry, P. Flandrin, M. Taqqu, D. Veitch, Wavelet for the analysis, estimation and synthesis of scaling data, in: K. Parks, W. Willinger (Eds.), Self

Similarity in Network Traffic, Wiley, New York, 1998.[778] J. Pando, L.-Z. Fang, Discrete wavelet transform power spectrum estimator, Phys. Rev. E 57 (1998) 3593–3601.[779] I. Simonsen, A. Hansen, O.M. Nes, Determination of the Hurst exponent by use of wavelet transforms, Phys. Rev. E 58 (1998) 2779–2787.[780] L. Delbeke, P. Abry, Stochastic integral representation and properties of the wavelet coefficients of the linear fractional stable motions, Stochastic

Process. Appl. 86 (2000) 177–182.[781] B. Audit, E. Bacry, J.-F. Muzy, A. Arneodo, Wavelet-based estimators of scaling behavior, IEEE Trans. Inf. Theory 48 (2002) 2938–2954.[782] B. Dubuc, J.F. Quiniou, C. Roques-Carmes, C. Tricot, S.W. Zucker, Evaluating the fractal dimension of profiles, Phys. Rev. A 39 (1989) 1500–1512.[783] R.F. Voss, Random fractals: self-affinity in noise, music, mountains, and clouds, Physica D 38 (1989) 362–371.[784] G. Samorodnisky, M.S. Taqqu, Stable Non-Gaussian Random Processes, Chapman and Hall, New York, 1994.

188 A. Arneodo et al. / Physics Reports 498 (2011) 45–188

[785] B.B. Mandelbrot, Intermitttent turbulence in self-similar cascades: divergence of highmoments and dimension of the carrier, J. FluidMech. 62 (1974)331–358.

[786] A. Arneodo, E. Bacry, S.Manneville, J.F.Muzy, Analysis of random cascades using space-scale correlation functions, Phys. Rev. Lett. 80 (1998) 708–711.[787] E.A. Novikov, Infinitely divisible distributions in turbulence, Phys. Rev. E 50 (1994) 3303–3305.[788] B. Castaing, B. Dubrulle, Fully-developed turbulence — a unifying point-of-view, J. Phys. II France 5 (1995) 895–899.[789] L.D. Landau, E.M. Lifshitz, Theory of Elasticity, Pergamon Press, Oxford NY, 1970.[790] F. Valle, M. Favre, P.De. Los rios, A. Rosa, G. Dietler, Scaling exponents and probability distributions of DNA end-to-end distance, Phys. Rev. Lett. 95

(2005) 158105.[791] R.R. Netz, H. Orland, Variation theory for a simple polyelectrolyte chain, Eur. Phys. J. B 8 (1999) 81–98.[792] T. Odijk, Polyelectrolytes near the rod limit, J. Polym. Sci. 15 (1977) 477–483.[793] J. Skolnick, M. Fixman, Electrostatic persistence length of a wormlike polyelectrolyte, Macromolecules 10 (1977) 944–948.[794] J.L. Barrat, J.-F. Joanny, Persistence length of polyelectrolyte chains, Europhys. Lett. 24 (1993) 333–338.[795] B.-Y. Ha, D. Thirumalai, Electrostatic persistence length of a polyelectrolyte chain, Macromolecules 28 (1995) 577–581.[796] F. Sellan, Synthèse de mouvements fractionnaires à l’aide de la transformation en ondelettes, C. R. Acad. Sci. Paris Sér. I 321 (1995) 351–358.[797] P. Abry, F. Sellan, The wavelet-based synthesis for fractional brownian motion proposed by F. Sellan and Y. Meyer: remarks and implementation,

Appl. Comp. Harm. Anal. 3 (1996) 377–383.[798] Y. Meyer, F. Sellan, M. Taqqu, Wavelets, generalized white noise and fractional integration: the synthesis of fractional brownian motion, J. Fourier

Anal. Appl. 5 (1999) 466–494.[799] G. Box, G. Jenkins, Time Series Analysis: Forecasting and Control, Holdenday, Oakland, 1976.[800] J.R.M. Hosking, Fractional differencing, Biometrika 68 (1981) 165–176.