Using Trawler_standalone to discover overrepresented motifs in DNA and RNA sequences derived from...

12
PROTOCOL p u o r G g n i h s i l b u P e r u t a N 0 1 0 2 © natureprotocols / m o c . e r u t a n . w w w / / : p t t h INTRODUCTION The instructions necessary for the precise spatio-temporal control of gene transcription are encoded in cis-regulatory sequences. Such regulatory sequences are generally composed of short degenerate motifs that are recognized by a limited number of trans-acting fac- tors. In a vast and constantly growing number of experiments aimed at studying the binding pattern of transcription factors, the occur- rences of motifs are expected and observed to be enriched within bound loci. The characterization of motifs is key to understanding the regulatory logic. This is particularly true within genome-wide chromatin immunoprecipitation (ChIP) experiments that locate not only the specific cis-regulatory loci of a particular factor but also the surrounding sequences. By combining the output of these experiments with a fast and accurate algorithm able to investigate the distribution of a large number of motifs, one can define part of the underlying regulatory logic. Three challenges are inherent in the informatics of this proc- ess: (a) the number of overrepresented binding sites is unknown; (b) factors often bind with significant affinity to various sites, resulting in consensus-binding sites that are often degenerate and which rarely exceed 10 bp; and (c) frequently, most of the sequences used in the discovery process do not correspond to binding site(s) and therefore constitute noise. We have previously developed Trawler 1 , a computational pipeline aimed at characterizing overrepresented motifs in ChIP pulled down sequences. Trawler does not set any previ- ous expectation on the number of motifs to be found, rather similar overrepresented motifs are clustered to form degener- ate motifs characteristic of transcription factor-binding spe- cificity. These motifs are then compared with known binding sites from Jaspar 2 , as well as from a large set of Homeodomain protein recognition sites 3 . If orthologous sequences or align- ments are provided, Trawler can also search for the most likely functional instances of the motif, based on their evolutionary conservation. Technically, Trawler is not limited to ChIP pulled down sequences and can process any data set in which nucleotide signa- tures are expected to be overrepresented relative to a background of random sequences. Trawler has been successfully applied to analyze data sets derived from different types of wet-lab experi- ments (e.g., expression profiling using microarrays, large-scale in situ hybridization, DNA adenine methyltransferase identi- fication) as well as bioinformatics analyses (e.g., data mining, GO category) 4 . Furthermore, the method is not only limited to the discovery of transcription factor-binding sites. For example, Trawler has been shown to be able to accurately define micro- RNA targets 4 , finding 70% of targets in a test data set. Other signatures that can be analyzed by Trawler include, but are not limited to, insulator binding domains and polycombs protein complex-binding sites. Alternative methods Other motif discovery tools are already available either as a web- based version and/or as a standalone version (e.g, AlignACE 5 , Amadeus 4 , MDscan 6 , MEME 7 and Weeder 8 ). We have already pre- viously shown that Trawler is the fastest computational pipeline for motif discovery 1 . Nevertheless, to ascertain that Trawler_stan- dalone retains the same properties, we have compared Trawler_ standalone with the motif discovery algorithms cited above (Table 1) that are publicly available in a standalone mode, easily installable and that require FASTA formatted sequences as input. We used two Using Trawler_standalone to discover overrepresented motifs in DNA and RNA sequences derived from various experiments including chromatin immunoprecipitation Yannick Haudry 1, 3, 4 , Mirana Ramialison 1, 3, 4 , Benedict Paten 2 , Joachim Wittbrodt 1, 3 & Laurence Ettwiller 1, 3 1 Developmental Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany. 2 Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California, USA. 3 Present addresses: Heidelberg Institute of Zoology, University of Heidelberg, Heidelberg, Germany (Y.H., M.R., J.W. and L.E.); Institute for Toxicology and Genetics, Forschungszentrum Karlsruhe, Postfach 3640, Karlsruhe D-76021, Germany (J.W.). 4 These authors contributed equally to this work. Correspondence should be addressed to L.E. ([email protected]). Published online 4 February 2010; doi:10.1038/nprot.2009.158 Genome-wide location analysis has become a standard technology to unravel gene regulation networks. The accurate characterization of nucleotide signatures in sequences is key to uncovering the regulatory logic but remains a computational challenge. This protocol describes how to best characterize these signatures (motifs) using the new standalone version of Trawler, which was designed and optimized to analyze chromatin immunoprecipitation (ChIP) data sets. In particular, we describe the three main steps of Trawler_standalone (motif discovery, clustering and visualization) and discuss the appropriate parameters to be used in each step depending on the data set and the biological questions addressed. Compared to five other motif discovery programs, Trawler_standalone is in most cases the fastest algorithm to accurately predict the correct motifs especially for large data sets. Its running time ranges within few seconds to several minutes, depending on the size of the data set and the parameters used. This protocol is best suited for bioinformaticians seeking to use Trawler_standalone in a high-throughput manner. NATURE PROTOCOLS | VOL.5 NO.2 | 2010 | 323

Transcript of Using Trawler_standalone to discover overrepresented motifs in DNA and RNA sequences derived from...

protocol

pu

orG

gn i

h si lb

uP er

u taN 010 2

©n

atu

rep

roto

cols

/m

oc. e rut a

n .w

ww / /:

pt th

INtroDUctIoNThe instructions necessary for the precise spatio-temporal control of gene transcription are encoded in cis-regulatory sequences. Such regulatory sequences are generally composed of short degenerate motifs that are recognized by a limited number of trans-acting fac-tors. In a vast and constantly growing number of experiments aimed at studying the binding pattern of transcription factors, the occur-rences of motifs are expected and observed to be enriched within bound loci. The characterization of motifs is key to understanding the regulatory logic. This is particularly true within genome-wide chromatin immunoprecipitation (ChIP) experiments that locate not only the specific cis-regulatory loci of a particular factor but also the surrounding sequences.

By combining the output of these experiments with a fast and accurate algorithm able to investigate the distribution of a large number of motifs, one can define part of the underlying regulatory logic. Three challenges are inherent in the informatics of this proc-ess: (a) the number of overrepresented binding sites is unknown; (b) factors often bind with significant affinity to various sites, resulting in consensus-binding sites that are often degenerate and which rarely exceed 10 bp; and (c) frequently, most of the sequences used in the discovery process do not correspond to binding site(s) and therefore constitute noise.

We have previously developed Trawler1, a computational pipeline aimed at characterizing overrepresented motifs in ChIP pulled down sequences. Trawler does not set any previ-ous expectation on the number of motifs to be found, rather similar overrepresented motifs are clustered to form degener-ate motifs characteristic of transcription factor-binding spe-cificity. These motifs are then compared with known binding sites from Jaspar2, as well as from a large set of Homeodomain

protein recognition sites3. If orthologous sequences or align-ments are provided, Trawler can also search for the most likely functional instances of the motif, based on their evolutionary conservation.

Technically, Trawler is not limited to ChIP pulled down sequences and can process any data set in which nucleotide signa-tures are expected to be overrepresented relative to a background of random sequences. Trawler has been successfully applied to analyze data sets derived from different types of wet-lab experi-ments (e.g., expression profiling using microarrays, large-scale in situ hybridization, DNA adenine methyltransferase identi-fication) as well as bioinformatics analyses (e.g., data mining, GO category)4. Furthermore, the method is not only limited to the discovery of transcription factor-binding sites. For example, Trawler has been shown to be able to accurately define micro-RNA targets4, finding 70% of targets in a test data set. Other signatures that can be analyzed by Trawler include, but are not limited to, insulator binding domains and polycombs protein complex-binding sites.

Alternative methodsOther motif discovery tools are already available either as a web-based version and/or as a standalone version (e.g, AlignACE5, Amadeus4, MDscan6, MEME7 and Weeder8). We have already pre-viously shown that Trawler is the fastest computational pipeline for motif discovery1. Nevertheless, to ascertain that Trawler_stan-dalone retains the same properties, we have compared Trawler_standalone with the motif discovery algorithms cited above (Table 1) that are publicly available in a standalone mode, easily installable and that require FASTA formatted sequences as input. We used two

Using Trawler_standalone to discover overrepresented motifs in DNA and RNA sequences derived from various experiments including chromatin immunoprecipitationYannick Haudry1, 3, 4, Mirana Ramialison1, 3, 4, Benedict Paten2, Joachim Wittbrodt1, 3 & Laurence Ettwiller1, 3

1Developmental Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany. 2Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California, USA. 3Present addresses: Heidelberg Institute of Zoology, University of Heidelberg, Heidelberg, Germany (Y.H., M.R., J.W. and L.E.); Institute for Toxicology and Genetics, Forschungszentrum Karlsruhe, Postfach 3640, Karlsruhe D-76021, Germany (J.W.). 4These authors contributed equally to this work. Correspondence should be addressed to L.E. ([email protected]).

Published online 4 February 2010; doi:10.1038/nprot.2009.158

Genome-wide location analysis has become a standard technology to unravel gene regulation networks. the accurate characterization of nucleotide signatures in sequences is key to uncovering the regulatory logic but remains a computational challenge. this protocol describes how to best characterize these signatures (motifs) using the new standalone version of trawler, which was designed and optimized to analyze chromatin immunoprecipitation (chIp) data sets. In particular, we describe the three main steps of trawler_standalone (motif discovery, clustering and visualization) and discuss the appropriate parameters to be used in each step depending on the data set and the biological questions addressed. compared to five other motif discovery programs, trawler_standalone is in most cases the fastest algorithm to accurately predict the correct motifs especially for large data sets. Its running time ranges within few seconds to several minutes, depending on the size of the data set and the parameters used. this protocol is best suited for bioinformaticians seeking to use trawler_standalone in a high-throughput manner.

NatUre protocols | VOL.5 NO.2 | 2010 | 323

protocol

324 | VOL.5 NO.2 | 2010 | NatUre protocols

p

uor

G g

n ih si l

bu

P eru ta

N 010 2©

nat

ure

pro

toco

ls/

moc. e r

ut an .

ww

w / /:pt t

h

different data sets: CREB ChIP-chip data9 and miR106b microRNA target genes10 (derived from the Amadeus compendium4). Trawler_standalone, Amadeus and Weeder are the only programs that could accurately predict the correct binding sites. We there-fore compared the running time perform-ance of these three programs by analyzing three additional data sets: E2F (ref. 11) and MYOD12 ChIP-chip data and miR155 microRNA target genes13 (Table 2). Motifs are compared with the expected motifs14–16. In most of the cases (three out of five data sets), Trawler_standalone performs the fastest.

Advantages and limitations of the protocolTrawler_standalone has been designed to run on a command line. Therefore, it is par-ticularly suited to the use of batch queries for high-throughput bioinformatics anal-ysis. As this standalone version is locally installed on the user’s computer, it runs faster than the web version installed on a remote server. It also provides an in-house solution for industry users and others who have concerns in sending data through the web. Compared with the web version of Trawler, Trawler_standalone has a number of options that can be tailored to a specific biological question and is not restricted to signatures of transcription factor binding sites. Specifically, these options are ‘-overlap’ (the percentage of overlap between similar instances that will be clustered in one single motif), ‘-motif_number’ (the number of instances to be used for the clustering) and ‘-xtralen’ (the number of positions to add around the final position weight matrix). Regarding the software itself, we have added new algorithmic features: ‘-strand’ (allowing the search for motifs either on a single-stranded (e.g., for RNA sequences) or double-stranded molecule (e.g., for DNA sequences)) and ‘-nb_of_cluster’ (the number of clusters of motifs to expect, using k-means or if the option is set to ‘som,’ the self-organizing map (SOM) clustering algorithm is run). Finally, we have provided a new html interface for displaying Trawler_standalone results that allows the user to visualize these results in an easier, more user-friendly and straightforward way. Furthermore, the positions of the motifs, relative to the length of the submitted sequences, are now visible through a distribution graph. Nevertheless, because Trawler_stan-dalone has been designed as a command line tool and needs to be installed on local operating system, casual users may want to use the Trawler web interface instead.

In this protocol, we describe how to install the standalone ver-sion of the program and run the Trawler_standalone pipeline with various types of data sets. The step-by-step procedure will guide the reader through the important stages of choosing a proper back-ground sequence set and defining the parameters to efficiently extract the overrepresented motifs.

Experimental designThe Trawler_standalone program takes a minimum of two argu-ments: the sample and background files containing the DNA/RNA sequences.

The sample file corresponds to the sequences expected to contain the overrepresented motifs. For ChIP data, e.g., these sequences correspond to the pulled down loci. Sequences need to be in FASTA format and should correspond to the nucleotide alphabet (ATCG or N, in uppercase letters). The file can contain as many sequences as needed. The length of the sequences can be variable or fixed. Unless the signal is expected to occur specifi-cally in repeat regions, the sequences should be repeat-masked for both species-specific repeats and low complexity regions. The RepeatMasker program provides extensive help and a list of species-specific repeats. When appropriate, repeat-masking the sequence is often key to the success of the procedure, as motifs in repeats tend to be highly overrepresented if the repeat is found slightly more often in the sample sequence compared with the background.

The background file is as important as the sample because the overrepresentation of motifs is always relative to the back-ground that is provided. The background should therefore be of the same ‘nature’ as the sample sequences. For example, if the sample sequences correspond to the 5-kb upstream sequence

table 1 | Motif finding performances (top 2 highest scoring motifs).

Standalone programs

Expected motif

CREB (from JASPAR)

miR106b (from miRBASE ) [16]

uaaagugcugacagugcagau

Parameters

Trawler-standalone

-nb_of_cluster 2

-strand single

Memory:

-Xmx2000m

Amadeus Defaults

Weeder Small – T15

MDscan

Max length = 10

Half the background

only for miR106b

AlignAce

Defaults

Meme Defaults

The programs were run using an Intel Core 2 Duo, 2.4 GHz processor with 4 GB of memory. Weblogo14 was used to generate the motifs logos for Weeder, MDscan, AlignACE and MEME.

protocol

NatUre protocols | VOL.5 NO.2 | 2010 | 325

p

uor

G g

n ih si l

bu

P eru ta

N 010 2©

nat

ure

pro

toco

ls/

moc. e r

ut an .

ww

w / /:pt t

h

of the gene start site in one species, the background should therefore correspond to 5 kb sequences upstream of random genes in the same species. If the back-ground corresponds to broader intergenic sequences and the sample corresponds to a promoter, e.g., Trawler_standalone is likely to find general features of promot-ers (TATA, CCAAT-box) instead of the specific signal from the sample data set. In general, the background should cor-respond to the biological question. The identical procedures (repeat-masking) should be applied to both the sample and the background sequences. Downloadable backgrounds are available from the Trawler_standalone package. In comple-ment to the Trawler_standalone package, backgrounds corresponding to various regions (promoter, 3′-UTRs, random sequences) in the most commonly used species can be obtained from http://ani.embl.de/laurence/blog/backgrounds.zip.

table 2 | Comparisons between motifs found by Trawler_standalone, Amadeus4 and Weeder8 and respective time performances (h:min:sec).

Motif expected Trawler_standalone Amadeus Weeder

Parameters JAVA_OPTS = -

Xmx2000m

run_1.3G_mem.bat Small – T15

E2F [11]

00:00:14 00:09:16 00:03:04

MyoD [12]

00:04:45 00:11:42 00:04:26

Creb [9]

00:01:57 00:07:06 00:34:53

miR155 [13] uu aa ugcu aa uu gugaua gggg u 00:01:54 00:04:52 00:06:12

miR106b

[10] u aaa gugcugacagugcaga u

00:01:54 00:01:12 00:07:07

Expected motifs were retrieved from Transfac15, Jaspar2 and miRBASE16. Weglogo14 was used to generate the motifs logos for Weeder. The programs were run Intel Core 2 Duo, 2.4 GHz processor with 4 MB of memory.For each data set, the programs’ speeds are indicated according to the following color code: fastest (cyan), medium (gray) and slowest (yellow).

Command line

File parameters

Files to prepare

Motif descriptionparameters

Clusteringparameters

Outputparameters

Clusters

Overrepresentedmotifs

Annotate sequence or alignment

Compare with known TFBS (treg)

Web interface

Sample file Background file

No orthology Orthology

List of files

Alignment files

Figure 1 | Overview of the Trawler_standalone pipeline. Description of the different steps of Trawler_standalone when analyzing data with or without orthologs.

protocol

326 | VOL.5 NO.2 | 2010 | NatUre protocols

p

uor

G g

n ih si l

bu

P eru ta

N 010 2©

nat

ure

pro

toco

ls/

moc. e r

ut an .

ww

w / /:pt t

h

Alternatively, other types of background can be obtained from other sources4.

The default parameters were optimized for ChIP data sets, but these parameters can be modified directly from the com-mand line (see Fig. 1). Once launched, Trawler_standalone performs automati-cally three different steps: first, Trawler_standalone searches for overrepresented motifs satisfying the motif description (see motif description parameters). The motifs are then clustered into families (see clustering parameters) using strongly connected component (default) or Cluster (SOM or k-mean)17, and instances of the motifs are mapped back to the sample sequences.

Trawler_standalone can also assess the evolutionary conservation of over-represented motifs. Using this option, the overrepresented motifs are mapped back to aligned sequences instead (see PROCEDURE Step 3, and other options in subsequent steps for running Trawler_standalone with orthologs). The align-ment and the highlighted motifs can be viewed using Jalview18. Trawler_stan-dalone also provides a ranked list of loci where the most likely functional sites were found, based on the motif conservation.

Furthermore, the de novo found motifs are screened for known transcription factor-binding sites using a database of binding sites compiled from Jaspar2 and a Homeobox binding site data set recently published3 (see Extra parameters). The comparison is performed using Treg Comparator19. Finally, Trawler_standalone displays all the data in a user-friendly interface.

All these steps are tunable and can be tailored to one’s specific data set through the use of different options (Fig. 2). For instance, for analyzing microRNA target genes, the default parameters are usually sufficient. For ChIP data sets, the -strand option could be set to double, to search also for DNA binding site in the reverse strand of the sample sequences. To find overrepresented motifs in micro-array expression profiling or co-expression data sets, we recommend setting the -occurrence option to half of the number of sequences submitted in the sample file. Indeed, it is usually difficult to extract meaningful motifs from putatively co-regulated genes if the number of occurrences of shared motifs is set too low.

Trawler_standalone results are presented through a web-based display page (Figs. 3–5) and all the output files are available in the results directory.

ControlsThe Trawler_standalone package includes example data sets of different types known to contain overrepresented motifs. The user should first run the example data set as described in this protocol and verify whether the version of Trawler_standalone installed locally can find overrepresented motifs and whether the overrepresented motifs correspond to the expected motif(s) as described in Tables 1 and 2. Once this control is positive, the user can run Trawler_standalone on their own data set by replacing the provided example files with their own files (see Steps 1–3 of the PROCEDURE for the file formats needed).

Options

Default options

Default options-strand single

Default options-nb_of_cluster som

Default options-nb_of_cluster1

Default options-xtralen 4 4 nt 4 nt

Default options-wildcard 4

Default options-occurrence 50

Default options-occurrence 100

Default options-mlength 11

Trawler_standaloneoutput

Description

These are the options that give satisfying results in most ChlP data sets.These options have been optimized for speed. For this data set, therunning time is less than a minute (99 sequences, 50,000 bp).

Using the single-strand option forces Trawler_standalone to considerthe reverse complement motif as a different motif (unless truepalindromic motif). In this example (ChlP data set),Trawler_standalone reports two clusters, one being the reversecomplement of the other one. This option is useful if the distinctionbetween the plus and minus strand is important, as it is for the casewith RNA.

With this option set to ‘som’, SOM clustering algorithm is usedrather than the default one (based on strongly connected graphs).The SOM clustering tends to under-cluster the motifs.Consequently, the output of Trawler_standalone consists of manyclusters of slightly different motif composition. This option is usefulfor finding small differences in binding specificity.

The user can fix the number of clusters instead of lettingTrawler_standalone automatically evaluate the number of clusters.Using this option, the K-mean clustering algorithm is used. Settingthis option to 1 forces all the motifs found to be included into onecluster. Before fixing the number of clusters, it is advisable to first runTrawler_standalone using the default options to first estimate the totalnumber of different motif types.

No change in the motif discovery or clustering steps. No change interms of running time. The PWM reported contains four flankingbases in both sides of the motif found.

With more wildcards allowed in individual motifs, Trawler_standaloneis more sensitive. The motif description should not change but therunning time increases drastically (6 min). To be used only if nosatisfactory motifs have been found with wildcard 2 (default).

In this option, we force the motif to be found at least 50 times inthe sample set. With an increase in the occurrence, the motif’sdescription tends to be smaller. To be used if the real signal is maskedby a sequence repeated a few times (repeat sequences).

In this example, the occurrence parameter has been set too high.Clearly, not all the sequences in the sample have the binding motif,and therefore Trawler_standalone finds only a small part of thebinding motif.

The minimum size of each individual motif has to be 11 bp long.The PWMs obtained might be too specific if the -mlength is too high.May miss small motifs if -mlength is too high.

Figure 2 | Tunable options of Trawler_standalone. Trawler_standalone’s outputs according to various options selected.

protocol

NatUre protocols | VOL.5 NO.2 | 2010 | 327

p

uor

G g

n ih si l

bu

P eru ta

N 010 2©

nat

ure

pro

toco

ls/

moc. e r

ut an .

ww

w / /:pt t

h

MaterIalsEQUIPMENTA personal computer and a web browser are required (see EQUIPMENT SETUP).EQUIPMENT SETUPHardware: Supports common operating systems including Linux-based distributions (e.g., Ubuntu and Fedora), Mac OS X (Apple) and Windows (Microsoft).

Java platform: Ensure that the Standard Edition of the Java software is installed (version 5.0 or above, see http://java.sun.com/javase/downloads/ index.jsp).

Perl: Ensure that a recent installation of Perl is installed (5.6 or above) with the Algorithm::Cluster module installed (see Boxes 1 and 2).

Trawler standalone: Download Trawler_standalone from the Trawler homepage. This website also hosts links to examples of background files, documentation, answers to FAQs, examples and package distribution down-loads. Select the download link and the appropriate version for the operating system. See Boxes 1 and 2 for detailed instructions on the installation. The distribution contains an INSTALL file and README file where one can find details on the installation procedure and command line arguments. Detailed documentation on Trawler_standalone is available in the Trawler_standalone ‘README’ file, which is accessible through a link in the top-right corner of the main display page (Fig. 3a). Additional links to Trawler_standalone’s blog and Trawler’s web version are also provided, together with links to license terms and contact information.

proceDUrepreparing sequences and parameters to run trawler_standalone ● tIMING a few minutes1| Compiling the sample file. Sequences need to be in FASTA format and should correspond to the nucleotide alphabet (ATCG or N, in uppercase letters). Repeat-mask the sequences if needed using the RepeatMasker program.crItIcal step Please mask the sequences with ‘N,’ do not soft-mask the sequences.

2| Compiling the background file. As for the sample file, the sequences need to be in FASTA format and should correspond to the nucleotide alphabet (ATCG or N, in uppercase letters). Repeat-mask the background sequences if the sample file has been repeat masked. It may be possible to use a pre-complied background; Trawler_standalone package provides backgrounds corresponding to various regions (promoter, UTRs, random sequences) in the most commonly used species, as described in box 3. Download these backgrounds separately by following this link http://ani.embl.de/laurence/blog/backgrounds.zip. Alternatively, other types of background can be obtained from other sources4. If no background corresponding to the user sample data set is available in Trawler_standalone or other sources, it is desirable to produce own background sequence file. crItIcal step Ensure that an identical procedure (repeat-masking, trimming, etc.) is applied to both the sample and the background sequences. crItIcal step Ensure the background file is larger than the sample file in terms of the total length of sequences. Ideally, one would like to have a background file at least twice as large as the sample file. crItIcal step Ensure that both data sets (the background and sample sequences) contain only one instance of any given sequence.

3| If Trawler_standalone is run using orthologs, also prepare a file contain-ing the path to all the alignment files that need to be assessed. The alignment files may correspond to the sample sequences used by Trawler_standalone to find the overrepresented motifs, but this is not a requirement. The alignment file names (without the extension) correspond to the ID of the alignment in the Trawler_standalone output. Each file contains the aligned sequences in the FASTA format. The ID can be anything as long as it is unique within that file. The format used by Trawler_standalone supports the output from the maf2fasta program (from the multiz package20) used as a standard format for UCSC alignment files21. crItIcal step Ensure that the name of the alignment files and the IDs do not contain spaces or special characters.

aFurther information on Trawler

b c

Figure 3 | Trawler_standalone output summary. Snapshots of Trawler_standalone’s result page. (a) The list of overrepresented motifs is displayed. The conservation column only appears if a multiple alignment of orthologous sequences has been submitted. (b) List of similar motifs identified in transcription factor-binding site (TFBS) databases. (c) Distribution plot of the motifs’ positions.

protocol

328 | VOL.5 NO.2 | 2010 | NatUre protocols

p

uor

G g

n ih si l

bu

P eru ta

N 010 2©

nat

ure

pro

toco

ls/

moc. e r

ut an .

ww

w / /:pt t

h 4| Specify the parameters. It is recommended that Trawler_standalone parameters should vary according to the characteris-tics of the data set, using the options described in box 4.

running trawler_standalone ● tIMING 2–20 min5| In a terminal window, go to the appropriate directory where Trawler_standalone is installed (/directory_path/trawler-1.0/). Run Trawler_standalone without orthologs using option A, or with orthologs using the alignment file by following option B.(a) running trawler_standalone without orthologs ● tIMING 2–5 min (i) The values of the default parame-

ters are specified in the README file and have been optimized for standard ChIP data sets. In this case, Trawler_standalone takes only two arguments: first, the file containing the sample data set (-sample) and second a file containing the background sequences (-background), as shown in the following exam-ple (the rest of the parameters are optional). Substitute the example files with one’s data set by replacing variables in the arguments -sample and -back-ground with the appropriate files (described in Steps 1 and 2): perl bin/trawler.pl -sample examples/REB1_YPD.fsa -background examples/background_yeast_Young_6k.fsa.

Figure 5 | Input and output information. Trawler_standalone provides, in addition to the result, a summary of both the input files and options submitted for this particular run and the downloadable output files generated by the algorithm. (a) Summary of the parameters used as input. (b) List of output files available for download.

a

b

Mouse over on highlighted motif

Jalview menu

a

b

Figure 4 | Occurrences of the motifs in the given sequences. Trawler_standalone provides also a visualization of the motifs found directly on the sequence. (a) List of sequences containing one instance of the motif. (b) Visualization of one sequence using Jalview. The instance of the motif within this sequence is highlighted in pink.

protocol

NatUre protocols | VOL.5 NO.2 | 2010 | 329

p

uor

G g

n ih si l

bu

P eru ta

N 010 2©

nat

ure

pro

toco

ls/

moc. e r

ut an .

ww

w / /:pt t

h

This example should run without an error message and all the output files should be created in the directory / directory_path/trawler_standalone-1.0/myResults/date_and_time. The name of the directory created includes the date and time of the run to prevent the overwriting of earlier experiments. crItIcal step Always provide the full path (relative or absolute) to the files that are passed as arguments. ? troUblesHootING

(ii) It is strongly recommended that parameters should vary according to the characteristics of the data set. For this, specify the parameter(s) to change directly on the command line. For example, perl bin/trawler.pl -sample examples/REB1_YPD.fsa -background examples/background_yeast_Young_6k.fsa -occurrence 20 -nb_of_cluster 1. This example should run with all the default parameters, except the occurrence is set to 20 and the number of clusters to 1. crItIcal step Trawler_standalone can also search for microRNA binding sites; in this case, sequence files correspond to gene transcripts or UTRs (each nucleotide U should be replaced by T). An example of such analysis is available in the examples/miR106b/ folder. The parameter -strand should also be set to single.

(b) running trawler_standalone with orthologs using the alignment file ● tIMING 10–20 min (i) Provide Trawler_standalone a minimum of four arguments as shown on the following example:

perl bin/trawler.pl -sample examples/creb/creb.fa -background examples/creb/creb_background.fa -alignments examples/creb/aligns_creb.fa -ref_species Homo_sapiens. crItIcal step Trawler_standalone contains an example of an orthologs file (aligns_creb.fa), located in the example/creb folder. ? troUblesHootING

Box 1 | TRAwleR_sTANDAloNe INsTAllATIoN foR lINUx AND mAc os x traWler_staNDaloNe INstallatIoN [linux, Mac os X]Installation procedure for a machine running a Unix-based operating system (linux, Mac os X).1. In the web browser, open a connection to http://ani.embl.de/laurence/blog/ and click on the TRAWLER-DOWNLOAD title in the main menu. Download and save the archived package trawler_standalone-1.1-linux.tar.gz or trawler_standalone-1.1-mac.tar.gz, depending on the operating system, to a directory of one’s choice on the local computer.2. Unpack the archive: > tar zxvf trawler_standalone-1.1-linux.tar.gz or > tar zxvf trawler_standalone-1.1-mac.tar.gz. The command creates a directory called ‘trawler_standalone-1.1,’ which contains a complete Trawler_standalone distribution. This directory has several subdirectories, including bin/ with Perl scripts and conf/ with the Trawler_standalone configuration file.Install trawler_standalone dependencies3. First, check if all the required dependencies installed with the pre-install_trawler.sh script are as follows.>./pre-install_trawler.sh

If the required dependencies are already installed, go to Step 5. If some dependencies are missing, go to step 4.4. Install the required Algorithm-Cluster (http://search.cpan.org/~mdehoon/) Perl module. To check whether Perl is installed, open a terminal and run ‘perl -v’. If Perl is installed, this command will return its version. Note that these operations require the ownership of administrative rights on the machine (if root privileges are not present, more information is provided in the INSTALL.txt file).Use CPAN to install:First, open a terminal and enter this command> sudo perl -MCPAN -e shell

cpan> install Algorithm::Cluster

crItIcal step Ensure Ghostscript is installed on the local computer.To check whether Ghostscript is installed, open a terminal and type gs -version.If Ghostscript is installed, this command will return its version.If Ghostscript installed is not installed, use the package manager.running trawler_standalone5. To verify the installation, run the examples provided by executing> cd trawler_standalone-1.1/

This command changes the current working directory to the Trawler directory.>./examples/run_trawler.sh

This command executes Trawler_standalone and will produce output messages on the console. When the job is complete one can browse the results, stored by default in trawler_standalone-1.1/myResults/, using the web browser and the index.html file.

protocol

330 | VOL.5 NO.2 | 2010 | NatUre protocols

p

uor

G g

n ih si l

bu

P eru ta

N 010 2©

nat

ure

pro

toco

ls/

moc. e r

ut an .

ww

w / /:pt t

h

Web viewing6| Trawler_standalone results are displayed through a user-friendly web interface—view results using the display features described in box 5. Please refer to the TROUBLESHOOTING section if appropriate output files are not obtained. Overrepresented motifs are displayed as families that share a common consensus sequence (box 6).? troUblesHootING

● tIMINGTrawler_standalone runs reasonably fast on any platform. A typical run with 5 Mb of sequences should take around 2–5 min. If an ortholog alignment file is provided, the complete analysis can take longer (10–20 min).

? troUblesHootINGTroubleshooting advice can be found in table 3.

Box 2 | TRAwleR_sTANDAloNe INsTAllATIoN foR wINDowstraWler_staNDaloNe INstallatIoN [Windows]Installation procedure on a machine running Windows.1. In the web browser, open a connection to http://ani.embl.de/laurence/blog/, and click on the TRAWLER-DOWNLOAD title in the main menu. Download and save the archived package trawler_standalone-1.1-win.zip to a directory of the choice on the local computer.2. Uncompress the archive using a zip utility for Windows. The command creates a directory called ‘trawler_standalone-1.1,’ which contains a complete Trawler_standalone distribution. This directory has several subdirectories, including bin\ with Perl scripts and conf\ with the Trawler_standalone configuration file.

Install trawler_standalone dependencies3. Install the required Algorithm-Cluster (http://search.cpan.org/~mdehoon/) Perl module.One needs to install ActivePerl (http://www.activestate.com/Products/activeperl/index.html); the Algorithm::Cluster can then be installed using the precompiled package by executing this command in a Shell window:>ppm install http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/Algorithm-Cluster.ppd

Note: To open a shell window on Windows, use Start -> Run -> cmd crItIcal step Ensure that Ghostscript is installed on the local computer (http://pages.cs.wisc.edu/~ghost/).One will then need to configure WebLogo included in the Trawler distribution.Go to trawler_standalone-1.1\weblogo and rename the file logo.conf.init to logo.confEdit the logo.conf file and uncomment the line gs = C:\Program Files\gs\gs8.63\bin\gswin32c.exe by removing the ‘#’ symbol at the beginning of the line. Note, the gs variable should point to the Ghostscript executable.

running trawler_standalone4. To verify the installation, run the examples provided by executing>cd trawler_standalone-1.1\

This command changes the current working directory to the Trawler_standalone directory.> examples\run_trawler.bat

This command executes Trawler_standalone and will produce output messages on the console. When the job is complete browse the results, stored by default in trawler_standalone-1.1\myResults\, using the web browser and the index.html file.

table 3 | Troubleshooting table.

step problem possible reason solution

5 Java memory error

Trawler_standalone’s default value for the Java memory allocation is -Xmx256m. However, this setting may not be sufficient depending on the size of the data set, in which case the following error message will be displayed: ‘Java OutOfMemoryError’

To overcome this issue, please close other applica-tions that are running on the same computer or increase the Java memory allocation. For the lat-ter, modify the trawler.cfg file in the ‘conf’ folder by replacing JAVA_OPTS = -Xmx256m with a higher number, e.g., JAVA_OPTS = -Xms500m

No motifs found

If Trawler_standalone cannot detect any overrepresented motifs, the following error will appear in the shell: ‘No motif satisfying your criteria has been found’

Please re-run Trawler_standalone using different parameters. Increase or decrease the parameter ‘Minimum number of motif occurrence,’ for instance

(continued)

protocol

NatUre protocols | VOL.5 NO.2 | 2010 | 331

p

uor

G g

n ih si l

bu

P eru ta

N 010 2©

nat

ure

pro

toco

ls/

moc. e r

ut an .

ww

w / /:pt t

h

Box 3 | TRAwleR_sTANDAloNe PRe-comPIleD BAckgRoUNDs Pre-computed backgrounds availableSpecies used are Homosapiens (human), Mus musculus (mouse), Danio rerio (zebrafish) and Drosophila melanogaster (Drosophila)[1] Promoter regions: 500, 1,000 and 5,000 bp upstream; the annotated start site of 1,000 random genes was retrieved from Ensembl (v.53) and repeat-masked. These files correspond to

human_promoter_500bp.fa.masked

human_promoter_1000bp.fa.masked

human_promoter_5000bp.fa.masked

mouse_promoter_500bp.fa.masked

mouse_promoter_1000bp.fa.masked

mouse_promoter_5000bp.fa.masked

zebrafish_promoter_500bp.fa.masked

zebrafish_promoter_1000bp.fa.masked

zebrafish_promoter_5000bp.fa.masked

drosophila_promoter_500bp.fa.masked

drosophila_promoter_1000bp.fa.masked

drosophila_promoter_5000bp.fa.masked

[2] Random regions: These files contain 1,000 loci (of 1,000 bp long each). The sequence has been repeat-masked. These files cor-respond to

human_random_1000bp.fa.masked

mouse_random_1000bp.fa.masked

zebrafish_random_1000bp.fa.masked

drosophila_random_1000bp.fa.masked

[3] 3’-UTR of random genes: These files correspond to

human_3utr.fa.masked

mouse_3utr.fa.masked

zebrafish_3utr.fa.masked

drosophila_3utr.fa.masked

Box 4 | TRAwleR_sTANDAloNe PARAmeTeRs a. Motif discovery parametersThe minimum motif length (-mlength): Trawler_standalone assesses motifs of variable length. The maximum length has been set to 20 nucleotides but the minimum length is a user-definable parameter. A good minimum length is around 4–6 bp if one is searching for overrepresented transcription factor or microRNA binding sites. It is not recommended to set this parameter too high as the algorithm will restrict the search space to only large motifs and consequently either miss the signal or find motifs with artificially few instances (Fig. 2).

(continued)

table 3 | (Continued)

step problem possible reason solution

6 Html output: no images

The images of the motif’s PWM cannot be displayed (especially for Windows users)

Ensure that Ghostscript dependency has been installed (please refer to the INSTALL file for installation details)

Html output: no tables

The web browser does not have Javascript enabled

Modify the browser preferences to enable Javascript (e.g., Safari: go to preference/security check enable javascript)

Low information content motifs

Repeats re-run Trawler_standalone with repeat-masked sequences as input

protocol

332 | VOL.5 NO.2 | 2010 | NatUre protocols

p

uor

G g

n ih si l

bu

P eru ta

N 010 2©

nat

ure

pro

toco

ls/

moc. e r

ut an .

ww

w / /:pt t

h

Box 5 | weB-vIewINg DIsPlAy feATURes 1. Open the ‘index.html’ file, which should have been created in the ‘myResults/tmp_date’ directory. One can use any of the main browsers, such as Firefox, Safari, Opera or Internet Explorer. A web page containing multiple tabs will appear; the contents of the ‘Results’ tab will be dis-played by default (Fig. 3a). This page contains a table summarizing the motifs identified by Trawler_standalone with the following information: Name of the motif A graphical representation of the motif’s PWM (generated with WebLogo14) Names of known TFBSs similar to the motif The overrepresentation score of the motif (Z-score) If sequence alignments of the orthologous sequences have been provided, the conservation of the motif calculated from the alignments will

be displayed

Maximum number of mismatches (-wildcard): The motif definition in Trawler_standalone allows for degeneracy. Using IUPAC codes, the following degenerated nucleotides are recognized: M R Y W S K. N can also be used but is counted as two mismatches within the motif. For example, the motif ‘ATGYGNT’ has three mismatches. crItIcal step Setting the maximum number of mismatches too high will enlarge the search space to values that may exceed the computer’s capacity (Fig. 2).Minimum number of motif occurrences (-occurrence): The minimum number of positions the motif can occur in the sample sequence set, regardless of the number of sequences. -occurrence 2 could therefore mean that either the motif occurs twice in one sequence or once in two sequences. The lower limit defined by the algorithm is two occurrences. Nevertheless, if the ChIP data set contains more than just a few sequences, one can probably expect to see the motif occur quite often. For typical ChIP data, a good minimum number of motif occurrences are 10–20. If you do not get anything interesting, try altering this parameter in combination with the size and degeneracy pattern parameters. In general, this parameter is very important.Strand (-strand): The strand can either be single or double (default double) to take into account DNA and RNA sequences. For most of ChIP experiments, the parameter -strand should be set to double and when working with RNA (RNA, microRNA, UTRs), this option should be set to single.b. Motif clustering parametersClustering procedure (-clustering): By default, the entire Trawler_standalone pipeline is run. If the clustering parameter is set to 0, then the pipeline searches only the overrepresented motifs and does not perform clustering on the discrete motif instances to generate a position weight matrix (PWM). No web interface will be provided either. Setting up this option to 0 is useful for analysis for which just the single instances are needed (bioinformatics analysis...) as the procedure is greatly speeded up.Number of motifs to be clustered (-motif_number): The maximum number of instances of motifs to be considered for clustering (if the -clustering option has been set to 1). If the option is set to 200, the best 200 motifs returned by the discovery step will be used for the clustering step and for the building of PWMs. If the motif discovery step has found less than 200 motifs to be significant, the clustering step will cluster only the significant motifs.Fixed number of cluster (-nb_of_cluster): By default, the number of cluster(s) is defined by the number of strongly connected component (SCC). By setting this option to a positive integer, the user sets the number of expected cluster(s). In this case, the SCC is replaced by the k-mean clustering algorithm. The user can also set this option to ‘som.’ In this case, the SOM algorithm is used instead. crItIcal step It is better to run Trawler_standalone with no fixed number of clusters (default parameter) to first estimate the number of different motif types.

c. output parametersExtra length flanking the PWM (-xtralen default 0): Adds an extra number of nucleotides on each side of the PWM description. Xtralen does not modify the motif discovery, nor the clustering procedure and nor the mapping to the sequence(s).Directory path (-directory default: /trawler_path/myResults/): This parameter specifies the location in which the Trawler_standalone results directory will be created. By default, all the Trawler_standalone results will be stored in myResults, but can specify a new path using this parameter.Directory name (-dir_id default: NULL): Changing this parameter modifies the name of the newly created Trawler_standalone results directory.

D. additional parameters for including the alignment when running trawler_standalone using orthologsWhen running Trawler_standalone using orthologs, set the motif discovery, cluster and output parameters as above; however, new parameters, -alignments and -ref_species, are specific to this mode:The -alignments parameter takes the alignment file described in Step 3.The -ref_species parameter corresponds to the ID of the reference sequence in the alignment files. These sequences will be used by Trawler_standalone to map the coordinates of the overrepresented motif. In the example provided with the standalone version, the reference sequence is homo_sapiens. crItIcal step Ensure that the ID of the reference sequence is identical in all the alignment files.

Box 4 | (coNTINUeD)

(continued)

protocol

NatUre protocols | VOL.5 NO.2 | 2010 | 333

p

uor

G g

n ih si l

bu

P eru ta

N 010 2©

nat

ure

pro

toco

ls/

moc. e r

ut an .

ww

w / /:pt t

h

Note: All the tables described below are sortable, i.e., one can click on the header of each table to sort the table according to the chosen column. Multiple columns can be selected for the sorting. Furthermore, some columns contain further description, which is made visible by positioning the mouse over the column’s header.2. Click on the motif’s family name or on its graphical representation to display further information about this motif (Figs. 3b and 4). This operation can be also achieved by clicking on the corresponding motif name’s tab (family_#). The first table displayed (Fig. 3b) contains detailed information about the hits of the motif when compared against known TFBSs in the following databases: Jaspar2, Homeobox data set3. In brief, the table displays in the two first columns ‘Query_ID’ and ‘Query_Consensus,’ the name and the consensus of the motif investigated. The remaining columns contain information about known TFBSs similar to this motif. The name, the database source, the ID in this database and the length and the consensus sequence of the known TFBS are displayed through the columns ‘Subject_Name,’ ‘Source_DB,’ ‘Subject_ID,’ ‘Length’ and ‘Subject_Consensus,’ respectively. Finally, information related to the comparison between the query motif and the subject TFBS is available: the orientation of the two corresponding consensus sequences, the shift, the dissimilarity score and the overlap between the two corresponding matrices are displayed through the columns ‘Orientation,’ ‘Offset,’ ‘Divergence’ and ‘Overlap,’ respectively.Note: If custom TFBSs for comparison are to be included, copy/paste the corresponding matrices in the ‘matrix_set_all’ file (using a text editor). The file located in the treg_comparator folder and description of the format is available in the README file.The second table displayed (Fig. 4a) contains the list of sequences in which the motif is present. As the motif is composed of several instances, nonoverlapping positions of the motif’s instances within the sequence are displayed. The starting positions of the motifs are indicated either with respect to the start of the sequence (column ‘Start_position (from start)’) or to the end of the sequence (column ‘Start_position (from end)’). The best Z-score of the instance at this position will be displayed. The column ‘Strand’ indicates in which direction the motif is with respect to the submitted sequences (useful when the option -strand has been set to double). A text query field is available for this table, which allows the filtering of entries for rows containing, e.g., a given sequence name, instance or part of an instance.3. Click on the sequence name to view this sequence in Jalview18 (Fig. 4b). The Jalview window will appear in a pop-up window, all the features provided by the Jalview-specific menu are functional. The instance of the motif will be highlighted in pink within the sequence. Mouse over on this motif will display the consensus sequences of instances that correspond to the motif. crItIcal step It could take up to 1 min before the alignment is displayed, if Jalview has been loaded for the first time.4. Click on the ‘Download PWM’ link to download the text version of the PWM corresponding to the motif (Fig. 3b).5. Click on the ‘Download instances (motif)’ link to display the single instances that have been used to generate the PWM (Fig. 3b).6. Click on the ‘Motif distribution’ link to visualize the distribution of the positions of the instances with respect the percentage of the length of the submitted sample sequences (Fig. 3c).7. Click on the ‘Input’ tab (Fig. 5a) to display the command line executed to perform the run and a synopsis of the corresponding parameters.8. Click on the ‘Download’ tab (Fig. 5b) to access the raw output files from Trawler_standalone, which are also available in the ‘myResults’ directory. Placing the mouse over on the link will display the content of the file on the same page: clicking on the actual link will open the link in a separate window in order to allow the download:Trawler_standalone raw data: This file lists all the instances found by Trawler_standalone. They are described by their occurrences in the sample and background, and the Z-score.Trawler_standalone sorted data: This file contains only the instances that are overrepresented in the data set (Z-score above the threshold).Clustered motifs: This file lists all instances belonging to one motif (or family, see box 6).PWMs: A PWM representing the motif, which is also available from the motif results page.

additional display features specific to running trawler_standalone with orthologs using the alignment file9. Open the ‘index.html’ file and click on one of the family name. The second table also contains the best conservation score at this position. The conservation score represents the number of species in which the motif is conserved.10. Click on the sequence name to view the multiple alignment in Jalview18 (Fig. 4a).

Box 5 | (coNTINUeD)

Box 6 | DefINITIoNs Motif: from -mlength to a maximum of 20-mer nucleotide in IUPAC code used by Trawler_standalone to calculate overrepresentation and conservation scores.

Family: Group of similar instances that together represent a putative TFBS; these are compiled into a position weight matrix.consensus sequence: Consensus sequence of the motif built from the compilation of its instances.

protocol

334 | VOL.5 NO.2 | 2010 | NatUre protocols

p

uor

G g

n ih si l

bu

P eru ta

N 010 2©

nat

ure

pro

toco

ls/

moc. e r

ut an .

ww

w / /:pt t

h

aNtIcIpateD resUltsThe user should expect to obtain a limited number of families of motifs. Typically, our large-scale analysis of a compendium of ChIP data sets reveals an average of two to three families per data set. If the number of families is too high (more than ten) and very different from one another, run Trawler_standalone with different parameters. In any case, because Trawler_standalone estimates the significance of overrepresentation using a stochastic process, each run of Trawler_standalone can vary slightly.

In general, a real signal should be robust to variable parameter settings and repeated runs of Trawler_standalone. In any case, the user is advised to repeatedly run Trawler_standalone with different parameter settings (within a certain range, see the PROCEDURE section) and analyze the output. Similar family(ies) recurrently found are generally a good indicator of a strong overrepresentation signal.

ackNoWleDGMeNts We thank the Wittbrodt lab for fruitful discussions and Florence Besse, Dirk-Dominik Dolle and Mythily Ganapathi for testing Trawler_standalone. This work was supported by FP7-CISSTEM.

aUtHor coNtrIbUtIoNs Y.H. with the help of M.R. built the standalone distribution; Y.H., M.R. B.P. and L.E. have contributed to improve the algorithm; and Y.H., M.R., B.P. and L.E. have written the paper.

Published online at http://www.natureprotocols.com/. Reprints and permissions information is available online at http://npg.nature.com/reprintsandpermissions/.

1. Ettwiller, L., Paten, B., Ramialison, M., Birney, E. & Wittbrodt, J. Trawler: de novo regulatory motif discovery pipeline for chromatin immunoprecipitation. Nat. Methods 4, 563–565 (2007).

2. Bryne, J.C. et al. JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update. Nucleic Acids Res. 36, D102–D106 (2008).

3. Berger, M.F. et al. Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell 133, 1266–1276 (2008).

4. Linhart, C., Halperin, Y. & Shamir, R. Transcription factor and microRNA motif discovery: the Amadeus platform and a compendium of metazoan target sets. Genome Res. 18, 1180–1189 (2008).

5. Hughes, J.D., Estep, P.W., Tavazoie, S. & Church, G.M. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol. 296, 1205–1214 (2000).

6. Liu, X.S., Brutlag, D.L. & Liu, J.S. An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments. Nat. Biotechnol. 20, 835–839 (2002).

7. Bailey, T.L. & Elkan, C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc. Int. Conf. Intell. Syst. Mol. Biol. 2, 28–36 (1994).

8. Pavesi, G., Mereghetti, P., Mauri, G. & Pesole, G. Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes. Nucleic. Acids Res. 32, W199–W203 (2004).

9. Zhang, X. et al. Genome-wide analysis of cAMP-response element binding protein occupancy, phosphorylation, and target gene activation in human tissues. Proc. Natl. Acad. Sci. USA 102, 4459–4464 (2005).

10. Linsley, P.S. et al. Transcripts targeted by the microRNA-16 family cooperatively regulate cell cycle progression. Mol. Cell Biol. 27, 2240–2252 (2007).

11. Ren, B. et al. E2F integrates cell cycle progression with DNA repair, replication, and G(2)/M checkpoints. Genes Dev. 16, 245–256 (2002).

12. Cao, Y. et al. Global and gene-specific analyses show distinct roles for Myod and Myog at a common set of promoters. EMBO J. 25, 502–511 (2006).

13. Rodriguez, A. et al. Requirement of bic/microRNA-155 for normal immune function. Science 316, 608–611 (2007).

14. Crooks, G.E., Hon, G., Chandonia, J. & Brenner, S.E. WebLogo: a sequence logo generator. Genome Res. 14 (6): 1188–1190 (2004).

15. Matys, V. et al. TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 34 (Database issue): D108–D110 (2006).

16. Griffiths-Jones, S., Saini, H.K., van Dongen, S. & Enright, A.J. miRBase: tools for microRNA genomics. Nucleic Acids Res. 36, D154–D158 (2008).

17. Eisen, M.B., Spellman, P.T., Brown, P.O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95, 14863–14868 (1998).

18. Clamp, M., Cuff, J., Searle, S.M. & Barton, G.J. The Jalview Java alignment editor. Bioinformatics 20, 426–427 (2004).

19. Roepcke, S., Grossmann, S., Rahmann, S. & Vingron, M. T-Reg Comparator: an analysis tool for the comparison of position weight matrices. Nucleic Acids Res. 33, W438–W441 (2005).

20. Blanchette, M. et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14, 708–715 (2004).

21. Karolchik, D. et al. The UCSC Genome Browser Database: 2008 update. Nucleic Acids Res. 36, D773–D779 (2008).