Microarray Analysis through Transcription Kinetic Modeling and Information Theory

Microarray Analysis through Transcription Kinetic

Modeling and Information Theory

Abdallah Sayyed-Ahmad, Kagan Tuncay and Peter Ortoleva*

Center for Cell and Virus Theory, Department of Chemistry, Indiana University, Bloomington IN 47405,

USA

ABSTRACT

cDNA microarray and other multiplex data hold promise for addressing the challenges of cellular complexity,

disease progression and drug discovery. We believe that combining transcription kinetic modeling with

microarray time series data through information theory will yield more information about the gene regulatory

networks than obtained previously. A novel analysis of gene regulatory networks is presented based on the

integration of microarray data and cell modeling through information theory. Given a partial network and time

series data, a probability density is constructed that is a functional of the time course of intra-nuclear

transcription factor (TF) thermodynamic activities, and is a function of RNA degradation and transcription

rate and equilibrium constants for TF/gene binding. The most probable TF time courses and the values of

aforementioned parameters are computed. Accuracy and robustness of the method are evaluated and an

application to Escherichia Coli is demonstrated. A kinetic (and not a steady state) formulation allows us to

analyze phenomena with a strongly dynamical character (e.g. the cell cycle, metabolic oscillations, viral

infection or response to changes in the extra-cellular medium).

1 INTRODUCTION

cDNA (Schena et al., 1995; DeRisi et al., 1997; Sauter et al., 2003) and other multiplex data acquisition

techniques yield a great volume of information and thereby hold promise for addressing the challenges of

cellular complexity, disease progression and drug discovery (Brown and Botstein, 1999; Debouck and

Goodfellow, 1999; Gerhold et al., 1999; Chitler, 2004). Recently we proposed an approach to the analysis

of such data based on its integration with cell modeling through information theory (Sayyed-Ahmad et al.,

2003). Here we show how this approach can be extended to analyze microarray time series data. The

approach has the potential to yield more information on the network of cellular processes from this data

than existing methods. Cell models hold great promise for understanding the complex networks of

processes underlying cell behavior (Slepchenko et al., 2003; Weitzke and Ortoleva, 2003; Navid and

Ortoleva, 2004). Unfortunately they suffer from a lack of information about many of the rate and

equilibrium constants for reaction and transport processes (Mendes and Kell, 1998; Sayyed-Ahmad et al.,

2003). Furthermore, key aspects of the biochemical network have yet to be resolved i.e. we are presented

with the challenge of calibrating and using an incomplete model. In contrast, microarray, protein mass

* To whom correspondence should be addressed. E-mail:[email protected], Telephone: 812-855-2717, Fax: 812-855-8300

https://www.researchgate.net/publication/224941082_Toward_Automated_Cell_Model_Development_through_Information_Theory?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==




https://www.researchgate.net/publication/8529060_Simulated_Complex_Dynamics_of_Glycolysis_in_the_Protozoan_Parasite_Trypanosoma_brucei?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==


https://www.researchgate.net/publication/15629117_Quantitative_Monitoring_of_Gene_Expression_Patterns_With_a_Complementary_DNA_Microarray?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/13372648_Microarrays_in_Drug_Discovery_and_Development?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==


https://www.researchgate.net/publication/9043044_Quantitative_cell_biology_with_the_Virtual_Cell?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/8987069_Weitzke_E_L_Ortoleva_P_J_Simulating_cellular_dynamics_through_a_coupled_transcription_translation_metabolic_model_Comput_Biol_Chem_27_469-480?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/12977099_DNA_chips_Promising_toys_have_become_powerful_tools?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/220263719_Non-linear_optimization_of_biochemical_pathways_Applications_to_metabolic_engineering_and_parameter_estimation?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/13372645_Exploring_the_New_World_of_the_Genome_With_DNA_Microarrays?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/289277208_Exploring_the_metabolic_and_genetic_control_of_gene_expression_on_a_genomic_scale?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

spectroscopy, NMR and other multiplex data acquisition techniques yield many simultaneous

measurements but they are often only indirectly related to the quantities we seek (e.g. protein and mRNA

production and degradation rates and binding constants or the stoichiometry or the biochemical network).

Our MACI (Model-Automated Cell Informatics) approach integrates cell modeling and multiplex data

analysis into one seamless endeavor, using the strengths of one to overcome weaknesses in the other.

The state of a cell is specified by a set of variables Ψ for which we know the governing equations

and a set ( )T t which are at the frontier of our understanding for which we do not know the governing

equations. The challenge is that the dynamics of Ψ is given by a cell model,

( ), ,d

G Tdt

Ψ = Ψ Λ . (1)

in which the rate G depends not only on many rate and equilibrium constants Λ , but also on the frontier

variables ( )T t at each time. Thus the model cannot be solved completely (i.e. Ψ can only be determined

as a functional of the unknown time course ( )T t ).

A microarray time series data M reflects the evolving state of a cell and hence one should in

principle be able to construct synthetic time series synM from the predicted RNA time course (or more

generally Ψ and ( )T t ). By solving (1) one can determine Ψ as a function of Λ and initial

state ( )0tΨ = and as a functional of ( )T t over the time of the experiment (i.e. [ )0, ,Ψ = Ψ Φ Λ Ψ ). In this

way the microarray data constructed, synM , is a function of Λ and ( )0Ψ and a functional of ( )T t . An

attractive aspect of the present approach is that the physics and chemistry built into a kinetic cell model

puts strong constraints on the relationship between Ψ and ( ) ,T t Λ and ( )0Ψ to facilitate the

determination of these variables from time series data. In a sense, physics and chemistry enhance the

solvability of the inverse problem – i.e. the determination of ( ) ,T t Λ and ( )0Ψ given the time series

microarray data.

In addressing cellular complexity, we must account for the omnipresent uncertainty in observed

data and the structure of a cell model. Thus a probabilistic framework is needed. We suggest that the

probability of interest (denoted ρ ) is a function of Λ and ( )0Ψ and a functional of the time course ( )T t

of the frontier variables. We use information theory to construct ρ , develop a model (e.g. realization of

(1)), and introduce time series microarray data as a way to constrain ρ (i.e. we know M so, in principle,

some ( ) 0, , and T tΛ Ψ are more likely to be consistent with it than others).

Time series experiments involve monitoring a synchronized culture of cells over their cycle or

observing changes due to time-varying conditions in the extra-cellular medium. Other dynamical

phenomena of interest involve behaviors in response to nuclear transplantation, fertilization or viral

infection, or the time course of normal development, gene insertion/deletion, radiation, heat shock and

transitions to abnormality or drug resistance. All these phenomena, and the analysis of time series data on

them, require kinetic approaches if the full dynamics of these systems is to be explored. For example,

steady state approaches can only yield ratios of rate coefficients and not all coefficients independently.

A number of methods have been proposed for extracting information and overcoming systematic

errors from microarray data after it has been preprocessed (e.g. through image analysis, statistical

approaches and channel normalization) (Wilson et al., 2002; Hyduke et al., 2003; Symth and Speed, 2003).

Among them are Boolean network models (Shmulevich et al., 2002; Shmulevich et al., 2003), Bayesian

network models (Pe'er et al., 2001), Bayesian statistics (Friedman et al., 2000; Li et al., 2002), cluster

analysis (Azuaje, 2002; Bolshakova and Azuaje, 2003), independent component analysis (ICA)

(Liebermeister, 2002), principal component analysis (PCA) (Holter et al., 2000; Holter et al., 2001) and

network component analysis (NCA) (Liao et al., 2003; Kao et al., 2004). These techniques assume that the

system is at steady state. In Boolean and Bayesian network models the goal is to infer gene regulatory

network structure. Cluster analysis, Bayesian statistics, ICA, and PCA classify genes into groups; genes that

have similar expression profiles are identified as functionally similar. NCA differs from other techniques in

that the structure of the transcription control network is assumed to be known. Furthermore, a number of

assumptions are made in NCA to arrive at the final model. The MACI algorithm presented here also

requires the structure of the transcription control network. However, we place no restrictions on the

structure of this network, use a kinetic model (1), construct synthetic time series microarray, apply a

physically motivated regularization constraint for the time-dependence of TF activities and then derive the

information of interest. We determine rate and equilibrium constants that appear in our chemical kinetic

formulation of RNA polymerase binding to the genes. Rate constants for transcription and RNA

degradation are also obtained. The use of a kinetic model also allows us to generalize our approach to

proteomic and metabolic data either by themselves or with cDNA microarray data. Our method is

significantly different from the approach proposed by (Gardner et al., 2003) whose objective is to construct

the gene control network using microarray data. However we demonstrate how our method provides a

method to fill in incomplete networks to identify inconsistencies of a proposed network with observed data.

https://www.researchgate.net/publication/10650947_New_normalization_methods_for_cDNA_microarray_data?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/8943768_Transcriptome-based_determination_of_multiple_transcription_regulator_activities_in_Escherichia_coli_by_using_network_component_analysis?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/11514380_Probabilistic_Boolean_Networks_A_Rule-based_Uncertainty_Model_for_Gene_Regulatory_Networks?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/5654666_Liao_JC_et_al_Network_component_analysis_reconstruction_of_regulatory_signals_in_biological_systems_Proc_Natl_Acad_Sci_USA_100_15522-15527?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/11525079_Linear_modes_of_gene_expression_determined_by_independent_coponent_analysis?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/9036647_A_Software_Package_for_cDNA_Microarray_Data_Normalization_and_Assessing_Confidence_Intervals?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/10677039_Inferring_Genetic_Networks_and_Identifying_Compound_Mode_of_Action_Via_Expression_Profiling?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/12428090_Holter_N_S_et_al_Fundamental_patterns_underlying_gene_expression_profiles_simplicity_from_complexity_Proc_Natl_Acad_Sci_USA_97_8409-8414?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/220309515_Using_Bayesian_Networks_to_Analyze_Expression_Data?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/11869653_Inferring_Subnetworks_from_Perturbed_Expression_Profiles?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/11084458_Bayesian_automatic_relevance_determination_algorithms_for_classifying_gene_expression_data?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/9027353_Normalization_of_cDNA_Microarray_Data?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/222299800_Cluster_validation_techniques_for_genome_expression_data?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

https://www.researchgate.net/publication/10575827_The_role_of_certain_Post_classes_in_Boolean_network_models_of_genetic_networks?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

2 A TRANSCRIPTION MODEL

2.1 The Model

In the MACI approach, one no longer requires a comprehensive cell model to make quantitative predictions

even though processes within and among the genome, proteome and metabolome are all strongly coupled.

Rather our approach only requires an incomplete model wherein governing equations for the variables Ψ

of (1) are to be set forth and the model must contain those variables with which one may construct synthetic

data of the type available. Thus the following model is based on equations for RNA levels so that synthetic

microarray data can be constructed. But to do so one must know the time course of TFs as they up/down

regulate genes. Specifically, to implement our approach the TF intra-nuclear thermodynamic activities are

identified as the frontier variables ( )T t of (1). Closure is obtained by using time series data to generate

functional differential equations for the ( )T t . Thus one need not make oversimplified assumptions on

( )T t to compute Ψ i.e. one may calibrate and run an incomplete transcription model even though the

dynamical variables (e.g. RNA populations) are strongly coupled to ( )T t .

We first develop a forward model in which given a set of time courses of TF thermodynamic

activities, ( )T t , the time course of intra-cellular RNA populations ( )R t is predicted. Due to the dense

environment within the nucleus, TF thermodynamic activities are preferred over concentrations. With such

a model and time series microarray data, we show in the next section how transcription rate and other

central genome model parameters, the most probable ( )T t can be obtained, and the gene control network

can be quantitatively characterized. Finally, note that the following model is rather simple. While more

complete models could be used (Weitzke and Ortoleva, 2003), our purpose here is to demonstrate the

approach and not to address the supercomputer challenge of a very detailed approach. Given a transcription

control network with TFN TFs and gN genes, the i-th gene (gi), 1, 2, gi N= , is assumed to have isN TF

binding sites labeled 1, 2, isj N= . Assume the binding at any site is independent of the state of others and

that only one type of TF can bind to each site. While the latter two assumptions are not an inherent

restrictions of our approach, they are made here for simplicity. Let ijn label the TF that can bind to site j on

gene i. Let ijb be minus one or plus one indicate the nature of regulation (down or up) of TF type ijn on gi.

At TF/gene equilibrium, the probability iH that gi is available for RNA polymerase (RP) complexing is

taken to be given by an equilibrium Langmuir uncompetitive adsorption isotherm (Laidler and Meiser,

1995)

( )( ) ( )1

2

1

1

i bijs

ij ij

N

i ij n ij nj

H Q T Q T+

=

= +∏ , (2)


for binding constant ijQ (liter/mole) and intra-nuclear activity ijnT of the ijn -th TF. The rate of RNA

transcription initiation is written as

( )max , ,i i ik k H Q b T= , (3)

where maxik is a saturation rate coefficient that we suggest is diffusion-limited. After RP binds to a gene

RNA elongation commences. If nucleotide concentrations are roughly steady during transcription and the

RP advancement velocity iu (nucleotides/sec), then the transcription polymerization rate iA is taken to be

( )1 1

[ ]

inuc

i i i

N

A k RP u= + . (4)

A form which captures the rate limiting step of the two serial processes (RP binding and the transcription).

With this, the governing equation for the RNA populations is written

ii i i

dRA R

dtλ= − , (5)

where inucN is the total length of the gene gi, and iλ is the decay constant. If the rate limiting step for

transcription is RNA polymerase binding to the gene, then the second term in (4) may be dropped. Finally,

[ ]RP is the activity of free RNA polymerase and is assumed constant (and henceforth is subsumed in

maxik ), then the governing equation for RNA evolution becomes

( )max , ,ii i i i

dRk H Q b T R

dtλ= − . (6)

Finally, our methodology can easily be generalized by relaxing any of the above assumptions.

3 INFORMATION THEORY MODEL/DATA INTEGRATION

4 RESULTS

4.1 Synthetic Example

In order to test our implementation of the approach described above, and to find its practical limitations, we

used a model network that consists of 20 genes and 10 TFs. The gene control network is shown in Table 1

where 1± implies up/down regulation**. We took all binding constants and initial RNA concentrations to

be unity and 910 M− , respectively.

We generated TF time courses and created the synthetic time series microarray data using our

transcription model, selecting 10 “data points” that are 500 seconds apart. In the following, we demonstrate

the robustness of our approach in reconstructing ( )T t despite incorrect regulatory network and noise in

microarray data, conditions commonly encountered in practice.

4.2 Uncertain Regulatory Network Information

Sequence analysis can be used to determine the structure of the gene control network based on likely

binding sites. However, this approach is likely to suggest a large number of incorrect interactions in the

gene control network. It is of interest to test whether our approach can filter the redundant nonzero entries

in the control network. In our approach, if a TF is assumed to upregulate a gene, large binding constants,

i.e. 1QT , imply that this interaction is unlikely (redundant) as ( )1 1QT QT+ ≈ . A similar argument

hold for wrongly assumed down regulation as indicated by small binding constants ( 1QT , and therefore

( )1 1 1QT+ ≈ ). Therefore our methodology conveniently filters out incorrect interactions by assigning

large/small binding constants for up/down regulation. To check the vulnerability of our approach to such

redundant interactions, we added random nonzero factors in the control network, and obtained a

“conjectured” full control matrix as shown in Table 2. As this matrix is full, the NCA method fails. In our

approach, the match between the predicted and real TF time courses is remarkable even when the

“conjectured” full matrix was used. Fig. 2 illustrates the effect of the number of redundant interactions on

the mismatch between the predicted and actual TF time courses. The mismatch (relative to the one obtained

using the actual network) does not exceed 2 even when the full interaction matrix is used as a starting point.

Thus our approach effectively filters the gene control network of unnecessary interactions.

** Table 1 and Table 2 are provided in the supplementary materials

Fig.2 Mismatch between the predicted and actual TF time course relative to the one obtained using the actual gene

control network shown in Table 1 Although the mismatch increases as the number of interactions in the gene control

network increases, it stays within a limit of 2 in this particular example, showing the robustness of our approach to

discover the operating gene control network hidden in a larger one (see Table 1 and 2 provided in the supplementary

material).

4.3 Robustness to Microarray Error Levels

Despite advances in the technology, microarray data has considerable levels of error. In our

implementation, we input raw microarray channel data, perform standard channel normalization (based on

housekeeping genes determined using ranking of channel intensities and quality filtering for multiple spot

and slide data (Hyduke et al., 2003)). The resulting confidence intervals constitute prior information about

the level of noise in the microarray response. In this test, we investigate the vulnerability of our approach to

error in microarray data. We added random noise to the synthetic microarray data that was obtained using

the assumed TF time courses, and the gene control network (Table 1). Fig. 3 shows the mismatch (relative

to the TF time course obtained without added noise) as a function of added noise level with and without the

regularization constraint (see equation 10). Regularization constraint yields a better match at noise levels

higher than 30%, providing robustness to the calibration process for very noisy data. More generally,

Lagrange multiplier for the regularization constraint in (12) should decrease with the width of the

confidence intervals although we did not pursue this relationship here. This example also illustrates one of


the advantages of time series data over steady state data in that time series data is less vulnerable to noise

due to the use of our novel regularization constraint.

Fig.3 Mismatch between the predicted and actual TF time course (relative to that obtained using noise-free microarray

data) as a function of the amplitude of random noise added to the microarray data. The diamond and square markers

represent the mismatch with and without the regularization constraint. For large noise levels the regularization

constraint provides significant improvement in the calibration process.

4.4 Healing Mistakes in a Proposed Regulatory Network

Often we rely on TF-gene interactions obtained from questionable quality resources. Therefore, it is

important that our algorithm is robust to potential sign mistakes in the gene control network. In this case,

we run our code in a discovery mode that searches for network mistakes. We first rank genes based on the

mismatch between the predicted and observed microarray response, highest ranked having the greatest

mismatch. Then, we rank the TFs based on the rank and number of genes that they regulate. As calibration

progresses, we periodically check the genes whose mismatch is greater than averageE aσ+ where averageE

is the average mismatch and σ is the standard deviation of gene mismatch and a is an empirical

parameter. Once the genes satisfying this criterion are identified, we change the sign of the highest ranked

TF (up/down). We also consider additional input that the user provides in his/her confidence in each

element of the gene control matrix. At a given step in this process, we only change one sign per column of

the gene control network. After a few iterations, we monitor the mismatch behavior; if the sign change

failed to improve the mismatch, we change the sign back. To test this algorithm, we took the gene control

network of Table 1 and introduced four mistakes by changing the sign of the diagonal elements of genes

1-4. Our algorithm successfully corrected the network after only 10 iterations. Fig. 4 shows the predicted

and observed microarray data.

Fig. 4 Comparison of observed (solid line) and predicted microarray data for genes 1-4 (see Table 1). Triangles

indicate the best microarray fit when there are four mistakes in the gene control network (the first four entries in the

upper left diagonal for genes 1-4); squares indicate the best microarray fit when we allow our program to correct the

network. This shows that our algorithm is not only able to calculate the TF time courses, binding constants, etc.; it can

also be used as a tool to decide whether a TF is up or down regulating a gene.

The square markers represent the best fit to the microarray data when the control network with four

mistakes is assumed, whereas the triangle markers represent the best fit after our algorithm corrected the

mistakes. This demonstrates another aspect of our methodology, correcting a user-supplied gene control

network via microarray data. For large, real systems, we believe that many iterations will be necessary to

arrive at an accurate network. The methodology will recommend changes in the gene control network

based on the microarray data and a user-suggested network. The rankings supplied with improved network

can be used to guide literature searches or carry out sequence analysis that can be used to further refine the

network.

4.5 Application to E. coli

To test our methodology we used the E.coli microarray data obtained for carbon source transition from

glucose to acetate media. Details of the experimental conditions and the microarray procedure are provided

in Kao et al. (2004). The data included expression levels (relative to initial time) of 100 genes at 300, 900,

1800, 3600, 7200, 10800, 14400, 18000 and 21600 seconds. The gene control matrix used is based on

RegulonDB (Salgado et al., 2001) as modified by Kao et al. (2004). We made additional changes based on

EcoCyc (Karp et al., 1998). Fig. 5 shows the time courses of 16 (out of 38) representative TF activities.

Kao et al. (2004) applied NCA (Liao et al. 2003) to the same problem. Furthermore, the transcription

kinetics in our approach differs from the steady state assumption and binding formulation used in NCA.

Despite these differences, 15 out of 16 TF activity time courses (Kao et al. only presented the time courses

of 16) are in qualitative agreement. As shown in Fig. 5 PhoB increases as a result of response to acetate

enrichment of the medium in contrast to the decreasing activity predicted by NCA. Therefore one would

expect the phoB activity to increase as well. As another check on our results, consider the arcA gene which

makes ArcA TF that upregulates gene itself.

Fig. 5 TF activity time courses for 16 of 38 TFs. These results are in qualitative agreement with those obtained by Kao

et al. (2003) except for PhoB. pstC and pstS are upregulated by PhoB and their level of expression increases in time

(shown in Fig. 6), therefore one would expect the activity of PhoB to increase as well.



https://www.researchgate.net/publication/13440896_EcoCyc_Encyclopedia_of_Escherichia_coli_Genes_and_Metabolism?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==

One would expect a correlation between the expression of arcA and TF activity of ArcA. To have an

unbiased test, we took the expression of arcA from the microarray data and calculated the time course of

ArcA activity based on the other 17 genes in the network that it regulates. The resultant ArcA time course

was indistinguishable from the one obtained via that using the arcA microarray expression data. A

comparison of ArcA activity (Fig. 5) and expression of arcA (Fig. 6) shows a similar trend. We also added

up to 30% noise to the E. Coli. Microarray data, the approach is still found to be robust.

Fig. 6 Comparison of predicted (hollow markers) and observed (solid markers) microarray response for a) nuoJ, nuoA,

b) arcA, c) livK, ppsA, pykF, d) pstC and pstS.

CONCLUSIONS

ACKNOWLEDGEMENTS

This project was supported by the United States Air Force (through the Defense Advanced Research

Projects Agency), the United States Department of Energy (Genomics: Genome to Life Program) and IBM

Life Sciences Institute of Innovations as well as the College of Arts and Sciences and the Office of the Vice

President for research (Indiana University) as general support for the Center for Cell and Virus Theory.

REFERENCES

Azuaje, F. (2002) A cluster validity framework for genome expression data. Bioinformatics, 18, 319-320.

Bolshakova, N. and F. Azuaje (2003) Cluster validation techniques for genome expression data. Signal Processing, 83,

825-833.

Brown, P. and D. Botstein (1999) Exploring the new world of the genome with DNA microarray. Nature Genetics, 21,

33-37.

Chitler, S. (2004) DNA microarrays:Tools for the 21(st) century. Combinatorial Chemistry and High throughput

Screening, 7, 531-537.

Debouck, C. and P. N. Goodfellow (1999) DNA microarrays in drug discovery and development. Nature Genetics, 21,

48-50.

DeRisi, J. L., V. R. Iyer and P. O. Brown (1997) Exploring the Metabolic and Genetic Control of Gene Expression on

a Genome Scale. Science, 278, 680-686.

Friedman, N., M. Linial, I. Nachman and D. Pe'er (2000) Using Bayesian networks to analyze expression data. Journal

of Compuational Biology, 7, 601-620.

Gardner, T. S., D. d. Bernardo, D. Lorenz and J. J. Collins (2003) Inferring Genetic Networks and Identifying

Compound Mode of Action via Expression Profiling. Science, 301, 102-105.

Gerhold, D., T. Rushmore and C. T. Caskey (1999) DNA chips: promising toys have become powerful tools. Trends in

Biochemical Sciences, 24, 168-173.

Holter, N. S., A. Maritan, M. Cieplak, N. V. Fedroff and J. R. Banavar (2001) Dynamics modeling of gene expression

data. Proceedings of National Academy of Science, 98, 1693-1698.

Holter, N. S., M. Mitra, A. Maritan, M. Cieplak, J. R. Banavar and N. V. Fedroff (2000) Fundamental patterns

underlying gene expression profiles: Simplicity from complexity. Proceedings of National Academy of Science,

97, 8409-8414.

Hyduke, D., L. Rohlin, K. Kao and J.Liao (2003) A software package for cDNA microarray data normalization and

assessing confidence intervals. OMICS: A Journal of Integrative Biology, 7, 227-234.

Kao, K. C., Y.-L. Yang, R. Boscolo, C. Sabatti, V. Roychowdhury and J. C. Liao (2004) Transcriptome-based

determination of multiple transcription regular activities in Escherichia coli by using network component analysis.

Proceedings of National Academy of Science, 101, 641-646.

Karp, P. D., M. Riley, S. M. Paley, A. Pellegrini-Toole and M. Krummenacker (1998) EcoCyc:Encyclopedia of

Escherichia coli genes and metabolism. Nucleic Acids Research, 26, 50-53.

Laidler, K. J. and J. H. Meiser (1995). Physical Chemistry. Boston, Houghton Mifflin Company.

Li, Y., C. Cambell and M. Tipping (2002) Bayesian automatic relevance determination algorithms for classifying gene

expression data. Bioinformatics, 18, 1332-1339.

Liao, J. C., R. Boscolo, Y.-L. Yang, L. M. Tran, C. Sabatti and V. P. Roychowdhury (2003) Network component

analysis:Reconstruction of regulatory signals in biological systems. Proceedings of National Academy of Science,

100, 15522-15527.

Liebermeister, W. (2002) Linear modes of gene expression determined by independent compnent analysis.

Bioinformatics, 18, 51-60.

Mendes, P. and D. Kell (1998) Nonlinear optimization of biochemical pathways: applications to metabolic engineering

and parameter estimation. Bioinformatics, 14, 869-883.




















https://www.researchgate.net/publication/11514386_A_cluster_validity_framework_for_genome_expression_data?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==















Navid, A. and P. J. Ortoleva (2004) Simulated complex dynamics of glycolysis in the protozoan parasite Trypanosoma

brucei. Journal of Theoretical Biology, 228, 449-458.

Pe'er, D., A. Regev, G. Elidan and N. Friedman (2001) Inferring subnetworks from perturbed expression profiles.

Bioinformatics, 17, S215-S224.

Salgado, H., A. Santos-Zavaleta, S. Gama-Castro, D. Millen-Zarate, E. Diaz-Peredo, F. Sanchez-Solano, E. Perez-

Rueda and C. Bonavides-Martinez (2001) RegulonDB(version 3.2):transcriptional regulation and operon

organization in Escherichia coli K-12. Nucleic Acids Research, 29, 72-74.

Sauter, G., R. Simon and K. Hillan (2003) Tissue microarrays in drug discovery. Nature Reviews Drug Discovery, 2,

962-972.

Sayyed-Ahmad, A., K. Tuncay and P. J. Ortoleva (2003) Toward Automated Cell Model Development through

Information Theory. Journal of Physical Chemistry A, 107, 10554-10565.

Schena, M., D. Shalon, R. W. Davis and P. O. Brown (1995) Quantitative Monitoring of Gene Expression Patterns with

a Complementary DNA microarray. Science, 270, 467-470.

Shmulevich, I., E. R. Dougherty, S. Kim and W. Zhang (2002) Probablistic Boolean Networks: a rule based uncertainty

for gene regulatory networks. Bioinformatics, 18, 261-274.

Shmulevich, I., H. Lahdesmaki, E. R. Dougherty, J. Astola and W. Zhang (2003) The role of certain Post classes in

Boolean newtork models of genetic networks. Proceedings of National Academy of Science, 100, 10734-

10739.

Slepchenko, B., J. Schaff, I. Macara and L. Loew (2003) Quantitative cell biology with the virtual cell. Trends in Cell

Biology, 13, 570-576.

Symth, G. and T. Speed (2003) Normalization of cDNA Microarray Data. Methods, 31, 265-273.

Weitzke, E. L. and P. J. Ortoleva (2003) Simulating cellular dynamics through a coupled transcription, translation,

metabolic model. Computational Biology and Chemistry, 27, 469-480.

Wilson, D. L., M. J. Buckley, C. A. Helliwell and I. W. Wilson (2002) New normalization methods for cDNA

microarray data. Bioinformatics, 19, 1325-1332.













https://www.researchgate.net/publication/24459711_RegulonDB_version_32_Transcriptional_regulation_and_operon_organization_in_Escherichia_coli_K-12?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==







https://www.researchgate.net/publication/9027353_Normalization_of_cDNA_Microarray_Data?el=1_x_8&enrichId=rgreq-b5163f9f58993146c21d21f3bfbd1577-XXX&enrichSource=Y292ZXJQYWdlOzI0OTg0NTMyNjtBUzoxMTQ5MTQ1NzIxMTU5NzBAMTQwNDQwOTE3Nzk5MA==




Microarray Analysis through Transcription Kinetic Modeling and Information Theory

Documents

Transcript of Microarray Analysis through Transcription Kinetic Modeling and Information Theory