caGEDA

Applied Bioinformatics 2004:3(1) 49–62© 2004 Open Mind Journals Limited. All rights reserved.

49

A P P L I C AT I O N N O T E

caGEDA: a web application for the integratedanalysis of global gene expression patterns incancerSatish Patel1 and James Lyons-Weiler1–7

1University of Pittsburgh Cancer Institute, 2Department of Pathology, 3Cancer Biomarkers Laboratory, 4Center for

Pathology Informatics, 5Benedum Center for Oncology Informatics, 6Center for Biomedical Informatics, 7Interdisciplinary

Biomedical Graduate Program; University of Pittsburgh Medical School, Pittsburgh, PA, USA

Abstract: The explosion of microarray data from pilot studies, basic research and large-scale clinical trials requires the development of

integrative computational tools that can not only analyse gene expression patterns but that can also evaluate the methods of analysis

adopted and then provide a boost to post-analysis translational interpretation of those patterns. We have developed a web application

called caGEDA (cancer gene expression data analyzer) that can: (1) upload gene expression profiles from cDNA or oligonucleotide

microarrays; (2) conduct a diverse range of serial linear normalisations; (3) identify differentially expressed genes using a variety of tests

– either threshold or permutation tests; (4) produce tables of literature references to papers reporting that specific genes (identified by

accession numbers) are up- or down-regulated in specific cancers; (5) estimate the error of sample class prediction using the significant

gene set for features; (6) perform low-bias and accurate validated learning using three computational validation techniques (leave-one-

out validation, k-fold validation, random resampling validation); and (7) validate a classifier with a randomly selected or user-defined

validation set. Significant genes are reported in a table of links to entries in the following databases: LocusLink, Genome View, UCSC,

Ensembl, UniGene, dbSNP, AmiGO and OMIM. caGEDA is seamlessly integrated via embedded forms with UCSD’s (University of

California at San Diego) 2HAPI server (for medical subject heading (MeSH) term exploration) and EZ-Retrieve (to identify common

transcription factors located upstream of sets of genes that exhibit similar modes of differential expression). caGEDA offers a variety of

previously described and novel tests for differentially expressed genes, most notably the permutation percentile separability test, which

is most appropriate for identifying genes that are significantly differentially expressed in a subset of patients. caGEDA, which is open

source and free to academic users, will soon be greatly enhanced by operating with the components of the National Cancer Institute’s new

cancer bioinformatics grid (caBIG).

Keywords: microarray analysis, cancer, genomics

Availability: http://bioinformatics.upmc.edu/GE2/GEDA.html

Contact: James Lyons-Weiler ([email protected])

Correspondence: James Lyons-Weiler, UPCI, Benedum OncologyInformatics Center, 304 Cancer Pavilion, 5150 Centre Avenue, Pittsburgh,PA 15232, USA; tel +14126237866; fax +14126475380; [email protected]

IntroductionMicroarray technologies permit the simultaneous monitoring

of gene expression values for tens of thousands of genes in

one experiment (Schena et al 1995; Lockhart et al 1996;

DeRisi et al 1997). These extremely powerful technologies

have revolutionised functional genomics by allowing cross-

tissue comparisons of expression profiles. Methods for the

analysis of global patterns of gene expression from

microarray studies are typically used to address two important

aims: (1) to identify genes that are differentially expressed

between cell populations (tissue types, or diseased and

normal tissue sample groups); and (2) to determine whether,

and how, the set of differentially expressed genes can be

used either to erect a classification of sample groups (class

discovery) or to predict class membership of new unknown

samples. Microarrays are now being used to study expression

patterns in a wide variety of diseases including, and most

notably, cancer, but also: diabetes (Susztak et al 2003);

arthritis (Urbanowska et al 2003); pancreatitis (Ji et al 2003);

neurological degenerative diseases, including Alzheimer’s

disease (Loring et al 2001; Pasinetti 2001), Parkinson’s

disease (Mandel et al 2003) and Amyotrophic Lateral

Scelorosis (ALS; Lou Gehrig’s disease) (Dangood et al 2004);

heart disease (Ueno et al 2003); and a variety of other diseases

(Hawse et al 2003; Hayden et al 2003; Minning et al 2003).

A multitude of comparisons are being made in the study of

COMPLIMENTARY COPY, PLEASE CIRCULATESubscribe via www.openmindjournals.com/2004_subs.pdfUse discount code 9FEM87 and get a 33% discount off the normal price

https://www.researchgate.net/publication/10651264_Genomic_strategies_for_diabetic_nephropathy?el=1_x_8&enrichId=rgreq-797f47f9-3d75-4a9b-bffb-958ff8fbf6b7&enrichSource=Y292ZXJQYWdlOzc0NDUxNjE7QVM6MTAxNTA0NzAyNzQ2NjQyQDE0MDEyMTIwMTU3OTA=

https://www.researchgate.net/publication/10572734_Microarray_profiling_of_gene_expression_during_trypomastigote_to_amastigote_transition_in_Trypanosoma_cruzi?el=1_x_8&enrichId=rgreq-797f47f9-3d75-4a9b-bffb-958ff8fbf6b7&enrichSource=Y292ZXJQYWdlOzc0NDUxNjE7QVM6MTAxNTA0NzAyNzQ2NjQyQDE0MDEyMTIwMTU3OTA=

https://www.researchgate.net/publication/10794420_Mandel_S_Weinreb_O_Youdim_M_B_Using_cDNA_microarray_to_assess_Parkinson's_disease_models_and_the_effects_of_neuroprotective_drugs_Trends_Pharmacol_Sci_24_184-191?el=1_x_8&enrichId=rgreq-797f47f9-3d75-4a9b-bffb-958ff8fbf6b7&enrichSource=Y292ZXJQYWdlOzc0NDUxNjE7QVM6MTAxNTA0NzAyNzQ2NjQyQDE0MDEyMTIwMTU3OTA=

https://www.researchgate.net/publication/10647415_DNA_microarray_analysis_of_in_vivo_progression_mechanism_of_heart_failure?el=1_x_8&enrichId=rgreq-797f47f9-3d75-4a9b-bffb-958ff8fbf6b7&enrichSource=Y292ZXJQYWdlOzc0NDUxNjE7QVM6MTAxNTA0NzAyNzQ2NjQyQDE0MDEyMTIwMTU3OTA=

https://www.researchgate.net/publication/10791287_Pancreatic_gene_expression_during_the_initiation_of_acute_pancreatitis_Identification_of_EGR-1_as_a_key_regulator?el=1_x_8&enrichId=rgreq-797f47f9-3d75-4a9b-bffb-958ff8fbf6b7&enrichSource=Y292ZXJQYWdlOzc0NDUxNjE7QVM6MTAxNTA0NzAyNzQ2NjQyQDE0MDEyMTIwMTU3OTA=

https://www.researchgate.net/publication/10701327_DNA_expression_analysis_Serial_analysis_of_gene_expression_microarrays_and_kidney_disease?el=1_x_8&enrichId=rgreq-797f47f9-3d75-4a9b-bffb-958ff8fbf6b7&enrichSource=Y292ZXJQYWdlOzc0NDUxNjE7QVM6MTAxNTA0NzAyNzQ2NjQyQDE0MDEyMTIwMTU3OTA=

https://www.researchgate.net/publication/11570544_A_Gene_Expression_Profile_of_Alzheimer's_Disease?el=1_x_8&enrichId=rgreq-797f47f9-3d75-4a9b-bffb-958ff8fbf6b7&enrichSource=Y292ZXJQYWdlOzc0NDUxNjE7QVM6MTAxNTA0NzAyNzQ2NjQyQDE0MDEyMTIwMTU3OTA=

https://www.researchgate.net/publication/8089689_Identification_and_functional_clustering_of_global_gene_expression_differences_between_age-related_cataract_and_clear_human_lenses_and_aged_human_lenses?el=1_x_8&enrichId=rgreq-797f47f9-3d75-4a9b-bffb-958ff8fbf6b7&enrichSource=Y292ZXJQYWdlOzc0NDUxNjE7QVM6MTAxNTA0NzAyNzQ2NjQyQDE0MDEyMTIwMTU3OTA=

https://www.researchgate.net/publication/227907018_Pasinetti_GM_Use_of_cDNA_microarray_in_the_search_for_molecular_markers_involved_in_the_onset_of_Alzheimer's_disease_dementia_J_Neurosci_Res_65_471-476?el=1_x_8&enrichId=rgreq-797f47f9-3d75-4a9b-bffb-958ff8fbf6b7&enrichSource=Y292ZXJQYWdlOzc0NDUxNjE7QVM6MTAxNTA0NzAyNzQ2NjQyQDE0MDEyMTIwMTU3OTA=

Patel and Lyons-Weiler

Applied Bioinformatics 2004:3(1)50

cancer, including normal with tumour, metastatic withquiescent (Chakrabarti et al 2002; Seftor et al 2002; van’tVeer et al 2002; Chopra et al 2003; Cooper et al 2003;DePrimo et al 2003; Kikuchi et al 2003; Nagata et al 2003;Ramaswamy et al 2003; Takahashi et al 2003), organ-confinedwith aggressive tumours (Luo et al 2002), drug responderswith nonresponders (Lyons-Weiler et al 2003; Ohno andNakamura 2003) and survivors with nonsurvivors (Miura etal 2002). (See also references listed in Lyons-Weiler et al 2004.)

In the field of cancer research, the set of differentiallyexpressed genes represents a potential goldmine of markersfor molecular medicine. Specifically, studies havedemonstrated the vast potential of differentially expressedgenes as biomarkers for molecular diagnosis (DeRisi et al1996; Welford et al 1998), including stage and grade (Nocitoet al 2001; Rickman et al 2001; Allander et al 2002; Inoue etal 2002; Sasaki 2002; Luo et al 2004), prognosis (Beer et al2002; Inoue et al 2002), therapy outcome prediction (Kiharaet al 2002; Okutsu et al 2002; Lyons-Weiler et al 2003),malignancy and metastasis (Li et al 2001; Shih 2001; Gildeaet al 2002; Ramaswamy et al 2003) and tumouraggressiveness (Chan and Huang 2001). With the ability todetect drug-resistant tumours using a drug response indexthat summarises signature profiles (positive and negative),physicians will be able to prescribe drugs that are less likelyto fail and thus generate more individualised prescriptionsthat have higher positive response rates. Eventual uses of thisinformation will extend beyond profiling. Targetedtherapeutics under development include antisense oligo-nucleotides designed to suppress the expression of genes, withthe aim of altering drug responsiveness. These may alsosomeday be used as a direct therapy by inducing cell cyclearrest, apoptosis and suppression of anchorage-dependent cellgrowth. A number of recent studies highlight this potential.Okutsu et al (2002) identified 28 genes whose presenceappears to correlate well with chemosensitivity of AMLpatients. Chang et al (2003) report genes whose presenceappears to predict the response of breast cancer patients todocetaxel. Kihara et al (2001) have identified 52 genes whosepresence appears to predict the response of oesophagealtumours to adjuvant chemotherapy. Dan et al (2002) reportsome success in finding genes that may be predictive of aresponse to a variety of drugs in 39 cancer cell lines.

Because drug responses can be expected to involve veryspecific metabolic and transfer pathways, success in thepreliminary identification and subsequent validation ofpredictive gene sets is key to the success of basic andtranslational research in this area, and is required beforeindividualised therapy can take place and become the

normative approach to the treatment of cancer. One ultimategoal of such research is the development of highlyindividualised treatments by predicting which of an array ofpossible treatment options is optimal for a specific patient(Eyster and Lindahl 2001; Kallioniemi 2001; Miyachi 2002).For some treatments, this may include molecular predictionof drug toxicity (Cunningham et al 2000; Kaminski et al2000; Vrana et al 2003) and then optimising the dosage.Microarrays are also being considered for use in matchingdonors for organ transplantation (Chua and Sarwal 2003).

Microarray data from medical research projects includedata from pilot studies, basic research (in vitro and modelanimal), pre-clinical studies and large-scale clinical trials.Each new study reporting results from microarrayexperiments is accompanied by new methods of analysis.These creative efforts have led to a wealth of methods but adearth of comprehension on the comparative performanceof these methods. Researchers should employ methods ofanalysis that do the least harm to, and that can most accuratelysummarise and represent, the clinically significantinformation in the data. A series of decisions is required inan ‘analysis decision tree’. Selection of an option at eachnode of the tree always involves a trade-off between lossand gain of information, and most studies are conducted withlittle idea of the knowledge required to optimise the gain/loss ratio for the complex series of decisions along any pathin the tree. Ways of optimising the gain/loss ratio includemethods for normalisation that lead to the least loss of signalwhen normalisation is applied where it is not required, andthat also lead to the recovery of the greatest amount ofinformation that would have been lost due to the intrusionof nuisance factors that arise despite careful study design.Tests for identifying differentially expressed genes shouldbe as powerful (sensitive) and yet as specific as possible.Classification approaches should be capable of accuratelyrepresenting patterns that reflect true relationships amongsamples and genes, and the entire process should be asimmune as possible to the hazards of overtraining. Thesoftware should provide feedback to the user on data qualityand the fundamentals of experimental design.

Such high standards are goals shared by all involved inthe development and deployment of methods for the analysisof microarray data. The ability to traverse the analysisdecision tree first requires building the tree (ie assemblingthe many methods of pre-interpretive analysis) for findingdifferentially expressed genes, for performing classificationoperations and for evaluating the application of any suchpath with computational validation (statistical machine

learning). Ideally, one would also be able to explore these

https://www.researchgate.net/publication/11551196_Gene_expression_analysis_of_prostate_cancer?el=1_x_8&enrichId=rgreq-797f47f9-3d75-4a9b-bffb-958ff8fbf6b7&enrichSource=Y292ZXJQYWdlOzc0NDUxNjE7QVM6MTAxNTA0NzAyNzQ2NjQyQDE0MDEyMTIwMTU3OTA=

https://www.researchgate.net/publication/11501339_An_integrated_database_of_chemosensitivity_to_55_anticancer_drugs_and_gene_expression_profiles_of_39_human_cancer_cell_lines?el=1_x_8&enrichId=rgreq-797f47f9-3d75-4a9b-bffb-958ff8fbf6b7&enrichSource=Y292ZXJQYWdlOzc0NDUxNjE7QVM6MTAxNTA0NzAyNzQ2NjQyQDE0MDEyMTIwMTU3OTA=

https://www.researchgate.net/publication/12009421_Biochip_technologies_in_cancer_research?el=1_x_8&enrichId=rgreq-797f47f9-3d75-4a9b-bffb-958ff8fbf6b7&enrichSource=Y292ZXJQYWdlOzc0NDUxNjE7QVM6MTAxNTA0NzAyNzQ2NjQyQDE0MDEyMTIwMTU3OTA=

https://www.researchgate.net/publication/10799739_Microarrays_New_tools_for_transplantation_research?el=1_x_8&enrichId=rgreq-797f47f9-3d75-4a9b-bffb-958ff8fbf6b7&enrichSource=Y292ZXJQYWdlOzc0NDUxNjE7QVM6MTAxNTA0NzAyNzQ2NjQyQDE0MDEyMTIwMTU3OTA=

https://www.researchgate.net/publication/11822481_Prediction_of_sensitivity_of_esophageal_tumors_to_adjuvant_chemotherapy_by_cDNA_microarray_analysis_of_gene-expression_profiles?el=1_x_8&enrichId=rgreq-797f47f9-3d75-4a9b-bffb-958ff8fbf6b7&enrichSource=Y292ZXJQYWdlOzc0NDUxNjE7QVM6MTAxNTA0NzAyNzQ2NjQyQDE0MDEyMTIwMTU3OTA=

https://www.researchgate.net/publication/11599493_Gene_expression_analysis_in_aggressive_NHL?el=1_x_8&enrichId=rgreq-797f47f9-3d75-4a9b-bffb-958ff8fbf6b7&enrichSource=Y292ZXJQYWdlOzc0NDUxNjE7QVM6MTAxNTA0NzAyNzQ2NjQyQDE0MDEyMTIwMTU3OTA=

https://www.researchgate.net/publication/10992287_Okutsu_J_Tsunoda_T_Kaneta_Y_Katagiri_T_Kitahara_O_Zembutsu_H_et_al_Prediction_of_chemosensitivity_for_patients_with_acute_myeloid_leukemia_according_to_expression_levels_of_28_genes_selected_by_genome?el=1_x_8&enrichId=rgreq-797f47f9-3d75-4a9b-bffb-958ff8fbf6b7&enrichSource=Y292ZXJQYWdlOzc0NDUxNjE7QVM6MTAxNTA0NzAyNzQ2NjQyQDE0MDEyMTIwMTU3OTA=

https://www.researchgate.net/publication/10841711_Recent_advances_in_molecular_diagnostic_tests?el=1_x_8&enrichId=rgreq-797f47f9-3d75-4a9b-bffb-958ff8fbf6b7&enrichSource=Y292ZXJQYWdlOzc0NDUxNjE7QVM6MTAxNTA0NzAyNzQ2NjQyQDE0MDEyMTIwMTU3OTA=

https://www.researchgate.net/publication/12094621_Molecular_medicine_a_primer_for_clinicians_Part_XII_DNA_microarrays_and_their_application_to_clinical_medicine?el=1_x_8&enrichId=rgreq-797f47f9-3d75-4a9b-bffb-958ff8fbf6b7&enrichSource=Y292ZXJQYWdlOzc0NDUxNjE7QVM6MTAxNTA0NzAyNzQ2NjQyQDE0MDEyMTIwMTU3OTA=

caGEDA

Applied Bioinformatics 2004:3(1) 51

analysis options with additional datasets, some of which arelike and others unlike the specific dataset at hand. Given aset of results from such analysis, additional functional andtranslational interpretation (such as gene ontology andtranscription factor analysis) should be facilitated by aseamless integration of the results of the analysis (ie the listof differentially expressed genes) with existing genomicdatabases. Researchers should be able to relate their newdiscoveries to existing knowledge on gene and proteindysregulation with a minimum of effort. Finally, the softwareitself should be designed with best practices, and it shouldbe sufficiently modular, extensible and open source to notrestrict the researcher to one paradigm of analysis.

There is a growing demand for easy-to-use tools for theanalysis and interpretation of global patterns of geneexpression from microarray experiments. While commercialplatforms are available, most are specifically designed fordata from one platform. Most of the software for arrayanalysis available from academia requires updates viadownloads from the Web, which often both limits the numberof platforms on which the software can run and requires theuse of plug-ins or additional software downloads. In contrast,server-based application designs, using servlets andJavaServer Pages™ (JSP), allow users to upload their datasecurely to a server bank where the options for analysis areselected through the graphical interface in the users’browsers. Server-side analysis allows users to performmultiple tasks on the same computer that they are using forbrowsing without burdening their local CPU for analysis orfor 0-second ‘instant upgrades’ every time they load theGEDA.html file into their browsers. Moreover, server-sideapplications can be easily mirrored at a large number of sites,which can then (in principle) become a coordinated grid,ensuring sufficient CPU resources for complex analyses. Thisis part of the strategy in the development of the new caBIG(cancer bioinformatics grid) initiative of the National CancerInstitute (NCI) (http://cabig.nci.nih.gov).

There were therefore sufficient reasons for buildingcaGEDA, the first such application developed specificallyto facilitate the evaluation of a researcher’s decisionprocesses by integrating a diversity of options at each nodein the analysis decision tree in a systematic manner that is

wrapped in a computational validation framework.

Materials and methodsSoftware design and programmingcaGEDA is written using the Java language. Java is aplatform-independent language, so the caGEDA software canrun on virtually any operating system for which the Java

runtime environment is available. caGEDA is a web-based

application with the standard HTML 4.0 and JavaScript™

interface, which is compatible with most modern Internet

browsers. caGEDA was developed using the Java Servlet

2.3 and JSP 1.1 technology. We followed the MVC (model-

view-controller) design strategy of pattern development. As

part of the development, we used Java SDK 1.4 with Apache

Tomcat 4.0.6 as an application server for testing purposes.

caGEDA requires an ACME image library (ACME 2001) to

create the GIF images, and O’Reilly’s MIME parsing library

(O’Reilly 2002) to handle file operations. All the tools,

libraries and supporting software used in the development

are either freeware or open source.

The installation of a caGEDA mirror in the new

environment is very simple. We distribute caGEDA as an

installer, which is a self-extracting executable file. This

installer runs only on the Microsoft® Windows® platform.

As a part of the installation procedure, it checks for the

required software on the user’s machine. If required

components are not installed, the installer starts their

installation. Upon successful completion of installation, the

installer creates short cuts in the start menu to access the

caGEDA web application.

Data formatsCurrently, caGEDA accepts data in two formats, caGEDA

format 1 and caGEDA format 2. They are identical with one

difference: format 2 includes the accession number to

facilitate post-analysis translational interpretation.

caGEDA format 1:

Name S1 S2 S3 S4 .................... SN

G1 I11 I12 I13 I14 .................... I1N

G2 I21 I22 I23 I24 .................... I2N

G3 I31 I32 I33 I34 .................... I3N

G4 I41 I42 I43 I44 .................... I4N

. . . . . .................... .

. . . . . .................... .

Gm

Im1 I

m2 I

m3 I

m4 .................... I

mN

caGEDA format 2:

Name ACC_NO S1 S2 S3 S4 .................... SN

G1 ACC-1I11 I12 I13 I14 .................... I1N

G2 ACC-2I21 I22 I23 I24 .................... I2N

G3 ACC-3I31 I32 I33 I34 .................... I3N

G4 ACC-4I41 I42 I43 I44 .................... I4N

. . . . . . .................... .

. . . . . . .................... .

Gm

ACC-mI

m1 I

m2 I

m3 I

m4 .................... I

mN



In both of these formats, the dataset is a tab-delimited text

file where the first column is the name of the gene (G1, G2

... Gm) for m genes and N samples. In caGEDA format 2, the

second column is the gene accession number, which is used

to generate links to public databases. Subsequent columns

are for individual arrays where S1, S2 ... SN are sample

names.

Forms for specifying the numbers of samples and genes,

the sample name labels and the sample group assignment

labels are found at the bottom of the web page under the

Group Settings heading. Sample names are entered as a list

with sample membership specified by a group number (1, 2,

3…) that is separated from the name by a space or tab.

Missing valuesMissing values are encoded as ‘NA’ or ‘?’ and can be

estimated using k-nearest neighbours. A total of three

neighbours in Euclidean space are used; if one of the

neighbours also has a missing value, the next neighbour is

found.

Public domain datasetsThe cancer datasets in Table 1 are available from the public

domain, and the data are ready for re-analysis with virtually

no reformatting uploading by the user. The sample number

and sample group assignment forms at the bottom of the

caGEDA page autopopulate when these datasets are selected

under the Sample Data File pull-down menu. Each of these

datasets is also an entry in the University of Pittsburgh Cancer

Microarray Dataset Link Database, located online (http://

bioinformatics.upmc.edu/Help/UPITTGED.html). Cancer

microarray researchers are invited to send us their published

datasets to be added to the database. A more permanent

repository is caArray (http://caarray.nci.nih.gov).

Text and tabular outputAll output appears in a new window. Text output summarises

the user-specific information about the data and analysis

settings: sample class label settings; whether transformation

and/or normalisation was used; whether data were tested for

differentially expressed genes; permutation or threshold

settings; distance measure (if used); classification algorithm;

and computational validation method used. A table of

significant genes is generated, with links to their respective

LocusLink, Genome View, UCSC, Ensembl, UniGene,

dbSNP, AmiGO and OMIM entries.

Data quality measuresThe text output also includes a number of simple data quality

measures, including the among-array coefficient of variance

(COV) and a confounding index (CI) for global expression,

the latter of which compares the average within-group

correlation to the average between-group correlation

according to the following equation:

(1)

given sample groups A and B, where r is the correlation

coefficient. The CI should be as close to 1.0 as possible.

Values higher than 1.0 may indicate incidental confounding

in the experimental design.

Graphical output and data visualisationcaGEDA generates a number of informative figures that aid

in pre-interpretive and post-interpretive analysis (Figures

1–3). The first is an all-gene, two-group frequency histogram

(Figure 1a). The second is a series of box-and-whisker plots,

one for each sample (Figure 1b). The third is a global

correlation plot of sample means (Figure 1c). The fourth is a

Table 1 Datasets available in the public domain that are included in the caGEDA web application

Cancer Comparison Samples Genes Reference

Astrocytoma Slow or fast progression 13 12625 (Khatua et al 2003)Breast cancer BRCA1, BRCA2 or sporadic 17 3226 (Hedenfalk et al 2001)Breast cancer Tumour subtypes or normal 84 1753 (Perou et al 2000)Colon cancer Tumour or normal 62 7464 (Alon et al 1999)Leukaemia Success or failure (chemotherapy) 15 7129 (Golub et al 1999)Lung cancer Tumour or normal 72 917 (Garber et al 2001)Lymphoma DLBCL1 or DLBCL2 96 4027 (Alizadeh et al 2000)Lymphoma Tumour or normal 25 7129 (de Vos et al 2003)Lymphoma DLBCL, FL or CLL 77 7129 (Shipp et al 2002)Medulloblastoma Tumour or normal 23 2059 (MacDonald et al 2001)Melanoma Tumour or normal cell lines 38 8067 (Bittner et al 2000)Ovarian cancer Tumour or normal 39 7129 (Welsh et al 2001)

CIr r

r= +A B

B2

caGEDA


mean versus variance plot to assess intensity-related

heteroscedascity (Figure 1d). The fifth is an M-A

(multiplicative-additive) plot of all genes (Figure 1e). These

are considered useful in assessing the need and effects of

transformation. A histogram of the test score for the top 50

differentially expressed genes is generated: genes that are

a b

c d

e f

Figure 1 Examples of data visualisation (analysis output) from caGEDA. Data are the BRCA1 and BRCA2 breast cancer sample data provided by Hedenfalk et al (2001).(a) All-gene frequency histogram with expression value bins on the x-axis and their frequency on the y-axis. (b) Box-and-whisker plots after linear global mean adjustment(GMA) with sample names on the x-axis and expression values on the y-axis. (c) Mean versus mean correlation plot, with mean of group 1 on the x-axis and mean of group2 on the y-axis. (d) Mean versus variance plot with mean intensities on the x-axis and variance on the y-axis. (e) M-A (multiplicative-additive) plot. (f) Hierarchicalagglomerative sample cluster diagram using Euclidean distance (differentially expressed genes only). In (c), (d) and (e), differentially expressed genes under the selected testare shown in red on screen.

underexpressed in group A relative to B (BA pattern genes)

are coloured green; genes that are overexpressed in group A

relative to B (AB pattern genes) are coloured red. A histogram

of the sample means for the top differentially expressed genes

is generated, as well as a test score frequency distribution

(see Figure 2 for examples).



Gene expression pattern gridscaGEDA can generate an expression pattern grid that

organises the differentially expressed genes according to the

‘mode’ of differential expression. The latter is determined

by a count of the number of samples in one group that exhibit

expression intensities in the upper or lower 95th percentile

of the alternate group.

There are six predominant ‘modes’ of differential

expression in the gene expression pattern grid (see Figure

3). The two most commonly considered modes are those in

which most samples in both groups appear to be quite distinct

from the other group; for example (1) genes that are

overexpressed in A compared with B and underexpressed in

B compared with A, and (2) those that are underexpressed

in A compared with B and overexpressed in B compared

with A. In these modes, the sample block (ie individual) is

red if that value is greater than the upper 95th percentile in

Figure 2 (a) Test score histogram for the set of significant genes, in the comparison of BRCA1 with BRCA2, found at threshold T = 4.0 under the J5 test. (b) Histogram ofmeans of group 1 and means of group 2 for the set of significant genes. (c) Test score frequency histogram, all genes. (d) SAM plot from the SAM test module in caGEDA.

a b

c d

the alternate group, and green if the expression intensity value

is less than the lower 95th percentile of the alternate group.

A given sample (ie individual), however, might not be

different from the other group, and these samples are painted

black. This leads to other modes of differential expression

(see Figure 3). Each mode reflects a predominant pattern,

and the grid is a type of clustering on that pattern. Genes

that are significantly different between two groups due to

outliers can be readily identified in the grid.

TransformationOccasionally, datasets contain negative values; for example,

where background subtraction has been performed. Users

can either filter out negative numbers, which are then

assigned a null value, or subtract the global minimum. Users

can opt to log transform their data in three ways (log2, log

10,

ln) or to perform a square-root or cube-root transformation.

caGEDA


Table 2 Transformation and normalisation methods availablein caGEDA

Type of method Method

Negative numbers Filter negative numbersSubtract global minimum

Transformations Log2

Log10

LnSquare rootCube root

Multiplicative linear Within arraynormalisation Sum

MeanMedianQuantileTrimmed mean

Among arrayMinimum mean ratioMedian mean ratio

Additive linear Global mean adjustment (GMA)normalisation Min = 0, Max = 1

NormalisationThe normalisation options implemented to date are listed in

Table 2. Some of the normalisation functions require further

input from the user. For example, the within-array quantile

and trimmed mean normalisation steps require the

specification of a quantile and upper or lower trim on the

within-array distribution expression. Users can select up to

three transformation and normalisation steps to be applied

in series. Researchers who wish to use caGEDA to normalise

the data and receive back the entire dataset can do so by

selecting ‘None’ under the test for differentially expressed

genes. The box-and-whisker plots are generated on the

transformed and normalised data.

Finding differentially expressed genesTests for the two-class comparison implemented to date are

listed in Table 3. These include a number of new tests (J5,

D1, simple separability, weighted separability, the

permutation percentile separability test (PPST) and the ABA

test). Each of these is described briefly in Appendix 1.

False discovery rate estimation with SAMWe have included original source code that implements the

SAM test of Tusher et al (2001) that is designed to report an

estimate of the false discovery rate (false positive/(true

positive + false positive)). A description of SAM as

implemented in caGEDA is available as part of the online

Figure 3 Gene expression pattern grid for the set of significant genes, for thecomparison of BRCA1 with BRCA2, under the J5 test. Each mode of differentialexpression is made up of genes that follow a predominant pattern of difference.On screen, black signifies that the expression intensity of a given gene in a givensample is within the upper and lower 95th percentile boundaries of the alternategroup. On screen, green signifies that the expression intensity is below the lower95th percentile of the alternate group, and red signifies that the expression value isabove the upper 95th percentile of the alternate group. Crossed boxes (visible onscreen), regardless of colour, signify expression intensity values in the lowest 5% ofthe entire dataset. Genes are ranked within each mode by the number of red- andgreen-painted samples, facilitating outlier detection by visual inspection.

A

B

C

D

E

F

G



help file (http://bioinformatics.upmc.edu/Help/SAM/

SAMINFO.htm).

Sample classificationUsers can select among eight pairwise distances (Table 4)

that are computed for pairs of samples. These distances are

used in hierarchical clustering (average linkage, complete

linkage or minimum linkage), k-means clustering (where k

is the number of expected clusters) or a locally developed

n-nearest neighbours clustering algorithm, as designated by

the user. The user can specify the number of expected clusters

(default = 2). The n-nearest neighbours algorithm does not

require a priori knowledge of the number of clusters and is

described in Appendix 2.

Computational validated learningUsers can study the robustness of classifiers obtained with

differentially expressed genes by performing nonparametric

bootstrapping (Efron 1979), leave-one-out validation and

random resampling validation. Many studies report the

training error rate (the rate at which a classifier built using a

set of samples can predict the class label of that same set of

samples). This error rate is highly biased (overly optimistic).

To avoid this bias, caGEDA splits a dataset into two groups

of samples, a training set and a test set. Learning (classifier

construction) is conducted on the training set and evaluated

on the test set. The user defines the size of the training and

test datasets; these splits are usually performed with a 70/30

ratio. Features (differentially expressed genes) are selected

using the specific test and threshold, and class prediction is

performed on the test dataset (eg naive Bayes posterior

probability). Random resampling can be conducted at a given

threshold, or over a range of thresholds or significance levels

for a test.

When computational validation is performed with a single

split of training set/test set, the estimated rate of test

classification error has a high variance. To decrease that

variance, caGEDA allows users to perform random

resampling validated learning. The average rate for test

classification error is determined over a range of values of

the test threshold or significance level (number of features

included), using multiple independent training/test set splits

to avoid the possibility that a lucky (or unlucky) single

training/test set split might misinform the investigator on

the rate of test classification error. Average rates of test error

generated this way have a lower variance. The user defines

the number of independent splits to be created.

This validated learning test error can be further validated

using the classical single training set/test set split. Users can

also exclude samples from analysis and create a ‘pristine’

test set (a ‘validation set’) by assigning samples to group 0

in the Group Settings forms. caGEDA will then report the

training error, the validated learning test error and the

validation set error.

Interoperability: towards translationalresearchTo facilitate translational research, the HTML output file of

caGEDA produces a table of links to the entries in a number

of databases. If the dataset analysed is in caGEDA format 2

(ie contains accession numbers, which may be UniGene ID,

GenBank® numbers or some commercial chip accession

numbers), the HTML output file of an analysis also contains

embedded forms so users can download the accession

numbers for their set of differentially expressed genes and

then upload them again to two online servers of interest:

Table 4 Pairwise distances available for sample classificationinferences (class discovery and class prediction) in caGEDA

Types of distances

Euclidean distance (equivalent to 1-Pearson’s correlation coefficient (centred))1-Pearson’s correlation coefficient (uncentred)Absolute (1-Pearson’s correlation coefficient (uncentred))Manhattan/City Block distanceChebychev distanceCanberra distanceMinkowski distanceChord distance

Table 3 Tests for differentially expressed genes available incaGEDA

Test Reference

J5 test Appendix 1Simple separability Appendix 1Weighted separability Appendix 1Pooled variance t-testSimple t-testPaired values t-testFold-change (ratio of means)Fold-change (mean of ratios)Fold-change ((M1 – M2)/M2) where M2 < M1D1 test Appendix 1S/N (Furey et al 2000)BSS/WSS (Dudoit and Fridlyand 2002)PPST test Appendix 1ABA test Appendix 1

S/N, signal-to-noise ratio; BSS/WSS, between sums of squares/within sums ofsquares ratio; PPST, permutation percentile separability.

caGEDA


UCSD’s (University of California at San Diego) 2HAPI

server (Fink et al 2003), which provides ranked output on

the association of a gene and medical subject heading

(MeSH) terms; and EZ-Retrieve (Zhang et al 2002), which

can be used to search upstream for regulatory elements

common to a set of differentially expressed genes. Naturally,

the set of accession numbers in the output can also be used

by the user with any other server-based analysis tool that

accepts a list of accession numbers as input (such as some

gene ontology browsers).

Data sourceOwing to space limitations, and the ease with which readers

can generate results using the datasets in the public domain,

we do not provide a complete representation of the repertoire

of results provided by caGEDA. For a representative set of

results, we have re-analysed gene expression profiles from

BRCA1 and BRCA2 subtypes of breast cancer (Hedenfalk et

al 2001).

Gene expression data simulatorcaGEDA is intimately associated with an online gene

expression data simulator (http://bioinformatics.upmc.edu/

GE2), the basis, capabilities and results of which will be

described in a paper in the near future.

ResultsThe data visualisation provided by caGEDA is helpful for

assessing data quality and for assessing the effects of

transformation and normalisation. The plots include a

dataset-wide frequency distribution (Figure 1a), sample-wide

box-and-whisker plots (Figure 1b), a correlation plot between

group-wide means (Figure 1c), a mean versus variance plot

(Figure 1d) and a ratio intensity or M-A plot (Figure 1e).

The frequency distribution (Figure 1a) informs about the

global data distribution. Ideally, the box-and-whisker plots

should have all features (median, upper median, lower

median, upper and lower 95th percentiles) aligned across all

samples. Both the plot of mean versus variance and the M-A

plot are expected to be flat and homogeneous. A cluster

diagram of the samples derived using the significant genes

as features is also provided in the output (Figure 1f).

The differentially expressed genes are presented in a score

histogram (Figure 2a) and a histogram of the means of the

groups (Figure 2b). A posterior distribution of the test statistic

is rendered (Figure 2c) to depict the number of significant

genes at a given threshold. If the SAM test has been selected,

a SAM plot comparing the observed and expected test

statistic values by rank (position) is provided (Figure 2d).

Expression modes for the genes found under the J5 test in

the comparison of BRCA1 with BRCA2 are summarised in

the expression pattern grid (Figure 3). All modes of

differential expression (A–F in Figure 3) are found under

the J5 test. A screenshot of the table of links to genomic

databases for a set of differentially expressed genes is shown

in Figure 4.

DiscussionIt is our intention to accumulate as many datasets on global

patterns of gene expression in cancer as possible, and place

them in the public domain. This strategy brings algorithms

and data together in a forum accessible to all researchers.

Researchers can submit their datasets to the caGEDA tool

by email to the authors. The installer makes mirroring the

tool, and the datasets, at different cancer research institutions

simple. caGEDA is also an open-source web application;

academic researchers interested in hosting a mirror are invited

to contact us. Individuals who do not program in Java can

contact the corresponding author with a request for

information on various mechanisms for working with the

caGEDA development team to include their methods.

An interesting potential application of the expression

pattern grid and the study of modes of differential expression

is the search for potential gene deletions (samples in group

A down-regulated compared with samples in group B). We

might expect that genes that exhibit similar modes of

differential expression may share a common root cause for

that differential expression (eg coordinated expression

regulation, gene amplification, gene deletion). Crossed-out

boxes in the expression pattern grids provide a visual aid in

formulating hypotheses about deletion events. Furthermore,

the gene expression pattern grid is useful in identifying false

positives that occur due to outliers.

Online help and user supportVarious features of the analysis carried out by caGEDA are

described, and more information on its implementation is

available online (http://bioinformatics.upmc.edu/Help/

GEDADescription.html). caGEDA is constantly being

upgraded, so online help is always out of date. Readers are

referred to the URL above for additional information.

Some intended extensions• Integration into caBIG. The caBIG initiative is a new,

NCI-led, multi-institutional initiative designed to foster

interactions among bioinformaticists working on clinical



trial management systems and tools for tissue banking,

pathology and integrative cancer research. The design

specifications for resources are stated under the caCORE

(Covitz et al 2003) description (http://ncicb.nci.nih.gov/

NCICB/core), and include a collection of cancer

bioinformatics infrastructure objects (ie ‘biomedical

objects’) (caBIO), a cancer data standards repository

(caDSR) and the enterprise (controlled) vocabulary

services (EVS). The caDSR implements the ISO/IEC

11179 standard for metadata registries and includes,

among other things, tools for establishing and using

common data elements. Our long-term intent is to bring

caGEDA into caCORE compliance, and our short-term

intent is to make caGEDA interoperable with as many

caBIG components as possible through the use of

embedded forms (as we have done with the 2HAPI

server). Components of caBIG that we will be building

towards immediately include MatchMiner (Bussey et al

2003) and cMAP (Buetow et al 2003). caGEDA is not

intended as a permanent data repository, but is instead a

research tool. Researchers might consider submitting

their dataset to NCI’s caArray data portal (http://

caarray.nci.nih.gov) for more permanent archival. We

encourage, therefore, developers of caBIG-related web

applications to post embedded-form HTML on the pages

of their web applications to enhance interoperability

within the cancer bioinformatics grid.

• Model-based analysis. A number of methods, notably

ANOVA, have been described that can be used to study

various potential sources of unwanted variability. We

intend, in the near future, to add ANOVA to caGEDA

with the ability to upload a design matrix so users can

perform multiparameter ANOVAs. Currently, users are

restricted to performing multigroup classifications using

the naive Bayes classifier. caGEDA will soon be upgraded

so that differentially expressed genes can be found among

more than two groups.

• Data formats. We intend to add the capability to allow

users to upload datasets that contain background values.

Ideally, caGEDA would be able to open all files in a

directory and parse, according to a setting in a simple

user-controlled interface, data elements from tab-

delimited text files that result from scans of arrays. The

effects of background subtraction can currently be

Figure 4 Partial screenshot of the links table offered to expedite interpretation and translational analysis for the BRCA1 and BRCA2 datasets. Each link opens a newbrowser window to that particular database entry for the given gene.

caGEDA


evaluated using the computational validation options ifthe user creates and analyses two datasets (one with andone without background subtraction).

• Nonlinear normalisation. We intend to add LOWESStransformations and other nonlinear normalisationapproaches advocated by some for use with two-dyeexperiments. These normalisations provide some benefitat an, as yet, unmeasured cost of lost information.Listgarten et al (2003) recently demonstrated that print-tip LOWESS appears to be superior to others at correctingthe problems inherent to the two-dye platform (dye bias).

• Gene clustering. caGEDA has the ability to perform geneclustering, but this option has been disabled because thecomputations place a heavy burden on the serversperforming the analyses. The usefulness of clusteringgenes without taking prior knowledge into account is

generally recognised as fairly limited.

AcknowledgementsWe thank Milos Hauskrecht and Richard Pelikan for their

generous assistance in implementing the naive Bayes

algorithm and for extensive discussion on resampling-based

validation. This work was funded exclusively by Dr Lyons-Weiler’s faculty recruitment funds, provided by a grant fromthe Claude Worthington Benedum Foundation to Dr Michael

Becich and Dr Ronald Herberman.

Appendix 1

Novel tests included in the caGEDA webapplicationSimple separability testWe define N

1 as the number of samples in group A for which

the expression value for the ith gene is less than the minimumvalue of that gene of group B (the subscript i in N

1i has been

omitted for ease of reading); N2, as the number of samples

in group A for which the expression value for the ith gene isgreater than the maximum value of that gene of group B; N

3,

as the number of samples in group B for which the expressionvalue for the ith gene is less than the minimum value of thatgene of group A; and N

4, as the number of samples in group

B for which the expression value for the ith gene is greaterthan the maximum value of that gene of group A. This isequivalent to the proportion of non-overlap between the twodistributions.

The counts N1, N

2 and N

3, N

4 are paired such that in the

perfectly separable case when N1= N

A, then N

2= 0, or vice

versa; and when N3= N

B, then N

4= 0, or vice versa. Our score,

Si, which is:

Si= (N

1+ N

4)/(N

A+ N

B) or

(N2

+ N3)/(N

A+ N

B)

(whichever is greater), has an upper limit of 1.0. The

threshold, T, ranges from 0 to 1. Genes with S scores greater

than or equal to T are retained (1 = perfect separability, no

overlap in the distribution).

Weighted separabilityClearly, simple separability is not as informative as it mightbe, especially in the comparison of two sample distributionsthat are perfectly separated, one by a large magnitude andthe other by a smaller magnitude. Under weightedseparability, the score for each gene is weighted usinginformation on the magnitude of the difference between thegroup means. Specifically, the weight of each of m givengenes is:

wi= (mean

Ai– mean

Bi)/(1/m)sum(mean

Ai– mean

Bi)

Under weighted separability, the score Si becomes w

iS

i and

has a lower limit of 0 and an upper limit of NA

+ NB.

J5 testThe J5 test is a gene-specific ratio between the mean

difference in expression intensity between two groups, A and

B, to the average mean group difference of all m genes.

The J5 test is likely to be useful in pilot studies where, due

to high variance, t-tests are likely to exhibit unacceptably

low specificity (high false discovery rates). A paired sample

J5 test is under development.

D1 testThe D1 test is a twice-iterated J5 test where the remaining

genes are tested again after removing the genes found to be

significant in the first iteration.

Permutation percentile separability testPPST examines whether a gene is overexpressed (or

underexpressed) in a surprisingly large number of patients

in the test group compared with the control group, and vice

versa. This is achieved by first counting the number of

individuals in a group that exhibit an expression intensity

that is greater (or less than) than the upper (or lower) 95th

percentile of the alternate group. This number is compared

with a distribution of counts that results when an arbitrarily

large number of permutations of sample class labels are

J

m

ii i

j j

j

m5 = -

-=

ÂA B

A B1

1



performed. PPST identifies genes with AB (ie where A > B)

and BA (ie where B > A) patterns of differential expression.

ABA testThe ABA test is a special form of the PPST and reports genesthat show both a surprisingly large number of samples that

are overexpressed compared with the alternate group and a

surprisingly large number of samples that are underexpressed

compared with the alternate group. The test is called ABA

to reflect the pattern in which samples from the A group are

found in the tails of the B group, and it reports genes that

exhibit either significant ABA or BAB patterns. caGEDA

provides output for both, and an expression pattern grid for

the ABA pattern genes (tumour–normal–tumour). Our

convention is that tumours (cases, treatments) are assigned

group label 1, while normals (controls) are assigned group

label 2 for a two-class comparison.

Appendix 2

Pitt n-nearest neighbours algorithmThis variant of nearest neighbours clustering works as

follows. Given a pairwise distance matrix for N samples,

the number of common nearest neighbours within a defined

neighbourhood size is calculated for each pair of samples.

Sample pairs are grouped if they share a common set of

neighbours (set size = Ns). If i and j share N neighbours but

are not yet assigned to a group, then a new group of i and j is

created. The process uses the principle of guilt by association:

if i, but not j, is assigned to a group, and i and j share N

neighbours in the defined neighbourhood size, j is assigned

to the same group as i. After iterating all the possible pairs

of the samples, we get x sample groups formed with members

of each group sharing N neighbours with at least one other

group member. This algorithm (Algorithm A1) does not

require the specification of the number of groups.

Algorithm A1 Pitt n-nearest neighbours Java source code

private boolean PittClustering(Vector Clusters) throws Exception {DistanceMatrix=DistanceLibrary.CalculateDistance(DistanceFunction,GE_Data,

Rows,Cols,RetainedGenes,RetainedSamples,GeneMask,SampleMask,Threshold);

Object neighbours[]=new Object[Cols]; double distance[]=new double[Cols];

for(int i=0;i<Cols;i++)

{

GetDistance(i,Cols, DistanceMatrix,distance);

neighbours[i]=sortPitt(i,distance);

}

int GroupPitt[]=new int[Cols];

int TotalGroups=0;

for(int i=0;i<Cols;i++) {

for(int j=0;j<Cols;j++)

{

if(i!=j) {

if(Common((int[])neighbours[i],(int[])neighbours[j],

NeighbourhoodSize,Cols)>=CommonNeighbours) {

if((GroupPitt[i]==0)&&(GroupPitt[j]==0))

{ TotalGroups++;

GroupPitt[i]=TotalGroups;

GroupPitt[j]=TotalGroups; }

else if((GroupPitt[i]!=0)&&(GroupPitt[j]!=0))

{ if(GroupPitt[i]!=GroupPitt[j])

{

//Out.println(“<BR>Assigned Groups not equal “+i+”,”+j); }

else

{

//Out.println(“<BR>Assigned Groups equal “+i+”,”+j);

}

}

else if(GroupPitt[i]==0)

{

GroupPitt[i]=GroupPitt[j];

}

else if(GroupPitt[j]==0)

{

GroupPitt[j]=GroupPitt[i];

}

else

{

//Out.println(“Invalid Conditions”);

}

}

}

}

}

GroupK=new int[Cols];


GroupK[i]=GroupPitt[i];


if(Group[i]!=GroupPitt[i])

return false;

return true;

}

private int Common(int data1[], int data2[], int Cols, int TotalCols)

{ if(Cols>TotalCols)

Cols=TotalCols;

int Indexes[]=new int[TotalCols]; for(int i=0;i<Cols;i++)

{

Indexes[data1[i]]=1;

Indexes[data2[i]]=1;

}

int count=0;

for(int i=0;i<TotalCols;i++)

caGEDA


{

count+=Indexes[i];

} count=Cols*2–count;

return count;

} private Object sortPitt(int index,double Data[])

{

int Index[]=new int[Data.length]; for(int i=0;i<Index.length;i++)

Index[i]=i;

for(int i=0;i<Index.length;i++) for(int j=i+1;j<Index.length;j++)

if(Data[Index[i]]>Data[Index[j]])

{ int temp=Index[i];

Index[i]=Index[j];

Index[j]=temp; }

for(int i=0;i<Cols–1;i++)

Index[i]=Index[i+1];

return (Object)Index;

}

private void GetDistance(int index, int Cols, double

DistanceMatrix[][],double distance[])

{

for(int i=0;i<index;i++)

distance[i]=DistanceMatrix[i][index];

for(int i=index;i<Cols;i++)

distance[i]=DistanceMatrix[index][i];

}

References

ACME Laboratories. 2001. ACME Laboratories freeware [online].Accessed 7 Apr 2004. URL: http://www.acme.com

Alizadeh A, Eisen M, Davis RE et al. 2000. Distinct types of diffuse largeB-cell lymphoma identified by gene expression profiling. Nature,403:503–11.

Allander SV, Illei PB, Chen Y et al. 2002. Expression profiling of synovialsarcoma by cDNA microarrays: association of ERBB2, IGFBP2, andELF3 with epithelial differentiation. Am J Pathol, 161:1587–95.

Alon U, Barkai N, Notterman DA et al. 1999. Broad patterns of geneexpression revealed by clustering analysis of tumor and normal colontissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA,96:6745–50.

Beer DG, Kardia SL, Huang CC et al. 2002. Gene-expression profilespredict survival of patients with lung adenocarcinoma. Nat Med,8:816–24.

Bittner M, Meltzer P, Chen Y et al. 2000. Molecular classification ofcutaneous malignant melanoma by gene expression profiling. Nature,406:536–40.

Buetow KH, Klausner RD, Fine H et al. 2002. Cancer molecular analysisproject: weaving a rich cancer research tapestry. Cancer Cell, 1:315–18.

Bussey KJ, Kane D, Sunshine M et al. 2003. MatchMiner: a tool for batchnavigation among gene and gene product identifiers. Genome Biol,4:R27.

Chakrabarti R, Robles LD, Gibson J et al. 2002. Profiling of differentialexpression of messenger RNA in normal, benign, and metastaticprostate cell lines. Cancer Genet Cytogenet, 139:115–25.

Chan WC, Huang JZ. 2001. Gene expression analysis in aggressive NHL.Ann Hematol, 80(Suppl 3):B38–41.

Chang JC, Wooten EC, Tsimelzon A et al. 2003. Gene expression profilingpredicts therapeutic response to docetaxel in breast cancer patients.Lancet, 362:362–9.

Chopra A, Brown KM, Rood BR et al. 2003. The use of gene expressionanalysis to gain insights into signaling mechanisms of metastaticmedulloblastoma. Pediatr Neurosurg, 39:68–74.

Chua MS, Sarwal MM. 2003. Microarrays: new tools for transplantationresearch. Pediatr Nephrol, 18:319–27.

Cooper CR, Chay CH, Gendernalik JD et al. 2003. Stromal factorsinvolved in prostate carcinoma metastasis to bone. Cancer, 97(Suppl3):739–47.

Covitz PA, Hartel F, Schaefer C et al. 2003. caCORE: a commoninfrastructure for cancer informatics. Bioinformatics, 18:2404–12.

Cunningham MJ, Liang S, Fuhrman S et al. 2000. Gene expressionmicroarray data analysis for toxicology profiling. Ann NY Acad Sci,919:52–67.

Dan S, Tsunoda T, Kitahara O et al. 2002. An integrated database ofchemosensitivity to 55 anticancer drugs and gene expression profilesof 39 human cancer cell lines. Cancer Res, 62:1139–47.

Dangond F, Hwang D, Camelo S et al. 2004. Molecular signature of late-stage human ALS revealed by expression profiling of postmortemspinal cord gray matter. Physiol Genomics, 16:229–39.

DePrimo SE, Wong LM, Khatry DB et al. 2003. Expression profiling ofblood samples from an SU5416 phase III metastatic colorectal cancerclinical trial: a novel strategy for biomarker identification. BMCCancer, 3:3.

DeRisi JL, Iyer VR, Brown PO. 1997. Exploring the metabolic and geneticcontrol of gene expression on a genomic scale. Science, 278:680–6.

de Vos S, Hofmann WK, Grogan TM et al. 2003. Cell cycle alterations inthe blastoid variant of mantle cell lymphoma (MCL-BV) as detectedby gene expression profiling of mantle cell lymphoma (MCL) andMCL-BV. Diagn Mol Pathol, 12:35–43.

Dudoit S, Fridlyand J. 2002. A prediction-based resampling method toestimate the number of clusters in a dataset. Genome Biol, 3:0036.1–0036.21.

Efron B. 1979. Bootstrap methods: another look at the jackknife. AnnStat, 7:1–26.

Eyster KM, Lindahl R. 2001. Molecular medicine: a primer for clinicians.Part XII: DNA microarrays and their application to clinical medicine.SDJ Med, 54:57–61.

Fink JL, Drewes S, Patel H et al. 2003. 2HAPI: a microarray data analysissystem. Bioinformatics, 19:1443–5.

Furey TS, Cristianini N, Duffy N et al. 2000. Support vector machineclassification and validation of cancer tissue samples using microarrayexpression data. Bioinformatics, 16:906–14.

Garber ME, Troyanskaya OG, Schluens K et al. 2001. Diversity of geneexpression in adenocarcinoma of the lung. Proc Natl Acad Sci USA,98:13784–9. [Erratum in: Proc Natl Acad Sci USA, 2002; 99:1098.]

Gildea JJ, Seraj MJ, Oxford G et al. 2002. RhoGDI2 is an invasion andmetastasis suppressor gene in human cancer. Cancer Res, 62:6418–23.

Giles ED, Singh G. 2003. Role of insulin-like growth factor bindingproteins (IGFBPs) in breast cancer proliferation and metastasis. ClinExp Metastasis, 20:481–7.

Golub TR, Slonim DK, Tamayo P et al. 1999. Molecular classification ofcancer: class discovery and class prediction by gene expressionmonitoring. Science, 286:531–7.

Hawse JR, Hejtmancik JF, Huang Q et al. 2003. Identification andfunctional clustering of global gene expression differences betweenhuman age-related cataract and clear lenses. Mol Vis, 9:515–37.

Hayden PS, El-Meanawy A, Schelling JR et al. 2003. DNA expressionanalysis: serial analysis of gene expression, microarrays and kidneydisease. Curr Opin Nephrol Hypertens, 12:407–14.

Hedenfalk I, Duggan D, Chen Y et al. 2001. Gene-expression profiles inhereditary breast cancer. N Engl J Med, 344:539–48.

Inoue H, Matsuyama A, Mimori K et al. 2002. Prognostic score of gastriccancer determined by cDNA microarray. Clin Cancer Res, 8:3475–9.



Ji B, Chen XQ, Misek DE et al. 2003. Pancreatic gene expression duringthe initiation of acute pancreatitis: identification of EGR-1 as a keyregulator. Physiol Genomics, 14:59–72.

Kallioniemi OP. 2001. Biochip technologies in cancer research. Ann Med,33:142–7.

Kaminski N, Allard J, Heller RA. 2000. Use of oligonucleotide arrays toanalyze drug toxicity. Ann NY Acad Sci, 919:1–8.

Khatua S, Peterson KM, Brown KM et al. 2003. Overexpression of theEGFR/FKBP12/HIF-2alpha pathway identified in childhoodastrocytomas by angiogenesis gene profiling. Cancer Res, 63:1865–70.

Kihara C, Tsunoda T, Tanaka T et al. 2001. Prediction of sensitivity ofesophageal tumors to adjuvant chemotherapy by cDNA microarrayanalysis of gene-expression profiles. Cancer Res, 61:6474–9.

Kikuchi T, Daigo Y, Katagiri T et al. 2003. Expression profiles of non-small cell lung cancers on cDNA microarrays: identification of genesfor prediction of lymph-node metastasis and sensitivity to anti-cancerdrugs. Oncogene, 22:2192–205.

Li S, Ross DT, Kadin ME et al. 2001. Comparative genome-scale analysisof gene expression profiles in T cell lymphoma cells during malignantprogression using a complementary DNA microarray. Am J Pathol,158:1231–7.

Listgarten J, Graham K, Damaraju S et al. 2003. Clinically validatedbenchmarking of normalisation techniques for two-colouroligonucleotide spotted microarray slides. Appl Bioinform, 2:219–28.

Lockhart DJ, Dong H, Byrne MC et al. 1996. Expression monitoring byhybridization to high-density oligonucleotide arrays. Nat Biotechnol,13:1675–80.

Loring JF, Wen X, Lee JM et al. 2001. A gene expression profile ofAlzheimer’s disease. DNA Cell Biol, 20:683–95.

Luo JH, Yu YP, Cieply K et al. 2002. Gene expression analysis of prostatecancers. Mol Carcinog, 33:25–35.

Lyons-Weiler J, Patel S, Bhattacharya S. 2003. A classification-basedmachine learning approach for the analysis of genome-wideexpression data. Genome Res, 13:503–12.

Lyons-Weiler J, Patel S, Ngyuen T et al. 2004. UPITT Cancer GeneExpression Data Set Link Database, maintained by the University ofPittsburgh Benedum Oncology Informatics Center [online]. Accessed12 Apr 2004. URL: http://bioinformatics.upmc.edu/Help/UPITTGED.html

MacDonald TJ, Brown KM, LaFleur B et al. 2001. Expression profilingof medulloblastoma: PDGFRA and the RAS/MAPK pathway astherapeutic targets for metastatic disease. Nat Genet, 29:143–52.

Mandel S, Weinreb O, Youdim MB. 2003. Using cDNA microarray toassess Parkinson’s disease models and the effects of neuroprotectivedrugs. Trends Pharmacol Sci, 24:184–91.

Minning TA, Bua J, Garcia GA et al. 2003. Microarray profiling of geneexpression during trypomastigote to amastigote transition inTrypanosoma cruzi. Mol Biochem Parasitol, 131:55–64.

Miura K, Bowman ED, Simon R et al. 2002. Laser capture microdissectionand microarray expression analysis of lung adenocarcinoma revealstobacco smoking- and prognosis-related molecular profiles. CancerRes, 62:3244–50.

Miyachi H. 2002. Recent advances in molecular diagnostic tests. RinshoByori, 123(Suppl):19–29.

Nagata M, Fujita H, Ida H et al. 2003. Identification of potential biomarkersof lymph node metastasis in oral squamous cell carcinoma by cDNAmicroarray analysis. Int J Cancer, 106:683–9.

Nocito A, Bubendorf L, Maria Tinner E et al. 2001. Microarrays of bladdercancer tissue are highly representative of proliferation index andhistological grade. J Pathol, 194:349–57.

Ohno R, Nakamura Y. 2003. Prediction of response to imatinib by cDNAmicroarray analysis. Semin Hematol, 40(Suppl 3):42–9.

Okutsu J, Tsunoda T, Kaneta Y et al. 2002. Prediction of chemosensitivityfor patients with acute myeloid leukemia, according to expressionlevels of 28 genes selected by genome-wide complementary DNAmicroarray analysis. Mol Cancer Ther, 1:1035–42.

O’Reilly. 2002. com.oreilly.servlet [online]. Accessed Dec 2002. URL:http://www.servlets.com/cos/index.html

Pasinetti GM. 2001. Use of cDNA microarray in the search for molecularmarkers involved in the onset of Alzheimer’s disease dementia. JNeurosci Res, 65:471–6.

Perou CM, Sorlie T, Eisen MB et al. 2000. Molecular portraits of humanbreast tumours. Nature, 406:747–52.

Ramaswamy S, Ross KN, Lander ES et al. 2003. A molecular signatureof metastasis in primary solid tumors. Nat Genet, 33:49–54.

Rickman DS, Bobek MP, Misek DE et al. 2001. Distinctive molecularprofiles of high-grade and low-grade gliomas based onoligonucleotide microarray analysis. Cancer Res, 61:6885–91.

Sasaki H, Ide N, Fukai I et al. 2002. Gene expression analysis of humanthymoma correlates with tumor stage. Int J Cancer, 101:342–7.

Schena M, Shalon D, Davis RW et al. 1995. Quantitative monitoring ofgene expression patterns with a complementary DNA microarray.Science, 270:467–70.

Seftor EA, Meltzer PS, Kirschmann DA et al. 2002. Moleculardeterminants of human uveal melanoma invasion and metastasis. ClinExp Metastasis, 19:233–46.

Shih JY, Yang SC, Hong TM et al. 2001. Collapsin response mediatorprotein-1 and the invasion and metastasis of cancer cells. J NatlCancer Inst, 93:1392–400.

Shipp MA, Ross KN, Tamayo P et al. 2002. Diffuse large B-cell lymphomaoutcome prediction by gene-expression profiling and supervisedmachine learning. Nat Med, 8:68–74.

Susztak K, Sharma K, Schiffer M et al. 2003. Genomic strategies fordiabetic nephropathy. J Am Soc Nephrol, 14(Suppl 3):S271–8.

Takahashi M, Sugimura J, Yang X et al. 2003. Gene expression profilingof renal cell carcinoma and its implications in diagnosis, prognosis,and therapeutics. Adv Cancer Res, 89:157–81.

Tusher VG, Tibshirani R, Chu G. 2001. Significance analysis ofmicroarrays applied to the ionizing radiation response. Proc Natl AcadSci USA, 98:5116–21.

Ueno S, Ohki R, Hashimoto T et al. 2003. DNA microarray analysis of invivo progression mechanism of heart failure. Biochem Biophys ResCommun, 307:771–7.

Urbanowska T, Mangialaio S, Hartmann C et al. 2003. Development ofprotein microarray technology to monitor biomarkers of rheumatoidarthritis disease. Cell Biol Toxicol, 19:189–202.

van’t Veer LJ, Dai H, van de Vijver MJ et al. 2002. Gene expressionprofiling predicts clinical outcome of breast cancer. Nature, 415:530–6.

Vrana KE, Freeman WM, Aschner M. 2003. Use of microarraytechnologies in toxicology research. Neurotoxicology, 24:321–32.

Welford SM, Gregg J, Chen E et al. 1998. Detection of differentiallyexpressed genes in primary tumor tissues using representationaldifferences analysis coupled to microarray hybridization. NucleicAcids Res, 26:3059–65.

Welsh JB, Zarrinker PP, Sapinoso LM et al. 2001. Analysis of geneexpression profiles in normal and neoplastic ovarian tissue samplesidentifies candidate molecular markers of epithelial ovarian cancer.Proc Natl Acad Sci USA, 98:1176–81.

Zhang H, Ramanathan Y, Soteropoulos P et al. 2002. EZ-Retrieve: a web-server for batch retrieval of coordinate-specified human DNAsequences and underscoring putative transcription factor-bindingsites. Nucleic Acids Res, 30:E121.

caGEDA

Documents

Transcript of caGEDA