Extending the BioArray Software Environment through PrognoChip-BASE

8
1 Extending the BioArray Software Environment through PrognoChip-BASE Anastasia Analyti, Haridimos Kondylakis, Dimitris Manakanatas, Manos Kalaitzakis, Dimitris Plexousakis Institute of Computer Science, FORTH-ICS, Greece Department of Computer Science, University of Crete, Greece Dimitris Kafetzopoulos, Thanassis Margaritis Institute of Molecular Biology and Biotechnology, FORTH-IMBB, Greece The BioArray Software Environment (BASE) v1.2 [3] (http://base.thep.lu.se/ ) is a web-based information system for the management of spotted DNA microarray experiments and the massive amounts of generated data. In particular, it is a MIAME-compliant 1 database server and analysis platform, designed to be installed in any microarray laboratory and serve many users simultaneously via the web. BASE v1.2 has been developed by Lund University and is a free software release under the GNU General Public License. It runs on a Linux Server using a MySQL database backend, and it has been developed using PHP, Java, JavaScript, and C++ programming languages. In short, BASE v1.2 manages biomaterials (samples, extracts, labeled extracts), reporters (i.e., the oligonucleotides spotted on a microarray slide), and related annotations. Additionally, it manages the details of array production. When all related information is available, raw hybridization data sets (also, called measured bioassay data in MAGE-OM : MicroArray and Gene Expression - Object Model 2 ) can be stored. Several scanners and image processors are supported. Each step of a microarray experiment is associated with a protocol description. Raw (hybridization) data sets can be organized in Experiments and normalized. Normalization plug-ins are already installed, but it is also possible to develop and install your own plug-ins. Normalization is performed in a hierarchical way and several normalization methods (such as, lowess, median) can be performed to refine intermediate results. PrognoChip-BASE extends BASE v1.2.16, in order to ease biologist’s task and to provide more functionalities. In particular, several quality indicators have been added to Extracts, Labeled Extracts, and Hybridizations to provide users with the capability of storing and reviewing the quality of their experiments. Specifically, the Extract form has been enriched with the 260/280 absorbance ratio field, which represents the ratio of absorbance at 260nm and 280 nm, the Distant/Proximal field, which declares if the sample is degraded or not, and the Quantity Amplified/Quantity Used field, which is a ratio that denotes the extract’s amplification capability. Finally, extracts are enhanced with the ability to load the image of gel electrophoresis, which can be an indicator of RNA quality (see Figure 1). In the Label Extract form, the Base-to-Dye Ratio field, as well as the 260, 280, 545, and 645 (nm) absorbance fields have been added, indicating the extent of labeling (see Figure 2). In the Hybridization form, a quality indicator has been added, indicating the general quality of the hybridization (see Figure 3). 1 http://www.mged.org/Workgroups/MIAME/miame.html 2 http://www.mged.org/Workgroups/MAGE/mage.html

Transcript of Extending the BioArray Software Environment through PrognoChip-BASE

1

Extending the BioArray Software Environment through PrognoChip-BASE

Anastasia Analyti, Haridimos Kondylakis, Dimitris Manakanatas, Manos Kalaitzakis, Dimitris Plexousakis

Institute of Computer Science, FORTH-ICS, Greece Department of Computer Science, University of Crete, Greece

Dimitris Kafetzopoulos, Thanassis Margaritis

Institute of Molecular Biology and Biotechnology, FORTH-IMBB, Greece The BioArray Software Environment (BASE) v1.2 [3] (http://base.thep.lu.se/) is a web-based information system for the management of spotted DNA microarray experiments and the massive amounts of generated data. In particular, it is a MIAME-compliant1 database server and analysis platform, designed to be installed in any microarray laboratory and serve many users simultaneously via the web. BASE v1.2 has been developed by Lund University and is a free software release under the GNU General Public License. It runs on a Linux Server using a MySQL database backend, and it has been developed using PHP, Java, JavaScript, and C++ programming languages. In short, BASE v1.2 manages biomaterials (samples, extracts, labeled extracts), reporters (i.e., the oligonucleotides spotted on a microarray slide), and related annotations. Additionally, it manages the details of array production. When all related information is available, raw hybridization data sets (also, called measured bioassay data in MAGE-OM: MicroArray and Gene Expression - Object Model2) can be stored. Several scanners and image processors are supported. Each step of a microarray experiment is associated with a protocol description. Raw (hybridization) data sets can be organized in Experiments and normalized. Normalization plug-ins are already installed, but it is also possible to develop and install your own plug-ins. Normalization is performed in a hierarchical way and several normalization methods (such as, lowess, median) can be performed to refine intermediate results. PrognoChip-BASE extends BASE v1.2.16, in order to ease biologist’s task and to provide more functionalities. In particular, several quality indicators have been added to Extracts, Labeled Extracts, and Hybridizations to provide users with the capability of storing and reviewing the quality of their experiments. Specifically, the Extract form has been enriched with the 260/280 absorbance ratio field, which represents the ratio of absorbance at 260nm and 280 nm, the Distant/Proximal field, which declares if the sample is degraded or not, and the Quantity Amplified/Quantity Used field, which is a ratio that denotes the extract’s amplification capability. Finally, extracts are enhanced with the ability to load the image of gel electrophoresis, which can be an indicator of RNA quality (see Figure 1). In the Label Extract form, the Base-to-Dye Ratio field, as well as the 260, 280, 545, and 645 (nm) absorbance fields have been added, indicating the extent of labeling (see Figure 2). In the Hybridization form, a quality indicator has been added, indicating the general quality of the hybridization (see Figure 3).

1 http://www.mged.org/Workgroups/MIAME/miame.html 2 http://www.mged.org/Workgroups/MAGE/mage.html

2

Furthermore, new raw data set fields have been added (see Figure 4). Reporter annotations have been extended with (i) the type3 of the reporter, (ii) the number of transcripts of the corresponding gene4, (iii) the description of the corresponding protein, (iv) the location of the reporter, and (v) ids to additional public databases, such as Ensembl5 and EMBL6 (see Figure 5). Gene Ontology7 (GO) is a well-known ontology for the annotation of gene products in terms of the biological processes in which they participate, the particular molecular functions that they perform, and the cellular components in which they act. In particular, GO consists of 3 independent taxonomies, namely GO Biological Process, GO Molecular Function, and GO Cellular Component. Ontology terms are associated with a GO id and a human readable GO name. Each gene product annotation should be accompanied by an evidence code, indicating the type of evidence that supports it. In PrognoChip-BASE, reporters are further annotated with GO ids/names/evidence code. As seen in Figure 6, the GO annotations of a reporter in the three GO taxonomies are stored in the GO Biological Process, GO Molecular Function, and GO Cellular Component fields, respectively. Further, due to the fact that each reporter can have multiple annotations in a GO taxonomy, each GO annotation field should be a semicolon (;) separated string, where each substring has the following form:

GO_id | GO_name | evidence_code

and any part (i.e. "| evidence_code" ) is optional. For example, all of the following forms are valid:

GO:0006928|cell motility|IMP ; GO:0005515|protein binding|IC GO:0006928|cell motility ; GO:0005515|protein binding|IC GO:0006928 ; GO:0005515|protein binding|IC |cell motility|IMP ; GO:0005515|protein binding|IC |cell motility| ; GO:0005515|protein binding|IC |cell motility ; GO:0005515|protein binding|IC.

This format is not a HARD constraint. However, it guarantees that the GO annotations of a reporter will be correctly stored in the (new) tables:

ReporterGOBioproc (reporterid, goid, goname, evidencecode), ReporterGOmolfun (reporterid, goid, goname, evidencecode),

ReporterGOcellcomp(reporterid, goid, goname, evidencecode).

Though these tables are not used by PrognoChip-BASE, they can be useful to other collaborating subsystems. For example, in PrognoChip Mediator8 [1], the user can filter the reporters appearing in interesting gene expression profiles (and stored in a PrognoChip-BASE database), based on 3 A reporter is characterized as common, partial common, or individual overlapping, if it represents all, a subset, or only one transcript of the gene. 4 If the transcript count is greater than 1, then the gene has alternative spliced transcripts. 5 http://www.ensembl.org 6 http://www.ebi.ac.uk/embl/ 7 http://www.geneontology.org/ 8 PrognoChip Mediator is the mediator component of the PrognoChip Integrated Clinico-Genomic

Environment [2], through which the integration of the clinical and genomic information subsystems is achieved.

3

their associated GO Biological Process, GO Molecular Function, and GO Cellular Location names and evidence code (see http://www.ics.forth.gr/isl/projects/PrognoChip). Experiments participating in a main study are called study experiments, and are marked with a special flag (see Figure 7). For these experiments, an integrity constraint verifies that the same (cancerous and “reference”) samples are used in all participating hybridizations and that the designs of the participating arrays are different. This guarantees that a study experiment in PrognoChip-BASE corresponds to a single wet lab experiment. In particular, the following WARNINGS and HARD constraints are applied to study experiments. • The Array Designs corresponding to the Raw Data Sets assigned to a Study Experiment

should be pairwise different. Additionally, all Raw Data Sets assigned to a Study Experiment should have an associated Array Design. If this constraint is not satisfied a WARNING is displayed.

• There should be no more than one Sample per Label in a Study Experiment. If this constraint

is not satisfied a WARNING is displayed. • The dates of all Hybridizations of a Study Experiment should be inserted. This is a HARD

constraint. • All BioAssaySets in each Study Experiment should have distinct names. If this constraint is

not satisfied a WARNING is displayed. Finally, “List” views of menu items have been enhanced with additional descriptors and sorting on these descriptors is provided. For example, in the “List” view of Experiments, the descriptor Study Experiment has been added, indicating if an experiment is a study experiment or not (see Figure 8). PrognoChip-BASE is released under the GNU General Public License and it can be downloaded from SourceForge.net (http://sourceforge.net/projects/prognochip-base/). Details about the installation can be found in the Documentation folder of the package. A detailed presentation of PrognoChip-BASE is provided in: http://www.ics.forth.gr/~analyti/PrognoChip/PrognoChip_BASE_presentation.ppt

4

Figure 1: Creation of a new extract

Figure 2: Creation of a new labeled extract

5

Figure 3: Creation of a new hybridization

Figure 4: Raw Data Set fields

6

Figure 5: Reporter annotations

Figure 6: Reporter GO annotations

7

Figure 7: Creation of a new experiment

Figure 8: “List” view of experiments

Acknowledgements: The development of PrognoChip-BASE was supported by National and EU funds within the contexts of the PrognoChip project (GSRT-EPAN) and the ACGT project (FP6-IST-026996).

8

References 1. Anastasia Analyti, Haridimos Kondylakis, Dimitris Manakanatas, Manos Kalaitzakis,

Dimitris Plexousakis, George Potamias, Integrating Clinical and Genomic Information Through the PrognoChip Mediator, Procs. of the 7th International Symposium on Biological and Medical Data Analysis (ISBMDA-2006), 250-261, 2006, Springer-Verlag.

2. George Potamias, Anastasia Analyti, Dimitris Kafetzopoulos, Maria Kafousi, Thanassis

Margaritis, Dimitris Plexousakis, Panagiota Poirazi, Martin Reczko, Yiannis Tollis, Elias Sanidas, Efstathios Stathopoulos, Manolis Tsiknakis, Stamatis Vassilaros, Breast Cancer and Biomedical Informatics: The PrognoChip Project, 2005, Proceedings of the 17th IMACS World Congress Scientific Computation, Applied Mathematics and Simulation Paris, France, ISBN 2-915913-02-1, Paris, France, 2005

3. L. H. Saal, C. Troein, J. Vallon-Christersson, S. Gruvberger, Å. Borg, and C. Peterson

(2002), BioArray Software Environment: A Platform for Comprehensive Management and Analysis of Microarray Data, Genome Biology, 3(8): software0003.1-0003.6.