Analysis and comparison of metabolic pathway databases

17
Ulrike Wittig works as a research associate in the Scientific Database and Visualisation group of the European Media Laboratory GmbH (EML) Heidelberg, Germany. Her research interests include modelling and visualisation of biochemical pathways, information extraction from biological data sources and data integration from biological databases. Ann De Beuckelaer is a member of the Scientific Database and Visualisation Group of the European Media Laboratory GmbH (EML) in Heidelberg looking at the electronic representation of biochemical pathways. Her interest is the development of data models allowing the complete biochemical system to be represented for organisms of any kind. Keywords: database, pathway, metabolism, enzyme, genome, bioinformatics Ulrike Wittig, European Media Laboratory, Villa Bosch, Schloss-Wolfsbrunnenweg 33, 69118 Heidelberg, Germany Tel: 49 6221 533117 Fax: 49 6221 533298 E-mail: [email protected] Analysis and comparison of metabolic pathway databases Ulrike Wittig and Ann De Beuckelaer Date received: 10th January 2001 Abstract Enormous amounts of data result from genome sequencing projects and new experimental methods. Within this tremendous amount of genomic data 30–40 per cent of the genes being identified in an organism remain unknown in terms of their biological function. As a consequence of this lack of information the overall schema of all the biological functions occurring in a specific organism cannot be properly represented. To understand the functional properties of the genomic data more experimental data must be collected. A pathway database is an effort to handle the current knowledge of biochemical pathways and in addition can be used for interpretation of sequence data. Some of the existing pathway databases can be interpreted as detailed functional annotations of genomes because they are tightly integrated with genomic information. However, experimental data are often lacking in these databases. This paper summarises a list of pathway databases and some of their corresponding biological databases, and also focuses on information about the content and the structure of these databases, the organisation of the data and the reliability of stored information from a biological point of view. Moreover, information about the representation of the pathway data and tools to work with the data are given. Advantages and disadvantages of the analysed databases are pointed out, and an overview to biological scientists on how to use these pathway databases is given. INTRODUCTION Initially, biological databases were created to store information gained during a particular project. As a consequence, all the data are stored in large and widely distributed databases, with heterogeneous data formats and access mechanisms. In most of the review articles published on biological databases 1–3 a critical analysis of the data is missing. However, to extract and use data from heterogeneous data sources it is essential for a user to have information about the origin of the data, updates, redundancies and reliability. This paper primarily focuses on pathway databases and critically analyses their structure, the content of information being stored and especially the reliability of the data. Pathway databases are closely related to enzyme databases and this means that some of the enzyme databases were also checked according to their biological content. The fast-growing amount of data, creation of new databases and changes in existing databases on the Internet means that the analysis is inevitably incomplete. PATHWAY DATABASES Databases developed in the last decades can be classified into different categories, including genome databases, protein databases, enzyme databases, pathway databases, literature databases and some very specialised databases. This classification of databases is often based on their biological content. Although the content of the databases is mostly restricted to specific biochemical compounds or functions, a lot of overlap occurs. For example protein sequence information is listed in genome databases and functional data of corresponding enzymes can be found in protein and pathway databases. While analysing databases containing information about biochemical reactions a distinction must be made between 126 & HENRY STEWART PUBLICATIONS 1467-5463. BRIEFINGS IN BIOINFORMATICS. VOL 2. NO 2. 126–142. MAY 2001 Downloaded from https://academic.oup.com/bib/article/2/2/126/239777 by guest on 31 January 2022

Transcript of Analysis and comparison of metabolic pathway databases

Ulrike Wittig

works as a research associate

in the Scienti®c Database and

Visualisation group of the

European Media Laboratory

GmbH (EML) Heidelberg,

Germany. Her research

interests include modelling and

visualisation of biochemical

pathways, information

extraction from biological data

sources and data integration

from biological databases.

Ann De Beuckelaer

is a member of the Scienti®c

Database and Visualisation

Group of the European Media

Laboratory GmbH (EML) in

Heidelberg looking at the

electronic representation of

biochemical pathways. Her

interest is the development of

data models allowing the

complete biochemical system

to be represented for

organisms of any kind.

Keywords: database,pathway, metabolism,enzyme, genome,bioinformatics

Ulrike Wittig,

European Media Laboratory,

Villa Bosch,

Schloss-Wolfsbrunnenweg 33,

69118 Heidelberg,

Germany

Tel: 49 6221 533117

Fax: 49 6221 533298

E-mail:

[email protected]

Analysis and comparison ofmetabolic pathway databasesUlrike Wittig and Ann De BeuckelaerDate received: 10th January 2001

AbstractEnormous amounts of data result from genome sequencing projects and new experimental

methods. Within this tremendous amount of genomic data 30±40 per cent of the genes being

identi®ed in an organism remain unknown in terms of their biological function. As a

consequence of this lack of information the overall schema of all the biological functions

occurring in a speci®c organism cannot be properly represented. To understand the functional

properties of the genomic data more experimental data must be collected. A pathway database

is an effort to handle the current knowledge of biochemical pathways and in addition can be

used for interpretation of sequence data. Some of the existing pathway databases can be

interpreted as detailed functional annotations of genomes because they are tightly integrated

with genomic information. However, experimental data are often lacking in these databases.

This paper summarises a list of pathway databases and some of their corresponding biological

databases, and also focuses on information about the content and the structure of these

databases, the organisation of the data and the reliability of stored information from a

biological point of view. Moreover, information about the representation of the pathway data

and tools to work with the data are given.

Advantages and disadvantages of the analysed databases are pointed out, and an overview to

biological scientists on how to use these pathway databases is given.

INTRODUCTIONInitially, biological databases were created

to store information gained during a

particular project. As a consequence, all

the data are stored in large and widely

distributed databases, with heterogeneous

data formats and access mechanisms. In

most of the review articles published on

biological databases1±3 a critical analysis of

the data is missing. However, to extract

and use data from heterogeneous data

sources it is essential for a user to have

information about the origin of the data,

updates, redundancies and reliability. This

paper primarily focuses on pathway

databases and critically analyses their

structure, the content of information

being stored and especially the reliability

of the data. Pathway databases are closely

related to enzyme databases and this

means that some of the enzyme databases

were also checked according to their

biological content. The fast-growing

amount of data, creation of new databases

and changes in existing databases on the

Internet means that the analysis is

inevitably incomplete.

PATHWAY DATABASESDatabases developed in the last decades

can be classi®ed into different categories,

including genome databases, protein

databases, enzyme databases, pathway

databases, literature databases and some

very specialised databases. This

classi®cation of databases is often based on

their biological content. Although the

content of the databases is mostly

restricted to speci®c biochemical

compounds or functions, a lot of overlap

occurs. For example protein sequence

information is listed in genome databases

and functional data of corresponding

enzymes can be found in protein and

pathway databases.

While analysing databases containing

information about biochemical reactions a

distinction must be made between

1 2 6 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001

Dow

nloaded from https://academ

ic.oup.com/bib/article/2/2/126/239777 by guest on 31 January 2022

enzyme databases and the more complex

pathway databases. Enzyme databases

mainly contain information about

enzymes and their properties. On the

other hand pathway databases re¯ect

information about reactions and pathways

in general, data related to organism-

speci®c information about genes, their

related gene products, protein functions,

expression data, data about enzymatic

activities, kinetic data, etc. In most cases

the above-mentioned enzyme databases

are closely connected with pathway

databases (Figure 1).

Pathway databases can also be

subclassi®ed into databases containing

metabolic and/or signalling pathways.

This review will mainly concentrate on

metabolic pathway databases because it is

dif®cult to compare the two types of

content. Currently, the most commonly

used pathway databases include WIT,

KEGG, EcoCyc, ExPASy±Biochemical

Pathways, PathDB and UM-BBD

(Table 1). These pathway databases were

selected and analysed particularly with

regard to their biological/biochemical

contents, data structures and data

representation.

One major advantage of pathway

databases over other biological databases is

the possibility of providing several types

of information in the context of the

graphical representation of pathways. For

example pathway databases are able to

represent the high complexity of all the

biochemical reactions within a single cell

or a complete organism. There are

different ways to show a graphical view of

a pathway. Graphical representations of

the pathways are drawn manually and are

non-dynamic (KEGG, WIT) or are

produced automatically on demand and

are user-dependent (PathDB, EcoCyc).

For a user-dependent interaction, ¯exible

representations will be more and more

important because they enable different

levels of information to be represented.

To avoid information overload, but on

the other hand to be able to differentiate

between general overviews and detailed

representations of reactions or pathways

containing structural formulas of chemical

compounds, users should be able to zoom

into their speci®c level of interest. A

challenge of pathways representation is to

combine a ¯exible drawing of pathways,

with the content behind all the elements

within that given pathway, without

loosing clarity. Therefore the following

pathway databases were checked for their

graphical representation, including issues

about the availability of automatic

drawings and the possibility of user-

dependent changes and interactions.

Furthermore, tools available to work with

the data were analysed.

WIT (WHAT IS THERE?)Content of the databaseThe WIT system connects data about

genes and genomes, enzymes, reactions

and pathways.4,5 The Enzyme and

Metabolic Pathways (EMP) database is

currently embedded within WIT. The

EMP data are derived from about 17,000

literature sources and are organised in a

number of tables. Organism-speci®c

information was extracted manually from

the literature, including protocols of

experiments (puri®cation steps, storage

conditions, etc.) and reaction mechanisms

what is there?

Figure 1: The connectivity of the main database types

& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 1 2 7

Analysis and comparison of metabolic pathway databases

Dow

nloaded from https://academ

ic.oup.com/bib/article/2/2/126/239777 by guest on 31 January 2022

(de®ned as a network of elementary

reactions).

Access to the dataWIT offers a general functional overview

as one possible starting point represented

as a classi®cation table of pathways.

Unfortunately this classi®cation is not

consistent with other pathway databases.

For instance WIT has many different

types of glycolysis dependent on the initial

substrate and/or ®nal product. To ®nd a

speci®c pathway some detailed

information about the pathway of interest

is necessary. Since pathway names contain

a mixture of abbreviations and systematic

names6 they are often very long and

incomprehensible.

Querying the EMP database for

enzymes by name or EC number can be

improved by de®ning compartments, cell

types, tissues or organisms. Pathways can

be searched by selecting compartments for

initial substrate, intermediate and ®nal

products, etc. Furthermore, organisms,

cell types or tissues can also be selected in

the general query system. Since the query

results contain internal accession numbers

for enzymes and abbreviated names for

pathways they are dif®cult to interpret.

Data representationThe `View Annotation' window gives

names, EC number and functional

description of an enzyme. Graphical

representations contain comments and

information on the compartments where

the reaction occurs and links to

information about the enzyme. The

descriptive page of enzymes is linked to

KEGG, EMP and Medline. However, the

enzyme information derived from KEGG

contains some slight differences from the

original KEGG project, eg links to KEGG

Table 1: Pathway databases and relevant databases for data retrieval in pathway databases

Classi®cation Content Example databases

Pathway databasesPathway databases Metabolic and/or signalling

pathways, reaction participants,relations between genes,enzymes and reactions

KEGGWITEcoCyc/MetaCycExPASy±Biochemical Pathways

CSNDBPathDBUM-BBDSPAD

http://www.genome.ad.jp/kegg/kegg.htmlhttp://wit.mcs.anl.gov/WIT2/http://ecocyc.pangeasystems.com/ecocyc/http://expasy.proteome.org.au/cgi-bin/search-biochem-

indexhttp://geo.nihs.go.jp/csndb/http://www.ncgr.org/software/pathdb/http://www.labmed.umn.edu/umbbd/index.htmlhttp://www.grt.kyushu-u.ac.jp/spad

Relevant databases often used for data retrieval in pathway databasesGenome databases Nucleotide sequences,

functional annotationsGenBankEMBL DatabaseDataBank of JapanTIGRMGDBFlyBase

http://www.ncbi.nlm.nih.gov/http://www.ebi.ac.uk/embl/index.htmlhttp://www.ddbj.nig.ac.jp/http://www.tigr.org/tdb/http://mbgd.genome.ad.jp/http://¯y.ebi.ac.uk:7081

Protein and protein-related databases

Protein sequences and theircharacteristics, functionalannotations

SWISS-PROT/TrEMBLPROSITESCOPPDBPedantTransfacPIR

http://www.ebi.ac.uk/swissprot/index.htmlhttp://www.expasy.ch/prosite/http://scop.mrc-lmb.cam.ac.uk/scop/http://www.rcsb.org/pdb/http://pedant.mips.biochem.mpg.de/http://transfac.gbf-braunschweig.de/TRANSFAC/index.htmlhttp://www-nbrf.georgetown.edu/pirwww

Enzyme databases Enzymatic properties ofproteins

ExPASy±ENZYMEBRENDAEMP

http://www.expasy.ch/enzyme/http://www.brenda.uni-koeln.de/http://wit.mcs.anl.gov/EMP

Chemical databases Chemical properties ofcompounds

BeilsteinChemFinder

http://www.beilstein.com/http://www.chem®nder.com

Literature database Abstracts of scienti®c literature Medline/PubMed http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db�PubMed

1 2 8 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001

Wittig and De Beuckelaer

Dow

nloaded from https://academ

ic.oup.com/bib/article/2/2/126/239777 by guest on 31 January 2022

maps, 3D-structure information from

PDB (Protein Data Bank) and links to the

BRENDA enzyme database are removed,

but accession numbers to the PIR

(Protein Information Resource) database

are given. The link to the EMP database

allows the retrieval of literature-related

and organism-speci®c enzyme

information.7

The `Diagram Data' window contains

detailed information on the overall

balance of the reaction, every compound

involved in the reaction and the location

or compartment of the cell where the

reaction occurs. These data are used for

the graphical representation of the

pathways that are represented in the

`Diagram Picture' view. The `Assertions

Table' contains information about

existing enzymes of a given pathway in

complete sequenced organisms.

One major disadvantage of the WIT

system is that there is no graphical

representation of all pathways in a general

overview.

Reliability/inconsistencyWIT uses a very comprehensive schema

for classifying biochemical pathways.

Unfortunately this classi®cation system is

not comparable with classi®cations used

in other databases (Figure 2). Going into

the detailed list of pathways the user is

confronted with very long pathway names

that contain abbreviations of start and/or

®nal compound of the pathway. For

example the catabolism of glucose (from

glucose to pyruvate) is named

`GLCPYR.CAT' but names such as

`GLCPYRADPATPNAD.CAT' also

occur, where it is very dif®cult to

understand the meaning.

Additional toolsHomology searches for genes or enzymes

and a pathway search option are

implemented. The result of these queries

contains only enzymes and the number of

matches for each pathway name but no

graphical representation.

KEGGContent of the databaseKEGG (Kyoto Encyclopedia of Genes

and Genomes) contains all known

metabolic pathways and a limited number

of regulatory pathways and transport

mechanisms.8,9 The KEGG system

consists of three main databases which are

tightly connected: LIGAND with

information about compounds, enzymes

and reactions stored in ¯at ®les,10,11

PATHWAY, which contains the

graphical representations of the pathways

and lists of enzymes and reactions within

the pathways, and GENES, which

contains organism-related genome and

gene information and lists of genes within

an organism and pathway. Furthermore

KEGG provides many links to other

databases that are integrated within the

DBGET integrated database retrieval

system.12

KEGG's gene and genome data are

based on genome databases containing all

publicly available nucleotide sequences

and their functional assignments.

Biochemical data of pathways, reactions,

etc., for instance compound and enzyme

names, reactions or comments, are based

on IUPAC (International Union of Pure

and Applied Chemistry) and IUBMB

(International Union of Biochemistry and

Molecular Biology) recommendations.13

Chemical structures of compounds are

stored in two different formats: a GIF

image format and 2D coordinates stored

in an MDL-MOL ®le (®le format for

molecular structures from MDL

Information Systems Inc.), which can be

used to launch a proper ISIS/Draw

application.

Access to the dataKEGG's main menu offers several

different ways to enter the pathway

systems: one can select an organism from

a list of complete or partially sequenced

organisms, or select a pathway from the

classi®cation of pathways (subdivided into

metabolic and regulatory pathways), or

search for a speci®c compound or enzyme

Kyoto Encyclopedia ofGenes and Genomes

& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 1 2 9

Analysis and comparison of metabolic pathway databases

Dow

nloaded from https://academ

ic.oup.com/bib/article/2/2/126/239777 by guest on 31 January 2022

in the LIGAND database or for a gene in

the GENE database. The query interfaces

do not allow complex queries; for

example, a query for all known enzymes

that use a de®ned substrate.

The general overview of all pathways

using a colour-code according to the

classi®cation of pathways enables the

initial entry of the program. This allows

access to speci®c pathway classes, eg

carbohydrate metabolism. A window

opens which represents links to all speci®c

pathways within carbohydrate

metabolism.

Data representationPathways in KEGG are classi®ed

according to the chemical structures of

their main compounds, eg carbohydrates,

lipids, amino acids. All speci®c pathway

maps and overviews are manually drawn

pictures where pathway maps consist of

links to speci®c information about

compounds, enzymes and genes. Pathway

maps contain all known reactions

catalysed by proteins/enzymes derived

from gene products. Non-enzymatic

reactions are not always included in that

system. However, in some cases non-

enzymatic reactions are shown within a

pathway map but do not have links to

detailed information of this speci®c

reaction. Reactions within the pathway

maps do not represent side compounds,

eg ATP (adenosine triphosphate) or

NADH (reduced nicotinamide adenine

dinucleotide). Pathway maps are also

linked to some other related pathways

connected by their contributing

compounds. This allows the user to get an

overview about connections to other

pathways, eg one main ®nal product of

glycolysis ± pyruvate ± is used in the

different pathways of amino acid

metabolism. Unfortunately not all

possible connections are given.

As well as the reference pathway the

user can select pathway information of a

speci®c organism. Within the reference

pathway all the enzymes that are known

to be expressed in the speci®c organism

(found by homology search) are

highlighted in green. Furthermore other

selections for speci®c information are also

available,10 eg highlighting of all enzymes

with identi®ed 3D structure, enzymes

with sequences in the SWISS-PROT,

GenBank or PIR databases, and enzymes

that have a link to genetic diseases in the

OMIM (Online Mendelian Inheritance in

Man) database.

Genome maps are represented as

graphics that give information about gene

positions and their relationship with the

pathways. Gene catalogues contain

hierarchical texts that include all known

genes for each organism, listed according

to the pathway classi®cation. Information

about orthologous genes are stored in

tables containing the information of a

conserved, functional unit in a pathway, a

comparative list of genes for the

functional unit in different organisms, and

the positional information of genes

clustered in each genome.

Molecule catalogues are represented by

hierarchical texts containing

macromolecules (including proteins and

RNAs) and small chemical compounds.

These data are based on classi®cations of

enzymes and chemical compounds.

Enzymes are classi®ed based on IUPAC/

IUBMB recommendations and

additionally according to PIR

superfamilies, SCOP (Structural

Classi®cation of Proteins) 3D-folds or

PROSITE (database of protein families

and domains) motifs. The pathway query

result shows all pathways containing a

given enzyme or compound but offers no

graphical representation. Organism-

speci®c information is also not available

within such results. Searching for a

pathway by selecting a speci®c organism

does not provide information about the

enzymes available but only links to the

gene information related to that organism.

A comparison of pathways of two or

more organisms cannot be implemented.

Enzyme pages contain the following

information: names, EC classi®cation

according to IUBMB, reactions and their

participants (substrates, products,

cofactors, inhibitors, etc.), pathways,

1 3 0 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001

Wittig and De Beuckelaer

Dow

nloaded from https://academ

ic.oup.com/bib/article/2/2/126/239777 by guest on 31 January 2022

related genes in organisms and links to

diseases (OMIM), motifs (PROSITE) and

3D structures (PDB).

Compounds are represented with

names, chemical and structural formula,

pathways in which the compound occurs

and enzymes that catalyse a reaction

containing the compound as participant.

Reliability/inconsistencyKEGG does not contain all possible

reactions catalysed by an enzyme. When

comparing reaction equations with the

lists of substrates and products,

inconsistencies sometimes arise, eg the

substrate list of pyruvate kinase (EC

2.7.1.40) contains more compounds than

those listed in the reaction equations.

Often reactions catalysed by an enzyme,

eg alcohol dehydrogenase (EC 1.1.1.1),

with a wide substrate speci®city are given

as general reaction equations, which

means loss of detailed information

concerning the correlating products of

each substrate. Moreover, enzymes are

represented with their substrates and

products, but reaction equations and

pathway information are missing, eg

peptidases.

Within compounds some errors and

inconsistencies arise. For example

sometimes one compound ID is given for

two different compounds (eg C00023 for

both Fe2� and Fe3�) or different

compound IDs are given for the same

compound (C02038 and C02156 for

glycyl-peptide or C04230 and C01174 for

1-acyl-glycero-3-phosphocholine among

others).

Additional toolsKEGG offers some tools for querying

within different maps and also methods to

compute with KEGG data. Tools for

searching enzymes (EC numbers ± names

not possible), compounds (compound ID

± names not possible) or genes (gene

name or accession number) are available.

Within genome maps queries for gene

positions or homologous gene clusters are

available as well as comparisons of two

genome maps for exhaustive search of

homologous gene clusters. Tools for

colouring and thereby highlighting the

genes in the pathway or genome map

make the system user-friendly and enable

a quick overview on the result of the

query.

One unique feature of the KEGG

system is the creation of a pathway

between two given compounds.10 As a

starting point, two compound IDs, the

parameter level and hierarchy of

relaxation have to be de®ned.

Unfortunately there is no help function or

documentation about the parameters.

Based on binary relations two algorithms

(Dijkstra/Floyd) are used to ®nd the

shortest way between two compounds. In

such cases where a reaction has two or

more substrates the implementation is not

able to distinguish between different

substrates and their correlating products

or compartments and transport

mechanisms. Pathways are created based

on one main substrate without paying

attention to side substrates or products,

compound/enzyme locations and

cofactors. Results of the pathway creation

contain only the compound ID and EC

number. For users who are not familiar

with the compound IDs used in KEGG it

is very time-consuming to understand the

resulting pathway since there is no

graphical representation of the search

results. This tool does not apply

organism-speci®c searches and creates

pathways containing all possible reactions

regardless of whether the speci®c enzyme

has been identi®ed in the speci®c

organism or not. Highlighting these

identi®ed enzymes and proven pathways

among them would be very helpful.

Automatic reconstruction of organism-

speci®c metabolic pathways can be

accomplished by matching the genes

corresponding to the enzymes in the gene

catalogues with the enzymes on the

reference pathway diagrams.

ECOCYC/METACYCContent of the databaseEcoCyc is a metabolic-pathway database

that describes the genome and the

EcoCyc/MetaCyc

& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 1 3 1

Analysis and comparison of metabolic pathway databases

Dow

nloaded from https://academ

ic.oup.com/bib/article/2/2/126/239777 by guest on 31 January 2022

biochemistry of Escherichia coli based on

information from the EcoGene database,14

SWISS-PROT and the scienti®c

literature.15,16 The database consists of all

sequences and functional annotations of

E. coli genes. Pathways of E. coli and its

reactions and enzymes are annotated with

references to the literature. In addition

MetaCyc describes pathways, reactions

and enzymes of a variety of organisms,

with a microbial focus. But MetaCyc does

not contain organism-speci®c genome or

protein information such as genomic

maps or sequences. MetaCyc uses the

same database schema and visualisation

software as EcoCyc.17

Access to the dataThe EcoCyc query interface provides

search options for E. coli genes, proteins,

reactions, compounds and pathways by

names, sub-strings, classi®cation

hierarchy, EC number or chemical

structure. In contrast to EcoCyc the

MetaCyc query interface only offers

searching for pathways, reactions and

compounds. Proteins or genes can be

selected for querying but no matches will

be found by searching, for example, EC

numbers. The EcoCyc user is also able to

start the system from an E. coli metabolic

overview in addition to the query page.

MetaCyc offers no general pathway

overview. The EcoCyc overview is a

bird's-eye view of the complete E. coli

biochemical pathways.3

The glycolysis and TCA (tricarboxylic

acid) cycles in the middle separate

predominately catabolic pathways on the

right from pathways of anabolism and

intermediary metabolism on the left.

Signalling pathways run along the bottom

of the ®gure. Individual reactions at the

right of the ®gure represent reactions not

currently assigned to any pathway.

However, these single reactions without a

classi®cation are not very helpful in the

overview. The picture uses different icons

(circles, squares, triangles, etc.) for

different compound classes. If a mouse

pointer is assigned to a circle in the

diagram, the name of the corresponding

metabolite and the name of the related

pathway are displayed at the bottom of

the www browser window. This

operation will work only for JavaScript-

enabled browsers. If such a system is not

available, the user needs biochemical

knowledge to recognise which graph can

represent the pathway. In that case the

metabolic overview will not help to get

detailed information. By clicking on a

compound the pathway containing the

selected compound will be displayed.

Unfortunately there is no way to get

compound information directly from the

overview window.

The metabolic overview is linked

directly to representations of speci®c

pathways or disconnected single reactions

but it is not possible to detect where the

pathway is located within the complex

overview. As in many other databases,

compartmentalisation and consequentially

transport mechanisms are missing.

Data representationPathways and reactions are drawn

automatically using graph-drawing

algorithms, and size may be adjusted. The

graphical representation of pathways

offers different levels of information. The

lowest level contains only the major

reactions of a given pathway. Information

about enzymes and genes, etc. is not

available. In the more detailed level all

enzymes with their corresponding EC

numbers and corresponding genes are

given. Some pathways show chemical

structures of the reactants on that level,

but this feature is not consistent. Some

pathways also include green arrows that

indicate regulatory mechanisms (dotted

lines) or connected pathways (solid lines).

Not all available connections between

pathways and reactions shown in the

metabolic overview picture are shown

within the pathway representation (eg

TCA cycle).

Within the graphical representation of

pathways, inhibitors and activators of

reactions are represented, but detailed

compound information is not accessible.

Names of the compounds are available in

1 3 2 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001

Wittig and De Beuckelaer

Dow

nloaded from https://academ

ic.oup.com/bib/article/2/2/126/239777 by guest on 31 January 2022

the bottom line of the JavaScript-enabled

www browser window.

Enzyme information consists of names,

enzyme classi®cation, catalysed reactions

and pathways in which the enzyme

occurs. More data about the enzyme are

given by connecting to the ExPASy-

ENZYME database. Within EcoCyc a

graphical representation of the gene-

reaction schema shows the relation

between E. coli genes coding for enzymes

or subunits. It also includes reactions that

can be catalysed by more than one

enzyme. A linkage of the enzyme to the

related gene or vice versa is implemented.

The E. coli genes are characterised by

their names, sequences, classi®cation,

positions within the genome and

information about the corresponding

enzymes. Links to the E. coli Stock

Center Database (CGSC)18 and automatic

queries in GenBank are installed.

Compound properties are represented

with names, chemical formulas, molecular

weights, CAS (Chemical Abstracts Service)

number and structural formulas, but these

formulas are not based on uniform

drawings. In addition a list of reactions in

which the compound occurs is given.

Reliability/inconsistencyThe classi®cation of pathways refers to the

classi®cation of E. coli gene products,19,20

but is not used consistently. The

biosynthesis of amino acid families

includes, for example, the following

instances: `ile�val biosynthesis',

`leu�val�ala�ile biosynthesis' and

`leu�val�ile biosynthesis'. In such a case

a differentiation between these pathways

does not make sense.

Compounds are based on a de®ned

classi®cation, but this system of

compound classi®cation has some

discrepancies. For example it has the type

`unclassi®ed compounds' but within this

category compounds with known

classi®cation (eg phospholipids, steroids)

also occur. There is a category `animal

compounds' which includes hormones,

etc., but these compounds do not

participate in reactions of E. coli and are

therefore not included within speci®c

pathways. On the other hand, the

differences between animal compounds,

unclassi®ed and non-metabolic

compounds are unclear.

Additional toolsUnfortunately the query options are not

complex enough to search for reactions or

pathways by de®ning substrates, products,

etc.

One important feature of the EcoCyc

system is the possibility of importing

E. coli expression data and colour-coded

highlighting of reactions in the general

pathway overview according to the

relative or absolute amount of the gene

expressions.

PATHDBContent of the databasePathDB is a general metabolic pathway

project of NCGR (National Center for

Genome Resources) representing

compounds, metabolic proteins (enzymes

or transport proteins), metabolic steps

(reactions) and pathways.21 Information

about locations within cells or organisms

are also given if available. PathDB also

includes descriptions of the kinetic,

thermodynamic and physicochemical

properties of pathway components based

on the scienti®c literature. The

integration of these data into PathDB

offers a good basis for the connection

with simulation tools.

Access to the dataA query interface allows the user to ®nd

pathways, enzymes, reactions or

compounds in the database by searching

given attributes (all possible roles of

compounds, kinetic types, modi®ers,

taxonomy, citations, etc.). A Java

application can be downloaded to run

complex queries. This interface makes it

possible to navigate between the relations

in various entities.

Data representationThe graphical representation of pathways

is based on graph-drawing algorithms,

PathDB

& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 1 3 3

Analysis and comparison of metabolic pathway databases

Dow

nloaded from https://academ

ic.oup.com/bib/article/2/2/126/239777 by guest on 31 January 2022

including, hierarchical and circular views

using prede®ned positions for the

compounds. Zooming and scaling tools

are installed and the user can select speci®c

information. For instance enzymes can be

displayed either by their names, their EC

numbers or both. In addition the user can

choose whether primary and/or secondary

substrates and products are visible.

Depending on the amount of selected

information the representation can be very

complex and confusing because of

overlapping terms and arrows.

Compounds are characterised by their

names, chemical and structural formulas,

molecular masses and other

physicochemical data. The information

about chemical structures is stored as a

SMILES (Daylight Chemical Information

Systems, Inc.) string that can be used to

generate the 2D structure. Metabolic

proteins are characterised by their names,

modi®ers and physicochemical properties

and links are given to databases containing

sequence or structure information.

A reaction is de®ned by its substrates,

products and the stoichiometric equation.

They are displayed as metabolic steps that

also include transport mechanisms.

Pathways are built by sequences of

metabolic steps including their direction,

catalysts, modi®ers and location.

Modi®ers can also be subdivided into

activators and inhibitors.

Reliability/inconsistencyUnfortunately cross-references within the

pathway representation to other

connected pathways are missing. Besides

common and alternative names of a

compound, PathDB also contains a

speci®c compound label used for the

graphical representation. However, most

of the compound labels are not

commonly used abbreviations of

compound names.

Additional toolsOne feature of PathDB is a tool for a

user-dependent pathway generation

between two compounds, so the database

is not restricted to prede®ned pathways.

EXPASY±BIOCHEMICALPATHWAYSContent of the databaseExPASy (Expert Protein Analysis System)

of the Swiss Institute of Bioinformatics

consists of two databases related to

biochemical pathway information ±

ENZYME and Biochemical Pathways.22

ExPASy±Biochemical Pathways is the

electronic version of the Boehringer

Mannheim Biochemical Pathways

Chart.23 An electronic index allows the

localisation of any metabolite or enzyme

on the chart.

Access to the dataQuerying for compounds, enzymes or

pathways by names or substrings results in

a list of related positions (squares) on the

graphical map. If a reaction is not

complete within that resulting square the

reaction is cut and the corresponding

information is lost. One of the four

adjacent squares can be selected to expand

the view.

Most enzyme names on the chart act as

links to the ExPASy±ENZYME database.

Data representationUnfortunately, ExPASY does not

implement a general overview of all

pathways on the complete map.

ENZYME, like Biochemical Pathways,

does not include organism-speci®c

pathway information except for the usage

of three colours to distinguish pathways

related to higher plants (green), animals

(blue) or microorganisms (red). Black

arrows indicate general biochemical

pathways. The information about the

colour-coded representation of reactions

and compounds and other remarks related

to the pathway maps (eg explanations of

abbreviations) is only available on the

printed versions of the Boehringer

Mannheim wall chart.23,24 Since there is

no documentation within the electronic

version the user has to have one of the

printed versions for data interpretation in

the electronic version.

Enzymes on the electronic map are

ExPASy-biochemicalpathways

1 3 4 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001

Wittig and De Beuckelaer

Dow

nloaded from https://academ

ic.oup.com/bib/article/2/2/126/239777 by guest on 31 January 2022

connected to the enzyme properties

stored in ExPASy±ENZYME containing

names, reaction properties and links to

other enzyme or pathway databases.

Compounds on the map are not linked to

any source of chemical database.

Reliability/inconsistencyWithin the graphical pathway

representation it is dif®cult to distinguish

the pathways according to their

classi®cation. Since only two squares of

the chart can be selected at the same time

it is not possible to view complete

pathways, eg glycolysis. On the electronic

as well as on the printed version the

pathway representation includes

redundancies because some enzymes or

compounds appear more than once on

the complete map. Links between

enzymes or compounds are not included,

which implies no connectivity between

reactions or pathways comprising the

same enzymes or compounds. This fact

disables an overview of the complexity

and connectivity of pathways.

Additional toolsThe ExPASy system does not offer any

tools to navigate on the pathway map or

to search for reactions or pathways by

de®ning substrates, etc.

UM-BBDContent of the databaseUM-BBD (University of Minnesota

Biocatalysis/Biodegradation Database)

contains information on reactions related

to biodegradation pathways of xenobiotic

compounds in microorganisms.25,26 It

provides information on pathways, single

reactions, chemical compounds including

intermediate states of the reactions,

organisms, enzymes and genes. The

system is connected to other resources

such as GenBank, ChemFinder, Medline,

etc.

Access to the dataA search function allows queries for

speci®c compounds or enzymes by

de®ning names or substrings, chemical

formula, CAS number or EC number

respectively. The search page also has

links to lists of pathways, compounds,

enzymes, reactions and organisms that can

be selected.

UM-BBD gives no general pathway

overview. Most of the pathways and

reactions are disconnected.

Data representationWhen appropriate citations in the

literature exist UM-BBD generates new

pathway, reaction and compound pages

based on the literature source. The

representation of a pathway contains the

names of compounds and reactions and

gives links to the more detailed

information of the compounds on their

corresponding pages. Starting compounds

are used to de®ne the name of a pathway.

Organisms are listed if they are known to

initiate this pathway or carry out later

steps. If both genus and species names are

known, the organism name links to a

search of a microbial database. The

pathway also contains possible cofactors.

If compounds occur that are parts of an

intermediary, metabolism pathway links

to the KEGG database are given. The

compound that links to KEGG is shown

as a red circle in the KEGG pathway map,

whereas KEGG does not link to the UM-

BBD on its pathway pages. A graphical

representation of a reaction contains the

reacting atoms and bonds highlighted in a

bold red font. If literature information

about the reaction mechanism is available

a link to a graphic showing the reaction

mechanism appears. There are two

different types of pathway representations:

a text format representation and a static

drawing, which is more restrictive.

Enzymes of a reaction are linked to

GenBank, when one or more hits have

been found. Furthermore enzyme pages

in the KEGG-LIGAND, in the ExPASy-

ENZYME and/or in the BRENDA

enzyme database are connected when a

four-digit EC number exists. If an

incomplete EC number occurs, ExPASy±

ENZYME gives a list of all possible

enzymes within that enzyme subclass. The

University of MinnesotaBiocatalysis/BiodegradationDatabase

& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 1 3 5

Analysis and comparison of metabolic pathway databases

Dow

nloaded from https://academ

ic.oup.com/bib/article/2/2/126/239777 by guest on 31 January 2022

enzyme pages contain a list of reactions

that are catalysed by the enzyme and links

to each of the static UM-BBD pathway

maps in which one or more reactions

catalysed by this enzyme appear.

The compound information includes

names and synonyms, chemical formula,

molecular mass, a SMILES format, CAS

number and links to ChemFinder. The

compound page also contains a list of all

reactions in the database which use or

produce this compound and links to the

UM-BBD pathway maps in which the

compound appears.

Reliability/inconsistencyOne important feature is the plausibility

of a reaction or pathway. Pathways with

less evidence or unknown intermediates

are indicated in several ways and

mentioned in the pathway introduction.

If compounds are plausible, but not

certain, or are intermediates, they are put

in brackets. Multiple arrows indicate

unknown intermediates. Within the

introduction of a pathway or reaction

page the signi®cance of the pathway is

given. The names and af®liations of the

people who contributed the pathway to

the UM-BBD are important. This feature

is unique for this database system and

allows the user to contact the person in

question directly.

Additional toolsA dynamically generated pathway map

starting from a de®ned reaction can be

generated and is shown only in text

format representation. The created

pathway maps include all possible

reactions from all organisms found in the

UM-BBD and link to each of the static

pathway maps in which the reactions

appear.

COMPARISON OF THEDATABASESA comparison of the analysed databases is

shown in Table 2. The simple molecule

`Ethanol' was selected as a criterion to

compare the database entries.

The result of this comparison points

out the following problems. WIT, KEGG

and ExPASy offer no speci®c queries for

compounds which, for example, results in

lists of enzymes or genes containing the

substring `ethanol'. Except for PathDB all

selected queries use the keyword as

substring and result in a list of entries.

From these lists `ethanol' can be selected

directly, except for WIT and ExPASy,

where the compound pages can only be

accessed by navigation, for example

through the enzyme pages. Unfortunately

in the case of `ethanol' ExPASy has no

implemented link to a corresponding

compound page. In general compound

information is available but not

consistently for all compounds.

The databases have different

preferences for contents within their

compound pages. Some of the databases

represent structural formulas or SMILES

strings. The databases also contain

different lists of names, which are written

with either capital or lower case letters.

Also within the same databases the types

of letters are used inconsistently. Only

KEGG and EcoCyc refer to pathways

containing this compound but the

pathways are assigned to different types of

classi®cations. The representation of

reactions containing the selected

compound is very inconsistent between

the databases. WIT has no reference to

reactions, KEGG refers to numbers of

enzymes that use the compound as

reactant (R), cofactor (C), inhibitor (I),

etc. and other databases refer to reaction

equations. One example of inconsistent

reactions within a database is shown in

EcoCyc. It contains reactions that use

`ethanol' as substrate and the resulting

product is `an aldehyde'. Ethanol is a

speci®c compound with molecular mass,

structural formula, etc. An aldehyde

represents a class of compounds that

contain the same functional group. Using

ethanol as the substrate only one possible

aldehyde (acetaldehyde) can be generated.

Note that, regarding the references on

the compound pages, UM-BBD gives

information about the author and date of

the entry. One piece of missing

Comparison ofdatabases

1 3 6 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001

Wittig and De Beuckelaer

Dow

nloaded from https://academ

ic.oup.com/bib/article/2/2/126/239777 by guest on 31 January 2022

Tab

le2:

Com

par

ison

ofth

epat

hw

aydat

abas

es(W

IT,K

EG

G,E

xPA

Sy,E

coC

yc,P

athD

B,U

M-B

BD

)in

rega

rdto

the

sear

choptions

and

quer

yre

sults.

`Eth

anol'

isuse

das

keyw

ord

toac

cess

com

pound

info

rmat

ion

WIT

KE

GG

ExP

AS

yE

co

Cyc

Path

DB

UM

-BB

D

Searc

hfo

rco

mp

ou

nd

`eth

an

ol'

Quer

y®el

dan

dsp

eci®

cation

Sear

chal

ldat

aen

trie

s(n

oco

mpound

spec

i®c

quer

y)Q

uer

yth

eLI

GA

ND

dat

abas

eG

ener

alke

yword

sse

arch

(no

spec

i®ca

tion)

Quer

yco

mpound

by

nam

eQ

uer

yco

mpound

by

nam

eSe

arch

for

com

pound

whose

nam

eco

nta

ins

the

keyw

ord

Num

ber

ofh

its

35

genes

,46

pat

hw

ays,

23

enzy

mes

,21

topic

sin

funct

ional

ove

rvie

w

23

enzy

mes

,51

com

pounds

7en

zym

es,

8co

mpounds

23

com

pounds

1co

mpound

19

com

pounds

Acc

ess

toco

mpound

pag

eN

oq

uery

for

co

mp

ou

nd

savailab

le,a

cces

sto

com

pound

pag

esth

rough

nav

igat

ion

ofa

pat

hw

ay(e

get

han

olo

xid

atio

n)

and

dia

gram

pic

ture

Sele

ct`e

than

ol'

from

quer

yre

sults

No

acce

ssto

com

pound

pag

e(o

nly

posi

tion

on

the

pat

hw

aym

ap),

acce

ssto

com

pound

pag

esth

rough

enzy

me

pag

esw

her

eth

eco

mpound

isuse

din

the

reac

tion

No

tavailab

lefo

reth

an

ol!Se

lect

`eth

anol'

from

quer

yre

sults

Sele

ct`e

than

ol'

from

quer

yre

sults

Sele

ct`e

than

ol'

from

quer

yre

sults

Co

mp

ou

nd

page

info

:N

ames

Eth

anol,

ethyl

alco

hol,

met

hyl

carb

inol

Eth

anol,

ethyl

alco

hol,

met

hyl

carb

inol

±A

bso

lute

alco

hol,

alco

hol,

etO

H,e

th,e

thyl

alco

hol,

grai

nal

cohol

Alc

ohol,

ethan

ol,

ethyl

alco

hol,

ethyl

hyd

roxi

de,

grai

nal

cohol,

met

hyl

carb

inol,

pota

toal

coho

l,sp

irit,s

ynas

ol,

win

esp

irit,c

olo

gne

spir

it

Anhyd

rol,

alco

hol,

met

hyl

carb

inol,

den

ature

dal

coho

l,et

hyl

hyd

rate

,eth

ylhyd

roxi

de,

ethan

ol,

algr

ain,

colo

gne

spir

it,f

erm

enta

tion

alco

hol,

grai

nal

cohol,

jays

ol,

jays

ols

,mola

sses

alco

hol,

pota

toal

cohol,

spir

it,s

pir

its

of

win

e,te

csol,

alco

hol

deh

ydra

ted,e

than

ol2

00

pro

of,

colo

gne

spir

its

(alc

ohol),s

dal

coho

l23-h

ydro

gen,

Synas

ol

Chem

ical

form

ula

ÿC

2H

6O

±C

2H

6O

C2H

6O

C2H

6O

Stru

ctura

lform

ula

��

±�

ÿ�

SMIL

ES

stri

ng

CC

Oÿ

±ÿ

OC

CC

CO

CA

Snum

ber

ÿ64-1

7-5

±64-1

7-5

64-1

7-5

64-1

7-5

cont

inue

dov

erle

af

& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 1 3 7

Dow

nloaded from https://academ

ic.oup.com/bib/article/2/2/126/239777 by guest on 31 January 2022

Tab

le2:

(Con

tinue

d) WIT

KE

GG

ExP

AS

yE

co

Cyc

Path

DB

UM

-BB

D

Mole

cula

rm

ass

ÿÿ

±46.0

69

46.0

846.0

7Pat

hw

ayÿ

Gly

coly

sis/

gluco

neo

gene

sis

±Fe

rmen

tation

ÿÿ

Enzy

me/

reac

tion

ÿ1.1

1.1

.6(R

)2.3

.1.1

52

(R)

3.1

.1.6

7(R

)3.5

.1.7

5(R

)

±A

ceta

ldeh

yde�

NA

DH�

ethan

ol�

NA

DN

AD�

ethan

ol�

NA

DH�

anal

deh

yde

or

keto

ne

NA

DP�

ethan

ol�

NA

DPH�

anal

deh

yde

NA

DP�

ethan

ol�

NA

DPH�

anal

deh

yde

H2O�

ure

than�

NH

3�

CO

2�

ethan

ol

H2O�

alo

ng

chai

n-a

cyl

ethyl

este

r�

ethan

ol

�a

fatt

yac

id

Eth

anol�

acet

aldeh

yde

Eth

anol�

aldeh

yde

Ure

than

e�

ethan

ol

�N

H3

Eth

ylac

etat

e�

acet

ate�

ethan

ol

Eth

anol�

acet

aldeh

yde

Ref

eren

ce/li

nk

NSC

num

ber

:85228

Day

lightT

±±

±C

hem

Finder

Nat

l.T

oxic

olo

gyPro

gram

Pag

eA

uthor�

Dat

e

1 3 8 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001

Dow

nloaded from https://academ

ic.oup.com/bib/article/2/2/126/239777 by guest on 31 January 2022

information on these compound pages in

most of the databases is a reference to the

classi®cation of compounds used, similar

to the classi®cation of enzymes on most

enzyme pages.

ERRORS ANDINCONSISTENCIESAs can be seen within the above-

mentioned examples there are always

problems, errors and inconsistencies

within one single database and between

different databases.

Until now it has not been possible to

automate the detection of errors within

databases. For example errors such as

wrong structural formulas of compounds

or different compound names of the same

compound which also cause two different

compound entries in the database can

only be detected manually by checking

each single entry.

Typically manual databases will always

contain typing errors. In most cases such

errors do not change the meaning of the

content. But certainly there are some

cases where a typing error changes the

property of a protein and enzyme. The

enzyme L-ribulose-5-phosphate 4-

epimerase (EC 5.1.3.4) catalyses the

reaction between L-ribulose-5-phosphate

and D-xylulose-5-phosphate, according to

IUBMB recommendations.13 Databases

such as BRENDA, KEGG, PathDB and

WIT/EMP use this equation with the

above-mentioned substrate and product.

In contrast, the protein database SWISS-

PROT and ExPASy±ENZYME associate

the same enzyme to a reaction between L-

ribulose-5-phosphate and D-xylose-5-

phosphate. Obviously the product name

D-xy(lu)lose-5-phosphate contains a

typing error that changes the catalytic

property of the catalysing enzyme. Since

several other databases extract their data

from this source, the wrong relation

between the compound and the enzyme,

including all the detailed information on

the reaction, is transferred and spread

throughout other database systems.

Copying or linking data causes the

transfer of errors from one database to

another. Such errors are relatively easy to

detect for an expert user. In contrast

typing errors within gene or protein

sequences are not usually so obvious.

Other problems that occur within

databases are errors from the original

source or experimental problems. These

errors include sequencing errors, errors

from homology searches, structure

prediction, etc. and incorrect, incomplete

or wrong transfers from literature

information into the database. The last

case occurs by incorrect or incomplete

interpretation of experimental results or

by drawing wrong conclusions from the

literature sources that are incorporated

manually. The risk here is the decrease in

reliability in the statements and contents

of such systems. One way to avoid such

problems is to add the source of data and

the reference, which gives the database

user the chance to see the original results

from a homology search, etc. The

comparisons of predicted data between

databases are very error-prone because of

the use of different tools for prediction.

This especially affects data about 3D

structures, motifs, molecular mass or

isoelectric points of protein sequences.

Both database user and database

developer must remember that

information is always context-

dependent.27 Data cannot be generalised

because of the dependence on

environmental conditions. An important

example is the recent expression data. In

this case the environmental conditions

and methods used are essential for the

interpretation of the data and are often

missing in the database. In the worst case

these data will never be trusted by expert

users and the database therefore virtually

useless.

Inconsistencies can be found by

comparing database entries from different

databases. Typically problems of

inconsistencies arise from names and

abbreviations used in different databases.

There is no consistent nomenclature for

genes and proteins unlike the enzyme

nomenclature recommended by the

IUBMB.13 Databases refer to different

errors andinconsistencies

& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 1 3 9

Analysis and comparison of metabolic pathway databases

Dow

nloaded from https://academ

ic.oup.com/bib/article/2/2/126/239777 by guest on 31 January 2022

names, synonyms and abbreviations,

depending on the source of the data

including different types of spelling.

In addition to the names no consistent

nomenclature for functional descriptions

of genes and proteins exists. This

inconsistency is also transferred to

inconsistent classi®cations of pathways

(Figure 2). One example for the

classi®cation of functions came up with

the classi®cation of E. coli gene products

established by Monica Riley.19,20 A

de®ned nomenclature of functions similar

to the enzyme classi®cation is certainly a

great advantage.28 That should also imply

a standard de®nition of keywords within

comment lines ± understandable and

usable for different types of users. In this

context a biochemical classi®cation of

chemical compounds has to be installed.

There are different ways to classify

chemicals: either by the chemical

properties recommended by the IUPAC

or by functional properties. The last

option is being employed by most of the

databases described above and causes a

large degree of variability. For systems

representing biochemical pathways a

¯exible classi®cation is needed which

ideally combines chemical and functional

characteristics of the compound but still

has to be consistent throughout the

system.

A demonstrative representation of the

connectivity of the main database types is

given in Figure 1. Genome and chemical

databases can be seen as the basis for all

other databases. Data from literature

databases enter other types of database.

Pathway databases are located above other

databases containing parts of data from the

databases below. Consequently pathway

databases might also contain the errors

made in the other databases. This

indicates the need for pathway database

developers to be especially aware of errors

and inconsistencies.

Expert knowledge is needed to avoid

or to detect errors and inconsistencies

within databases: a time-consuming and

expensive aspect.

SUMMARYIn general all analysed databases are far

from being complete. In terms of user-

friendly access and the amount of data

incorporated in the database the KEGG

system clearly deserves the highest praise.

The more or less consistent nomenclature

of the compounds based on international

agreements such as the IUPAC and

IUBMB, as well as the easy-to-understand

classi®cation of compounds and reactions,

makes this system an easy tool for ®rst

access to pathway databases. However, for

users familiar with the biochemical

pathways of E. coli and with a JavaScript-

enabled www browser, the EcoCyc

system offers the most information on that

speci®c organism since it also incorporates

direct links to the literature and

experimental data being used to set up the

system. For users interested in the

creation of new biochemical pathways

between different compounds the KEGG,

PathDB and the UM-BBD systems are

the most useful ones. Concerning

graphical representation of biochemical

pathways, those users already familiar with

the Boehringer Mannheim chart will

easily ®nd their way through the

ExPASy±Biochemical Pathways, whereas

absolute beginners will de®nitely prefer

the KEGG graphics.

One disadvantage of all these pathwayFigure 2: Comparison of the classi®cation of Glycolysis (Embden±Meyerhoff±Parnas pathway ± EMP) used in KEGG, EcoCyc and WIT

1 4 0 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001

Wittig and De Beuckelaer

Dow

nloaded from https://academ

ic.oup.com/bib/article/2/2/126/239777 by guest on 31 January 2022

systems is the absence of help functions or

windows describing, for example, the

concepts and/or restrictions that one has

to de®ne when querying the data. In most

of the databases meta-information, for

example about the source or origin of the

data (eg experiments, homology search,

literature), is missing.29 The user is not

provided with information as to whether

the data are experimentally proven or not.

Since most of the data are predicted by

homology search, information about

algorithms and their parameters being

used for the homology search or the

degree of homology would be very useful

for data interpretation.

As regards problems of errors, sequence

databases were already analysed some

years ago.29 Recommendations were

given on how to solve problems of errors

and missing meta-information. These

suggestions made for sequence databases

should now be applied to pathway

databases.

A comparison of data of pathway

databases is complicated because of

different classi®cations of compounds,

genes, proteins, pathways and gene/

protein functions. Only the classi®cation

of enzymes recommended by the IUBMB

is used as a standard. One suggestion

would be to have more standards for

nomenclature. In database development a

standardisation of data structures and ®le

formats would make further applications

more powerful.

In conclusion, pathway database users

should be aware of many possible errors

within the databases and the large number

of inconsistencies between the databases.

The inconsistencies would cause problems

for example during the extraction of data

from different data sources. Developers of

pathway databases should focus more on

curation of the data. One suggestion

would also be to try to standardise their

database entries, for example, in the case

of classi®cation.

Acknowledgements

This work was supported by the Klaus Tschira

Foundation gGmbH (KTF) and the BMBF

(Project BioRegio 0312212). The authors thank

Andreas Kohlbecker and Isabel Rojas for

discussions and reviewing the manuscript.

References

1. Frishman, D., Heumann, K., Lesk, A. andMewes, H. (1998), `Comprehensive,comprehensible, distributed and intelligentdatabases: Current status', Bioinformatics, Vol.14 (7), pp. 551±561.

2. Karp, P. D. (1998), `Metabolic databases',Trends Biochem. Sci., Vol. 23 (3), pp. 114±116.

3. Karp, P. D., Krummenacker, M., Paley, S. andWagg, J. (1999), `Integrated pathway/genomedatabases and their role in drug discovery',Trends Biotechnol., Vol. 17 (7), pp. 275±281.

4. Overbeek, R., Larsen, N., Smith, W. et al.(1997), `Representation of function: The nextstep', Gene, Vol. 191, pp. GC1±GC9.

5. Overbeek, R., Larsen, N., Pusch, G. D. et al.(2000), `WIT: Integrated system for high-throughput genome sequence analysis andmetabolic reconstruction', Nucleic Acids Res.,Vol. 28 (1), pp. 123±125.

6. Selkov, E. Jr, Grechkin, Y., Mikhailova, N.and Selkov, E. (1998), `MPW: The MetabolicPathways Database', Nucleic Acids Res., Vol. 26(1), pp. 43±45.

7. Selkov, E., Basmanova, S., Gaasterland, T.et al. (1996), `The metabolic pathwaycollection from EMP: The enzymes andmetabolic pathways database', Nucleic AcidsRes., Vol. 24 (1), pp. 26±28.

8. Kanehisa, M. and Goto, S. (2000), `KEGG:Kyoto Encyclopedia of Genes and Genomes',Nucleic Acids Res., Vol. 28 (1), pp. 27±30.

9. Ogata, H., Goto, S., Sato, K. et al. (1999),`KEGG: Kyoto Encyclopedia of Genes andGenomes', Nucleic Acids Res., Vol. 27 (1),pp. 29±34.

10. Goto, S., Nishioka, T. and Kanehisa, M.(1998), `LIGAND: Chemical database forenzyme reactions', Bioinformatics, Vol. 14 (7),pp. 591±599.

11. Goto, S., Nishioka, T. and Kanehisa, M.(2000), `LIGAND: Chemical database forenzyme reactions', Nucleic Acids Res., Vol. 28(1), pp. 380±382.

12. DBGET/LinkDB ± Integrated databaseretrieval system (2000), URL:http://www.genome.ad.jp/dbget/

13. IUBMB ± International Union ofBiochemistry and Molecular Biology (1992),`Enzyme Nomenclature: Recommendations ofthe Nomenclature Committee of theIUBMB', Academic Press, New York.

14. EcoGene database (2000), URL: http://bmb.med.miami.edu/EcoGene/index.html

15. Karp, P. D., Riley, M., Paley, S. M. et al.

& HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001 1 4 1

Analysis and comparison of metabolic pathway databases

Dow

nloaded from https://academ

ic.oup.com/bib/article/2/2/126/239777 by guest on 31 January 2022

(1997), `EcoCyc: Encyclopedia of Escherichiacoli genes and metabolism', Nucleic Acids Res.,Vol. 25 (1), pp. 43±50.

16. Karp, P. D. and Riley, M. (1999), `EcoCyc:The resource and the lessons learned' inLetovsky, S. Ed., `Bioinformatics Databasesand Systems', Kluwer Academic, Boston, MA,pp. 47±62.

17. Karp, P. D., Riley, M., Saier, M. et al. (2000),`EcoCyc and MetaCyc databases', Nucleic AcidsRes., Vol. 28 (1), pp. 56±59.

18. CGSC ± E. coli Stock Center Database(2000), URL: http://cgsc.biology.yale.edu/cgsc.html

19. GenProtEC ± E. coli genome and proteomedatabase (2000), URL:http://genprotec.mbl.edu:80/cat1.htf

20. Riley, M. (1993), `Functions of the geneproducts of Escherichia coli', Microbiol. Rev.,Vol. 57 (4), pp 862±952.

21. PathDB (2000), URL: http://www.ncgr.org/software/pathdb/

22. ExPASy ± Biochemical Pathways (2000),URL: http://expasy.ch/cgi-bin/search-biochem-index

23. Michal, G. (1993), `Biochemical Pathways',Boehringer Mannheim GmbH ± Biochemica.

24. Michal, G. (1999), `Biochemical Pathways',Spektrum Akad. Verlag, Heidelberg.

25. Ellis, L. B. M., Hershberger, C. D. andWackett, L. P. (2000), `The University ofMinnesota Biocatalysis/BiodegradationDatabase: Microorganisms, genomics andprediction', Nucleic Acids Res., Vol. 28 (1), pp.377± 379.

26. Ellis, L. B. M., Speedie, S. M. and McLeish,R. (1998), `Representing metabolic pathwayinformation: an object-oriented approach',Bioinformatics, Vol. 14 (9), pp. 803±806.

27. Bork, P. (2000), `Powers and pitfalls insequence analysis: The 70% hurdle', GenomeRes., Vol. 10, pp. 398±400.

28. Karp, P. D. (2000), `An ontology forbiological function based on molecularinteractions', Bioinformatics, Vol. 16 (3),pp. 269± 285.

29. Karp, P. D. (1998), `What we do not knowabout sequence analysis and sequencedatabases', Bioinformatics, Vol. 14 (9),pp. 753±754.

1 4 2 & HENRY STEWART PUBLICATIONS 1467-5463. B R I E F I N G S I N B I O I N F O R M A T I C S . VOL 2. NO 2. 126±142. MAY 2001

Wittig and De Beuckelaer

Dow

nloaded from https://academ

ic.oup.com/bib/article/2/2/126/239777 by guest on 31 January 2022