Software tools and databases for bacterial systematics and their disseminationvia global networks

25
Antonie van Leeuwenhoek 64: 205-229, 1993. © 1993 Kluwer Academic Publishers. Printed in the Netherlands. Software tools and databases for bacterial systematics and their dissemination via global networks Vanderlei Perez Canhos 1, Gilson Paulo Manfio 2 & Lois D. Blaine 3 z Tropical Data Base (BDT); 2 Tropical Culture Collection (CCT), Rua Latino Coelho 1301, Campinas, SP, 13087-010, Brazil; s American Type Culture Collection, 12301 Parklawn Drive, Rockville, MD 20852, USA Received 25 August 1993; accepted 6 September 1993 Key words: bacterial systematics, databases, electronic networks, information resources, software tools Abstract The dynamic expansion of the taxonomic knowledge base is fundamental to further developments in biotech- nology and sustainable conservation strategies. The vast array of software tools for numerical taxonomy and probabilistic identification, in conjunction with automated systems for data generation are allowing the con- struction of large computerised strain databases. New techniques available for the generation of chemical and molecular data, associated with new software tools for data analysis, are leading to a quantum leap in bacterial systematics. The easy exchange of data through an interactive and highly distributed global computer net- work, such as the Internet, is facilitating the dissemination of taxonomic data. Relevant information for com- parative sequence analysis, ribotyping, protein and DNA electrophoretic pattern analysis is available on-line through computerised networks. Several software packages are available for the analysis of molecular data. Nomenclatural and taxonomic 'Authority Files' are available from different sources together with strain spe- cific information. The increasing availability of public domain software, is leading to the establishment and integration of public domain databases all over the world, and promoting co-operative research projects on a scale never seen before. Introduction The development of new approaches and tech- niques for acquisition and utilisation of taxonomic data have sparked fundamental changes in the questions that now may be pursued in the study of interrelationships among microorganisms. Mole- cular data are leading to many reinterpretations of evolutionary relationships at all taxonomic levels. The study of the connections between the evolution of the genome and the phenotype is an important area of scientific investigation, rich in its implica- tions for all biology. Technological developments in information/ computer technology and molecular biology are leading to new strategies for molecular microbial ecological studies and target screening programmes for microorganisms of industrial and environmen- tal importance (Bull et al. 1992; Hall 1989). The molecular information not only provides new ways of classifying living organisms in terms of directly- measured genomic similarity, but also brings sys- tematics into the front line of the quest to explain the relationships between molecular and phenotyp- ic diversity (Liesack et al. 1991). The computer tech- nology revolution is having a great impact on sys- tematics and should have an even greater impact in the future. Now, for the first time, the technology is

Transcript of Software tools and databases for bacterial systematics and their disseminationvia global networks

Antonie van Leeuwenhoek 64: 205-229, 1993. © 1993 Kluwer Academic Publishers. Printed in the Netherlands.

Software tools and databases for bacterial systematics and their dissemination via global networks

Vanderlei Perez Canhos 1, Gilson Paulo Manfio 2 & Lois D. Blaine 3 z Tropical Data Base (BDT); 2 Tropical Culture Collection (CCT), Rua Latino Coelho 1301, Campinas, SP, 13087-010, Brazil; s American Type Culture Collection, 12301 Parklawn Drive, Rockville, MD 20852, USA

Received 25 August 1993; accepted 6 September 1993

Key words: bacterial systematics, databases, electronic networks, information resources, software tools

Abstract

The dynamic expansion of the taxonomic knowledge base is fundamental to further developments in biotech- nology and sustainable conservation strategies. The vast array of software tools for numerical taxonomy and probabilistic identification, in conjunction with automated systems for data generation are allowing the con- struction of large computerised strain databases. New techniques available for the generation of chemical and molecular data, associated with new software tools for data analysis, are leading to a quantum leap in bacterial systematics. The easy exchange of data through an interactive and highly distributed global computer net- work, such as the Internet, is facilitating the dissemination of taxonomic data. Relevant information for com- parative sequence analysis, ribotyping, protein and DNA electrophoretic pattern analysis is available on-line through computerised networks. Several software packages are available for the analysis of molecular data. Nomenclatural and taxonomic 'Authority Files' are available from different sources together with strain spe- cific information. The increasing availability of public domain software, is leading to the establishment and integration of public domain databases all over the world, and promoting co-operative research projects on a scale never seen before.

Introduction

The development of new approaches and tech- niques for acquisition and utilisation of taxonomic data have sparked fundamental changes in the questions that now may be pursued in the study of interrelationships among microorganisms. Mole- cular data are leading to many reinterpretations of evolutionary relationships at all taxonomic levels. The study of the connections between the evolution of the genome and the phenotype is an important area of scientific investigation, rich in its implica- tions for all biology.

Technological developments in information/

computer technology and molecular biology are leading to new strategies for molecular microbial ecological studies and target screening programmes for microorganisms of industrial and environmen- tal importance (Bull et al. 1992; Hall 1989). The molecular information not only provides new ways of classifying living organisms in terms of directly- measured genomic similarity, but also brings sys- tematics into the front line of the quest to explain the relationships between molecular and phenotyp- ic diversity (Liesack et al. 1991). The computer tech- nology revolution is having a great impact on sys- tematics and should have an even greater impact in the future. Now, for the first time, the technology is

206

there to handle, sort, analyse and disseminate com- plex taxonomic data.

Exploitation of taxonomic knowledge in micro- biology is leading to significant benefits to public and commercial activities in the fields of agricul- ture, medicine and environmental monitoring (Hawksworth 1992). Applications in the biological control of pests, bioremediation and environmental restoration offer particular promise (Bull et al. 1992). Systematics plays a central role in under- standing biodiversity, documenting it and helping in the elucidation of the underlying nature of environ- mental processes and the structure of natural com- munities (Krebs 1992). The design of sustainable conservation strategies will depend heavily on taxo- nomic knowledge and the availability of accurate, up to date and relevant information from comput- erised databases and others sources (Hawksworth & Colwel11992a,b).

Bacteriologists now have an enormous array of tools and methods from which to choose for classi- fying and identifying organisms (Goodfellow & O'Donnel11993). The study of molecular evolution has come to the forefront and often seems to dwarf the efforts of traditional systematists. However, the 'polyphasic' approach (Colwel11970) is rapidly be- coming the method of choice for bacterial system- atics. One of the primary problems facing today's microbiologist is how to keep up with the rapidly proliferating technologies that can assist in the study of natural relationships among organisms. The development of comprehensive taxonomic da- tabases associated with image databanks and geo- graphic information systems will play an important role in ecology and biodiversity programs. Efforts are required to stimulate the integration of se- quence databanks with more traditional informa- tion resources, such as bibliographic, strain data, compendia of species descriptions, and metabolic products databases. This paper presents an over- view of the state-of-the-art with examples of soft- ware tools and databases for bacterial systematics and the communication mechanisms available to microbiologists for accessing these resources.

Dissemination of taxonomic information

Central to progress and development in systemat- ics, and indeed any other scientific discipline, is the provision and ready availability of accurate, up to date, relevant and timely information. Moreover, this information must be communicated in a form which is appropriate for the intended audience.

Until recently the dissemination of information for bacterial systematics was done exclusively through hard copy publications, such as printed books, manuals and periodicals. The most exten- sive effort is that of Bergey's Manual of Systematic Bacteriology (Krieg 1984; Sneath 1986; Staley et al. 1989; Williams et al. 1989), a four volume compendi- um of physiological and molecular information on most of the recognised species of bacteria. An ab- breviated version of this multi-volume work is the Bergey's Manual of Determinative Bacteriology (Buchanan & Gibbons 1974), the 9th edition of which is to be published in 1993. Another compre- hensive treatise covering known genera of bacteria is the second edition of 'The Prokaryotes - A Hand- book on the Biology of Bacteria: Ecophysiology, Isolation, Identification, and Applications.' This four volume work, edited by Balows et al. (1993), includes a broader spectrum of information than is covered by the Bergey's Manuals, but most of the characterisation data are in textual rather than tab- ular format. Nevertheless, it is an excellent tool for those seeking data on habitats, characteristics, me- tabolic pathways, and products of bacterial species.

The International Journal of Systematic Bacte- riology (IJSB) is, at present, the only official publi- cation medium for validly described bacterial names. All new names of bacteria must be either published or announced in this journal. The Ap- proved List of Bacterial Names published by Sker- man et al. (1980) appears in this journal and is the official document containing the names of all recog- nisable bacterial taxa as of January 1,1980. The IJSB is also the vehicle for publicising recommendations of the Judicial Commission which considers amend- ments to the Code of Bacterial Nomenclature and exceptions to Rules of Nomenclature. The IJSB is an indispensable tool for the bacterial systematist.

The CD-ROM (from compact disc, read only

memory) is a computer data storage peripheral de- vice, which has evolved from Audio Compact Discs or CD's. CD-ROM technology allows the storage of very large databases on a 5.25 inch compact disc. The disc takes up little space and is inexpensive to produce. Good retrieval software is being generat- ed, so that it is quite economical for the database producer wishing to make data available in this me- dium. CD-ROM disc drives are also dropping in cost and are increasingly becoming components of standard PC systems in libraries and laboratories. Following these developments in information tech- nology, important publications such the IJSB are now being offered on CD-ROM thereby facilitating the search and retrieval of taxonomic information. The catalogues of the American Type Culture Col- lection, The Culture Collection of the Institute for Fermentation, Osaka, and the Japan Collection of Microorganisms are also available on CD-ROM from Hitachi Software on its 'CD-STRAINS' disk.

Cambridge Scientific Abstracts produces a num- ber of bibliographic databases on CD-ROM that are of interest to bacteriologists, e.g., Compact Cam- bridge MEDLINE, Pollution Abstracts, Aquatic Sciences and Fisheries Abstracts, and the Life Sci- ences Collection. Other bibliographic resources available on CD-ROM include Current Contents, produced by the Institute for Scientific Information.

A recent addition to the repertoire of CD-ROM products is NCBI's Entrez Discs. This dual disc set contains literature references to molecular biology- related subjects, i.e., a subset of MEDLINE called Entrez References and a second disc called Entrez Sequences, containing sequence data from Gen- Bank, PIR, and SWISS-PROT, plus MEDLINE references and abstracts associated with those se- quences.

Two CD-ROM databases of special interest to molecular biology are the EcoSeq, EcoMap, and EcoGene files that contain over 1.5 million base pairs of non-overlapping Escherichia coli DNA se- quences in FASTA format, restriction map data, ge- netic map positions, and alignment information for over 1,000 E. coli genes. This database is also dis- tributed by NCBI. The E. coli Database (ECD) from the Institut far Mikrobiologie und Molekular- biologie in Frankfurt, Germany contains descrip-

207

tive information to supplement E. coli nucleotide sequence data in EMBL and GenBank. Gene names, number of non-redundant base pairs, over- lapping sequences, citations, and EMBL and Gen- Bank cross references are components of ECD. The EMBL Data Library in Heidelberg, Germany and the Protein Databank (PDB) in the USA also dis- tribute sequence and structure data on CD-ROM.

With the proliferation of computer mediated communications systems, new horizons are open- ing for the dissemination of taxonomic information. The developments of academic or research net- works, in association with commercial networks, are increasing the offer of electronic mail, on-line databases and computer conferences (Cerf 1991). The networks relevant to the dissemination of taxo- nomic information are listed in Table 1. The Infor- mation Resources and Computerised Databases relevant to bacterial systematics are listed in Tables 2 and 3, respectively.

The Internet is the most rapidly growing network in the world. It is a network of networks, linking over i million computers together through common open protocols. The primary applications include electronic mail (e-mail), file transfer and remote lo- gin (Kro11992). The Internet is an excellent place to look for information in many fields of biology. Im- portant and frequently asked questions (FAQ) about the Internet are addressed in the revision 'A Biologist's Guide to the Internet Resources' (Smith 1993). This electronic publication describes Usenet newsgroups, Internet and Bitnet mailing lists, and information archives of special interest to biology. The number of users subscribing to the mailing lists for topics in biology and reading the newsgroups is increasing steadily (Reid 1993).

Other important tools widely used to access the Internet information resources are telnet, anony- mous ftp, gopher and WAIS. Telnet is a protocol that allows a user in a computer in the Internet to login to another computer. Anonymous ftp allows a user to connect to another computer in the Internet using the account 'anonymous' in order to retrieve files archived for public domain access. These ar- chives include databases, mailing lists and news- groups logs, and public domain software (such as communication tools, sequence alignment and im-

208

age analysis) WAIS is a tool for indexing files stored in the 'anonymous ftp sites' for easy searching and location. Gopher is one of the most powerful tools, combining features of all the others. It was devel- oped at the University of Minnesota, and, through a standard user interface, allows a user to connect to other gophers, retrieve files, programs, sound and image archives, search databases and library cata- logues, without having to learn commands and the location of such archives.

The rapid evolution of the Internet connectivity is changing the way the linking of separate data re- sources is perceived and may solve some of the problems data providers have in reaching their au- diences. Much of this evolution is occurring in soft- ware that is in the public domain and is accessible to anyone with an Internet connection. Of the Inter- net servers, Gophers are showing an outstanding

Table 1. Electronic networks relevant to bacterial systematics.

growth rate. There are currently 1,200 Gopher serv- ers around the world including some 60 BioGoph- ers containing information relevant to bacterial sys- tematics. Gophers to some degree make it possible for anyone to become a data resource provider and allows any interested person to systematically browse that resource through the Internet. In fact, one can search GenBank, EMBL databases and sev- eral hundred other resources on servers throughout the world. Through the Internet Gophers it is pos- sible to share information without imposing rigidity on the structure of the data or predefining the group with whom the information can be shared.

The existing technology is allowing the establish- ment of distributed Public Domain Databases (PPD) that can be co-ordinated and accessed from everywhere in the world. The main requirement is for participating sites to coordinate their activities

NETWORK Secretariat & E-mail address Scope and Services

BIN21 - Biodiversity Information Network Interim Secretariat at the Tropical Database, Campinas, SR Brasil E-mail: [email protected]

BIOSCI - Biological Communication Network Two distribution sites: SERC Daresbury Laboratory, UK fntelliGenetics, California, USA E-mail: [email protected]. (Americas, Pacific Rim)

[email protected] (Europe, Africa, Asia) EMBNet - European Molecular Biology Network Contact EMBL Data Library Heidelberg, Germany E-mail: [email protected]; [email protected] GRIN - Genetic Resources Information Network USDA-ARS, Beltsville, MS, USA E-mail: [email protected] ICGEBNet - International Center of Genetic Engineering and Molecular Biology Network, Trieste, Italy E-mail: [email protected] MGD - Microbial Germplasm Database and Network, Oregon State University, Corvallis, OR, USA E-mail: [email protected] [email protected] MINE - Microbial Information Network Europe, CAB International Mycological Institute, Egham, UK; and DSM, Braunschweig, Germany E-mail: [email protected] MSDN - Microbial Strain Data Network Institute of Biotechnology, Cambridge, UK E-mail: [email protected]

Aims at facilitating efficient access to information relating to all aspects of biodiversity. Services: Gopher server and discussion list (biodiv-L @bdt.ftpt.br) Organized as a set of newsgroups covering general and special interests; available free over Internet

A molecular biology network linking a series of centers in Western European countries; distributes sequence data and software for molecular biology

Database of plant related materials including origin of samples and their useful characteristics

Molecular biology related information, Unido's computer resources for molecular biology

Data on plant pathogens, symbionts and biological control organisms

Integrated catalog project, incorporating a European network of culture collection databanks

Information on specific properties and cultured cells. Central Directory includes information on updated RKC codes

and agree on the protocols they use. Public Domain Databases are proliferating in all areas of science. Their great advantage is that the set of potential contributors includes everyone working on the da- tabase theme, even people unknown to the co-ordi- nating agent. Existing Internet network utilities and public domain software provide all the resources necessary for running a PDD project. The greatest issues confronting the co-ordinator of a PDD are the costs of running the project, developing data standards and setting up quality control proce- dures. The establishment of public domain taxo- nomic databases makes economic and scientific sense, as the effort will allow the maximum use of

209

every piece of available data and promote the co- ordination of microbial systematics research activ- ities. The operational mechanisms, as well as the technological and biological issues related to biodi- versity PDDs were discussed by Green (1992).

Tools for data analysis and databases for bacterial systematics

There is a clear need for the construction of inte- grated, comprehensive, accurate and reliable taxo- nomic databases to answer the real questions relat-

Table 2. Information resources relevant to bacterial systematics.

INFORMATION RESOURCE: Scope and Services Secretariat & E-mail address

ATCC - American Type Culture Collection Rockville, MD, USA E-mail: [email protected] Bacillus Genetic Stock Center Ohio State University Columbus, OH, USA E-mail: [email protected] E. coli Stock Center Department of Biology, Yale University, New Haven, CT, USA E-mail: [email protected] Staphylococcus Stock Center Iowa State University EMBL - European Molecular Biology Laboratory - Heidelberg, Germany E-mail: [email protected] ICECC - Information Centre for European Culture Collections DSM, Braunschweig, Germany E-mail: [email protected]; [email protected] NCBI - National Center for Biotechnology Information NIH, Washington DC, USA E-mail: info @NCBI.nlm.nih.gov RDP - Ribosomal Databases Project University of Illinois, Urbana, 1L USA E-mail: [email protected]

WDC - World Data Center on Microorganisms RIKEN, Tokyo, Japan E-mail: [email protected]

Preservation and distribution of microbial strains and related data

Preservation and distribution of Bacillus strains and related genetic data

Preservation and distribution of E. coli strains and related data

Preservation and distribution of Staphylococcus strains and related data Collection, organization, and distribution of nucleotide sequence data

Collection, organization and dissemination of microbial strain data

Services: production and distribution of nucleotide and amino acid sequence data and software tools for sequence analysis Ribosomal sequence data and software for sequence analysis Services: phylogenetic analysis of sequences and probe checking Information on resource centers of microorgansisms and cultured cells, including the following databases: - World Directory of Culture Collections of Microbial Strains (CCINFO) - Hybridoma Data Bank (HDB) - World Catalogue of Algae (ALGAE) The information is searchable through a Gopher/WAIS INTERNET server.

210

Table 3. List of computerized databases with relevant information to bacterial systematics.

DATABASE Producer & E-mail address Scope and Comments

BBRN - Biosis Complete list of valid bacterial names, synonyms and history Register of Bacterial Nomenclature Phone: + 1 (800) 523-4806, (215) 507-4917 E-mail: [email protected] BKS - Biotechnology Knowledge Sources Phone: + 44 (753) 7-4201 E-mail: [email protected] DBIR - Directory of Biotechnology Information Resources E-mail: [email protected] [email protected] DDBJ - DNA Databank of Japan National Institute of Genetics Yata, Mishima, Japan E-mail: tgoj [email protected] ECOLI - E. coli K-12 Database E-mail: [email protected] GenBank E-mail: [email protected] CODATA/IUIS Hybridona Data Bank (HDB) E-mail: [email protected] EMBL, EMBL-New, EMBL-Daily UEMBL: European Molecular Biology Laboratory, Heidelberg, Germany E-mail: [email protected] LiMB - List of Molecular Biology Databases Los Alamos National Laboratory, Los Alamos, NH, USA E-mail: [email protected] LYSIS E-mail: [email protected] NAPRALERT E-mail: [email protected] OLIGONUC E-mail: [email protected] PDB - Protein Data Bank Brokhaven National Laboratory Upton, NY, USA E-mail: [email protected]

Publications and events available on-line through the MSDN

Online directory of organizations, databases, networks, publications and nomenclature resources relevant to biotechnology Collectoin and dissemination of nucleotide sequence data generated in Japan

Genome and protein nucleotide and amino acid sequency, database gene name, and gene location data Database of reported nucleotide sequences

Database describing construction and reactivities of hybridomas and monoclonal antibodies Nucleotide sequences databanks and related information

A comprehensive listing of molecular biology databases

Endopeptidase cleavage sites in PPR known protein sequences

Secondary metabolites and chemosystematics of plants and bacteria Chemically synthesized oligonucleotide database

Three dimensional structure of proteins and other biological macromolecules determined by X-ray crystalography or nuclear magnetic ressonance

Biologically significant protein sequence patterns

Analysis of 2D gel samples database

Database on restriction endonucleases, recognition sequences and cleavage sites

Stanford University, California, USA Phone: + 1 (415) 723-1772 PROSITE E-mail: [email protected] QTDGPD - Quest 2D Gel Protein E-mail: [email protected] RED - Restriction Enzymes Database E-mail: rober [email protected]

PIR - Protein Identification Resource Information on amino acid sequences National Biomedical Research Foundation, Washington DC, USA E-mail: [email protected] PPR - Plasmid Prefix Registry and PRCTR - Plasmid Reference Center Transposon Registry

List of existing plasmid names and transposon allocations

Table 3. Continued.

211

DATABASE Producer & E-mail address Scope and Comments

RKC codes E-mail: [email protected]; [email protected] SEQ ANL REF Databank E-mail: [email protected] SIG PEP - Signal Peptide Sequences E-mail: [email protected] SRRSD - Small Ribosomal RNA Sequences Database E-mail: dewachter @ccv.uia.ac.be SWISS-PROT E-mail: [email protected] TFD - Transcription Factor Database E-mail: [email protected] TRNAC - Transfer RNA Compilation Phone: + 49 (921) 552-668

Standardized list of bacterial phenotypic characteristics

Listing of literature references relevant to sequence analysis

A database of secretory signal peptide sequences

Data on small ribosomal subunit RNA sequences

Amino acid sequence data

Database on transcription factors

Nucleotide sequences of tRNA

ed to microbial systematics. These databases should be widely available to maximise their impact.

All taxonomic activity is part of an international network of communication and information. Al- though research can be done, and many times is car- ried out individually, the activities related to sys- tematics depends on a series of agreed upon rules and codes, publications and up-to-date scattered pieces of information and molecular data. Ideally the results of every taxonomic study should not only answer an immediate question, but also contribute data to a fundamental global taxonomic knowledge database.

The implementation of international pro- grammes to document the importance and roles of systematic biology in sustainable development will contribute to the enhancement of the taxonomic knowledge base. The 'Sustainable Biosphere Initia- tive' (Anonymous 1991a), the 'Systematics Agenda 2000' (Anonymous 1991b) and the 'Microbial Di- versity 21' (Hawksworth & Colwell 1992a) are al- ready stimulating the debate on the importance of systematics in human affairs. Within the context of Diversitas, the International Union of Biological Sciences (IUBS) Biodiversity Programme an inter- national Biodiversity Information Network (BIN-21) is being established in order to facilitate the dissemination of biodiversity information worldwide (Canhos et al. 1992)

The development of integrated taxonomic data-

bases is now feasible mainly because of the ad- vances in instrumentation and information technol- ogies. Laboratory instruments such as gas and liq- uid chromatographers, mass spectrometers, spec- trophotometers, scintillation counters and microdilution analysers now include integrated computers or offer them as options (Kellogg 1989). Research operation is becoming even more com- puter intensive, and data logging and management of the resulting streams of microbial strain data are creating the conditions for the development of comprehensive databases.

New possibilities of linking together institutions and people throughout the world using global com- puter networks are leading to a new era of co-oper- ative research on a scale never seen before. Signif- icant outputs of this international co-operation in the biological sciences are the public domain data- bases fostered by GenBank and EMBL. With this increased co-ordination and improved access to molecular biology databases, thousands of re- searchers are using and contributing to the compila- tions of DNA and protein data. These develop- ments in molecular biology provide a good model for the design of integrated networks for microbial systematics and construction of public domain tax- onomic databases.

Inconsistencies in database organisation and da- ta coding greatly complicate, and may even prevent electronic exchange of information between scien-

212

fists. The Committee on Data for Science and Tech- nology, CODATA, which is concerned with the quality and accessibility of data (as well as the meth- ods by which data are acquired, managed, analysed, and disseminated) has established a Commission on the Terminology and Nomenclature of Biology. This commission aims to provide resources for the development of standards for terminology and no- menclature in the fields of biological sciences and bioinformatics.

Microbiologists have traditionally adopted the artificial and natural approaches to the classifica- tion of bacteria. Artificial classifications are still heavily used in clinical microbiology (Cowan & Steel 1974). In such classifications the groups are monothetic, i.e., defined on the basis of a few select- ed characters. Natural or phenetic classifications use a large number of phenotypic and genotypic characters for defining groups. These have a high information content and can accommodate some degree of phenotypic variability on the strains be- ing identified. Groups defined in this way are po- lythetic, i.e., defined on the basis of a large number of common properties.

A third approach, phylogenetic classification, con- sists of the interpretation of molecular sequence in- formation (mainly rRNA and conserved protein se- quences) for expressing the evolutionary relation- ships between microorganisms. Several constraints in data analysis and interpretation are still unresolv- ed (Sneath 1989; Williams 1992), but phylogenetic analysis has been used recently for proposing the di- vision of microorganisms into the Domains Ar- chaea, Bacteria and Eucaria (Woese et al. 1990; Winker & Woese 1991) and has had a great impact on the taxonomic structure of several microbial taxa (Rossler et al. 1991, Stackebrandt et al. 1981, 1991).

Phenetic and phylogenetic classifications have benefited from the developments in computer tech- nology. The biggest impacts of the use of computers in bacterial systematics may be seen in the areas of numerical taxonomy, computer-assisted (probabi- listic) identification, automated identification sys- tems, chemosystematics, molecular data (nucleic acid and protein sequences), nomenclature and strain data (culture collections).

Tools for numerical taxonomy

Numerical taxonomy studies involve the analysis of a large number of strains representing a taxonomic group with regard to a large number of characters (Sneath& Sokal 1973). Characters vary to a great extent, depend on the microbial group under inves- tigation and range from morphological, cultural and biochemical properties, degradation activity, utilisation of organic and inorganic compounds (carbon and nitrogen sources), resistance to chemi- cals, production of metabolites to chemical data, such as fatty acid and polar lipid patterns, DNA guanine/cytosine ratio, isoprenoid quinones and sugars. These are discussed in detail by O'Brien and Colwell (1987), Goodfellow et al. (1985) and Good- fellow and O'Donnell (1993). The resulting data are usually coded as binary variables (positive/nega- tive) representing the individual test results or sometimes as multistate characters, using a few var- iables to cover all the range of possible responses (e.g.; levels of antibiotic resistance). Continuous da- ta (numeric values, e.g., spore diameter) are usually converted to multistate binary variables to facilitate the analysis, but a few analysis algorithms can cope with binary and continuous data in the same data- set. Duplicate strains are introduced for the assess- ment of the test error in the system, enabling the detection of variable and poorly reproducible tests (Sneath 1974, Sneath& Johnson 1972).

The management of numerical taxonomic data may be accomplished in simple IBM-PC compati- ble microcomputers, using spreadsheets (e.g. Lotus 123, Quattro Pro), database software (e.g. DBase, FoxPro, Paradox) or shareware software (PC-File). Other platforms (e.g., Macintosh, Unix, Vax) are equally suited for data management and analysis. Regardless of the software or system used it is always advisable to test a small simulated dataset to ensure that the software has all the desired features before embarking upon a time consuming large scale taxo- nomic project. Faster computers are desirable for large datasets and some software may require very specific computer configurations to run (e.g., numer- ical co-processors or large RAM memory).

The usual data analysis starts with the evaluation of the 'quality' of the data set by the estimation of

the test error. This is done by comparing all test re- sults for the pairs of duplicate strains (Sneath 1974, Sneath & Johnson 1972). Poorly reproducible tests may be removed from the dataset. Subsequent tax- onomic analysis involves calculating similarity coef- ficients and performing hierarchical and non-hier- archical statistics for grouping the strains according to their overall similarity (Bryant 1987; Sackin & Jones 1993).

The most usual coefficients used for similarity calculations are the Jaccard and simple matching, but several similarity/dissimilarity coefficients are available for the analysis of binary and continuous data (Wishart 1987). Usually, similarity data are ex- pressed by a dendrogram, calculated using the un- weighted pair group method with arithmetic means algorithm (UPGMA) (Sneath& Soka11973).

The evaluation of numerical taxonomic classifi- cations (groups) can be achieved using several prac- tical and theoretical approaches. The use of differ- ent similarity and clustering algorithms provides some insight on the stability of the clusters (Sackin & Jones 1993). The cophenetic correlation index (Sheath & Sokal 1973) provides an estimate of the distortion introduced during the construction of dendrograms by comparing the final similarity val- ues with the original data from the similarity matric- es. Minimum spanning trees and shaded diagrams provide additional information on the composition, stability and relationships between strains and clus- ters. Calculation of cluster overlap statistics (Sheath 1977, 1979b,c) and outlying strains (Sneath & Langham 1989) provide effective ways of assess- ing the overall classification.

Statistical procedures and calculations can be done using commercial packages (SAS; SPSS; Gen- stat) or specialised software (CLUSTAN, MVSR TAXAN, Taxon). CLUSTAN (Wishart 1987), MVSP and TAXAN (ASM, USA, unpublished) have an extensive range of similarity coefficients and clustering algorithms, suitable for binary and continuous data, and are more user-friendly than statistical software packages. The use of an inte- grated software for numerical taxonomic analysis may present several advantages over the use of sep- arate database and analysis softwares. MICRO-IS (Bello 1989) has integrated modules for data man-

213

agement and analysis in the same software with minimal manipulation of data and files. TAXON (A. C. Ward, unpublished) has basic database facil- ities for strain data and data analysis can be carried out in separate software using ASCII exported files. This software includes procedures for the calcula- tion of test error, cluster centrotypes, construction of percentage positive tables and probabilistic iden- tification. NTSYS-pc (Sackin 1987) and SYN-TAX IV (J. Podani, unpublished) have limited database management facilities for ASCII files but provide the most comprehensive suite of procedures for da- ta analysis, including clustering, ordination, com- parison and evaluation of classifications and several options for graphical display. Peter Sneath and col- leagues have published several BASIC programs for numerical taxonomic analysis and probabilistic identification which can be implemented as user subroutines for taxonomic data analysis (Sheath 1974, 1977, 1979a-e, 1980a-c; Snea th& Langham 1989; Sneath & Sackin 1979). Additional informa- tion on computer programs can be found in Sackin (1987). The reader is reffered to Sackin and Jones (1993) for an excellent up-to-date text on computer- assisted classification.

Analysis of reactivity patterns of monoclonal an- tibodies may be used as a tool for bacterial system- atics. The CODATA/IUIS Hybridoma Data Bank can be used to study taxonomic relationships among strains of bacteria and other organisms (Walczak et al. 1988). Programs allowing for analy- sis of reactivity patterns of monoclonal antibodies are an integral part of the software used to manage the data bank. Biochemical substances will 'cluster' when subjected to taxonomic analysis based on their reactivity patterns with a defined set of mono- clonals. Monoclonal antibodies exhibit complex re- activity patterns with various antigens, although they are specific to unique epitopes. Because simi- lar epitopes can be found on diverse proteins and/or polysaccharides, the monoclonal antibodies may react with dissimilar cell types or biochemical sub- stances. Cluster analysis can reveal epitopes com- mon to related, and possibly non-related, orga- nisms. By inverting the data, relationships among epitopes can be determined.

Cluster analysis seeks to group individual objects

214

based on measured properties. These groupings can serve many purposes, e.g., to display relationships among similar objects and to predict unmeasured properties of objects. A cluster analysis of Leishma- nia strains based on reactivity with monoclonal anti- bodies demonstrated that certain strains were more closely related by geographic origin than by species, i.e., cross species clustering of organisms from specif- ic locations was observed (Bussard et al. 1985). Iden- tification of a conserved epitope provides some in- formation on genetic diversity among organisms.

The data analyses performed, while not definite measures of structural (topological) similarities, can help to generate hypotheses. When combined with data from other molecular biology databases, such as PIR and GenBank (Table 3), the hypotheses can be further developed.

Probabilistic identification matrices as strain databases

Probabilistic identification matrices are one of the end products from numerical taxonomic studies (Willcox et al. 1980). Once a stable classification is achieved (i.e., the strains are arranged into groups according to the overall similarity), the individual clusters are assessed to produce a 'percentage posi- tive table', i.e., the percentage of strains showing a positive state for each character studied. The per- centage positive table is then reduced to an identifi- cation matrix, using statistical procedures to select a minimal combination of characters which are dis- criminatory for the groups defined previously by the numerical taxonomy (Sneath 1979e, 1980a). The ma- trix should contain several strongly diagnostic char- acters for each group. The final identification matrix is evaluated by a series of theoretical and practical trials (Langham et al. 1989a; Sneath 1974, 1979d, 1980b,c; Sneath& Sackin 1979) after which it can be used for the identification of unknown isolates.

Identification of unknown strains can be per- formed using computer software. MATIDEN (Sneath 1979d) is a BASIC routine which provides three identification scores: Willcox probability, tax- onomic distance and standard error of taxonomic distance, all implemented in the TAXON software

(A. C. Ward, unpublished). The Bacterial Identifier (Bryant 1991) provides matrices for Gram-positive and Gram-negative bacterial species. Facilities for adding organisms and creating additional identifica- tion matrices are provided but the output is restrict- ed to one identification score. Matrix (S. Souza, un- published) also provides one identification score and accepts user-defined identification matrices being most suited for routine identification tasks.

A review of probabilistic identification matrices is available (Bryant 1993). The several identifica- tion matrices for microbial taxa available in the lit- erature, including Bacillus (Priest & Alexander 1988) and Streptomyces (Kampfer & Kroppenstedt 1991; Langham et al. 1989b) amongst others, can be loaded into suitable software (Bacterial Identifier, MICRO-IS, MATIDEN, Matrix) and be used in da- ta analysis.

Several laboratories and governmental organisa- tions provide facilities for the use of specialised identification matrices, such as the National Insti- tute of Dental Research of the U.S. National Insti- tutes of Health that holds large datasets on myco- bacteria, clinical anaerobes, and Arctic marine bac- teria. Probabilistic identification matrices were de- veloped using data from large numbers of phenotypic clusters of Alaskan marine bacteria (Davis et at. 1983) Capnocytophaga (Walczak & Krichevsky 1982) and slow growing mycobacteria (Wayne et al. 1980).

A consortium of U.S. government agencies has developed a suite of programs for management and analysis of microbial strain data. These programs, called MICRO-IS, have an Identification Module (McManus & Krichevsky 1992) that is used for probabilistic identification of unknown strains. Twenty-five identification matrices, listed below, are provided with the programs. Users can also de- velop their own matrices and import them to the system. It is important to note that automated iden- tification of unknown strains must be tempered with common sense and background knowledge of the organisms being tested. Successful identifica- tion of unknowns is dependent upon using standar- dised methods to test the strains, i.e., methods used to develop the matrix must be known and used to test the characteristics of the unknowns. Documen-

tation for the MICRO-IS Identification Module contains complete information on media and meth- ods used to develop the matrices. Matrices without accompanying methods data have been removed from the module. Identification matrices for the fol- lowing groups of organisms are currently included with the MICRO-IS programs:

Aerobic Gram negative fermentative bacilli Aerobic Gram positive cocci Aerobic Gram negative bacilli Aerobic Gram negative nonfermentative bacilli Aerobic Gram positive cocci Bacillus species (two separate matrices) Enterobacteriaceae and Vibrionaceae Facultatively anaerobic Gram negative bacilli Fastidious bacteria Gram positive nonsporeforming bacilli Lactobacillus species Mycobacterium species Oral Peptostreptococcus species Oral Staphylococcus species Oral Streptococcus species Oral Gram negative bacilli Pseudomonas species Slow-growing Mycobacterium species - clinical matrix Slow-growing Mycobacterium species - taxo- nomic matrix Streptococcus species Streptomyces species - minor clusters Streptomyces species - major clusters Streptoverticillium species Vibrio and related genera

The MICRO-IS programs are public domain soft- ware and are available through the Secretariat of the Microbial Strain Data Network, described in Table 1.

Despite the wide availability of matrices and computer software, identification matrices should be evaluated for the different laboratory conditions using the same standardised identification tests and reference strains (Sneath 1974). The user should be aware that in most cases, particularly when dealing with environmental isolates, only a certain percent-

215

age of the organisms will be successfully identified by the matrix (Williams et al. 1985).

Automated systems as tools for database construction

Commercially available automated bacterial iden- tification systems represent a recent tool for the routine identification of isolates in the laboratory and construction of databases (Mauchline & Keevil 1991). Identification tests are assembled in micro- titre plates or disposable kits and the results may be read and collected automatically by a plate reader connected to a microcomputer or visually by the op- erator (Bochner 1989). Although rapid automated identification systems are used for the quick and re- liable identification of clinical isolates, few systems can identify environmental isolates with the same success (Klinger et al. 1992).

The assay system is usually based on colour de- veloping reactions, where a change of colour in the reaction well indicates the utilisation of the sub- strate (e.g., sugar or nitrogen source) or enzymatic activity (e.g., phosphatase) (BIOLOG, API ZYM). Some systems use substrates chemically bonded to indicator molecules, such as methylcoumarins or methylumbelliferones, which become coloured or fluorescent when the link is cleaved (Manafi et al. 1991; Sensititre 1992). The reaction can be analysed both quantitative or qualitatively. Advantages of this type of system are the high sensitivity, speed (the majority of the reactions can be read after a few hours incubation) and high throughput of samples.

The identification of unknown isolates in auto- mated systems may be performed manually by com- paring results with identification tables or using computer identification matrices. A wide range of identification matrices for Gram-positive and Gram-negative organisms are currently available for automated and semi-automated systems (BIO- LOG 1992; K~impfer & Kroppenstedt 1991; K~mpf- er et al. 1991; Sensititre 1992).

Automated systems have similar restrictions to probabilistic identification matrices derived from numerical taxonomic studies in that they may not successfully identify all the isolates and that they

216

need to be evaluated for the specific laboratory con- ditions and samples under investigation (Klingler et al. 1992).

Chemosystematic data analysis

Chemical data are a valuable source of taxonomic information. Cell wall composition, lipid patterns, fatty acids and other cell components provide com- plementary information to phenetic data for the classification and identification of microorganisms (Goodfellow & Minnikin 1985).

The analysis of chemical data usually demands a refined statistical background from the taxonomist (James & McCulloch 1990). Complex qualitative and quantitative patterns of chemical compounds, such as fatty acid and sugar profiles, amongst other types of chemical profiles, are not as easy to analyse and interpret as numerical phenetic data (O'Don- nell et al. 1985). The raw data may require normal- isation, scale transformation and data reduction steps prior to statistical analysis. These procedures are not easily conducted without a computer and adequate software. Statistical analysis is sometimes complex, usually consisting of multivariate statis- tics, such as principal component analysis, canoni- cal variates analysis and SIMCA (Saddler et al. 1987; Brondz et al. 1990). Several statistical soft- ware packages may be used, including Genstat (Payne et al. 1989), SAS (SAS Coorporation, USA) and SPSS (Nurisis 1982), but users invariably have to program their own routines for data analysis in these systems. IDAMS is another data management and analysis software package that is distributed free of charge UNESCO (Paris, France) with ver- sions for IBM mainframes and PC platforms. The software has several options for elaborate multiva- riate statistics, such as factor analysis and multidi- mensional scaling which are not found in other packages.

The MIDI-Hewlett Packard gas-chromatographic identification system is one example of an automat- ed system for generating chemotaxonomic data. The chromatographic analysis of bacterial fatty acids can be fully automated and the system provides options for the identification of unknown strains against a li-

brary of fatty acid data as well as for the classification of unknown strains (MIDI 1993).

Curie point pyrolysis-mass spectrometry analysis of biological samples has been used for the classifi- cation and identification of microorganisms of clin- ical and industrial importance (Gutteridge et al. 1985; Sanglier et al. 1992). An up-to-date review of the use of PYMS can be found in Magee (1993). Ho- rizon (Horizon Instruments, East Sussex, UK) pro- vides analytical equipment with dedicated statisti- cal software (PYMENU) for multivariate statistical analysis, including facilities for library building and identification of unknown strains (Horizon 1992). Other rapid whole-organism fingerprinting tech- niques, such as infra-red spectrometry (Helm et al. 1991) may also develop into useful tools for bacte- rial systematics.

Techniques, software tools and molecular information

Molecular techniques are adding a new dimension to bacterial systematics. Introduced in the 60's, the determinations of DNA base composition and nu- cleic acids hybridisation marked the beginning of a new era in bacterial systematics (Stackebrandt & Goodfellow 1991). Further developments, includ- ing the introduction of techniques based on the comparison of total DNA and sequences from ho- mologous genes, are allowing the measurement of genealogical distances and the order at which the organisms evolved. New tools, such as oligonucleo- tide probing, the generation of taxon-specific pro- files of DNA fragments and analysis of PCR gene products are now available for the identification of taxa at all levels. A vast array of software is avail- able for the analysis of molecular data and for the dissemination of molecular information via elec- tronic networks.

Comparative sequence analysis, especially of ri- bosomal RNA sequences, have gained a renewed interest from microbiologists as a useful tool for the classification and identification of bacteria. Hyper- variable and conserved regions in the 23S, 16S and 5S ribosomal ribonucleic acid can be used for deter-

mining taxonomic relationships at suprageneric, generic, species and strain level.

Comparative 16S rRNA sequence analysis has been used to study phylogenetic diversity in Bacil- lus. Rossler et al. (1991) have determined that there are four major clusters of Bacillus species, based on rRNA sequence similarities. The designated clus- ters are: 'Bacillus subtilis', which includes B. stea- rothermophilus; 'Bacillus brevis', including B. lat- erosporus; 'Bacillus alvei', including B. polymyxa, B. macquariensis, and B. macerans; and a 'Bacillus cycIoheptanicus' group that includes only B. cyclo- heptanicus. As further evidence of the distinct divi- sions among these Bacillus clusters, the authors point out that some genera of non-spore forming Gram-positive bacteria, such as enterococci, lacto- bacilli, and leuconostocs share more sequence simi- larity with B. subtilis and B. stearotherrnophilus than any of the Bacillus species from either the 'B. alvei' or 'B. brevis' clusters. Based on the rRNA sequence comparisons reported here, the authors conclude that there is enough genotypic diversity among the Bacillus species to justify the creation of new genera and to alter the current classification of the genus Bacillus.

It has also been demonstrated (Jurtshuck et al. 1992) that in situ hybridisation techniques using 16S rRNA segments can differentiate between closely related bacteria. The use of DNA probes to 16S rRNA for rapid identification of bacteria has proved to be advantageous, primarily because of the large database of 16S rRNA sequences available for bacteria. Probes specific to an organism's 16S rRNA can be used to rapidly identify bacteria at the family or genus level in tissue, soil and water sam- ples (Sayler & Layton 1990). However, the technol- ogy can also be used for taxonomic purposes, i.e., to differentiate between closely related bacterial spe- cies. Jurtshuk et al. (1992) have shown that two Ba- cillus species, B. polymyxa and B. macerans, having similar phenotypic characteristics, can be clearly differentiated by use of a rapid in situ hybridisation method. This in situ technique, which draws on in- formation from the sequence databanks and on the use of software for sequence alignment and editing, can be applied to taxonomic studies of other bacte- rial genera. The growing body of bacterial sequence

217

data and the development of software tools for the use of specialised subsets will be of great assistance to bacterial systematists as they realise the power of these tools for classification and differentiation of organisms.

Ribotyping is another molecular approach to characterisation of microorganisms. The technique is based on the fact that, although sequences of rRNA genes are highly conserved, some regions of the genes are variable and indicative of differences among the species. Ribotyping involves the analysis of RFLP patterns of rRNA operons. Genomic DNA is submitted to restriction endonuclease di- gestion using a suitable enzyme and the fragments separated by gel electrophoresis are transferred and immobilised on a membrane. The location of the rRNA operons is mapped using labelled 16S and 23S RNA or a cloned rRNA operon from a related bacterial species or genus. The resulting patterns are indicative of the number and distribution of the rRNA operons on the bacterial genome, therefore reflecting genome organisation and genomic rela- tedness between different strains. Modern non-ra- dioactive labelling and detection systems, such as biotin, dioxigenin and chemoluminescence, pro- vide several advantages for this methodology, such as increased sensitivity, safe handling of probes and possibility of reprobing the same membrane with other nucleic acid probes. Riboprinting involves the extraction and PCR assisted amplification of DNA from the organism, purification of the gene, restric- tion enzyme digestion, and data analysis.

The combined power of molecular and computer techniques to solve taxonomic problems is illustrat- ed in numerous publications describing the use of ribotyping for characterisation and classification of bacteria, rRNA restriction patterns have been used as taxonomic tools for classification of Aeromonas (Lucchini & Altwegg 1992; Moyer et al. 1992); Neis- seria (Woods et al. 1992); Clostridium (Gurtler et al. 1991); Shigella (Hinojosa-Ahumada et al. 1991) and other bacterial genera. Ribotyping is an excellent tool for elucidating genetic relationships among bacteria, primarily because ribosomal genes are highly conserved and stable and multiple operons are usually present on the bacterial chromosome.

The results of restriction enzyme digests can be

218

compared by cluster analysis and other statistical techniques. Using maximum parsimony, Gurtler et al. (1991) constructed a dendrogram that identified two distinct clusters of Clostridium based on restric- tion site differences. Hinojosa-Ahumada et al. (1991) have used the technique of riboprinting to subtype isolates of Shigella sonnei from different geographic regions and outbreaks of shigellosis, thus demonstrating the utility of ribotyping for epi- demiological studies and the genetic relationships among the causative organisms. The technique is particularly useful for closely related organisms having limited phenotypic differentiating charac- teristics.

Protein and DNA electrophoretic patterns, par- ticularly whole-cell proteins and restriction frag- ments of DNA, generate good quality data for taxo- nomic studies. Protein electrophoretic fingerprints are usually analysed using a densitometer and DNA fragment patterns derived from simple restriction endonuclease digestion or from restriction fragment length polymorphism (RFLP) analysis can be sub- jected to image analysis. One of the consequences of the evolution of two-dimensional (2D) gel tech- niques has been the development of databases that contain master gels from a variety of bacterial sourc- es. These databases are going to play an increasingly important role in the analysis of bacterial genomes. 2D gel databases generally contain one or more mas- ter images of the gels that correspond to the orga- nism studied; spots on these images are attributed an identification code and a variable percentage of these spots are linked to known proteins.

The identification of a protein on ~ 2D gel is gen- erally carried out using antibodies or by microse- quencing leading to the elucidation of partial se- quences and physico-chemical data for a number of yet uncharacterised proteins. SWISS-PROT has committed itself to work in close collaboration with a number of groups developing 2D gel databases. Since last year cross-references are already availa- ble to the gene-protein database of Escherichia coli K12 now called ECO2DBASE (van Bogelen et al. 1992) and symmetrically that database now con- tains cross-references to SWISS-PROT. A file serv- er will be set up that will allow anyone with a net- work connection to obtain annotated graphic files

containing the region of the gel that correspond to a selected SWISS@ROT entry linked to SWISS- 2DPAGE (Bairoch 1993a).

Quantitative or qualitative data may be analysed by comparing similarities between individual pro- files using band matching or overall similarity algo- rithms. A classification can be achieved using hier- archical cluster analysis or multivariate statistical analysis (Stglhl et al. 1990). Generally speaking, soft- ware for the analysis of electrophoretic fingerprints can handle both DNA and protein electrophoresis patterns, with a few exceptions. GelManager from BioSystematica, has features for data acquisition from densitometers and document scanners and the data management provides facilities for classifica- tion (similarity matrix and cluster analysis) and identification of unknown organisms against a li- brary of representative fingerprints (Manfio 1993). GelCompar (Kersters 1985) provides extensive fa- cilities for data management, including hierarchical clustering and principal component analysis (PCA), a wide choice of outputs (dendrograms, or- dination plots, shaded similarity matrices, band profile) and facilities for identification of unknown strains against a database of fingerprints.

State-of-the-art data acquisition, database man- agement and analysis of DNA and protein finger- prints is provided by the BioImage package (Milli- pore Corporation, USA), a dedicated image analy- sis system with software modules for the analysis of DNA sequencing, one and two dimensional protein electrophoresis, RFLP and whole band DNA elec- trophoresis gels. The system combines powerful hardware (SPARCstation, Sun Microsystems Inc.), extensive database capabilities (databases up to 10.000 profiles) and statistical facilities. Further software developments may enable other taxonom- ic applications of this system.

Molecular information is being generated from the implementation of international programs that envision the formidable task of deciphering the se- quences of several genomes. At present, what has emerged from these programs is extremely interest- ing and will lead to new molecular biology ap- proaches to the study of biodiversity (Monolou 1992). The immediate benefit is the establishment of numerous molecular biology e-mail servers and

219

U A A G

G G I CC G U • G C CAA I G K U C C U U U G U UGCC c G G U c A A A u G c A U C U Q G A G G A A u A G U QC G A CGAGC G C CCUUA U I AQ

° 1 1 l ° l l l l l * ° 1 • I I G I I I I I I I 1 " " I I I I I I I I o I I I I C G U G ? C G A u B U G G A C C U U A A G A U o G e CQ IGA U , G U G C U C G AGGGGU A G G A A A C cAAGG GCICGG

A - - U U c ~ Q G A F U

G -- C -A G U G U G - - C G - - C U G - - C E GA C G Q G-- G - - C A -- G

A o G U A A G j A G A C I - U - - A U ~ G

A o G G C U C C - - G C s U U - - A G - - C U G G U C-1200 A G

° - 8 . . . . . . ~ -A * C - - C - - G G "~ A G - - C A U U - - A A A G-800 C G AU__A

- - -7S0 G C - - Q I I I • I I • C I 1050~C--G G A - U AGG • U A G A C U U U U G A A G # / ~ II G = C

65O U G G - - C r r, G C / / Q A - G - - C A C ~UA \ U A G - - C A / G G - - C -

A A C C U G G G U G C A U C U G A CU G G C A A G C U = A / C c / / A c A UU -C I 1 - 1 1 1 . 1 1 1 1 1 1 1 I . 1 1 1 - 1 CQ G - - C \ c U u lO00 U A

U C Q G G C C C G U G U A Q A C U GA U U G U U U G G / / C U C A ~ C --Q/C C Q A u c - - G C C A I A i C G • G A G - C GU. GC%% A CA ~=U~U -- A

U A A 600 G G A ~ U % G ~ ~ G - - C I A I UG A A U G " C ~ G - C

CG C I / C G A U , ° G C U u U - - A I A c / C G . G U \ C A C - - G ~

G C G \ AG,% G eS0 A C - - G G . . . . UA~ G ~ U C A G GC

UA - - = A GUCGA CUUQ%% U G C C UC G

U / U • U C A G C U G A A \ GC % U U U G - - C C G I U G / U ." G C A C I AA A A Q . . . . G G GO G--O , 00 " ' ' G O

A / . ~ C " A " U G ~ C ~ - u ~ / ~ , , - U G C U A U - - A I A G C A A / A u • U G - - C C ~ C G ~ A A G - - C - - A A A G %% G

G A / / C A U u / CC A A G U ~ C G U 9 5 0 - U • G C % uCc G G G Q U A C A I

G CGU G - - C U GGCC C G - - C C %" U G A / U C -- G U * I I A-900 A U~A~ I CG~% HGA A A

/" C / [ U - - A A I A A * G 4S0 A G G / C u A A C - - G - G A A A \ G U . %_U

GQ/ /GCA ~GG G A U G A A I ~ t AGCG%*~A ~G C / 3 G ~ c - G l I I U U G A C G Q Q G C C C Q C A C A *% - U i U

G r G 500C-- G AuC% ~ u c G G U A c . I I * • I I I I I I ~.'~C G~ /A A , ~c,:.,., ~e,,~ - u 5so A ~ (~A~,t~- ." ACUGUUCCGGGC U G e - ,~o

Go,, ~o, , . ,~G-c ~o,,o~' ~,,G°°uG" ~ , , c ~ % oo oc o G G G U U GUP ; AGC %\AAQ ~ A ~o C U U U G ~ G

U G I I ° " I J ] A % GC A.t t400-O I - u , A uCCGG UAUGU~C U c ~ G----------~ I ~ ~-mso ~ u i A A G400~ C ,~% U C A A J C / U C / U ~.~U

GAAI AHGC%% C G G A C U J C I A U ~ G U ~ " / G~, A ~ G

I l l " I I I I G , C G G I G A~ A ~ U CGCGGGU ACGUA C 5 A U A A C C G U A G G ~ G ~ ~ C H

AAC I uAU C Ac__GA I I I I 1 ' 1 1 1 ~ U A A ~ C ~ ~G A ~ U - A - - U - G U U G G C G U C C A A U C

^ / ' A - S O C - - G G- t GAA u A C - - G A 1 G Q = G o U l u g A A G

U ' G C G C A G I U C

CG A U G - - C A G 360 C G

cACG / C A - G A GA \ A A i

UGCA G GUC AC G GU CAG G A A Q A A G C G *

U , G G~A G u c A ~ G G U c CGG UQ A CA GUC U U U C U U C G 3"

• II'cCA~c cAC%G C A G 100 C - o o G G U / I

/ A e a G A U U ~ GC

CG~ A G - - 'G,, %', G~ • oo~ , ~:~

c U GGI A--U A • A G - - C G ' U G U A - - U , . G - - C GC U C - - G U * O

O %%G A C - - G U A *U C C - - G A G - - G

C %%U A-2S0 G o ~ C - - G \ C % A G U - - A - U . G

G~°.'G ~ ~ ~ ~ - : G % G G - - G

O G C C \ G G- G 1,,0 % p E s c h e r i c h i a co i l 1 6 S r F I N A A U C - - G

U - - A A -- U lS0 C - - G

\ C - - G A U A GGG C C U C U U G AGGGGG A C U A C U G G A

UG I I [ I . I I , l e l l o e l l l l - I A u C C GGGGAG CGGC A c G A U G G C A

200 A

A - - U G - - C

-A A C °

Fig. 1. Secondary structure of bacterial I6S r ibosomal R N A .

220

several bulletin boards for molecular biologists. Lists of e-mails servers and bulletin board services can be obtained at serv-ema.txt and serv-bbo.txt. A detailed list of molecular biology ftp servers for da- tabases and software is available at bionews@net- .No.net (serv-ftp.txt). The document describes the various sources of databases and software for mole- cular biologists that are publicly available on the In- ternet computer network (Bairoch 1993b,c).

LIMB, a List of Molecular Biology Databases available on the Internet, provides the scientific community with a comprehensive overview of data- bases relevant to molecular biology and related da- ta sets. Information on how to access databases, how to contribute information as well as on general characteristics and availability of databases is pro- vided by LiMB (Burks et al. 1988). Established and maintained at the Los Alamos National Laboratory in the USA, LiMB is periodically updated (Lawton et al. 1989, 1992).

Nucleic acid sequences are usually deposited in sequence databanks, such as the GenBank (USA), EMBL (European Molecular Biology Laboratory) and DDBJ (DNA Databank of Japan) (Tables 2 and 3). A very large number of nucleic acid se- quences from viruses, prokaryotic and eucaryotic organisms are deposited under unique accession numbers and may be searched by genus and species of origin and keywords in the commentary fields, as well as by homology with a nucleotide sequence provided by the user.

The Ribosomal RNA Database (RDP). is a pro- ject headed by Carl Woese (Olsen et al. 1991) and funded by the U.S. National Science Foundation. It provides rRNA sequence data along with software packages for managing, analysing, and viewing the data. Although the RDP aims to include sequence data from all types of organisms, the currently avail- able database contains aligned and phylogenetical- ly organised small subunit rRNA sequences from close to 500 prokaryotes. Among the best repre- sented genera (by number of species) are Bifido- bacterium, Clostridium, Cytophaga, Flavobacteri- urn, Flexibacter, Fusobacterium, Lactobacillus, My- cobacterium, Mycoplasma, Spirochaeta and Spiro- plasma. Sequences in the RDP are drawn from the public sequence databanks, e.g. GenBank and

EMBL, but also include sequences from investiga- tors who have not deposited with the other data- banks. The RDP is available from the University of Illinois via the Internet. Software offered includes several sequence editors, phylogenetic analysis tools, a tree drawing program, and an editor for ma- naging bibliographic data associated with rRNA se- quences.

The RDP plans to accept sequence submissions which it will format and release for distribution to other databanks. It also provides services such as sequence alignment and secondary structural rep- resentation. An example of a secondary structure diagram provided by RDP and received over the In- ternet (courtesy of Carl Woese) is illustrated in Fig- ure 1. The RDP is unique in that it offers a 'sequence assessment' system that can detect possible errors and anomalies in submitted sequences. It also offers several formats, including GenBank format, and various software packages and services. It is antici- pated that the scope and coverage of the RDP will expand significantly in the near future.

Another important resource for bacterial sys- tematists is the E. coli Genetic Stock Center and Database developed by Barbara Bachmann at Yale University (Berlyn & Letovsky 1992). Characteris- tics of over 7,000 mutant derivatives of E. coli K12 are comprehensively described in the database in- cluding alleles, structural mutations, mating types and plasmids. Gene names, properties, products and mutations are also listed as well as derivation, strain names, and references. The E. coli Stock Center has been a strain and data repository for the past 20 years and its staff have provided leadership in the development of standards for genetic nomen- clature and data organisation. Recently, the Stock Center has announced on-line availability of the da- tabase (see Table 2). Although the database does not include sequence information, it is anticipated that it will be complementary to the sequence data- banks in providing users with a mechanism to check on gene names before depositing sequence data.

Systematists studying Bacillus strains have access to a similar resource produced by the Bacillus Ge- netic Stock Center at the Ohio State University. The database, containing information on Bacillus mutant strains, gene names, literature references

and availability is distributed on floppy disc and in hard copy catalogue format. Similar information is available for Staphylococcus strains from the Sta- phylococcus Stock Center at Iowa State University.

Sequence alignment. There are a number of algo- rithms available for a wide scope of DNA and pro- tein sequence alignments and comparisons. Rapid searching programs, such as FASTA (Pearson 1990) or BLAST (Altschul et al. 1990 ) have facilitated ef- ficient comparisons of unidentified sequences with those existing in databanks. An excellent review (Doolittle 1990) of computerised sequence analysis for the study of molecular evolution provides in- depth examinations of the various algorithms and their attributes. It is clear that molecular biologists recognise the significance of their results, which are highly dependent on the algorithm used for analy- sis. An overview of a few of the more popular pack- ages for rapid sequence alignment and comparison follows.

FASTR developed by Lipman and Pearson (1985) can be used for both nucleic acid and protein sequence alignment but was designed primarily to identify protein sequences that have descended from a common ancestor (Pearson 1990), making the algorithms more suitable for protein sequence comparisons. Modifications of this program have been developed and a package called FASTA is suitable for aligning nucleic acid sequences and also for comparing protein sequences to nucleic acid se- quences. The programs are written in C and run on DOS, Macintosh, Unix, and VMS operating sys- tems, and are highly portable. The number of sear- chable residues depends on the hardware, but there is virtually no limit if the library is scanned in over- lapping pieces. FASTA limits the number of gaps that can be inserted into an alignment, while rigor- ous optimal alignments, though slower, may extend the alignment. The programs are freely available on the Internet and source code is available from the developer, Willian Pearson.

Another tool for rapid sequence comparison, is BLAST, the Basic Local Alignment Search Tool. BLAST utilises the maximal segment pair (MSP) score that can be used for both protein and nucleic acid sequence analyses. It is a rapid sequence simi- larity search program and may also be used for too-

221

tif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences (Altschul et al. 1990). BLAST does not allow gaps in the local regions that it reports. MACAW, the Multiple Alignment Construction and Analysis Workbench is useful for studying molecular evolution and also for analysing struc- ture/function/sequence relationships (Schuler et al. 1991). The user of MACAW constructs multiple alignments by locating and combining blocks of aligned sequence segments. Blocks may be edited or linked to form a composite multiple alignment, i.e., the user may 'custom design' the set of se- quence segments that are submitted for analyses.

The U.S. National Center for Biotechnology In- formation is using the Internet to provide access to public domain software such as BLAST and MA- CAW. The programs and the source code may be obtained via anonymous ftp to ncbi.nlm.nih.gov. The pub directory contains these and other compu- tational tools that may be transferred and used without restriction.

Physical/genetic mapping. Colibri (Medigue et al. 1990) is a software system for Macintosh comput- ers that provides the ability to localise unknown DNA fragments from the Escherichia coli chromo- some on the restriction map established by Kohara et al. (1987). The software, a runtime version of 4th Dimension, is available on the Internet via anony- mous FTP to radium.jussier.fr. Correlation of phys- ical and genetic maps allows for analysis of poly- morphism data. Mapping positions can be assigned to most genes for which sequences are available. Detection of inaccuracies in the map is facilitated through the retrieval of appropriate data from the sequence databanks in combination with the use of Colibri. The software also permits placement on the map of any cloned gene for which restriction frag- ments are provided, allowing for rapid mapping of new genes. Colibri, written in PASCAL, provides a good model for software to correlate physical and genetic maps of other bacteria.

Electronic submission of sequence data. Se- quence data submission policies have been de- signed by agreement among the nucleotide and protein sequence databanks. Researchers should submit nucleotide sequence data directly to Gen-

222

Bank or EMBL for assignment of an accession number prior to publication. Authors are strongly urged to use the sequence submission software package AUTHORIN to submit their sequence da- ta to the databanks; a free copy (for either the IBM PC or Macintosh) can be obtained by sending the request to [email protected]. (Garavelli 1993). The U.S. National Center for Biotechnology Infor- mation (NCBI) has also assumed responsibility for distribution of the AUTHORIN software, which can be obtained via e-mail to [email protected] .nih.gov. Data submissions can be sent directly to Los Alamos National Laboratory at gb-sub@ge- nome.lanl.gov.

Multifunetional software for PCs. The array of multifunctional software packages for sequence alignment and analysis is overwhelming for the molecular biologist having to make a choice on which provides the best tools for solving a particular problem. Among software packages available for the Macintosh, GeneWorks, LaserGene, Mac DNA/Pronasis, and MacMolly are used for access- ing PIR, EMBL, GenBank, and SWISS-PROT for performing sequence alignment, predicting PCR primers, and other analytic functions (Ahern 1993). Several also provide the ability to draw phylogenet- ic trees based on sequence relatedness from mul- tiple sequence alignments. It is practically impos- sible to identify and describe each and every set of programs used for this purpose, but the examples below provide an overview of the types of software available to the molecular taxonomist. For more de- tailed information on these and other programs, Biotechnology Software: The Journal of Computa- tional Biology, edited by Kevin Ahem, and publish- ed bimonthly by Mary Ann Liebert Publishers is recommended.

LaserGene for the Macintosh from DNA Star contains modules that carry out sequence editing, restriction mapping, sequence alignment, database analysis, and protein analysis. The editor, EditSeq, is used for entering sequences and importing/ex- porting sequences in various formats. Open reading frames may be found by defining start and stop co- dons. MapDraw provides restriction mapping func- tions and recognises both linear and circular se- quences and can draw circular maps. Enzyme sites

with which to scan the sequence are selected in MapDraw. The module called Align allows for alignment of multiple DNA or protein sequences. Several algorithms may be used for alignment. Sub alignments can also be done. The program also pro- vides for database analysis through the GeneMan module from which one is able to query the se- quence databases, e.g. GenBank, EMBL, SWISS- PROT and PIR. Protean, the protein analysis mod- ule, produces two dimensional plots based on com- mon sequence algorithms. The versatility of this suite of programs makes it attractive to molecular systematists.

GeneWorks by Intelligenetics is another popular set of sequence analysis programs for the Macin- tosh. GeneWorks provides a number of options for analysis, including algorithms for restriction and protease analysis, PCR primer prediction, pair and multiple homology determination, phylogenetic re- lationships, and motif identification. For restriction analysis, the package provides a comprehensive list of enzymes and allows the user to add new enzymes. Reports are in text or graphic form. For finding open reading frames (ORF), GeneWorks provides options such as genetic code and minimum number of bases, sequence of initiator codon, location and sequence of upstream sequences. Information about the location of the ORE length, number of occurrences of each amino acid and each codon may be viewed on the output screen. Sequence alignment may be done in pairs or in groups. Dot matrix analyses can also be used to examine the re- latedness of sequences. Intelligenetics has also de- veloped PC/GENE for sequence analysis on DOS based systems. This software has capabilities simi- lar to those of GeneWorks. A new module is avail- able for PCR primer design that gives the user more flexibility than previously available.

Gene Runner for Windows is a relative new- comer to the field of sequence analysis software. Gene Runner can be used with the usual sequence bank formats as well as with the National Center for Biotechnology Information Entrez retrieval soft- ware for GenBank access. Capabilities for ORF analysis, restriction analysis, fragment analysis, oli- gonucleotide analysis, PCR analysis, sequencing

primer analysis, and reverse translation probe anal- ysis are provided.

Nomenclaturai and taxonomic 'authority files'

Inherently, systematics will always be subject to in- terpretation. In many cases there are 'competing taxonomies.' Molecular techniques are now aug- menting classical taxonomic methodology. Debates among systematists are commonplace and must continue with the expansion of knowledge about the world's organisms and their relationships. For this reason, 'official' taxonomic treatments are rare. It has been perceived, however, that publicly avail- able databases containing the most comprehensive information possible on the nomenclature, history, authorities, hierarchical placement, and competing opinions on the taxonomic status of organisms are needed.

In the case of bacteria, names at the genus and spe- cies levels are controlled by the International Code of Nomenclature of Bacteria (Sneath 1992). The cor- rect name of a bacterial taxon is based on the valid publication, legitimacy, priority of publication and the effective publication. Since 1 January 1980, pri- ority of bacterial names is based upon the Approved Lists of Bacterial Names (Skerman et al. 1980). Names that were not included in the Approved Lists lost standing in bacterial nomenclature.

The Bacterial Nomenclature Up-to-Date Data- base is compiled by the Information Centre for Eu- ropean Culture Collections (ICECC) together with the Deutsche Sammlung von Mikroorganismen und Zellkulturen GmbH (DSM). This database is available in a printed version or floppy disk. The Bacterial Nomenclature Up-to-Date Database in- cludes all valid bacterial names which have been published since 1 January 1980 and valid nomencla- ture changes which have been published since then. It is updated with the publication of each new issue of the IJSB and is available on-line through the In- ternet (Tropical Database Gopher, Brazil) or through the Microbial Strain Data Network (MSDN).

There are several computerised resources that provide authority references, synonyms, and taxo-

223

nomic rank of bacterial names. While none of these resources is officially sanctioned by the Interna- tional Committee on Systematic Bacteriology, sev- eral are extremely valuable for researching the ori- gin and current status of genus and species names.

Bacteriologists are fortunate to have a widely ac- cepted de facto authoritative resource for bacterial systematics. The Bergey's Manual Trust, headquar- tered at Michigan State University in East Lansing, Michigan, has, since 1923, published taxonomic treatments of bacterial species. In addition to the classifications, the Manuals contain descriptions of all genera and higher taxa, criteria for species dif- ferentiation, taxonomic comments, and references to authors of the names.

Future editions of Bergey's Manual of Determina- tive Bacteriology and Bergey's Manual of System- atic Bacteriology will be published both in hard copy and computer accessible media. A prototype data model for a computerised version of the Bergey's Manual of Determinative Bacteriology, 9th edition is being developed. This model is based on the use of the RKC Codes, a set of standardised descriptors for bacterial characteristics, created by Rogosa et al. (1986). An example of these codes with their numer- ic equivalents are illustrated below.

SECTION 24 FATTY ACIDS PRODUCED FROM D-GLUCOSE 024016: Acetic acid is produced from D-glucose. 024017: Acetic acid without carbon dioxide is

produced from D-glucose. 024018: Butyric acid is produced from D-glucose. 024019: Isobutyric acid is produced from

D-glucose. 024020: Caproic acid is produced from

D-glucose. 024021: Isocaproic acid is produced from

D-glucose. 024022: Caprylic acid is produced from

D -glucose 024023: Formic acid is produced from

D-glucose. 024211: Fumaric acid is produced from

D-glucose. 024024: Oenanthic (heptanoic) acid is produced

from D-glucose.

224

024025: Propionic acid is produced from D-glucose.

024026: Valeric acid is produced from D-glucose. 024027: Isovaleric acid is produced from

D-glucose.

RKC Codes applicable to each genus of bacteria are being supplied to the international editorial staff of Bergey's Manuals. Editors will select the codes that apply to each species within the genus and indicate a positive or negative value for each of the pheno- typic characteristics to differentiate among the spe- cies. Tables will be constructed within the database that will allow for adding, editing, and retrieving da- ta in response to specific queries. Standardised de- scriptors will facilitate interspecies comparisons and provide the mechanism for matching user-gen- erated data with the Bergey's data.

Another computerised taxonomic resource is the on-line BIOSIS Register of Bacterial Nomencla- ture (BRBN). This resource contains nomenclatu- ral and taxonomic information on over 15,500 bac- teria including the following data fields: bacterial name, authority name, synonyms of the bacterial name, International Committee on Systematic Bac- teriology (ICSB) status, and taxonomic rank (in- cluding Bergey's Manual 7th and 8th editions). An effort is made to also cover all commonly occurring infra-specific ranks, the major portion of which are pathovars and serovars. BRBN is a component of the BIOSIS Taxonomic Reference File (TRF). This database contains taxonomic data on other classes of organisms and provides electronic mail, comput- er conferencing, and bulletin board services. A complete list of data of BRBN elements are illus- trated below.

BRBN DATA ELEMENTS TRFNUM: Unique identifying number of each

record. NAME: Bacterial name described in record. LEVEL: Taxonomic level of bacterial name. ICSBSTAT: Official ICSB status. AUTHORITY: Author and date of first

published description. TYPE: Type culture collection identifier. SOURCE: Publication first noted by TRF staff.

PREENAME: Preferred name if synonyms exist. REFERENCE: Bibliographic reference where

name first published. BERGEY7: Status in Bergey's Manual 7th

edition. BERGEY8: Status in Bergey's Manual 8th

edition. BSB: Coverage of name in Bergey's Manual of

Systematic Bacteriology, Volumes 1-4. NAMENOTE: Nomenclatural notes which help

explain current status of name. BIOSISI: BIOSIS Biosystematic Code(s) pre

1979 (order level). BIOSIS2: BIOSIS Biosystematic Code(s) post

1979 (Family level). CODE: Four-letter abbreviations for genera on

Approved Lists.

Culture collections and strain databases

Large amounts of information on collection hold- ings are acquired as a result of routine activities and the research applied to the cultures. Information may relate to the phenotypic or genotypic features of the cultures, to their industrial applications, their taxonomic relatedness, their response to different preservation methods, their history and their patent status. This information is increasingly held in com- puters for ready access and updating (Canhos 1990). It is anticipated that in the near future, strain infor- mation will include reference to taxon-specific se- quences, probes and primers. This information, to- gether with the availability of strains and genetic material will lead to advances in the development and application of modern diagnostic tools in mi- crobial systematics.

Established collections are collaborating nation- ally, regionally and internationally to make data more readily available to the scientific community through databases and networks. The function and differences of the major computerised strain data- bases has been reviewed by Kirsop (1988a). De- scriptions of culture collections and a species ori- ented directory are held at the World Data Center for Microorganisms (WDC) housed at RIKEN (In- stitute of Physical and Chemical Research) in Japan

and are available on-line trough the Internet GOPHER.

A number of major European culture collections, supported by the 'Biotechnology Action Pro- gramme' (BAP) of the European Community (EC) are involved in the establishment of the multina- tional 'Microbial Information Network Europe' (MINE). The project aims at the centralisation of data on strains in one database, in order to insure easy and quick access to the data that will be avail- able on-line at the German Institute for Medical Documentation and Information (DIMDI). In or- der to allow the electronic combination of data from different collections, a uniform system for computer storage and retrieval of strain data is be- ing adopted. Formats for bacteria (Stalpers et al. 1990) and for yeasts and filamentous fungi (Gains et al. 1988) have been developed.

The Microbial Germplasm Data Network (MGD) is the USA National Network of scientists and collections containing microbial germplasm utilised primarily in plant related research. The MGD, which is available from the Oregon State University, USA, has on the Internet almost a giga- byte of information provided by scientists describ- ing their research culture collections. Accession da- ta on plant pathogens, symbionts and biological control organisms are included and uploaded on a daily basis. Future developments at the MGD will include a spelling checker for taxonomic names, e- mail searches and responses for users not on the In- ternet and Image Data relevant to microbial germ- plasm. Further information can be obtained through [email protected].

The Microbial Strain Data Network (MSDN) is an internationally sponsored information network that provides mechanisms for locating microorgan- isms and cultured cells with specific properties, through an electronic communication system (Kir- sop 1988b). The MSDN is also a facilitating mecha- nism that builds links between databases. Many of the major collections are now making their cata- logues available on-line through the MSDN net- work. The MSDN services make it possible to both search catalogues and order cultures on-line and collections are increasingly taking advantages of

225

these facilities and enriching their databases with taxonomic information.

Conclusions and out look

The polyphasic approach to microbial systematics requires a broad spectrum of information, including access to primary molecular data, such as sequenc- es, and to software tools for data analysis. The ex- pansion of the taxonomic base is becoming very dy- namic and is demanding a close monitoring of new developments in research methodology and data analysis. To facilitate the follow up of frontier de- velopments, new mechanisms for the dissemination of information relevant to microbial systematics must be created.

The development of new taxonomic databases together with the creation of specialised services such as newsgroups, listservers, mailing lists and in- formation archives are required for the advance- ment of microbial systematics. Long term planning and funding are fundamental for the success of the enterprise.

Considering its rapidly growing connectivity and relatively low cost to the end users, the Internet is the best route for the dissemination of taxonomic information at the present time. Initially, access to the Internet was a privilege only available to those at universities and governmental institutions able to finance organisational subscriptions to this worldwide network. However, new links have been established that provide access to individual users at very reasonable costs. The Internet is an attrac- tive tool for research and public information pro- grams and reaches regions of perceived remoteness such as Africa and Latin American countries, im- portant stockholders of biological diversity. In in- dustrialised countries high speed data transmission backbones are being established that will allow the transfer of massive amounts of data and images in real time. Developments in public domain software shared via the Internet will lead to a massive growth of public domain databases as end products of new- ly established National and International Biodiver- sity Programmes.

While it is true that there is much software and

226

other information to be gathered on the Internet, it is also the case that software tools and databases are being developed that will be commercialised. Al- though these will not be free to the Internet users, a microbial systematics list/server/usergroup would serve as a locator for these software tools as well as linking people to databases and other resources of importance to taxonomic interests.

Daily, new ftp sites are being created, discussion lists are being established and biological databases are being set up on the Internet. There is a clear need for the development of co-ordinating and linking mechanisms in order to solve the problems related to the collection and analysis of primary da- ta and their dissemination through electronic net- works. Related international organisations should promote the development of common protocols, guidelines and standards to insure the data quality. Linking the existing computerised databases and information resources is important for making the data available worldwide. The existing models such as the MSDN and the Internet Gophers clearly in- dicate that the linkage of diverse distributed data- bases is feasible at low cost.

Following the developments in molecular biology and information technology it is anticipated that in the near future high quality and easily accessible tax- onomic databases will be available and linked though common open protocols. These develop- ments, associated with the creation of other informa- tion dissemination mechanisms such as listservers, newsgroups, ftp sites and information archives, will enlarge the knowledge base necessary for the eluci- dation of bacterial diversity and taxonomic complex- ity, allowing new adventures in the biotechnological exploitation of the microbial gene pool.

Acknowledgement

We would like to acknowledge the helpful reviews of the paper by Kevin Painting and Trevor Bryant.

References

Ahern K (1993) DNA Star/LaserGene. Biotechnology Software, 10:6-12

Altschul SF, Gish W, Miller W, Myers EW & Lipman DJ (1990) Basic local alignment search tool. J. Mol. Biol. 215:403-410

Anonymous (1991a) Sustainable biosphere initiative. Ecology 72:371-412

Anonymous (1991b) Systematics Agenda 2000: integrating bio- logical diversity and societal needs. System. Zool. 40:520-523

Bairoch A (1993a) SWISS-PROT and 2D gel databases. Posted (May 1, 1993) at the [email protected]

Bairoch A (1993b) List of molecular biology e-mail servers. Ver- sion 1.50/July 30, 1993. Posted at [email protected]

Bairoch A (1993c) List of molecular biology FTPservers for da- tabases and software. Version 1.50/July 30,1993. Posted at bio- [email protected]

Balows A, Trtiper HG, Dworkin M, Harder W & Schleifer K-H (Eds) (1993) The Prokaryotes - A Handbook on the Biology of Bacteria: Ecophysiology, Isolation, Identification, Applica- tions. 2nd ed. Springer-Verlag, New York

Bello (1989) Computer applications in microbiology: making sense out of data. ASM News 55:71-73

Berlyn MB & Letovsky S (1992) Genome-related datasets within the E. coli Genetic Stock Center Database. Nucl. Acids Res. 20:6143-6151

BIOLOG (1992) BIOLOG Microstation: Automated Bacteria Identification System. BIOLOG Inc., Hayward, USA

Bochner BR (1989) Sleuthing out bacterial identities. Nature (London) 339:157-158

van Bogelen RA, Sankar R Clark RL, Bogan JA & Nedhardt FC (1992) The gens-protein database of Escherichia coli: Edition 5. Electrophoresis 13:1014-1054

Bryant TN (1991) Bacterial Identifier: a Utility for Probrabilistic Identification of Bacteria. Blackwell Scientific Publ., Oxford

Brondz I, Olsen I & Sjostrom M (1990) Multivariate analysis of quantitative chemical and enzymic characterization data in classification of Actinobacillus, Haemophilus and Pasteurella spp. J. Gen. Microbiol. 136:507-513

Bryant TN (1987) Programs for evaluating and characterising bacterial taxonomic data. CABIOS 3:45-48

Bryant TN (1993) A review of probabilistic identification matric- es. A compilation of probabilistic bacterial identification ma- trices. Binary 5:207-210

Buchanan RE & Gibbons NE (Eds) (1974) Bergey's Manual of Determinative Bacteriology. 8 ed. Williams & Wilkins, Balti- more

Bull AT, Goodfellow M & Slater JH (1992) Biodiversity as a source of innovation in biotechnology. Ann. Rev. Microbiol. 46:219-252

Burks C, Lawton J & Bell G (1988) The LiMB database. Science 241:888

Bussard A, Krichevsky MI, & Blaine, LD (1985) An internation- al hybridoma data bank: aims, structure, function. In Macario AJL & Macario EC (Eds) Monoclonal Antibodies Against Bacteria (p. 287-311). Academic Press, Orlando

Canhos VP (1990) The impact of computers on culture collec- tions. In: Sly I, Iijima I & Kirsop BE (Eds) 100 years of culture collections (p. 20-27). Institute for Fermentation, Osaka

Canhos VE Lange DA, Kirsop BE, Ross E & Nandi S (Eds) (1992). Needs and Specifications for a Biodiversity Informa- tion Network (265 pp.). United Nations Environment Pro- gramme, Nairobi

Cerf V (1991). Networks. Sci. Amer. 265:72-84 Cinkosky M, Fickett JW, Gilna P & Burks C (1991) Electronic

publishing and GenBank. Science 252:1273-1277 Colwell RR (1970) Polyphasic taxonomy of the genus Vibrio: Nu-

merical taxonomy of Vibrio cholerae, Vibrio parahaemolyticus and related Vibrio species. J. Bacteriol. 104:410-433

Cowan ST & Steel KJ (1974) Manual for the Identification of Medical Bacterial. Cambridge University Press, Cambridge

Davis AW, Atlas RM & Krichevsky MI (1983) Development of probability matrices for identification of Alaskan marine bac- teria. Int. J. System. Bacterio133:803-810

Doolittle RF (Ed) (1990) Molecular evolution: computer analy- sis of protein and nucleic acid sequences. Meth. Enzymol. 183: 1-736

Gains W, Hennebert GL, Stalpers JA, Janssens D, Schipper MA, Smith J, Yarrow D & Hawksworth DL (1988) Structuring strain data for storage and retrieval of information on fungi and yeast in MINE, the Microbial Information Network Eu- rope. J. Gen. Microbiol. 134:1667-1689

Garavelli JS (1993) Announcements of the Protein Information Resource. Network Request Service. [email protected] (29 April 1993)

Goodfellow M & Minnikin DE (Eds) (1985) Chemical Methods in Bacterial Systematics. Academic Press, London

Goodfellow M & O'Donnell AG (Eds) (1993). Handbook of The New Bacterial Systematics. Academic Press, London

Goodfellow M, Jones D & Priest FG (Eds) (1985) Computer- assisted Bacterial Systematics. Academic Press, London

Green D (1992) Public Domain Databases for Networking Bio- diversity. On-line contribution to [email protected], avail- able through BDT gopher

Gurtler V, Wilson VA & Mayall BC (1991) Classification of med- ically important clostridia using restriction endonuclease site differences of PCR-amplified 16S rDNA. ]. Gen. Microbiol. 137:2673-2679

Gutteridge CS, Vallis L & Macfie HJH (1985) Numerical meth- ods in the classification of micro-organisms by pyrolysis mass spectrometry. In: Goodfellow M, Jones D & Priest F (Eds) Computer-assisted Bacterial Systematics (p. 369-401). Aca- demic Press, London

Hall HJ (1989) Microbial product discovery in the biotech age. Bio/Technology 7:427-430

Hawksworth DL (1992) Biodiversity in microorganisms and its role in ecosystem function. In: Solbrig OT, van Emdem HM & Van Oordt PGWJ (Eds). Biodiversity and Global Change (p. 83-94). Monograph no. 8. International Union of Biological Sciences, Paris

Hawksworth DL & Colwell RR (1992a). Microbial Diversity 21 :

227

Biodiversity amongst microorganisms and its relevance. Bio- div. Conserv. 1:221-226

Hawksworth DL & Colwell RR (1992b) Biodiversity amongst mi- croorganisms and its relevance. Biology International 24:11-15

Helm D, Labischinki H, Schallehn G & Naumann D (1991) Clas- sification and identification of bacteria by Fourier-transform infrared spectroscopy. J. Gen. Microbiol. 137:69-79

Hinojosa-Ahumada M, Swaminathan B, Hunter SB, Cameron DN, Kiehlbauch JA, Wachsmuth IK & Strockbine NA (1991) Restriction fragment length polymorphisms in rRNA operons for subtyping Shigella sonnei. J. Clin. Microbiol. 29:2380-2384

James FC & McCulloch CE (1990) Multivariate analysis in ecol- ogy and systematics: panacea or Pandora's Box? Ann. Rev. Ecol. System. 21:129-166

Jurtshuk R J, Blick M, Bresser J, Fox GE & Jurtshuk Jr P (1992) Rapid in situ hybridization technique using 16S rRNA seg- ments for detecting and differentiating the closely related Gram-positive organisms Bacillus polymyxa and Bacillus macerans. Appl. Environ. Microbiol. 58:2571-2578

K~impfer P & Kroppenstedt RM (1991) Probabilistic identifica- tion of streptomycetes using minituarized physiological tests. J. Gen. Microbiol. 137:1893-1902

K~impfer R Kroppenstedt RM & Dott W (1991) A numerical classification of the genera Streptomyces and Streptoverticilli- um using miniaturized physiological tests. J. Gen. Microbiol. 137:1831-1891

Kellogg ST (1989) The state of computers. ASM News 55:22-25 Kersters K (1985) Numerical methods in the classification of bac-

teria by protein electrophoresis. In: Goodfellow M, Jones D & Priest F (Eds) Computer-assisted Bacterial Systematics (p. 337-368). Academic Press, London

Kirsop BE (1988a) Computerized databases for locating mi- croorganisms: Functions and differences of major information resources. MIRCEN Journal 4:419-424

Kirsop BE (1988b) Microbial Strain Data Network: A service to biotechnology. Internat. Indust. Biotechnol. 8:24-27

Klinger JM, Stowe RE Obenhuber DC, Grover TO, Mishra SK & Pierson DL (1992). Evaluation of the Biolog automated mi- crobial identification system. Appl. Environ. Microbiol. 58: 2089-2092

Kohara Y, Akiyama K & Isono K (1987) The physical map of the whole E. coli chromosome: Application of a new strategy for rapid analysis and sorting of a large genomic library. Cell 50: 495-508

Krebs JR (1992). Evolution and Biodiversity-The New Taxono- my. Published by the Natural Environment Research Council

Krieg NR (Ed) (1984) Bergey's Manual of Systematic Bacteriol- ogy. Vol.1. Williams & Wilkins, Baltimore

Krol E (1992) The whole Internet user's guide and catalog. 1st. ed. O'Keilly & Associates, Inc., Sebastopol (California)

Langham CD, Sneath PHA, Williams, ST & Mortimer, AM (1989a) Detecting aberrant strains in bacterial groups as an aid to constructing databases for computer identification. J. Appl. Bacteriol. 66:339-352

Langham CD, Williams ST, Sheath PHA & Mortimer AM

228

(1989b) New probability matrix for identification of Strepto- myces. J. Gen. Microbiol. 135:121-133

Lawton J, Burks C & Martinez F (1989) Overview of LiMB data- base. Nucl. Acids Res. 17:5885-5899

Lawton J, Cinkosky M, Mishra S, Fickett J & Burks C (1992) Access to molecular biology databases. Math. Comput. Mod- el. 16:93-101

Lennette EH, Balows A, Hausler Jr WJ & Shadomy HJ (Eds). (1985) Manual of Clinical Microbiology. American Society for Microbiology, USA

Liesack W, Ward N & Stackebrandt E (1991) Strategies for mole- cular microbial ecological studies. Actinomycetes 2:63-67

Lipman DJ & Pearson WR (1985) Rapid and sensitive protein similarity searches. Science 227:1435-1441

Lucchini GM & Altwegg M (1992) rRNA gene restriction pat- terns as taxonomic tools for the genus Aeromonas. Int. J. Sys- tem. Bacteriol. 42:384-389

Magee J (1993) Whole organisms fingerprinting In: Goodfellow M & O'Donnell AG (Eds) Handbook of New Bacterial Sys- tematics (p. 383419). Academic Press, London

Manfio GP (1993) GelManager for DOS, GelManager for Win- dows: gel fingerprinting analysis. Binary 5:114-116

Manafi M, Kneifel W & Bascomb S (1991) Fluorogenic and chro- mogenic substrates used in bacterial diagnosis. Microbiol. Rev. 55:335-348

Mauchline WS & Keevil CW (1991) Development of the BIO- LOG substrate utilization system for identification of Legio- nella spp. Appl. Environ. Microbiol. 57:3345-3349

McManus C & Krichevsky MI (1992) Self-Instruction Manual for MICRO-IS

Medigue C, Bouche JP, Henaut A & Danchin A (1.990) Mapping of sequenced genes (700 kbp) in the restriction map of the Es- cherichia coil chromosome. Mol. Microbiol. 4:169-187

MIDI (1993) Microbial Identification System. MIDI, Newark, DE, USA

Monolou JC (1992) Biodiversity at the molecular level. In: Sol- brig OT, Van Emdem HM & Van Oordt PGWJ (Eds). Biodi- versity and global change (p. 33-39). Monograph No. 8, In- ternational Union of Biological Sciences, Paris

Moyer NE Martinetti G, Luthy-Hottenstein J & Altwegg M (1992) Value of rRNA gene restriction patterns of Aeromonas spp. for epidemiological investigations. Curr. Microbiol. 24:15 21

Nurisis MJ (1982) SPSS introductory guide: basic statistics and operations. McGraw-Hill, Chicago

O'Brien M & Colwell RR (1987) Characterisation tests for nu- merical taxonomy studies. In: Colwell RR & Grigorova R (Eds) Methods in Microbiology: Current Methods for Classi- fication and Identification of Microorganims. Vol. 19 (p. 69- 104). Academic Press, London

O'Donnell AG, Minnikin DE & Goodfellow M (1985) Integrat- ed lipid and wall analysis of actinomycetes. In: Goodfellow M & Minnikin DE (Eds). Chemical Methods in Bacterial Sys- tematics (p. 131-143). Academic Press, London

Olsen GJ, Larsen N & Woese CR (1991) The ribosomal RNA database project. Nucl. Acids Res. 19 (Suppl.): 2017-2021

Payne RW, Lane PW, Ainsley AE, Bicknell KE, Digby PGN,

Harding SA, Leech PK, Simpson HR, Todd AD, Verrier PJ, White RE Gower JC, Wilson GT & Paterson LJ (1989) Gen- stat 5 Reference Manual. Oxford University Press, Oxford

Pearson WR (1990) Rapid and sensitive sequence comparison with FASTP and FASTA. Meth. Enzymol. 183:63-98

Priest FG & Alexander B (1988) A frequency matrix for prob- abilistic identification of some bacilli. J. Gen. Microbiol. 134: 3011--3018

Horizon (1992) RAPyD-400 User Manual. Horizon Instruments Ltd. Ghyll Industrial Estate, UK

Reid B (1993) Usenet readership report for January 1993. Usenet news lists

Rogosa M, Colwell RR & Krichevsky M (1986) Coding Microbi- ological Data for Computers. Springer-Verlag, New York

Rossler D, Ludwig W, Schleifer KH, Lin C, McGill T J, Wisotz- key JD, Jurtshuk Jr. E & Fox GE (1991) Phylogenetic diversity in the genus Bacillus as seen by 16S rRNA sequencing studies. System. Appl. Microbiol. 14:266-269

Sackin MJ (1987) Computer programs for classification and identification. In: Colwell RR & Grigorova R (Eds) Methods in Microbiology: Current Methods for Classification and Identification of Microorganims. Vol. 19 (p. 459494). Aca- demic Press, London

Sackin MJ & Jones (1993) Computer-assisted classification. In: Goodfellow M & O'Donnell AG (Eds) Handbook of New Bacterial Systematics (p. 281-313). Academic Press, London

Saddler GS, O'Donnell AG, Goodfellow M & Minnikin DE (1987) SIMCA pattern recognition in the analysis of strepto- mycete fatty acids. J. Gen. Microbiol. 133:1137-1147

Sanglier J-J, Whitehead D, Saddler GS, Ferguson EV & Good- fellow M (1992) Pyrolysis mass spectrometry as a method for the classification, identification and selection of actinomy- cetes. Gene 115:235-242

SAS (1992) SAS Introductory Guide for Personal Computers: release 6.03. SAS, Coorporation, USA

Sayler GS & Layton AC (1990) Environmental application of nu- cleic acid hybridization. Ann. Rev. of Microbiol. 44:625 -648

Schuler GD, Altschul SF & Lipman DJ (1991) A workbench for multiple alignment construction and analysis. Proteins 9:180- 190

Sensititre (1992) Sensititre Operation Manual. Sensititre Corpo- ration, UK

Skerman VDB, McGowan V &Sneath PHA (1980). Approved lists of bacterial names. Int. J. System. Bacteriol. 30:225420

Smith U R (1993). A Biologist's Guide to Internet. Published monthly in the Usenet Newgroups sci.bio, bionet.general and news.answers, and archived as file 'biology/guide' in the anon- ymous ftp archive on pit-manager.mit.edu. '20 pages'

Sneath PHA (1974) Test reproducibility in relation to identifica- tion. Int. J. System. Bacteriol. 24:508-523

Sneath PHA (1977) A method for testing the distinctness of clus- ters: a test of the disjunction of two clusters in Euclidean space as measured by their overlap. J. Math. Geol. 9:123-143

Sneath PHA (1979a) BASIC program for significance test for clusters in UPGMA dendrograms obtained from squared Eu- clidean distances. Comp. Geosci. 5:12%137

Sneath PHA (1979b) BASIC program for a significance test for two clusters in Euclidean space as measured by their overlap. Comp. Geosci. 5:143-155

Sneath PHA (1979c) BASIC program for nonparametric signif- icance of overlap between a pair of clusters using the Kolmo- gorov-Smirnov test. Comp. Geosci. 5:173-188

Sneath PHA (1979d) BASIC program for identification of an un- known with presence-absence data against an identification matrix of percent positive characters. Comp. Geosci. 5:195-213

Sneath PHA (1979e) BASIC program for character separation indices from an identification matrix of percent positive char- acters. Comp. Geosci. 5:349-357

Sneath PHA (1980a) BASIC program for the most diagnostic properties of groups from an identification matrix of percent positive characters. Comp. Geosci. 6:21-26

Sneath PHA (1980b) BASIC program for determining the best identification scores possible from the most typical examples when compared with an identification matrix of percent posi- tive characters. Comp. Geosci. 6:27-34

Sneath PHA (1980c) BASIC program for determining overlap between groups in an identification matrix of percent positive characters. Comp. Geosci. 6:267-278

Sneath PHA (Ed) (1986) Bergey's Manual of Systematic Bacte- riology. Vol. 2. Williams & Wilkins, Baltimore

Sneath PHA (1989) Analysis and interpretation of sequence da- ta for bacterial systematics: the view of a numerical taxono- mist. System. Appl. Microbiol. 12:15-31

Sneath PHA (1992) International Code of Nomenclature of Bac- teria (1990 revision). American Society For Microbiology, Washington.

Sneath PHA & Johnson R (1972) The influence on numerical taxonomic similarities of error in microbiological tests. J. Gen. Microbiol. 72:377-392

Sneath PHA & Langham CD (1989) OUTLIER: a BASIC pro- gram for detecting outlying members of multivariate clusters based on presence-absence data. Comp. Geosci. 15:939-964

Sneath PHA & Sackin MJ (1979) BASIC program for printing a coding sheet for unknowns that are to be identified against an identification matrix of percent positive characters. Comp. Geosci. 5:359-367

Sneath PHA & Sokal RR (1973) Numerical Taxonomy. Free- man, San Francisco

Stackebrandt E & Goodfellow M (Eds) (1991) Nucleic Acid Techniques in Bacterial Systematics. John Wiley & Sons, Chichester

Stackebrandt E, Wunner-Ffissl B, Fowler V J, Schleifer, K-H (1981) Deoxyribonucleic acid homologies and ribosomal ribo- nucleic acid similarities among sporeforming members of the order Actinomycetales. Int. J. System. Bacteriol. 31:420-431

Stackebrandt E, Witt D, Kemmerling C, Kroppenstedt R & Lie-

229

sack W (1991) Designation of streptomycete 16S and 23S rRNA-based target regions for oligonucleotide probes. Appl. Environ. Microbiol. 57:1468-1477

StShl M, Molin A, Ahrn6 S & St~hl S (1990) Restriction endonu- clease patterns and multivariate analysis as a classification tools for Lactobacillus spp. Int. J. System. Bacteriol. 40:189-193

Staley JT, Bryant MR Pfleming N & Holt JG (Eds) (1989) Ber- gey's Manual of Systematic Bacteriology. Vol. 3. Williams & Wilkins, Baltimore

Stalpers JA, Kracht M, Janssens D, DeLey J, Van der Toorn J, Smith J, Claus D & Hippe D (1990) Structuring strain data for storage and retrieval of information on bacteria in MINE, the Microbial Information Network Europe. System. Appl. Mi- crobiol. 13:92-103

Walczak CA & Krichevsky MI (1982) Computer-aided selection of efficient identification features and calculation of group de- scriptors as exemplified by data on Capnocytophaga species. Curr. Microbiol. 7:199-204

Walczak CA, Blaine L & Krichevsky MI, (1988) The CODATA/ IUIS Hybridoma Data Bank: development of a hybrid system to handle complex data relationships. Comp. Meth. Progr. Biomed. 88:275-285

Wayne LG, Krichevsky E J, Love LL, Johnson R & Krichevsky MI (1980) Taxonomic probability matrix for use with slowly growing mycobacteria. Int. J. System. Bacteriol. 30:528-538

Willcox WR, Lapage SP & Holmes B (1980) A review of numer- ical methods in bacterial identification. Antonie van Leeu- wenhoek 46:233-299

Williams DM (1992) DNA analysis: Theory. In: Forey PL, Hum- phries CH, Kitching I J, Scotland, RW, Siebert DJ & Williams DM (Eds) Cladistics: A Practical Course in Systematics p. 89- 101. Oxford University Press, Oxford

Williams ST, Sharpe ME & Holt JG (Eds) (1989) Bergey's Manual of Systematic Bacteriology. Vol. 4. Williams & Wil- kins, Baltimore

Williams ST, Locci R, Vickers, Schofield GM, Sneath PHA & Mortimer AM (1985) Probabilistic identification of Strepto- verticillium species J. Gen. Microbiol. 131:1681-1689

Winker S & Woese CR (1991) A definition of the domains Ar- chaea, Bacteria and Eucarya in terms of small ribosomal RNA characteristics. System. Appl. Microbiol. 14:305-310

Wishart D (1987) Clustan User Manual 4th edition. Computing Laboratory University of St. Andrews, Scotland, U K

Woese CR, Kandler O & Wheelis ML (1990) Towards a natural system of organisms: proposal for the domains Archaea, Bac- teria and Eucarya. Proc. Nat. Acad. Sci. USA 87:4576-4579

Woods TC, Helsel LO & Swaminathan B (1992) Characteriza- tion of Neisseria meningitidis serogroup C by multilocus en- zyme electrophoresis and ribosomal DNA restriction profiles (ribotyping). J. Clin. Microbiol. 30:132-137