Download - The Semantic Grid and chemistry: Experiences with CombeChem

Transcript

The Semantic Grid and Chemistry: Experiences with CombeChem

K. Taylor, J. W. Essex, J. G. Frey School of Chemistry,

University of Southampton, Southampton, SO17 1BJ, UK

Tel: +44 (0)23 8059 3209 {ktn1, j.w.essex, j.g.frey}@soton.ac.uk

H. R. Mills, G. Hughes, E. J. Zaluska School of Electronics and Computer Science,

University of Southampton, Southampton, SO17 1BJ, UK {hrm, ejz}@ecs.soton.ac.uk

Abstract

The Combechem e-Science project has demonstrated the advantages of using Semantic Web technology, in particular RDF and the associated triplestores, to describe and link diverse and complex chemical information, covering the whole process of the generation of chemical knowledge from inception in the synthetic chemistry laboratory, through analysis of the materials made which generates physical measurements, computations based on this data to develop interpretations, and the subsequent dissemination of the knowledge gained. The RDF descriptions employed allow for a uniform description of chemical data in a wide variety of forms including multimedia, and of the chemical processes both in the laboratory and in model building. The project successfully adopted a strategy of capturing semantic annotations ‘at source’ and establishing schema and ontologies based closely on current operational practice in order to facilitate implementation and adoption. We illustrate this in the contexts of the synthetic organic chemistry laboratory with chemists at the bench, computational chemistry for modelling data, and the linking of chemical publications to the underlying results and data to provide the appropriate provenance. The resulting ‘Semantic Data Grid’ comprises tens of millions of RDF triples across multiple stores representing complex chains of derived data with associated provenance.

1. Introduction The objective of Grid computing is to bring a variety of computational and data resources together to create new capabilities. Ten years ago there was an emphasis on combining the resources of supercomputers with high-speed wide area networking to provide very-large-scale data processing. As Grid computing has evolved it continues to focus on bringing geographically-separated resources and services together, but the emphasis has now shifted onto Virtual Organisations (VOs) as defined in 2001 by Foster [14]: “The real and specific problem that underlies the Grid concept is coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations… A set of individuals and/or institutions defined by such sharing rules form what we call a virtual organization.” The VOs exist even if these rules are not specified or explicitly articulated. Indeed this is a feature of many “real” VOs formed from existing communities in an ad hoc manner. These VOs must be able to evolve dynamically. In the past few years there has been an additional change in emphasis to provide the necessary knowledge and semantic framework to enable the automated management and sharing of complex resources, facilitating the realisation of VOs. The “Semantic Grid” [10,11,21] thus created provides an infrastructure where complex applications and services can be deployed with minimal manual intervention. This level of automation is required to deal with the increasing rate at which scientific data is being generated and needs to be processed if the integration of the data and its transformation into information and knowledge is to keep pace with the generation of the data. [23] This increasing rate of data production will inevitably require more automation and hence formalisation of the rules that underpin the production of acceptable data and information within the Chemistry community, so as to facilitate the increased use of automatic validation of data. This already occurs in some areas of Chemistry (e.g. there are a series of steps and checks required to publish crystallographic data which involve automated and human checks, followed by submission to a database prior to formal publication [7]). This paper reports on our experiences in building a Semantic Web [2, 53] infrastructure for chemical research as part of the Combechem project funded by the U.K. e-Science programme [24]. We set out to use (and extend) the available Grid and semantic technology to support the entire chemical research sequence. This typically starts from an experiment producing data, which is then searched, in the context of the available literature, for relevant patterns (in the limit of very large amounts of data available form some combinatorial approaches this becomes a data mining exercise), which lead to results, conclusions and publications and which in turn leads to further experiments. The majority of the progress depends on individual scientists and small research groups, building on the results already produced by others. While in the past this process has served the science community well, it is now in danger of paralysis from the sheer quantity of data being produced. We considered it essential that semantic support be introduced at every stage in the process (“metadata@source”) to facilitate automated mechanisms for research support, providing an infrastructure where complex applications and services can be deployed with minimal manual intervention. This level of automation is required to deal with the increasing rate at which scientific data is being generated and needs to be processed if the integration of the data and its transformation into information and knowledge is to keep pace with the generation of the data. Techniques are needed to disseminate data that would otherwise be lost to the wider community due to lack of time or immediate value to the producer. The barrier to dissemination with high quality metadata (and provenance) needs to be reduced. We describe our solution as a “Semantic Data Grid” as seen from a Grid perspective in [50,51]. It is essential to build the right kind of software support that improves the human experience of conducting the research, so that the digital world actually facilitates the conduct of high-quality science and acts to improve the efficiency, reliability and quality of the investigations. Collectively we refer to the Combechem support software, together with the supporting knowledge infrastructure (ontologies) as the ‘SmartLab architecture’. This architecture has been designed to enable effective capture of both data and metadata at the earliest opportunity during the scientific investigation, part of our vision of “Publication@Source” [16,17, 42]. Once captured, the data and metadata material is maintained and organized as it traverses the virtual organization that represents the whole Chemistry community that is involved in converting a new piece of data into accepted chemical fact and knowledge. This includes incomplete, inconsistent and time-varying

information. Our methodology has been to draw as far as possible on established chemistry practices and then augment them. This is both principled and pragmatic – the principle is that by constructing schema and ontologies based on current operational practice we have a solution that is known to work, and we regard this as an essential first step to facilitate deployment and adoption. This bottom-up type of approach needs to be complemented with a top down design to obtain full value out of the use of ontologies,.. While the general field of chemical research provided the motivation for the Combechem project, at every stage we have endeavoured to create more general solutions that would retain applicability beyond the domain of chemical research. The chemistry domain is an area which, for example, ranges from small-scale laboratory work to use of large national and international facilities, from experimental studies through to pure computational and theoretical studies, and connects with chemical engineering on one side and with the biological arena on the other. The choice of such a wide-ranging discipline provides a very wide breadth of experience relevant to comprehensive scientific investigation. This ensures that the procedures and understanding developed in CombeChem are not confined to a narrow niche view but can encompass in a single environment all the features necessary for such investigations. We believe that we have demonstrated wide and generic applicability and that our experiences will be relevant to a wide variety of e-science and general research. In section 2 we describe the information environment encompassing chemistry research, what in Grid computing terms is called a “Data Grid” and we explain why Semantic Web technologies are an appropriate solution, leading us to the “Semantic Data Grid”. We then go on to explain aspects of our Semantic Web design in Section 3 and systems design in Section 4. Related work is outlined in Section 5, and evaluation is discussed in Section 6. The conclusions and future work items form Section 7.

2. The Chemistry Data Grid

2.1 From Computational Grid to Data Grid We believe that in the chemistry domain the advantages of a Grid approach lie not only in the traditional ‘Computational Grid’ [9] but rather more significantly in the ‘Data Grid’, by which we mean that the systems that are required to support the integration of data arising from many diverse and geographically-distributed sources. The UK e-Science programme, largely due to many of the application-led projects, took on from the start a significant Data Grid flavour. No amount of computation, no matter how inexpensive or freely provided, will be useful if the data needed to drive the models are not available. In the chemistry community a major source of data is the individual scientist working in a traditional chemistry laboratory, supplemented today by the advent of automated high-throughput synthetic and analytical technology (especially in the industrial sector). Data is generated in laboratories and spread throughout the entire worldwide community, each laboratory contributing significant discoveries, which are collated via the worldwide literature. Several different organizations correlate the various materials produced across the wide and diverse community, and potentially provide significant added value by evaluating the material. Fewer organizations perform this evaluation now than in the past when governmental funding was available for the evaluation process; this process is now down to the user and so more automated evaluation processes are required. This in turn requires the information needed to support the evaluation by others than full domain experts, needs to be propagated along with the data. Parallel to the worldwide academic community there are the large islands of semi-isolated but heavily-loaded commercial chemical data sources, which represent the industrial component of the chemical market (particularly, over the last half century, the pharmaceutical sector). Recognising that the digital world necessary to implement the Data Grid vision must start as early as possible along the information chain, the CombeChem project set out to support the relatively small-scale needs of these everyday scientists in recording experiments. This resulted in a Combechem interest in the Electronic Laboratory Notebook (ELN) research area as part of the overall concept for the ‘SmartLab Architecture’ to provide effective semantic support for experimental and computational science. The traditional paper laboratory notebook has in the past been the workhorse of scientific experimentation. While the flexible and robust paper-based lab book system has been the de facto experimental recording tool literally for centuries, there are significant problems when used in the new world of digital science. When scientists wish to share data or move towards electronic publication of complete and detailed results, as in the Publish@Source vision discussed later, the paper lab book now becomes an obstacle: data captured in the lab book is invisible to

those who cannot access it physically. Transfer of the data from the laboratory notebook onto a computer adds an extra (potentially error prone) step to the discovery process and inevitably material will only be selectively transcribed. The transcription makes for more difficult time stamping and thus more error prone association with parallel records of the experimental environment. These individual laboratory data sources are a perfect example of a Data Grid, or rather the need to be able to handle one. Each site acts almost independently but adheres (more-or-less) to standard chemical practice. This means that considerable human effort is required to assemble, assign, correlate, document and standardise to enable the whole community to access the data using conventional relational database mechanisms. While this has been achieved in several sub-domains of chemistry (for example, crystallography), as a whole the required human effort is simply too great and exceeds the resources available. The quantities of data generated can be quite considerable. The National Crystallography Service generates about 1Gb of data per day and a bio-molecular simulation can run to terabytes in size. However it is the complexity of the data describing experiments, processes and experimental values that has the most implications for the way the data is represented and stored.

2.2 From Data Grid to Semantic Grid In all the stages of the chemical data pathway that we aim to support in the vision described above, data is being exchanged between people and computers, and increasingly and importantly between computers and computers. In the current practice the subsequent ability to use the data in interpretation of a chemical reaction mechanism or structure depends heavily on the scientist keeping explicit track of the metadata (for example, which sample compound is associated with a spectra and under what conditions). Depending on the circumstances, the amount of metadata needed in the interpretation phase can be very considerable. All the relevant metadata, which is usually effectively “hidden” inside a written laboratory notebook, needs to be made accessible digitally if the full context of the data item is to be properly appreciated. All of the constituent players in the relevant Virtual Organisation need access in the Combechem vision [8,16]. The availability of this information in real situations in which the data is owned and stored by different players is a difficult problem. Access involves crossing administrative domains such as from international journals, via institutional repositories, to individual research groups [42]. The effective representation of chemical data poses a significant problem for the relational database model, which is typically the established solution to large scale data integration. Any representation must be multidimensional and in addition a vast quantity of supplementary information is required to give any particular datum its meaning. The melting point of a particular compound is a common and useful property, but what do we mean when we say that compound X melts at such and such a temperature? The value is just a simple number, with units such as Celsius, but the truth of the matter is much more complicated.; for example, given compound X is a particular chemical species,

• How pure is it? • What form has it crystallised in? • Did it melt over a range of temperatures? • How accurate was the apparatus used to measure the melting point? • What if the compound sublimed and never went through the liquid phase? • At what pressure was the measurement taken? • What if it began to decompose from heating before or while it melted?

This results in a very long and thin data structure. Perhaps in a given context we neither care nor know, but that does not excuse the data store from needing to specify these things. Too many data sources gloss over these details, relying on the users not requiring such information or simply not being aware, or knowing the information because of expertise they already have. The reasons are obvious: this sort of data is intrinsically hard to describe and is easily forgotten or ignored, as well as being significantly more difficult to extract from original literature references

All this (metadata) information is needed to describe the quantity fully. For the digital world it must be captured as automatically as possible in a computer readable form, as this is the only realistic method to guarantee the quality of the description. This is simply another way of stating that the provenance of an item of information is essential in understanding its value and importance. The importance of provenance cannot be overstated, because an item of information is rendered almost useless if the details of the provenance are not known (in practice this leads to experiments being unnecessarily repeated). Humans can often infer provenance from the values (e.g. if the value is within a certain range, then it probably is a particular type of measurement) but it is not always clear and the rules cannot always give an unambiguous answer. This procedure should not be needed as when the measurement was originally made it was known with certainty what was being measured (or at the very least what was doing the measuring, which is not always quite the same thing!). There are often many measurements of the same quantity in the literature and choosing the most appropriate value, that is evaluating the quality and relevance of the data, is greatly assisted by the provision of the provenance train. Inference would not be needed if the descriptions were to have been given with the data and this is essential if reliable automated analysis is being undertaken (if not reliable, to quantify the nature of the uncertainty). The rule-based ideas may be able to add back this information but this has the usual problems associated with the attempt to capture expert knowledge, although this will still be necessary to interpret legacy data and check for the consistency of the supplied metadata. However, even though there is no excuse for failure to adopt a modern data capture, common practice often still results in the numeric values being stored in a file and a separate description of the data being written in a lab book. It is clearly essential to do more to tie these two facets of the same piece of information together from the very start and moreover to enable the information to be read by all users (human and computers). However, the simple solution to scientific data input mentioned above (by supplementing the numeric item with textual notes associated with a database entry) makes processing the retrieved numbers much harder, if not impossible, unless there is a consistent approach to the textual fields provided. Given that the main role of the database is to provide fast and easy access to the data, such a solution is far from satisfactory. To solve these problems using a more complex database design results in an exceedingly convoluted and unintuitive database structure which is difficult to maintain and modify. Outside of large commercial companies this is not a viable option, as not only does proper database design demand database development professionals but also the continued maintenance necessary will all amount to significant expense. The more complex the system, the harder it becomes to alter anything and it is easily conceivable that the normal process of innovative scientific research will produce new data that cannot be stored in the database without a complete redesign The Combechem approach of using RDF to capture the semantic content and describe the scientific data provides a viable solution to many of these problems by providing flexibility as well as a variable structure. Provenance is a hard problem of active research and we aim to go some way to support use of provenance by elevating the provenance metadata to the same level as the data and hold it as individual steps, processes etc rather than bland text strings. The RDF approach stores all the information as a set of triples. A pattern of triples can reflect any number of dimensions for any number of data types with relative ease. A second instance of a property for a particular molecule merely requires another set of triples rather than another layer of abstraction in the database. Where the relational database demands that all data fits the schema, an RDF schema can be redefined to accommodate new data beyond the original scope. The explicit nature of the triple – every data point requiring a declaration of what it is and what it relates to – may appear hugely inefficient by defining things in such a verbose manner, but this is precisely what creates the flexibility. Nothing is assumed or left to the software to guess, so the limitation on data structures comes in the form of the software that deals with the output, and common sense for what should and should not be handled. If everything is logically spelled out in a net of interlinking data, the data is self-describing and can live on even without the database in which it is contained. This brings about the added advantage that we can exchange our database software for a newer product while retaining the underlying knowledge. The need to address real-time data is something that will in the future further stress the information systems. In exchange for the flexibility of RDF, we accept a potential loss in speed. Relational databases have been developed and optimised for commercial use over several decades and are presently the fastest way of storing and retrieving large volumes of data. In contrast, RDF triplestores are a relatively new tool with neither a

significant mass of software nor the abundant experience that make the deployment of such a technology easier. Although the mechanics of triplestores varies, processing the RDF graph is inherently slower than demanding particular data directly in a particular form from a relational database. It is worth stressing that the performance limitations of triplestores are not yet well-known although they cannot be expected to match the speed of relational databases with present technology and so we must compromise one way or the other: speed and simplicity versus flexibility. The choice is dependent on the function of the database. For chemical data, it is possible to supply so much supplementary information of potential future importance that flexibility appears to be the clear winner. This trade-off needs to be viewed in the context of the acceleration of the scientific process afforded by the flexibility of this approach. In the future we can envisage using the flexibility of the triplestore to accumulate the rapidly growing types of information and once these have become established and integrated with other data sets, generate optimised relational databases for frequent rapid access.

3. Semantic Web design aspects of the CombeChem Data Grid We set out to provide comprehensive semantic support for the whole spectrum of chemistry research in as transparent fashion as possible. A fundamental part of this strategy was to study established practice in the field and to introduce as few changes to normal everyday working practice as possible. This was part of the objective of capturing as much metadata at source (i.e. as it is generated) completely automatically. We adopted the additional premise that it would be impossible to predict in advance the way that data would be accessed and used, hence flexibility of use was a fundamental objective. This led directly to the requirement that the information infrastructure to hold this data and metadata should be as general as possible. Our design approach adopted five principles:

• Grounding in established operational practice – our starting point is to study chemists at work;

• Capturing a rich set of associations between all types of things, expressed pervasively in RDF and hence explicitly addressing the sharing of identifiers;

• Metadata capture should be automated as far as possible – our goal is efficient augmentation not disruption;

• Information will be reused in both anticipated and unanticipated ways.

• The storage, maintenance and transport of metadata will be given equal consideration to data, ensuring availability of accurate metadata, a dependable provenance record and comprehensive understanding of the context of data.

The next three sections describe the design of the principal schemas.

3.1 The RDF Graph for Molecular Properties As mentioned previously, the traditional relational database model demands a sizeable development period prior to the operation of the database. Exact requirements must be deduced and a detailed database schema drawn up before the system can be implemented. Any mistakes or oversights made during this design process cause significant problems later on. If the schema cannot capture the desired data, the database must be rebuilt from the bottom, with the old data accordingly modified to fit within the new requirements. Adding an additional column of information about particular records can be a difficult task, and is in no way as simple as a typical spreadsheet representation may imply. RDF, building on XML data descriptions in chemistry [15,36-8,43] has the solution to many of these problems by providing flexibility and variable structure. To design the structure of the RDF graph we determined the identity of most importance, which for this work is the chemical species. It should be noted that the emphasis could be placed elsewhere. If one were concerned with the properties of mixtures, then the mixture identity might be the hub from which all other data stems, or one might choose to focus on individual measurements first, and describe what they relate to as a subsidiary. Correct choice of the central identity makes data access simpler and aids understanding of the data. In fact due to the nature of various chemical processes, the specified chemical species of a physical sample may differ over time from the initial assertion, as more is known. The unique identifiers associated

with real-world objects are good examples of invariant identifiers; e.g. a sample ID, or the ID of a piece of equipment. It is important to note that in a freeform RDF data structure, none of the objects are compulsory nor are we limited to just one of each. Multiple entries for some data items are meaningless (some are essential as they represent the results of different observations of the same quantity made by different people), but a molecule can have any number of properties assigned to it, and many measurements of those properties. Also, although no element is compulsory it is important that most items are included to allow different items to be located easily. It would not be sensible to store molecular properties if the molecule had no unique identifier to locate it, or if there were no record of the place from which the property came.

FIGURE 1 HERE Figure 1. The schema for the CombeChem data grid, based around chemical properties. Objects are marked as ellipses, the arrows show how predicates link objects together, and rectangles are literal values. At the root of this schema, shown in Figure 1, is a node that identifies each molecule uniquely. Originally it was going to be the InChI [26] for that molecule. An InChI is a character string that uniquely describes an organic molecule based on its structure. Unfortunately, there is a problem with this as the InChI strings can become very long and thus impractical as a primary index. To rectify this, an SHA1 hash is taken of the InChI string and used as the fixed-length identifier node. Extending from this central node are two distinct layers of information. The first contains information about the molecule that is independent of state and conditions, while the second consists of all the information about properties where these factors are important. The second layer (marked Physical Property) makes up the bulk of the data and contains the provision for complete tracking of the origins of any data, as well as the data itself. Every value is not only associated with its molecule, but also an expression of uncertainty in that value, the units of the value, how trustworthy that value is (in case we should later discover a problem), the source of the data and the method by which it was obtained. This mechanism is designed to be sufficient to provide a trail of information by which individual data points can be traced back to their original source and reproduced if need be. Such point-by-point inspection has rarely been possible in existing databases, and even then verification has seldom proceeded past a simple journal reference. This scheme for describing chemical data is the product of several iterations of development in which the starting ideas were gradually extended until everything required could be described in a suitable form. The need to impose some structure on the underlying RDF statements in order to handle the input into the triplestore suggested that, in the first instance, all information should be grouped by the molecule to which it applies; i.e. the head node would be a unique identifier for the molecule. This is not as simple a choice as it appears, because what Chemists view as the same molecule depends on the context. While the InChI provides a very useful method of generating a URI for a given molecule from its structure, it is in some ways too specific to allow sensible chemically-aware searching and indexing of molecules and their properties. A higher-level grouping was considered necessary to be able to generate a useful database although other less-specific identities would also be useful. This highlights the fact that the concept of chemical equivalence is a context dependent issue. To take a particular example of isomerism in molecular structure (the same chemical formula hence the same atoms but arranged differently spatially), because of the pharmaceutical context for some of the work (e.g. drug design), we needed to consider the possibility of the enantiomeric forms of molecules (the pair of molecules that are simply mirror images but have very different biological properties). The end user searching for a particular chemical structure will probably not specify the complete stereochemistry of the target molecule if they are working from a drawn structure or from a chemical name, so ambiguities of which enantiomer or other type of isomer could easily be present in the query. Considering materials as well as the constituent molecules means that polymorphism (where the same molecules can adopt different 3D packing structures when forming a crystal which can dramatically alter their macroscopic properties) needs to be considered. However, no molecular-based chemical identifier can currently capture polymorph information. Nor indeed do many databases of measurements even provide

information about the polymorph they refer to. We have adopted a system that directly associates the polymorph with the 3D crystal structure. The next consideration was handling of properties. Several properties useful for indexing (molecular weight, reference codes of other databases such as those from the Cambridge Crystallographic Data Centre, CCDC) are completely independent of where they came from and so it is unwieldy to build them into the same system that accommodates properties with the associated overhead of provenance that is needed for experimentally-determined quantities. It was decided that these properties that are dependent only on chemical structure should be separated out to allow rapid location of data. Three-dimensional structures were not included in this, as they are dependent on the method used to obtain the structure, such as which force field was applied to a calculation, or if it was an x-ray or neutron structure. Associating 3D structures with methods allows several structures to be given for a molecule, such as those generated from X-ray diffraction data or those produced by high-level quantum simulations. Further calculations may be performed in which case it is vital we know which structure they began from, should we need an explanation for an unusual result. Some structure files have an internationally agreed naming convention that gives away their application; e.g. CIF files are molecular structures from an X-ray crystallography diffraction experiment. This implied ontology is used to provide the meaning for these files without further explicit definition within the store. Yet another division that was considered related to the phase the property relates to (gas, liquid, liquid-crystal, solid, plasma etc.). For example, one may have a density of substance in any of common states of matter, or computed structures, may be created in vacuum or a solvent shell. These must be differentiated somehow, and it would be obvious to partition some of the physical properties into phases that they apply to (although this was not implemented here) while other properties can apply to many phases but with different values in each phase or nature of that phase (i.e. different solid structures have different melting points). This information is also implicit in the other data we would like to store about properties. Properties in which phase is important should be stored alongside the conditions that control the phase, such as temperature. When searching for data from the literatures (as opposed to undertaking experiments on it), if we do not already know the melting or boiling point of this compound, then we have no way of deciding which phase to put it in (for results that are viewed at room temperature), and hence we would be forced to leave this facet as unknown. Clearly the view of which phase to place a molecule does depend on the context, a laboratory chemist will have a different view from someone studying the chemistry of the Earth’s cold upper atmosphere, or of the much hotter surface of Venus. This is an example of the multiple classifications that are needed and remains for future consideration. With this new approach applied to data capture and storage, it is possible to filter data based on author, data source, method, accuracy, conditions and molecular properties such as relative molecular mass. None of these filterable properties are particularly new, but in combination they far exceed the scope of presently available products. Even more usefully, we can also isolate data that is unexpected and examine the supplementary information for possible reasons for its abnormality. If it should appear anomalous and that the original data proved incorrect, we can mark it as untrustworthy so that others will know not to place too much faith in it. An overtly incorrect experimental value might be recreated in the laboratory and the new correct value placed in the database, but the old value is not lost, merely superseded. We maintain a trail of precedence that guarantees we keep the original data, although the newer value is selected by preference. This data then feeds into the scientific data processing. The creation of original data is accompanied by information about the experimental or computational conditions with which it was created. There then follows a chain of processing such as aggregation of experimental data, selection of a particular data subset, statistical analysis, or modelling and simulation. The handling of this information may include explicit annotation of a diagram or editing of a digital image. All manipulation of the data through the chain of processing is effectively an annotation upon it and the provenance is explicit.

3.2 An RDF framework for Scientific Units We have also found it necessary to develop a units system to capture the semantics of scientific units. RDF is used to create a network of units and quantities that can be effortlessly extended with new units and conversions without requiring any rewritten software. It has several advantages over the existing XML methods by rigorously limiting the ways in which units relate to each other, and by clearly addressing issues of dimensionality, convenience and functionality. The result of this is a system for which it is easier to write reliable software, as explained in section 5.

All practical scientific research and development relies inherently on a well-understood framework of units to exchange quantities between people in different places and times. Quantities are the numerical values together with the unit of the measurement. The unit is a crucial part of the metadata associated with the numerical value, so important that it should perhaps be considered part of the data item itself, rather than metadata. Understanding the quantity and conversion of the numerical value between different units requires the appropriate standards for example for the base units and the conversion between quantities expressed in different systems (SI, Imperial). Indeed, an essential part of all scientific training is directed to the appreciation and understanding of this topic. While at first sight it appears that potential issues should by now be well-understood, and indeed handled automatically, closer investigation reveals that there are still many potential pitfalls. 1

Exactly the same potential problems exist in all e-Science applications where computers exchange information. Whenever data is exchanged between different systems an unambiguous definition of the units is essential. Units range from simple definitions to complex constructions (almost infinite grouping of the base units is permissible), and it is these complex constructions that particularly require a compact yet unambiguous computer representation. While significant progress has been made using XML to describe scientific units, this approach falls well short of the functionality necessary for a Semantic Data Grid. e-Science applications must be able to import and export scientific data accurately and without any necessity for human interaction (or potential error). We have found that RDF provides a suitable and practical means to describe scientific units to enable their communication and inter-conversion, and thereby solve a major problem in reliable exchange of scientific data. We might choose to store all numbers according to SI conventions (for example see the IUPAC Green Book for Quantities, Units and Symbols In Physical Chemistry [33]) and therefore contain ourselves within a single system of units and dimensions. Unfortunately many fields of science have units that predate SI and are conveniently scaled for analysis, and so storing these values in SI units would be inconvenient as well as requiring careful conversion. If on the other hand we decide to retain the original units we face the prospect of other people choosing to use different units. This is almost inevitable given that different purposes may lead us to use any of SI, British Imperial, US Imperial, CGS, esu, emu, Gaussian and atomic units or more ancient or country-specific systems. Then we may have two measurements on different scales that might well be the same, and yet we have no way of comparing the two without reaching for a pocket calculator and tables of constants. Here we will concentrate on conversions within SI to illustrate the main principles of the proposed RDF representation of quantities (i.e. numerical value and unit). Inevitably units from different systems are encountered in the same setting and conversion must take place so that values may be compared or combined. The majority of conversions are exchanges of one unit for another, such as Celsius for Kelvin, or yards for metres. Such conversions, when recognised, present no real challenge, but not all conversions are so straightforward. Knowledge of the dimensions of a unit is vital to deciding what conversions are possible. For example, the knot is a recognised measure of speed but it is a compound unit and must be separated into the dimensions of length and time in order to compare it with other measures of speed. By the same token, the watt is an approved SI measure of power but it also likely that data sources may report the same quantity in joules per second. How does a typical computer application equate these two concepts?

1 There are a number of well-publicised case studies where human misinterpretation of unit information has resulted in significant

problems. The ramifications of imperfect conversions are considerable, as NASA discovered to their cost when their Mars Climate Orbiter was destroyed in 1999 due in part to the incorrect units of force given to the software, in the commutation between a section of code written using SI units (the engine control sub-system) being called by a section that assumed Imperial units (the flight software). A similar, but more directly human, problem occurred in1983 when the "Gimli Glider" incident involving a Boeing 767 running out of fuel occurred because of invalid conversion of fuel weights, where a chain of measurement resulted in a confusion again between SI and Imperial units of conversion of volume to weight of fuel, and the plane ran out of fuel before reaching its destination, but was able to glide to a landing.

A more complicated issue is encountered with older measurements of pressure based on a column of mercury. A column height in millimetres of mercury has been a long-established method of monitoring atmospheric pressure, but of course this is not a pressure at all, rather a length. If treated as a length for conversion purposes we may only convert from millimetres or inches to some other length of a mercury column. While this is entirely reasonable, it is not particularly useful. Some form of “bridge'” is required to make the transition from a length to a pressure but it is entirely dependent on the material. This type of conversion requires a greater knowledge of the physics behind the measurement and not simply a pure matter of conversion between units or unit systems description. Several organisations are in the process of creating systems to make units transferable, including NIST's unitsML, the units sections within Geography Markup Language (GML) [30] and SWEET[40] ontologies from the Open Geospatial Consortium and NASA respectively. At this time there are no complete computerised units definitions endorsed by any standards organisation, and this means that there is no accepted procedure to express scientific units in electronic form. Wherever the problem arises, people have either created their own systems that can cope with their immediate needs or resorted to text, thereby condemning their data to digital obscurity. The XML schemas developed thus far tend to propose that one should have a definition for mol dm-3, and within that definition are descriptions of what units and powers compose this set of units. This makes description straightforward, as shown by the GML. There are fragments below.

<measurement> <value>10</value> <gml:unit#mph/> </measurement> <DerivedUnit gml:id="mph"> <name>miles per hour</name> <quantityType>speed</quantityType> <catalogSymbol>mph</catalogSymbol> <derivationUnitTerm uom="#mile" exponent="1"/> <derivationUnitTerm uom="#h" exponent="-1"/> </DerivedUnit>

One potential problem with this approach is in the vast numbers of permutations needed to accommodate the many different ways people use units in practice. In addition there are perhaps ten different prefixes in common use in science, so at least in principle we may have ten versions of each unit, and with compound units it might be common to talk of moles, millimoles or micromoles per decimetre, centimetre or metre cubed. We would then have around one hundred useful permutations and many more possibilities. Clearly such a description is more useful if it considers non-SI units such as inches and pounds as well as SI. Every combination of two units together results in another definition leading to a finite but practically endless list of definitions. This is exemplified by the units section of the SWEET ontology. SWEET presently addresses a relatively small set of units around the basic SI units, and already the list is many pages of definitions with a very precise syntax required to invoke them. In a long list, humans will have difficulty locating the correct entities and both processing and validating the schema becomes increasingly difficult. There is considerable scope for typographical errors when writing programs to use ontologies, and the bigger and more complex the ontology the greater the problem becomes. It is better to provide a higher level of abstraction and encode in effect rules for building viable units from the base units. A more viable alternative to the above approach is to explode the units when they are invoked as follows:

<measurement> <value>10</value>

<unit> <unitname>mile</unitname> <power>1</power> </unit> <unit> <unitname>hour</unitname> <power>-1</power> </unit> </measurement>

or in more condensed form

<measurement> <value>10</value> <unit id=#mile power="1"/> <unit id=#hour power="-1"/> </measurement>

Clearly this approach requires more data to describe the units for each measurement, but it does dramatically reduce the size of the dictionary required to interpret it. The cornucopia of distinct entities is reduced down to two units and a definition of a prefix. Secondly it reduces the number of logical steps required to transform the units, since they are already subdivided and we do not need to decompose them further using the schema. Figure 2 illustrates part of the ontology.

FIGURE 2 HERE

Figure 2. Part of the Units ontology

3.3 Materials and Processes We investigated the representation and storage of human-scale experiment metadata and designed an ontology [19,20] to describe the record of an experiment and a novel storage system for the data from the electronic lab book. In the same way that the interfaces needed to be flexible to cope with whatever chemists wished to record, the back end solutions also needed to be similarly flexible to store any metadata that might be created. Additionally, pervasive computing devices are used to capture laboratory conditions, and chemists are notified in real time about the progress of their experiment using pervasive devices. Our Web-based planner application allows the basic structure of an experiment to be generated from scratch or from copying an existing experiment and refining it. The main metadata for the experiment, such as the experimenter's name and a high-level description of the experiment, can be entered or updated, and more detailed data about the experiment, such as the lists of ingredients and steps can be created or modified. The current planner interface does not allow the creation of a non-linear experiment plan (i.e. where more than one step can be done in parallel), but this is not a serious problem for our test environment of a synthetic organic research laboratory, where the individual experiments rarely have parallel execution paths. We have developed an ontology in RDFS [19,20] to encompass the major phases of an experiment: planning the ingredients, planning the procedural steps, and recording the experiment. The ontology uses a high level representation of work flow, modelling processes, materials, and data. Information recording using the ontology is a representation of the human-scale activities of scientists performing experiments. The requirements have been gathered from our work with chemists in the laboratory. We have also incorporated additional requirements from the chemistry aspects of Combechem so that concepts such as a process can mean either an act performed by a human or the running of a piece of software. The initial design was inspired by analysing the experiments performed by chemists during the trials of the tablet PC interface as well as some straightforward procedures such as the preparation of aspirin. We quickly noted that there was a spine of activity running through the record in which a process led to a product which in turn formed the input

to the next process step. In the laboratory this maps to the adding of ingredients to a reaction vessel and the actions performed on it to result finally in an end product. This process-product pairing is important, both in terms of the physical reality of an experiment, and also in software where each computational process results in output files. Recording these intermediate outputs allows us to link to each outcome from the final report and provide far greater opportunities for other scientists to reproduce experiments. Previously all that remained from an experiment would typically be the analysis of the final product. Our ontology is designed to make it easier for systems such as eBank [12] to retrieve all of the intermediate data and results vital to reproducing a procedure. Our ontology [20] separates out a plan from the record of an execution of that plan, as illustrated in Figure 3. The separation between plan and record is an important one, since the two may not match exactly. An experimenter may perform additional steps over and above those in the plan, for example if something unexpected or unplanned happens (say, an interim result is a solid, rather than a liquid as expected, and must be dissolved to continue). There is no reason, from the point of view of the ontology, why a plan cannot have more than one record with it (for, say, repeated attempts at an experiment), although we do not currently allow for that in our software.

FIGURE 3 HERE

Figure 3. Example of the plan and the process record. the shaded circles depict processes, hollow circles depict substances, triangles represent the making of observations and squares represent literal values. Each circle, triangle or square is a node in the RDF with an abbreviated URI represented in bold, black text. The class name of each node is given in italics and arrows represent RDF relationships The two main concepts which the ontology models are Materials and Processes. A “Material”, in our nomenclature, may be either a data set or a physical sample of a chemical (possibly some anonymous and indeterminate mixture partway through a reaction scheme). A Process may be a purely in silico process (a computation); a purely in vitro process (a chemical reaction); or a hybrid process such as the measurement of a spectrum, which takes a substance as input, and produces a piece of data. There are currently no processes known which take only data as input, and produce a substance. The ontology has a hierarchy of different process types; for example, it divides in vitro processes into broad classes of "Separate", "Mix" and "React", with each of these being subdivided into different types; e.g. Separate has subclasses including Filter, LiquidLiquidExtract and ColumnChromatography. These may be further subdivided to give more detailed descriptions of the processes being used, as required. Extending the ontology to include additional types of process is a job for the domain experts (in this case, chemists), who are much better able to identify what classifications are useful. Extending the ontology in this way does not require a high degree of expertise in ontology creation, as it is primarily a matter of classification of processes. There are tools available which can aid the domain expert in modifying an ontology. To use the ontology's Materials and Processes, observe that every Material may be the input to or the result of a Process. Some Materials will be used as input to several processes (if they are data, or if they are a large sample of substance split up into smaller samples). Some Processes may have many Materials as inputs (or as outputs – consider a filtration or fractional distillation process). Thus, the main part of the experiment consists of a network of nodes, always alternating between Materials and Processes. A measurement, which may be of some property of a substance, or of the state of a process, or even an arbitrary annotation made by the experimenter, has three parts to it: a measurement process, which is the act of making the measurement; a measurement material, which is the URI representing the result; and optionally a type/value pair, representing the data of the measurement in the case that it is a simple (one-value) observation with a unit, such as a weight or a temperature.

4. Systems design aspects In this section we take a look at two parts of the system – the “Smart Lab” and the large store of chemical descriptions.

4.1 Smart Lab There are a large number of issues involved in making the SmartLab systems useful beyond the limited environments that they have been tested in to date. These cover two main aspects: the presentation to the user, and the back-end data and services. On the presentation aspect, the main issue is in how arbitrary experiment records are presented to the user. There may be many different preferred forms of presentation, varying from print journal publication, through on-line browsing, to a detailed view of the whole experiment and its position within a group of other experiments. In addition, methods of manipulating the material recorded by the SmartLab system are required, to allow (re-)contextualization of data gathered by services, which do not (or cannot) deliver their data with appropriate metadata context to constitute a full record. Developing data recording services which are sufficiently lightweight in their user interface (UI) to be unobtrusive in operation, but which are capable of giving full context to their measurements is also an important piece of research in pervasive computing and UI design. Our methodological approach to this aspect of the project is presented in [45]. The SmartLab system is intended to support the chemist through the whole life-cycle of an experiment. We can roughly break this lifetime into four parts, with the “PPPP” mnemonic: Plan, Perform, Ponder and Publish. The experiment cycle starts with some form of planning. In the UK the chemist has to produce a plan of the experiment as a list of the reagents to be used, and any associated hazards, as part of the COSHH (Control Of Substances Hazardous to Health) assessment. Often the experiment planned is based on a previous procedure with a variation in reagents. The plan will be authorized by a supervisor once the amounts of reagents have been calculated and the relevant safety information researched and noted for COSHH. At this point, work generally moves into the laboratory. The experiment is carried out in the laboratory and in other locations, generally those providing specialist services such as the mass spectroscopy laboratory. The chemists will then analyse and write up their experiments back at their desks. After writing up the experiment, the optional final stage is publication of the experiment results as a peer-reviewed paper. There are three main components to the SmartLab architecture design. These result from a careful consideration of the whole context of the experiment, not simply the historical aspect of recording the observations in a laboratory. The three stages are

(i) preparation and planning capture, (ii) results capture, (iii) data presentation in multiple contexts,

with an overall ontology and architecture to integrate each of the above three components together. Our system consists of three main user-visible systems, supporting the chemist for the first three parts of the process described above. It also contains software development libraries and back-end data services on which the user-visible systems are built. The architecture is shown in Figure 4, which shows the main user interface components on the right. We have developed three primary applications: a planning tool, which is used to set up the plan and ingredients for the experiment; a weigh-station/liquid-measure application, used for recording the quantities of ingredients actually used, as an example of a measurement device; and a ‘bench’ application, used for making notes and annotations on the plan while performing the experiment. The latter two applications we have implemented on a Tablet PC, to be carried around in the laboratory. The current prototype planner application is implemented as a set of dynamic, form-based web pages. The ‘smart lab’ system is modular. For instance, other measurement devices, such as a digital camera for recording TLC plates, or a formatter for adding mass spectrograph recordings, can also be added to the system in the same way as the weigh-station application. We describe the planning application first, followed by the Tablet applications. These were developed first since they replace the lab book, and are the most critical components of the system.

FIGURE 4 HERE Figure 4. The Smart Lab architecture.

The back-end of the Smart Lab [25] system consists of a database which stores the details of the experiments, and presents a query interface for interrogating the data store. The connection between the back-end data

storage and the user interface applications runs over the network. The data for the experiment is stored in an RDF triplestore based on Jena [29], and is accessed by the front-end applications through a SOAP-based query interface. The “back-end” for these laboratory systems, recording the processes undertaken, uses the same RDF technology as in the recording of information about the molecular species highlighted above. This shows the way to integrate the process information captured by the pervasive technology with the knowledge base about the materials, all using the Semantic Grid approach. The functional core of the SmartLab software is the libtea library. This library provides abstraction of the low-level RDF data structures kept in the triplestore, and queried and served by the ModelServer. In more detail, it: • Handles the network connection to the ModelServer, requesting the subgraphs from the triplestore

through the SOAP interface; • Parses the RDF graph for an experiment, identifying the main flows of overview, plan and record; • Presents the experiment structure as a set of objects which can be manipulated without requiring

knowledge of the underlying RDF; • Ensures the structural integrity of the main RDF graph; • Requests, where appropriate, new unique URIs from the URI generator, (Southampton URI Generator –

SURIG, which is used by several of the CombeChem projects), again through SOAP. libtea API. The programming interface (API) for libtea abstracts away some of the details of the underlying RDF structures from the programmer, and presents a simpler interface for the “standard” structures and properties encoded in the base SmartLab ontology. It does not, however, explicitly hide the RDF from the programmer, so if it is necessary to add new features or structures to the schema, it is possible to do so in a flexible and ad hoc manner without having to add support directly to libtea. The libtea API presents a set of objects to the programmer, each object representing a different concept within the RDF structure, and encapsulating a subgraph of one or more triples. As the library requests RDF subgraphs from the triplestore, it constructs an RDF graph of the experiment in a local cache of triples. It then generates objects which maintain references into the cache, and which can be used to read “well-known” properties of the concept that the object represents. For example, a request for an observation from the ModelServer will retrieve a subgraph and add it to the current triple cache. It will then create an Observation object, and return that to the programmer. The programmer can then access the Observation object to read or change the units and value for the observation, using getunits/setunits and getvalue/setvalue methods respectively.

4.2 Scalable triplestore The SmartLab architecture is a crucial source of RDF at the beginning of the lifecycle of the chemical information. Following our strategy, this digital record is enriched and interlinked by a variety of annotations such as data from sensors, records of use, or explicit interaction. The annotations are required to be machine processable, and useful for both their anticipated purpose and for interoperabilty to facilitate subsequent unanticipated reuse. The aim is to record the complete provenance record. One of the variety of possible interfaces to this is a Web page from which it is possible to chase back to the original data, as illustrated in Figure 5. The project has used multiple triplestores, and although the intent was to use them to ‘glue’ together the information in a variety of relational databases in operational use, the stores have also been used to provide uniform description of a variety of chemical information harvested from multiple sources. Much of this data for practical purposes has been held in one large store. At the time of writing there are 50 million RDF triples in the Combechem triplestore. We obtained the chemical data used from a range of publicly available databases including the ZINC database [28], and information from the National Institutes for Health (NIH) and in particular the National Cancer Institute (NCI) [4] chemical data. The current target is 200 million triples, representing a substantial Semantic Web deployment. We evaluated several triplestores (3store [22], Jena [29], Sesame [46], Redland [41], Kowari [52]) and adopted 3store because it has good scaling properties [31] (see Figure 3 for a comparison of 3store and Kowari). Additionally it is easily batch- or perl- scriptable, supports RDFS , and it can use RDBMS tools for maintenance of data (e.g. backups and migration) as all application state is held in the database, in contrast to Kowari. 3store uses an independent database schema for flexibility, and is based around a three level

architecture. The top level (an Apache server module) passes an RDQL [44] query to the middle layer (a C library), which compiles the query down to SQL and executes it. MySQL provides the low-level indexing, query execution and persistent storage, which constitutes the bottom layer. This design allows the system to perform query optimisations at each level of abstraction and the final query is translated into one SQL query that can executed by the database engine, in a conventional RDBMS manner, rather than as fragmented queries. There is a trade-off between complexity at query time and store time which is optimized by 3store. This design brings the execution time of typical RDQL queries down to a few milliseconds, and allows for RDF(S) data files to be asserted at a rate of around 1000 triples/second on a commodity x86 based server, even with large knowledge bases.

FIGURE 5 HERE Figure 5. The eCrystals interface [7]. The information contained within an entry in this archive is all the underlying data generated during the course of a structure determination from a single crystal x-ray diffraction experiment. An individual entry consists of three parts: core bibliographic data, such as authors, affiliation and a number of chemical identifiers; data collection parameters that allow the reader to assess at a glance certain aspects of the crystallographic dataset; files available for download (visualisations of the raw data, the raw data itself, experimental conditions, outputs from stages of the structure determination, the final structural result and the validation report of the derived structure). With 50 million triples in 3store the queries remain responsive, but data import performance has begun to degrade, i.e. reassertions of the RDF schema are taking a long time. Write performance on large stores is known to be a challenging issue and we can envisage how a single large triplestore with frequent insertions would be unable to cope with potential demand. In a newer version, with more properties for each molecule, 60 million triples equates to a reasonably-sized chemical dataset of ca 1 million molecules. The number of properties is set to grow much larger as we begin adding more computed rather than experimental data. Hence we are now contemplating alternative ways of partitioning and maintaining the triples across multiple stores. Progress is now being made in linking the RDF structures for the molecular properties and those describing the experiments from the ELN, tied together by the molecule’s URI.

FIGURE 6 HERE

Figure 6. A comparison between the triplestore assertion time for up to 10 million triples for 3store and Kowari.

RDF has been shown to be an effective method for capturing highly-detailed chemical data which allows it to be indexed in a persistent triplestore such that it can be searched and data-mined in useful ways. The triplestore has now reached a viable state with further addition of chemical properties as an ongoing process. We are now beginning to develop automated calculations using the many available structures and to store the results alongside all the details of the computations that produced them. Beyond that we can achieve high-throughput data processing and begin to develop new models based on those computations.

5. Related Work

The digital support for modern chemistry requires both the information agenda discussed above and the high-performance calculations for modelling the complex systems. A number of existing Grid projects address these needs either jointly or separately. Similarly the links between chemistry and the life sciences are of major academic and industrial concern and Grid-based projects in the bio-informatics area are beginning to make contact with the chemical informatics investigations.

At present most chemical data is not published in machine-processable form, but made available via conventional peer-reviewed journals (distributed on paper, pdf or sometimes HTML). In practice, the secondary publishers (e.g. Chemical Abstracts Service) employ large numbers of people to extract and re-key much of this data into more organised secondary publications and assign suitable identifiers. Although these might in principle constitute part of a semantic network, these services are currently only available under commercial license and often restricted to online searches and not available for automated searching. In contrast, the Biosciences community publishes a significant part of its research data results directly into data repositories as part of the primary process. The BioSimGrid project [3, 48] draws together and formalises relationships within the biomolecular simulation community in order to enable optimum use of resources (based on distributed, shared data structures) and to establish a biomolecular simulation database. The aim is to exploit the Grid so that the database will exist in a distributed form but can be curated and interrogated centrally with software tools developed by the project for interrogation and data-mining across entire distributed database. Added value from taking the Grid approach with associated metadata facilities comes from enabling data-mining across all simulations in the database and facilitating access to simulation results by non-experts (e.g. in the structural biology and genomics communities). Biomolecular simulations play a key role in enabling the study and understanding of biological processes at a microscopic level, which are not always accessible for experiment. Thus simulations are essential for interpretation and extension of experimental information, and are becoming increasingly important as methods and computer technology advance. Building on the BioSimGrid work, the IntBioSim project [27] is exploring an integrated approach to computational systems biology, spanning from the chemical to the subcellular level of simulations. The output of several parts of the CombeChem project are being entered into the BioSimGrid system to facilitate processing and dissemination of these results The myGrid project [27, 47] exploits Grid technology, with an emphasis on the Semantic Grid, to provide middleware layers that make it applicable to the immediate needs of bioinformatics, building high-level services for data and application resource integration such as resource discovery, workflow enactment and distributed query processing. These services enable experiments to be formed and executed and additional services are needed to support the e-based scientific method and best practice found at the bench but often neglected at the workstation, notably provenance management, change notification and personalisation. The Collaboratory for Multiscale Chemical Science (CMCS) [6], mentioned above in the context of the electronic laboratory notebooks, is developing an informatics-based approach to synthesising multi-scale information to create knowledge. The underlying systems in CMCS use a triple based approach, encoded in webDAV, but is in essence a very similar system to that that has grown up in CombeChem. The adoption of this semantic approach in CMCS was driven by the need to support the wide range of data and information observed in the Chemical Community it hopes to serve. The requirements capture, just as in CombeChem, led to this approach being thought as the most suitable representation. The World Wide Molecular Matrix (WWMM) [35] is generating a peer-to-peer XML repository for molecules and properties. Scientists and researchers can directly publish and upload their data into the matrix, in CML [36,37,38] and XML. All entries have extensive metadata. Each molecule has a unique identifier based on the connection table (chemical structure diagram), which provides rapid exact matching. Searches are also possible on molecular properties, since there is a controlled ontology with unique identifiers. Xindice and other OpenSource (Java) tools are used to prototype an XML repository. In some ways the use of XML together with the unique identifiers and then linking the items together via these identifiers, produces the same effect as the Combechem use of RDF. However, the use of Semantic Web and grid technology provides these links directly and without further explicit work. The linking up of data comes naturally from the use of the semantic technology. Future work is looking at combining the WWMM material with the CombeChem Data Grid. In many analytical laboratories the need to keep track of a significant quantity of data about the large number of samples flowing through the laboratory has led to many using relational database driven Laboratory Information Management Systems (LIMS). These are beginning to be constructed in a way that allows them to be used with many different manufacturers’ analytical equipment rather than being tied to a particular hardware and software solution. However these are complex pieces of software and are not typically able to interface to the heterogeneous data sources in the wider world. Sharing of scientific photographic data provides an interesting example of a community approach to providing exciting images. The “Molecular Expressions” website [34] contains several photo galleries that

explore the fascinating world of optical microscopy offering one of the Web's largest collections of colour photographs taken through an optical microscope (commonly referred to as "photo-micro-graphs") but with plenty of associated metadata about the images. One gallery contains images of many of the important pharmaceuticals drugs. The Molecular Expressions Pharmaceuticals Collection contains over 100 drugs that have been recrystallized and photographed under the microscope. There are many systems for recording experiments electronically, for a number of different fields of experimentation. The field of Electronic Laboratory Journals and Notebooks is growing rapidly. Few of them adopt any form of semantic technology for their representation of information. We review in this section a few of the more prominent systems. A more complete review of electronic logbook research (both academic and commercial) may be found in [45]. The Microarray Gene Expression Data Society [32] is concerned with describing samples and processes used in microarray gene experiments. Their first achievement was a format for the minimal annotation of an experiment, followed by an object model to describe experiments. This is accompanied by a tool set to aid developers to convert outputs from systems into their formats, enabling data exchange. These formats are now becoming the accepted de facto standard for the publication of data in this field. In some application domains, experiments are conducted using computers only (in silico research). Managing the results of such experiments is becoming of increasing importance in Grid computing. The Earth System Grid project [13] is a major collaboration to create an ontology that describes the processes and data found in such large-scale distributed software systems. The major high-level concepts include pedigree and scientific use together with other concepts, such as datasets, services and details of access to service. CMCS is a very large project building a toolkit of resources to aid in multi-scale combustion research. The portal-based project encompasses a vast amount of work in individual systems to share resources and results in the community. CMCS is a powerful approach to managing diverse data formats produced from different analysis systems and tools such as electronic notebooks. One of the major components of the CMCS project is SAM (Scientific Annotation Middleware) [39]. SAM is a set of middleware components and services designed to aid storage of data or metadata in its native form and provide tools to map metadata from one form to another. Mappings are written in XML so that an arbitrary combination of analysis systems, underlying data stores and electronic notebooks can be used together through the one portal. The system is based upon a WebDav implementation adding metadata management and notebook services layers. The CMCS and SAM projects are currently some of the most advanced work in electronic notebooks and the distributed storage of diverse experimental data. Where their approach differs from the work described here is in our use of an ontology to describe the nature of an experiment in details and our design of a client/server API explicitly to build a structured RDF graph of an experiment plan and record.

6. Evaluation Informal testing and evaluation of the parts of the CombeChem project have been carried out with Chemistry staff and students in Southampton, users of the National Crystallography Service (NCS) and the wider Crystallographic Community, and the digital library community. The Electronic Laboratory Notebook (ELN) was tested in two phases. First the user interfaces were developed for a tablet PC within the laboratory and proved easy to use. Indeed observations showed that the research students felt at ease with the tablet as they did with the paper lab books. The security offered by recording the data in a more permanent and digital form outside the laboratory, was something they all commented on. Subsequent testing of the backend, the aspect of the project of more concern to this paper, demonstrated the complexity of this task. The current version, which allows for planning, performing and producing a record of the experiment, supports most of the functionality that the students expect from a lab notebook and has proved very useable. The areas currently being enhanced are the links to other materials (reference data, spectra etc) which is exactly the issue of linking the experiment triplestores to the molecular data triplestore, something which has also been demonstrated within the project, but need to be made more robust. The units ontology has been evaluated in the context of the molecular properties triplestore and successfully tested by developing a conversion program driven by the ontology. The responsivity of the properties store has been referred to previously does indicate the need for triplestore development. A new version of 3Store is now available and will be tested soon. Trials with users who also have access to the ELN have indicated that

safety information is an area where tying the two systems together would be of great benefit. Pulling all these parts together is the first step in creating a suitable software support for a smart 21st century laboratory [18].

7. Conclusions and Future Work The “Publish@Source” principle is essential in speeding the dissemination of a much wider range of information about experiments than has traditionally been available. We have demonstrated this in parts but tying the smart laboratory systems together with the data repository concept will bring this concept to fruition in a way that can be readily adopted by other areas of experimental and computational science beyond the boarders of chemistry. This will be further promoted by the automatic provision of links between the written reports and publications, and the data, stepping back through the analysis to the raw data, with the possibility of recalculation of these steps, with all these links derived from the RDF description of the workflow. More work is required to build up the chemical ontology to the level comparable with the XML structure provided by CML. In the INCHI we have a computable URI for organic molecules, but this leaves large areas of compounds (inorganic, mixtures, materials etc.) without adequate URIs of this form. We could of course make use of the commercially provided CAS no. (Chemical Abstracts Service numbers which are allocated to all registered compounds, but this is a problem for virtual compounds, those envisaged but not yet made). We have used a simple URI generator to provide a URI for the different processes and samples used in a laboratory framework, and we are extending this to the computational process used in analysis, modelling and theoretical predictions. However, a framework for allowing re-use of the ontology is needed. We have seen in the need to provide an RDF structure for units that the process of describing a piece of scientific data, with all the necessary descriptions and provenance, propagates requirements out in an extensive net, meeting though with other domains, where we can link up with other semantic descriptions. Filling in more of the gaps in this descriptive web is going to be an essential part of the project to bring our vision of the chemical Semantic Grid into operation. An area which demands rapid attention because of the importance of unhindered and accurate information flow between different knowledge domains, is the need to integrate the chemical ontology with for example the LSI identifiers is a part of the whole processes of linking Bio- and Chemical Informatics, to aid for example, drug modeling for sample and model selection in determining Quantitative Structure Activity Relationships (QSAR) and linking and mapping this data onto the larger spatial scale of the environment, all of which are major purposes of our current investigations, with a view to early detection or advanced prediction of unexpected environmental problems in our current highly interlinked world. The ELN proves to be an excellent model system for the meeting of the semantic chemical grid descriptions of materials and services with the pervasive environment needed to capture the information within the source laboratory. We are currently conducting more investigations in the use the smart lab systems in an active synthetic organic chemistry laboratory looking at the use and re-use of information captured using our systems. The studies will soon be extended to investigate the pervasive aspects of the grid looking at the use of handheld systems (e.g. PDA, tablets) systems vs. distributed computers in different positions within the laboratory. We have also commenced an exploration of capturing scientific discourse within the Data Grid, fully linked in following the Publish@Source approach. This includes materials from meetings and videoconferences, and is achieved through the use of meeting support tools which capture semantic annotation following a similar approach to the smart lab [19]. We have moved from storing all available triples in a triplestore to just loading what is needed for the task at hand, which might be regarded as a ‘caching’ approach. Hence we are exploring the issues of working with large quantitites of metadata and multiple distributed stores. Currently the user interfaces are custom web pages, triplestore browsers or visualization tools. Much more work is needed to establish general purpose query interfaces. Our principle of pragmatism has brought us a long way – this is significant piece of the Semantic Web. In a sense, we are now ready to begin! We have benefited from flexible associations and from the sharing of identifiers, and from graph queries and chaining in the triplestore. Much of the power of the Semantic Web that comes from ontologies and reasoning is yet to be explored. Also at this level we see opportunities for use of rules as these solutions emerge within the Semantic Web stack.

With the integrated system now coming in to place to provide the digital support for the chemistry information pipeline, the Semantic Data Grid, will to some extent grow itself. The separate parts of the chemical information placed on the Grid, are linked by the common URIs and thus the network is formed. Having established this procedure with in one scientific, laboratory based regime, we are looking at the ways to extend this first to other sciences, physics being an obvious choice, and then to the life science arena, where we will integrate with the bio-informatics projects.

Acknowledgements CombeChem was funded under the UK e-Science programme EPSRC grants GR/R67729/01 and EP/C008863/1. The authors extend thanks to the entire CombeChem team, especially David De Roure for the Semantic Grid vision, and to colleagues in the Advanced Knowledge Technologies IRC (GR/N15764/01), especially Stephen Harris and Nicholas Gibbins, the Equator IRC (GR/N15986/01) especially Danius Michaelides, and CoAKTinG (GR/R85143/01), especially Jessica Chen-Burger. Nicholas Humfrey assisted in live multimedia capture and Michael Streatfield assisted with triplestore comparisons. eBank is funded by JISC under the Semantic Grid and Autonomic Computing Programme.

References [1] M.S. Bachler, S.J. Buckingham Shum, D.C. De Roure, D.T. Michaelides, and K.R. Page, Ontological

Mediation of Meeting Structure: Argumentation, Annotation, and Navigation. In Proceedings of 1st International Workshop on Hypermedia and the Semantic Web (HTSW2003), Nottingham, 2003.

[2] T. Berners-Lee, J. Hendler, O. Lassila, The Semantic Web. Sci. Am. 284, (2001), 34-43. [3] BioSimGrid: a distributed database for biomolecular simulations, http://globus.biosimgrid.org/ [4] CACTUS, National Cancer Institute, Frederick and Bethesda Data and Online Services; National

Institutes for Health: Bethesda, Maryland. http:// cactus.nci.nih.gov/, 2004. [5] CIF, http://journals.iucr.org/iucr-top/cif [6] Collaboratory for Multi-Scale Chemical Science (CMCS), http://cmcs.org/ [7] S. J. Coles, J. G. Frey, M. B. Hursthouse, M. E. Light, A. J. Milsted, L. A. Carr, D. DeRoure, C. J.

Gutteridge, H. R. Mills, K. E. Meacham, M. Surridge, E. Lyon, R. Heery, M. Duke, and M. Day, An E-Science Environment for Service Crystallography-from Submission to Dissemination J. Chem. Inf. & Modeling, Web Release Date: 24-Feb-2006; DOI: 10.1021/ci050362w

[8] CombeChem, www.combechem.org, UK e-Science Pilot Project 2001-2005 [9] P. Coveney, Scientific Grid Computing, Phil Trans A, 363, (2005) 1707-1713, doi:

0.1098/rsta.2005.1632 [10] D. De Roure, N. Jennings, N.R. Shadbolt, Research Agenda for the Semantic Grid: A Future e-Science

Infrastructure; Technical Report UKeS-2002-02; National e-Science Centre: Edinburgh, U. K., 2001. [11] De Roure, N.R. Jennings, N. R. Shadbolt, The Semantic Grid: Past, Present, and Future. Proc. IEEE 93

(2005), 669-681. [12] M. Duke, M. Day, R. Heery, L.A. Carr and S.J. Coles, “Enhancing access to research data: the challenge

of crystallography”. Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries, Denver, CO, USA. 2005 pp 46 – 55.

[13] Earth System Grid project, https://www.earthsystemgrid.org/ [14] I. Foster, C. Kesselman, and S. Tuecke. The anatomy of the grid: Enabling scalable virtual organizations.

International Journal of Supercomputer Applications, 2001 [15] M. Frenkel, R.D. Chirico, V.V. Diky, Q. Dong, S. Frenkel, P.R. Franchois, D.L. Embry, T. L.; Teague,

K.N. Marsh, R.C. Wilhoit, ThermoML, An XML-based approach for the storage and exchange of experimental and critically evaluated thermophysical and thermochemical property data. 1. Experimental Data. J. Chem. Eng. Data 48, (2003) 2-13.

[16] J.G. Frey, M. Bradley, J.W. Essex, M.B. Hursthouse, S.M. Lewis, M.M. Luck, L. Moreau, D.De Roure, M. Surridge and A. Welsh, ‘Combinatorial Chemistry and the Grid’, published in ‘Grid Computing: Making the Global Infrastructure a Reality’, edited by F. Berman, G. Fox and T. Hey, Wiley 2003.

[17] J.G. Frey, D. De Roure, L.A. Carr, Publication at Source: Scientific Communication from a Publication Web to a Data Grid. In Euroweb 2002 Conference, The Web and the Grid: From e-Science to e-Business, Oxford, U. K., December 17-18, 2002; F.R.A. Hopgood, B. Matthews, M.D. Wilson, Eds. British Computer Society: Swindon, U. K.; Electronic Workshops in Computing.

[18] J.G. Frey, Dark Lab or Smart Lab: The challenges for 21st century laboratory software, Organic Process Research and Development, 8 (2004) 1024 -1035 (2004).

[19] J. G. Frey, D. De Roure, m. c. schraefel, H. Mills, H. Fu, S. Peppe, G. Hughes, G. Smith, and T. R. Payne, (2003), Context Slicing the Chemical Aether. In Proceedings of First International Workshop on Hypermedia and the Semantic Web, Nottingham, UK. D. Millard, Eds. http://eprints.ecs.soton.ac.uk/8790/

[20] J.G. Frey, G.V. Hughes, H.R. Mill, m.c. schraefel, G.M. Smith, D. De Roure, Less is More: Lightweight Ontologies and User Interfaces for Smart Labs, Proceedings of the UK e- Science All Hands Meeting, Nottingham, pp 8, EPSRC 2004, ISBN 1-90 4425-21-6. http://www.allhands.org.uk/proceedings/papers/187.pdf

[21] C. A. Goble, D. De Roure, N. R. Shadbolt, and A. A. A. Fernandes, "Enhancing Services and Applications with Knowledge and Semantics," in The Grid 2: Blueprint for a New Computing Infrastructure, I. Foster and C. Kesselman, Eds. Morgan-Kaufmann, 2004, pp. 431-458..

[22] S. Harris, and N. Gibbins, 3store: Efficient Bulk RDF Storage. In Proceedings of the First International Workshop on Practical and Scalable Semantic Web Systems (PSSS2003), Sanibel Island, Florida, USA.

[23] T. Hey and A. Trefethen, Cyberinfrastructure for e-Science, Science Vol. 308, (2005) 817-821. [24] T. Hey, A.E. Trefethen, The UK e-Science Core Programme and the Grid. Future Gener. Comput. Syst.

2002, 18, 1017-1031. [25] G. Hughes, H. Mills, D.De Roure, J.G. Frey, L. Moreau, m.c. schraefel, G. Smith and E. Zaluska, ‘The

Semantic Smart laboratory: A system for supporting the chemical e-Scientist’, Org. Biomol. Chem. Vol. 2, (2004) pp3284-3293.

[26] InChI International Chemical Identifier, http://www.iupac.org/inchi/ [27] IntBioSim: An Integrated Approach to Multi-Level Biomolecular Simulations, http://intbiosim.org/ [28] J.J. Irwin, B.K. Stoichet, ZINC - A Free Database of Commercially Available Compounds for Virtual

Screening. J. Chem. Inf. Comput. Sci. 45 (2005) 177-182. [29] Jena – A Semantic Web Framework for Java, http://jena.sourceforge.net/ [30] R. Lake, The application of geography markup language (GML) to the geological sciences, Computers &

Geosciences, 31, (2005), pp 1081-1094, Application of XML in the Geosciences, doi:10.1016/j.cageo.2004.12.005

[31] R. Lee, Scalability Report on Triple Store Applications. http://simile.mit.edu/reports/stores/ (accessed 2005).

[32] Microarray Gene Expression Data Society, http://www.mged.org/ [33] I. Mills, T. Cvitas, K. Homann, N. Kallay, K. Kuchitsu, Quantities, Units and Symbols in Physical

Chemistry 2nd Ed, Blackwell Science, IUPAC, 1993 [34] Molecular Expressions http://micro.magnet.fsu.edu/ [35] P. Murray-Rust, "The World Wide Molecular Matrix - a peer-to-peer XML repository for molecules and

properties," presented at EuroWeb2002, Oxford, UK, 2002. [36] P. Murray-Rust, H.S. Rzepa, M.J. Williamson, E.L. Willighagen, Chemical Markup, XML, and the

World Wide Web. 5. Applications of Chemical Metadata in RSS Aggregators. J. Chem. Inf. Comput. Sci. 44, (2004), 462-469.

[37] P. Murray-Rust, H.S. Rzepa, Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles. J. Chem. Inf. Comput. Sci 39 (1999) 928-942.

[38] P. Murray-Rust, H.S. Rzepa, M.J. Williamson, E.L. Willighagen, Chemical Markup, XML, and the World Wide Web. 5. Applications of Chemical Metadata in RSS Aggregators. J. Chem. Inf. Comput. Sci. 44. (2004) 462-469.

[39] myGrid, http://www.mygrid.org.uk/ (accessed 2005)

[40] R. G. Raskin, and M. J. Pan, Knowledge representation in the semantic web for Earth and environmental terminology (SWEET), Computers & Geosciences, 31(9), 1119-1125, (2005), in, “Application of XML in the Geosciences”, doi:10.1016/j.cageo.2004.12.004

[41] Redland, http://librdf.org (accessed 2005) [42] E. R. Rousay, H. Fu, J. M. Robinson, J. W. Essex and J. G. Frey, Grid-based dynamic electronic

publication: A case study using combined experiment and simulation studies of crown ethers at the air/water interface. Phil. Trans, Royal Soc. (London), 363 (2005) 2075-2095.

[43] M.A. Ruhl, G.W. Kramer, R. Schafer, SpectroML - A Markup Language for Molecular Spectrometry Data. J. Assoc. Lab. Automation 6, (2001) 76-82.

[44] A. Seaborne, RDQL - A Query Language for RDF, W3C Member Submission. http://www.w3.org/Submission/RDQL/ (accessed 2005).

[45] m.c. schraefel, G. Hughes, H. Mills, G. Smith, T. Payne and J. Frey, ‘Breaking the Book: Translating the Chemistry Lab Book to a Pervasive Computing Environment’, published in Proceedings of the Conference on Human Factors (CHI), 2004.

[46] Sesame, http://www.openrdf.org (accessed 2005) [47] R. Stevens, A. Robinson, and C.A. Goble myGrid: Personalised Bioinformatics on the Information Grid

in proceedings of 11th International Conference on Intelligent Systems in Molecular Biology, 29th June–3rd July 2003, Brisbane, Australia, published Bioinformatics Vol. 19 Suppl. 1 2003, i302-i304.

[48] K. Tai, S. Murdock, B. Wu, M. H. Ng, S. Johnston, H. Fangohr, S. J. Cox, P. Jeffreys, J. W. Essex, M. S. P. Sansom (2004) BioSimGrid: towards a worldwide repository for biomolecular simulations. Org. Biomol. Chem. 2:3219–3221 DOI: 10.1039/b411352g

[49] T. Talbott, M. Peterson, J. Schwidder, and J. D Myers, Adapting the Electronic Laboratory Notebook for the Semantic Era, Proceedings of the 2005 International Symposium on Collaborative Technologies and Systems (CTS 2005), May 15-20, 2005, St. Louis, MO

[50] K. Taylor, R. Gledhill, J.W. Essex, J.G. Frey, S.W. Harris, and De Roure, D. “A Semantic Datagrid for Combinatorial Chemistry”, Proceedings of IEEE Grid Computing Workshop at SC05, IEEE, Seattle, WA. November 2005. (8 pages)

[51] K. R. Taylor, R. J. Gledhill, J. W. Essex, J. G. Frey, S. W. Harris, and D. C. De Roure, Bringing Chemical Data onto the Semantic Web, J. Chem. Inf. & Modeling, Web Release Date: 26-Jan-2006; DOI: 10.1021/ci050378m

[52] Tucana Technologies Inc, Kowari Metadata Store. http://www.kowari.org/ (accessed 2005). [53] World Wide Web Consortium (http://www.w3.org/) W3C Semantic Web Activity, Semantic Web

Activity Statement. http://www.w3.org/ 2001/sw/Activity (accessed 2005)