Providing support for integrated scientific computing: metacomputing meets the grid and the semantic...

Providing Support for Integrated Scientific Computing: Metacomputing Meets the Semantic Web and the Grid

Spyros Lalis1,2, Catherine Houstis1,2, Marios Pitikakis1,2, George Vasilakis1,2 and Manolis Vavalis2

1Department of Computer and Communications Engineering, University of Thessaly 2Informatics and Telematics Institute, Center for Research and Technology Hellas

{lalis,houstis}@inf.uth.gr {pitikak,vasilak,mav}@iti.gr

Abstract

In this paper, we present an integrated system architecture for metacomputing on top of distributed scientific resources via active semantic-driven information management. Our approach is to consider both data and programs as objects of an application-specific ontology, describing them via corresponding metadata schemata that can be exploited to search for and to generate information in an efficient and automated way. Metacomputations are expressed in the form of workflows thereby enabling a flexible combination of data and program objects that can reside on different servers or grid recources. Workflow descriptions are associated with ontology concepts and can be activated as a side-effect of searching for information to produce new data on demand. We also report on the current status of an implementation that provides support for several aspects of this architecture. 1. Introduction Researchers are increasingly relying on computers and communication networks for their scientific collaboration, not only within the same group or institute but also between different organizations. Indeed, the concept of a co-laboratory or virtual enterprise where each participant contributes with own resources to form an environment that is more than the sum of its parts is very appealing. One of the great technological challenges is thus to support and enhance such activities through mechanisms that enable a global, secure, flexible and efficient exploitation of remote resources, distributed in separate locations and managed by different organizations.

This goal is pursued via the grid computing effort [9, 10, 11, 13]. Following the same principles that drove the development of the Internet years ago, the aim is to extend the concept of resource sharing to support high

performance computing at a global scale. Towards this objective, a reference architecture with corresponding protocols and mechanisms has been proposed, and is being further developed. This enables institutions to export their local resources to other parties, as well as to access remote resources made available by other members in a controlled and transparent way. Basic services include resource directories, resource allocation, invocation and execution management, data transfer, and secure authentication and access control.

The grid was initially conceived along the paradigm of submitting and running individual programs. However, it was not long before the need emerged to run computations that involve many programs, each possibly executing in parallel and on a different supercomputer, cluster or server. This form of computing is also referred to as metacomputing to indicate that the desirable functionality results from a combination of many separate autonomous system components . Based on and extending previous work done to support intra-organizational processes, metacomputing in the context of the Grid is performed using so-called workflow systems [3, 5, 21]. From an abstract perspective, the grid or a collection of grids is essentially a global, distributed and heterogeneous architecture, and a grid workflow system is a language and runtime for writing and respectively executing multi-component computations on top of it.

But where is the concept for information management to be found in this computing model? It seems that most work on grid infrastructures and workflow systems is not addressing this issue at a sufficiently high level. As a consequence the user must manually locate and specify the data sources that contain the right input data for each program execution. Even worse, every single result of a (meta)computation must be documented extensively so that it can be located at a later point in time, for example to be used as input for another processing step. This can be cumbersome and time consuming because scientific data and metadata (!) tend to be complex; searching for the right data set could be like trying to find a needle in a

haystack. Supporting the efficient documentation and discovery of data is therefore of key importance for global grid computing systems.

In this paper, we present our work towards a system architecture and implementation that combines the principles of metacomputing and the Semantic Web [26] to support computing for a given application domain. Our approach complements other work on grid scheduling and grid metacomputing in that we view the organization of individual data objects and programs as a central element of a global computing environment. This is achieved via a structured information space where scientific resources are documented with reference to a domain-specific ontology, enabling efficient discovery and access both for people and computer programs (agents). Metacomputing is integrated in this information space by linking workflow definitions back to the ontology. It is thus possible not only to find data and program objects that can be combined into workflows but also to find workflows that will produce data objects of a given class. Data generation can be performed in an interactive or background (silent) mode, and with automatic annotation of the results with metadata according to the ontology.

The rest of the paper is structured as follows. We start by giving an overview of the envisioned system model and architecture. We then describe key aspects of our current implementation. Finally, we compare with existing work and point towards future directions. 2. Information and computation integration Users of high-performance computing systems typically run simulations or so-called virtual experiments in order to generate information. They do not care about, and thus should not be bothered with, the internal operation of the system. Providing information, rather than computing, at the scientist's fingertips is ultimately what counts, so it is important to provide corresponding system abstractions. Our approach is to support the mundane tasks of resource discovery and computing within the framework of a metadata-aware information system, as a combination of horizontal and vertical integration (Figure 1).

Horizontal integration is implemented via ontologies and metadata schemata pertaining to the semantics of a particular application domain. For the case of scientific computing, an ontology contains concepts (or classes) and instances (or objects) of scientific resources such as data sets, simulation models or data filtering and aggregation codes. The properties of these resources are described via metadata which also contain information about the location and protocols/interfaces through which they can be accessed or invoked. Such an organization greatly

simplifies the task of navigation and searching in a large information space. The ontology and metadata, being encoded in machine readable form, can also be queried by a computer program (agent) to locate and access available resource objects.

Vertical integration serves the purpose of extending the information aspect to include the, typically dominant, computational aspect found in scientific environments. This is achieved by associating data concepts for which instances can be produced as a combination of data and program objects with the corresponding metacomputing descriptions or workflows1. Hence, when searching for information and arriving at that particular data concept, it is possible not only to discover and access the currently available objects but also to generate new ones by launching a metacomputation. It must be noted that the user does not have to be aware of the existence or internal details of metacomputing; in the ideal case, the system may infer the input parameters of the computation as a result of the controlled searching environment. What is equally important, when a computation finishes the produced data can be automatically annotated with metadata that capture its provenance, and can be properly registered with the searching facilities of the system. 1 These descriptions need not be workflows, but this seems to be the

most popula r way of expressing scientific metacomputations.

Figure 1. Horizontal and vertical integration for knowledge-centric scientific computing

This integrated approach allows users (and computer programs) to interact with and explore a large but well organized information space in an accurate way. Even complex data objects can be promptly found or generated on demand, without dealing with the technical details of file systems, databases or metacomputing. With fast enough computing and communication infrastructure (or lightweight enough computations) it would be hard to distinguish between retrieving available data from a repository and generating it via metacomputing. Last but not least, efficient searching is not limited to data objects alone. It becomes possible to search for program objects based on the data objects required as input and produced as output, as well as for metacomputation descriptions based on their data and program components and results. 3. Proposed system architecture With this vision of integrated scientific computing in mind we propose an architecture (Figure 2) consisting of two main subsystems, the information subsystem and the computing subsystem, which interact with each other in a loosely coupled fashion. Each subsystem can be implemented in a distributed fashion with various service components installed on separate machines that reside in different organizations2.

The information subsystem is responsible for managing the metadata of concepts and objects according to the domain ontology. It offers corresponding querying and mutation facilities to client programs, and provides operations for searching and editing metacomputing descriptions. There can be several information servers that maintain a local repository with (part of) the system information, which communicate with each other to exchange/update data and forward queries. Such a distributed scheme may be implemented by employing protocols and mechanisms that have already been studied in the context of federated digital libraries [20].

The computing subsystem comprises one or more servers responsible for managing the execution of metacomputations. A metacomputation can be initiated on any server and results in a distributed, possibly long running, computation that may involve several different data and program objects residing on different locations and grid systems. Any metacomputing or workflow system that has been developed for the grid can in principle be used to provide this functionality; see the section on related work for an overview. In any case, the computing subsystem should cooperate seamlessly with the information subsystem in order to discover data and program objects at runtime, if required, and to register 2 This paper does not focus on the distribution aspect of the information

and computing subsystems.

appropriate metadata entries for newly generated data objects when a computation finishes. Conversely, the information subsystem invokes the computing subsystem to launch new metacomputations to produce data requested by users. It is exactly this “ecosystem-like” interaction between the two subsystems that achieves the desired level of integration. 4. Prototype implementation and status We are developing a system along the lines of the architecture presented in the previous sections. The current implementation supports a significant part of the envisioned functionality, and has been used to build a prototype application collaboratory for ocean engineering; on the monitoring, modelling and prediction of waves for geographical areas of interest. The most relevant features of the implementation are discussed below; a detailed system description can be found in [17]. 4.1 Information management The information system, implemented on a single server, allows programs to search and add data and program concepts and objects via a domain specific ontology [15, 28]. We have designed a scientific (ocean wave) ontology that comprises different facets, which describe data sets and data processing codes, including simulation models and visual data analysis tools. Facet-based engineering of an ontology scales well with large scientific ontologies because new information may be appended as the needs of different users or organizations evolve in time. The ontology contains an “is-a” hierarchy of domain concepts, relationships between concepts and properties of concepts. There are two main concepts, each consisting of different facets that describe the data and program objects.

Figure 2. System architecture

The representation of resource metadata and ontologies is implemented using RDF [24] and RDF schemas [4]. Objects, classes, and properties are described using a standardized syntax with a set of modeling primitives like “instance-of” and “subclass-of” relationships. Metadata description is ontology-driven hence performed in a top-down fashion. In other words, when populating the ontology with a new object, data or program, this is described via attributes (properties) inherited from its parent-class and optionally additional ones that are native to that instance.

This knowledge representation is combined with the query and storage capabilities of RDFSuite [1], consisting of three main components: a RDF validating parser (VRP), a RDF schema-specific storage database (RSSDB) and a query language (RQL) [18]. We exploit the mechanisms of the RDFSuite to manipulate schema information in a dynamic and flexible way, and to perform consistency checks when adding new metadata. Furthermore, it is possible to issue multi-level queries as a combination of expressions on ontological relationships, metadata attributes and metadata values. The information server can be invoked locally via a Java API and remotely via JSP and servlets. It is quite straightforward to support different remote interfaces, if required, for example one could export this functionality as a grid or web service for more standardized access. 4.2 Workflow management

Metacomputations are defined in the form of workflows. A workflow specification can be constructed using elements such as: the program objects employed; the external data objects accessed; the intermediate, yet possibly persistent, data objects created and accessed; the sequence of processing (control flow); and the data flow between objects. The data objects used as component in a workflow are specified with reference to the system ontology. In turn, a workflow specification is registered with the information system, indirectly, via the data objects produced as results.

Our workflow specification language is based on XRL [29], a routing language in XML with substantial expressive power and mapping to formal semantics. Control flow is supported via basic constructs, such as “sequence”, “parallel-sync”, “conditional if-then-else”, and “while-do”, which can be combined to build complex descriptions. We also introduce a “rollback” primitive. This is similar to a “while-do” but requires explicit user interaction to decide whether to repeat a given step or proceed with the computation. With these primitives we have been able to express a wide range of workflow scenarios. A corresponding DTD is used to facilitate the editing and validation of workflow specifications.

4.3 Workflow execution Workflow execution is requested by a user (could as well be requested by a program) to generate data for particular concept(s) of the ontology. Workflows are executed using mobile agents on top of the Grasshopper system [14]. Each workflow execution request is assigned to a coordinator agent that is created for the purpose of managing and acting as a point of reference for the metacomputation. The coordinator agent subsequently creates one or more slave agents, each being assigned the task of accessing or respectively invoking a data or program object, and monitors their execution. When the computation finishes, the coordinator registers the results with the information system by adding corresponding metadata entries, and terminates.

The slave agents act as proxies for the data and program objects assigned to them, allowing them to be controlled and monitored remotely. They also take care of data transportation between these objects according to the plan of the workflow. Slave agents can be deployed in various different ways to achieve the desired functionality and efficiency. Figure 3 shows indicative implementations for an input-output combination of two objects residing on different machines: (a) remote object access with local data exchange between agents, (b) local object access with remote (possibly optimized) data exchange between agents, (c) a combination of the previous two options, and (d) using a single agent for local access and data transfer. These options can be intermixed to form arbitrarily flexible communication structures, for instance depending on the physical network topology. Our implementation currently supports only data and program objects that are

Figure 3. Mobile agent deployment

accessible through local APIs3 and slave agents have to migrate on the machines where their target objects reside; hence only options (b) and (d) are employed. 4.4 User interaction Users access the information server through a web-based front-end that allows them to navigate through the ontology and issue search queries in order to find concepts and objects of interest (Figure 4). It is possible to inspect the descriptions of (remote) data and program objects, and data can be downloaded and visualized on any computer (Figure 5). Users may add new data objects at any point in time, by selecting the appropriate concept in the ontology and filling in the metadata required. Workflows are defined and edited interactively using a graphical editor or can be supplied directly in an ASCII-XML file with the proper format. These descriptions are stored in the information system as first-class objects and are linked to the data concepts of the ontology for which they can generate new instances.

Data generation is straightforward. The user simply searches for a data object that is an instance of a given concept of the ontology, with the desired metadata. If such an instance does not already exist, the system retrieves the corresponding workflow description and initiates its execution. A workflow can be executed in the background and the user may request to interactively monitor the computation at any point in time. Also, the system may prompt the user when additional input

3 Command line execution mode for program objects and conventional

file-based access for data objects.

parameters and decisions are needed to continue with the workflow execution.

The ability to monitor and interact with an ongoing computation proved to be very useful during the first test executions of a newly introduced workflow, where several attempts are needed to determine appropriate calibration values for critical input parameters and boundary conditions of simulation codes. Once a workflow has been properly parameterized, these values can be fixed as defaults into its description, relieving the user from having to supply them from scratch for each execution. It turned out that this is a convenient way to gradually move from workflow prototyping, performed by experts only, to fully automated data generation suitable for casual users. 4.5 Extensibility Even though the prototype system focuses on a particular application environment, it is possible to “port” it on other scientific domains. In order to do this it would be necessary to define and populate a new ontology with different data and program concepts, and with different metadata schemata and metadata values for its objects. Nevertheless, the system architecture is ontology and metadata independent so that the information subsystem, the computing subsystem and the various support tools can be reused with little or no modification.

Any program may access the information subsystem in order to search for data objects that are instances of a given ontology concept and have certain metadata attributes and values. The worklfow editor exploits this to support the programmer in the task of finding the components of a metacomputation. The current version of the computing subsystem is not as flexible since agents do not perform dynamic discovery of data objects. As a

Figure 5. Data/result visualization

Figure 4. The ontology browser

consequence it is not possible to execute workflows with abstract data components that must be resolved at runtime. Still, this late binding ability can be introduced in a straightforward and modular way by augmenting the agent logic with the corresponding code4 that is already part of the workflow editor. 5. Related work

We view our work as being complementary to the existing Grid architecture [10, 11, 13]. This is because the Grid information services are designed primarily with the objective of registering and discovering general-purpose computing and storage resources. However, there is little support at the level of individual and concrete data sets and program codes that may be available in the various repositories. Our work focuses on this particular issue and proposes a corresponding information management approach. It would be possible to extend the Grid information services towards this direction, or have a third-party system, along the lines of this paper, operating on top of several different, perhaps even incompatible, grid infrastructures; each object registered with the information subsystem would then contain a reference to the generic grid resource where it is hosted, along with grid-specific access directives for homing-in the specific object within that generic resource.

Considerable work has already been done in the area of metacomputing. Webflow [3] was probably the first system to support the development of distributed applications via a visual programming interface. The Condor DAGMan metascheduler [8] allows the user to express dependencies between jobs to be executed using Condor pools of workstations [22]. The GridFlow system [5] provides a flexible mechanism for the dynamic scheduling of composite computations in a global network of different grids. Each grid is represented (proxied) by an agent which monitors local resource availability and collaborates with other agents to find suitable execution schedules at the level of metacomputations. GridAnt [21] supports the coordinated job execution on top of different grids with emphasis on the visual interactive specification of workflows by casual users. It is based on the Ant [2] dependency tracking tool, extended to communicate with remote grid resources via CoG [6]. We have implemented our own workflow engine using mobile agents mainly because the data and program objects of our application were installed on separate machines in an ad-hoc fashion and without any support for remote access as part of an organized grid infrastructure. Thus we require a mobile agent runtime platform (in our case Grasshopper) on each 4 This functionality is not required in our application scenarios, which is

the reason why our prototype does not support it.

node that hosts or acts as a proxy for remote scientific resources. Moreover, our focus is on achieving efficient object access via migration but we do not address the problem of resource scheduling as in other work [5, 23]. It is however possible to substitute the entire computing subsystem with a different implementation that interfaces with standard grid infrastructure, along the lines of the aforementioned work. In conjunction with any such system, our information subsystem would still simplify the task of building a workflow by allowing users and agents to discover and inspect data and program objects that can be combined into a metacomputation.

The issue of elaborate resource documentation for the purpose of dynamic resource management is addressed by the Semantic Web [26]. The vision is that of a global environment where intelligent agents will not only discover and access individual resources but also perform sophisticated and combined information processing on behalf of users [16]. Notably, both grid computing and metacomputing are now being positioned in the context of web technologies. On the one hand, the Semantic Grid [25] is an effort to recast Grid services in the form of more knowledge-driven and metadata-rich web services. On the other hand, metacomputing is being explored in the context of the dynamic composition of web services [19, 27]. Along the lines of the Semantic Web, we view both data and programs as first-class resources to be documented in terms of their properties, semantics and access protocols. This is accomplished using metadata schemata with reference to an application specific ontology. In turn these descriptions can be exploited to search for appropriate resources, data or programs in an efficient way. We also consider metacomputation descriptions as objects that are discoverable in terms of their components.

A system with a similar vision is Chimera [12], which also addresses computation and data management in a holistic fashion. The integration model of Chimera is to capture data through schemata, which can be thought of as non-materialized database views, while programs are viewed as transformations used to perform a specific data derivation. Our view is more knowledge-centric, rather than database-oriented, with data sources and programs being documented via an information ontology that can be openly exploited by users or programs. Still, like in Chimera, new data can be generated, annotated with proper metadata and registered with the information subsystem, in an automated fashion. A key difference, however, is that in our approach a metacomputation is defined by the hand of a domain expert in the form of a workflow, and linked to data concepts of the ontology. In Chimera, a metacomputating script is generated on the fly based on data dependencies of a given virtual schema. We have explored a similar approach in the past, using a

Prolog-based system [7]. We argue that due to the complexity of scientific resources flexibility is enhanced by allowing metacomputations to be explicitly introduced by experts. For example one may wish to experiment with a workflow that has formal or syntactic input/output mismatches which can be repaired via custom “glue code”; this can be subsequently added to the system ontology as a reusable first-class object. Of course, given accurate enough metadata for data and program objects, our approach also allows workflow descriptions to be derived without human guidance, via programs (agents) that perform the required matchmaking automatically. 6. Conclusions and future work We have presented a system architecture that combines grid-like metacomputing with ontology driven resource documentation in order to achieve integrated and efficient information and computation management. Casual users are practically relieved from having to deal with the task of data organization and computing, which is supported through mechanisms built around application-specific ontologies and a metadata-aware system. To paraphrase the vision of the Grid through the prism of the Semantic Web: discovery and access to information, which can also be produced on demand through metacomputing involving many different resources all over the world, should be just as natural and ubiquitous as access to electricity. Indeed, the servers of our architecture and system can be thought of as information access plugs for a particular application domain.

Our prototype system features several design and functional aspects of the presented architecture. We plan to extend our implementation to consider data and program objects exported as web services as well as to investigate the use of existing grid workflow systems in tandem with our information system. We also wish to research in more depth the potential of the coordinator agent in terms of gathering data from previous executions to take more informed planning decisions for subsequent executions. Acknowledgements This work was funded in part by the European Union under the 5th and 6th Framework Programmes and IST research project contracts ARION IST-2000-25289 and AIM@SHAPE IST-506766, respectively. References 1. S. Alexaki, V. Christophides, G. Karvounarakis, D.

Plexousakis and K. Tolle, “The ICS-FORTH RDFSuite:

Managing Voluminous RDF Description Bases”, Proc. 2nd International Workshop on the Semantic Web, WWW10.

2. The Apache Ant project, http://ant.apache.org 3. D. Bhatia, V. Burzevski, M. Camuseva, G. Fox, W.

Furmanski, and G. Premchandran, “WebFlow – A Visual Programming Paradigm for Web/Java Based Coarse Grain Distributed Computing”, Proc. Workshop on Java for Computational Science and Engineering, 1996.

4. D. Brickley and R.V. Guha, “Recource Description Framework Schema Specification”, Technical Report, W3C, 2001. http://www.w3c.org/TR/PR-rdf-schema

5. J. Cao, S. Jarvis, S. Saini and G. Nudd, “GridFlow: Workflow Management for Grid Computing”, Proc. 3rd IEEE/ACM International Symposium on Cluster Csomputing and the Grid, 2003.

6. The CoG Kits, http://www.cogkit.org 7. V. Christophides, C. Houstis, S. Lalis, and H. Tsalapata,

“Ontology-driven Integration of Scientific Repositories”, Proc. 4th Workshop on Next Generation Information Technologies and Systems, 1999, in Lecture Notes in Computer Science, R. Pinter and S. Tsur (eds), 1649, Springer, 1999.

8. The DAGMan (Directed Acyclic Manager), http://www.cs.wisc.edu/condor/dagman.

9. I. Foster and C. Kesselman, “Globus: A Metacomputing Infrastructure Toolkit”, Internationl Journal on Supercomputer Applications, 11(2), 1997.

10. I. Foster C. Kesselman, J. Nick and S. Tuecke, “The Physiology of the Grid: An Open Grid Services Architecture for Distributed Systems Integration”, Open Grid Service Infrastructure Working Group, Global Grid Forum, 2002.

11. I. Foster, C. Kesselman and S. Tuecke, “The Anatomy of the Grid: Enabling Scalable Virtual Organizations”, International Journal of Supercomputer Applications, 15(3), 2001.

12. I. Foster, J. Voeckler, M. Wilde and Y. Zao, “Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation”, Proc. 14th Conference on Scientific and Statistical Database Management, 2002.

13. Global Grid Forum, http://www.gridforum.org. 14. The Grasshopper 2 Agent Platform, http://grasshopper.de. 15. T.R. Gruber, “The Role of Common Ontology in Achieving

Sharable, Reusable Knowledge Bases”, Allen, Fikes and Sandewall (eds): Principles of Knowledge Representation and Reasoning, Morgan Kaufman, 1991.

16. J. Hendler, “Agents and the Semantic Web”, in IEEE Intelligent Systems Journal, 16(2), 2001s.

17. C. Houstis, S. Lalis, M. Pitikakis, G. Vasilakis, K. Kritikos and A. Smardas, “A Grid-based Infrastructure for Accessing Scientific Applications”, The International Journal of sHigh Performance Computing Applications, 17(3), 2003.

18. G. Karvounarakis, S. Alexaki, V. Christophides, D. Plexousakis and M. Scholl, “RQL: A Declarative Query Language for RDF”, Proc. 11th International WWW Conference, 2002.

19. R. Khalaf and F. Leymann, “On Web Services Aggregation”, in B. Benatallah and M.-C. Shan (eds), TES 2003, Springer, 2003.

20. C. Lagoze, D. Fielding, and S. Payette, “Making Global Digital Libraries Work: Collection Service, Connectivity

Regions, and Collection Views”, Proc. ACM Digital Libraries, 1998.

21. G. von Laszewski, K. Amin, M. Hategan, N. Zaluzec, S. Hampton, and A. Rossi. “GridAnt: A Client-Controllable Grid Workflow System”, Proc. 37th Hawai'i International Conference on System Science, 2004.

22. M. Litzkow, M. Livny and M. Mutka, “Condor – A Hunter of Idle Workstations”, Proc. 8th International Conference on Distributed Computing Systems, 1988.

23. R. Raman, M. Livny and M. Solomon, “Resource Management through Multilateral Matchmaking”, Proc. 9th IEEE International Symposium on High Performance Distributed Computing, 2000.

24. The Resource Description Framework, www.w3.org/RDF.

25. The Semantic Grid, http://semanticgrid.org. 26. The Semantic Web, http://www.w3.org/2001/sw. 27. S. Thatte, “XLANG: Web Services for Business Process

Design”, Technical Report, Microsoft Inc., http://www.gotdotnet.com/team/xml_wsspecs/xlang-c/default.htm.

28. M. Uschold, M. Healy, K. Williamson, P. Clark and S. Woods, “Ontology Reuse and Application”, in S. Guarino (ed), Formal Ontology in Information Systems, 1998.

29. W. Van der Aalst, H.M.W. Verbeek and A. Kumar, “Verification of XRL: An XML-based Workflow Language”, in W. Shen, Z. Lin, J.P. Barthes, and M. Kamel (eds), Proc. 6th International Conference on CSCW in Design, 2001.

Providing support for integrated scientific computing: metacomputing meets the grid and the semantic...

Documents

Transcript of Providing support for integrated scientific computing: metacomputing meets the grid and the semantic...