CHRONOPOLIS - federated digital preservation across time and space

6
CHRONOPOLIS Federated Digital Preservation Across Time and Space* Reagan W. Moore moore @sdsc.edu San Diego Supercomputer Center Fran Berman berniann@&sdsc.edu San Diego Supercomputer Center Don Middleton don @ucar.edu National Centerfor Atmospheric Research Brian Schottlaender becs @ librarjv.ucsd.edu UCSD Libraries Joseph JaJa [email protected] University of Maryland Arcot Rajasekar [email protected] San Diego Supercomputer Center Abstract There is a critical need to organize, preserve, and make accessible the increasing number of digital holdings that represent intellectual capital. This intellectual capital contains scientific records that are the basis for current research, future scientific advances, and education source materials for use by the public, educators, scientists and engineers now andfor the foreseeable future. Chronopolis is a proposed model facility that will enable long-term support of irreplaceable and important national data collections, ensuring that: 1) Standard reference datasets remain available to provide critical science reference material; 2) Collections can expand and evolve over time, as well as weather evolution in the underlying technologies; and 3) Preservation "of last resort" is available for critical disciplinary and interdisciplinary digital resources at risk of being lost. 1. Introduction As early as a decade ago [1] a Scientific American article warned a broad audience of the need to address the issue of preserving "the content and historical value of thousands of records, databases and personal documents [that] may be irretrievably lost to future generations if we do not take steps to preserve them now." Over the last 10 years, a considerable body of literature has been generated * These concepts are based on research by the NSF NPACI ACI- 9619020 (NARA supplement), the NSF Digital Library Initiative Phase II Interlib project, the NSF NSDL/UCAR Subaward S02-36645, the NSF National Virtual Observatory, the NHPRC Persistent Archive Testbed (PAT), and the NASA Information Power Grid. Views and conclusions contained in this report are the authors' and should not be interpreted as representing the official opinion or policies, either expressed or implied, of the Government, or any person or agency connected with them. by scientific and library communities enumerating the complex issues that need to be solved [2,3,4]. Congress has mandated that data and information generated through public funding be openly accessible and preserved for the long-term, and the National Archives and Records Administration (NARA) launched an Electronic Records Archives (ERA) initiative in order to fulfill its mandate to preserve "essential evidence" that is being increasingly created, transmitted, and stored electronically [5]. However, science and engineering communities have yet to develop trusted digital preservation repositories for their important intellectual assets. At an April 2002 workshop sponsored by the National Science Foundation (NSF) and The Library of Congress (LC), participants urged: "Solutions are urgently needed to preventfurther loss of valuable digital information.... these problems are urgent;... action is needed now, not some time in the future; .... everyone-from creators to custodians-must contribute to the solution... "[4]. Chronopolis is a proposed facility for the preservation of valuable data collections that represent the intellectual holdings of science and engineering disciplines as well as valued national collections that must persist for 100 years or more. These include scientific data collections from academic disciplines (such as earth science, bio-informatics, astronomy, ...) historical collections (digital federal records, historical society material), and digital library collections (art images, audio collections, web crawls). The preservation architecture is based on an integration of digital library, data grid, and persistent archive technology, currently used at the San Diego Supercomputer Center to support scientific data collections. The preservation architecture integrates: 0-7803-9228-0/05/$20.00 ©2005 IEEE 171

Transcript of CHRONOPOLIS - federated digital preservation across time and space

CHRONOPOLIS Federated Digital PreservationAcross Time and Space*

Reagan W. Mooremoore @sdsc.edu

San DiegoSupercomputer Center

Fran Bermanberniann@&sdsc.edu

San DiegoSupercomputer Center

Don [email protected]

National CenterforAtmospheric Research

Brian [email protected] Libraries

Joseph [email protected] of Maryland

Arcot [email protected]

San DiegoSupercomputer Center

Abstract

There is a critical need to organize, preserve, andmake accessible the increasing number of digital holdingsthat represent intellectual capital. This intellectual capitalcontains scientific records that are the basis for currentresearch, future scientific advances, and education sourcematerials for use by the public, educators, scientists andengineers now andfor the foreseeable future. Chronopolisis a proposed model facility that will enable long-termsupport of irreplaceable and important national datacollections, ensuring that: 1) Standard reference datasetsremain available to provide critical science referencematerial; 2) Collections can expand and evolve over time,as well as weather evolution in the underlyingtechnologies; and 3) Preservation "of last resort" isavailable for critical disciplinary and interdisciplinarydigital resources at risk ofbeing lost.

1. Introduction

As early as a decade ago [1] a Scientific Americanarticle warned a broad audience of the need to address theissue of preserving "the content and historical value ofthousands ofrecords, databases andpersonal documents[that] may be irretrievably lost tofuture generations ifwedo not take steps to preserve them now." Over the last 10years, a considerable body of literature has been generated

* These concepts are based on research by the NSF NPACI ACI-9619020 (NARA supplement), the NSF Digital Library Initiative PhaseII Interlib project, the NSF NSDL/UCAR Subaward S02-36645, theNSF National Virtual Observatory, the NHPRC Persistent ArchiveTestbed (PAT), and the NASA Information Power Grid. Views andconclusions contained in this report are the authors' and should not beinterpreted as representing the official opinion or policies, eitherexpressed or implied, of the Government, or any person or agencyconnected with them.

by scientific and library communities enumerating thecomplex issues that need to be solved [2,3,4]. Congresshas mandated that data and information generated throughpublic funding be openly accessible and preserved for thelong-term, and the National Archives and RecordsAdministration (NARA) launched an Electronic RecordsArchives (ERA) initiative in order to fulfill its mandate topreserve "essential evidence" that is being increasinglycreated, transmitted, and stored electronically [5].

However, science and engineering communities haveyet to develop trusted digital preservation repositories fortheir important intellectual assets. At an April 2002workshop sponsored by the National Science Foundation(NSF) and The Library of Congress (LC), participantsurged: "Solutions are urgently needed to preventfurtherloss ofvaluable digital information.... these problems areurgent;... action is needed now, not some timein the future; .... everyone-from creators tocustodians-must contribute to the solution... "[4].

Chronopolis is a proposed facility for thepreservation of valuable data collections that represent theintellectual holdings of science and engineeringdisciplines as well as valued national collections thatmust persist for 100 years or more. These includescientific data collections from academic disciplines (suchas earth science, bio-informatics, astronomy, ...)historical collections (digital federal records, historicalsociety material), and digital library collections (artimages, audio collections, web crawls).

The preservation architecture is based on anintegration of digital library, data grid, and persistentarchive technology, currently used at the San DiegoSupercomputer Center to support scientific datacollections. The preservation architecture integrates:

0-7803-9228-0/05/$20.00 ©2005 IEEE 171

* A scalable production digital librarysystem for life cycle management of meritoriouscollections and provision of user access services;

* Data Grid technology for federation ofcollections across primary resource sites;

* A research and development (R&D)laboratory for developing and evaluating datapreservation technologies for enhancement andevolution of the production system;

* An administration and policy effort formanagement of federated facilities and theinvestigation of sustainable "cost models" for long-term data preservation

The components of a preservation facility amerepresented in the diagram in Fig. 1 below. The top halfof the circle shows the set of services and infrastructurethat are assembled through the integration of digitallibrary, data grid, and persistent archives. The bottom halfshows the activities that are essential for the support andevolution of both technologies and collections over thelong term. The center of the figure indicates theknowledge generation that will be enabled through high-performance access to the data collections.

LI

Figure 1. Chronopolis Components

1.1 .Collection Life-Cycle management andScaling

Collections proceed through a life-cycle, fromcreation, to maintenance and update, to preservation andaccess, and finally to dissolution. A preservation facilitymust support not only the life cycle of the collections,but also the life-cycle of the facility itself. Digital library

life-cycle management consists of a combination ofcuration and preservation processes that are applied toeach digital entity. A preservation environment has awell-defined set of archival processes that include:

appraisal - the determination for whether acollection is appropriate for preservation;

accession - the controlled process by whichcollections are imported and evaluated for completenessand correspondence to a submission agreement;

description - the process of assembling the contextthat will be used to describe provenance, integrity,structural, and behavioral characteristics of the data;arrangement - the process of structuring the

metadata into a collection hierarchy, and aggregatingdigital entities into containers for storage management;

storage - the process of replicating digital entitiesand metadata context onto at least three independent datagrids to ensure that collections survive local disasters(e.g., fire, flood, earthquake, and operational errors);

preservation - the process of managing technologyevolution and maintaining integrity by migrating to newmedia, new encoding formats, new information syntax,and new storage technologies as more cost effectivesystems become available; and

access - the process of supporting discovery,manipulation, bulk retrieval, and knowledge generation,including implementing new access standards as well asaccess standards used by each research group.

Bulk operations are needed for registering (assigninglogical names to each digital entity), loading digitalentities (making a copy in the archive), and accessing thepreservation facility.

Chronopolis has been designed to support theseactivities using federated data grids. In the next sections,we describe the Chronopolis approach.

2. Chronopolis Model

The key concept underlying Chronopolis is a phasedapproach to the development of long-term preservationCyberinfrastructure that can be scaled and evolved overtime. Such an approach must provide:* A model production system for collection

management and preservation that is stable, canevolve with use and technology, and scale withexpansion of individual and aggregate collections.

* A plan for the smooth integration of newtechnologies as they are developed and tested, sothat the facility can increase capability andfunctionality without disruption to users.

* Well-managed administration of the facilitywhich includes the integration of scientific,national, and commercial policies and

172

procedures governing the authenticity and integrityof data, availability of data, security, retentionperiods, collection selection, and metadata standards.

* The exploration of policies and cost modelsfor long-term preservation that ensure theprotection of critical data collections beyond the life-time of the projects and efforts which generated them,and provide a plan for future maintenance, curationand use.

* The involvement of scientific communities in settingup and managing such a cyberinfrastructure.Preservation of scientific collections requires acommunity-wide consensus on metadata standards andaccess services.

2.1. Expanding Capacity and Minimizing Riskthrough Federation

A viable digital repository federates three independentdata grids to minimize risk of data loss. The types of riskinclude media failure, operational procedural error,systemic vendor hardware or software failure, naturaldisasters, and malicious users. A three-site designprovides a minimal number of replicas needed to protectagainst reasonable risk.

The federated preservation facility utilized byChronopolis will provide three functionalities: that of aCore Center, a Replication Center and a DeepArchive. The Core Center (CC) and the ReplicationCenter (RC) support user access to the digital holdings.The Core Center serves as the primary access resources.Initially, the Replication Center has mirror copies of thedigital holdings to ensure user-access if the Core Center isunavailable. Together these two sites can handle four ofthe risks:

* Second copy at the RC handles media failure* Having the RC geographically remote from the

CC handles natural disasters* Having the RC administered by a separate team

handles local operational errors* Having the RC constructed using different vendor

products handles systemic system failure.

To handle malicious users, the third site needs to be aDeep Archive (DA). This is a facility that stages all datathat is submitted through archivist controlled processes,allows no access by remote users, and manages allchanges to the digital holdings as versions. The DAholds copies of all older versions and formats of thedigital holdings. In this sense, the DA is an historicalarchival site.

As the holdings increase, the federated facility can beextended to include more sites. The Replication Centercan evolve into an autonomous center with digital

holdings replicated onto other Cnronopolis sites.Adequate networking is important for data transfer. Forexample, given current plans for high-speed nationalnetworking, sites on the Light Rail network and theTeraGrid network are viable candidates for inclusion in thenational preservation backbone for supporting massivedata analysis.

Federation technology makes it possible to replicatenot only the digital holdings and the authenticity andintegrity metadata, but also the name spaces used toidentify archivists, identify storage resources, and manageaccess constraints. The replication of the name spacesmeans that the preserved digital holdings can becompletely described independently of the choice of thestorage technology. This concept of infrastructureindependence is a key component of a preservationfacility. All digital holdings must be exportable ontodifferent choices of preservation technology to ensure thatthere are no unfortunate dependencies on technology thatmay become obsolete.

Federation also means that it is easy to add new sitesto the preservation facility. The SRB data grid enablestwo or more autonomous systems with different levels oftrust to be federated in order to exchange and synchronizecollections. Each federation (called a Zone) isindependently administered with its own metadata catalogand resources.

The CC, RC, and DA grids are independentlyadministered Zones. Under federation, digital data andmetadata exchange and synchronization are supportedwithin and across Zones. For interactions between thecore and replication Zones, the Chronopolis design: 1)replicates and maintains collection and user data andrelated metadata across Zones; 2) allows users to goacross Zones and use data when they operate in aparticular Zone; and 3) maintains independent ingestsystems, in which the replication Zone is not allowed toingest any new data beyond what is synchronized from thecore Zone. For interactions between the core and deeparchive Zones, the Chronopolis design: 1) establishesmultiple Zones within the back-up or archive Zone; 2)stores holdings of the core Zone on a set of Zones withinthe deep archive Zone; 3) automates the ongoingarchiving of core collections, including all data andmetadata, in the deep archive Zone; and 4) prohibitsexchange of user information.

2.2. Chronopolis and SRB

At the heart of a preservation environment is a robustproduction system that incorporates new, proven, cost-effective technologies that take into account presentstandards, scalability, and technology evolution. Apreservation environment uses data grid technology to

173

federate collections that are distributed acrossgeographically remote sites, and to replicate and back- upvalued and irreproducible collections. The management offederated data collections involves procedures for theingest of material, the coordinated management ofcollections with the originating site, and the developmentof both union catalogs and integrated concept spaces.

Grid technologies, and in particular data gridtechnologies, have been evolving rapidly under pressurefrom national research projects to provide mechanisms forthe management of resources, users, files, and metadata[6]. The data grid evolution is enabling a convergencebetween the data management capabilities needed fordigital libraries and the data management capabilitiesneeded for job execution. Chronopolis will utilize thesetechnologies and integrate them with capabilities from theStorage Resource Broker data grid [7], digital library life-cycle management systems, and other data federation andpreservation software to create the preservation system.

The Storage Resource Broker (SRB), developed atSDSC, is powerful data management technology thatimplements the virtualization mechanisms needed toincorporate new technology. SRB technology at SDSCcurrently supports more than 430 Terabytes of data forscientific collaborations including SCEC [8], BIRN [9],NVO [10], NSDL [11], TeraGrid [12], etc., as well asfederal electronic records in a collaboration with NARA[13].SRB data grid technologies provide the preservation

mechanisms needed to control and track the authenticityand integrity of the archived collections. This includesconsistent and persistent management of bothadministrative metadata about each file (location, accesscontrols, checksums, usage audit trails, ownership,versions, creation time) and descriptive metadata that maybe specific to a collection. The SRB supports a standardset of access operations for file and metadata manipulationthat can be used to support new digital library interfaces,such as FEDORATm[14]. The SRB supports organizationof digital entities into collection hierarchies, making itpossible to manage separately each preserved collection.Scientific communities are porting standard services ontop of data grids to view and manipulate each datacollection based on the attributes appropriate for theirdiscipline.SRB technologies have been integrated with DSpace, a

freely available, open source system developed by theMIT libraries and Hewlett-Packard Labs [15]. TheDSpace environment includes a submission system thatsupports complex, flexible workflows, as well as accesscontrol and delivery of complex digital content.Although originally designed for use by individualacademic research institutions to capture, archive,preserve, and publish scholarly research material, the

integration with data grids will extend the system tosupport federation of collections across sites.A viable preservation system will require

implementing infrastructure-independent software-supportmechanisms for all layers of the preservation architecture.Such a system is in production use within the NationalScience Digital Library, the National Vlrtual Observatory,the Biomedical Informatics Research Network, theNHPRC Persistent Archive Testbed [16], and the UCSDLibraries Digital Asset Management Service that isbuilding upon DSpace. The preservation architecturecomponents and production software systems include thefollowing technologies

User portals: provide access, manage collectionformation, and manage collection life-cycle.

Process management systems: automateapplication of curation and preservation processes.Web Services: implement infrastructure-independent

archival and curation processes based on SOAP [17] andWSDL [18] standards.

Data Access: manipulate and display digitalentities. Existing digital library metadata schema includeMETS [19] and the OCLC/RLG [20] PreservationMetadata Implementation Strategies, records-keepingpreservation standards based on the International StandardArchival Authority Record (ISAAR) [21] and DOD5015.2 [22], and archival description standards based onthe International Standard for Archival Description(ISAD) [23], and Encoded Archival Description (EAD)[24].

Data Grids: provide support for shared collectionsdistributed across multiple sites and storage repositories.

Grid Technologies: provide support for dataanalysis on distributed resources.

Persistent Archives: provide virtualizationmechanisms for storage, data, information, and accessmanage technology evolution and support for porting ofnew interfaces.

Persistent disk systems: provide commodity-based disk cache for on-line access to large collections

High Performance Disk Caches: supportmassive data analysis. The TeraGrid SAN environment[12] has demonstrated access rates in excess of 5 GB/secto the TeraGrid Linux cluster at SDSC.

Petabyte Archive: Long term preservation willrely initially upon tape, as total operational costs for tapeare still up to a factor of three less expensive thancommodity disk cache.A preservation facility is a living entity that

aggressively tracks technology evolution. Newtechnology is incorporated when it can be done withoutdiminishing stability, and when it offers greater capacityor functionality, and leads to reduced cost. TheChronopolis Research and Development Laboratory will

174

provide a venue for evaluation of new technology for itssuitability for production use and integration when suchtechnology is deemed an improvement to the productionsystem.

2 . 3 Collection Selection and CommunityBuilding

To determine collections preserved withinChronopolis, we will initiate a "Chronopolis CollectionsRFP" open to the broader science, engineering, academicand government community. This will allow us to selectand preserve collections of demonstrated value to keycommunities and to select collections which exercisedistinct "usage scenarios". A Selection Committeerepresenting Chronopolis' broad constituencies will beformed in advance of the Collection RFP.

In addition, we plan to hold an annual ChronopolisPreservation Workshop whose focus is discussion of bestpractices, opportunities and challenges in long-termpreservation. We expect the first ChronopolisPreservation Workshop to be held in the Spring of 2006.

3. Conclusion

In Navigating the Knowledge Infrastructure, Bauer, Cook,and Sullins ask, and not completely rhetorically, Whycreate archives? They answer their own question bynoting that archives ". . . have traditionally served as away for public memory, or record, to be stored andprotected ... Public record is the essential ingredient ofsociety . . . Without archives each society big or smallwould be constantly reinventing strategies for copingwith everyday problems." The ambition manifest inChronopolis is, simply, the development and deploymentof the knowledge management Cyberinfrastructure inwhich society can place its archives (its "publicmemory"), the tools for getting them there, the standardsfor encoding/describing them, the tools for discoveringthem, the tools for manipulating them, the tools forsharing them, and the tools for protecting them acrossspace and time.

References

[1] Rothenburg, J. "Ensuring the Longevity of DigitalDocuments." Scientific American. (January 1995).

[2] Atkins, D. E., et al. Revolutionizing Science andEngineering Through Cyberinfrastructure: Report ofthe National Science Foundation Blue-RibbonAdvisory Panel on Cyberinfrastructure. Washington,D.C.: National Science Foundation, January 2003.

[13] Library of Congress. Preserving our Digital Heritage:Plan for the National Digital Information

Infrastructure and Preservation Program. Washington,D.C., October 2002.

[4] Hedstrom, M., et. al. It's About Time: Final Reportof the Workshop on Research Challenges in DigitalArchiving and Long-term Preservation, April 12-13,2002. Washington, D.C.: National ScienceFoundation and The Library of Congress, August2003.

[5] National Research Council. Building an ElectronicRecords Archive at the National Archives and RecordsAdministration. Washington, D.C.: NationalAcademy Press, 2003.

[6] Moore, R. "Evolution of Data Grid Concepts",Global Grid Forum Data Area Workshop, January,2004.

[7] SRB, Storage Resource Broker, Version 3.1,(http://xwvw.npaci.edu/dice/srb). 2004.

[8] SCEC, Southern California Earthquake Center,(http:/!wwNw. scec. org!).

[9 1 BIRN, The Biomedical Informatics ResearchNetwork, http://wwwvnbirn.net. 2002.

[1 NyO, National Virtual Observatory,http:!//vwv. .srl.caltech.edu/nvo/, 2001.

[1 1IISDL, National Science Digital Library,http://v-wwv.nsdl.nsf.gov/indexl.html.

[12] TeraGrid, http://wNs7wwteragridorgf/.[13] Moore, R., C. Baru, A. Rajasekar, B. Ludascher, R.

Marciano, M. Wan, W Schroeder, and A. Gupta,(2000), "Collection-Based Persistent DigitalArchives - Parts l& 2", D-Lib Magazine,April/March 2000, http://wNw.dlib.org,!

[14 FEDORA, Flexible Extensible Digital ObjectRepository, http1://wwwv.fedora.info/

[15 PSpace Federation, htt;p://libraries.mit.ecdu/dspace-mit/, http://www.dspace.org/.

1 1MT, Persistent Archive Testbed,http:l/www. sdsc.edo/PAT.

I 7SpAP, Simple Object Access Protocol,http://www. w3.org/TR/2000/NOTE-SOAP-20000508/

[ 1 8WSDL, Web Services Description Language,http://wwww w3 .ore/TR/wsdl

[ 1 9 METS, Metadata Encoding and TransmissionStandard, http://www.loc.gov/ standardsf/nets/.

201 PREMIS, PREservation Metadata; ImplementationStrategies from the Online Computer Library Centerand RLG, http://wwwvolCor!L

[21 [ISAAR, International Standard Archival AuthorityRecord for Corporate Bodies, Persons and Familiesfrom the International Council on ArchivesCommittee on Descriptive Standards,http://wwwN;hmc.gov.uk/icacds/eng/standards.htmn

175

[22] DOD 5015.2, Design Criteria Standard for ElectronicRecords Management Software Applications,http://www.hmc.gov.uk/icacds/eng/standards.htm

[ 2 31_AD, International Standard for ArchivalDescription, from the International Coundil onArchives Committee on Descriptive Standards,http://Nwxv.hmc. gov.uk/icads/eng/stancards.htm

[ 2 14D, Encoded Archival Description,http://www.loc.gov/ead/

176