From compliance to curation: ORA-Data at the University of Oxford ...

26
From compliance to curation: ORA-Data at the University of Oxford Burgess et al 1 From compliance to curation: ORA-Data at the University of Oxford Lucie Burgess 1 , Neil Jefferies 1 , Sally Rumsey 1 , John Southall 1 , David Tomkins 1 , James Wilson 2 1. Bodleian Libraries, University of Oxford 2. IT Services, University of Oxford V1.2 24 May 2016 Abstract The University of Oxford’s institutional repository for data, ORA-Data was launched in May 2015. This paper describes a current-state assessment of ORA-Data in in relation to the EPSRC policy framework on research data, and offers some future directions in implementing a more systemic lifecycle approach to research data curation at the University of Oxford with reference to the Digital Curation Centre’s Curation Lifecycle Model. Insights into the attitudes and requirements of EPSRC- funded researchers at the University in relation to research data management, from a survey we conducted, are explored. Descriptive metadata and identifiers, preservation actions, access, use and re-use, and community engagement are considered. ORA-Data is positioned within the wider context of Oxford’s research data management roadmap. Introduction Data management has always been a key component of academic work. Researchers have always organized, filed, documented and stored their data in such a way as to enable its retrieval and analysis to underpin their research. Two-thirds of the researchers at the University of Oxford responding to a survey conducted in 2012 [1] indicated that such data management was an ‘essential’ aspect of their research, and that their research ‘would suffer significantly if my data were not properly managed’ (only one of the 314 people surveyed thought that it was ‘not important’). In recent years, however, expectations regarding how research data should be managed and, increasingly, published and disseminated have changed. The Royal Society’s report Science as an Open Enterprise[2], made ten key recommendations for universities, repositories, learned societies, researchers and funders relating to the use, curation, dissemination and openness of data for the public good in 2012; and Research Councils UK published its Common Principles on Data Policy in April 2011. In the more quantitative disciplines there is growing recognition that journal articles are insufficient when it comes to enabling the reproducibility of research, leading to a requirement for mechanisms to enable data publication, dissemination and discovery with suitable metadata alongside (see, for example [3], [4], and [5]). Furthermore, concern has grown that without high standards of data curation, data-driven research can be hard to replicate, and that the value of data cannot be fully realized unless it is made available in such a manner as others can re-use it with confidence. The University of Oxford ratified its Policy on the Management of Research Data and Records in July 2012 [6]. This policy declares that the University is responsible for ‘providing access to services and facilities for the storage, backup, deposit and retention of research data and records that allow

Transcript of From compliance to curation: ORA-Data at the University of Oxford ...

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

1

From compliance to curation: ORA-Data at the University of Oxford

Lucie Burgess1, Neil Jefferies1, Sally Rumsey1, John Southall1, David Tomkins1, James Wilson2

1. Bodleian Libraries, University of Oxford

2. IT Services, University of Oxford

V1.2 24 May 2016

Abstract

The University of Oxford’s institutional repository for data, ORA-Data was launched in May 2015. This paper describes a current-state assessment of ORA-Data in in relation to the EPSRC policy framework on research data, and offers some future directions in implementing a more systemic lifecycle approach to research data curation at the University of Oxford with reference to the Digital Curation Centre’s Curation Lifecycle Model. Insights into the attitudes and requirements of EPSRC-funded researchers at the University in relation to research data management, from a survey we conducted, are explored. Descriptive metadata and identifiers, preservation actions, access, use and re-use, and community engagement are considered. ORA-Data is positioned within the wider context of Oxford’s research data management roadmap.

Introduction

Data management has always been a key component of academic work. Researchers have always organized, filed, documented and stored their data in such a way as to enable its retrieval and analysis to underpin their research. Two-thirds of the researchers at the University of Oxford responding to a survey conducted in 2012 [1] indicated that such data management was an ‘essential’ aspect of their research, and that their research ‘would suffer significantly if my data were not properly managed’ (only one of the 314 people surveyed thought that it was ‘not important’). In recent years, however, expectations regarding how research data should be managed and, increasingly, published and disseminated have changed. The Royal Society’s report Science as an Open Enterprise[2], made ten key recommendations for universities, repositories, learned societies, researchers and funders relating to the use, curation, dissemination and openness of data for the public good in 2012; and Research Councils UK published its Common Principles on Data Policy in April 2011.

In the more quantitative disciplines there is growing recognition that journal articles are insufficient when it comes to enabling the reproducibility of research, leading to a requirement for mechanisms to enable data publication, dissemination and discovery with suitable metadata alongside (see, for example [3], [4], and [5]). Furthermore, concern has grown that without high standards of data curation, data-driven research can be hard to replicate, and that the value of data cannot be fully realized unless it is made available in such a manner as others can re-use it with confidence.

The University of Oxford ratified its Policy on the Management of Research Data and Records in July 2012 [6]. This policy declares that the University is responsible for ‘providing access to services and facilities for the storage, backup, deposit and retention of research data and records that allow

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

2

researchers to meet their requirements under this policy and those of the funders of their research’; for ‘Providing researchers with access to training, support and advice in research data and records management; and for ‘Providing the necessary resources to those operational units charged with the provision of these services, facilities and training’, which includes the Bodleian Libraries and IT Services.

Since 2012 research data management activities at the University of Oxford have been heavily driven by the RCUK Common Principles on Data Policy[7] and in particular the EPSRC policy framework on research data[8]. One such response by the University is ORA-Data, Oxford University’s digital registry, repository and service for the long-term preservation of and access to research data. The development of the service has been largely driven by the compliance requirements of EPSRC, although underpinned by a recognition of whole lifecycle research data initiatives such as the Digital Curation Centre’s (DCC) Curation Lifecycle Model [9] and the LERU Roadmap for Research Data [10]. Although Oxford has contributed energetically to the development of full-lifecycle initiatives they have not to date fully underpinned its development. On the other hand, engagement with such models provides an opportunity to embed best practice developed through significant community engagement.

Fragmented beginnings

The University has been offering services relating to research data management for many years. The Oxford Text Archive has been looking after digital texts in the humanities since 1976, IT Services’ HFS1 client has been backing up (and to a limited extent archiving) research data since 1995, and the Bodleian Libraries hosted a number of very important data collections, mainly in the humanities. Research Services and the Bodleian Libraries have been helping researchers navigate the requirements of funders for many years; the Bodleian Libraries and IT Services have advised researchers how to complete technical plans and appendices; and library staff have offered consultations regarding standards and data organisation. The academic departments and IT Services both offer disk space and help with hosting data. Training that touches on elements of research data management is available from various sources. Furthermore, researchers have always made use of commercially available software, Web services, specialist data repositories, and other external support. However, these efforts were fragmented and not necessarily well integrated into a single cohesive service.

Data management practices vary considerably between disciplines [11], [12], [13], and this means that different researchers may have quite different support requirements from their institution. Astronomers, particle physicists, computational biologists, geneticists, archaeologists, and others have developed their own international standards and infrastructures for data management and sharing (see, for example the BioSharing Project [14] in computational biology, part of the Elixir initiative [15]), and in general researchers in such disciplines are used to preparing material to meet established standards. In many other disciplines attitudes and practices are far more variable. In some, there is little awareness of what is possible, and it is not uncommon for researchers to struggle to recall important contextual information about the data they themselves generated or compiled a few years earlier. At Oxford, the 2012 survey indicated considerable variability of data

1 Hierarchical File Service, a petabyte scale tape back-up store.

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

3

management practices in different disciplines and departments, from data stored on ageing hard drives or portable media where they are at risk of loss through deterioration or obsolescence, or within departmental infrastructures where they may not be widely visible or available for re-use; to use of data standards and shared community repositories; to more embedded data curation practices.

In recent years, as research data management has risen up the agendas of funding agencies and investments have been made in developing research data management infrastructure, groups within the central services and academic departments of the University have undertaken projects to create new tools and services. At Oxford, the Managing Research Data Programme was created in 2013 from which many projects were funded (see [16], [17], [18]). Notable projects included DataStage – a software application enabling research groups to set up shared drives with different levels of access for collaborators, facilitating the deposit of data and metadata into repositories; the Online Research Database Service (ORDS) - a cloud-hosted service enabling researches to collaboratively create, edit, and share relational databases which is now in active service; and DataBank and DataFinder – an early repository and catalogue of research data produced at Oxford, the pre-cursors to ORA-Data.

More recently the central support of research data management activities is undergoing a transformation from being a secondary to a core activity at Oxford, and today our aspiration is to deliver research data curation as a single service through a collaboration between the Bodleian Libraries, IT Services and Research Services. This change in support structure is in common with many Russell Group universities in the UK (in particular the Universities of Cambridge, UCL, Imperial, Edinburgh and Manchester) who are providing more investment into research data management activities in alignment with funder mandates and researcher requirements.

ORA Data: long-term digital repository, registry and service

ORA-Data is developed and maintained by the Bodleian Libraries [19] on behalf of the wider University. Following an ‘alpha’ pilot period from December 2014, it was launched as a working beta pilot on 1 May 2015. It was developed largely in response to the EPSRC’s data policy which came into force on the same day, providing one illustration of the impact of a policy intervention in research data curation.

ORA-Data offers a service to archive, preserve and enable the discovery and sharing of data produced by Oxford researchers. Any type of digital research data, from across all academic disciplines, can be deposited in ORA-Data in any file format (although in practice standard file formats are preferred for access reasons such as Excel, plain text or CSV formats – see “Preservation Planning and Preservation Actions” below). The service uses the same technical platform as the long-established Oxford University Research Archive (ORA) for publications such as journal articles, conference proceedings and theses, so that data can be linked easily to, and browsed alongside, related publications.

The service is primarily aimed at:

• Oxford researchers who wish to include an entry for their dataset in the University's registry of research data, irrespective of where that data is archived;

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

4

• Oxford researchers who need a repository to deposit their research data, especially data that underpins publication, and data where the funding body requires that data to be discoverable, accessible and preserved; or

• Oxford researchers who wish their data to be openly accessible, following an embargo period. However, ORA-Data is not intended to store:

• Data still in active (live) use by research projects; • Data that has been created exclusively outside the University of Oxford; or • Sensitive, confidential or personal data which has not been anonymised or any other data which

should not be made accessible during the period of archiving.

Furthermore, the use of ORA-Data is by no means obligatory for Oxford researchers – there may exist a more appropriate discipline-oriented repository, or certain funders may encourage use of a different service (for example, the Economic and Social Research Council (ESRC) encourages deposit of data in the UK Data Archive [20], which it partly funds). Nevertheless, Oxford researchers are strongly encouraged to at least create a data record in ORA-Data if their datasets have been deposited elsewhere and provide Digital Object Identifiers (DOIs2) or persistent URLs so that they can be linked. In that context, ORA-Data primarily offers a service for so-called ‘small science’[21].

Some of the key features that ORA-Data offers include:

• Assignment of a DOI for each dataset stored in ORA-Data on request, so that the data can be cited (this is particularly important for data that underpins publications).

• A curation service staffed by experienced personnel, helping researchers to describe their dataset appropriately, identify a suitable data steward, and comply with funder requirements

• Rigorous review of metadata describing datasets before publication, complying at least with the DataCite minimum standard.

• Long-term preservation of data, so that it remains findable, accessible and reusable. • Appropriate guidance for users via an ORA-Data-specific LibGuide [22] and a dedicated helpdesk

for users • A licence is automatically assigned to datasets - currently CC03, but other licences can be

assigned if desired. However, there is some technical work to do to implement this full functionality in the ORA-Data platform, and at present generic terms of use are in force4.

• Data and metadata can be embargoed or made freely available as required. • ORA-data is listed in Re3Data.org5, the international registry of research data repositories

provided by DataCite.

In addition to ORA-Data, the Bodleian Digital Library works closely with other University services to encourage good data curation practice across the whole research data lifecycle, of which further discussion is below.

EPSRC Readiness Assessment

2 A DOI is a digital object identifier, in other words a persistent URL for a dataset. 3 Creative Commons Zero/ public domain licence – seehttps://creativecommons.org/publicdomain/zero/1.0/ 4 http://ora.ox.ac.uk/information/termsOfUse, accessed 30 January 2016. 5 http://re3data.org, accessed 30 January 2016.

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

5

Research Councils UK published its Common Principles on Data Policy for the management and curation of research data in April 2011 and updated in July 2015 [7]. Since then, all UK Research Councils have published research data management policies (for example see [8], [23]), although they vary quite considerably [24]. The most stringent of these in terms of requirements is the EPSRC Policy Framework on Research Data (hereafter referred to as ‘the EPSRC expectations’, agreed in 2011 and which came into force in May 2015[8]. This policy became the main impetus for the development, functional design and service design of ORA-Data. The EPSRC policy differed from previous policies relating to research data management in that it placed the responsibility to ensure compliance not only on funded researchers but also on the research institution that employed them. It quickly became necessary for institutions to consider what infrastructure they needed to provide to ensure that their researchers could meet the new expectations, as they were referred to, or face unspecified sanctions.

The EPSRC placed an initial deadline of May 2012 on institutions to develop a ‘clear roadmap to align their policies and processes with EPSRC’s expectations’. As a major beneficiary of EPSRC funding, the University of Oxford swiftly put together just such a roadmap. Professor Paul Jeffreys, as Director of IT at the time, arranged a workshop of existing service providers and stakeholders within the university in the field of research data management to meet, map out, and prioritize what would need to be implemented over the next three years. This exercise saw the existing and required future services mapped to an RDM lifecycle model for the first time in the University of Oxford. The completed roadmap featured a set of different infrastructure tracks, with rough development timetables against each. Although the initial timetable changed due to improvements in the understanding of requirements, priorities, and possibilities over the next three years, the general outline remained recognizable.

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

6

Figure 1. The original Oxford RDM Roadmap, from 2012. Note that what has become ORA-Data was at that time considered to consist of two component parts – DataBank and DataFinder.

With the roadmap in place, the next major deadline set by the EPSRC was the 1st May 2015. It was by this point that institutions in receipt of EPSRC funding were supposed to be fully compliant with the expectations. Thanks to the early investment in building the research data repository and supporting services, the University was in a strong position by the end of 2014 with an alpha version of ORA-Data in place, but most researchers remained blissfully unaware of the changes they were required to implement.

The first of the nine EPSRC expectations was that, “Research organisations will promote internal awareness of these principles and expectations and ensure that their researchers and research students have a general awareness of the regulatory environment and of the available exemptions which may be used, should the need arise, to justify the withholding of research data.”[8] Up until that point, the main channel for raising awareness of data curation issues had been the university’s research data management website [25], by now officially termed ‘Research Data Oxford’ or ‘RDO’. Whilst the RDO website contained a wealth of information about funder requirements and guidance as to how these could be met, it was not clear how many EPSRC-funded researchers knew to look there. Therefore we surveyed EPSRC-funded PIs to gauge whether more work needed to be done in engaging with researchers so that they understood their responsibilities.

The survey was designed and developed by Amanda Flynn, Digital Scholarship Support Officer at the Bodleian Libraries and circulated to all 149 PIs in receipt of EPSRC funding at the University at that

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

7

time. 48 responses were received (32%). Only 18% of the respondents were fully aware of the EPSRC’s policy on research data management, although half knew of its existence but not the content; 32% knew nothing about it. The survey also established that consideration of data archiving left much to be desired, with 69% not having a clear idea of where to deposit their research data at the end of their project. Most respondents cautiously supported the principles of open data. The majority (60%) felt that all of their research data could be made openly available at the end of their projects, while 37.5% felt that some but not all elements would require an embargo. Only 2.5% felt that none of their data could be made open. The most common reasons for wanting to restrict access were related to intellectual property, in particular the fear that making data openly available would reduce opportunities for commercial exploitation. This is of course of broader concern to the University, with the University’s commercialization service ‘Isis Innovation’ having a particular stake.6

Nearly half of responding PIs (49%) indicated that they would like more support from the university to meet the EPSRC’s requirements. Commonly cited reasons for this were: a need for better guidance from both the university and the EPSRC as to what actually constitutes research data; uncertainty over whether every piece of raw data should be archived; and a desire for more advice about how data should be formatted for archiving and sharing.

Although we were already minded to conduct a short awareness-raising project relating to the EPSRC expectations before carrying out the survey, the results reinforced the need for us to do so. This work began in early 2015, sponsored by Pro Vice-Chancellor for Research, Professor Ian Walmsley, himself in receipt of EPSRC funding, and targeted at colleagues in the Mathematics, Physics, and Life Sciences Division. Project staff drawn from IT Services, the Bodleian Libraries, and Research Services held structured discussions with PIs, although in practice, many PIs preferred to hold group or departmental workshops, which enabled the project staff to reach a wider audience.

Reactions to the EPSRC Expectations expressed during the interview and group sessions were mixed but generally negative. Most researchers felt that the Expectations would add to their already busy workloads and be of little value either to themselves or their wider disciplinary communities. Even when informed of the kinds of benefits the EPSRC envisaged arising from the Expectations, many researchers remained sceptical.

The Expectations were greeted more positively by researchers working in the fields of crystallography and aspects of biochemistry, where it is already common practice to share data and where the infrastructure to assist with this is already established at the supra-institutional level. This suggests that the benefits of data preservation and sharing start to become more obvious as the practice becomes more prevalent.

A large number of issues and concerns were raised by researchers over the course of the EPSRC Readiness Project. Some could be addressed immediately, others required exegesis of the EPSRC pronouncements and further discussion, and a few required consideration by the Research Data Management and Open Data Working Group – the committee responsible for governance of research data management-related issues at the University of Oxford. Many of the specific questions

6See http://isis-innovation.com/about/

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

8

asked were incorporated into the EPSRC Frequently Asked Questions section of the Research Data Oxford website.7

Speaking directly to the researchers helped inform decisions relating to the design of ORA-Data. Firstly, it became apparent that the initial process that had been envisaged for assigning DOIs would need to be adapted to ensure it could fit into researchers’ workflows. Secondly the requirement for metadata harvesting from third-party repositories was confirmed, so that researchers did not have to describe their datasets twice when they had already been deposited elsewhere; this had already been successfully explored during a pilot project between the Bodleian Libraries and the Archaeology Data Service. This requirement is on our roadmap for the future although remains complex due to the significant variability of datasets described in third party repositories (a situation which would be ameliorated if all data repositories adopted the DataCite minimum standard).

A third area of feedback from researchers concerned the proposed cost-recovery model. Initially, this was to consist of a flat fee for any data deposit to cover the cost of metadata review, with a per-gigabyte fee on top to cover storage costs. Several researchers complained that if they literally had a single Excel spreadsheet to deposit, charging a flat fee of over a hundred pounds was unreasonable (even though the metadata review process would inevitably be the same whatever the size of data deposited). Use of ORA-data is currently free to Oxford researchers and the future pricing model is under review. Other suggestions included better integration with the University’s Symplectic system for managing research information, which is on the development roadmap, and improvements to the interface, which have now largely been implemented.

Whilst further enhancements to ORA-Data are planned to meet researcher feedback, the existing service meets EPSRC expectations. We intend to carry out a similar project for researchers in receipt of ESRC funding during 2016.

However, our desire is to further develop ORA-Data to support whole-lifecycle research data curation activities, which is discussed further in the later sections of this paper.

Whole life-cycle models for research data

The Digital Curation Centre’s Curation Lifecycle Model [9] provides a high-level graphical overview of the full research data lifecycle and was developed through a community effort. The model enables the mapping of functionality onto a series of practical activities, which allows data creators, curators, and re-users of data to identify where they fit into the bigger picture. Through application of the model to working practices, organisations and individuals can identify whether there are missing or redundant steps their data curation processes, and better define roles and responsibilities. The discussion and application of this model below focuses on the ‘whole lifecycle’ activities – curate and preserve, community watch and participation, preservation planning and description/representation information. The sequential activities from conceptualise to dispose are also a consideration for us but a full discussion of these is outside the scope of this paper.

7 http://researchdata.ox.ac.uk/epsrc-data-requirements-and-what-you-need-to-do/frequently-asked-questions/

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

9

Figure 2. The DCC Curation Lifecycle Model [9]

The LERU Roadmap for Research Data[10] takes a more institutional view of data curation challenges in addition to the detailed practical issues faced by data creators, curators and re-users. It focuses on six challenges: Policy and Leadership; Advocacy; Selection and Collection, Curation, Description, Citation, Legal Issues; Research Data Infrastructure; Costs; Roles, Responsibilities and Skills. Oxford intends to implement the LERU Roadmap and to update the published case study illustrated in the report. While we refer to the LERU Roadmap and would like to highlight its existence, a full consideration of Oxford’s response is outside the scope of this paper.

Create or receive; appraise and select; description and representation information

Although our efforts to develop ORA-Data have been largely driven by the EPSRC expectations, we are making efforts to implement more systemic curation activities indicated by the ‘Description/ Representation Information’ circle in the DCC Curation Lifecycle Model. In that context, review is a fundamental aspect of the service.

One of the full lifecycle actions in the model relates to description and representation information (the yellow innermost circle in Figure 3). Description and representation information about datasets held in ORA-Data supports primarily the corresponding sequential actions: Create or Receive, Appraise and Select, and Access, Use and Re-Use, although it plays a role in all.

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

10

Oxford researchers are encouraged to archive any digital data which might be deemed to underpin research outputs, especially where those outputs are published; this might range from small datasets, including data underpinning basic charts, through to larger quantitative data used to create data visualisations or large image datasets, for example the results of hyperspectral imaging.

Those depositing in ORA-Data are required to be members of the University of Oxford (access is authenticated using Oxford Single Sign-On), or are required to have been members of the University of Oxford at the point of creating their research outputs and research data (in which case retrospective deposit is facilitated by ORA-Data staff). Basic advice for data producers is provided via the Bodleian Digital Library’s data archiving web-pages [26] and more detailed guidance is provided from the ORA-Data LibGuide [22].

Datasets and metadata are deposited using a clearly presented 6-step online deposit form, with a combination of mandatory fields in accordance with the DataCite standard [27], optional fields and concise pop-up guidance for each. The first of these steps requires the depositor to agree to the ORA-Data Deposit Agreement [28]. Submission of data is governed by our policy on submission, preservation and withdrawal [29].

All deposited datasets and metadata are subject to review by ORA-Data staff. Quality metadata is paramount in aiding discoverability both by manual searching and by search engines; where mandatory fields are not completed or contextual information is deemed to be insufficient, the data producer will be asked to supply further information by ORA-Data staff. In addition to mandatory metadata fields in compliance with the DataCite schema, we deem other metadata fields to be mandatory in ORA-Data in order to facilitate on-going curation or to ensure accessibility. Depositors are required to nominate a ‘Data Steward’, someone who is not already listed as a creator but who can act as a point of contact in the event of any queries arising about the dataset should the creator/s be unavailable. Depositors are required to include sufficient information about the dataset to make it easily intelligible to others, i.e. to document (either within the metadata or in a deposited ReadMe file) how, why and when their data was created and whether any specific software was used to create, edit or process the files, in support of long-term access and re-use.

Depositors are requested to provide funder details in the event their data was produced as a result of a grant award, both to acknowledge the funder and to enable ORA-Data staff to monitor which comply with funder policies.

Further work to implement the DCC Curation Lifecycle Model approach: Create or receive; appraise and select; description and representation information

The following steps are in progress to fully implement a whole lifecycle curation approach:

• The Open Researcher and Contributor ID, or ORCID, is a persistent digital identifier that unambiguously distinguishes researchers and, through integration research workflows such as manuscript and grant submission, supports automated linkages to improve citation. In September 2015 we launched the ORCID at Oxford service which enables Oxford researchers to link their ORCID to their Oxford authentication credentials (single sign-on via Shibboleth). As at the end of February 2016, 2,372 Oxford researchers have obtained and linked their ORCIDs using the service. We are investigating how ORCIDs could be integrated into ORA-Data through an integration scoping study, contacting comparable institutions and stakeholders to understand

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

11

how use of ORCIDs might be most beneficial to them – for example, enabling depositors to automatically link their ORCID to a data deposit or search by ORCID. Once we understand the requirements, we will make the requisite technical enhancements.

• Further refinement of licensing and intellectual property policies is planned to support access, use and re-use. Depositors are currently able to select a standard licence from a drop-down list, or choose an option 'bespoke licence' and enter the text of any licence, access terms and conditions (though it reverts to CC0 if an alternative is not selected); the online guidance directs depositors to the DCC for guidance on how to license research data. Meanwhile, depositors are able to identify a ‘Rights Holder’ other than the ‘author/s’ and include a rights statement ‘as required by the publisher or other legal entity’. There is often uncertainty regarding data ownership and rights which needs to be clarified in formal policy.

• A number of technical enhancements are anticipated, including integration of data management plans into ORA-Data (especially where such plans are a stated requirement of a funder), improved metadata harvesting capability (both to and from other repositories or from existing citation files such as Bibtex, EndNote). We are also working on a more structured approach to embargos; at present, depositors select their own embargo option irrespective of their funder, yet the funder information entered elsewhere in the deposit process could be used to provide tailored advice on the length of embargo permitted, or indeed to default to it automatically.

• Updated versions of an existing dataset currently require deposit with a further individual data record. While each different record can be version-numbered and linked to one another, this compromises the integrity of the deposit and distorts context. Again, it is anticipated that transfer to a more robust infrastructure in 2016 will facilitate versioning.

Preservation planning and preservation actions

Preservation planning is a critical aspect of the Curation Lifecycle Model. Preservation plays a key role in the Bodleian Digital Library and this is reflected in both its own policy statement [30] and the University’s Research Data Management policy [6]. The Curation Lifecycle Model makes a distinction between preservation planning (for example, the development of data management plans or storage options) and preservation actions (for example, bit-level preservation or file-format migration). The corresponding sequential actions in the model are Conceptualise, Preservation Action, and Store. ORA-Data contributes to both preservation planning through collection of appropriate metadata, for example in mandating that depositors name a data steward, and to preservation action through provision of secure long-term storage. More specifically, ORA-Data’s retention and preservation policy asserts that: 1. Every reasonable effort will be made to retain items indefinitely. 2. The repository will try to ensure continued readability and accessibility.

o Items will be migrated to new file formats where possible and if deemed necessary. o It may not be possible to guarantee the on-going readability of all file formats.

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

12

3. The repository regularly backs up its files according to current best practice. 4. The original bit stream is retained for all items, in addition to any upgraded formats. 5. In the event of the repository being closed down, the database will be transferred to another

appropriate archive in the control or management of the University of Oxford. The first of these statements complies as best as is possible with EPSRC Expectation 7, which states that research data should be ‘securely preserved for a minimum of 10 years from the date that any researcher ‘privileged access’ period expires or, if others have accessed the data, from the last date on which access to the data was requested by a third party’. While we may not be able to guarantee preservation ‘perpetuity’, we do not intend to dispose of any of the datasets with which we are entrusted unless requested to do so by the data steward; although some datasets are under embargo, a disposal request has not yet been tested. ORA-Data is technically agnostic with respect to file types and metadata formats as a result of a very simple digital object model based on that used by the Fedora repository platform8 although in practice certain formats are preferred for access. An object comprises a number of files which may be data or metadata in any format with a manifest (also a file) providing a description of each part. As a result, any file type can be deposited and stored, as can any metadata in any format - and objects can comprise multiple, related, files. ORA-Data itself only provides bit-stream preservation of research data, it does not attempt to handle or express the internal structure of complex objects such as the structure of a relational database or the user interface features of a website, although this can be captured by making use of extended metadata by including, for example, a database schema file. There is one exception: data objects can also be deposited in the form of zipped files that can be automatically unzipped if required, providing support for a hierarchical file structure within a dataset. Prospective depositors are advised to consider which formats will ensure the broadest possible accessibility by others, both now and in the future and, if in doubt, to consider depositing more than one format of the same item. Where possible, we encourage the use of plain text files (such as .txt; .csv; .html; .xml) which are both human and machine readable, and can be opened in any operating system by a wide range of applications unlike some proprietary software formats. If datasets are created using proprietary software or instruments then continued access and reuse may cause issues in the future as some manufacturers protect the internal structure of their files and others do not make future upgrades back-compatible. Datasets deposited in ORA-Data have an access copy maintained on redundant disk storage spread across two data centres, with a preservation copy stored on off-site archival tape. At present, explicit data integrity checking is undertaken on the version of the data backed up on tape, while the disk copies rely on the intrinsic integrity checking of the file system. A new storage platform (for ORA-Data and other services) in 2016 will facilitate integrity checking (e.g. checksums) and cross-checking between the multiple copies.

8 http://fedorarepository.org - accessed 21-Jan-2016

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

13

Meanwhile, ORA-Data staff also monitor deposits for outmoded (or soon-to-be outmoded) file formats at the review stage of the deposit process, and contact depositors to advise them as necessary. These actions are governed by our policy on submission, preservation and withdrawal [29]. At this stage we have not needed to migrate any deposited files to new formats.

Further work to implement the DCC Curation Lifecycle Model approach: Preservation planning and preservation actions The migration of ORA and ORA-Data to a new hardware infrastructure in 2016 will ensure stability in the long-term and facilitate a number of service improvements, including faster networking and a considerable increase in storage capacity. For the time being, active preservation techniques such as file format migration, format validation and automated metadata extraction are not yet in place, and are the subject of future work. In practice, the diversity of formats encountered in research data means that the automatic application of these techniques may be somewhat less effective than when applied to more controlled corpora. Improvements are planned to our preservation policies. The current preservation policy[29] states that the Bodleian will make every reasonable effort to retain data indefinitely, and this is probably unrealistic and beyond funders’ expectations. Our suite of preservation policies need alignment in full consultation with stakeholders at Oxford. This will be a significant task. Community watch

The Curation Lifecycle Model contains a full lifecycle action of ‘Community Watch’, which corresponds to ‘Maintain a watch on appropriate community activities, and participate in the development of shared standards, tools and suitable software’. Our work in Oxford has focused on community activities around the research communities at the University, with participation in standards development but much less so on tools and software, because the latter are so diverse and ORA-Data does not have the capability to preserve software. Elements of a wider advocacy and communications infrastructure were already in place before the development of ORA-Data due to previous work on individual projects. Some of this was located in IT Services, Research Services and within the Bodleian Libraries. For example, IT Services had already appointed a Senior Project Manager to lead projects, integrate policy development and support the University’s RDM and Open Data Working group, a committee comprised of senior academics and professional support staff. Since 2013 we have made further efforts to integrate this activity more systematically and give it focus. In addition to the Head of Scholarly Communications and RDM and the ORA-Data team within the digital library, the Bodleian Libraries appointed a Data Librarian with the specific role of developing support and training services within the libraries, establishing a growing network of subject librarians; and appointed an ORA-Data Service Manager to lead policy development and service design and implementation. The establishment of a single research data service website was another important step in developing services overall. ‘Research Data Oxford’ (RDO) is a single coherent source of information

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

14

and guidance for all Oxford researchers. The site provides guidance on data curation activities across the full lifecycle, from information on funding applications, an Oxford-specific version of the DCC’s DMPOnline tool, through to issues of digital preservation and intellectual property rights. This web-based support is underpinned by a virtual helpdesk, staffed by a behind-the-scenes network of support staff drawn from all the divisions. Questions emailed to the RDO contact address are filtered and managed using a ticketing system by around 20 staff in this virtual network located all over Oxford. These interactions provide an excellent way of keeping in touch with research communities and evolving our plans. The virtual helpdesk also informs additional training or consultancy sessions to be delivered in small workshops though the iSkills training programme or at a departmental level. Significant advocacy efforts take place beyond the University at workshops and conferences, allowing exchange of ideas and a consideration of the approaches other institutions are taking. Outreach work has been important in communicating to the outside world what is being achieved at Oxford and to share the results of our work with others and vice-versa, and to further reflect on aims and goals. For example, the Bodleian Libraries in partnership with the Universities of Oxford, Cambridge, Edinburgh, Warwick and UCL hosted a working group of the Alan Turing Institute joint venture partners to develop a research data management policy for the UK’s leading data science institute, and led a symposium “Reproducibility for Data Intensive Research” in partnership with the same universities plus the universities of Manchester and Newcastle, which was attended by international research leaders in April 2016.

Further work to implement the DCC Curation Lifecycle Model approach: Community watch With the Economic and Social Research Council (ESRC) Research Data Policy[23] following on swiftly from the publication of the EPSRC’s expectations, a review of current awareness of ESRC expectations within the University is planned for 2016. The ESRC’s philosophy regarding the role and responsibilities of the data producers it funds is similar to that of the EPSRC, but there are some key differences which may necessitate a different institutional response. In particular the implementation guidance for Principle 7 of the policy which states that “The use of public funds to support the management and sharing of publicly-funded research data is achieved by grant applicants including the costs of data management and data preparation for sharing in their grant proposal, so that data can be made available for re-use. ESRC funds its data service providers to guarantee (curation and) long-term preservation of all research data deposited by grant holders. Such costs for long-term preservation can therefore not be included in grant proposals.” While Oxford provides institutional support for data management and preparation in a number of ways, particularly through its ‘Research Data Oxford’ (RDO) web presence and the Oxford-specific version of the DCC’s DMPOnline tool, the role of ORA-Data as a repository for ESRC-funded research data becomes less clear when the funder provides its own data service provider and institutional preservation costs cannot be reclaimed. Therefore, it seems unlikely that ESRC-funded data deposits in ORA-Data will match those of the EPSRC.

Meanwhile, data deposits from the Humanities remain under-represented in ORA-Data, though some preliminary survey work in this area will be revisited over the coming months to identify requirements in this area and to inform service development. The DHARMa (Digital Humanities

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

15

Archives for Research Materials) project ran from 2013 to 2014 and worked with a number of Oxford-based Digital Humanities projects to investigate the research data they used and/or created; it produced a number of key requirements and recommendations, as well as a number of sample datasets for future ingest into ORA-Data[31].

The Bodleian works with agencies such as the Digital Preservation Coalition9 and the Digital Curation Centre10 to ensure best practice and shape future preservation strategy, and has recently commenced a joint project with Cambridge University Library, generously funded by the Polonsky Foundation, to identify and refine future preservation requirements. Although this project is focused on digitised and born-digital cultural heritage collections we intend to apply the findings more broadly to humanities datasets.

Oxford is participating in the Jisc-funded UK Research Data Discovery Service[32], which will improve our OAI-PMH endpoint to enable metadata harvesting from and into other repositories.

Curation: Access, Use and Re-use

The fourth full lifecycle action corresponds to “Curate and Preserve” with a sequential action of Access, Use and Reuse, which states inter alia “Ensure that data is accessible to both designated users and re-users, on a day-to-day basis”. Since the full-lifecycle and sequential actions relating to preservation have already been considered above, here we focus on curation in the context of access, use and re-use. DOI assignment

All records submitted in ORA-Data are assigned a ‘universal unique identifier’ (UUID) which is used to create a persistent URL. Digital Object Identifiers (DOIs) can be issued for all datasets deposited in ORA-Data, or for datasets deposited in secure data stores elsewhere in Oxford University according to the Bodleian's policy for assigning DOIs to datasets [33].

Once a DOI has been issued, the terms of the University's DOI Allocation Agreement with the British Library (acting as the UK allocation agent of DataCite11) do not allow either the dataset itself or five mandatory metadata fields in the DataCite schema (identifier; creator; title; publisher; year published) to be amended. Any subsequent version of a dataset which has been assigned a DOI therefore needs to be considered as a new deposit and, at present, this necessitates duplication of metadata entry in each new record for a new version of the dataset. However, we intend to incorporate new functionality following an infrastructure upgrade scheduled for early summer 2016, which will facilitate multiple versions of the same dataset to be assigned to the original data record in ORA-Data, with newer versions of the dataset retaining the same DOI string but each with a different version suffix12. At the time of writing, ORA-Data staff advise enquirers about the permanency of both the DOI and the digital object, and depositors will often choose not to request a DOI as part of the online deposit process and instead wait until they are absolutely sure that their

9 http://www.dpconline.org/ - accessed 21-Jan-2016 10 http://www.dcc.ac.uk/ - accessed 21-Jan-2016 11 https://www.datacite.org - accessed 18-Jan-2016 12 The DOI syntax is based on the Dryad model at http://wiki.datadryad.org/DOI_Usage - accessed 18-Jan-2016.

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

16

dataset is the ‘final version’ (often determined by the publisher of the journal article which the data underpins) and then request a DOI retrospectively.

While researchers are increasingly aware of the importance of DOIs for citation, there remains some confusion as to how they should be used. We have encountered two issues in particular – firstly, issues with reserving DOIs in advance, secondly the way in which DOIs should be referenced: • To ease reciprocal citation between published articles and the data supporting them, ORA-Data

originally provided depositors with a reserved (but not yet minted, and therefore inactive) DOI upon request during the online deposit process. However, some researchers disengaged with the deposit process as soon as they were given a reserved DOI and neglected to complete the submission, meaning their reserved DOI was never actually minted (and remained inactive). Subsequently, we decided to withdraw this facility.

• DOIs are not URLs as such, but need a DOI resolver in order to work properly. An Oxford University DOI might read 10.5287/bodleian:000000000 but this string in itself will not provide direct access to the digital object it is identifying without some sort of resolver, although it will be found by search engines. In order to resolve directly, the DOI needs to be prefixed by a resolver (e.g. http://dx.doi.org), although this prefix does not specifically form part of the DOI. Because citation behaviour differs for online and print publications, ORA-Data staff differentiate between the two when offering advice on citing DOIs. For print publications, the full resolver-DOI string is suggested (e.g. http://dx.doi.org/10.5287/bodleian:000000000) while for an online publication the DOI alone is recommended (e.g. doi: 10.5287/bodleian:000000000) although its underlying link should include the DOI resolver in the full string.

Access and use Deposits of data records and datasets in ORA-Data have been steady since the service was launched in May 2015. AS of 30 April 2016, there were 119 ORA-Data datasets at different stages of the submission process (ranging in size from 0 to 22 GB): 100 published, 2 DOI-registered, 8 claimed, 5 with system issues (verified/failed), 3 referred, 0 submitted, and one deletion:

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

17

Figure 3: Monthly deposit and publication of datasets since the launch of ORA-Data

Monthly statistics are produced by the ORA-Data team to measure uptake of the service and which funders’ projects are being deposited. Since the EPSRC Readiness project has been completed, EPSRC-funded researchers have deposited significantly more than those funded by other bodies, suggesting the approach taken to informing researchers is effective, if labour-intensive.

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

18

Figure 4: Illustrating the impact of the EPSRC Readiness Project on deposit in ORA-Data

Figure 4 illustrates that 128 ‘standard’ funding grants had been declared by depositors of which 110 from ‘standard’ UK funders are shown (58 EPSRC, 11 John Fell Fund, 10 Heritage Lottery Fund, 8 European Research Council, 5 European Commission, 4 NERC, 2 JISC, 1 ESRC, 1 BBSRC, 1 STFC, 1 British Academy, 1 Wellcome Trust, 1 Leverhulme, 1 Royal Society, and a grant from the Government of Estonia administered through the European Social Fund).

It is not surprising to note that over half of the funding grants declared originated from the EPSRC, as it was that council’s publication of policy requirements regarding data that represented one of the main drivers for provision of the ORA-Data service in its present form. The trend was replicated in terms of divisional breakdown within the University, with 64 out of a total of 119 deposits originating from the Mathematical, Physical and Life Sciences (for which the EPSRC is a major funder), 24 from independent departments and units, 15 from the Medical Sciences, 10 from the Social Sciences, and 6 from the Humanities.

Further work to implement the DCC Curation Lifecycle Model approach: Access, use and re-use At present, there is a limit on the size of datasets which can be uploaded online; individual files cannot exceed 2GB, and multiple files cannot total more than 5GB. Datasets which are larger than this require transfer from the depositor to ORA-Data staff through file-sharing facilities such as OxFile (the University’s file-sharing service, which facilitates transfer of up to 25GB) or commercial services such as DropBox (which are not secure), or physical handover of USBs or hard drives for particularly large datasets, and these are then manually uploaded by Bodleian staff. This process is burdensome and time-consuming for the depositor and uneconomical for the Library. Transfer of the repository to a new infrastructure will allow for online upload of far larger datasets in 2016.

Statistical information on access and use in ORA-Data is limited to the number of times a data record has been viewed and the number of times a dataset has been downloaded. Meyer and Schroeder 2015[34], argue that ‘A major issue standing in the way of widespread data sharing is the unresolved issue of how academic researchers can be assigned credit and rewards for contributing scientific and other data to a public archive’. Anecdotal feedback from researchers suggests provision of more sophisticated metrics (e.g. temporal, geographical, citation counts, etc.) would allow them to more effectively assess the impact of their research outputs and provide greater incentive for sharing their research data in the future. Commercial data services are currently more advanced in the provision of such information, but the potential for using plug-ins and/or apps for generating metrics at this level within an HEI context will be explored.

The Bodleian Libraries was recently awarded funding by Jisc for the third phase of its project Giving Researchers Credit for their Data, a highly collaborative, community-based project with partners including publishers F1000 Research, Elsevier, Oxford University Press and Nature; data repositories ORA-Data, Figshare and Mendeley Data; and City University. The project team is developing a prototype ‘helper app’ to streamline the submission process for data papers and to reference datasets held in a wide range of repositories.

Costs, funding and business model

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

19

Oxford’s case study published in the LERU Roadmap for Research Data [10] was partly based upon DataFinder and DataBank as the precursor to ORA-Data. Establishing a workable cost model for ORA-Data is still very much an on-going process and, at present, the service remains free at point of use until a sustainable cost model is agreed and adopted. At the time of writing, the Bodleian Libraries has been awarded two years of additional operational funding from the University, partly to enable cost-recovery from research grants to be implemented.

Comparison with HEI and commercial providers regarding service/storage costs revealed considerable variance in charging models and in November 2014 Oxford drafted a model comprising the cost of storage space (£5 per gigabyte, based on the Bodleian’s established charge of £5,000 per terabyte) on top of a one-off charge of £140 per deposit to cover the associated technical and administrative costs of maintaining the service. For some months following the launch of ORA-Data, this draft cost model was disseminated to potential users of the service with the explanation that costs might be reclaimed through a Small Research Facility (SRF) or directly from funders as part of the grant application (where applicable). The reaction, however, was largely negative, especially as many respondents anticipated depositing data comprising far less than a gigabyte and therefore felt that an up-front charge of £140 was not only prohibitively expensive but essentially unfair. The few who responded positively to the draft model tended to be those who anticipated depositing datasets comprising several gigabytes or terabytes. A number of alternative models are currently being explored whereby the per-gigabyte cost might be increased slightly to include a scaled service cost.

Table 1. Options for sustainable funding of ORA-Data (taken from RDM and Open Data Working Group discussion document, University of Oxford working paper, unpublished)

Option Description Pros Cons A Central funding

Service supported solely by central funding, such as an increase to the Bodleian Libraries baseline funding or the 123 infrastructure charge.

• Enables deposit by any researcher who requires a data archiving service.

• Requires approval and agreement of funding source

• Expense falls wholly on University

B Central funding plus offset. Service supported central funding. Costs recovered where funding resources are available: if no source of funds is available, deposit is free.

• Costs recovered from grants where permitted

• Enables deposit by any researcher who requires a data archiving service.

• Covers unanticipated shortfall for data quantity estimation errors.

• Requires approval and agreement of funding source

• Unknown quantity of ‘free’ deposits

• Cost recovery may not be economically viable

• Grant income may not cover costs • Grant income may be expensive

to administer C Funded deposit only. Data

files may be deposited in ORA-Data only if funding is attached and can be recovered from an award or other source

• Only service users are charged • Direct charges related to usage have shown to have a negative impact on user behavior resulting in increase in cost per item, which may increase to a point unacceptable to funders.

• Prohibits deposit by researchers whose research in unfunded and PGRs

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

20

• Cost recovery may not be economically viable

• Requires pump-priming or underwriting as the service develops before charges can be recouped

• Grant income may be expensive to administer

Conclusions: From compliance to full life-cycle curation

Launched in May 2015, ORA-Data is a relatively new service which is compliant with funders’ policy requirements for research data management, and particularly those of the EPSRC. Use of the DCC Curation Lifecycle Model is helping us to take a more strategic approach to planning for the future. The table below summarises an assessment of ORA-Data against the DCC Curation Lifecycle Model and activities that we intend to implement within the next two years to support a whole life-cycle approach to research data management. Oxford’s decision to host an institutional data repository internally was based on two key factors: the Bodleian Library had an existing technology platform which already served a similar purpose, and the EPSRC, the first of the UK funding councils to specify research data preservation requirements, emphasised the responsibility of research organisations themselves to ensure effective stewardship of research data. Commercial providers were contemplated but this was considered, at the time, to represent an undue surrender of control and risk for a business-critical service. Commercial agreements are subject to change, corporate objectives may diverge and providers may undergo changes of ownership or even cease to trade. As the market has developed, it should be noted that commercial solutions have emerged that address these concerns, although at the expense of some complexity. Some UK universities such as Loughborough and Salford have adopted institutionally-tailored research data services by working in partnership with Figshare13 and Arkivum14.

ORA-Data complements other data archives by providing a local archive for researchers who are not able to, or do not want to, deposit their data elsewhere. It is not intended to replace national, subject, or other established data collections but these facilities do not, by any measure, provide for all subjects and data formats. Data producers in Oxford are actively encouraged to create a record in ORA-Data linked to the dataset’s location, even if they consider an alternative repository a more appropriate place to actually deposit the material. As such, ORA-Data aims to be a catalogue of Oxford related data resources even if it does not hold all the data itself, and can be used to aid the repatriation of datasets should an external provider fail.

At the time of writing, ORA-Data celebrated its 100th deposit, and in December it was awarded the Data Seal of Approval15. Establishing the service has provided an opportunity to understand far

13 https://figshare.com/ - accessed 22-Jan-2016 14 (http://arkivum.com/ - accessed 22-Jan-2016. 15 http://datasealofapproval.org/ - accessed 22-Jan-2016.

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

21

more closely the behaviour and requirements of researchers for curation of research data in one of the world’s leading research-intensive universities.

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

22

Table 2. Activities planned in the next two years to support a whole-lifecycle curation approach to research data management within ORA-Data

DCC Curation Lifecycle Model action Existing functionality within ORA-Data service Improvements planned within the next 2 years Sequential activities Create or receive Supports online download up to 2Gb for a single file, 5Gb for

multiple files Limited support for versioning

FTP, cloud services and other file transfer options under consideration Improved versioning support, so that each updated version of a dataset can be given a version number and a related DOI

Appraise and select Review by expert metadata reviewers All datasets accepted, but must not contain sensitive or confidential data; personal data must be anonymised. Confidential data out of scope within ORA-Data. Separate ‘Digital Safe’ service for confidential data and records under consideration

Review process will be streamlined Automated upload of existing metadata e.g. from Bibtex files Improved OAI-PMH harvest of metadata allowing interoperability between ORA-Data and other repositories Confidential data will continue to be out-of-scope Improved embargo management

Ingest Automated ingest after review Other services are in place or planned to address the pre-ingest phases of the data lifecycle, including the Oxford DMP Online template and integration with the Online Research Database Service (ORDS).16

Preservation action Limited preservation actions (file format validation, virus checking)

Preservation workflow software under consideration for digital cultural heritage collections; learning will be applied to research data

Store Dual-site online storage and tape back-up in Bodleian-owned and maintained infrastructure Bit-level fixity checking

Major hardware infrastructure upgrade planned for summer 2016, with increased storage capacity (600Tb per node)

Access, use and re-use DOIs are minted for datasets that point to a URL linking to the dataset in question. Each dataset in ORA-Data is described with a licence that governs its terms of use. KPIs and metrics describing download, citation and use are minimal

Improved versioning support Review of licence terms for access and use and technical enhancements to facilitate better display of licence terms Significantly improved metrics on access and use to ensure that researchers are credited for data deposit Integration of ORCID identifiers Integration of FundRef and other identifiers under consideration Jisc-funded project, Giving Researchers Credit for Their Data will be completed and disseminated

Transform Not in place To be reviewed, but unlikely to be required within this timeframe Whole-lifecycle, ongoing activities

16 For more information see http://researchdata.ox.ac.uk/2015/11/30/dmp-online-oxford-guidance-now-available/, and http://ords.ox.ac.uk/ respectively.

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

23

DCC Curation Lifecycle Model action Existing functionality within ORA-Data service Improvements planned within the next 2 years Data and databases Any dataset can be held in ORA-Data, agnostic of file format.

Databases cannot be preserved at present, although their outputs can be.

No change

Description and representation information

5 basic DataCite standard metadata fields required from researchers Other metadata and description collected to enable access and use (e.g. a ReadMe file) ORCIDs/ FundRef/ other identifiers apart from DOIs not supported Limited provenance information

See ‘Create or Receive’ and ‘Access, Use and Re-Use above’ Potential integration of provenance information using the W3C standard, PROV

Preservation planning Preservation policies in place, but need alignment with each other Archival storage planning in place Broad programme of advocacy and engagement with the digital preservation community

Review, alignment and updating of preservation policies Preservation workflows documented Preservation software appraised e.g. Archivematica

Community watch and participation Advocacy of RDM as a methodology relevant throughout the full research lifecycle Regular promotion and training in key RDM issues: data management planning, effective metadata, repository and data-sharing options RDO website developed and maintained as the key resource for RDM at the University of Oxford Expectations of funding bodies regarding RDM clarified and disseminated – focus on EPSRC

Expectations of funding bodies regarding RDM clarified and disseminated – focus on ESRC and other major funders who update or publish data sharing policies Close collaboration with the Jisc RDM shared service

Curate and preserve Ongoing planning, review and improvements in place under governance of RDM and Open Data working group

Occasional actions Conceptualise Ongoing planning, review and improvements in place under

governance of RDM and Open Data working group Improved strategic planning through application of the LERU Roadmap for Research Data Improved operational planning through the RDM Delivery Group Transition to a long-term funded, sustainable service with cost recovery through research grants

Dispose Current preservation policy prevents disposal – data will be retained indefinitely. This is perhaps unrealistic

Draw up and consult on a disposals policy for datasets > 10 years old

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

24

Annex: Technical specification

The current ORA-Data service represents a hybrid technical solution based on pragmatic considerations. Originally, the data archive was planned as a separate service from the Institutional Repository but user consultation and an awareness of the increasing diversity of forms for research outputs that could be considered for impact and assessment exercises caused a rethink. In reality, a single route for depositing or registering any form of research output, regardless of its form, made sense. Oxford's Institutional Repository, ORA, already accepted theses, book chapters, conference papers and grey literature as well as peer reviewed publications in a system based on the Fedora 3 repository platform. These deposit workflows were already in the process of being migrated from a home grown code-base to the Sufia Hydra framework17. As such, we considered that implementing a dataset deposit workflow would not significantly complicate the process as much of the metadata (funders, researchers etc.) handling would be common. Hydra is a standard set of Ruby-on-Rails code libraries and APIs for developing applications based on Fedora, and Sufia is an institutional repository application built on Hydra. Both are open-source projects that have developed significant communities for support, development and sustainability which did not exist when ORA was first developed. The migration to Sufia was also used to move from a MODS18 representation of the core metadata to an RDF representation based on the W3C PROV-O ontology, with MODS generated programmatically if required, to better integrate with emerging Linked-Open-Data frameworks such as LD4L 19and Vivo. However, the Bodleian's Fedora 3 instance is not built for handling large volumes of data as a result of its heritage as a traditional IR. Since 2008, a parallel Fedora-like file storage service, DataBank, has existed alongside ORA and held the small number of datasets (until 2014) that had been deposited alongside conventional publications. Databank is accessed via a REST API with a minimal user interface and has grown primarily through the bulk scripted ingest of Bodleian digitised collections, consequently acquiring significant storage capacity. The deposit workflow therefore stores metadata in Fedora and routes the data file(s) to DataBank, replacing them with links - effectively treating the local data store like an external repository. In due course, when ORA migrates to Fedora 4 underpinnings, it is expected that data and metadata will be held in a single location. This will be supported by much more scalable storage and will also provide a coherent target for active preservation activities discussed later in this section. ORA-Data shares a discovery and access interface with ORA itself, again in the interests of providing a unified experience. Currently, this is also a custom application written in Python and utilising Lucene SOLR to provide full text indexing and faceted search. This will also be migrated to a Hydra-based application using Blacklight to provide the SOLR derived discovery interface. Hydra applications generally separate deposit from discovery and access so that it is simpler to implement, for example, tablet and mobile interfaces which work rather better for content consumption rather than creation and editing. 17 18 http://www.loc.gov/standards/mods/ - accessed 30 January 2016 19 https://www.ld4l.org - accessed 30 January 2016

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

25

References

[1] J. Wilson, ‘University of Oxford Research Data Management Survey, 2012’, 2012. [Online].

Available: http://ora.ox.ac.uk/objects/uuid:73070f0b-e4ba-42ba-912e-a246d70aba8e. [Accessed: 30-Jan-2016].

[2] G. Boulton, P. Campbell, B. Collins, P. Elias, W. Hall, L. Graeme, O. O’Neill, M. Rawlins, J. Thornton, P. Vallance, and M. Walport, ‘Science as an open enterprise’, Science (80-. )., no. June, pp. 1–104, 2012.

[3] D. De Roure, ‘Pages of History: eResearch blog.’ [Online]. Available: http://www.scilogs.com/eresearch/pages-of-history/. [Accessed: 30-Jan-2016].

[4] S. Bechhofer, I. Buchan, D. De Roure, P. Missier, J. Ainsworth, J. Bhagat, P. Couch, D. Cruickshank, M. Delderfield, I. Dunlop, M. Gamble, D. Michaelides, S. Owen, D. Newman, S. Sufi, and C. Goble, ‘Why linked data is not enough for scientists’, Futur. Gener. Comput. Syst., vol. 29, no. 2, pp. 599–611, Feb. 2013.

[5] S. Bechhofer, D. De Roure, M. Gamble, C. Goble, and I. Buchan, ‘Research Objects: Towards Exchange and Reuse of Digital Knowledge’, Nat. Preced., Jul. 2010.

[6] ‘University of Oxford Policy on the Management of Research Data and Records.’ [Online]. Available: http://researchdata.ox.ac.uk/files/2014/01/Policy_on_the_Management_of_Research_Data_and_Records.pdf. [Accessed: 16-Jan-2016].

[7] ‘RCUK Common Principles on Data Policy’, Online, 2015. [Online]. Available: http://www.rcuk.ac.uk/research/datapolicy/. [Accessed: 30-Jan-2016].

[8] ‘EPSRC policy framework on research data’, Online, 2015. [Online]. Available: https://www.epsrc.ac.uk/about/standards/researchdata/. [Accessed: 30-Jan-2016].

[9] Digital Curation Centre, ‘DCC Curation Lifecycle Model’, 2015. [Online]. Available: http://www.dcc.ac.uk/sites/default/files/documents/publications/DCCLifecycle.pdf. [Accessed: 30-Jan-2016].

[10] A. Achard, Pablo; Ayris, Paul; Fdida, Serge; Gradmann, Stefan; Horstmann, Wolfram; Labastida, Ignasi; Lyon, Liz; Maes, Katrien; Reilly, Susan; Smit, ‘LERU Roadmap for Research Data’, 2013. [Online]. Available: http://www.leru.org/files/publications/AP14_LERU_Roadmap_for_Research_data_final.pdf [Accessed 26-Feb-2016]

[11] Digital Curation Centre/ Key Perspective Ltd, ‘Data Dimensions : Disciplinary Differences in Research Data Sharing , Reuse and Long term Viability A comparative review based on sixteen case studies’, Synthesis (Stuttg) pp. 1–31, Jan. 2010. [Online]. Available: http://www.dcc.ac.uk/sites/default/files/documents/publications/SCARP-Synthesis.pdf

[12] C. L. Borgman, ‘The conundrum of sharing research data’, J. Am. Soc. Inf. Sci. Technol., vol. 63, no. 6, pp. 1059–1078, Jun. 2012.

[13] K. G. Akers and J. Doty, ‘Disciplinary differences in faculty research data management practices and perspectives’, Int. J. Digit. Curation, vol. 8, no. 2, pp. 5–26, Nov. 2013.

[14] ‘BioSharing.org.’ [Online]. Available: https://biosharing.org/. [Accessed: 31-Jan-2016]. [15] Elixir (various authors), ‘ELIXIR Scientific Programme 2014 - 2018’, 2014.[Online]. Available:

https://www.elixir-europe.org/sites/default/files/documents/elixir_scientific_programme_final.pdf [Accessed 26-Feb-2016].

[16] J. A. J. Wilson, M. A. Fraser, L. Martinez-Uribe, P. Jeffreys, M. Patrick, A. Akram, and T. Mansoori, ‘Developing Infrastructure for Research Data Management at the University of Oxford’, Ariadne, no. 65, 2010.

[17] J. A. J. Wilson and P. Jeffreys, ‘Towards a Unified University Infrastructure: The Data Management Roll-Out at the University of Oxford’, Int. J. Digit. Curation, vol. 8, no. 2, pp.

From compliance to curation: ORA-Data at the University of Oxford Burgess et al

26

235–246, Nov. 2013. [18] J. A. J. Wilson, L. Martinez-Uribe, M. A. Fraser, and P. Jeffreys, ‘An Institutional Approach to

Developing Research Data Management Infrastructure’, Int. J. Digit. Curation, vol. 6, no. 2, pp. 274–287, Oct. 2011.

[19] ‘Bodleian Digital Library Systems and Services.’ [Online]. Available: http://www.bodleian.ox.ac.uk/bdlss. [Accessed: 16-Jan-2016].

[20] ‘UK Data Archive .’ [Online]. Available: http://www.data-archive.ac.uk/. [Accessed: 30-Jan-2016].

[21] M. H. Cragin, C. L. Palmer, J. R. Carlson, and M. Witt, ‘Data sharing, small science and institutional repositories.’, Philos. Trans. A. Math. Phys. Eng. Sci., vol. 368, no. 1926, pp. 4023–38, Sep. 2010.

[22] A. Flynn, ‘Oxford LibGuides: Oxford University Research Archive for Data: Getting Started.’ [Online]. Available: http://ox.libguides.com/ora-data/ [Accessed 26-Feb-2016]

[23] ‘ESRC Research Data Policy’, 2014. [Online]. Available: https://www.epsrc.ac.uk/about/standards/researchdata/ [Accessed 26 -Feb-2016]

[24] ‘Overview of funders’ data policies - Digital Curation Centre.’ [Online]. Available: http://www.dcc.ac.uk/resources/policy-and-legal/overview-funders-data-policies. [Accessed: 30-Jan-2016].

[25] ‘Research Data Oxford website.’ [Online]. Available: http://researchdata.ox.ac.uk/. [Accessed: 30-Jan-2016].

[26] ‘Bodleian Digital Library Systems and Services | Data Archiving (ORA-Data).’ [Online]. Available: http://www.bodleian.ox.ac.uk/bdlss/digital-services/data-archiving. [Accessed: 30-Jan-2016].

[27] J. Starr, J. Ashton, J. Brase, P. Bracke, A. Gastl, and F. Ziedorn, ‘DataCite Metadata Schema for the Publication and Citation of Research Data (Version 3.1)’, 2015. [Online]. Available: https://doi.org/10.5438/0010 [Accessed 26-Feb-2016]

[28] ‘ORA Data deposit agreement’, 2014. [Online]. Available: http://www.bodleian.ox.ac.uk/__data/assets/pdf_file/0019/190009/ORA-Data-Deposit-Agreement.pdf. [Accessed: 30-Jan-2016].

[29] ‘ORA-Data policy on submission, preservation and withdrawal’, 2014. [Online]. Available: http://www.bodleian.ox.ac.uk/__data/assets/pdf_file/0012/190011/ORA-Data-Submission-Preservation-and-Withdrawal-Policies.pdf. [Accessed: 31-Jan-2016].

[30] ‘Bodleian Libraries digital policies: Preservation.’ [Online]. Available: http://www.bodleian.ox.ac.uk/bodley/about-us/policies/preservation. [Accessed: 02-Mar-2016].

[31] J. McKnight, J; Madsen, C; Prag, ‘DHARMa Project Final Report.’ [Online]. Available: http://blogs.bodleian.ox.ac.uk/wp-content/uploads/sites/114/2015/04/DHARMa_Final.pdf. [Accessed: 02-Mar-2016].

[32] ‘UK research data discovery - Jisc.’ [Online]. Available: https://www.jisc.ac.uk/rd/projects/uk-research-data-discovery. [Accessed: 02-Mar-2016].

[33] ‘University of Oxford policy for assigning DOIs to datasets’, 2014. [Online]. Available: http://www.bodleian.ox.ac.uk/__data/assets/pdf_file/0014/190013/Oxford-University-DOI-Policy-ORA-Data.pdf. [Accessed: 30-Jan-2016].

[34] E. T. Meyer and R. Schroeder, Knowledge Machines: Digital Transformations of the Sciences and Humanities. MIT Press, 2015.