DESY Data Preservation Project - CERN Indico

29
K . W i c h m a n n , 2 4 . 0 5 . 1 2 , C H E P 2 0 1 2 D E S Y D a t a P r e s e r v a t i o n P r o j e c t 1 DESY Data Preservation Project Katarzyna Wichmann on behalf of DESY Data Preservation Group: DESY-IT: Y. Kemp, D. Ozerov DESY-Library / INSPIRE: Z. Akopov H1: S. Baghdasaryan, V. Dodonov, S. Levonian, B. Lobodzinski, J. Olsson, D. South (coordinator), Michael Steder * HERMES: E. Avetisyan, G. Schnell ZEUS: A. Ausheva, V. Bokhonov, A. Geiser, J. Malka, KW * big thanks for help with preparing this talk

Transcript of DESY Data Preservation Project - CERN Indico

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

1

DESY Data Preservation ProjectKatarzyna Wichmann

on behalf of DESY Data Preservation Group:

DESY-IT: Y. Kemp, D. OzerovDESY-Library / INSPIRE: Z. Akopov

H1: S. Baghdasaryan, V. Dodonov, S. Levonian, B. Lobodzinski, J. Olsson, D. South (coordinator), Michael Steder*

HERMES: E. Avetisyan, G. SchnellZEUS: A. Ausheva, V. Bokhonov, A. Geiser, J. Malka, KW

* big thanks for help with preparing this talk

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

2

HEP Data Preservation Project● Global DPHEP initiative was launched in 2009 at DESY● DESY Data Preservation Group established soon after

● 17 people involved

● Many HEP data sets are unique and retain their scientific potential● No clear model of long term preservation before DPHEP● Physics cases for data preservation

● Long-term data analysis● Re-using and re-analyzing data● Combining results between experiments● Education, training and outreach

● Four models of preservation are defined:

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

3

DESY Data Preservation Project● Big effort directed into DESY Data Preservation Project

● Digital and non-digital documentation● Software preservation● Data archiving

● Lots of progress in all fields since the beginning of the project● Very fruitful collaboration between H1, HERMES, ZEUS, DESY

Library & Inspire and DESY IT● Regular meetings

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

4

Non-Digital DocumentationGreat care taken to preserve as much as possible of the various documentation collected over years of running of the experiment● Much material exists from pre-web days● All kinds of web applications were used● Requires quite some management, cataloguing and new archives● non-digital documentation sorted out and safely stored in a dedicated

library archive● some part of non-digital documentation digitized

● theses● talks● minutes● log book● internal notes

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

5

Digital DocumentationDigital documentation:● Former online monitoring and shift tools● Web-based documentation, electronic logbooks, presentations in

meetings, minutes...● digital documentation needs to be improved and modernized

➔ missing or unavailable documentation restored ➔ providing new condensated “tutorials” on topics most important for

future analysis

ZEUS Primer: http://www-zeus.desy.de/ZEUS_ONLY/analysis/primer/

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

6

INSPIRE● Inspire: new dedicated effort to create an improved info storage like Spires● Inspire offers many convenient options for digitized documents archiving

● collaborations internal notes submitted to INSPIRE (password protected)

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

7

INSPIRE● Inspire: new dedicated effort to create an improved info storage like Spires● Inspire offers many convenient options for digitized documents archiving

● collaborations internal notes submitted to INSPIRE (password protected)

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

8

INSPIRE● Inspire: new dedicated effort to create an improved info storage like Spires● Inspire offers many convenient options for digitized documents archiving

● collaborations internal notes submitted to INSPIRE (password protected)

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

9

● Inspire: new dedicated effort to create an improved info storage like Spires● Inspire offers many convenient options for digitized documents archiving

● collaborations internal notes submitted to INSPIRE (password protected)

● Other documents from the HERA collaborations are under discussion: preliminary results, theses, conference talks, proceedings, paper history...

Inspire gives unique opportunity to conserve documentation, wikis, news forums and even data outside collaboration resources and keep it available and undisturbed “forever”

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

10

Long-Term Archiving of Web Pages● Most of essential digital information for analysis is stored on web

servers – huge amount of various type of data to be safely stored● Investigating and refining collaboration web pages● All webservers are now running in DESY-IT central environment

● Hardware renewal and failures handled by IT● Consolidation of web pages still requires quite some work

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

11

Analysis Software & Data Preservation

The interface to access and handle the data has to be fully

functional

The integrity of data has to be guaranted (without frequent

user access)

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

12

Data Analysis Models @ DESY● H1

● preservation level 4● full chain from compilation of simulation, reconstruction and

analysis code● full flexibility in the future for data and MC

● HERMES● preservation level 4● ADAMO-based micro-DST files for data and MC

● ZEUS● preservation level between 3&4● data and MC preserved in form of ROOT-based Common Ntuples● in addition maintain the ability of simulation of small samples of

new MC in the future using existing executables

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

13

sp-system - Software Preservation System● Automated validation system to facilitate future software and OS transitions● Clear separation of tasks between IT and experiments

● Utilisation of virtual machines offers great flexibility– OS and configuration is chosen by parameter

● Development system allows interactive coding and debugging ● Computing resources of all experiments end in the next few years● Access to DESY-IT infrastructure given for future analysis & file production● Requires an installation similar to Grid/IT cluster nodes

– No administrative actions, no pre-installed software, no afs access, ...

Computing CentreExperiment

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

14

Example Structure of a Validation Project

Simulation / Reconstruction

Fortran

Analysis Software

h1oo

req

uire

s

ROOT

FastJet

Neurobayes

...

data base snapshot

h1oo snapshot

create

create

create tar ballsof H1 sw

Fortran Executables

h1oo Executables

outside withinsp-system

Physics Analyses

~5x

~40x

~10x

Event Display

HAT/μODS dst2all

DST production h1simrec

1996

2007

1996

2007

all periods

all periods

10-20x10-20x

MC generators to come

Compilation of MC generators

to come

Commonstorage

softwarerepository

currently cvs

needs access to

running VM

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

15

Example Structure of a Validation Project

Simulation / Reconstruction

Fortran

Analysis Software

h1oo

requ

ires

ROOT

FastJet

Neurobayes

...

data base snapshot

h1oo snapshot

create

create

create tar ballsof H1 sw

Fortran Executables

h1oo Executables

outside withinsp-system

Physics Analyses

~5x

~40x

~10x

Event Display

HAT/μODS dst2all

DST production h1simrec

1996

2007

1996

2007

all periods

all periods

10-20x10-20x

MC generators to come

Compilation of MC generators

to come

Commonstorage

softwarerepository

currently cvs

needs access to

running VM

Compilation ofreconstruction andsimulation software(+ external dependencies)

Sequential testing ofa) MC generators b) DST productionc) Analysis Level Data Production

Parallel testing ofa) Physics Analysesb) Eventdisplayc) Tools and Binaries

9 /13

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

16

Example Structure of a Validation Project

Simulation / Reconstruction

Fortran

Analysis Software

h1oo

requ

ires

ROOT

FastJet

Neurobayes

...

data base snapshot

h1oo snapshot

create

create

create tar ballsof H1 sw

Fortran Executables

h1oo Executables

outside withinsp-system

Physics Analyses

~5x

~40x

~10x

Event Display

HAT/μODS dst2all

DST production h1simrec

1996

2007

1996

2007

all periods

all periods

10-20x10-20x

MC generators to come

Compilation of MC generators

to come

Commonstorage

softwarerepository

currently cvs

needs access to

running VM

Compilation

10/13

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

17

Example Structure of a Validation Project

Simulation / Reconstruction

Fortran

Analysis Software

h1oo

requ

ires

ROOT

FastJet

Neurobayes

...

data base snapshot

h1oo snapshot

create

create

create tar ballsof H1 sw

Fortran Executables

h1oo Executables

outside withinsp-system

Physics Analyses

~5x

~40x

~10x

Event Display

HAT/μODS dst2all

DST production h1simrec

1996

2007

1996

2007

all periods

all periods

10-20x10-20x

MC generators to come

Compilation of MC generators

to come

Commonstorage

softwarerepository

currently cvs

needs access to

running VM

ValidationCompilation

10/13

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

18

Running jobs in the sp-system

validation step

running step

build step

tgz

${ID}

sourcecode

Initial step● Compilation of analysis (level 3) and sim/rec

(level 4) software ● OR: use tar-balls of pre-compiled software ● Provide access to software

Copy tar-balls to persistent storage● All output kept in directory with unique name

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

19

Running jobs in the sp-system

validation step

running step

build step

persistent

Initial step● Compilation of analysis (level 3) and sim/rec

(level 4) software ● OR: use tar-balls of pre-compiled software ● Provide access to software

Copy tar-balls to persistent storage● All output kept in directory with unique name

Run tests in parallel● Set up software environment● Validate binaries with persistent input

e.g. event display, DB access, ...

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

20

Running jobs in the sp-system

validation step

running step

build step

persistent

tmp

Initial step● Compilation of analysis (level 3) and sim/rec

(level 4) software ● OR: use tar-balls of pre-compiled software ● Provide access to software

Copy tar-balls to persistent storage● All output kept in directory with unique name

Run tests in parallel● Set up software environment● Validate binaries with persistent input

e.g. event display, DB access, ...Run sequential tests

● Set up software enviroment● Validate file production1. MC generation (-> generator files)2. Reconstruction (gen. Files -> DSTs)3. Analysis level (DSTs -> RooT files)● Test use output of previous test as input

-> Results remain accessible or can be reproduced with identical results

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

21

Bookkeeping of Validation Results● Display validation results in a comprehensible way● Provide links to additional information

– plots, root files,…● Similar task for all collaborations

– Profit by synergies● Allow different levels of complexity

11/13

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

22

Example of Validation of Physics Analysis

● Test stability of analysis against changing software environment● Validate results of ZEUS Z0 analysis● Check access to data/MC in common ntuple format (very important)

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

23

Example of Validation of Physics Analysis

● Test stability of analysis against changing software environment● Validate results of ZEUS Z0 analysis● Check access to data/MC in common ntuple format (very important)● Check results of the Z0 analysis (event list, cross section, acceptance

& invariant mass calculations) against various possible future changes● 32 – 64 bit machines● New ROOT versions● Speed● New data access schemes● New operating systems● ...

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

24

Status of Experiments’ Software

ok

ongoing

to be done

problem

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

25

- All software compiles in SL5/64bit without problems

- DST and analysis level production started - Implementation of validation scripts for full chain ongoing - First tests for other binaries implemented, e.g. db integrity

ok

ongoing

to be done

problem

Status of H1 Software

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

26

● HERMES: - All experiment software successfully compiled - Validation of results ongoing (small differences wrt SL3

spotted) - Production of Adamo based μDST require cernlib2005@64bit - Archival mode has to be used already from 2013 on

ok

ongoing

to be done

problem

Status of HERMES Software

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

27

● ZEUS: - Pre-compiled SL5/32bit software runs without problems - Validation of stand-alone MC and common ntuple production

started - First physics validation scripts implemented

ok

ongoing

to be done

problem

Status of ZEUS Software

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

28

Status of Experiments’ Software● All experiments run software inside sp-system

– Reference OS for experiments is SL5/32bit

● All experiments plan to finalise validation scheme on SL5 this year● Very different requirements - flexibility and scalability of the system● Validation of 64-bit systems: major step towards migrations to future OS ● Next step will be migration to SL6● sp-system has already proven to be useful

– Database snapshot file production was corrupted (fixed)– Bug in analysis level filling code has been identified (fixed)

ok

ongoing

to be done

problem

K. Wic hm

a nn , 2 4 .05 .1 2, C HE P 20 12 D

ES Y Dat a Pr es er va ti on P ro je ct

29

Summary● DESY Data Preservation Group very active

● Very good collaboration between the experiments, DESY library,

Inspire and DESY-IT

● (Non-)digital documentation ● Huge effort has been made by DESY library and experiments

to find, digitise and catalog all documents

● Unique possibilities given by Inspired pursued by DESY collaborations

● Software preservation system● All three experiments use the sp-system and its development system● Successful compilation and/or running of experiment’s software● Necessary tests have been identified, implementation is in full swing