Guidelines for the Digitisation of Paper Records

68
Guideline for the Digitisation of Paper Records © Queensland State Archives 2006 Version 2 April 2006

Transcript of Guidelines for the Digitisation of Paper Records

Guideline for the Digitisation of Paper Records

© Queensland State Archives 2006 Version 2 April 2006

Contents 1: Introduction ............................................................................................................................. 4

1.1 Authority ............................................................................................................................ 4 1.2 Scope ................................................................................................................................ 4 1.3 Exclusions ......................................................................................................................... 5 1.4 Acknowledgments ............................................................................................................. 5

2: Management of Original Records and Scanned Images ..................................................... 6 2.1 Responsibilities for recordkeeping .................................................................................... 6 2.2 Management of original paper records.............................................................................. 6 2.3 Management of imaged records........................................................................................ 7

3: Issues to Consider Before Commencing Digitisation ......................................................... 9 3.1 Why digitise?..................................................................................................................... 9 3.2 Which records will be digitised? ........................................................................................ 9 3.3 How will digitisation integrate to the existing workflow? .................................................. 10 3.4 How will the digitised records be managed? ................................................................... 10 3.5 Authorised disposal of paper originals ............................................................................ 11 3.6 How will the digitised records be used? .......................................................................... 11 3.7 Who is responsible for the digitisation project? ............................................................... 11 3.8 Will digitisation be outsourced?....................................................................................... 11

4: Components of a Digitisation Program .............................................................................. 13 4.1 Computer Hardware ........................................................................................................ 13 4.2 Computer Software ......................................................................................................... 14 4.3 Procedures and Standards.............................................................................................. 15 4.4 Staff ................................................................................................................................. 17

5: Authorisation for Early Disposal ......................................................................................... 18 5.1 Introduction...................................................................................................................... 18 5.2 Overview of process........................................................................................................ 18 5.3 Determining whether records are eligible........................................................................ 18 5.4 Ensuring appropriate systems and procedures............................................................... 21 5.5 Obtaining authorisation ................................................................................................... 23 5.6 Monitoring and review ..................................................................................................... 24

6: Technical considerations..................................................................................................... 25 6.1 Resolution ....................................................................................................................... 25

Recommended Resolutions.................................................................................................................... 27 6.2 Bit Depth.......................................................................................................................... 28

Recommended Bit Depths...................................................................................................................... 31 6.3 Compression and File Size ............................................................................................. 31

Recommended Compression ................................................................................................................. 33 6.4 File Formats .................................................................................................................... 33

Recommended File Formats................................................................................................................... 38 6.5 Quality Control................................................................................................................. 39

Recommended Quality Checks .............................................................................................................. 42 6.6 Master Files and Derivatives ........................................................................................... 42

Recommended Derivatives..................................................................................................................... 45 7: Metadata................................................................................................................................. 46

7.1 Metadata Types............................................................................................................... 47 7.2 Capturing Metadata......................................................................................................... 49 7.3 File Naming Conventions ................................................................................................ 49

Recommended Metadata........................................................................................................................ 51

Queensland State Archives: Guideline for the Digitisation of Paper Records

8: Storage and Media Options.................................................................................................. 52 8.1 On-line, Near-line, and Off-line Storage .......................................................................... 52 8.2 Media Types.................................................................................................................... 53 8.3 Media Lifecycle................................................................................................................ 54

Recommended Storage Options ............................................................................................................ 55 Appendix 1: Glossary of Terms and Acronyms......................................................................... 56 Appendix 2: Scanner Types......................................................................................................... 60 Appendix 3: Table of Technical Recommendations.................................................................. 61 Appendix 4: Related Standards................................................................................................... 62 Appendix 5: Reference List.......................................................................................................... 65

3

Queensland State Archives: Guideline for the Digitisation of Paper Records

1: Introduction

Digitisation is the process of converting any physical or analogue item into an electronic representation1. In the context of this guideline, digitisation refers to the creation of digital images from paper documents by such means as scanning or digital photography. Queensland State Archives has produced this guideline to provide information to public authorities about digitisation, to recommend suitable digitisation parameters and to raise awareness of the recordkeeping factors associated with digitisation. Many Queensland public authorities have implemented or are considering implementing systems to digitise their paper records. In most cases, these projects are undertaken with the goals of

• achieving faster retrieval of information; • improve access to information, by; • greater sharing of information; and • the reduction of the storage space required for paper records.

There are many aspects of digitisation. While the acquisition of scanners and associated computer hardware may be the initial action that comes to mind when digitisation is discussed, successful digitisation requires several components, including procedures, standards, computer software, and appropriately skilled staff.

1.1 Authority This guideline is issued under Section 25 of the Public Records Act 2002 (the Act). The guideline is a resource for Queensland public authorities to help them achieve best practice recordkeeping and information management. This publication is intended to serve as a guide to public authorities undertaking or considering undertaking digitisation. Information and recommendations provided in this guideline are considered to be accurate at the time of publication. Queensland State Archives reserves the right to withdraw, amend or replace this guideline at any time as technology and the needs of public authorities change.

1.2 Scope Paper records exist in a variety of formats including maps, plans, photographs and other documents of various colours, paper types and sizes. This guideline provides digitisation recommendations that are broad enough to apply to the majority of paper records applicable to most public authorities. In some cases, particular characteristics of different types of paper records may call for different technical parameters and approaches from those included here. Public authorities should combine the guidance provided in this document and advice from digitisation-related computer hardware and software vendors with their own testing to determine the optimum parameters for their organisation. This guideline also provides information on how public authorities can apply for authorisation for the early destruction of certain temporary records that have been

1

Tanner, S. From Vision to Implementation – strategic and management issues for digital collections. 2000. The Electronic Library – strategic, policy and management issues seminar. Accessed March 2005 at http://heds.herts.ac.uk/resources/papers/Lboro2000.pdf

4

Queensland State Archives: Guideline for the Digitisation of Paper Records

digitised, following the Digitisation Disposal Policy: Policy on the authorisation of the early disposal of original paper records after digitisation. Key technical terms have been explained and illustrated with examples and a comprehensive glossary has been included in Appendix 1: Glossary of Terms and Acronyms. Definitions of records management terms can be found in Queensland State Archives’ Glossary of Archival and Recordkeeping Terms2.

1.3 Exclusions The conversion of other analogue records, such as video or audio recordings, into a digital form is outside the scope of this document. Likewise, the management of information that originates in a digital form, such as word processing documents, e-mails, and other born-digital items is not included. This guideline does not provide advice on high-quality digitisation of historical documents for preservation purposes.3

While advice will be provided on some generic features that should be possessed by computer hardware and software used in the digitisation process, this guideline will not provide recommendations for particular models of computer equipment or software titles. Queensland State Archives is unable to provide advice on systems and network architecture issues relating to digitisation. Public authorities should refer to their existing computer systems administration and implementation procedures for technical and systems issues.

1.4 Acknowledgments Queensland State Archives would like to acknowledge the public authorities which participated in the development of the Guideline for the Digitisation of Paper Records.

2

Glossary of Archival and Recordkeeping Terms. 2004. Queensland State Archives. Accessed March 2005 at http://www.archives.qld.gov.au/downloads/GlossaryOfArchivalRKTerms.pdf3

For information on preservation digitisation of archival or historical documents, please contact Queensland State Archives.

5

Queensland State Archives: Guideline for the Digitisation of Paper Records

2: Management of Original Records and Scanned Images

While there are clear benefits that the digitisation of paper records can bring, it is important that public authorities are aware of the related recordkeeping issues. There are a number of challenges in ensuring that digitised paper records remain accessible and useable. Meeting these challenges should be a key focus when considering digitisation technical requirements. Digitised files can be considered accessible if they can be easily identified, retrieved, used and maintained. To be considered useable and accessible within an organisation, descriptions of the files and procedures on how they can be accessed must be established and published. Any digitisation program should be carefully planned to meet appropriate standards and avoid the need to repeat work. Consideration must also be given to the categorisation and storage of the original paper documents that are digitised. Once digitised, the paper records still need to be kept for their respective retention periods unless disposal authorisation is given by the State Archivist. The conditions for early destruction of paper originals are outlined in the Digitisation Disposal Policy and section 5 of these guidelines provides information for public authorities on seeking authorisation in accordance with the policy. This guideline examines important digitisation issues regarding accessibility and usability of digitised paper records including file formats, image qualities, the way the image files are stored and the process that is adopted to accomplish the digitisation. Issues associated with usability, such as the information that is recorded to describe both the record and allow it to be accessed, are also considered.

2.1 Responsibilities for recordkeeping Under Section 7 of the Public Records Act 2002 (the Act), a public authority must make and keep full and accurate records of its activities, and have regard to any relevant policy, standards and guidelines made by the State Archivist about the making and keeping of public records. Queensland Government legislation and standards relevant to digitisation includes:

• Public Records Act 2002 • Evidence Act 1977 • Information Standard 31: Retention and Disposal of Public Records • Information Standard 40: Recordkeeping • Information Standard 41: Managing technology dependent records

These standards and legislation should be considered in conjunction with the guidance provided in this document. Additional legislation, standards and policies which apply to individual public authorities or industries may also need to be consulted. Reference should also be made to the Australian Standard on Records Management, AS-ISO15489.

2.2 Management of original paper records Original paper records must be kept for the retention periods set out in an approved retention and disposal schedule unless authorisation has been obtained from the State Archivist for their earlier disposal. Under Section 13 of the Act, no public record can be disposed of without the permission of the State Archivist. In considering whether to seek authorisation for early disposal, public authorities should be aware of any legislative or regulatory requirements to maintain records in a particular

6

Queensland State Archives: Guideline for the Digitisation of Paper Records

format. Public authorities should also assess the need to maintain records in their original form for legal purposes and should seek legal advice if unsure of any requirement. By scanning original records to a digital format, and retaining only the digital version, public authorities may be disadvantaged if called upon to authenticate certain records. The Digitisation Disposal Policy sets out what records are eligible for authorisation for early disposal and under what conditions. For more information on the authorisation process, see section 5 of these guidelines. Day batching Some public authorities have adopted the practice of day batching, which involves filing the paper originals of imaged records in batches based on date received or scanned. Batching places a heavy reliance on the system used to manage the digitised records and introduces a number of issues including:

• the risk of losing vital contextual information about the business the records document and their relationship with other records,

• the inability to effectively implement a disposal program, since records batched together may have different retention periods, and

• the refusal of Queensland State Archives to accept for transfer into its custody temporary records contained in a batch also holding permanent public records.

Batching is usually associated with the digitisation of new records. It should be noted that any records which are removed from structured files for any purpose, including digitisation, must be returned to the file from which they were removed. Further information can be found in the Public Records Alert, Day batching of records4.

2.3 Management of imaged records Imaged records require careful management. There is a high risk of technical obsolescence of hardware and software needed to retrieve information from electronic storage media. A public authority needs to ensure that its recordkeeping system can maintain authentic, accurate, complete and accessible imaged records for as long as they need to be retained. A management plan dealing with the procedures for migration of data is required to cater for systems being replaced and equipment becoming obsolete. Some general principles apply to the retention of original paper records and their digital copies:

• The paper original should be kept for the full period specified in an approved retention and disposal schedule, unless an early disposal authorisation is granted (see section 5: Authorisation for early disposal).

• If the image becomes part of a file with other records, for example, in an eDRMS environment, it should be kept in accordance with the retention and disposal period given to the parent file. Records should never be removed or ‘culled’ from files.

• An image made purely for access or reference purposes can be destroyed when reference ceases in accordance with the General Retention and Disposal Schedule (GRDS) for Administrative Records, class 6.1.2 for duplicate copies of records.

There are some exceptions to these general principles. For example, if key business decisions, approvals or comments are closely associated with the image copy of a record, such as in a workflow system, the image and the associated information should be kept for the full retention period. 4

Public Records Alert No 1/05: Day batching of records. 2005. Queensland State Archives. Accessed March 2005 at http://www.archives.qld.gov.au/publications/PublicRecordsAlert/PRA105.pdf

7

Queensland State Archives: Guideline for the Digitisation of Paper Records

The paper original may only be disposed of before the digitised image with the explicit permission from the State Archivist through an approved retention and disposal schedule. It should be noted that retention and disposal schedules set minimum periods for retention. Public authorities occasionally have a need to keep records for longer than the approved retention period. In this situation, if authorisation has not been given for the early destruction of the original records, the principle of not disposing of the paper original before the digitised image still applies. Information Standard 40: Recordkeeping and the Australian Standard AS ISO 15489: Information and Documentation: Records Management should be consulted for general advice on the principles and practices for the management of scanned images as a public record. Authenticity Authentic records are those that can be proven and trusted to be what it purports to be and to have been created, used, transmitted or held by an agency or person to whom these actions have been attributed5. Public authorities will need to be able to verify the authenticity and accuracy of the images of business transactions captured by scanning. The original records must remain readily accessible long enough to allow for verification that procedures related to the capture of records have been followed. Measures should be in place to protect the authenticity of the scanned records throughout their lifecycle. Information about the scanning processes should be maintained, including documentation about the business processes and the maintenance of systems, to demonstrate that public records were created and captured in the normal course of business with reliable systems and equipment. Documentation about the records that are scanned should be maintained to describe the structure and content of the records and the business context in which they are created and captured. Copyright, Intellectual Property and Privacy Most public authorities will digitise records for ease of information sharing within the organisation. However, once digitised, information is in a form that makes it easier to distribute to a wider audience. Any public authority intending to make information available to a broader audience, for example, publishing images to a website, should be aware of any copyright, intellectual property and privacy implications.

5

Glossary of Archival and Recordkeeping Terms. 2004. Queensland State Archives. Accessed March 2005 at http://www.archives.qld.gov.au/downloads/GlossaryOfArchivalRKTerms.pdf

8

Queensland State Archives: Guideline for the Digitisation of Paper Records

3: Issues to Consider Before Commencing Digitisation

Careful consideration should be given to all aspects of digitisation prior to the commitment to the project by such actions as the acquisition of equipment or the recruitment of staff. There are several questions that should be addressed to assess the value and effectiveness of digitisation. These are raised and discussed over the following pages.

3.1 Why digitise? Digitisation of records is typically undertaken with the aim of achieving faster retrieval of information, easier transmission of information, greater sharing of information and the reduction of the storage space required for paper records. Driving factors behind digitisation projects may include:

• to provide easier and improved access to the records; • to improve the internal transfer or dissemination of the records; • to integrate digitised records with an electronic documents and records management

system (eDRMS), other systems, applications or websites; and • to reduce management & access costs. Benefits from the process, such as improvements to productivity resulting from the better access to records and the information they contain, should be clearly defined and will need to be quantified and communicated to management and relevant staff. Costs of digitisation should be compared with the cost of “doing nothing”, and the issues of continuing to use only paper records, including lack of accessibility, poor integration into modern business systems and inconvenience should be clearly understood. Available resources, including personnel, technology and money, should be assessed. The organisation’s technical infrastructure requires the ability to cope with the ongoing operation of the system, and funds need to be allocated to the regular maintenance and update of the systems implemented.

3.2 Which records will be digitised? The type of equipment required, the number of staff, and the financial resources are dependent upon the volume of records that will be digitised. By monitoring the amount of paper records that are generated by and sent to the public authority, and investigating the need to digitise existing paper records, an estimate can be made as to the number of pages per day that the organisation can expect to be digitised. The physical characteristics of the typical paper records should also be assessed to assist in determining the required specifications for digitisation equipment. For example, large format maps and plans, double sided correspondence, and colour documents may not be able to be fully and accurately captured using equipment designed for scanning black and white documents. The proportion of different physical record types that could be digitised should be analysed and consideration given to the acquisition of specialist equipment. For a public authority to make the investment in digitisation, there must be a drive or business need to access paper records in a digital format. Unless the size of the paper records collection is very small and static, it will not usually be feasible or warranted to scan all existing records. Consideration should be given to the purpose of the digitisation program. Generally, digitisation is undertaken to improve access to records and to integrate paper records into an otherwise electronic business process.

9

Queensland State Archives: Guideline for the Digitisation of Paper Records

Decisions on whether or not to digitise a particular record will need to be made on a day to day basis once digitisation is available to the organisation. A number of questions should be asked when selecting records for possible digitisation:

• Is there a benefit of digitising this record? • Is the original suitable for digitising? • Is the equipment able to fully capture the content of the record? • Is the original part of a series that also needs to be digitised? • Are there any special characteristics of the record – eg: colour, double sided, faded? Some records are physically less suitable for scanning than others. For example, large format records, bound volumes, photographs, plans and maps, records with reflective surfaces or fragile material require specialised scanning equipment and techniques. Other records, such as those handwritten in coloured ink, on coloured paper, or double sided paper may be accommodated by the available scanning equipment, but need to be separated and scanned in a different batch with modified settings. There may be a business decision made not to scan some records. Deciding to digitise existing paper records in addition to new records can be a large undertaking, and careful analysis should be conducted to gauge the benefit of doing this. Ideally, existing paper records that are frequently accessed should be digitised, maximising the benefits of digitisation. Some records may have such short retention periods that the expense of digitising them is not warranted. It is important that staff are made aware of what has been digitised and what hasn’t so that they will benefit from the convenience of faster access to digitised records without searching in vain for digitised copies of records that have not been scanned.

3.3 How will digitisation integrate to the existing workflow? It is essential to the success of a digitisation program that the existing records management procedures are investigated prior to the introduction of new techniques. Decisions need to be made on a number of aspects including when the records will be scanned, how end users are presented with the records, and how will the original paper records be managed after scanning. A good understanding of existing practices will not only present the opportunity to integrate digitisation at the most appropriate stage, but will also provide a point of reference to measure the performance of digitisation. The introduction of digitisation may also provide the impetus to streamline business processes around digitised records.

3.4 How will the digitised records be managed? As described later in section 4: Components of a Digitisation Program, a system to manage the digitised records is arguably the most important component of the digitisation system. It is essential that a system is in place that enables access by appropriate authorised personnel, allows digitised records to be easily found, includes measures to preserve the authenticity of the records, and provides information about the record and its context. This descriptive information, known as metadata, is discussed further in section 7: Metadata. Time related factors encountered in records management, such as record retention periods, and destruction dates also apply to digitised records. Additionally, the obsolescence of technology and the deterioration of storage media are time related factors that are introduced by digitisation and should be addressed in management plans.

10

Queensland State Archives: Guideline for the Digitisation of Paper Records

3.5 Authorised disposal of paper originals Agencies should also consider whether they have authorisation, or intend seeking authorisation, for the early disposal of original paper records after digitisation. If so, digitisation procedures and processes should be developed in accordance with the Digitisation Disposal Policy. For more information, see section 5: Authorisation for early disposal. In addition, as authorisation is only possible for certain classes of records, workflows will need to be reviewed to ensure that the appropriate disposal classes for records are allocated at the point of digitisation, and that strategies are in place to identify those records eligible for early disposal.

3.6 How will the digitised records be used? Digital objects, such as scanned records, are easier to distribute than their analogue equivalents. This has the potential to provide larger audiences with increased access to digitised records than by physical access to paper records. If digitised records are to be made publicly available, there may be a need to investigate any security, copyright, or intellectual property issues that this raises. Consideration should also be given to the mode of use of the digitised records. For example, there are different image quality requirements for on screen viewing of images than for printed images. Images which are to be accessed on a low bandwidth connection may include less information and be created at a lower quality than those viewed directly from a CD-ROM or local area network.

3.7 Who is responsible for the digitisation project? Like any large project, there should be clear ownership of the digitisation program by an individual or work unit within an organisation. If considerable resources are allocated to establishing an ongoing digitisation program, it makes sense for the program to run for a considerable period. For this to occur, sufficient funding should be secured not only for the implementation of digitisation, but also for the ongoing maintenance, routine costs, and system upgrades and replacement.

3.8 Will digitisation be outsourced? Public authorities should carefully consider the pros and cons of either outsourcing digitisation projects or conducting them in-house. The types of records being digitised, the digitisation requirements of the public authority and the geographic location may all have an affect on the suitability of outsourcing and the availability of external parties to carry out the work. There are several benefits of carrying out digitisation within the organisation including:

• Control over the entire imaging process, how the digital copies of the records are arranged and stored, and the handling and storage of the original paper records,

• Security of the records is controlled by restricting the access to known staff, • Experience in project management and digital imaging and exposure to technology

and techniques gained which may be transferable to other projects, and • Flexibility to alter project requirements and digitisation parameters as the project

develops, rather than having them locked in a contract. However, there are drawbacks of this approach that need to be considered including:

• Large investment of financial, IT and human resources, both initially and throughout the project lifecycle,

11

Queensland State Archives: Guideline for the Digitisation of Paper Records

• Time needed to implement a digitisation process and associated technical infrastructure, with initial production levels and efficiency typically limited,

• Staffing expertise not always available, and any training investment will be wasted if staff leave, and

• Responsibility for network downtime, equipment failure and obsolescence, training of staff, and adherence to standards and best practices.

In cases where digitisation processes are outsourced, external parties should be made aware of relevant standards. This should include this guideline, IS40 and IS41, and any standards or practices that may apply to individual public authorities. Outsourcing arrangements made as part of a digitisation project should also comply with any existing contracting or purchasing policies that may be in place. The benefits of engaging an external party for digitisation include:

• Costs are more predictable as the digitisation of the paper record is what is paid for, usually as a cost per page, not equipment or staffing,

• High production levels and fast completion as equipment and staff are tested and already in place,

• Expertise and experience of the specialist can be drawn upon, • Risk is lower, as the vendor accepts the costs of technology obsolescence, failure,

downtime, staff changes, etc, and • Economies of scale can be realised as specialist scanning bureaux will usually be

carrying out digitisation on behalf of a number of clients. Unless a public authority has a huge volume of records to digitise over a long period, outsourcing will usually be more cost effective.

These benefits are offset by some negative aspects of outsourcing including:

• Limited control over how the digitisation is carried out, • Complex contractual process of determining the specifications for the digitised

records, research, negotiation, and communication with the vendor will take some resources, so outsourcing cannot be seen as a totally “hands off” approach,

• Knowledge gap between the vendor and the client may cause delays and confusion. The vendor will have experience and skills in scanning technology and related practices, but will not know the business of the client,

• Risk of the vendor going out of business or altering their practices, • Quality control duplication as it is required by both parties, • Transportation and handling of the originals records introduces avenues of possible

loss or damage to the records, and • Security and privacy issues of the vendor’s staff having access to records which may

be private or confidential6. The outsourcing of digitisation should not be thought of as an easy solution that simply requires financial resources. Establishing the requirements and specifications for digitisation, progressing through to a tender program or purchase decision, and then liaising with the vendor and monitoring their work will require time and staff allocation.

6

Adapted from Western States Digital Imaging Best Practices Version 1.0. 2003. Western States Digital Standards Group. Accessed March 2005 at http://www.cdpheritage.org/resource/scanning/documents/WSDIBP_v1.pdf

12

Queensland State Archives: Guideline for the Digitisation of Paper Records

4: Components of a Digitisation Program

Given the wide variation in size of public authorities, there is not a universal specification for a paper records digitisation solution. However, the components described below, implemented at various scales, should form part of any digitisation program.

4.1 Computer Hardware Scanners or other digital imaging devices The physical process of converting a paper document into a digital image requires a scanner or other digital imaging device to capture the image of the document, (collectively referred to as “scanners” for the remainder of this document) a computer to control the scanner and to provide the processing required for the conversion, as well as computer file storage of some form to keep the resulting image. Typically, a single computer is required for each scanner operator. Some common scanner types are briefly described in appendix 2: Scanner types. Computers Computers are an integral part of the digitisation process, required for input, management, storage, and distribution. The scanner will usually need to be connected to a computer, and often this computer will be used for setting the parameters of the scanning and carrying out initial quality control. Computers will also be required for the entry of metadata for scanned documents, often referred to as profiling. Computers used for these interactive tasks which require the operator to continuously visually check the screen should be equipped with a quality graphics adapter and a large monitor that allows the full page to be viewed. Computers will also be required to carry out less interactive tasks including file storage, indexing, copying and backup. These roles may best be carried out using file servers or other computers that do not necessarily require the intervention of an operator. When choosing computers to use for digitisation, emphasis should be placed on selecting equipment that is capable of efficiently handling the demands of digitisation, while still being compatible with the public authority’s existing computer equipment, support arrangements and standards. An organisation’s ICT specialists should be consulted to check this prior to the purchase of computer equipment. In a small digitisation implementation, all of the tasks described here may be performed by a single computer, but as the implementation grows, or for large scale digitisation, the tasks can be separated out to various computers on the network. Scanning and other image related tasks are typically demanding of computer memory and processing, but most current model PCs would be suitable for digitising. To avoid interruptions to the workflow of the digitisation process, it is recommended that even in small scale implementations, a computer be dedicated solely to the digitisation process. Computer storage, backup devices, and storage media Scanned records, their descriptions and other associated information need to be stored in a manner which promotes access, security, and longevity. For small digitisation programs, storage on the scanning computers local hard drive and back-up to a writeable CD or DVD drive may fulfil this purpose. For larger implementations, file servers may be used for the storage of digitised records, with tape drives used for back-up. So that the integrity of the digitised records can be guaranteed, a mechanism should be in place to prevent the unauthorised deletion or modification of stored information. The write-

13

Queensland State Archives: Guideline for the Digitisation of Paper Records

once nature of some optical media is an inexpensive means of doing this, however, as the amount of digitised material grows, the slow access times and handling required to use removable media may not be suitable. In these situations, software security solutions, such as those included in many eDRMS, should be implemented to provide a similar assurance of integrity for files stored on line using hard drives and file servers. Section 8: Storage and media options, covers these and other related issues in detail.

4.2 Computer Software A system for describing and managing the digitised records A system to manage the digitised records is arguably the most crucial component of a digitisation program. The successful implementation of this software should ideally be completed prior to the commencement of scanning of paper records and the acquisition of such a system should be a high priority task. There is little use in commencing the scanning of paper records if there is no established method of recording what has been scanned or storing descriptions of those scanned records. For the management of digital images as records, systems that are designed to fulfil records management needs should ideally be used. The effort and resources required to adapt a system not designed for the management of digital records should be analysed and a business decision made as to whether the introduction of a new system with the required capabilities would be a more effective use of resources. If new systems, such as eDRMS, are to be introduced into an organisation to manage digitised paper records, the system should comply with the guidance provided in this document in such areas as metadata and file formats, and integrate into the organisation’s existing ICT infrastructure. Resources will need to be allocated to training of staff and technical support of any systems that are introduced. Existing systems employed by public authorities to manage paper records may also have the capability to also link to digital images of the records they describe. If this is not a feature of the available records management software, provided that the software is able to provide descriptions of individual items within a file, a file path to the digital images and other information could be recorded as comments. As an entry level solution, or as a temporary measure pending the introduction of a specialised system, a record of what has been scanned, where the digital copies can be found, and relevant details of the scanning process can be a stored in a spreadsheet or small database. There are also many free or low cost image cataloguing applications designed principally for management of digital photographs that could be used for managing digital copies of paper records. The features of such systems should be closely examined prior to implementation, even if only as a temporary or interim measure. A large effort may be required to adapt non-recordkeeping systems to document scanned records, and much of this will need to be repeated when a more appropriate longer term solution is employed. Imaging software for capture and manipulation Scanners are bundled with software (known as a driver) that is required for the controlling computer and the scanner to communicate. Additional software is typically included with the scanner which allows scanning, calibration and some post scanning image processing operations to be performed. This software will usually be thoroughly tested by the scanner manufacturer to work optimally with their hardware, with features appropriate to the type of scanner purchased. For example, software bundled with high speed sheet fed scanners

14

Queensland State Archives: Guideline for the Digitisation of Paper Records

would likely include features which would allow a choice to be made between single sided and duplex scanning, and recognition of barcodes while the software that comes with slide scanners would typically include features for magnifying the originals and reversing the colours of negatives. If the bundled software does not fulfil a public authority’s scanning needs it may need to be supplemented or replaced by software which is purchased separately. Additional image processing software may also be used for such tasks as the conversion of file formats, deriving of related files, and modification or enhancement of images. Some degree of interoperability usually exists between most image processing and scanning software so that images can be acquired directly into the image processing software. To enable full text searching of scanned paper documents, optical character recognition (OCR) software would also be required. This is described and discussed further in section 6.6: Master files and derivatives. Security and access control Just as the access to many paper records is monitored and typically restricted to authorised staff, access to digitised copies should also be controlled in a secure digital environment. As it would be a relatively straightforward exercise for a skilled operator using free or inexpensive image processing software to change the appearance of digitised records, security measures should be in place to prevent this type of unauthorised tampering. As described above and detailed in section 8: Storage and media options, some computer storage scenarios, such as the use of write-once optical media, inherently prevent modification of the image files once they have been stored. However, if other types of computer storage are used, additional security provided by software with such features as encryption, access control, and auditing should be employed. For large digitisation programs that may be widely distributed throughout an organisation and several staff need to add to and modify the collection of digitised records, a system such as an eDRMS may be used to manage access to the information and to provide an audit of system access and modification. In small scale digitising implementations, security and access control may be provided through the use of a password protected system by a single operator, with other authorised staff given read-only access. This could be accomplished using the built-in security features of most current computer operating systems.

4.3 Procedures and Standards It is important to fully document decisions made about the digitisation process, including technical, procedural and quality considerations. This is particularly important when seeking authorisation for early disposal. The Digitisation Disposal Policy provides information on the particular procedures required in this situation. A method to identify which records are to be digitised It is unlikely that all paper records within a public authority will be digitised. Identifying those that are to be digitised prior to the implementation of digitisation may assist in setting the requirements for a digitisation program and also determine the parameters for subsequent digitisation. Internal policies and procedures should be developed and relevant staff made aware of the criteria for deciding what paper records will be digitised.

15

Queensland State Archives: Guideline for the Digitisation of Paper Records

A workflow for the digitisation process To help ensure that all of the relevant steps in the process of creating an accessible digital copy of a paper record are carried out, a workflow should be developed and relevant staff provided with instruction on the process. The workflow may be adapted from the organisation’s existing records management practices or implemented as a new process. The development of a workflow prior to implementation may assist in the estimation of resources required for the digitisation process. Standards There are many international (ISO), Australian and industry standards that may be applied to the digitisation process which are listed at appendix 4: Related standards. When a public authority is considering implementing a digitisation project, existing standards should be adopted and adhered to where possible, rather than establishing in-house standards. This will promote consistency, increase the relevancy of research and advice from other organisations which have implemented digitisation to the same standards, and prepare the organisation for any potential collaboration. The adoption of published standards will also provide vendors and potential outsourcing partners with an unambiguous specification of what is required and will also provide a means of measuring performance. For some aspects of a public authority’s digitisation program, a recognised standard may not be available. For these aspects, an internal standard or service level should be established with its details made available to appropriate staff. As with the adoption of an external standard, this will not only assist in setting clear targets for digitisation, but will also provide a measure of performance. A means of determining if the records were digitised correctly A crucial part of the digitisation workflow is the verification that the images of the paper records have been captured effectively to the required standard. Public authorities should implement a checklist and establish which stages of the digitisation process are evaluated and how often or what proportion of records are checked. Effective checking and control will help avoid time consuming redundant activities such as the re-scanning of records that have not been scanned correctly or retrospectively adding metadata for documents that were not correctly profiled at the first attempt. This issue is covered in depth in section 6.5: Quality control, which includes a sample checklist. Documentation on system maintenance The performance of routine maintenance on the systems being used for digitisation is crucial to the success of the program and should be documented and linked to the digitisation workflow. This documentation should include a description of the system backup strategy, contact details of system administrators and system technicians and procedures for calibrating equipment. It is recommended that a disaster recovery plan also be in place to provide for preservation of digital records and alternative access paths to them in the event of unexpected failure of the primary systems. Research and case studies of a variety of computer based systems suggest that approximately one fifth of the resources required to implement a system should be

16

Queensland State Archives: Guideline for the Digitisation of Paper Records

allocated annually to maintenance, upgrades and training7. Allowance for this should be made as part of the planning for digitisation.

4.4 Staff Project management Staff with business analysis and project management skills will be required to determine the need for digitisation. They should also examine the workflow of current and new processes to ensure that benefits of digitising are realised with minimal interruption to business and effective communication and change management. Staff with these skills may also manage the financial resources, negotiate with equipment and service suppliers and prepare for the continued support, maintenance, and lifecycle management. Technical experts Digitisation involves the integration of computer hardware, imaging equipment and various software packages to produce a managed collection of digitised records. Staff with technical skills will need to investigate the various hardware and software options, and work to bring the aims of the project to reality within time and budget constraints. These staff may need to liaise with vendors, test different combinations and configurations of equipment, and be responsible for acquisition, support, integration and maintenance of equipment. If an organisation’s IT help desk is expected to provide ongoing support to the digitisation program, they should be made aware of any non-standard configurations required for computers and contact details for the vendors of digitisation specific equipment. Records management Recordkeeping best practice applies to all records independent of its digital or paper format. Recordkeeping controls and processes such as registration, classification or profiling and appraisal and disposal will have to be applied to digitised records. Records managers should be involved in the development of a digitisation program to ensure these matters are addressed. On a routine basis once digitisation is underway, records management staff will be involved in profiling and sentencing digitised images and also in the retrieval and storage of paper records which are being digitised. Records management staff without previous experience in managing digitised or other technology dependent records should consult contemporary information management research and guidelines to examine how digitisation will affect the management of records within their organisation. Equipment / computer operators Personnel will be required to obtain the source paper records, operate the scanners, carry out quality checks on scanned records and add metadata / profile information for the digitised material. These staff should have a clear understanding of their task and a workflow should be developed so that digitisation is regular and routine and meets appropriate standards.

7

Revised Digital Imaging Guidelines for State of Ohio Executive Agencies and Local Governments. 2003. Ohio Electronic Records Committee. Accessed March 2004 at http://www.ohiojunction.net/erc/imagingrevision/revisedimaging2003.html

17

Queensland State Archives: Guideline for the Digitisation of Paper Records

5: Authorisation for Early Disposal

5.1 Introduction While public authorities may digitise records for a variety of reasons, some may also wish to dispose of the originals after digitisation to save on storage and processing costs. Disposing of the originals after digitisation and before the authorised retention period for that class of record has expired, is referred to as ‘early destruction’. Authorisation from the State Archivist is required for the early destruction of records after digitisation and the Digitisation Disposal Policy sets out what records are eligible for authorisation and under what conditions. This section of the digitisation guidelines complements the policy statement by providing advice on seeking authorisation for early destruction, including assessing whether the records meet the policy criteria. Please ensure you are familiar with the Digitisation Disposal Policy before reading this section.

5.2 Overview of process As required by the policy, each public authority will need to seek authorisation from the State Archivist. This is done in writing, proposing particular classes of public records, as identified in an approved retention and disposal schedule, for early destruction. Depending on the scope of digitisation process, this may be a few records classes or many. The public authority has initial and ongoing responsibility for assessing whether the classes of records meet the criteria proposed in the policy, and for certifying that appropriate systems and procedures are in place for generating and managing the digital image so that it is an accurate, reliable and authentic copy of the original. Queensland State Archives is responsible for assessing applications in accordance with the policy and providing authorisation in accordance with the Public Records Act 2002.

5.3 Determining whether records are eligible In accordance with the parameters of the policy statement, public authorities should assess whether the classes being proposed for early destruction meet all of the following requirements:

• have a total retention period of ten years or less, in accordance with an approved retention and disposal schedule

• are not subject to format specific retention requirements, and

• have a low risk of the original records being required for legal proceedings or similar purposes.

Record classes being proposed for early destruction must meet all three requirements. Retention period restriction The total length of time a record is retained for comprises two components:

• the amount of time from creation of the record, or the file it belongs to, to the disposal trigger,8 and

8 A ‘disposal trigger’ is the event or action, specified in a Retention and Disposal Schedule from which the disposal date is calculated. Common disposal triggers include ‘after last action’, ‘after contract / agreement

18

Queensland State Archives: Guideline for the Digitisation of Paper Records

• the amount of time which the record has to be retained after a disposal trigger. It is important to note that the ten-year retention period restriction on authorisation for early disposal applies to this total period, not simply the number of years after the disposal trigger occurs. A retention and disposal schedule clearly indicates the disposal trigger and the period of time a record needs to be retained after the trigger. However, to determine the total retention period for the purpose of authorisation under the Digitisation Disposal Policy, it will also be necessary to determine the average period that elapses before the disposal trigger occurs. Determining the total retention period may involve discussions with the relevant business areas to determine how long records remain active before the disposal trigger is activated. As many disposal classes use ‘after last action’ as a disposal trigger, it would be necessary to determine for how long a file is usually active. For example:

• if files relating to business planning are routinely closed at the end of the financial year in accordance with the planning cycle, and

• the retention period is five years after last action,

• then these records would be eligible for authorisation, as the total retention period would be six years.

In contrast, if a record is usually active for three years, and the retention period is ten years after last action, then this record would not be eligible for early destruction.

For example: • An application was made for a licence, and approved.

• The disposal action for approved licence records is retain for five years after expiration of licence.

• Licences are valid for two years. The total retention period would be approximately seven years (assuming that it is only a short time from receipt of application to approval of licence) and this class of records would therefore be eligible for authorisation.

Queensland State Archives acknowledges that determining the total retention period for some record classes will rest on the ‘balance of probabilities’, as particular records within a record class may be retained for longer. In addition, some records change class during their life span. For example, if some records within a class are subject to a Freedom of Information request, a different record class also applies and their retention period is potentially extended. Please note: if it is difficult to apply a record class with any certainty to some types of records at creation or early in the life of the record, then these records are not eligible for early destruction.

expires’ or ‘after end of financial year’. As a disposal trigger may occur many years after the creation of a record, for example, a trigger ‘after sale of building’ may occur more than 50 years after the creation of a building purchase record.

19

Queensland State Archives: Guideline for the Digitisation of Paper Records

Queensland State Archives’ Appraisal Archivists can assist in determining the total retention period and will review these determinations in assessing any application for authorisation. Risk assessment Once it has been determined that the class of records meets the retention criteria, a risk assessment should be undertaken which examines the likelihood of the records in each class being proposed for early destruction, required for legal proceedings and challenged in court. In undertaking this assessment, a review of the agency’s litigation history and consultation with legal staff is essential.

For example: • An agency may process 1000 claims files a year.

• Of these, 2% are subject to dispute and of this only 10% proceed to full litigation.

• In the agency’s experience of litigation, the validity of the records or parts of them has not been challenged.

Therefore, there may be a low risk of the original records being required and the Chief Executive may be happy to propose this record class for authorisation for early destruction.

In addition to assessing the risk of the original record being required in legal proceedings, there may be other factors to consider, such as whether it is a vital record or whether there are any business needs for the original format. In many cases these other risks may be appropriately treated and minimised through good digitisation procedures and appropriate management of the image. Table 1 (below) includes examples of some risks and potential actions to minimise risk.

Risk Mitigation / Treatment Electronic copy not legible • Develop and implement quality

assurance procedures.

• Procedure for retention of original and inclusion of metadata / explanatory notes if poor quality original means that a legible image cannot be generated.

Loss of hand-written annotations on originals

• Quality assurance procedures.

• Raise awareness of staff that if they make extensive annotations on a print-out of an image, the annotated copy should be rescanned as a new record.

Electronic copy cannot be found / lost • Capture and management of image in recordkeeping system.

• Regular backups and other system maintenance.

20

Queensland State Archives: Guideline for the Digitisation of Paper Records

Unauthorised access to record • Use of access controls in recordkeeping system.

Lack of capability to retrieve digitised image over time

• Minimised through retention period restriction in policy

• Development of agency migration strategy.

Unauthorised manipulation of image • Only authorised staff undertake scanning.

• Develop and implement procedures specifying what manipulation is acceptable.

• Security of image through recordkeeping system.

Loss of vital records in a disaster • Off-site backup.

Table 1: Identifying and treating risk

Formal risk assessment processes should be used to identify risks and plan mitigation strategies. These may be either agency-endorsed procedures or AS 4360: Risk Management. Appendix 11 of the DIRKS (Designing and Implementing Recordkeeping Systems) manual provides advice on how the AS4360 risk management process can be adapted to recordkeeping risks.9

Format-specific retention requirements Records which are subject to format-specific retention requirements which are not overriden by the Electronic Transactions Act 2001 are not eligible for early destruction after digitisation. Format-specific requirements relate to the need to retain a record in its original, paper form and commonly relate to witnessed or signed documents. Many format-specific requirements in legislation were overridden by the Electronic Transactions Act 2001. However some requirements were specifically excluded from the coverage of the Act and these exclusions are noted in Schedule 1 of the Act. Other requirements may be found in regulations or standards, for example the Financial Management Standard 1997 requires financial information to be kept in its original form for one year after the date of the audit report for the financial year. Other format-specific requirements may require electronic forms of documents to be retained on a particular storage device. If this is the case, the systems and procedures for digitisation (see next section) should comply with these requirements. Consultation with legal staff should be undertaken to determine the extent of format-specific requirements affecting the records of an agency.

5.4 Ensuring appropriate systems and procedures Section 2.3 of the policy statement specifies a range of conditions public authorities must meet before authorisation can be granted. Table 2 (below) provides references to advice on meeting these requirements. Many of these requirements are discussed further in these guidelines. 9

National Archives of Australia (2003) The DIRKS Manual: A Strategic Approach to Managing Business Information. Available online: http://www.naa.gov.au/recordkeeping/dirks/dirksman/dirks.html.

21

Queensland State Archives: Guideline for the Digitisation of Paper Records

Requirement Further Advice / Information

Policies and procedures covering:

• Roles and responsibilities for the selection, digitisation and management of digitised records and secure and documented destruction of originals

Section 3.2: Which records will be digitised? Section 4.3: Procedures and standards

Public authorities may also need to review their procedures for registering and sentencing records to ensure that disposal decisions can be made at the point of digitisation.

• Technical specifications for digitisation Section 6: Technical considerations

While these guidelines provide recommendations, agencies are responsible for determining specific requirements for the type of records they intend to digitise.

• Capture of technical imaging metadata Section 7: Metadata

• Quality assurance procedures Section: 6: Technical considerations

Quality assurance procedures must cover the calibration of equipment as well as the examination of images captured.

Public authorities should also determine an appropriate ‘cooling off’ period after digitisation and before the destruction of the originals to allow for all quality assurance processes to take place (minimum 1 month).

System requirements See Information Standard 40: Recordkeeping for general advice on the management of records.

• Designed with adequate physical and other security safeguards to ensure the records remain inviolate and can only be changed in an authorised manner.

For an explanation of recordkeeping systems, see “Characteristics and functionality of recordkeeping systems” in State Records NSW DIRKS Manual.

For specific information on ensuring the security of systems, see Information Standard 18: Information Security and Best Practice Supplement.

• Appropriate metadata is retained, including:

o Mandatory recordkeeping metadata elements are captured and maintained in accordance with the Recordkeeping Metadata Standard for Commonwealth Agencies, including the allocation of retention and disposal actions.

National Archives of Australia, Recordkeeping Metadata Standard for Commonwealth Agencies

Information Standard 31: Retention and Disposal of Public Records principle 2 specifies what retention and disposal information must be captured.

o Captures appropriate audit trails and maintains these as recordkeeping metadata

See elements “Management History” and “Use History” in National Archives of Australia, Recordkeeping Metadata Standard for Commonwealth Agencies.

o Captures technical imaging metadata at the point of digitisation

Section 7: Metadata

22

Queensland State Archives: Guideline for the Digitisation of Paper Records

• Is covered by business continuity and disaster recovery plans

See Principle 9 in Information Standard 18: Information Security and Best Practice Supplement.

• Has a migration strategy to ensure that public records are not placed at risk of loss through technological obsolescence.

See Section 10: “Preserving Digital Records for the Long Term” in National Archives of Australia Digital Recordkeeping Guidelines.

• That there is no regulation requiring the electronic form of the record to be kept on a particular kind of data storage device as referred to in section 20(2)(c) of the Electronic Transactions Act 2001 OR that the regulation has been complied with.

For advice on identifying recordkeeping requirements, Step C in State Records NSW DIRKS Manual.

Section 8: Storage and media options.

Table 2: Sources of advice on meeting requirements

5.5 Obtaining authorisation The Chief Executive Officer of a public authority may write to the State Archivist to seek authorisation once the public authority has assessed whether the classes of records being proposed for authorisation meet the criteria, and that appropriate systems and procedures are in place for generating and managing the digital image. This written request should include form 1: a signed declaration of compliance with the policy, form 2, a list of the record classes (as identified in an approved retention and disposal schedule) proposed for early disposal, and the name of an appropriate contact officer within the public authority. The forms are in appendix 1 of the Digitisation Disposal Policy. Following receipt of this request, Queensland State Archives’ Appraisal Archivists will assess the request against the policy. This may involve requesting copies of the risk assessment or determinations of total retention period for some classes where there may be doubts regarding their eligibility for authorisation. The Appraisal Archivists will then liaise with the contact officer to:

• develop and insert the required classes in an existing public-authority specific retention and disposal schedule, or

• develop a new schedule if the permissions relate to a comprehensive sector-specific schedule.

If necessary, the authorisation can be limited to specific business units rather than applying to the whole public authority. Table 3 shows examples of disposal classes providing authorisation for the early destruction of public records.

23

Queensland State Archives: Guideline for the Digitisation of Paper Records

Where the records are covered by an agency-specific retention and disposal schedule: Reference Description of records Status Disposal Action Class number

Original licence records which have been digitised

Temporary Retain for 2 months after digitisation and completion of quality checks

Class number

Digitised images of licence records Temporary Retain for 5 years after last action

Where the records are covered by General Retention or Disposal Schedule or sector-specific schedule: Reference Description of records Status Disposal Action Class number

Original records sentenced under classes [insert class numbers and summary description] of the [name, number and version of schedule], where full and accurate digitised images are retained for the authorised retention period.

Temporary Retain for 2 months after digitisation and completion of quality checks

Table 3: Examples of disposal classes

5.6 Monitoring and review Public authorities are responsible for monitoring their recordkeeping practices to ensure requirements are met. As part of this responsibility, public authorities should ensure that digitisation policies and practices continue to meet the requirements of the Digitisation Disposal Policy. In particular, any proposed change to digitisation procedures or recordkeeping practices should be assessed to ensure the terms and conditions of the authorisation continue to be met. As with other disposal authorisations, it is recommended that they are reviewed every five years, or sooner if one of the following occurs:

• Machinery of Government (MoG) change.

• Plan to extend or reduce the range of records proposed for early destruction.

• Any change affecting the risk assessment undertaken to gain authorisation (for example, increase in litigation affecting a particular class or records).

24

Queensland State Archives: Guideline for the Digitisation of Paper Records

6: Technical considerations

Prior to implementing a digitisation program, there should be a high level of understanding of the technical aspects of scanning within the organisation. Whether an organisation outsources its digitisation or performs the work in house, familiarity with key technical aspects of digitising will assist relevant staff to gain an understanding the process. Most software and hardware that will be used in a digitisation program will provide a range of variable parameters such as image resolution and output file format, and informed choices need to be made on each of these. Establishing appropriate technical standards for digitisation before implementation will promote consistency and accountability. As detailed in section 3: Issues to consider before commencing digitisation, there are a number of business considerations that need to be assessed by your organisations prior to implementing a digitisation program and also when deciding which records will be digitised. The key technical considerations of:

• resolution, • bit depth, • compression, and • file format

are described in detail in the following sections with recommendations provided where warranted. A summary table of recommendations is provided in appendix 3: Table of technical recommendations. Also included in this chapter is a discussion of the quality control procedures that can be put into place to check that the image files created meet the specified standards.

6.1 Resolution Picture elements, or pixels, can be considered the building blocks of all digital images. They are square cells of a single colour or shade that, when arranged in a regular grid pattern, form the digital image. The resolution of a digital image is the density of pixels that make up the image. Pixels per inch (PPI) is used to describe image resolution. Figure 1 shows a piece of text scanned at various resolutions.

Figure 1: 100PPI, 200PPI and 300PPI examples showing the effect of resolution on image clarity

For example, the image produced by scanning an A4 (8.27” x 11.69”) page at 100 PPI would have 827 pixels in the horizontal by 1169 pixels in the vertical direction, or a total of 966,763 pixels. If the same A4 image were scanned at a resolution of 300 PPI, it would be made up of 2481 x 3507 pixels or a total of 8,700,867 pixels (8.7 megapixels). Similarly, a 4” x 6” photograph digitised at 300 PPI would result in a 1200 x 1800 pixel image with a total of 2.16 megapixels. As seen in these examples, the pixel density combined with the

25

Queensland State Archives: Guideline for the Digitisation of Paper Records

dimensions of the source material provides an accurate assessment of the total number of pixels that will make up the resultant image. Occasionally, an image will be described by using its pixel dimensions rather then pixel density. For example, images intended only for viewing on a computer screen may described as “800 x 600 pixels”, “1024 x 768 pixels”, etc. By determining the source material dimensions in inches and using the provided horizontal and vertical pixel totals, the pixel density of the image can be discovered. For example, a 1024 x 768 image displayed full screen on a 17” monitor (viewing size 13” x 10”) has a resolution of approximately 80 PPI. Scanning hardware limitations As the resolution of an image cannot be increased beyond that at which it was originally digitised without recapture, it is crucial that an appropriate resolution is selected prior to image capture. Even current model entry level scanners targeted at home users can scan at a resolution of 600 PPI, meeting or exceeding the recommendations given in this guideline. Many scanning devices currently available have maximum resolutions of 4800 PPI. Higher PPI settings will result in images which are able to contain more detail per inch while increasing the file size of the resultant image. It should also be noted increasing resolution beyond certain thresholds will not provide a more useful image with existing viewing and printing technology10. Convenience The time and effort required to locate a paper recorto storage need to be considered when determinexploit future technology, such as high resolution vcommonly available, it may be warranted to set thehighest level for the best possible quality11, thus arecord. In this scenario, a lower resolution versiofrom the high resolution image, may be more approsection 6.6: Master files and derivatives, for more in

10

General Guidelines for Scanning. 1999. Colorado Digitization Project. Acchttp://www.cdpheritage.org/resource/scanning/documents/std_scanning.pdf11

Digital Imaging for Archival Preservation and Online Presentation: Best Pra2004 at http://www.historicalvoices.org/papers/image_digitization2.pdf

Dpao

Doisao

Ugmuca

FrerethA

What about dots per inch (DPI)? PI is a measure of printing resolution, in articular the number of individual dots of ink printer or toner can produce within a linear ne-inch space.

ue to the similarity with other measurements f graphical resolution, the DPI measurement frequently misused, for instance, to specify scanner's sampling resolution or the number f pixels per inch in a computer display.

sing DPI measurement in these cases is enerally considered to be inaccurate and isleading, though the intended meaning is sually clear based on context. In these ases, a measure given in DPI can be taken s the number of pixels per inch.

igure 2: A 10×10-pixel image on a computer display may quire more than 10×10 printer dots to accurately produce, due to limitations of available ink colours in e printer dapted from: http://en.wikipedia.org/wiki/Dots_per_inch

d, prepare it for scanning, and return it ing what resolution will be used. To iewing and printing capabilities not yet resolution of the capture device at its voiding the need to rescan the paper n of the image, which can be derived priate for contemporary use. Refer to

formation.

essed March 2005 at

ctices. 2001. Michigan State University. Accessed March

26

Queensland State Archives: Guideline for the Digitisation of Paper Records

Mode of use How the digitised documents will be used needs to be considered when making a decision about resolution. The resolution of the typical output should be considered. As a general guide, source documents that are generally magnified for viewing and printing require digitising at a higher resolution, while source documents that are reduced for viewing and printing can be digitised at lower resolutions. In the case of large documents, the intended viewing or reproduction size needs to be considered, but there can be logistical and practical difficulties if using too high a resolution for large documents. For example, digitisation of an A0-sized (33”x47”) poster at 300 PPI could produce a file over 400Mb in size. While the storage of this sized file may be accommodated, the processing power required to view and print such files is beyond many systems. For a large map or plan that is to be only ever viewed as an A4-sized image, the reduced size of the output means that the input resolution may be quite low. However, a high resolution may be required to legibly capture the fine line work and small text that is often present on large format maps and plans. The resolution selected to digitise documents may be a compromise between detail and file size. Source documents that are typically enlarged for viewing or are of a small size and require magnification for use should be digitised at high resolutions. The best illustration of this would be the digitisation of a slide, microfilm or photographic negative which would normally be viewed at several times its actual size. It is common practice for these types of originals to be digitised at many thousands of pixels per inch to produce useable output at viewing size. On the other hand, considering that even the most modern computer monitors typically have resolutions less than 100 PPI, if a document is digitised purely for on screen viewing at the original scale, digitising at high resolutions will not provide any benefit12.

Recommended Resolutions

Table 4 shows the minimum recommended PPI resolutions for digitising paper records.

Document Type Page Size Resolution Standard text documents Up to A3 200 PPI

Oversized documents, e.g. maps Larger than A3 200 PPI

Photographs 6”x4” 7”x5” 9”x6”

600 PPI 430 PPI 300 PPI

Table 4: Resolution recommendations

Digitising at a higher resolution than recommended may be necessary if there is a requirement to enlarge the image for use or to capture highly detailed paper originals. Public authorities should combine reference to these guidelines with their own testing on typically digitised documents prior to selecting which resolutions to use.

12

Scanning Tips and Techniques. Jasc Software Inc. 1999. Accessed October 2004 at http://www.jasc.com/tutorials/scantip.asp

27

Queensland State Archives: Guideline for the Digitisation of Paper Records

6.2 Bit Depth A “bit” is the fundamental unit of computer information having just two possible values, either 0 or 1. Bit depth is the number of bits used to describe the colour of each pixel. Greater bit depth allows a greater range of colours or shades of grey to be represented by a pixel13. Using multiple bits increases choice and variety, at the expense of increased file size. For example, using only 1-bit pixels gives 2 colours, usually either black or white. Using 4 bits gives 16 colour14 choices (i.e. 2 x 2 x 2 x 2). Typical bit depths are described below. Bi-tonal (1-bit) Bi-tonal images are made up of a foreground and a background colour, typically black in the foreground and white as the background. Because this does not allow for shading, bi-tonal depth is recommended primarily for black and white text documents without illustrations, or with simple line drawings which have no shading. Palettised (4-bit: 16 colours / greyscales and 8-bit: 256 colours / greyscales) Images with these bit depths can be classified as palettised since each pixel in the image is assigned a value that relates to a specific colour in the palette. Colour 4- and 8-bit images should be used to capture colour drawings and illustrations. 8-bit is more commonly used than 4-bit. Using a palettised image for continuous colour changes such as those found in photographs will give poor results and is not appropriate. Greyscale 8-bit images are most useful for black and white photographs, half-tone illustrations, other types of continuous tone illustrations, handwritten and typed manuscript and archival materials that are nominally black and white, but which actually contain shading and varieties of ink density and paper tonality. For older documents, where for example the paper may be coloured with age or ink may have faded, colour rather than greyscale may be appropriate15. High colour (16-bit: 65536 colours / greyscales) For the most continuous greyscale images requiring more than 256 shades of grey, 16-bit greyscale may be used. It provides these additional shades, which are used in such applications as medical imaging, with increased file size. 16-bit colour is a compromise between palettised colour and true colour offered in the higher bit-depths. 16-bit colour is used in many video and animation applications, but its use is limited for still images. If used for continuous images such as photographs, some colour changes may be noticed, and for discrete colour drawings, 16-bit could be considered excessive and inefficient. True colour (24-bit: 16.7 million colours and 48-bit: billions of colours) When digitising full colour images and photographs, 24-bit images should be used. These true colour images consist of three colour bands – red, green and blue (RGB) each of which is 8-bit for 24-bit colour or 16-bit for 48-bit colour. Each pixel in a 24-bit colour image will have a red value, a green value and a blue value, each between 0 and 255. For example, a citrus green colour pixel would have an RGB value of 201,254,40. 13

Creating and Managing Digital Content – Glossary. 2002. Canadian Heritage Information Network. Accessed March 2005 at http://www.chin.gc.ca/English/Digital_Content/Small_Museum/glossary.html#c14

Standard 4-bit colours are black, dark red, dark green, dark yellow, dark blue, dark purple, dark cyan, pale grey, mid grey, red, green, yellow, blue, magenta, cyan and white. 15

Technical Recommendations for Digital Imaging Projects. 1997. Image Quality Working Group of ArchivesCom. Accessed March 2005 at http://www.columbia.edu/acis/dl/imagespec.html

28

Queensland State Archives: Guideline for the Digitisation of Paper Records

True colour depths are recommended for any materials with colour where colour conveys essential information. For colour photographs the minimum recommended bit depth is 24-bit (true colour). With current viewing and printing equipment, 48-bit colour does not provide any meaningful advantages over 24-bit colour. However, if documents are captured using a 48-bit capture device now, the benefits may be able to be exploited in the future as technology develops. 32-bit colour is 24-bit colour with an additional 8-bit channel providing 256 levels of transparency and is used mainly for digital video and animation applications. Selecting an appropriate bit-depth The nature of the documents being digitised should be the main factor dictating the bit depth used for the images produced. For the digitisation of black and white text documents, bi-tonal colour depth will usually capture the information most efficiently. However, for documents that contain greyscales or colours, a bi-tonal image will not capture all of the information and may produce an illegible image. Palettised colour depth is typically suitable for line drawings, colour document and diagrams, while continuous

Capturing a document at a lower than recommended bit depth will

tone images, such as photographs, are best captured in true colour.

business graphic, into a 24-bit colour

Figure 3: Greyscale text captured in 24-bit colour showing that using a higher than recommended colour depth may introduce extra colours into the image

possibly result in an image that is visibly different from the original record. In some situations this visible difference and loss of information will be acceptable – for example when digitising a document with black and white content, but a coloured letterhead, the loss of colour in the letterhead may be acceptable. Choosing a higher than recommended colour depth, such as 24-bit colour for a black and white document, will not provide any benefits, but will result in an increase in the file size of the image produced and may even introduce small areas of extra colours not present in the original document. The conversion of a colour drawing, such as a simple image would not only result in an inefficient file size but also introduce many extra colours into the image. For example, the original document digitised to produce the image shown in Figure 3 had three colours – black, white and grey. However, during the process of scanning this document as a 24-bit image, 17,898 colours including pixels with shades of brown and pink were generated! If the image was printed using a monochrome printer, the general appearance may be similar to the original, however, the introduction of additional colours may affect post-digitisation image processing operations. In this case using 4- or 8-bit grey for the output image would be more appropriate.

29

Queensland State Archives: Guideline for the Digitisation of Paper Records

As is the case when determining the resolution to use, the mode of use of the digital images should be considered when deciding upon an appropriate bit-depth. If imaged pages will most often be viewed on computer screens, then the use of a higher than normal bit-depth may be warranted. As seen in Figure 4, increasing the colour depth may enhance the on-screen readability of a low resolution image. If, however, digitised copies of records will only ever be made available as monochrome print outs, then the use of colour could be considered superfluous.

Capturing a document that contains a watermark, highlighting, or hand written annotations into a bi-tonal image may cause text to be obscured leading to a loss of information. An example of this information loss is shown in Figure 5. Once again, a palettised grey or palettised colour output image would capture the text of the document as well as the extra information in the watermark or annotations.

Figure 4: 8-bit and 1-bit versions of a 72ppi image, showing how an increase in bit depth allows anti-aliasing which may improve readability

Halftones In printing, halftones are evenly spaced spots of varying diameter to produce apparent shades of grey with a single colour ink. The darker the shade at a particular point in the image, the larger the corresponding spot in the printed halftone. In traditional publishing, halftones are created by photographing an image through a screen. In order to simulate variable-sized halftone dots in digital imaging, dithering is used, which creates clusters of pixels in a "halftone cell". The more black pixels in the “cell”, the darker the grey. Bi-tonal images utilising halftones may be considered as an alternative to using 4- or 8-bit grey to represent greyscales on digitised documents. This technique may provide some advantages over using palettised images including wider format compatibility and reduced file size.

Figure 5: 8-bit vs. 1-bit vs. 1-bit with halftone for watermarked documents

However, use of halftones may also introduce a speckled effect to areas of the image that should be white. At too low a resolution, halftones will not be beneficial, and halftones at high resolutions may produce a large number of halftone pixels where there should be white space. Some other image processing, notably optical character recognition, (refer to section 6.6: Master files and derivatives for more information) may also be negatively affected if using halftones in text documents. Public authorities considering using

30

Queensland State Archives: Guideline for the Digitisation of Paper Records

halftones for digitised records should carry out thorough testing to ensure the end results are suitable. When paper documents that contain halftone images are digitised, a distracting pattern of lines called "Moire" is often produced. To avoid this unwanted effect, most scanning systems have a “de-screen” function to remove the Moire during the scanning process. Post-capture image processing software can also be used to correct these images. Alternatively, halftones may be captured by scanning the source document at a high enough resolution to isolate each of the dots making up the halftone, typically 600 PPI or above, and then using software to reduce the image to the standard resolution16.

Recommended Bit Depths

6CAaeguUtdaa

1

1

The table below shows the recommended bit depth for digitising paper records.

Document type Bit Depth Black and white text only 1-bit bi-tonal

Text with some colour 8-bit colour

Text with shades of grey 8-bit grey

Colour drawings / presentations / graphics 8-bit colour

Black and white photographs 8-bit grey

Colour photographs 24-bit colour

Table 5: Bit depth recommendations

If a document containing a mix of the above is being imaged, the highest colour depth should be used to capture it. For example, an otherwise black and white page whichincludes a colour photograph should be captured in 24-bit colour. Public authorities should combine reference to these guidelines with their own testingon typically digitised documents prior to selecting which bit depths to use.

.3 Compression and File Size alculating file size s indicated in the previous sections, the total number of pixels used to make up an image ffects file size. Additionally, the colour depth of each of those pixels has a multiplying ffect on the file size. In the example used earlier, an A4 page was digitised at 300 PPI iving a total of 8 700 867 pixels. The following table shows the number of bits that make p this image at varying colour depths and resolutions, and shows approximate file sizes17. sing the information in Table 6, it can be seen that if an organisation were to scan all of

heir A4 sized documents as 600 PPI 24-bit colour images, even a moderate collection of igitised documents would create large file storage requirements. In addition to choosing ppropriate image resolution and colour depth, a number of compression methods can be dopted to reduce the file size of digital images.

6

How To Fix Bad Scans. 2004. Dixie State College of Utah. Accessed March 2005 at http://cit.dixie.edu/vt/vt2600/bad_scans.asp7

1 byte contains 8 bits. 1024 bytes = 1Kb. 1024Kb = 1Mb

31

Queensland State Archives: Guideline for the Digitisation of Paper Records

Colour depth Resolution (PPI)

Total bits Uncompressed file size (Mb)

1 bit bi-tonal 300 8 700 867 1.04

1 bit bi-tonal 600 34 803 468 4.15

8 bit grey or colour 300 69 606 936 8.30

8 bit grey or colour 600 278,427,744 34.00

24 bit colour 300 208 820 808 24.89

24 bit colour 600 835,283,232 101.96

Table 6: Uncompressed file sizes for an A4 page digitised at different pixel depths and resolutions

Compression Compression reduces storage space requirements, saves on backup and transfer media, lessens the impact on the network of accessing image files and provides shorter file transfer times. Mainstream compression techniques in widespread use today are tried and tested and can be used with the confidence that images will continue to be accessible once compressed. Compression used for images can be categorised into lossless and lossy compression. Lossless compressions reduce the size of a file without discarding any information. An example of a lossless compression technique is substitution. As a very simplistic example, if the A4 page of text described in Table 6 consists of 90% white space and 10% black text, then by simply substituting a 4-bit symbol for each white pixel’s 24-bit RGB value, the image size would reduce from approximately 25 Mb to around 6Mb. The substitution table is stored within the image file, allowing the exact image to be viewed and printed, while still having a small file size. Lossy compressions, however, are irreversible; file information is lost when a lossy compression process is applied. When the file is viewed or printed, the resultant image will therefore be different from the original. The degree of difference between the original and compressed files is sometimes related to the amount of compression required. Appropriately applied, the human eye should not be able to readily differentiate between the original file and the compressed version. One of the most commonly used lossy compression processes is known as quantisation. Colour values are simplified and rounded - discarding real information. The extent of compression is variable with the level of output quality specified governing how much simplification occurs. Greater simplification leads to a smaller file size, but with greater loss of information. The effects of file compression can depend on the file format, the file contents and the compression method used. There is not a fixed file size reduction that can be expected from every image that is compressed. For example, the commonly used JPEG compression works well on colour photographic images, but poorly compresses images containing drawings, letters or simple graphics. Therefore, if compression is to be applied, a method appropriate to the digital image and its intended use needs to be selected.

32

Queensland State Archives: Guideline for the Digitisation of Paper Records

Recommended Compression

6FstowApdTsJJIiTppJcctsbAac

1

A1

h2

h

Some form of compression should be applied to digitised records to enable storage and access in an efficient manner. Lossless compression provides file size reduction while being able to reproduce an exact, true and accurate digital copy of the image created at time of digitisation. Where possible, lossless compression should be employed. Lossy compression is not suitable when original paper records are authorised for early disposal as the accuracy of the image may be called into question. However, when originals are being retained, the additional file size reduction that lossy compression provides can mean that a small, perhaps indistinguishable, loss of data may be acceptable for some file types. When employing lossy compression techniques, the resulting image should not appear noticeably different from the original paper record.

.4 File Formats ile formats encode information into a form which is intended for processing and use by pecific combinations of hardware and software18. Fortunately, the current technology rends of interoperability and compatibility have led to many file formats being supported n a variety of hardware and software platforms. This trend applies to image file formats ith many image processing and viewing programs available for Windows, UNIX, and pple computer systems. Many free or relatively inexpensive image manipulation rograms support the creation, editing, viewing and printing of images in dozens of ifferent formats. he five file formats most commonly used for digitisation are described below and ummarised in Table 7. oint Photographic Experts Group (JPEG) File Interchange Format (JFIF) FIF was developed by the Joint Photographic Expert Group and was standardised by the nternational Standards Organisation (ISO) in August 1990. The JFIF format is platform ndependent and can be exchanged between a variety of different applications. his format of images is commonly used on the World Wide Web (WWW) and in digital hotographic equipment as the JPEG compression scheme inherent to JFIF lends best to hotographs and complex graphics with continuous tones to minimise file sizes19. The FIF format uses the JPEG lossy compression technique which, by simplifying certain olour information, can reduce file size at the expense of some image quality. The level of ompression is adjustable by the operator at the time of creation, allowing a compromise o be found between the amount of information loss and the file size reduction. JFIF upports only 8-bit grey and 24-bit colour, not 1-bit bi-tonal. This format will be referred to y its common name, JPEG, through the remainder of this document. new related format, JPEG2000, has recently been developed which is substantially dvanced with new compression techniques, additional bit depths and a lossless ompression option20. While JPEG2000 is related to the original format, it is not

8

Brown A. Digital Preservation Guidance Note 1: Selecting File Formats for Long-Term Preservation. 2003. National Archives (UK). ccessed March 2005 at http://www.nationalarchives.gov.uk/preservation/advice/pdf/selecting_file_formats.pdf 9

Horton S. Web Style Guide 2nd Edition: JPEG Graphics. 2004. Lynch and Horton. Accessed March 2005 at ttp://www.webstyleguide.com/graphics/jpegs.html 0

Mendham S. JPEG 2000. 2005. IDG Communications. Accessed March 2005 at ttp://www.pcworld.idg.com.au/index.php/id;1170029196;fp;2;fpid;1585691688

33

Queensland State Archives: Guideline for the Digitisation of Paper Records

compatible with many mainstream image processing, scanning, and viewing programs or web browsers. These compatibility issues may be alleviated once the format is more established in the sector and software and hardware vendors have assessed the format and the market’s demand for its use. Tagged Image File Format (TIFF) The TIFF format was developed in 1986 by Microsoft and Aldus and is currently maintained by Adobe21. Despite being an older file format TIFF is widely supported and is seen by many as a de facto standard for image files. TIFF files are commonly used in desktop publishing, faxing, 3-D applications and medical imaging applications. There are several sub-formats within the TIFF specification. TIFF CCITT22 Group 3 and Group 4 are the most widely used format in document imaging – most fax transmissions are in TIFF Group 3 format. Other sub formats of TIFF support greyscale, colour depths of up to 64-bit and offer compression choices including uncompressed, lossless LZW, and run length compression23. The most recent release, TIFF 6.0, was launched in 1992. While the baseline version of TIFF 6.0 is fully compatible with applications designed to read earlier TIFF images, a number of additional features were added that require software to be specifically tailored to support the newer version. JPEG compression was included in the TIFF 6.0 specifications, and despite a technical revision in 1995 to overcome serious design flaws24, there still remain problems with the use of this lossy compression within TIFF files, and this option is not widely used. The TIFF version 7.0 specification which appeared in draft format in 1997 but is still to be released is expected to feature a more stable implementation of JPEG compression amongst other new features. Various extensions to the TIFF specification have been implemented for specialised purposes. Care should be taken when using these extended versions of TIFF, as the application support for viewing and manipulating them may be limited. Graphics Interchange Format (GIF) Graphics Interchange Format (GIF) is a widely used image format introduced in 1987 by CompuServe. In the early years of the WWW, developers adopted GIF for its efficiency and widespread familiarity. A large proportion of the images on the Web are presented in GIF format, and virtually all Web browsers that support graphics can display GIF files. The GIF format supports a maximum 256 palettised colours or shades of grey so is most suited to discrete images such as illustrations, black and white images, logos and line drawings rather than photographs. GIF files are compressed using a lossless compression technique, LZW. Although GIF has a free and open specification, the Unisys Corporation patents LZW and its commercial use may require licensing and royalty payments25. While the generation and use of GIF files can generally be done without requiring a licence, and many of the patents that relate to GIF have expired, or are soon to expire, the royalty free PNG format (outlined below) which was developed largely because of this patent issue has taken over from GIF in many applications.

21

TIFF Revision 6.0. 1992. Adobe Systems Inc. Accessed March 2005 at http://partners.adobe.com/asn/developer/pdfs/tn/TIFF6.pdf. 22

Comite Consultatif International Telegraphique et Telephonique (International Telegraph and Telephone Consultative Committee) 23

Leurs L. The TIFF file format. 2001 Laurens Leurs. Accessed March 2005 at http://www.prepressure.com/formats/tiff/fileformat.htm 24

JPEG Image Coding Standard. 1998. Centre for Telecommunications and Information Engineering, Monash University. Accessed March 2005 at http://www.ctie.monash.edu.au/EMERGE/multimedia/JPEG/COMM03.HTM25

LZW Patent Information. 2005. Unisys Corporation. Accessed March 2005 at http://www.unisys.com/about__unisys/lzw/

34

Queensland State Archives: Guideline for the Digitisation of Paper Records

Portable Network Graphics (PNG) Portable Network Graphics (PNG) is a lossless, portable, well-compressed storage format for images. The open-source and patent free PNG format was designed to replace the proprietary GIF format and, to some extent, the much more complex TIFF format. The second edition of PNG is an ISO standard - ISO/IEC 15948:2003 (E)26. The PNG format was designed specifically for use in online viewing applications such as the WWW, and the format offers a range of attractive features that should eventually make PNG the most common graphic format27. At this stage, however, the lack of universal support for PNG in scanning and imaging applications is the format’s main weakness. Portable Document Format (PDF) Portable Document Format (PDF) is a widely used proprietary file format developed and maintained by Adobe Systems. The PDF format was released in 1993 and is based on the Adobe Postscript printing language. The PDF format was created by Adobe to provide a standard storing and editing documents. It is important to note that PDF is an encapsulating format, not a raster image format as are the other formats in this section. Encapsulation formats attempt to ensure that files are displayed and used consistently across computer programs and platforms28. As such, PDF can not be used to capture digitised records directly, although some scanning systems may convert to PDF seamlessly from a native image format without further user interaction. While the PDF format is widely used, because a PDF file is not an image, most image processing software titles can not open a PDF file. Operations including the deriving of related files through resolution reduction, optical character recognition, and image correction will not usually be able to be carried out on a PDF file. If the original images are not retained separately, any of these image processing operations should be performed prior to encapsulated into a PDF file. The strength of the PDF format is its ability to capture text and images in their original format, preserving fonts, graphics and layouts. Viewing PDF files requires Adobe Reader, a free application distributed by Adobe Systems29 or other licensed software which embeds this capability. Because documents in PDF format can easily be seen and printed by users on a variety of computer and platform types, they are very common on the WWW. The PDF format supports several compression methods, most commonly used are the CCITT, LZW and ZIP lossless compression schemes and JPEG lossy compression. It supports up to 8-bit palettised, 16 bit grey and 48 bit-colour and uses lossless compression, based upon the Zip compression techniques used in PKZip and WinZip. Summary As seen in Table 7, each format has its strengths and weaknesses. Identifying the best format requires knowledge of how the digitised documents will be used, the characteristics of the materials that will be digitised and the way digitised documents will be delivered.

26

PNG (Portable Network Graphics). 2004. World Wide Web Consortium. Accessed March 2005 at http://www.w3.org/Graphics/PNG/ 27

Horton S. Web Style Guide 2nd Edition: PNG Graphics. 2004. Lynch and Horton. Accessed March 2004 at http://www.webstyleguide.com/graphics/pngs.html 28

File Formats and Compression. 2004. Technical Advisory Service for Images. Accessed March 2005 at http://www.tasi.ac.uk/advice/creating/fformat.html#ff2 29

Adobe PDF. 2005. Adobe Systems Inc. Accessed March 2005 at http://www.adobe.com/products/acrobat/adobepdf.html

35

Queensland State Archives: Guideline for the Digitisation of Paper Records

The five formats described in this guideline are widely used for a number of digital imaging applications. JPEG and PNG are non-proprietary formats while TIFF and PDF are proprietary formats which have freely available specifications. This provides system developers with a cost effective and readily available means to incorporate support for these formats. If a system was adopted which required the use of another format, especially a system-specific proprietary format, the ability of that system to convert images into at least one of the common formats discussed here is paramount. As an encapsulation format, it is difficult to provide a direct comparison between PDF and the image formats described here. Scanning or imaging systems that provide PDF as an output format option will always need to use a true image format for an intermediary, usually temporary file, that is converted into PDF. Therefore, the characteristics of this intermediary file need to be considered prior to implementation. To take advantage of the characteristics of different file formats, multiple copies of a digitised document may be stored. Refer to section 6.6: Master files and derivatives, where derived files that may form part of a digitisation program are discussed in more detail, along with suggested file formats for the different types of derived files. Considerations for digitising multi-page documents TIFF is the only image format described here that is able to capture more than one image in a single file. This enables storage of individually scanned pages of a multi-page document into a single file. The other image formats described here can only deliver a single image per file. If these formats are used for digitisation of multi-page documents, image management software or other systems, such as an eDRMS, are required to provide the linkage and sequencing required to represent multiple images making up the pages of digitised document as a single entity. Some systems, including widely used eDRMS and high volume scanning software, require that a multi-page format, typically TIFF or PDF, be used for image storage so that a single file is capable of representing several scanned pages of a document. This should be viewed as a software limitation rather than best practice, and if organisations are forced to use TIFF or PDF solely due to their support for multiple paged documents, they should do so with caution after thoroughly investigating all options. While TIFF is widely regarded as the standard file format to use when capturing documents as bi-tonal images, it lacks the compression and bit depth combinations to suit other document types, particularly greyscale and colour documents. If no compression or the inefficient packbits compression is used to capture multiple page greyscale or colour documents as a single TIFF file, file sizes can become very large, affecting the accessibility and storage of the file. To overcome this file size issue, some vendors have chosen to implement JPEG compression within the TIFF file format, providing a higher rate of compression, but with the data loss inherent of this compression scheme. As mentioned previously in this section, there is no agreed standard for the implementation of JPEG compression within the TIFF format. Using non-standard formats for the storage of digitised records may create compatibility problems with other software, perhaps preventing the images from being viewed or printed.

36

Queensland State Archives: Guideline for the Digitisation of Paper Records

Table 7: Image formats compared: Adapted from http://www.library.cornell.edu/preservation/tutorial/presentation/table7-1.html

Name and Current Version

TIFF 6.0 GIF 89a JPEG JFIF PNG 1.2 PDF 1.4

Extension .tif .gif .jpg .png .pdf Bit-depth(s) 1-bit bi-tonal;

4- or 8-bit greyscale or palette colour Up to 64-bit colour

1-8 bit bi-tonal, greyscale, or colour

8-bit greyscale 24-bit colour

1/2/4/8-bit palette colour or greyscale 16-bit greyscale, 24/48-bit true colour

4- or 8-bit greyscale or palette colour Up to 64-bit colour support

Compression Uncompressed Lossless: CCIT G3/G4, LZW, Packbits, JPEG

Lossless: LZW Lossy: JPEG

Lossless: Deflate, an LZ77 derivative

Uncompressed Lossless: CCIT, LZW. JBIG Lossy: JPEG

Standard/ Proprietary

De facto standard, LZW compression may require licence

De facto standard, LZW compression may require licence

JPEG: ISO 10918-1/2 JFIF: de factostandard

ISO 15948 ISO 15930-1:2001. De facto standard

Web Support Plug-in or external application

Native since Microsoft® Internet Explorer 3

Native since Microsoft® Internet Explorer 2

Native since Microsoft® Internet Explorer 4

Plug-in or external application

Metadata Support

Basic set of labelled tags

Free-text comment field

Free-text comment field

Basic set of labelled tags plus user-defined tags.

Basic set of labelled tags

Strengths - Long history and widespread support as the premier format for high quality digital images and faxes - Lossless compression - Multiple pages

- Widespread support through web browsers and imaging applications - Lossless compression

- Widespread use on websites - Native support in nearly all imaging applications and web browsers - Variable compression rates

- Native support in web browsers - Royalty free - Supports all commonly used bit depths - Lossless compression

- Widely used - Viewer freely available - It can be used to distribute documents with images and text that will print consistently. - Multiple pages

Limitations - Limited options for compression of colour and greyscale images - Not natively supported by browsers - LZW licensing restrictions - JPEG version not widely used

- Not suitable for continuous tone images or photographs due to 8-bit maximum colour depth. - LZW licensing restrictions

- Used forphotographs or images that have continuous tone, rather than text documents or graphics. - Only lossy compression - Not suitable for bi-tonal

- Limited uptake - Not supported within all imaging and scanning applications

- Not an image format - Requires post-processing for encapsulation - Not natively supported by browsers - Limited support for file creation in non-proprietary applications

Specification http://partners.adobe.com/public/developer/tiff/index.html

http://www.w3.org/Graphics/GIF/spec-gif89a.txt

http://www.w3.org/Graphics/JPEG/

http://www.w3.org/TR/PNG/

http://partners.adobe.com/public/developer/pdf/index_reference.html

37

Queensland State Archives: Guideline for the Digitisation of Paper Records

TIFF has many sub-formats that have been developed from the original specification, baseline TIFF. It is not uncommon for software to be branded as capable of viewing TIFF files, but actually only be able to view baseline TIFF. As a “baseline TIFF reader is not required to read any images beyond the first one”30 many image viewing applications are unable to view multi-page TIFF files, and instead show only the first page. Public authorities who decide to use any non-standard image format should investigate carefully the impact this may have on the longevity of the digitised records. Migrating non-standard images to standard formats will require additional planning and resources. In addition public authorities risk being locked into using products from the same vendor and also risk the continued accessibility of their imaged records if the vendor leaves the marketplace. PDF may be considered to be an alternative to TIFF for the storage of multi-page files in a single document. However as PDF is an encapsulation format if images are encapsulated into a PDF file they cannot be manipulated, easily extracted, or have other image files derived from them. As an alternative means of accessing images, PDF is an appropriate format. However, careful consideration should be given to the likely need to manipulate, extract, or derive information from the original image before retaining only a PDF version of a digitised record.

Recommended File Formats

30

TIFF Revision 6.0. 1992. Adobe Systems Inc. Accessed March 2005 at http://partners.adobe.com/asn/developer/pdfs/tn/TIFF6.pdf

The table below lists the file formats recommended for digitising different document types

Document type File Format Black and white text only TIFF G3 / G4

PNG Bi-tonal Text with some colour GIF colour

PNG 4- or 8-bit colour TIFF (LZW)

Text with shades of grey GIF grey PNG 4- or 8-bit grey TIFF (LZW)

Colour drawings / presentations / graphics

GIF PNG 4- or 8-bit TIFF (LZW)

Black and white photographs JPEG 8-bit grey PNG 8- or 16-bit grey TIFF (JPEG)

Colour photographs JPEG 24-bit (high quality compression 10:1) PNG 24-bit TIFF (high quality JPEG compression 10:1)

Table 8: File format recommendations

Public authorities should combine reference to these guidelines with their own testing and refer to the specification of the systems used to manage imaged record prior to determining which file formats are compatible and most suited for their use.

38

Queensland State Archives: Guideline for the Digitisation of Paper Records

6.5 Quality Control Decisions relating to the technical aspects of the digitisation program described in the preceding sections should be documented and subject to quality control measures. Without this, there is no assurance that digitised records will meet stated technical specifications or fulfil their intended use. Appropriate quality assurance procedures and guidelines are needed to ensure that digitised records meet the requirements for their use. All stakeholders in a digitisation project – sponsor, staff, users, and IT support – should be consulted to determine appropriate quality measures31. Quality measures should be agreed on and tested before image capture commences, to ensure that they can be implemented and produce acceptable results32. Periodical revision of the quality measures should occur to ensure they remain relevant to the project goals and keep up with technology, legislation, and industry trends. Quality Baselines Baselines for acceptable and unacceptable characteristics need to be established for digitised records so that a consistent level of quality can be maintained. These may be general, perhaps simply requiring that each digital file be visually compared to the original paper record, or complex, involving quantitative analysis of digital images using computer equipment to ensure that the properties of a digital file meet accepted international standards. The complexity and detail of quality baselines will depend on the project’s aims and the nature of records involved. Strict and detailed quality control should be applied to digital images if the project intends to dispose of the original paper records, convert an important collection for long-term access, build or be compatible with a digital archive initiative, or make high quality reproductions of a paper document. However, quality tolerances need to be set at a level that is achievable with the staffing, time, equipment and technology resources available. For example, if staff are required to spend hours every day fine-tuning a scanner to achieve quality baseline standards, then perhaps the equipment or the baselines need to be changed. Quality baselines should be established for the output device that a digital image is intended for and be verified using that device33. If a digital image is intended for printing, then the digital file should be printed and then checked against quality baselines for printed images. If a digital image is intended for display on a computer monitor, quality baselines should be verified on a computer monitor. Calibration Digitisation projects rely on correctly calibrated hardware and software to produce high quality images that meet quality baselines. Ideally, calibration parameters should be recorded, automatically if possible, at the beginning and end of each batch of documents that are digitised. If the calibration parameters are outside of predefined bounds, then remedial action should take place with a calibration process repeated until the parameters are within limits. 31

Quality Assurance. 2004. Technical Advisory Service for Images. Accessed March 2005 at http://www.tasi.ac.uk/advice/creating/quality.html32

Moving Theory into Practice: Digital Imaging Tutorial. 2003. Cornell University Library/Research Department. Accessed March 2005 at http://www.library.cornell.edu/preservation/tutorial/quality/quality-01.html 33

Frey F. Guides to Quality in Visual Resource Imaging: 4. Measuring Quality Of Digital Masters. 2000. Council on Library and Information Resources. Accessed March 2005 at http://www.rlg.org/visguides/visguide4.html#4.1

39

Queensland State Archives: Guideline for the Digitisation of Paper Records

The calibration settings for some equipment may need to be checked and recorded at the beginning and end of a day’s digitisation work. Other equipment may not have any calibration settings that are user-adjustable, and may only need calibration following servicing or maintenance. Exact parameters and suggested intervals for calibration should be determined with input from the hardware and software suppliers, and should be documented with other quality controls. To establish acceptable levels of quality for digital image capture, the scanning hardware system should be tested by the use of scanner test targets or charts such as those shown in Figure 6. These can contain a wide range of material which provide the ability to judge output in carefully measured increments for such aspects as resolution, text, fonts, line widths, colour, tonal range, handwriting, and halftone.

Figure 6: Standard “targets” can be used to test the functionality of digitising equipment

Environment A controlled environment is required to consistently apply quality baselines. In an uncontrolled environment, for example with excessive glare, reflections or using an improperly set up computer system, a high quality image may be incorrectly deemed to have not met quality baselines34. While calibration of hardware and software is one part of ensuring a controlled environment that is necessary to evaluate digital images, viewing conditions also need to be considered, as the optimal level for viewing a computer monitor is in lower light conditions than for a paper based record. The size of the viewing screen, plus the speed, processing power and memory of the computer need to be considered to enable the retrieval and manipulation of large image files. Workstation monitors used for scanning or quality control should be set at appropriate colour depth, gamma and colour temperature settings and a high refresh rate to avoid a flickering display. These settings will need to be set for each workstation, and also within any image manipulation software where the option to adjust settings is available. A monitor adjustment target, such as one shown in Figure 7, can be displayed on screen when brightness and contrast adjustments are made, so that all the relevant shades and steps in the target are distinguishable from the adjacent similar shades. User perceived image quality will depend on the capabilities of display hardware being used, the screen size and pixel dimension capabilities. The common pixel dimensions supported by monitors are from a low 640 x 480 to a high of 1600 x 1200, referring to the number of horizontal and vertical pixels on the screen for an image. What area of an image can be seen on a monitor depends on the image pixel dimensions and the desktop resolution. The area of an image displayed can be increased by increasing the screen resolution or by decreasing the image resolution. When viewing digitised text documents, typical checks can be made by examining the image at actual

34

Moving Theory into Practice: Digital Imaging Tutorial. 2003. Cornell University Library/Research Department. Accessed March 2005 at http://www.library.cornell.edu/preservation/tutorial/quality/quality-02.html

40

Queensland State Archives: Guideline for the Digitisation of Paper Records

size35. However, it may be necessary to enlarge other types of digitised records such as photographs and maps to ensure details have been captured appropriately. As the number of pixels displayed increases, more of the image area can be viewed, but without also increasing the size of the monitor, details may be too small to see without zooming or magnifying.

Scope of Quality Assessment An important aspect of quality control is determining which characteristics will be subject to inspection and the proportion of digitised images that will be checked. In addition to comparing a digitised record to the original, the condition and features of the original may need to be checked prior to digitisation. Almost all of the individual steps involved in converting a paper record into a readable and accessible digital image can have quality assessments placed against them. For example, following the capture of the paper record as an image, the descriptive information or metadata should be checked for completeness and correctness, the computer media that the images are stored on can be periodically checked for readability and any computer security measures put into place to control access to digitised records can be tested. These types of checks would be common to many records managed by the organisation, not just digital images, and should be integrated in standard records management processes. All digital images can be tested, or a representative sample of digitised documents may be selected. Testing of all digital images will ensure that all images meet the minimum required quality levels, but can be very time and resource intensive. If, however, only a sample is tested, care must be taken to ensure that the sample is representative of the range of records digitised. Many organisations and scanning bureaux routinely check a 10% sample of scanned images, chosen randomly. In some cases, such as following equipment repairs, or if using new staff or outsourcing vendor, each image is checked until there is confidence that the standard is being met.36 However, testing only a sample of digital images gives a lower degree of certainty that all images have met quality baselines. Qualitative and Quantitative Assessment The quality of images may also be evaluated using software to examine technical aspects of images. For example, noise in images is caused by random pixel fluctuations, and may

35

A 21” CRT monitor with a display resolution of 1280 x 1024 is able to display an A4 sized page at 1:1 scale. 36

Western States Digital Imaging Best Practices Version 1.0. 2003. Western States Digital Standards Group. Accessed March 2005 at http://www.cdpheritage.org/resource/scanning/documents/WSDIBP_v1.pdf

Figure 7: A monitor adjustment target can be used to ensure shades of grey can be distinguished

41

Queensland State Archives: Guideline for the Digitisation of Paper Records

make images appear grainy37. Software can be used to measure the level of noise in images, to check that it is minimised to an acceptable level. Measuring technical aspects gives a consistent and repeatable measure of image quality. For instance, if noise in an image is measured twice, the same levels should be detected. However, software tools to measure all aspects of quality and image appearance may not be widely used or available. Instead, some quality control relies on human judgement. Human judgement is often subjective and therefore results of visual inspections may vary from person to person. Ideally if a number of staff are responsible for visual inspections, training should be provided to communicate qualitative information effectively38.

R

6Tt

3

h3

a

ecommended Quality Checks

Digital images should be inspected and checked against the attributes listed below, inaddition to standard records management quality checks. Has a true and accurate copy been made of the original record? Attributes to check include:

• image size • image resolution • bit depth: bitonal, greyscale or appropriate colour depth • too light or too dark • too low or too high contrast • lack of sharpness • too much sharpening, unnatural appearance and halos around dark edges • image orientation • skewed or not centred • images cropped or incomplete • missing pixels or scan lines • poor quality dithering • obvious use of lossy compression

Has the image file been stored correctly? Issues to check include:

• file format • file size • incomplete or incorrect profile information / metadata • appropriate security applied • necessary derivatives produced

Have procedures for the disposal of the original paper record been followed? Has the equipment been calibrated correctly?

.6 Master Files and Derivatives he process of initial capture of a paper record as an image is often a very resource and

ime intensive process. The majority of this time and effort could be spent on such tasks

7

Sharma A. Digital Noise, Film Grain. 2001. Digital Photo Techniques. Accessed March 2005 at ttp://www.phototechmag.com/sample/sharma.pdf 8

Moving Theory into Practice: Digital Imaging Tutorial. 2003. Cornell University Library/Research Department. Accessed March 2005 t http://www.library.cornell.edu/preservation/tutorial/quality/quality-02.html

42

Queensland State Archives: Guideline for the Digitisation of Paper Records

as locating the paper file, obtaining it from storage, separating the pages, and reassembling and storing the paper file after capture. The time taken to actually scan the record to the highest quality that the device allows and save into a lossless compression format may pale in significance when compared to the time taken to handle the paper. Usually, if digitisation is undertaken at a high level of quality with good quality control measures, the original paper file may not need to be routinely accessed. A single file format may not be appropriate for all intended uses of a digitised document. Consequently, a master file can be complemented by a number of derivatives to meet business and user needs. As outlined in section 6.4: File formats, different file formats have different characteristics and applications. TIFF files, although capturing an image without compromise, may not be suited for use on the internet. GIF and PNG files are both commonly used on the internet, but may not be suitable for high-quality printing of certain types of images. JPEG, GIF, and PNG files are often of a small enough size to be deliverable via a network, but the network delivery of TIFF and PDF files depends on their content and intended use. Master Files The goal of a master image file is to provide a high-quality, unedited information rich copy and to prevent the need for re-digitisation in the future39. A master file should capture as much information as possible from the original document, and should be of the highest quality possible. The quality available to derived files depends entirely on the quality of the master files – a poor quality master file can only result in poor quality images derived from it. The creation, use and storage of master files should be subject to strict quality control. The image resolution is the main variable factor in controlling the quality of images from scanning equipment. Using higher resolutions than those recommended in this document will provide a higher quality image. However, as described earlier, colour depth should be selected based on the characteristics of the paper record, and there is no real benefit in increasing the colour depth beyond the recommendations provided earlier in this chapter when creating a master file. Master files should be protected from damage through excessive handling or overuse of media and also kept secure from deliberate change or deletion. In many ways, the traditional process of microfilming records is similar to this aspect of the digitisation process. Just as several copies of microfilm are made for a variety of purposes, several copies of an imaged record may be made to preserve the image while providing access. Derived files Several types of files that can be derived from the master image file are described below. The process of creating these derivatives will vary, but should be implemented as a routine part of the digitisation process. Derived files typically have a smaller file size than the original images and the storage of these extra files should be considered when determining the system requirements for storage of digitised records. It is possible for derived files to be generated by the system when they are requested. For example, the Microsoft Windows XP operating system automatically generates thumbnails when a user views a directory of images. However, for a multi-user system such as an eDRMS, the one time generation and then storage of derived files will probably be more efficient. The relatively small amount of computer storage required for the storage of

39

Western States Digital Imaging Best Practices Version 1.0. 2003. Western States Digital Standards Group. Accessed March 2005 at http://www.cdpheritage.org/resource/scanning/documents/WSDIBP_v1.pdf

43

Queensland State Archives: Guideline for the Digitisation of Paper Records

derivatives is generally preferred to using the processing required to generate derivatives on the fly. Any stored derivatives should be managed, so that they are updated or deleted in sync with the master file. Some image processing and management software may have the ability to modify the appearance of a digitised record by adding information such as the date or organisation name. Two such techniques are watermarking and fingerprinting. Watermarking is the inclusion of static information on an image at time of storage, perhaps the name of the organisation and date of capture. Fingerprinting typically includes information generated when the image is accessed, such as login name of the end user and date / time information. While this information may be useful, and the inclusion of it as part of the image convenient, it may be considered that these modified images are no longer a true and accurate copy of the original paper records. This may be especially relevant where this added information, such as a large watermark through the text, makes the content of the record difficult to read. Public authorities should consider retaining the master image as an unmodified representation of the paper record or captured this additional information as metadata rather than part of the image. Any allowed modifications must be documented in policies and procedures. Combined with establishing appropriate processes for accessing files, using derived copies of a master file, and leaving the master itself in off-line secure storage can help reduce the risk of corruption, alteration or loss of the master files. Derived images should be managed, usually in an eDRMS or other system, to ensure they are accessible with similar constraints and permissions as the master image and paper record. Both master and any derived versions of a scanned record should be destroyed when the paper record is authorised for destruction, unless there is a business need for on-going retention.

• Backup copies Exact copies of the master file may be made and stored in separate locations to avoid a total loss of information in the event of a disaster such as a fire or computer virus attack

• Access copies The most commonly used version of the scanned record should be a complete and accurate representation of the paper record, while having characteristics that allow it to be easily distributed, printed and viewed. For digitisation programs with the aim of providing improved access to records or integrating with electronic systems, the access version may be the only derivative required. Access files should meet the technical recommendations for file format and resolution provided earlier in this chapter. PDF is recommended for use as an access file format, provided the images encapsulated in the PDF file are retained separately. Access images are typically of a lower quality than master images in a format that is more accessible to a wider audience. Access copies normally compromise on quality, perhaps with a more severe compression ratio, so that files can be easily accessed though on-line storage, over a network connection or via a web page. The purpose of an access image should dictate its characteristics. If an access image is provided for viewing on a computer monitor, it should be sized to fit within the viewing area of an average monitor. If the file is to be accessed through a low bandwidth network connection, minimising file size would be critical.

• Thumbnails

44

Queensland State Archives: Guideline for the Digitisation of Paper Records

Thumbnail images are very small images designed to display instantaneously on web pages and file management software, allowing users to determine whether they want to view an access image. Thumbnails are best used when dealing with a collection of pictorial images, but they are not very useful for images of text documents due to the difficulty in determining the textual content within a very small image40.

• Text Using optical character recognition (OCR), the text depicted in a scanned paper document can be extracted as a text file or word processor document. OCR software is required to recognise the text contained in the image and usually provides search and export capabilities. OCR is rarely a fully automated process and may require operator intervention to assist in obtaining an accurate transcription of the scanned record’s text. Documents containing handwriting, serif fonts41, halftones, and background text or images or those that are damaged or dirty may not be suited to the OCR process. As with other derived files, if OCR is used to derive a text document from an imaged record, this additional file should be managed by the system.

Recommended Derivatives

Public authorities digitising records to improve access and integrate into other business systems should aim to create images that meet the technical recommendations provided earlier in this chapter. Those organisations with additional capacity, wishing to take advantage of future technical advances without having to rescan paper records, should consider creating a high quality master copy in addition to an access copy. Any modification of the appearance of the image should be performed using a copy, with an unmodified original also retained. The systems used to manage the digitised paper records should be tested for their ability to manage multiple images if the decision to create derived versions of imaged records is made.

40

Western States Digital Imaging Best Practices Version 1.0. 2003. Western States Digital Standards Group. Accessed March 2005 at http://www.cdpheritage.org/resource/scanning/documents/WSDIBP_v1.pdf41

A small decorative line added as embellishment to the basic form of a character. Typefaces are often described as being serif or sans serif (without serifs). The most common serif typeface is Times New Roman. A common sans serif typeface is Arial.

45

Queensland State Archives: Guideline for the Digitisation of Paper Records

7: Metadata

Metadata is often described simply as 'data about data'. A more useful definition of metadata is “structured information that describes and/or allows us to find, manage, control, understand or preserve other information over time”42. While metadata has always been utilised in the recordkeeping and archival professions, it has only been described as 'metadata' for the past decade. Many routine operations such as the profiling of records and files, cataloguing of library resources, and describing archival items can all be described as metadata collection. Metadata describes information objects or resources and may be used for many purposes, including the management, control and discovery of records43. As outlined in Information Standard 34 – Metadata (IS34), metadata must be used to maintain the context, content and structure of records in electronic environments. The creation, retention and preservation of metadata is integral to the concept of records as evidence44. Ideally the collection of metadata for digitised documents will be part of an agency-wide metadata strategy which is consistent with the requirements of IS34. A metadata standard (also known as a schema) provides a list of the elements that define the individual pieces of information that should be captured to describe the record. Use of a metadata standard as opposed to a locally developed set of metadata elements will:

• encourage best practice; • assist the end users; • avoid redoing work that has already been done elsewhere; • provide system vendors with certainty; and • support interoperability between applications.45 System developers, vendors and records management staff should have a good understanding of metadata standards to facilitate their implementation in an organisation. Vendors and system developers providing business information systems and applications used to manage digitised records of public authorities should ensure the capability exists to record metadata to the appropriate standard.46 Records managers should be involved in liaising with these parties to certify that appropriate metadata is able to be captured, stored and managed in the system they are purchasing. Tools such as templates and data entry forms which facilitate the entry of metadata in a user friendly manner may be supplied by the vendor or developed in house. Additionally, records managers should develop in-house metadata procedures and policies to suit their particular business requirements. Users of locally developed systems, or those who manually record metadata in a spreadsheet or records ledger, will need to be aware of the elements that make up the relevant metadata standards and aim to record them correctly.

42

DIRKS – Glossary. 2001. National Archives of Australia. Accessed March 2005 at http://www.naa.gov.au/recordkeeping/dirks/dirksman/glossary.html43

Glossary of Archival and Recordkeeping Terms. 2004. Queensland State Archives. Accessed March 2005 at http://www.archives.qld.gov.au/downloads/GlossaryOfArchivalRKTerms.pdf44

Information Standard 34, Metadata. 2004. Office of Government ICT. Accessed March 2005 at http://www.governmentict.qld.gov.au/02_infostand/standards/is34.htm45

Cunningham A. Metadata Standards in Australia – An Overview. 2005. Presentation at Queensland State Archives March 2005. National Archives of Australia. 46

Overview of Classification Tools for Records Management. 2003. National Archives of Australia. Accessed March 2005 at http://www.naa.gov.au/recordkeeping/control/tools.pdf

46

Queensland State Archives: Guideline for the Digitisation of Paper Records

Capturing and maintaining accurate metadata for digitised paper records is essential, as information can not be effectively managed or used without metadata. The scope of this document does not extend to providing a full description of recordkeeping metadata and its use, which applies to all records and recordkeeping systems. Instead the focus will be placed on the recording and management of specific metadata pertaining to digitised records.

7.1 Metadata Types There are many types of metadata tailored to data types including geographic information, financial applications, and rights management. Three main metadata types apply to the digitisation process and subsequent management of records, namely resource discovery, recordkeeping, and technical imaging metadata. Resource Discovery Metadata Resource discovery metadata contributes to enabling the discovery and management of online information resources. Information Standard 34: Metadata mandates the use of the Australian Government Locator Service (AGLS) metadata standard, or other standards compatible with it, for all information resources, including records.47 Information Standard 34: Metadata requires that public authorities must, at a minimum, adopt metadata schemes that are interoperable with the AGLS Metadata Element set and are consistent with the Queensland Government AGLS Element Implementation Standard. AGLS is a resource discovery standard and does not include elements required for records management processes, such as disposal. Recordkeeping Metadata Information Standard 40, Recordkeeping, recommends the use of the National Archives of Australia’s Recordkeeping Metadata Standard for Commonwealth Agencies. This standard is compatible with AGLS and extends AGLS by included elements required for managing records. The recording and maintenance of recordkeeping metadata for digitised records can assist with:

• A means of searching and identification, • Authentication, • Preservation of the content and context, • Information on retention and disposal, • Auditing and restriction of use, and • Interoperability with other systems.48 An Australian recordkeeping metadata standard is currently under development. It is likely that the National Archives of Australia’s metadata standard will be revised in the future to align with the national standard. It is expected this national standard will include guidance on implementation. Further information on recordkeeping metadata can be found in the Public Records Alert, Understanding and Applying Recordkeeping Metadata49

47

For further information on AGLS including the list of elements see National Archives of Australia’s AGLS implementation manual at http://www.naa.gov.au/recordkeeping/gov_online/agls/cim/cim_examples.html48

Cunningham A. Metadata Standards in Australia – An Overview. 2005. Presentation at Queensland State Archives March 2005. National Archives of Australia. 49

Public Records Alert No 2/05: Understanding and Applying Recordkeeping Metadata. 2005. Queensland State Archives. Accessed March 2005 at http://www.archives.qld.gov.au/publications/PublicRecordsAlert/PRA205.pdf

47

Queensland State Archives: Guideline for the Digitisation of Paper Records

Technical Imaging Metadata Information about the digitisation process such as the operator’s login name, date of scanning, technical scanning parameters and equipment used should be recorded for digitised records. The National Information Standards Organisation (NISO) has developed a detailed draft metadata standard for still images. The draft standard for trial use was released in June 2002, and is awaiting approval. The draft standard contains four categories of metadata elements, described below50.

• Image parameters: items that are fundamental to the reconstruction of the digital file as a viewable image on electronically interfaced displays.

• Image creation: descriptive technical metadata; information with respect to the logistics and administrative conditions surrounding digital image data capture.

• Imaging performance assessment: attributes of the image inherent to its quality. • Change history: documenting processes applied to image data over the lifecycle of an

image. While the above NISO draft standard would comprehensively describe a digital image, it does not include other important information relating to recordkeeping which will be covered by the recordkeeping metadata elements. Overlap of metadata elements when multiple schemas are used should be addressed to ensure that the information is only recorded once, but referenced in all schemas. The following table shows an example of technical metadata collected by the State Library of Queensland to facilitate management, storage and preservation of digital files51.

Element Example Accession number 99-5-4 File name 2468.tif Digital image creator Image Production Unit Link date 22 May 2002 (automatically generated by ImageServer) Capture device Heidelberg Nexscan Capture software Silver Fast Manipulation Photoshop 6 Digital source Copy negative Scan resolution 600dpi File bit depth 8bit Format tiff Colour Greyscale Image manipulation (level contrast unsharpmask) Colour profiles No

Table 9: State Library Queensland Technical Metadata Elements

Other metadata elements that may be considered include file size, compression technique, creation hardware, creation software and creation methodology52. Appropriate metadata for digitised paper records will depend on the agency and the rationale of the digitisation project. Choosing a small number of metadata elements may cut down on data entry time, but may result in vital information missing from the collection and limit resource discovery. Deciding to record an excessive number of metadata 50

Data Dictionary—Technical Metadata for Digital Still Images. 2003. National Information Standards Organization and AIIM International. Accessed March 2005 at http://www.niso.org/standards/resources/Z39_87_trial_use.pdf 51

Digital Standard 1 – Cataloguing and Metadata for Digital Images. 2003. State Library of Queensland. Accessed March 2005 at http://www.slq.qld.gov.au/__data/assets/file/5449/sd1_meta_v1.2.doc 52

Metadata and Digital Images. 2004. Technical Advisory Service for Images. Accessed March 2005 at http://tasi.ac.uk/advice/delivering/metadata.html and Suggested Technical Metadata Elements. 2004. Indiana Digital Library. Accessed March 2005 at www.statelib.lib.in.us/www/isl/diglibin/techmeta.pdf.

48

Queensland State Archives: Guideline for the Digitisation of Paper Records

elements will create a metadata database that is very large. A large number of manually entered metadata elements will also burden staff who are required to enter the information, potentially leading to a lack of attention to detail and a poor quality collection. Selecting the metadata elements to record, and determining which of these are mandatory or optional to record should be part of the early phases of the digitisation project.

7.2 Capturing Metadata As the manual collection and entry of metadata is a mundane task, automating the entry of as many elements as possible should be a priority. Some metadata, such as titles and comments, needs to be manually collected and linked to the digitised file. Other metadata, such as recording the technical properties of a scanned image, can be collected automatically. Table 10 shows a small sample of the metadata elements that can be automatically captured.

Element Source Comment

Operator name Computer operating system Name can be extracted from the login account of the user

Capture device Computer operating system Including hardware, software, driver versions

Date of capture Computer operating system

Device calibration results Device driver

Time since calibration Device driver / Computer operating system

Image resolution, colour depth, compression, file format & sub-formats

Imaging software

Image file name Imaging software

Image’s parent collection details Imaging software Entered initially by the operator and retained for subsequent images until changed

Table 10: Some metadata elements that may be automatically captured

The automation of the entry of these technical details will allow staff to concentrate on entering the details that cannot be automatically captured. Any metadata entry, including those elements captured automatically, need to be periodically or randomly checked to ensure it is being captured correctly. Creating and maintaining metadata requires on-going resources, staff and perseverance. Collection of metadata is most cost-effective if it occurs at the time a digital file is created, as retrospectively collecting metadata is a large and tedious task. The processes for metadata collection should be easy to use and follow and appropriate support (such as manuals) should be provided53. Training for staff involved in the creation and maintenance of metadata is critical to the successful collection of metadata.

7.3 File Naming Conventions In many systems, a file name is the primary identifier for electronic files and places electronic records in context with other electronic records and activities54. Consideration of

53

Thornely J. The How of Metadata: Metadata Creation and Standards. 1999. 13th National Cataloguing Conference, October 1999, Accessed March 2005 at http://www.slq.qld.gov.au/__data/assets/file/6289/How_of_Metadata.doc

49

Queensland State Archives: Guideline for the Digitisation of Paper Records

file naming conventions is important in a digitisation project as consistent file naming allows easy management, retrieval and use of digital records. Ideally, file naming within digitisation projects should be part of an agency-wide file naming policy for electronic records. In larger digitisation implementations, where a system such as an eDRMS is used to manage the imaged records, end users may not have any interaction whatsoever with the actual image files. Unlike these large systems where the file names of the image files are essentially transparent to the users, smaller systems and manual digitisation processes may require the naming of files in a consistent and understandable manner. Descriptive File Names General guidelines for file naming conventions suggest that file names should be unique, follow a predictable pattern, provide a meaningful title and be structured and easy to follow. File names should use standard extensions such as .jpg, .tif. In addition, file names should avoid spaces, tabs and characters used by the system (such as “/”, “?”, and “|”) that might not work across platforms or be incompatible with the operating system and storage media used for digital files. The creator of an electronic file is responsible for naming the file so that the file can be communicated to other people. Inconsistent naming of files can make locating files problematic, lead to frustrating searches and wasted time, and may result in information being unavailable when it is needed. These principles apply to descriptive file names that are manually generated and non-descriptive file names that may be generated automatically by software. A number of different elements may be included in a file naming convention. Some suggested elements may include the following.55

Item Example Filename Component Version Number version 1 v1, vers1 Date Of Creation February 24, 2001 022401, 02_24_01 Name Of Creator Rupert B. Smith RBSmith, RBS Description Of Content media kit medkit, mk Name Of Intended Audience general public pub Name Of Group Associated With The Record

Committee ABC CommABC

Release Date released on June 11, 2001 at 8:00 a.m. central time

61101_0800CT

Publication Date published on December 24, 2003

pub122403

Project Number project number 739 PN739 Department Number Department 140 Dept140 Records Series SeriesX s_x

Table 11: Inclusion of file characteristics as parts of the file name

Another consideration in file naming conventions is how derivatives are distinguished from master files. The State Library of Queensland56 provides detailed naming conventions where a qualifier is added to the filename to indicate if the file is a master, preview, research or thumbnail image, as shown in Table 12. 54

Electronic Records Management Guidelines: File Naming. 2004. Minnesota State Archives. Accessed March 2005 at http://www.mnhs.org/preserve/records/electronicrecords/erfnaming.html 55

Electronic Records Management Guidelines: File Naming. 2004. Minnesota State Archives. Accessed March 2005 at http://www.mnhs.org/preserve/records/electronicrecords/erfnaming.html 56

Digital Standard 2 – Digital capture, format & preservation. 2003. State Library of Queensland Accessed March 2005 at http://www.slq.qld.gov.au/__data/assets/word_doc/12788/sd2_digcapture1.doc

50

Queensland State Archives: Guideline for the Digitisation of Paper Records

If an embedded file naming convention is used, the derivative information could be included in the folder structure. In some cases, the file format extension may be sufficient to distinguish between derivatives. It may be beneficial for derived versions of a file to be able to be linked back to the master file through the file name57. If the file name of a derivative is going to be used to retrieve higher quality versions of a file, then the derivative file name must include enough information to link it to other versions as well as the master file.

File Type Filename Suffix

Master .tif Preview p.jpg Research r.jpg Thumbnail b.jpg

Table 12: Derivative Qualifiers used by the State Library of Queensland

Non-descriptive File Names The file naming conventions discussed so far have provided meaningful names for digital files. An alternative to meaningful names is to have non-descriptive file names. Non-descriptive file names are often computer or machine generated numerical strings, which have little or no correspondence to the content of the file58. Many software packages designed for the rapid digitisation of a large amount of paper records will use sequential numeric or alphanumeric filenames – the operator setting the starting point for the file name and the software automatically incrementing as subsequent files are digitised. If non-descriptive file names are used, the files must be associated with file metadata to allow the file to be identified. It is crucial that the link between the image file and the descriptive information is maintained, as the file names will probably be meaningless to the human eye.

Recommended Metadata

57

Technical Guidelines for Digitizing Archival Materials for Electronic Access. 2004. National Archives and Records Administration (US). Accessed March 2005 at http://www.archives.gov/research_room/arc/arc_info/techguide_raster_june2004.pdf 58

Guidelines for management, appraisal and preservation of electronic records. 1999. Public Record Office, The National Archives (UK). Accessed March 2005 at http://www.nationalarchives.gov.uk/electronicrecords/advice/pdf/procedures2.pdf

In terms of metadata, imaged records should be described in a similar fashion to their paper originals, with the additional recording of information about the digitisation process. A standard, such as the Recordkeeping Metadata Standard for Commonwealth Agencies, should be used, which also fulfils the Information Standard 34 requirement for compliance with AGLS. Additional metadata should be recorded to document the image using an appropriate technical image metadata schema. In addition, disposal metadata should capture information about the disposal of both the original paper record and the image. If acquiring a new system for the management of digitised records, system vendors should be approached regarding the capability of their system to record metadata to defined standards in an unobtrusive and convenient manner. File naming conventions need only be considered for small systems where the image files are accessible to the end users. When descriptive filenames are used, they should avoid incompatibilities with operating systems and be understood by appropriate staff.

51

Queensland State Archives: Guideline for the Digitisation of Paper Records

8: Storage and Media Options

Cache Digital records can be stored in a variety of ways, each having an impact on the convenience, security and longevity of the records. Paper records may be secure and preserved if stored in a climate controlled archive repository, but can be inconvenient to access. On the other hand, while a file sitting on a desktop can be accessed almost instantly, its security and integrity can easily be compromised. As the characteristics of access, security and preservation are often given as the impetus of a record digitisation program, the appropriate storage of digital records is crucial to the program’s success.

Cost per Mb

Access Time Hard Disks

Secondary hard disks / optical library

Tapes

Capacity

Figure 8: The relationship between access time, cost and capacity of various computer storage types.

8.1 On-line, Near-line, and Off-line Storage For computer files, including digitised paper records, options for storage can be categorised as on-line, near-line and off-line. A digitisation program will normally make use of a combination of storage options, for instance off-line storage for master copies of digitised images with online storage for derived versions of a file. Ideally, at least the master file will be duplicated in a number of locations to ensure against accidental damage, corruption or media deterioration, and to provide disaster recovery options. On-line storage refers to the storage of files on networks or hard disks so that files are immediately available to a computer system. On-line storage provides the fastest and easiest access to files, but is often expensive. If the files are large, high bandwidth networks may be needed to provide fast and easy access. On-line storage is normally the most limited in the amount of storage space that is offered. Near-line storage offers a combination of lower costs than on-line storage at the expense of access speed. Near-line storage media includes magnetic disks, magneto-optical storage and robotically accessed optical media and tape libraries. Files that are stored using near-line storage should still be accessible without human intervention, but take longer to access than on-line files, so are best suited to infrequently accessed materials. Near-line storage capacity can normally be easily increased to meet demand. On- and near-line storage allows digital files to be quickly supplied to users. Increased accessibility does increase the risk of unauthorised access to or tampering with digital files. Bandwidth and security issues need to be considered for on- and near-line storage to be successfully deployed59. Off-line storage is often used to store rarely accessed records or backup copies. It is typically inexpensive, but access costs to files can be very high since the media requires

59

Frey F. Guides to Quality in Visual Resource Imaging: 5. File Formats for Digital Masters. 2000. Council on Library and Information Resources. Accessed March 2005 at http://www.rlg.org/visguides/visguide5.html

52

Queensland State Archives: Guideline for the Digitisation of Paper Records

manual location and loading60. Access to files is normally accompanied by delays as the files are located and the sequential nature of tape storage adds further delays.

8.2 Media Types Magnetic Media Magnetic media includes magnetic tapes and magnetic disks such as hard drives and floppy drives61. Magnetic media usually has a lifespan of 10 to 20 years, although the lifespan of magnetic media can be extended if appropriate storage conditions are used62. Magnetic media can be used to provide offline, near-line or on-line access to files. Magnetic tapes have a relatively low cost with large storage capacities of up to several hundred gigabytes per tape. Tapes are sequential media and as such cannot be considered as an alternative to random access media such as disks for routine access to information. Instead magnetic tapes are widely used for long term, offline file storage or backup63, where their low cost per megabyte is most appropriate. Robotically controlled tape libraries can provide for a huge volume of information to be available near line. Magnetic disks have a higher cost than magnetic tape, but with the benefit of providing faster access to information. Personal computer hard disks and server disk arrays are examples of magnetic disks. Magnetic disks are generally used for online storage, however, it is becoming commonplace for cheaper, lower performance disks to be used as an online backup or spare in a near line capacity. Optical Media Optical media use lasers to read data from a metallic coating on a disk. Optical media include Compact Disks (CDs) and Digital Versatile Discs (DVDs). CDs and DVDs may be read only (e.g. CD-ROM), writeable once (such as CD-Rs) or writeable many times (e.g. CD-RW, DVD+RW). Optical disks are a common media used for digital file storage, transportation and publication. The main advantage that optical media have over magnetic media is that its life expectation is more predictable, as its longevity is determined by the properties of the optical material rather than wear and tear on the media64. Optical media can provide near-line or off-line storage. It is not possible to alter or delete information from write once, read many (WORM) optical media. This provides an assurance that the imaged records that they store have not been altered or deleted. Conversely, the implementation of disposal decisions for digitised records stored on WORM media can be complicated if imaged records of differing retention period are stored on the same disk. The nature of this type of media means that it cannot be reused.

60

Creating and Managing Digital Content – Capture Your Collections. 2002. Canadian Heritage Information Network. Accessed March 2005 at http://www.chin.gc.ca/English/Digital_Content/Capture_Collections/maintenance.html61

The Preservation Management of Digital Material Handbook, Chapter 5: Media and Formats. 2002. Digital Preservation Coalition. Accessed March 2005 at http://www.dpconline.org/graphics/medfor/media.html 62

Frey F. Guides to Quality in Visual Resource Imaging: 5. File Formats for Digital Masters. 2000. Council on Library and Information Resources. Accessed March 2005 at http://www.rlg.org/visguides/visguide5.html63

Electronic Records Management Guidelines: Digital Media. 2004. Minnesota State Archives. Accessed March 2005 at http://www.mnhs.org/preserve/records/electronicrecords/erdigital.html64

Frey F. Guides to Quality in Visual Resource Imaging: 5. File Formats for Digital Masters. 2000. Council on Library and Information Resources. Accessed March 2005 at http://www.rlg.org/visguides/visguide5.html

53

Queensland State Archives: Guideline for the Digitisation of Paper Records

Optical media is normally a cost effective option for storage. Media may have a unit cost of only a few cents when purchased in reasonable quantities. However, recording and retrieving files from optical media can often be slow, and locating the disk that a file is stored on may complicate access to digitised documents. Optical media work well as a storage option for small projects. Using optical media in large projects may lead to high storage and retrieval costs65.

8.3 Media Lifecycle A range of media is available for storing digital files in on-, near- or off-line capacities. Choosing an appropriate medium for storage of digital files is important to ensure on-going accessibility to the file. The rate of media obsolescence and reliance on hardware and software for access to media requires that careful consideration is given to the media used in digitisation projects. Media types may lose popularity and be difficult to read due to lack of available equipment over a period of time. For example, reading Beta video cassettes or 5.25” disks is problematic, because the hardware required is no longer readily available. Digitised documents will need to be copied to new media if they are to remain accessible. Generally, the lifetime of hardware and software is shorter than the lifetime of digital media. A five year timeframe has been suggested for data refreshing (copying of files to a new media)66. Media Life and Deterioration Unlike paper documents, digital files cannot be easily examined to determine if the file is still legible. Most digital media becomes obsolete or loses information faster than words produced on paper. Hardware and software are also required to interpret and display a digital file so that the file’s legibility can be checked. Hence the storage of digital files requires on-going, regular maintenance to ensure that files remain readable by contemporary hardware and software, and to ensure that the media the files are stored on does not decay. Different media have varying life expectations. For example, microfilm is expected to have a shelf life of 500 years, whereas Compact Disc (CD) life may be as short as 2 years. Recently some manufacturers have released “archive quality” CD and Digital Versatile Disc (DVD) media which are said to have a shelf life of up to 50 years, but many CDs, DVDs and tapes lose data within a very short period after their creation (2-30 years67).

Figure 9: Media Life Expectancy. From http://www.caps-project.org/cache/ DigitalMediaLifeExpectancyAndCare.html

65

Western States Digital Imaging Best Practices Version 1.0. 2003. Western States Digital Standards Group. Accessed March 2005 at http://www.cdpheritage.org/resource/scanning/documents/WSDIBP_v1.pdf66

Rothenberg, J. Ensuring the Longevity of Digital Information. 1999. Council on Library and Information Resources. Accessed March 2005 at http://www.clir.org/pubs/archives/ensuring.pdf 67

The Preservation Management of Digital Material Handbook, Chapter 5: Media and Formats. 2002. Digital Preservation Coalition. Accessed March 2005 at http://www.dpconline.org/graphics/medfor/media.html

54

Queensland State Archives: Guideline for the Digitisation of Paper Records

Often, longevity of storage media is less important than adequate plans to migrate and refresh files for compatibility on contemporary hardware and software68. Media Refreshing To ensure the continued accessibility of digitised records and to prevent information loss, a testing and re-mastering schedule should also be implemented and a strategy drawn up for migrating the images and metadata to new media and new formats when necessary.69 Known as refreshing, this may involve copying the contents from one type of technology to another in order to prevent records from being left on media which can no longer be read. Alternatively, refreshing may be from one piece of media to another of the same technology ensuring that pieces of media are replaced before they fail.70 The records should be verified following the refresh process. Expungement of digitised records When the retention period of a record has been reached and it is scheduled for destruction, any digital copies should also be destroyed at the same time as the paper record. Care should be taken to destroy, overwrite, or carry out secure deletion on computer storage media and devices used in the storage of records. As discussed above, there can be some difficulty in achieving this when using WORM media – in this case files to be retained could be copied onto new media, with the old disc destroyed. Organisational IT policies, such as system backup, should also be examined to ensure that digital copies of records are no longer preserved following their destruction.

Recommended Storage Options

If digitisation programs are established to improve access to information, online storage is the most appropriate option. In addition, disposal of digitised records needs to occur in an online eDRMS environment. In other circumstances, digitised documents may be stored using magnetic or optical media. Regardless of storage arrangements, media must always be handled appropriately and stored in environmental conditions recommended by manufacturers. When choosing storage media, it is important to determine the manufacturer’s life expectancy for the particular media. Where possible, the highest quality media should be used. The choice of digital media will be influenced by factors such as media lifespan, cost and ease of access to files stored using the media. It is also important to ensure that digital files, particularly master copies, are safe from tampering so the original image cannot be changed. If the system used to manage the digitised records does not provide adequate assurance of the authenticity and integrity of the record, consideration should be given to the use of WORM media, such as some compact discs, to prevent alteration of the original file. The use of WORM media must be planned carefully to ensure that retention and disposal decisions can still be implemented.

68 Western States Digital Imaging Best Practices Version 1.0. 2003. Western States Digital Standards Group. Accessed March 2005 at

http://www.cdpheritage.org/resource/scanning/documents/WSDIBP_v1.pdf69

Digital Preservation and Storage. 2004. Technical Advisory Service for Images. Accessed March 2005 at http://www.tasi.ac.uk/advice/delivering/digital.html

55

Queensland State Archives: Guideline for the Digitisation of Paper Records

Appendix 1: Glossary of Terms and Acronyms

Acronyms AGLS Australian Government Locator Service BMP Bitmap CD Compact Disc DPI Dots Per Inch DVD Digital Versatile Disc GIF Graphics Interchange Format JIFF JPEG File Interchange Format JPEG Joint Photographic Expert Group PCX PC Paintbrush Format PDF Portable Document Format PNG Portable Network Graphics PPI Pixels Per Inch RAM Random Access Memory ROM Read Only Memory TIFF Tagged Image File Format WORM Write Once Read Many

Glossary Anti-Aliasing Imporives the appearance of grey scale images by adding grey pixels at

the border of black and white areas, smoothing the transition from black to white. Also used in colour images to smooth transitions between colours.

Bit Depth The number of bits used to describe the colour of each pixel. Greater bit depth allows more colours to be used in the colour palette for the image.

Bi-tonal Images containing only black and white pixels. Bi-tonal images are often used to represent modern, non-illustrated text documents.

Colour Depth

The colour or bit depth of an image refers to the number of bits used to describe the colour of each pixel. Greater bit depth allows more colours to be displayed in an image. Colour depths can range from 1 bit per pixel for bi-tonal images to 24 bits per pixel or greater in high quality colour images.

Continuous Colour

An image, such as an original photographic transparency or print, in which the tones or colours blend smoothly from one to another. Continuous colour images have a virtually unlimited range of colour or shades of greys.

Discrete Instances when the colours in an image are separate and distinct.

70

VERS Advice 10: System Requirements for Preserving Electronic Records. 2004. Public Record Office Victoria. Accessed March 2005 at http://www.prov.vic.gov.au/vers/standard/advice_10/3-8.htm

56

Queensland State Archives: Guideline for the Digitisation of Paper Records

Colour Discrete colour images do not blend smoothly from one colour to the next and lack the many shades of colour seen in photographs.

Dithering The computer graphics equivalent to printed halftones, this technique creates the illusion of colour depth in images with a limited colour palette. This is done by interspersing pixels of different colours over the required area to give the appearance of a third colour. For example, white and black pixels allocated over an area will provide a grey appearance to that area.

Dots per Inch

A measure of the resolution of a printer. It refers to the number of dots the printer is able to place in a linear one-inch space. The more dots per inch, the higher the resolution and the higher the printing quality.

File format The specific way that data is arranged in a file. Some file formats can be used by a range of applications (such as text files or some image files) while others may only be used by a specific application (usually the same application used to create the file). Most applications can save documents in one or more standard formats as well as in their native format (i.e. a document produced in Microsoft Word can be saved as a Word document, or in rich text format, or in WordPerfect format). File formats may be proprietary or non-proprietary.

Greyscale Greyscale images use only black, white and a range of shades of grey. The number of grey shades available depends on the colour depth of the image.

Half-tone A printed image in which the density and pattern of black and white dots are varied, giving the appearance of a continuous tone image when viewed from an appropriate distance. Half-tone images are used extensively in magazines and newspapers.

Lossless compression

The compression of data that guarantees the original data can be restored exactly. A file that compressed using a lossless method and then retrieved is exactly the same as the original, uncompressed file.

Lossy compression

The compression of data that may result in some data being changed or lost. A file that is compressed using a lossy method and then retrieved may be different from the original file, but is "close enough" to be useful in some way.

LZW A lossless compression algorithm developed by Abraham Lempel, Jacob Ziv, and Terry Welch. Lempel-Ziv-Welch is a proprietary lossless data-compression algorithm used in GIF files. The patent to the LZW algorithm is owned by Unisys Corporation.

Naming conventions

A standardised approach to naming computer files.

Near-line storage

Storage of files, normally on magnetic or optical media, so that files can be accessed if needed. The accessing of files in near- line storage should not require human intervention, as in the case of off-line storage, but will usually be slower to access than on-line storage. Robotically controlled tape libraries and CD/DVD jukeboxes are applications of near line storage.

Non-proprietary

Refers to a technological design or architecture whose configuration is available for use by the public. Use of non-proprietary technology is not restricted by licences or patents. Software is considered non-proprietary once it is released with a license that would permit others to modify the

57

Queensland State Archives: Guideline for the Digitisation of Paper Records

software and release their own versions without restrictions. Non-proprietary technology allows individuals or organisations to copy, modify and study the technology.

Off-line storage

Storage of files, normally on magnetic or optical media, in a manner where the files are separate from and not directly accessible by the computer system. Human intervention, such as loading a tape into a tape drive, is required for the file to be accessed by the computer system.

On-line storage

Storage of files, normally on networks or hard disks, so that files are immediately available to the computer system.

Palette A palette is the set of available colours that may be used to display an image. Each pixel in the image is assigned a value that relates to a specific colour in the pallet. The number of entries in the palette is the total number of colours which can appear simultaneously on screen

Palettised A type of image that is composed of a distinct set of colours from a palette. Standard palettised images are made up of 16 or 256 colour palettes.

Pixel The smallest element of a digital image; short for picture element. Pixels are the many tiny squares that make up the representation of a digital picture. Usually the squares are so small and so numerous that, when displayed on a computer monitor or printed, they appear to merge into a smooth image. Pixels per inch (PPI) is a commonly used measure for digital images. Each pixel can represent a number of different shades or colours, depending on how much storage space is allocated for it.

PPI A measure of the resolution of an image. The more pixels per inch, the finer the resolution. PPI is used to describe the resolution of an image in a virtual state, or on a monitor. ‘PPI’ is often confused with ‘DPI’, which is used to describe the resolution of a printing device.

Proprietary A technological design or architecture whose configuration is unavailable to the public and may not be duplicated without permission from the designer or architect. Proprietary technology is created for a given company's purposes. For example, Microsoft Word stores documents in a proprietary format, namely Microsoft Word format. Proprietary technology may be legally used only by a person or entity purchasing an explicit license. Proprietary means "privately owned and controlled", and hence software can remain proprietary even when source code is made publicly available, if control over use, distribution, or modification is retained.

Raster A category of digital still images. Raster images are the most common images created and used within digitisation projects. Raster images take the form of a grid or matrix of pixels. Each pixel has a defined value that precisely identifies its specific colour, size and place within the image. Examples of raster image file formats are TIFF, GIF and JPEG. The other category of digital still images is vector images.

Resolution Resolution is the amount of picture data in a specific area of an image. Resolution is usually measured in pixels per inch (PPI). The higher the resolution, the sharper and clearer an image will be.

Vector A category of digital still images. Vector images are defined by mathematical equations and are used for drawing and diagrams that can be constructed from points, lines and area shapes. Vector images are resolution independent, meaning they can be scaled up to large sizes with

58

Queensland State Archives: Guideline for the Digitisation of Paper Records

no loss of quality. Examples of vector files formats are CAD drawings, Corel Draw files, and SVG files. The other category of digital still images is raster images.

59

Queensland State Archives: Guideline for the Digitisation of Paper Records

Appendix 2: Scanner Types

Flatbed scanner – Provides a flat glass area for scanning which allows for thick documents or books to be scanned. Flatbed scanners are at the entry level end of the market. Some models have sheet-feeders, to increase the throughput of single sheets, and transparent media adaptors, for scanning slides ornegatives, available as accessories.

Sheet-fed scanner – Dedicated to scanning separated pages, typically at a much higher rate than a flatbed scanner with a sheet feeder attached. This type of scanner is designed for a high paper throughput and most models can scan both sides of the page. Due to the sheet fed nature, this scanner is not suited to scanning fragile or outsized documents or books. Slide scanner – Designed specifically for digitising transparent materials such as slides and negatives. This scanner type typically provides a higher throughput and improved quality over a flatbed scanner with a transparent media adaptor, particular for scanning high volumes of slides and negatives. Drum scanner – Used in graphic design and in publication, these expensive scanners use different technology from the other scanner types described here to produce a higher quality image. The page being scanned is attached to a high speed rotating drum which makes this type of scanner unsuitable for scanning fragile documents or large volumes of records. Wide-Format scanner – Used for scanning maps, plans and other paper documents that are larger than typical office documents. While standard sized documents are usually accommodated, these scanners are usually manually fed, meaning that they are not suited for high volume scanning of standard office documents. Overhead scanner – Also known as a book eye scanner, this type of device captures the reader’s eye view of a book or document. These are often used for capturing documents of historical or cultural significance that are not able to be laid face down on a flatbed scanner or fed through a sheet-fed scanner. Overhead scanners are not suited for high volume scanning. Digital Camera – It is possible for a standard digital camera to capture a digital copy of a paper record. Digital cameras should be used in macro mode for photographing objects that close to the camera. A stable mount should be used to ensure the camera is steady enough to accurately capture the object. It should be noted that photographic effects, such as barrel distortion and fall off which affect the edges of objects captured by a camera lens at close range, will be present in records captured using a digital camera.

60

Queensland State Archives: Guideline for the Digitisation of Paper Records

Appendix 3: Table of Technical Recommendations

Document type Resolution1 Bit Depth File Format Text document with only one colour of text

No less than 200 PPI 1-bit (bi-tonal) TIFF G3/G4 PNG

Document with watermarks, grey shading, grey graphics, etc

No less than 200 PPI 4-bit or 8-bit grey PNG GIF3

TIFF (LZW) 2,3

Document with discrete colour used in text or diagrams, etc

No less than 200 PPI 4-bit or 8-bit colour PNG GIF3

TIFF (LZW)2,3

Black and white photographs

9” x 6” - 300 PPI 7” x 5” - 430 PPI 6” x 4” - 600 PPI

8-bit grey PNG JPEG4

TIFF (JPEG)2,4

Colour photographs 9” x 6” - 300 PPI 7” x 5” - 430 PPI 6” x 4” - 600 PPI

24-bit colour PNG JPEG4

TIFF (JPEG)2,4

Notes: 1. Resolution may be reduced for images only used for on-screen viewing and should be

increased for documents that require enlargement for use. For documents larger than A3, a resolution of 200PPI is generally accepted to provide a reasonable file size. The clarity of fine line work and small text at this reduced resolution should be assessed.

2. For storing multi-page documents as a single file, TIFF may be considered as an alternative.

3. The ability of software to manage any licensing required for LZW compression should be checked.

4. The compression ratio used JPEG compressed images should not exceed 10:1.

61

Queensland State Archives: Guideline for the Digitisation of Paper Records

Appendix 4: Related Standards

Queensland Information Standards Available from http://www.governmentict.qld.gov.au/02_infostand/infostand.htm

Information Standard 18: Information Security Information Standard 31: Retention and Disposal of Public Records Information Standard 34: Metadata Information Standard 40: Recordkeeping Information Standard 41: Managing Technology Dependant Records File Formats ISO 12639:2004: Graphic technology – Prepress digital data exchange – Tag image file format for image technology (TIFF/IT) International Organization for Standardization ANSI/AIIM MS53-1993: Standard Recommended Practice - File Format for Storage and Exchange of Images - Bi-Level Image File Format: Part 1 American National Standards Institute AS ISO/IEC 15444.2-2004: Information technology - JPEG 2000 image coding system – Extensions Standards Australia ISO/IEC 15948:2004: Information technology -- Computer graphics and image processing -- Portable Network Graphics (PNG): Functional specificationInternational Organization for Standardization ISO/DIS 19005-1: Document management -- Electronic document file format for long-term preservation -- Part 1: Use of PDF 1.4 (PDF/A-1) International Organization for Standardization Image Management ISO 10196:2003: Document imaging applications -- Recommendations for the creation of original documents International Organization for Standardization ISO/TS 12029(ATS 5084-2003): Electronic Imaging – Forms design for optimisation for electronic image management International Organization for Standardization ISO/TS 12033:2001(ATS 5083-2003): Electronic imaging -- Guidance for selection of document image compression methods Standards Australia ISO/TR 14105(HB 177-2003): Electronic imaging - Human and organizational for successful Electronic Image Management (EIM) implementation International Organization for Standardization Imaging ISO 12651-1999: Electronic imaging – Vocabulary Standards Australia JIS Z 6016:2003: Electronic imaging process of paper documents and microfilmed documents

62

Queensland State Archives: Guideline for the Digitisation of Paper Records

Japanese Standards Association ANSI/AIIM TR26-1993: Resolution as it Relates to Photographic and Electronic Imaging American National Standards Institute ANSI/AIIM TR2-1998: Glossary of Document Technologies American National Standards Institute Metadata ISO/TS 23081-1:2004: Information and documentation - Records management processes - Metadata for records -- Part 1: Principles International Organization for Standardization ISO 15836:2003: Information and documentation - The Dublin Core metadata element set International Organization for Standardization AS 5044.1-2002: AGLS Metadata element set - Part 1: Reference description Standards Australia AS 5044.2-2002: AGLS Metadata element set – Usage guide Standards Australia Records Management AS ISO 15489.1-2002: Records management - General Standards Australia AS ISO 15489.2-2002: Records management – Guidelines Standards Australia AS ISO 23081.1-2004: Information and documentation - Records management processes - Metadata for records – Principles Standards Australia Scanners ISO 12653-1:2000: Electronic imaging -- Test target for the black-and-white scanning of office documents -- Part 1: Characteristics International Organization for Standardization ISO 12653-2:2000: Electronic imaging -- Test target for the black-and-white scanning of office documents -- Part 2: Method of use International Organization for Standardization ISO/IEC 14473:1999: Information technology -- Office equipment -- Minimum information to be specified for image scanners International Organization for Standardization ISO 16067-1:2003: Photography -- Spatial resolution measurements of electronic scanners for photographic images -- Part 1: Scanners for reflective media International Organization for Standardization ISO 16067-2:2004: Photography - Electronic scanners for photographic images - Spatial resolution measurements -- Part 2: Film scanners International Organization for Standardization ANSI/AIIM MS44-1988: Recommended Practice for Quality Control of Image Scanners American National Standards Institute ANSI/AIIM MS52-1991: Recommended Practice for the Requirements and Characteristics of Original Documents Intended for Optical Scanning

63

Queensland State Archives: Guideline for the Digitisation of Paper Records

American National Standards Institute Storage ISO/TR 15801:2004: Electronic imaging -- Information stored electronically -- Recommendations for trustworthiness and reliability International Organization for Standardization ISO/TR 12037:1998(HB 179-2003): Electronic imaging -- Recommendations for the expungement of information recorded on write-once optical media Standards Australia ISO/TR 12654:1997HB 178-2003: Electronic imaging - Recommendations for the management of electronic recording systems for the recording of documents that may be required as evidence, on WORM optical disk International Organization for Standardization ISO 18927:2002: Imaging materials -- Recordable compact disc systems -- Method for estimating the life expectancy based on the effects of temperature and relative humidity International Organization for Standardization ANSI/AIIM MS59-1996: Media Error Monitoring and Reporting Techniques for Verification of Stored Data on Optical Digital Data Disks American National Standards Institute ISO 22028-1:2004: Photography and graphic technology -- Extended colour encodings for digital image storage, manipulation and interchange -- Part 1: Architecture and requirements International Organization for Standardization ANSI/AIIM TR25-1995: The use of Optical Disks for Public Records American National Standards Institute There are a total of 80 standards covering Optical Media and a further 36 on the topic of Data Storage available from Standards Australia

64

Queensland State Archives: Guideline for the Digitisation of Paper Records

Appendix 5: Reference List

Queensland State Archives Documents Digitisation Disposal Policy: Policy on the authorisation of the early disposal of original paper records after digitisation. 2006. Queensland State Archives. Accessed April 2006 at www.archives.qld.gov.au/government/ddp.asp. Glossary of Archival and Recordkeeping Terms. 2004. Queensland State Archives. Accessed March 2005 at http://www.archives.qld.gov.au/downloads/GlossaryOfArchivalRKTerms.pdfPublic Records Alert No 1/05: Day batching of records. 2005. Queensland State Archives. Accessed March 2005 at http://www.archives.qld.gov.au/publications/PublicRecordsAlert/PRA105.pdfPublic Records Alert No 2/05: Understanding and applying recordkeeping metadata. 2005. Queensland State Archives. Accessed March 2005 at http://www.archives.qld.gov.au/publications/PublicRecordsAlert/PRA205.pdf Other Documents Adobe PDF. 2005. Adobe Systems Inc. Accessed March 2005 at http://www.adobe.com/products/acrobat/adobepdf.html Brown A. Digital Preservation Guidance Note 1: Selecting File Formats for Long-Term Preservation. 2003. National Archives (UK). Accessed March 2005 at http://www.nationalarchives.gov.uk/preservation/advice/pdf/selecting_file_formats.pdf Cunningham A. Metadata Standards in Australia – An Overview. 2005. Presentation at Queensland State Archives March 2005. National Archives of Australia. Creating and Managing Digital Content. 2002. Canadian Heritage Information Network. Accessed March 2005 at http://www.chin.gc.ca/English/Digital_ContentData Dictionary—Technical Metadata for Digital Still Images. 2003. National Information Standards Organization and AIIM International. Accessed March 2005 at http://www.niso.org/standards/resources/Z39_87_trial_use.pdf Digital Imaging for Archival Preservation and Online Presentation: Best Practices. 2001. Michigan State University. Accessed March 2004 at http://www.historicalvoices.org/papers/image_digitization2.pdfDigital Preservation and Storage. 2004. Technical Advisory Service for Images. Accessed March 2005 at http://www.tasi.ac.uk/advice/delivering/digital.html Digital Standard 1 – Cataloguing and Metadata for Digital Images. 2003. State Library of Queensland. Accessed March 2005 at http://www.slq.qld.gov.au/__data/assets/file/5449/sd1_meta_v1.2.doc Digital Standard 2 – Digital capture, format & preservation. 2003. State Library of Queensland Accessed March 2005 at http://www.slq.qld.gov.au/__data/assets/word_doc/32645/sd2_current.doc. The DIRKS Manual: A Strategic Approach to Managing Business Information. 2003. National Archives of Australia. Accessed December 2005 at http://www.naa.gov.au/recordkeeping/dirks/dirksman/dirks.html.

65

Queensland State Archives: Guideline for the Digitisation of Paper Records

Electronic Records Management Guidelines. 2004. Minnesota State Archives. Accessed March 2005 at http://www.mnhs.org/preserve/records/electronicrecords/erguidelinestoc.htmlFile Formats and Compression. 2004. Technical Advisory Service for Images. Accessed March 2005 at http://www.tasi.ac.uk/advice/creating/fformat.html#ff2 Frey F. Guides to Quality in Visual Resource Imaging. 2000. Council on Library and Information Resources. Accessed March 2005 at http://lyra2.rlg.org/visguides/ General Guidelines for Scanning. 1999. Colorado Digitization Project. Accessed March 2005 at http://chnm.gmu.edu/digitalhistory/links/cached/chapter3/link3.45.CDPscanningguidelines.html. Gillespie J., Fair P., Lawrence A., Vaile D. Coping when Everything is Digital? Digital Documents and Issues in Document Retention – White Paper. 2004. Baker & McKenzie Cyberspace Law and Policy Centre, University of New South Wales. Accessed March 2005 at http://www.bakercyberlawcentre.org/ddr/ Guidelines for management, appraisal and preservation of electronic records. 1999. Public Record Office, The National Archives (UK). Accessed March 2005 at http://www.nationalarchives.gov.uk/electronicrecords/advice/pdf/procedures2.pdfHilton D. & Warr P. Unlocking Queensland’s Picture Heritage – Picture Queensland Digital Imaging Workshop Course Notes. 2004. State Library of Queensland. Horton S. Web Style Guide 2nd Edition: PNG Graphics. 2004. Lynch and Horton. Accessed March 2004 at http://www.webstyleguide.com/graphics/pngs.html How To Fix Bad Scans. 2004. Dixie State College of Utah. Accessed March 2005 at http://cit.dixie.edu/vt/vt2600/bad_scans.aspImaging Best Practices. 2003. University of California, Berkley. Accessed March 2005 at http://www.lib.berkeley.edu/digicoll/bestpractices/image_bp.htmlJPEG Image Coding Standard. 1998. Centre for Telecommunications and Information Engineering, Monash University. Accessed March 2005 at http://www.ctie.monash.edu.au/EMERGE/multimedia/JPEG/COMM03.HTMLeurs L. The TIFF file format. 2001 Laurens Leurs. Accessed March 2005 at http://www.prepressure.com/formats/tiff/fileformat.htm Ling T. 2002. Taking it to the streets: why the National Archives of Australia embraced digitisation on demand. National Archives of Australia. Accessed March 2005 at http://www.naa.gov.au/Publications/corporate_publications/digitising_TLing.pdfLZW Patent Information. 2005. Unisys Corporation. Accessed March 2005 at http://www.unisys.com/about__unisys/lzw/Management of Electronic Records PROS 99/007 (Version 2). 2004. Public Record Office Victoria. Accessed March 2005 at http://www.prov.vic.gov.au/vers/standard/standardManuscript Digitization Demonstration Project. 1998. Library of Congress. Mendham S. JPEG 2000. 2005. IDG Communications. Accessed March 2005 at http://www.pcworld.idg.com.au/index.php/id;1170029196;fp;2;fpid;1585691688Metadata and Digital Images. 2004. Technical Advisory Service for Images. Accessed March 2005 at http://tasi.ac.uk/advice/delivering/metadata.html

66

Queensland State Archives: Guideline for the Digitisation of Paper Records

Moving Theory into Practice: Digital Imaging Tutorial. 2003. Cornell University Library/Research Department. Accessed March 2005 at http://www.library.cornell.edu/preservation/tutorial/quality/quality-01.htmlMoving Theory into Practice: Digital Imaging Tutorial. 2003. Cornell University Library/Research Department. Accessed March 2005 at http://www.library.cornell.edu/preservation/tutorial/quality/quality-02.htmlPNG (Portable Network Graphics). 2004. World Wide Web Consortium. Accessed March 2005 at http://www.w3.org/Graphics/PNG/ The Preservation Management of Digital Material Handbook, Chapter 5: Media and Formats. 2002. Digital Preservation Coalition. Accessed March 2005 at http://www.dpconline.org/graphics/medfor/media.htmlQuality Assurance. 2004. Technical Advisory Service for Images. Accessed March 2005 at http://www.tasi.ac.uk/advice/creating/quality.htmlRecordkeeping in Brief No. 11: Digital Imaging and Recordkeeping. 2003. State Records New South Wales. Accessed March 2005 at www.records.nsw.gov.au/publicsector/rk/rib/rib11.htmRevised Digital Imaging Guidelines for State of Ohio Executive Agencies and Local Governments. 2003. Ohio Electronic Records Committee. Accessed March 2004 at http://www.ohiojunction.net/erc/imagingrevision/revisedimaging2003.htmlRoelofs G. Multiple-image Network Graphics. 2005. Greg Roelofs. Accessed March 2005 at http://www.libpng.org/pub/mngRothenberg, J. Ensuring the Longevity of Digital Information. 1999. Council on Library and Information Resources. Accessed March 2005 at http://www.clir.org/PUBS/archives/ensuring.pdf. Scanning Tips and Techniques. Jasc Software Inc. 1999. Accessed October 2004 at http://www.jasc.com/tutorials/scantip.aspSharma A. Digital Noise, Film Grain. 2001. Digital Photo Techniques. Accessed March 2005 at http://www.phototechmag.com/sample/sharma.pdf Suggested Technical Metadata Elements. 2004. Indiana Digital Library. Accessed March 2005 at www.statelib.lib.in.us/www/isl/diglibin/techmeta.pdf. Tanner, S. From Vision to Implementation – strategic and management issues for digital collections. 2000. The Electronic Library – strategic, policy and management issues seminar. Accessed March 2005 at http://heds.herts.ac.uk/resources/papers/Lboro2000.pdfTechnical Guidelines for Digitizing Archival Materials for Electronic Access. 2004. National Archives and Records Administration (US). Accessed March 2005 at http://www.archives.gov/research_room/arc/arc_info/techguide_raster_june2004.pdf Technical Recommendations for Digital Imaging Projects. 1997. Image Quality Working Group of ArchivesCom. Accessed March 2005 at http://www.columbia.edu/acis/dl/imagespec.htmlThornely J. The How of Metadata: Metadata Creation and Standards. 1999. 13th National Cataloguing Conference, October 1999, Accessed March 2005 at http://www.slq.qld.gov.au/__data/assets/file/6289/How_of_Metadata.doc TIFF Revision 6.0. 1992. Adobe Systems Inc. Accessed March 2005 at http://partners.adobe.com/asn/developer/pdfs/tn/TIFF6.pdf

67

Queensland State Archives: Guideline for the Digitisation of Paper Records

2003. Western States Digital Standards Group. Accessed March 2005 at http://www.cdpheritage.org/digital/scanning/documents/WSDIBP_v1.pdf.

68