Digital preservation is a topic with many faces.
This introductory guide contains:
This guide contains background information on the theory and practice of digital preservation. Like traditional preservation and conservation are to books, so is digital preservation to digital assets. Mould can destroy a book, but the very technology used to create and store our digital assets is also a threat to them.
This guide covers the many faces of digital preservation, including the terms, concepts, models, standards, actions, risks, and tools. Digital preservation is a broad field that encompasses everything from project management to technical skills. Not everyone working in digital preservation can possess every skill, but it is the combination of teams with complimentary skills that makes a successful digital preservation programme in an organization possible. Having an awareness of the theories behind digital preservation and the risks to digital assets is perhaps the most important universal skill. This guide contains enough information to provide a general awareness, but it also contains resources for further study.

Image Source: Digital Preservation Business Case Toolkit http://wiki.dpconline.org/, CC-BY-NC 3.0
Questions? Need help or advice about digital content? Contact the Digital Preservation team at:
digitalpreservation@bodleian.ox.ac.uk
Throughout this guide and throughout digital preservation literature, there will be references to data or digital object, digital files or digital assets? What do they all mean? Are they any different?
Essentially no. They are all ways to describe the "digital things" we are trying to preserve. In many cases, the "digital things" can be made of multiple bit streams, which includes the digital file(s) and associated metadata.
DATA. A binary object of any kind. It could be a bitstreams or several bitstreams. It may have a file format type or not.
DIGITAL OBJECT. A conceptual term that describes an aggregated unit of digital content comprised of one or more related digital files. These related files might include metadata, master files and/or a wrapper to bind the pieces together.
DIGITAL FILE. Binary information that is available to a computer program. In other words, a digital file contains information that tells a computer program how to open and treat it. This might include a file header, file signature, embedded metadata, container signature or file extension.
DIGITAL ASSET. Can mean any of the things above. It is often referred to describe an individual digital file, though it is not limited to that. It is also used to help assign further value to digital objects, by referring to them as an asset to and organization.
Be aware that every organization might use these terms differently. It is important to understand what someone else means when they refer to a digital asset or data. Sometimes it's a very generic idea and other times it can be extremely specific. It is important when it comes to policies and procedures to be extremely clear about what you are using. It may be useful to refer back to the Glossary while reading the rest of the digital preservation LibGuide.
Digital preservation at Bodleian Libraries is defined as:
The formal activity of ensuring access to digital information for as long as necessary. It requires polices, planning, resource allocation (funds, time, people) and appropriate technologies and actions to ensure accessibility, accurate rendering and authenticity of digital objects.
A “lifecycle management” approach to digital preservation is taken, where action is done at regular intervals and future activity is planned. This includes policies and recommendations for appraising and selecting digital information to preserve, acknowledging resources are finite.
There are two different kinds of digital preservation: bit level preservation and logical preservation. Bit-level preservation is not full preservation, but it is the foundational building block necessary before logical digital preservation can take place.
Bit Level Preservation: A term used to denote a very basic level of preservation of the digital object as it was submitted (literally preserving the bits forming a digital object).
Bit preservation is not digital preservation but it does provide one building block for the more complete set of digital preservation practices and processes that ensure the survival of digital material and also its usability, display, context and interpretation over time.
Logical Preservation: The aspect of preservation management that is concerned with ensuring the continued usability of meaningful information content, by ensuring the existence of a usable manifestation the digital object. Sometimes referred to as format preservation or active preservation. It is comprised of three stages:
Characterize: understanding what digital materials are in the repository
Digital curation involves maintaining, preserving and adding value to digital files throughout their lifecycle—not just at the end of their active lives. This active management of digital files reduces threats to their long-term value and mitigates the risk of digital obsolescence. Digital curation includes digital preservation, but the term adds the curatorial aspects of: selection, appraisal and ongoing enhancement of the object for reuse.
It is commonly used in the science and social sciences for research data and is often being replaced with research data management, especially when referring to active digital files.
Digital archiving is often used interchangeably with digital preservation in archives. It has two main definitions:
The process of storage, backup and ongoing maintenance as opposed to strategies for long-term digital preservation (DPC Handbook). This definition is often used by computing professionals.
the long-term storage, preservation and access to information that is "born digital" (created and disseminated primarily in electronic form) or for which the digital version is considered to be the primary archive (D-Lib magazine). This is the definition primary used by archivists and librarians.
It is important to recognize and understand both definitions of the term, as well as be aware of the audiences that use this term differently. Knowing your audience will help you understand what definition to follow - when it doubt, ask.
It is commonly used in the US more than in the UK. Its definition essentially combines both curation and preservation—the active life of a digital asset and its continual preservation afterwards for long-term use. But this school of thought splits digital curation & digital preservation into two separate categories and then uses digital stewardship as the umbrella term.
At Bodleian Libraries, we consider digital preservation to be a holistic term that includes aspects of digital curation and stewardship. We aim to work with creators to help them organise and manage their digital objects in ways that enhance the ability to preserve them. As creators, we aim to follow best practice for creation and managing our active files so that they are ingested into our digital repository, they will be easier to manage and provide access to in the long-term.
There are a number of reasons why digital preservation matters to Bodleian Libraries:
However, even with our best intentions, loss can still occur if we are not being proactive through continual planning, auditing and managing. Digital preservation is an ongoing process that can help prevent loss.
Digital preservation also matters because there are many risks to our digital materials. It matters because we need to provide access to these digital materials now and into the future. We need ongoing work to make this happen; we need digital preservation.
For more information on why digital preservation matters, please visit: https://dpconline.org/handbook/digital-preservation/why-digital-preservation-matters
The Lunar Orbiter Image Recovery Project (LOIRP) is a project funded by NASA. SkyCorp, SpaceRef Interactive, and private individuals to digitize the original analog data tapes from the five Lunar Orbiter spacecraft that were sent to the Moon in 1966 and 1967.

By chance, the original tapes and an original tape drive survived until the early 2000s when funding was finally secured for a recovery project. This was after over 20 years of trying to secure funding to check the contents of the tapes and digitize any recovered content. Fortunately, the project was able to recover data from all 1,500 tapes though not without considerable time, expertise and funds.
This highlights the high cost of recovering digital data from obsolete hardware and software. It took an enormous amount of volunteers and people who had previously retired to restore the images. The project also had to rely on one company to help restore the original tape drives -- at a high cost. Had the value not been seen in restoring the images in the first place, it is likely that the high quality images of the earth from the moon would be lost forever. Prior to the LOIRP initiative, only poor quality images that had been broadcasts to TV in the 60s were available.
Image: Earthrise, top image TV broadcast image from the 60s; bottom image recovered from a data tape during LOIRP
To learn more about this project:
Risks to digital materials range from technological to organizational and cultural. There are internal and external risks to consider. The rapid pace of technology and the growing rate of digital information (sometimes called the data deluge) are urgent risks that digital preservation must mitigate against. While not an exhaustive list, below are some of the risks to our digital materials at Bodleian Libraries:
Hardware, software and storage media are constantly being superseded by newer and better models. This means that digital materials are at risk of no longer being usable when the software and hardware they rely on become obsolete. When the storage media that digital materials are saved on becomes obsolete, recovery of the material can become difficult and costly.
This is a high risk and transfer of digital materials happens often during its life. Often transfer failure is not discovered until it is too later to recover. Transfer failure can occur at any stage, including the copying of digital materials to backup tapes. System failure can also lead to the corruption of digital materials, not just transfer failures.
New and complex file formats can be tricky to make preservation decisions about. Some file formats might not be supported accurately by most available software, or they might not be well formed and this can cause issues in the future.
Storage media often fails due to a lack of refreshment policies. All storage media has a shelf life and must be refreshed regularly or failures can happen. This not only leads to corruption or complete loss of digital materials, but can cause a downtime in services while restores must happen from tape backups.
These are mistakes we cause and inappropriate access can lead to accidents like alteration or deletion. This is something that can be damaging when considering compliance and data protection issues. It is important to control access to content internally not just externally in order to prevent against human error and compliance breeches.
What do we have to comply with? It is a risk if we do not meet our legal obligations. This includes laws, policies (internal and external), and standards. Non compliance may result in financial loss, loss of reputation and trust, which can mean that we become an undesirable repository for material so that donors and vendors may look elsewhere.
Lack of funding and political climate can be a risk. There is also a risk using third parties for preservation services as they can go out of business.
For more information on risks to digital materials, please visit: https://dpconline.org/handbook/digital-preservation/preservation-issues
Image credit: Digital Preservation Business Case Toolkit http://wiki.dpconline.org/
Digital preservation is standards-based. The field uses both theoretical models and standards as the basis for digital preservation activities. The language used in the field can be confusing, but it is based off the standards and models. Understanding the various models and standards will help to fill in the gaps and provide a starting point.
Image source: Digital Preservation Business Case Toolkit http://wiki.dpconline.org/, CC-BY-NC 3.0
The Open Archival Information Systems (OAIS) Reference Model is a conceptual model for a digital archive. Much of the terminology used in digital preservation comes from this model, which was turned into an ISO standard in 2002. The OAIS Reference model describes the environment, functional components, and information objects associated with a system responsible for the long-term preservation. As a reference model, its primary purpose is to provide a common set of concepts and definitions that can assist discussion across sectors and professional groups and facilitate the specification of archives and digital preservation systems. It has a very basic set of conformance requirements that should be seen as minimalist. OAIS was first approved as ISO Standard 14721 in 2002 and a 2nd edition was published in 2012. Although produced under the leadership of the Consultative Committee for Space Data Systems (CCSDS), it had major input from libraries and archives.

For further reading:
Technology Watch Report 14-02: The Open Archival Information System (OAIS) Reference Model: Introductory Guide (2nd Edition) by Brian Lavoie 2014
Meeting the Challenges of Digital Preservation: The OAIS Reference Model. By Brian Lavoie, OCLC (2000)
There are six functional entities within the OAIS Archive. They each perform functions that carry out the sustainable operation of the archive. The six functional entities are:
Ingest: The processes the accept digital materials from Producers and prepare it for inclusion in the Archival Store.
Archival Store: The long-term storage and maintenance of digital materials. It ensures that archived digital materials reside in the right form of storage and that the bit streams of the digital materials remain complete and renderable over time.
Data Management: The maintenance of databases of descriptive metadata that identifying and describing the archived digital materials. It also manages the administrative data supporting the OAIS system operation, such as performance data or access statistics.
Access: The processes and the services that Consumers use to locate, request and receive information of digital materials in the Archival Store.
Administration: The day-to-day management of operations and the coordination of the other functional entities. It is also the central communications hub with the external environment: management, producers and consumers.
Preservation Planning: Monitoring of the external environment for changes or threats to the digital materials in the Archival Store. It maps our the preservation strategy and recommends appropriate revisions to it in line with changing conditions. This is the safeguard against a constantly evolving user and technological environment.
The OAIS archive operates within an external environment. There are three main stakeholders involved:
Designated Community: An identified group of potential consumers who should be able to understand a particular set of information from an archive. These consumers may consist of multiple communities, are designated by the archive, and may change over time.

There are three different types of information packages that move within the OAIS. At each stage they change, based on the digital files added (or removed) and the metadata created.
Submission Information Package (SIP): An Information Package that is delivered by the Producer to the OAIS for use in the construction or update of one or more Archival Information Packages (AIPs) and/or the associated Descriptive Information.
Archival Information Package (AIP): An Information Package, consisting of the complete set of digital files and a complete set of metadata for the AIP (to support preservation and access) that is preserved within an OAIS archive.
Dissemination Information Package (DIP): An Information Package, derived from one or more Archival Information Packages (AIPs), and sent by Archives to the Consumer in response to a request to the OAIS archive.
These information packages are what is known conceptually as a digital object. A digital object an aggregated unit of digital content comprised of one or more related digital files. These related files might include metadata, master files and/or a wrapper to bind the pieces together.
The DCC Lifecycle Model further expands on the idea that digital preservation requires actions throughout the lifecycle to ensure sustained long-term access to digital assets. The model is often used in the digital curation and research data management sectors, but it reflects the same actions and workflow used throughout the digital preservation field.

There are a number of key elements to the Lifecycle, starting from the data in the centre; this is what we are trying to preserve. The model then moves outwards to various dependencies and actions in order to preserve and provides access to that data. These actions include full lifecycle, sequential and occasional actions.
The first element in the centre of the Lifecycle is DATA, which is any information in binary form, including:
Digital Objects: these are defined as follows in the Lifecycle model. Simple digital objects (discrete digital items such as text files, image files or sound files, along with their related identifiers and metadata) or Complex digital objects (discrete digital objects made by combining a number of other digital objects, such as websites).
Databases: structured collections of records or data stored in a computer system.
Full lifecycle actions occur continuously throughout the life of the data being preserved. These actions include:
Description and Representation Information
Assign administrative, descriptive, technical, structural and preservation metadata, using appropriate standards, to ensure adequate description and control over the long-term. Collect and assign representation information required to understand and render both the digital material and the associated metadata.
Preservation Planning
Plan for preservation throughout the curation lifecycle of digital material. This would include plans for management and administration of all curation lifecycle actions.
Community Watch and Participation
Maintain a watch on appropriate community activities, and participate in the development of shared standards, tools and suitable software.
Curate and Preserve
Be aware of, and undertake management and administrative actions planned to promote curation and preservation throughout the curation lifecycle.
These actions are in red on the Lifecycle model and as the title suggests, they occur in sequential order. And as the Lifecycle suggests, they are continual as the data is transformed or reappraised over time. In order, the sequential actions are:
Conceptualise
Conceive and plan the creation of data, including capture method and storage options.
Create or Receive
Create data including administrative, descriptive, structural and technical metadata. Preservation metadata may also be added at the time of creation.
Receive data, in accordance with documented collecting policies, from data creators, other archives, repositories or data centres, and if required assign appropriate metadata.
Appraise and Select
Evaluate data and select for long-term curation and preservation. Adhere to documented guidance, policies or legal requirements.
Ingest
Transfer data to an archive, repository, data centre or other custodian. Adhere to documented guidance, policies or legal requirements.
Preservation Action
Undertake actions to ensure long-term preservation and retention of the authoritative nature of data. Preservation actions should ensure that data remains authentic, reliable and usable while maintaining its integrity. Actions include data cleaning, validation, assigning preservation metadata, assigning representation information and ensuring acceptable data structures or file formats.
Store
Store the data in a secure manner adhering to relevant standards.
Access, Use and Reuse
Ensure that data is accessible to both designated users and reusers, on a day-to-day basis. This may be in the form of publicly available published information. Robust access controls and authentication procedures may be applicable.
Transform
Create new data from the original, for example:
Dispose
Dispose of data, which has not been selected for long-term curation and preservation in accordance with documented policies, guidance or legal requirements.
Typically data may be transferred to another archive, repository, data centre or other custodian. In some instances data is destroyed. The data's nature may, for legal reasons, necessitate secure destruction.
Reappraise
Return data which fails validation procedures for further appraisal and re-selection.
Migrate
Migrate data to a different format. This may be done to accord with the storage environment or to ensure the data's immunity from hardware or software obsolescence.

Created by Anne Kenney and Nancy McGovern for the Digital Preservation Management Workshops in 2003-2006, this stool represents the three aspects of a successful and sustainable digital preservation programme:
What the model demonstrates, is that without considering and maintaining each of these components (or "legs"), a digital preservation programme will ultimately collapse. These three components need to be considered together in order to sustain digital preservation activity.
It is important that a programme does not consider technology as the only solution and that a balance is struck between the technology, the people, the funding and the organizational policies.
The technology leg represents the necessary hardware, software and secure environments required to sustain a digital preservation programme. The technology leg also acknowledges changing technology and is prepared to respond accordingly.
The areas the technology leg covers in digital preservation includes:

The organization leg of the stool looks at the elements required to address the organizational needs and practices of a digital preservation programme. This leg not only maps the parameters of a programme, but helps drive the organizational change required for a successful digital preservation programme.
The organizational components include:

The resources leg looks at the time, money and people requirements of a digital preservation programme. These are the resources required create and maintain a sustainable programme.
The resources required include:

Preservation Metadata Implementation Strategies (PREMIS) is a metadata standard for recording information required for preservation of digital objects. The standard's documentation and metadata schema are hosted by the Library of Congress. Their website states:
"The PREMIS Data Dictionary for Preservation Metadata is the international standard for metadata to support the preservation of digital objects and ensure their long-term usability. Developed by an international team of experts, PREMIS is implemented in digital preservation projects around the world, and support for PREMIS is incorporated into a number of commercial and open-source digital preservation tools and systems. The PREMIS Editorial Committee coordinates revisions and implementation of the standard."
Aside from associated documentation, there are two main components to the PREMIS standard:
To learn more about PREMIS, please visit: http://www.loc.gov/standards/premis/
The data model consists of four Entities:

Objects: An object is a a discreet unit of information that is made up of four potential levels/types: Intellectual Entity, Representation, File or Bitstream. The PREMIS object model describes the nature and relationship between these four types of objects and how they can be expressed in PREMIS. It is up to the organization to decided how to define and model digital objects in its collections. Objects are central to the PREMIS standard, and is the only one of the four entities that are mandatory (meaning it must be included in all PREMIS records).
Events: Events describe actions which have happened to objects, particularly during their management within a digital repository. Types of events that can be recorded in PREMIS include: virus scanning, file format validation, and file format migration.
Agents: Agents are the actors that perform events on the object. This can include staff working with the digital repository, the organization as a whole, but it can also includes the software used on the object to perform the event (such as virus scanning software).
Rights: Rights relates to copyright, licenses and any other restrictions on what a repository can do to an object. For example, a PREMIS rights statement might be from a donor agreement that states that a particular agent entity is allowed to make a particular object available online.
Preservation metadata is defined as:
"Things that most working preservation repositories are likely to need to know in order to support digital preservation." - PREMIS Data Dictionary
"Preservation metadata is intended to store technical details on the format, structure and use of the digital content, the history of all actions performed on the resource including changes and decisions [...]" - PADI, The National Library of Australia
Preservation metadata is often used an umbrella term for various categories of metadata which all enable ongoing preservation and management of digital objects. These types of metadata include:
More about preservation metadata can also be found in the Glossary section of this Libguide.
What is the benefit of recording preservation metadata?
There are a number of reasons why preservation metadata is important to an organization. It supports the ability to preserve digital objects and helps to maintain access to them in the long-term. But other benefits include:
The PREMIS Data Dictionary (currently in v.3) defines semantic units. Each semantic unit is mapped to one of the four entities (objects, events, agents, and rights), which means that a semantic unit is a property of an entity.
Each entry in the Data Dictionary defines how a semantic unit may be used, and provides practical examples to guide a repository when making implementation decisions for using PREMIS.

A repository can use different options for recording PREMIS metadata. Common approaches are:
This section covers information on file formats for digital preservation and best practice around identifying and validating file format types to aid with long-term access and availability.
Image source: Digital Preservation Business Case Toolkit http://wiki.dpconline.org/, CC-BY-NC 3.0
File format identification is an important part of digital preservation. Knowing what type of file format you have and what version it is, will assist with preservation planning for that digital object. It will also provide information on the types of software programs that can open and render the digital object. It is important to note that a program may be able to open a particular file format, but it may not render it correctly. This means that the look and feel could be altered, sometimes slightly and sometimes to make it difficult to interpret. This is particularly true for older file formats that were created with legacy software programs. Be aware that "legacy" can mean only 10 years!
Knowing the file format and version of a digital object also means you can plan for its future. Does it need to be normalised on ingest? Does it need to be migrated to a new file format? Would emulation be a better fit? This is all part of preservation planning.
File format identification tools and methods constantly improves and develops. File format identification should not be seen as a one-off activity which is only ran when a digital object is first given to a repository; it is good practice to regularly run identification software over collections to benefit from new tool developments.
DROID (Digital Record Object Identification) is a tool for automated batch identification of file formats. DROID uses the PRONOM registry to identify file formats based on file format signature, file extension and other technical information contained in PRONOM. It can export reports to .CSV files for querying and creating statistics from.
DROID is a free and open source digital preservation tool. The newest version can be downloaded here.
PRONOM is a technical registry of file formats that has been created and maintained by The National Archives. It contains information about file formats and supporting software products or technical components. It is a resource to support ingest and long-term digital preservation.
It is regularly maintained and updated by the National Archives. While it is not a comprehensive list of file formats, submissions are encouraged. Researchers working with rare and proprietary file formats, as well as research data managers and archivists have made submissions to PRONOM. Information on how to submit can be found here.
Siegfried is file format identification tool that uses the PRONOM registry, but is available to use in the web browser as well as available for download and installation.
FIDO is available from the Open Preservation Foundation and also uses the PRONOM registry.
File format validation does a number of functions that help to confirm a file format is well-form and valid. Validation will:
For these reasons, file format validation is important. It is an especially useful tool for digitization workflows as it will ensure that digital objects are being created correctly. When you are in control of creating a digital object, validation is an important step. However, it is important to know that file format validation has the following limitations:
This is why fixity is equally as important in digital preservation. It can help detect visual corruption of those files with the early generation of what is known as a checksum. The section on fixity goes into greater detail on creating and confirming checksums and their uses in digital preservation.
The most common validation tool is JHOVE, maintained by the Open Preservation Foundation. It is an open source validation tool that can validate the following file formats:
JHOVE stands for JSTOR/Harvard Object Validation Environment. It was a joint project between JSTOR and Harvard University to create a tool to validate files and extract metadata. In 2015, the maintenance of the software was transferred to the Open Preservation Foundation.

MediaConch is an implementation checker, policy checker, reporter, and fixer that targets preservation-level audiovisual files (specifically Matroska, Linear Pulse Code Modulation (LPCM) and FF Video Codec 1 (FFV1)) for use in memory institutions, providing detailed and batch-level conformance checking. It has an interface accessible by the command line, a graphical user interface, or a web interface. While it validates several audiovisual file types, it does not validation every file format type.
The policy checker part of the tool is useful, but it complex and requires a certain level of knowledge about the different file formats.

Jpylyzer is a validation tool for JPEG2000 (JP2) images. It also reports on the image's technical characteristics or technical metadata (called a feature extraction). It is an open source tool maintained by the Open Preservation Foundation. The creation of this validation tool was made possible by partial funding from the EU FP 7 project known as SCAPE. It is a richer validation tool for JPEG2000 images than JHOVE and is therefore preferred for validating this file type. It is commonly used in digitization workflows were TIFF files are migrated to JPEG2000 storage and access reasons.
Unlike JHOVE, Jpylyzer will only validate one file format, but it has a richer validation rules set for JPEG2000 than JHOVE.

EpubCheck validates EPUB files and will extract technical and other embedded metadata. It checks things such as:
It was largely developed by Adobe Systems and is currently supported by the International Digital Publishing Forum (IDPF).
An online version of EpubCheck is available at: http://validator.idpf.org/
veraPDF validates all PDF/A parts and conformance levels. PDF/A is a version of PDF intended for long term preservation and archving of electronic documents. PDF/A is meant to prohibit features that are not suitable for long term preservation, including font linking (instead it will embed the font file in the document), encryption and annotations. However it does not work for every document and creating a valid PDF/A can be labour intensive. Conformance levels include A (Accessible), B (Basic) and U (Unicode). U was created to deal with the specialized fonts and characters like Greek, Arabic, Chinese and so on. On top of conformance levels, there are also three versions of PDF/A, which means a PDF/A document has a version number and conformation level associated with it.
veraPDF will help to validate the various versions and conformance levels of PDF/A, but will not be able to validate any other version of PDF -- JHOVE will be required for that. It is good practice to validate a PDF/A file using both veraPDF and JHOVE as both validate different aspects of the PDF file.

There are several other file format validation tools available. These include, but are not limited to:
The COPTR registry of digital preservation tools has a list of further file format validation tools.
Not all digital formats are suited or indeed designed for archiving or preservation. Any preservation policy should therefore recognize the requirements of the collection content and decide upon a file format which best preserves those qualities. Pairing content with a suitable choice of preservation format or access format; identifying what is important in the content.
Below we suggest some factors to consider in selecting your preferred file formats:
Open source formats, such as JPEG2000, are very popular due to their non-proprietary nature and the sense of ownership that stakeholders can attain with their use. However, the choice of open source versus proprietary formats is not that simple and needs to be looked at closely. Proprietary formats, such as TIFF, are seen as being very robust; however, these formats will ultimately be susceptible to upgrade issues and obsolescence if the owner goes out of business or develops a new alternative. Similarly, open source formats can be seen as technologically neutral, being non-reliant on business models for their development however they can also been seen as vulnerable to the susceptibilities of the communities that support them.
Although such non-proprietary formats can be selected for many resource types this is not universally the case. For many new areas and applications, e.g. Geographical Information Systems or Virtual Reality only proprietary formats are available. In such cases a crucial factor will be the export formats supported to allow data to be moved out of (or into) these proprietary environments.
The availability of documentation - for example, published specifications - is an important factor in selecting a file format. Documentation may exist in the form of vendor’s specifications, an international standard, or may be created and maintained within the context of a user community. Look for a standard which is well-documented and widely implemented. Make sure the standard is listed in the PRONOM file format registry.
A file format which is relied upon by a large user group creates many more options for its users. It is worth bearing in mind levels of use and support for formats in the wider world, but also finding out what organizations similar to you are doing and sharing best practice in the selection of formats. Wide adoption of a format can give you more confidence in its preservation.
Lossy formats are those where data is compressed, or thrown away, as part of the encoding. The MP3 format is widely used for commercial distribution of music files over the web, because the lossy encoding process results in smaller file sizes.
TIFF is one example of an image format that is capable of supporting lossless data. It could hold a high-resolution image. JPEG is an example of a lossy image file format. Its versatility, and small file size, makes it a suitable choice for creating an access copy of an image of smaller size for transmission over a network. It would not be appropriate to store the JPEG image as both the access and archival format because of the irretrievable data loss this would involve.
One rule of thumb could be to choose lossless formats for the creation and storage of "archival masters"; lossy formats should only be used for delivery / access purposes, and not considered to be archival. A rule like this is particularly suitable for a digitization project, particularly still images.
Some file formats have support for metadata.This means that some metadata can be inscribed directly into an instance of a file (for example, JPEG2000 supports some rights metadata fields). This can be a consideration, depending on your approach to metadata management.
This is a complex area. One view regards significant properties as the "essence" of file content; a strategy that gets to the heart of "what to preserve". What does the user community expect from the rendition? What aspects of the original are you trying to preserve? This strategy could mean you don’t have to commit to preserving all aspects of a file format, only those that have the most meaning and value to the user.
Significant properties may also refer to a very specific range of technical metadata that is required to be present in order for a file to be rendered (e.g. image width). Some migration tools may strip out this metadata, or it may become lost through other curation actions in the repository. The preservation strategy needs to prevent this loss happening. It thus becomes important to identify, extract, store and preserve significant properties at early stage of the preservation process.
Source: Digital Preservation Coalition Handbook, 2nd Edition

This section is about the actions that can be taken over time to mitigate the technical challenges of digital over time. These actions include maintaining fixity, digital forensics, storage, migration and emulation. These various technical strategies can help to ensure long-term access to digital objects.
Image source: Digital Preservation Business Case Toolkit http://wiki.dpconline.org/, CC-BY-NC 3.0
Digital forensics is associated in many people’s minds primarily with the investigation of wrongdoing. However, it has also emerged in recent years as a promising source of tools and approaches for facilitating digital preservation and curation, specifically for protecting and investigating evidence from the past.
Institutional repositories and professionals with responsibilities for personal archives and other digital collections can benefit from forensics in addressing digital authenticity, accountability and accessibility. Digital personal information must be handled with due sensitivity and security while demonstrably protecting its evidential value.
Forensic technology makes it possible to: identify privacy issues; establish a chain of custody for provenance; employ write protection for capture and transfer; and detect forgery or manipulation. It can extract and mine relevant metadata and content; enable efficient indexing and searching by curators; and facilitate audit control and granular access privileges. Advancing capabilities promise increasingly effective automation in the handling of ever higher volumes of personal digital information. With the right policies in place, the judicious use of forensic technologies will continue to offer theoretical models, practical solutions and analytical insights.
There are three basic and essential principles in digital forensics: that the evidence is acquired without altering it; that this is demonstrably so; and that analysis is conducted in an accountable and repeatable way. Digital forensic processes, hardware and software have been designed to ensure compliance with these requirements.
Information assurance is critical. Writeblockers ensure that information is captured without altering it, while chains of custody in terms of evidence handling, process control, information audit, digital signatures and watermarking protect the historical evidence from future alteration and uncertain provenance.
Selective redaction, anonymization and encryption, malware sandbox containment and other mechanisms for security and fine-tuned control are required to assure that privacy is fully protected and inadvertent information leakage is prevented. Family computers, portable devices and shareable cloud services all harbour considerable personal information and consequently raise issues of privacy. Digital archivists and forensic practitioners share the need to handle the ensuing personal information responsibly.
The current emphasis on automation in digital forensic research is of particular significance to the curation of cultural heritage, where this capability is increasingly essential in a digital universe that continues to expand exponentially. Current research is directed at handling large volumes efficiently and effectively using a variety of analytical techniques. Parallel processing, for example, through purpose-designed Graphics Processing Units (GPUs), and high performance computing can assist processor-intensive activities such as full search and indexing, filtering and hashing, secure deletion, mining, fusion and visualization.
Especially noteworthy for digital preservation and curation is the way that digital forensics directs attention towards the digital media item as a whole – typically the forensic disk image, the file that represents everything on the original disk.
Forensic technologies vary greatly in their capability, cost and complexity. Some equipment is expensive, but some is free. Some techniques are very straightforward to use, others have to be applied with great care and sophistication. The BitCurator Consortium has been an important development bringing together a community of archival users of open source digital forensic tools (Lee et al, 2014). There is an increasingly rich set of open source forensic tools that are free to obtain and use – most significantly for archivists, BitCurator. These are a wonderful introduction to the ins-and-outs of digital forensics, and can be used to compare and cross-check the outputs of commercial or other open source tools.
Digital archivists and forensic specialists share a common need to monitor and understand how technology is used to create, store, and manage digital information. Additionally, there is a mutual need to manage that information responsibly in conformance with relevant standards and best practice. New forensic techniques are furthering the handling of digital information from mobile devices, networks, live data on remote computers, flash media, virtual machines, cloud services, and encrypted sources. The use of encryption is beginning to present significant challenges for digital preservation. It is not only a matter of decryption but of identifying encryption in the first place. Digital forensics offers some solutions.
Forensic and archival methodology must retain the ability both to retrospectively interpret events represented on digital devices, and to react quickly to the changing digital landscape by the rapid institution of certifiable and responsible policies, procedures and facilities. The pace of change also has implications for ongoing training of curators and archivists, and there are digital forensics courses endorsed by archival, scholarly and preservation institutions.
Conclusion
In conclusion, there are some deep challenges ahead for cultural heritage and archives, but the forensic perspective is undoubtedly among the most promising sources of insights and solutions. Equally, digital forensics can benefit from the advances being made in the curation and preservation of digital information. This brief overview has been based on short excerpts from The Digital Preservation Technology Watch Report on Digital Forensics and Preservation (John, 2012) with additional material kindly provided by Jeremy Leighton John, the author of the report
Source: Digital Preservation Coalition Handbook, 2nd Edition

BitCurator project put together an open source suite of digital forensics tools specifically to be used in library and archives born-digital workflows. It contains a range of tools that can be run from a Linux environment. The available tools include:
FRED is a digital forensics workstation sold by Digital Intelligence. It has a number of ports, media readers and built-in writeblockers. FRED also has internal RAID storage. This system uses FTK imager software for creating and reading disk images.

Kyroflux is a floppy disk controller used in creating disk images of floppy disks. Its advantages over USB floppy disk readers is that the Kyroflux can:


Fixity is a term commonly used in digital preservation when talking about digital files and bitstreams. Fixity means the state of being unchanged or permanent. Confirming a digital file's fixity means that it has remained the same over time. Often this process of confirming is called fixity checking or integrity checking. This process will verify that a digital object has not been altered or corrupted.
The most common way to confirm the fixity of a digital object is to create what is known as a checksum or hash for each individual file or in some cases, bitstream (mainly for audiovisual works). A checksum is a string of numbers and letters generated using a mathematical algorithm. A checksum is like a digital fingerprint for a file, because it will be unique for each file.
The most common checksum algorithms used in digital preservation are: MD5, SHA-256 and SHA-1. However, there are others and they go in and out of use over time. It is important to know what algorithm was used to to generate the checksum for a digital file as they are not interoperable.
By monitoring a file's integrity from as early on as possible, any loss or corruption to that file may be detected. However, a checksum has its limits. While a mismatch of checksums during fixity checking may flag that a file's checksum has changed, it cannot diagnose the problem with the file. It can only say there was one. It will be up to you to investigate further.
For more on Fixity and checksums, please read the DPC Handbook section on Fixity.
Image source: Jørgen Stamp, CC-BY 2.5
There are a number of programs listed in the COPTR tool registry that can generate checksums and verify file fixity. Some of the common tools are:
Aside from verifying that file fixity has been maintained while the file is being stored, checksums have three other main uses:
Below are some examples of what various checksums look like for the following image.

Image By Walter Heubach (German, 1865–1923) (Upload: User:Jarlhelm) [Public domain], via Wikimedia Commons
File name: Heubach_cat.jpg
Md5: 6d5b04d33455ac13a2291216e5b552a2
Sha-1: 1a26f9ce33857a5c742877aa8de982968d87f67b
Sha-256: 06a67229b29321064ab6b83cd3fce40bc8079666a1197d324e8f2ce28dd24dff
Data integrity is important to digital objects. It is about ensuring the maintenance and consistency of the data throughout its lifecycle. Maintaining fixity is a critical part of data integrity.
Other aspects include managing relationships between data and maintaining metadata for contextual purposes.
Fixity can be recorded using the PREMIS metadata standard. It is referred to as a message digest, which is just another term for a checksum. It can record not only the checksum, but the algorithm that created it as well as the software and version. Any subsequent fixity checks can also be recorded using PREMIS, including the outcome of the check.
Recording this type of preservation metadata is crucial for confirming and establishing a digital object's "chain of custody".
Also known as file format migration or sometimes called file format conversion, migration is different from storage media migration and software refresh. It involves transferring, or migrating, data from an aging or obsolete file format into a new file format, possibly using new applications systems at each stage to interpret the information. Moving from one version of a file format to a later version is a standard practice of migrations. This preservation action is particularly useful when the software used to render the file format type is now obsolete and modern software cannot render it correctly. This is the case with older word processing file formats, such as those created by obsolete software like WordPerfect or WordStar.
For more information, please visit the DPC Handbook on Preservation Actions.
Migration is important. Context is lost when modern software cannot render a file format type correctly. The example below shows what a Corel WordPerfect 7 file looks like in modern software.

Opened in LibreOffice Writer Opened in Microsoft Word 2007
Neither of these versions accurately represents what was intended by the original file, which can be seen rendered correctly in Corel WordPerfect version 7 below:

In migration, this file would be changed to another file format type, perhaps a Microsoft Word 2007 (.docx) file or an Adobe PDF document, depending on what is determined to maintain the most significant properties of the file and taking into account the policies of the organization. Migration is not done without significant research first.
Image source: Euan Cochrane, CC-BY 2.0
When software used to render a file format is obsolete, one method of accessing the data is through file format migration.
It also allows the data to be opened with modern, up-to-date software, thus making access easier for users.
However, it is important when undertaking file format migration to always retain the original file along with the new migrated file. There may very well be a loss to certain properties of the original file when file format migration is undergone.
It is important to assess what is considered an acceptable loss (if any). This will require understanding what is most important about the file and why it is part of the collection.
It is also important to consider how often to migrate a file format as it may not be practicable to migrate with every new file format version, but instead wait until there is a new generation file format type entirely.
But waiting too long may mean the software becomes obsolete before migration and therefore makes the task of migration harder to preform.
An "emulator" is a software which mimics the behaviour of another computer environment. It is used in digital preservation to access software and digital files which require obsolete technological environments to run. For example, an organization could use a Windows 3.1 emulator to access a WordPerfect file from 1994 on the document editing software which originally created it (Corel WordPerfect version 7.x).
Emulation software has been developed by gaming enthusiasts since the early 1990s, but has also sparked debate and interest within the digital preservation community since the early 2000s. While emulation environments were originally seen as complex and time consuming to set up, new developments such as in-browser-emulation has lowered the barrier to use. Today, one of the biggest obstacles to using emulation software is instead around legal concerns. The licensing landscape for obsolete software and Operating Systems required for emulation is still complex.
There are a number of benefits to using emulation for accessing file formats and obsolete software. These include:
Emulation as a Service (EaaS) is a scalable service model to allow organizations to access emulated environments and deliver them to users. The EaaS architecture simplifies access to preserved digital objects and provides original environments to users. It utilizes ready-made emulation components and a flexible web service API for tailored digital preservation workflows.
There are some demo use cases for EaaS available here.
One project which used Eaas in the cloud was Rhizome's Theresa Duncan's art game CD-ROMs from the 1990s. By providing access to them on the web, users can play the games without needing to download and install any additional software or disk images to their computer. The games Chop Suey, Smarty, and Zero Zero are available to play using an emulator here.

There are a number of drawbacks to emulation, which do not always make it a practicable choice:
Game emulation remains a very popular example of using emulators to provide access to original files. It does strip the game of the original hardware, but it creates a software environment where the original game can be played on a variety of devices, from small computers (like the raspberry pi), to mobiles, special devices (the mini NES and SNES systems) and the web. The Internet Archive provides access to a number of emulated games, such as those for MS DOS, and emulated computer environments like the Mac OS 7.0.1 from 1991.

Storage is often the most thought about thing in digital preservation. While it is foundational to a digital preservation programme, it is only one component of it.
When it comes to storage, you ideally want to follow these main principles, though there is no one solution for an organisation

Tiered storage for different types of digital data is popular, when taking into account costs and usage. Many large research data sets are stored only on tape rather than also disk, due to the size and cost of keeping enough disk space. It is important to make those decisions and have them well documented. This is just the basics of preservation storage and it does not include the preservation systems and associated software that should sit on top of it in order to ingest, manage, and audit the digital objects.
We often refer to storage as intelligent storage because it needs to be flexible, it needs to be scalable and it needs humans to manage it. This is the same with our preservation systems. For example, if we keep copies on tape, then we should occasionally check those tapes, and we need a person to be responsible for this activity and to document the results. Do the checksums still match (integrity checking)? Can the data on the tape be read? Can we restore the data to spinning disk without any issue? It is important to test storage systems and monitor the data you are storing on them, otherwise you are not protecting your digital objects well. Essentially, you are exposing them to risk.
It is also important to check how often hardware and software are replaced because that is also important to protect against failure; these things have life spans and will eventually fail. In these scenarios, the human element is as important as the technology that underpins storage systems - humans will be the ones to test the system, check that everything is working and will make sure refreshment happens regularly.
It is important to understand that backup alone is not digital preservation. It is only a part of it.
More importantly, how good are your backups if you never practice recovery? If you have not tried restoring any of your backups, then assume they will likely fail. It is important to occasionally restore from tape and to check digital materials at random. Relying solely on third party backup systems to do the work for you is NOT digital preservation.
Remember:
Digital archiving in the technological world is often seen as the process of backing up. If something is archived, then it is considered in long-term storage. In digital preservation, digital archiving is not enough! While it will reduce some of the risk, it is not actively managing digital objects to protect them against other technological and cultural risks. Often in the tech world, digital archiving is considered suitable, but it is important that this myth is debunked and an understanding of the entire process of digital preservation is explained.
Tip!
To make things more confusing, digital archiving can also be an archival term to mean working with digital objects in an archival setting. It is important to keep this distinction in mind when speaking to colleagues, to ensure that you understand which of the meanings they are intending when using the term.
Many types of digital material selected for long-term preservation may contain confidential and sensitive information that must be protected to ensure they are not accessed by non-authorised users. In many cases these may be legal or regulatory obligations on the organization. These materials must be managed in accordance with the organizations' Information Security Policy to protect against security breaches. ISO 27001 describes the manner in which security procedures can be codified and monitored (ISO, 2013a). ISO 27002 provides guidelines on the implementation of ISO 27001-compliant security procedures (ISO, 2013b). Conforming organizations can be externally accredited and validated. In some cases your own organizations' Information Security Policy may also impact on digital preservation activities and you may need to enlist the support of your Information Governance and ICT teams to facilitate your processes.
Information security methods such as encryption add to the complexity of the preservation process and should be avoided if possible for archival copies. Other security approaches may therefore need to be more rigorously applied for sensitive unencrypted files; these might include restricting access to locked-down terminals in controlled locations (secure rooms), or strong user authentication requirements for remote access. However, these alternative approaches may not always be sufficient or feasible. Encryption may also be present on files that are received on ingest from a depositor, so it is important to be aware of information security options such as encryption, the management of encryption keys, and their implications for digital preservation.
Techniques for protecting information
Several information security techniques may be applied to protect digital material, though this list is not exhaustive:
Encryption is a cryptographic technique which protects digital material by converting it into a scrambled form. Encryption may be applied at many levels, from a single file to an entire disk. Many encryption algorithms exist, each of which scramble information in a different way. These require the use of a key to unscramble the data and convert it back to its original form. The strength of the encryption method is influenced by the key size. For example, 256-bit encryption will be more secure than 128-bit encryption.
It should be noted that encryption is only effective when a third party does not have access to the encryption key in use. A user who has entered the password for an encrypted drive and left their machine powered on and unattended will provide third parties with an opportunity to access data held in the encrypted area, which may result in its release.
Similarly encryption security measures (if used) can lose their effectiveness over time in a repository: there is effectively an arms race between encryption techniques and computational methods to break them. Hence, if used, all encryption by a repository must be actively managed and updated over time to remain secure.
Encrypted digital material can only be accessed over time in a repository if the organization manages its keys. The loss or destruction of these keys will result in data becoming inaccessible.
Access controls allow an administrator to specify who is allowed to access digital material and the type of access that is permitted (for example read only, write).
Redaction refers to the process of analyzing a digital resource, identifying confidential or sensitive information, and removing or replacing it. Common techniques applied include anonymization and pseudonymization to remove personally identifiable information, as well as cleaning of authorship information. When related to datasets this is usually carried out by the removal of information while retaining the structure of the record in the version being released. You should always carry out redaction on a copy of the original, never on the original itself.
The majority of digital materials created using office systems, such as Microsoft Office, are stored in proprietary, binary-encoded formats. Binary formats may contain significant information which is not displayed, and its presence may therefore not be apparent. They may incorporate change histories, audit trails, or embedded metadata, by means of which deleted information can be recovered or simple redaction processes otherwise circumvented. Digital materials may be redacted through a combination of information deletion and conversion to a different format. Certain formats, such as plain ASCII text files, contain displayable information only. Conversion to this format will therefore eliminate any information that may be hidden in non-displayable portions of a bit stream.
Source: Digital Preservation Coalition Handbook, 2nd Edition

This Libguide is merely a starting point to learn the basics about digital preservation. There is so much more to learn. More importantly, there is putting theory into practice. This section has materials that have been made available through the Bodleian Staff Library and online library resources. More are being added as they are purchased by the library. The next section contains a variety of web resources for further reading, to look up tool to trial and resources to help put your skills to the test. The final section deals with further development resources, fromconferences to attend to professional qualifications and other development courses.
Digital preservation is an ever evolving field and so skills must be constantly evolving to meet the needs. Consequently, this is a field where continual learning is crucial and where collaboration is vital to being able to achieve long-term preservation of our digital objects. There are always people to learn from, organization to ask questions of and plenty of forums to do it. For various preservation systems from Archivematica to Preservica, there are various user groups to consult with. For many of the open source tools from JHOVE to AVP's Exactly transfer tool, there are user groups and web forums and there is always Twitter. Just ask a digital preservation on Twitter using one of the following hashtags #digitalpreservation #digipres #digpres and you will be sure to get a response from someone and plenty of retweets from others to amplify.
And as always, if you have digital presevation questions, you can ask the team at Bodleian Library by sending an email to digitalpreservation@bodleian.ox.ac.uk - we're here to help!
Image source: Digital Preservation Business Case Toolkit http://wiki.dpconline.org/, CC-BY-NC 3.0
This section lists books and other materials available at Bodleian Libraries on digital preservation. Most of these are available through the staff library.
If there is a book that is not in the staff library but that you think would be good to add to it, please contact Staff Development at staff-dev@bodleian.ox.ac.uk with suggestions.
A listing of technical registries that contain links to tools and also to wikis that identify various media carriers, audio/video cables and other useful tools to help with digital preservation activities.
Some of the major conferences in the field of digital preservation: iPres and PASIG. iPres puts out open access peer reviewed papers every year. PASIG puts all slides from presentations online for future reference. There are a number of other conferences that cover digital preservation as a portion of the conference, but these are the largest two international ones.
Background: This glossary was created for the GLAM Digital Preservation Project.The application of digital preservation terminology varies greatly across different organizations and information professions, there is therefore scope for misunderstandings and unclarity when using these terms. The purpose of the glossary is to ensure that GLAM uses a common language when defining system functionalities, policy and standards documentation to enable the organizations to better collaborate and exchange knowledge.
The glossary is divided into three sections:
Section 1: Storage, copies, and backups
Section 2: Digital Preservation terms and concepts
Section 3: Abbreviations
Acknowledgements. The glossary was amalgamated and adapted from: NDSA glossary, OCLC Trusted Digital Repositories Report, APARSEN glossary, Digital Preservation at Oxford and Cambridge Project report: Bodleian Libraries’ digitized image assets (2016), DPC Digital Preservation Handbook (2nd edition), and Adrian Brown (2013) Practical Digital Preservation: A How-to Guide for Organizations of Any Size
___________________________________________________________________________________
Archival copy - A copy of digital material made at a particular point in time, that can be used as a reference if the original disappears or is temporarily unavailable. Usually stored on long-term storage and could be considered a primary copy of the data.
Backup - A copy of digital material saved to a storage device for the purpose of preventing loss of data in the event of equipment failure or destruction of the original material. Backups could be considered a secondary convenience copy of the digital material. A backup may be only kept for 30 days, but is not retained indefinitely and will be overwritten with a newer backup periodically. Conversely, a backup maybe just occur when files are altered and the altered files are then backed up again, but all other files remain untouched in the backup. A backup may be of just the data, or the entire file system or also include the entire computer system
Clone - a copy of a data structure such as a file or disk image, a duplicate of the original data
Cloning - the process of copying the contents data structures (such as files)
Deep Copy - a copy of a data structure which includes all associated data, including deleted files
Geographically separated - This means that identical copies of the data are not stored in the same physical location. Geographically separated means that identical copies of the data, even if held on different storage medium are held a fair distance apart. What constitutes geographically separated is that the data is not subject to the same environmental risks, such as fire and flood, or the same infrastructure risks, such as power or internet failure, or the same human risks, such as arson, bombing or other malicious event. The physical distance will therefore vary in different regions and may be constrained by legal requirements
Logical copy - A copy of a data structure which includes all associated active data such as multimedia. The copy will retain the hierarchical organisation (folder/directory structure) and the full path of file names. Deleted files are not included
Shallow copy - A copy of a data structure which includes references to some structure, e.g. to a variable, file, folder, or other object. In contrast to a "deep copy" which is used to describe an actual duplicate of data, shallow copies are not meant for direct usage, and copying of a shallow copy doesn’t move the original contents. An example of a shallow copy would be a Windows shortcut or symbolic references, programming pointers - i.e. objects that contain the address and simply point to the data structure, but do not contain the data themselves
Snapshot - A copy of a data structure (e.g. computer hard drive, virtual machine) at a specific moment in time. Snapshots are useful for backing up data at different intervals, which allows information to be recovered from different periods of time
Spinning disk - Refers to a hard drive with physical spinning disk platter(s). However this term is often used to mean online storage (instantly accessible) as many hard drives are now use solid state technology rather than mechanical
Storage - Archival - Storage for digital material which is rarely accessed, often for storing archival copy of digital material. Usually kept off site and distanced from the original copy. Tape is often seen as a good medium for archival storage
Storage - Nearline - A storage system where access to the data is not immediately available, but the data being stored can be made online quickly without human intervention. Tape libraries that can automatic load and access tapes, are considered nearline storage, but if a tape must be manually loaded then it is considered offline storage. *see also "Storage - Online" and "Storage -Offline"
Storage - Offline - A storage system where access to data is not immediately available and requires human intervention to become online. *see also "Storage - Online"
Storage - Online - Online storage supports frequent, rapid access to data by being immediately available to users all the time. It often involves a series of either spinning disks, flash storage or a combination of both
Synchronized replication - Data is written to primary storage and replicated additional storage or backups simultaneously. This keeps multiple copies of the data up-to-date in real time. *see also "Replication"
Replication - The process of copying data from one location to another so there are multiple, identical copies and different locations. This helps keep copies of data up-to-date and mitigate risk of data loss or corruption
Resilient storage - A term which can have several meanings, often will refer to storage which has redundant array of independent disks (RAID) meaning that disks can fail in the system without effecting any of the data stored on the storage system
___________________________________________________________________________________
3D scanning - The act of collecting data about a physical object, in order to reconstruct it as a digital three-dimensional model
Accessioning - The process of bringing digital objects under the physical and intellectual control of an organisation
Administrative metadata - This term is sometimes used to refer to subtypes of digital preservation metadata. However, because Administrative metadata is also used by ISAD(G) for archival description aids, it is ambiguous and is not a prefered term across GLAM. See instead Preservation Metadata.
Asset register (digital) - A record of an organization’s digital information assets/digital materials, which quantifies the value and risk of loss in each case
Authenticity - A digital object is authentic if it “is what it purports to be”. In the case of digital materials, it refers to the fact that whatever is being cited is the same as it was when it was first created, unless the accompanying metadata indicates any changes. Confidence in the authenticity of digital materials over time is particularly crucial owing to the ease with which alterations can be made *see also “provenance metadata”
Bag - A package of digital material that conforms to the BagIt Specification (specification available at http://www.digitalpreservation.gov/documents/bagitspec.pdf). Under the specification, a bag consists of a base directory containing a small amount of machine-readable text to help automate the material's receipt, storage and retrieval and a subdirectory that holds the files
Bit level preservation – A term used to denote a very basic level of preservation of digital object as it was submitted (literally preservation of the bits forming a digital object). It may include maintaining onsite and offsite backup copies, virus checking, fixity-checking, and periodic refreshment to new storage media. Bit preservation is not digital preservation but it does provide one building block for the more complete set of digital preservation practices and processes that ensure the survival of digital material and also its usability, display, context and interpretation over time
Bit stream - A stream of data in binary form. A bit stream may be a digital file or a component of a digital file. The term bit stream is particularly important in fields such as audiovisual archiving *see also “digital file”, “digital object”, and “digital material”
Born digital - Digital materials which are not intended to have an analogue equivalent. This differentiates born digital material from digitized material, as it has not been created from an analogue source
Canonical metadata – A metadata record which is to be regarded as the most up to date and correct source of information. Knowing which metadata record is the “canonical” source of information for a digital object is important, as metadata may become out of sync between (for example) catalogues, delivery websites, and metadata embedded in digital files
Canonical file – * see “master file”
Capture standard (digitization) - The settings, formats, and quality levels used for digitization of physical objects
Chain of custody - A process used to maintain and document the chronological history of the handling, including the transfer of ownership, of any arbitrary digital file from its creation to a final state version. * See also "provenance"
Characterisation - Characterization is the identification and description of what a file is and of its defining technical characteristics. Characterisation may include the identification of file formats and technical attributes such as creating software and hardware, file size, bit depth etc. Characterisation is often captured as technical metadata *see also “technical metadata”
Checksum – An algorithmically-computed numeric value for a bitstream, file or a set of files. Checksums are used monitor files in order to detect accidental errors that may have been introduced during its transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and comparing it with the stored one. If the checksums match, the data was almost certainly not altered *See also "Fixity Check"
Derivative image file – A version of a file which has been derived from a master image file, often for access purposes *see also “Master file”
Descriptive metadata – Metadata created for discovery and identification. Examples of descriptive metadata include: shelfmark, date, and creator
Digital file - Binary information that is available to a computer program
Digital material - A generic term which can refer to either a Digital File or to a Digital Object *see also “Digital file” and “Digital object”
Digital object - A conceptual term that describes an aggregated unit of digital content comprised of one or more related digital files. These related files might include metadata, master files and/or a wrapper to bind the pieces together *see also “digital file”, “digital material” and “bitstream”

Digital preservation – A series of activities, processes and policies used for ensuring continued access, usability and reliability of digital information
Digital signature - A method to authenticate digital materials that consists of an encrypted digest of the file being signed. The digest is an algorithmically-computed numeric value based on the contents of the file. It is then encrypted with the private part of a public/private key pair. To prove that the file was not tampered with, the recipient uses the public key to decrypt the signature back into the original digest, recomputes a new digest from the transmitted file and compares the two to see if they match. If they do, the file has not been altered in transit by an attacker
Digitization - The act of creating a binary representation of an analogue source object
Digitized - Digital files(s) generated from an analogue equivalent. *see also “born digital”
Disk image - A disk image is a copy of the entire contents of a storage device, such as a hard drive, DVD, or CD. The disk image represents the content exactly as it is on the original storage device, including both data and structure information
Emulation - A means of overcoming technological obsolescence of hardware and software by developing techniques for imitating obsolete systems on contemporary generations of computers
File format conversion - *see “File format Migration”
File format migration - A means of overcoming software obsolescence, by converting files into formats which the hosting institution is able to support and render. File format migration is also referred to as “file format conversion” by some groups within the University of Oxford and can be used interchangeably
File store – A system for delineating pieces of information, and controlling how digital material is stored and retrieved
Fixity check - The process of ensuring that digital files have not been changed without prior authorization. Changes to files may occur due to human error or transmission errors *see also “Checksum”
Handle –*see “persistent identifier”
Hash – *see checksum
Image hashing - Image hashing is a way of creating a fingerprint of an image based on its visual appearance using a mathematical algorithm. The outcome is the creation of a pixel hash. Visually similar image files will have similar pixel hashes *see also “pixel hash”
Ingest - The process through which digital objects are added into a managed environment *see also “repository system”
Intellectual entity – A set of content which is considered a single intellectual unity for the purposes of management and description. An intellectual entity may have several representations. *see also “Representation”
Long-term - a period long enough to raise concern about the effect of changing technologies, including support for new media and data formats, and of changing user needs
Long-term preservation - the act of maintaining correct and independently understandable information over the long term *see also “long term”
Lossless compression – A compression method which allows reconstruction of the original data without any quality loss
Lossy compression – A lossy compression method which discards some data. Original data cannot be restored without some quality loss
Master file - A master file is a source file from which subsequent file versions can be created. An example of a master file could be a high quality TIFF file used for deriving JPEG access copies from. For this reason preservation effort is generally targeted at master files, rather than derivative files which can be regenerated from the source
Media carrier (handheld) - A type of storage which is not networked or part of a “managed storage system”. Examples of handheld media carriers are: tape, cassettes, CDs, DVDs, and USB sticks
Metadata -The set of information required to enable content to be discovered, managed and used by both humans and automated systems. *see also: “descriptive metadata”, “technical metadata”, “rights metadata”, “preservation metadata”, “provenance metadata”, “tracking metadata”, and “structural metadata”
Normalization (file formats) - Normalization (file formats) – The process of migrating digital files to new file formats at the point of ingest into a managed preservation environment. The purpose of normalization is to minimize the number of formats managed by an organization. *see also “ingest” and “file format migration”
Object management – Management of digital objects, their relationships and intellectual integrity *see also “Digital object”
Persistent identifier - A long-lasting/persistent set of characters used to uniquely identify a digital file or a digital object *see also “UUID”
Pixel hash - A hash value created by image hashing, using the visual content of an image file rather than the bit stream. *see also “image hashing” and “hash”
Preservation metadata – preservation metadata is information which supports and records digital preservation processes. In the context of the GLAM digital preservation project, preservation metadata is an umbrella term which refers to four subsets of metadata: provenance metadata, rights metadata, technical metadata, and structural metadata
.png)
*see also the entries for “Provenance metadata”, “Rights metadata”, “Technical metadata”, and “Structural metadata”
Process documentation - Step-by-step documentation about how an action or activity is undertaken. Process documentation is often updated to keep it relevant as technologies and systems change
Provenance metadata - Information about the origin of a digital object and about any changes to it that has occured while under management of the digital repository. Provenance metadata includes (but is not limited to) information about file format migration, date of creation, the generation of checksums, and file format validation *see also “versioning” and “preservation metadata”
Refreshing - Copying information content from one storage media to the same or another storage media
Representation - A representation is a distinct manifestation of an Intellectual Entity. An Intellectual Entity could be a student thesis which has two representations (Representation 1: a PDF document. Representation 2: a DOCX file version of the same document). *see also “Intellectual Entity”
Repository system - A system in which digital objects are stored for possible subsequent access, retrieval and management.
Rights metadata - In the context of digital preservation, rights metadata records information about the intellectual rights to a digital object and system rights to access, view, and edit content
Schema - A formal description of a data structure. I.e.: for XML, a common way of defining the structure, elements, and attributes that are available for use in an XML document that complies to the schema *see also “XML”
Storage migration - The process of copying content from one generation or configuration of digital data storage onto an updated generation or configuration
Structural metadata – Metadata used to describe relationships between digital files or other digital material which comprising a complex digital object. A simple example of structural metadata is mapping of page numbers within a digitized manuscript to corresponding image files
Technical metadata - The term technical metadata is contested and has many different definitions (sometimes being used synonymously with provenance metadata and “digital preservation metadata”). However, it has been given a more narrow definition in the context of GLAM to make metadata modelling simpler. When referring to technical metadata in GLAM, we refer exclusively to information which can be automatically extracted from a digital file. I.e. metadata which has been embedded into a digital a digital file (such as EXIF metadata for 2D image files), as well as attributes such as the length and size of a digital file. As such, it is closely related to and compliments Provenance metadata collected by the digital repository. *see also ”Preservation metadata” and “characterisation”
Tracking metadata – Administrative information for tracking and managing physical material, digital material, as well as processes during digitization projects. Tracking metadata may include information about the current location of physical collection items and the progress of current workflow steps
Transformation metadata - *see “provenance metadata”
Validation – The process of ensuring that data is correct and useful when checked against a set of data validation rules. These might include rules for package or file structure or specific file format profiles
Versioning (file system) - A versioning file system allows a computer file to exist in several versions at the same time, by keeping old copies of a file which has been edited. Versioning supports a repository’s ability to trace the provenance of a digital file *see also “provenance” and “file store”
Workflow - A defined sequence of tasks performed by either humans or software agents
Wrapper - A data structure or software that encapsulates (“wraps around”) other data or software objects, appends code or other software for the purposes of improving user convenience, hardware or software compatibility, or enhancing data security, transmission or storage
_____________________________________________________________________________
DCC - Digital Curation Centre
COPTR - Community Owned digital Preservation Tool Registry
DOI - Digital Object Identifier
DPC - The Digital Preservation Coalition - http://www.dpconline.org/
DROID - Digital Record Object Identification
EXIF – Extensible Image File Format
ISAD(G) - International Standard Archival Description (General)
JHOVE - Harvard Object Validation Environment (http://jhove.openpreservation.org/)
METS – Metadata Encoding and Transmission Standard
MIX - Technical Metadata for Digital Still Images Standard
OPF - Open Preservation Foundation
PREMIS - Preservation Metadata Implementation Strategies
PRONOM - Technical registry service created by The National Archives (UK)
UUID – Universal Unique Identifier
XML - eXtensible Markup Language
XMP – Extensible Metadata Platform (ISO 16684-1:2012)