Digital Preservation is not my area of specialization, but I'm currently taking this course towards my Data Curation specialization. As a part of class project I decided to explore the preservation issues surrounding institutional repositories and storage architecture since it's closely related to my current job at the Scholarly Communication and Repository Services. The outcome is an analysis of existing preservation systems and challenges based on literature review. Since I've put some solid hours into writing this article, I thought I might just post it on my blog as well. This can be helpful for an institution to identify the preservation issues they would like to focus on, and the frameworks that might prove helpful for their particular case. So, below I posted the main article.
--------------------------------------------------------------------------------------------------------------------------
As
the use of digital media, and digital document for scholarly communication has
become more common during the past two decades, the concern to preserve these
different forms of digital content escalated simultaneously. Hedstrom (1997)
referred to digital preservation as “the planning, resource allocation, and
application of preservation methods and technologies necessary to ensure that
digital information of continuing value remains accessible and usable.” The
author further mentioned that, the field of digital preservation at that time
lacked standards, and was more shaped by the needs and strategies of repositories
with little consideration for the accessibility requirement by the current and
future users [3]. Since then the types of institutional repositories and
storage architecture have evolved to accommodate the needs of user, and
different repository frameworks are integrating more advanced technologies to preserve
digital contents and provide better access over longer period. Chowdhury (2010)
has referred to the modern digital libraries as “a space – a centre of
intellectual activities – with content, available in different forms and
formats in a distributed network environment, as well as tools and facilities
for user-centric access, use, interactions, collaborations and sharing.” Hence
the focus has now shifted to interactive use and sharing in a networked
environment from system and content.
To understand the transition in the
field digital preservation, in this paper we will focus on the following criteria
–
- Role of repositories in digital preservation
- Different types of repository architecture
- Current challenges and suggestions
- Preservation approaches by different
institutions
Role
of Repositories:
The
term “institutional repository” refers to the digital collections for
identifying and preserving intellectual output of single or multi-university
community. Even though most of the repositories are based on e-prints, it can potentially
include research data, learning material, image collections and many other
different types of content [4]. The role of repositories for long-term
preservation purpose has been a matter of debate as some argue that the
purposes of open access institutional repositories are chiefly access, usage
and impact, while preservation of institutions’ published journal articles is already
in other hands such as the publishers and the legal deposit libraries. However,
others strongly agree about institutional repositories playing a major role in managing
and preserving an institutions knowledge base.
According to Wheatley (2004), the
key requirements and aims for effective preservation by repositories are as
follows –
- Data can be maintained in the repository without
being damaged, lost or maliciously altered.
- Data can be found, extracted from the archive
and served to a user.
- Data can be interpreted and understood by the
user.
- Goals 1, 2 and 3 can be achieved in the long
term.
To
achieve these goals he proposed the following design considerations for
institutional repositories to provide long-term preservation [12].
- Unique identifier – to help locate the object
- Ingest – automatically capture metadata of the
object to lower cost and effort
- Representation Information – Provide information
on how to gain access to the object
- Technology watch – a function that monitors
Representation Information and related rendering capabilities
- Rendering - to turn a bytestream into meaningful
information or to gain access to the intellectual content encapsulated in the
raw data though methods such as migration and emulation
- Overall repository structure – to ensure the repository
survives technological challenge.
Many
of these proposed designs are yet to be implemented, but being able to ensure
these structural components will make repositories ideal for long-term
preservation.
A fine example of how such
repositories or hybrid libraries can benefit preservation by increasing
accessibility is given the author Malinconico (2002). Oxford University holds a
wide range of ballads collection consisting of 30,000 ballads including digitized
broadside ballads. Broadside ballads are single sheet songs that used to be
popular yet inexpensive songs sold on the streets of Britain between 16th
to 20th centuries. However, the same song was often issued with
different titles. Therefore, anyone needs access to the sheet music to match the
particular reference against the songs. By providing access to the digital
images of these broadside ballads, Oxford University thus provides an important
verification tool besides preserving the digital surrogates for long-term.
Repository
Frameworks:
Most
of the commonly used repository software are open source while they vary in
functionality and scalability. Madalli et al. (2012) conducted comparative
analysis of nine such frameworks that are –
•
CDS-Invenio (Switzerland)
•
DoKS (Belgium)
•
DSpace (USA)
•
EPrints (UK)
•
FEDORA (USA)
• Greenstone
(New Zealand)
•
MyCoRe (Germany)
•
OPUS (Germany)
•
SciX (Slovenia).
Some
of the more recent developments include Hydra and Islandora, which build on top
of Fedora model. While DSpace and EPrints are widely used frameworks for their
easy implementation method, and less work by developers, these are mainly built
to support research publications or e-prints, rather than preserving high
volume of digitized image collection. For example, while DSpace and EPrints
only support Dublin Core metadata, Fedora supports more metadata formats, such
as METS, MPEG-21, DIDL, IEEE, LOM, MARC, FOXML, and ATOM. Fedora is also
currently changing their metadata model to linked data format to provide better
accessibility, and ability to link someone’s collection with other existing
collections on the Web. Most importantly it provides better scalability and
storage options while it requires more effort by the developers. As mentioned by
the authors, “To a large extent FEDORA supports more features that are
essential from a digital preservation point of view, but it lacks a
user-friendly interface; hence, there are not many installations of FEDORA”
[7].
The project by Schumacher et al.
(2014) approaches this issue to find out the appropriate platforms for digital
preservation for under-resourced cultural heritage institutions. They developed
an evaluative rubric based on the intersection of the Digital Curation Centre’s
digital curation lifecycle and the OAIS Reference Model, and selected six both
freely available, open-source solutions as well as vendor-based applications,
that are –
Archivematica
|
Curator’s Workbench
|
DuraCloud
|
Internet Archive
|
MetaArchive
|
Preservica
|
Figure 1 List of
preservation solutions
Figure
2
Evaluation rubrics by Schumacher et al. (2014)
Current
Challenges and Suggestions:
Though
important, current Digital preservation practice face different challenges in
its real life implementation. Some of the main challenges observed from the
existing studies are as follows-
- Lack of financial and technical resources to
support preservation and management of digital objects is a major issue for
small and medium level institutions. As mentioned by Kay Rinehart et al.
(2014), “When DP is not part of existing position descriptions, and there is no
funding to hire an expert, how does an institution begin to tackle the
challenge? The pressure to start working on it immediately, before any content
is lost, just adds to the impression of impossibility.” More collaborative work
and strategic development is required within larger and comparatively smaller
scale institutions for the greater benefit of scholarly communication. Studies similar to the one by Schumacher et al. (2014) will
also prove beneficial, where the authors identified different preservation
solutions for different groups of institutions based on the availability of
financial and technical resources. The authors recommend the smallest
institutions to begin with Data Accessioner for triage, and uploading public
domain materials to Internet Archive for public access and long-term storage;
those with some financial resources to explore Presrevica, and the ones that
have less financial resources but decent technical resources to explore
Archivematica.
- The creators of widely used system LOCKSS 9 mentioned
that, “the key problem in the design of digital preservation systems is that
the period of time is very long, much longer than the lifetime of individual
storage media, hardware and software components, and the formats in which the
information is encoded”. To face this issue they suggest taking a bottom-up
approach rather than top-down by focusing on “what the system should not do, in
terms of losing data or delaying access under specific types of failures.” [10]
- Maintaining authenticity and security of the
materials can be challenging for multi-institutional repositories such as
HathiTrust [13]. Often the same material can be uploaded by multiple
organizations resulting into redundancy. Further technical advancements will
allow duplicity check to avoid such challenges in large-scale digital
libraries.
Preservation
Approaches by Different Institutions:
At
University of Illinois at Urbana-Champaign the library has an institutional
repository built upon DSpace to preserve research publication by the campus
community. Besides that, it supports a more robust digital preservation
platform
Medusa 5 using
Fedora for repository layer, Apache Solr for full text indexing, and Hydra
tools - ActiveFedora, Solrizer, and OM - for creating and managing objects in
the codebase. Solely developed by the Scholarly Communication and Repository
Services at the Main Library, the Medusa workflow model is based on the
Archivematica project while written entirely in Ruby, and designed to be
deployed in a distributed environment. It uses PREMIS as metadata schema.
Bodleian Libraries at the University
of Oxford has recently launched their new digital library
Digital Bodleian to make their
extraordinary collections available online for the very first time [2]. As the
prominent institutions are giving more attention to the digital preservation
issue, it is expected to overcome the current challenges in near future.
Reference:
- Chowdhury, G. (2010). From digital libraries to digital preservation
research: the importance of users and context. Journal of documentation, 66(2),
207-223.
- Digital Bodleian. (n.d.). Retrieved March 14, 2016, from
http://digital.bodleian.ox.ac.uk/
- Hedstrom, M. (1997). Digital preservation: a time bomb for digital
libraries. Computers and the Humanities, 31(3), 189-202.
- Hockx-Yu, H. (2006). Digital preservation in the context of
institutional repositories. Program, 40(3), 232-243.
- Ingram, B. (n.d.). Medusa - Hydra - DuraSpace Wiki. Retrieved March 14,
2016, from https://wiki.duraspace.org/display/hydra/Medusa
- Kay Rinehart, A., Prud'homme, P. A., & Reid Huot, A. (2014).
Overwhelmed to action: digital preservation challenges at the under-resourced
institution. OCLC Systems & Services, 30(1), 28-42.
- Madalli, D. P., Barve, S., & Amin, S. (2012). Digital preservation
in open-source digital library software. The Journal of Academic
Librarianship, 38(3), 161-164.
- Malinconico, S. M. (2002). Digital preservation technologies and hybrid
libraries. Information Services and Use, 22(4),
159-174.
- Reich, V., & Rosenthal, D. S. (2001). LOCKSS: A permanent web
publishing and access system. D-Lib Magazine, 7(6), 14.
- Rosenthal, D. S., Robertson, T. S., Lipkis, T., Reich, V., &
Morabito, S. (2005). Requirements for digital preservation systems: A bottom-up
approach. arXiv preprint cs/0509018.
- Schumacher, J., Thomas, L. M., VandeCreek, D., Erdman, S., Hancks, J.,
Haykal, A., & Spalenka, D. (2014). From Theory to Action: “Good Enough”
Digital Preservation Solutions for Under-Resourced Cultural Heritage
Institutions.
- Wheatley, P. (2004). Institutional repositories in the context of
digital preservation. Microform & imaging review, 33(3),
135-146.
- York, J. (2009, January). This library never forgets: preservation,
cooperation, and the making of HathiTrust digital library. In Archiving
Conference (Vol. 2009, No. 1, pp. 5-10). Society for Imaging Science
and Technology.