Tuesday, March 15, 2016

Digital Preservation Issues Surrounding Institutional Repositories

Digital Preservation is not my area of specialization, but I'm currently taking this course towards my Data Curation specialization. As a part of class project I decided to explore the preservation issues surrounding institutional repositories and storage architecture since it's closely related to my current job at the Scholarly Communication and Repository Services. The outcome is an analysis of existing preservation systems and challenges based on literature review. Since I've put some solid hours into writing this article, I thought I might just post it on my blog as well. This can be helpful for an institution to identify the preservation issues they would like to focus on, and the frameworks that might prove helpful for their particular case. So, below I posted the main article.
--------------------------------------------------------------------------------------------------------------------------

As the use of digital media, and digital document for scholarly communication has become more common during the past two decades, the concern to preserve these different forms of digital content escalated simultaneously. Hedstrom (1997) referred to digital preservation as “the planning, resource allocation, and application of preservation methods and technologies necessary to ensure that digital information of continuing value remains accessible and usable.” The author further mentioned that, the field of digital preservation at that time lacked standards, and was more shaped by the needs and strategies of repositories with little consideration for the accessibility requirement by the current and future users [3]. Since then the types of institutional repositories and storage architecture have evolved to accommodate the needs of user, and different repository frameworks are integrating more advanced technologies to preserve digital contents and provide better access over longer period. Chowdhury (2010) has referred to the modern digital libraries as “a space – a centre of intellectual activities – with content, available in different forms and formats in a distributed network environment, as well as tools and facilities for user-centric access, use, interactions, collaborations and sharing.” Hence the focus has now shifted to interactive use and sharing in a networked environment from system and content.
            To understand the transition in the field digital preservation, in this paper we will focus on the following criteria –
  • Role of repositories in digital preservation
  • Different types of repository architecture
  • Current challenges and suggestions
  • Preservation approaches by different institutions 
Role of Repositories:
The term “institutional repository” refers to the digital collections for identifying and preserving intellectual output of single or multi-university community. Even though most of the repositories are based on e-prints, it can potentially include research data, learning material, image collections and many other different types of content [4]. The role of repositories for long-term preservation purpose has been a matter of debate as some argue that the purposes of open access institutional repositories are chiefly access, usage and impact, while preservation of institutions’ published journal articles is already in other hands such as the publishers and the legal deposit libraries. However, others strongly agree about institutional repositories playing a major role in managing and preserving an institutions knowledge base.
            According to Wheatley (2004), the key requirements and aims for effective preservation by repositories are as follows –
  • Data can be maintained in the repository without being damaged, lost or maliciously altered.
  • Data can be found, extracted from the archive and served to a user.
  • Data can be interpreted and understood by the user.
  • Goals 1, 2 and 3 can be achieved in the long term. 

To achieve these goals he proposed the following design considerations for institutional repositories to provide long-term preservation [12].
  • Unique identifier – to help locate the object
  • Ingest – automatically capture metadata of the object to lower cost and effort
  • Representation Information – Provide information on how to gain access to the object
  • Technology watch – a function that monitors Representation Information and related rendering capabilities
  • Rendering - to turn a bytestream into meaningful information or to gain access to the intellectual content encapsulated in the raw data though methods such as migration and emulation
  • Overall repository structure – to ensure the repository survives technological challenge.

Many of these proposed designs are yet to be implemented, but being able to ensure these structural components will make repositories ideal for long-term preservation.
            A fine example of how such repositories or hybrid libraries can benefit preservation by increasing accessibility is given the author Malinconico (2002). Oxford University holds a wide range of ballads collection consisting of 30,000 ballads including digitized broadside ballads. Broadside ballads are single sheet songs that used to be popular yet inexpensive songs sold on the streets of Britain between 16th to 20th centuries. However, the same song was often issued with different titles. Therefore, anyone needs access to the sheet music to match the particular reference against the songs. By providing access to the digital images of these broadside ballads, Oxford University thus provides an important verification tool besides preserving the digital surrogates for long-term.

Repository Frameworks:
Most of the commonly used repository software are open source while they vary in functionality and scalability. Madalli et al. (2012) conducted comparative analysis of nine such frameworks that are –
• CDS-Invenio (Switzerland)
• DoKS (Belgium)
• DSpace (USA)
• EPrints (UK)
• FEDORA (USA)
• Greenstone (New Zealand)
• MyCoRe (Germany)
• OPUS (Germany)
• SciX (Slovenia).
Some of the more recent developments include Hydra and Islandora, which build on top of Fedora model. While DSpace and EPrints are widely used frameworks for their easy implementation method, and less work by developers, these are mainly built to support research publications or e-prints, rather than preserving high volume of digitized image collection. For example, while DSpace and EPrints only support Dublin Core metadata, Fedora supports more metadata formats, such as METS, MPEG-21, DIDL, IEEE, LOM, MARC, FOXML, and ATOM. Fedora is also currently changing their metadata model to linked data format to provide better accessibility, and ability to link someone’s collection with other existing collections on the Web. Most importantly it provides better scalability and storage options while it requires more effort by the developers. As mentioned by the authors, “To a large extent FEDORA supports more features that are essential from a digital preservation point of view, but it lacks a user-friendly interface; hence, there are not many installations of FEDORA” [7].
            The project by Schumacher et al. (2014) approaches this issue to find out the appropriate platforms for digital preservation for under-resourced cultural heritage institutions. They developed an evaluative rubric based on the intersection of the Digital Curation Centre’s digital curation lifecycle and the OAIS Reference Model, and selected six both freely available, open-source solutions as well as vendor-based applications, that are –

Archivematica
Curator’s Workbench
DuraCloud
Internet Archive
MetaArchive
Preservica

Figure 1 List of preservation solutions




Figure 2 Evaluation rubrics by Schumacher et al. (2014)

Current Challenges and Suggestions:
Though important, current Digital preservation practice face different challenges in its real life implementation. Some of the main challenges observed from the existing studies are as follows-

  • Lack of financial and technical resources to support preservation and management of digital objects is a major issue for small and medium level institutions. As mentioned by Kay Rinehart et al. (2014), “When DP is not part of existing position descriptions, and there is no funding to hire an expert, how does an institution begin to tackle the challenge? The pressure to start working on it immediately, before any content is lost, just adds to the impression of impossibility.” More collaborative work and strategic development is required within larger and comparatively smaller scale institutions for the greater benefit of scholarly communication. Studies similar to the one by Schumacher et al. (2014) will also prove beneficial, where the authors identified different preservation solutions for different groups of institutions based on the availability of financial and technical resources. The authors recommend the smallest institutions to begin with Data Accessioner for triage, and uploading public domain materials to Internet Archive for public access and long-term storage; those with some financial resources to explore Presrevica, and the ones that have less financial resources but decent technical resources to explore Archivematica.
  •  The creators of widely used system LOCKSS 9 mentioned that, “the key problem in the design of digital preservation systems is that the period of time is very long, much longer than the lifetime of individual storage media, hardware and software components, and the formats in which the information is encoded”. To face this issue they suggest taking a bottom-up approach rather than top-down by focusing on “what the system should not do, in terms of losing data or delaying access under specific types of failures.” [10]
  •  Maintaining authenticity and security of the materials can be challenging for multi-institutional repositories such as HathiTrust [13]. Often the same material can be uploaded by multiple organizations resulting into redundancy. Further technical advancements will allow duplicity check to avoid such challenges in large-scale digital libraries.


Preservation Approaches by Different Institutions:
At University of Illinois at Urbana-Champaign the library has an institutional repository built upon DSpace to preserve research publication by the campus community. Besides that, it supports a more robust digital preservation platform Medusa 5 using Fedora for repository layer, Apache Solr for full text indexing, and Hydra tools - ActiveFedora, Solrizer, and OM - for creating and managing objects in the codebase. Solely developed by the Scholarly Communication and Repository Services at the Main Library, the Medusa workflow model is based on the Archivematica project while written entirely in Ruby, and designed to be deployed in a distributed environment. It uses PREMIS as metadata schema.
            Bodleian Libraries at the University of Oxford has recently launched their new digital library Digital Bodleian to make their extraordinary collections available online for the very first time [2]. As the prominent institutions are giving more attention to the digital preservation issue, it is expected to overcome the current challenges in near future.


Reference:
  1. Chowdhury, G. (2010). From digital libraries to digital preservation research: the importance of users and context. Journal of documentation66(2), 207-223.
  2. Digital Bodleian. (n.d.). Retrieved March 14, 2016, from http://digital.bodleian.ox.ac.uk/
  3.  Hedstrom, M. (1997). Digital preservation: a time bomb for digital libraries. Computers and the Humanities31(3), 189-202.
  4.  Hockx-Yu, H. (2006). Digital preservation in the context of institutional repositories. Program40(3), 232-243.
  5.    Ingram, B. (n.d.). Medusa - Hydra - DuraSpace Wiki. Retrieved March 14, 2016, from https://wiki.duraspace.org/display/hydra/Medusa
  6.     Kay Rinehart, A., Prud'homme, P. A., & Reid Huot, A. (2014). Overwhelmed to action: digital preservation challenges at the under-resourced institution. OCLC Systems & Services30(1), 28-42.
  7.      Madalli, D. P., Barve, S., & Amin, S. (2012). Digital preservation in open-source digital library software. The Journal of Academic Librarianship38(3), 161-164.
  8. Malinconico, S. M. (2002). Digital preservation technologies and hybrid libraries. Information Services and Use22(4), 159-174.
  9. Reich, V., & Rosenthal, D. S. (2001). LOCKSS: A permanent web publishing and access system. D-Lib Magazine7(6), 14.
  10. Rosenthal, D. S., Robertson, T. S., Lipkis, T., Reich, V., & Morabito, S. (2005). Requirements for digital preservation systems: A bottom-up approach. arXiv preprint cs/0509018.
  11. Schumacher, J., Thomas, L. M., VandeCreek, D., Erdman, S., Hancks, J., Haykal, A., & Spalenka, D. (2014). From Theory to Action: “Good Enough” Digital Preservation Solutions for Under-Resourced Cultural Heritage Institutions.
  12. Wheatley, P. (2004). Institutional repositories in the context of digital preservation. Microform & imaging review33(3), 135-146.
  13. York, J. (2009, January). This library never forgets: preservation, cooperation, and the making of HathiTrust digital library. In Archiving Conference (Vol. 2009, No. 1, pp. 5-10). Society for Imaging Science and Technology.

No comments:

Post a Comment