Autoren: Bernhard Haslhofer, Universität Wien, firstname.lastname@example.org; Prof. Dr. Franz Wirl, Research Studios Austria, email@example.com
Im Kontext digitaler Langzeitarchivierung tritt immer wieder die Frage nach dem Dateiformat eines Medienobjekts auf. Abhängig vom Format müssen Strategien entwickelt werden, die es ermöglichen auch noch in ferner Zukunft auf archivierte Inhalte zuzugreifen. Dieser Bericht stellt JHove vor, ein JAVA-Tool zur Bestimmung, Validierung, und Charakterisierung von Dateiformaten.
“Policy and processing decisions regarding object ingest, storage, access, and preservation are frequently conditioned on a per-format basis. In order to achieve necessary operational efficiencies, repositories need to be able to automate these procedures to the fullest extent possible”1. JHOVE1 is a joint project of JSTOR2 and the Harvard University Library3 for the development of an “extensible framework for format validation”1. The software has been released under the GNU Lesser General Public License (LGPL)4. Coming from the field of digital libraries, JHOVE has been designed to answer several types of questions that often occur in preservation scenarios, such as “I have a digital object; what format is it?” or “I got a digital object that should be of a certain format; is it?”.
To answer these questions, JHOVE provides the following functionalities:
- Format Identification: the process of retrieving information about the format of a digital object.
- Format Validation: determining if a digital object is compatible to its purported format. This process has two stages: well-formedness and validity. A digital object is well-formed if it meets the syntactic specification of this format. Being valid against a format means that the object is well-formed and that also semantic requirements are fulfilled.
- Format Characterization: providing answers about a digital object’s format specific properties. JHOVE delivers this information in terms of representation information, a concept introduced by the Open Archival Information System (OAIS) reference model5. A representation information response includes properties such as MIME type, last modification date, format profiles, as well as integrity checks as CRC32 or MD56.
JHOVE for digital preservation
Since JHOVE fulfills common functionalities required for preserving digital objects and is fairly easy to integrate with any JAVA application, it has become a constituent part of many preservation tools. Products such as PRONOM7 from the UK National Archives8 or the Metadata Extraction Tool9 from the National Library of New Zealand/Te Puna Mātauranga o Aotearoa (NLNZ)10 are using JHOVE especially for format validation. Integration of JHOVE into repository workflows can be done according to OAIS reference model using Submission Information Package (SIP), as shown in Figure 1.
Architecture and Implementation
JHOVE is implemented in Java (version 1.4) and therefore platform independent. It ships with a command line as well as GUI interface. JHOVE includes modules for supporting arbitrary byte streams, ASCII and UTF-8 encoded text, PDF, HTML, XML; text and XML output handlers, TIFF images, GIF, JPEG2000, JPEG, AIFF and WAVE audio. A full list of available modules can be found on the JHOVE website.
The future of JHOVE
Since JHOVE was first released three years ago, lots of user experience (e.g. from the Library of Congress) has been gathered. The gained insights will be implemented in JHOVE 211, a project proposed in 2006. Future JHOVE releases will provide a more open API for generic processing modules, as well as enhancements of existing JHOVE functionality such as reports in the METS12 schema. Furthermore it is planned to use the Open Source project DROID13 for format identification, in order to gain “a significantly wider range of formats than it can validate”14.
- ↑ JHOVE, http://hul.harvard.edu/jhove/
- ↑ Journal Storage, http://www.jstor.org/
- ↑ http://hul.harvard.edu/
- ↑ LGPL, http://www.gnu.org/licenses/licenses.html#LGPL
- ↑ ISO/IEC 14721:2002, Space data and information transfer systems — Open archival information system — Reference Model; http://www.ccsds.org/documents/650×0b1.pdf
- ↑ R. Rivest, The MD5 Message-Digest Algorithm, RFC 1321, April 1992; http://www.ietf.org/rfc/rfc1321.txt
- ↑ http://www.nationalarchives.gov.uk/pronom/
- ↑ http://www.nationalarchives.gov.uk/
- ↑ http://meta-extractor.sourceforge.net/
- ↑ http://www.natlib.govt.nz/
- ↑ http://hul.harvard.edu/jhove/JHOVE2-proposal.doc, 2006
- ↑ metadata encoding & transmission standard (METS) http://www.loc.gov/standards/mets/
- ↑ http://www.nationalarchives.gov.uk/aboutapps/fileformat/pdf/automatic_format_identification.pdf
- ↑ The National Archives, Automated Format Identification Using PRONOM and DROID, DPTP-01, September 17, 2005