Automated Metadata Extraction and the Long Way to Significant Properties
DIMAG and IngestList at the Landesarchiv Baden-Württemberg
Author: Dr. Christian Keitel
In digital archiving, doing something is better than doing nothing at all. Unfortunately, doing something even very basic one can run into problems. Many problems are based on the short lifespan of formats and data carriers. In other words, how can we transfer a logical representation with all its significant properties to another physical form? Is it possible to validate the new form in an automated way? This article reports how the Landesarchiv Baden-Württemberg tries to address these topics.
The Digital Magazine DIMAG
The Landesarchiv Baden-Württemberg is a regional state archive in Baden-Württemberg, a region in the south-western corner of Germany. As an administrative unit, the Landesarchiv comprises six state archives. These archives are collecting the older records of the State Administration, which have to be submitted by its agencies after a given period (mostly 10 years, in some cases up to 30 years) if a historical value can be stated. Most of the records are still written on paper, but since 2002 an increasing number of born digital objects are submitted to the Landesarchiv, starting with the 1970 census and reaching now back to the 1961 census.
Digital records are archived using DIMAG (for “Digital Magazine”), a repository software for archives. A project group of the Landesarchiv developed DIMAG between 2006 and 2008. The basis of the system is a LAMP web server architecture (Linux, Apache, MySQL, and PHP). DIMAG stores both records and metadata and maintains the arrangement of the records in collections. It is quite easy to store new objects in the system, because it requires minimal mandatory metadata. The metadata are based on the representation model of PREMIS and the standard ISAD (G). DIMAG has a flexible rights management. It records major actions in an XML protocol. Currently, the system can export METS-compliant objects, objects with minimal metadata and pure metadata objects.
At the end of 2008, DIMAG comprised nearly 17.000 digital objects with more than 57 million datasets. Databases are by far the most common digital objects among the digital records offered to the Landesarchiv. However, the system is not restricted to a certain type of digital object.
Errors and metadata
To process databases, the Landesarchiv follows a migration strategy in which software is not archived and important data is exported from the original database for subsequent transfer to the archives. There are two successive kinds of transfer: one from the original database system in an agency into files (serialisation) and a subsequent one from the agency into the archive. Both transfers can produce quite a number of errors:
- Not all files were transferred.
- Not all data sets were exported.
- Some fields were not exported completely.
- The exported file has syntax errors of the chosen file format.
- The exported file comprises invalid characters.
- Some characters were not exported (mainly mutated vowels).
- The exported file uses multiple encoding schemes.
In the past, archive staff members detected these faults “manually”. Unfortunately, we cannot continue with this method because it does not scale. So we asked ourselves: What do we need to detect errors more or less automatically?
We presume that errors are observable only in contrast to something that declares the state of faultlessness. Therefore, declaring an error is a consequence of declaring faultlessness. The desired state of a digital object, i.e. this object in the state of faultlessness, needs to be captured before a transfer. Later on, we need to compare the transferred object with the object before the transfer. Metadata give us an opportunity to do such comparisons. Hence, it is important to collect crucial metadata before and after each transfer, for a subsequent comparison.
These were our theoretical considerations. For the practical implementation, we had two questions. Which metadata should be taken and compared? And how can we check the consistency of the transferred data? To answer the first question, some metadata about the number of datasets and fields are necessary. The Landesarchiv preferres to take the metadata in XML format and the primary information of the data tables in CSV (character-separated value) format. CSV is used with a number of field and data set delimiters. Therefore, we have programmed the small tool IngestList to count the most common delimiters.
IngestList and significant properties
IngestList uses also procedures of the well-known tool JHOVE and DROID and collects a large number of metadata, which can be compared with the metadata extracted from the digital object at a later date. Hence, this tool answers also the second question of how to enable automatic consistency checks.
As this tool can also be used directly from an USB-stick by external agencies, the relevant data can be captured both before and after the data transfer to the archive, which helps to evaluate the quality of these transfers. It enables archivists to determine the completeness of a data submission. Finally, IngestList helps to describe the provenance of submitted data, as it creates a corresponding protocol which is initially filled in the agency upon which a description of the main actions of the ingest process are added by the archive. This protocol is then continually updated throughout the whole life cycle of the digital object in DIMAG.
However, IngestList does not solve all the issues related to trustworthy transfers and migrations. Currently, we do not know all the properties which are important for the preservation and the future usage of these databases. Nor do we know how to automatically extract some of the necessary properties of the datasets. IngestList is therefore only a first attempt to implement our criteria for successful data transfers and a very first step towards automatic comparison of significant properties of digital objects of various kind.
The Landesarchiv Baden-Württemberg has published the tool IngestList free of cost on the website of sourgeforge.net (http://sourceforge.net/projects/ingestlist/). All comments are welcome.
email@example.com (Digital Archiving)
- Tobias Beinert et al., Into the Archive – a guide to the information transfer to a digital repositiory, 
- Susanne Dobratz et al., Catalogue of Criteria for Trusted Digital Repositories, Frankfurt/Main 2006, 
- Christian Keitel, Ways to Deal with Complexity, paper given at The Fifth International Conference on Preservation of Digital Objects (iPRES) 2008, 
- Kai Naumann, Christian Keitel, Rolf Lang, One for Many. A Metadata Concept for Mixed Digital Content at a State Archive; The International Journal of Digital Curation 2 Vol. 4 (2009), p. 80-92,