Andrew Weaver

I am an Archivist/Librarian

A blog about stuff. And things. Sometimes both.



Testing For Archivematica Migration

Diving into AIPs: ContentDM ➜ Archivematica

Picture of Diving Operation

Underwater diving operations, ca. 1900. Source

My institution has a LOT of amazing content preserved digitally in its storage. It also has a LOT of amazing metadata about this content, much of it contained within a hosted CONTENTdm instance, that surely represents thousands of hours of labor. One of my ongoing goals is to start gathering all of this information into archival packages built around the OAIS model. This would make it not only easier to handle in terms of performing preservation actions, it also would help ensure that digital items never become disassociated from their respective descriptive and administrative metadata.

Recently, the main tool I have been experimenting with is Archivematica. So far, Archivematica seems to have a lot of potential for assisting with what I am trying to do. It has a logical microservice based approach that should open the door to a lot of customization. Additionally, I like the fact that Archivematica is designed to leverage the power of the open-source community and is working to integrate other amazing projects such as MediaConch.

All this being said, I was curious to see what a hypothetical process of wrangling our existing data into Archivematica would consist of. Conceptually, I need to export the metadata out of CONTENTdm, associate it with preservation files in our storage and then arrange the metadata/items in a manner that is palatable to Archivematica. I was also curious to see how well I could automate this process, as should we decide to do a migration of this sort for real, I would rather not spend months and months copy-pasting metadata and files by hand!

To evaluate the viability of this process, I created a test script (current form available here on github). Using the Ruby CSV class, this script attempts to take the TSV (tab separated vales) metadata exports provided by CONTENTdm and parse it into the CSV (comma separated values) Archivematica allows for Dublin Core ingest. Although the two formats use different column headers, since they map to the same Dublin Core fields, this was a relatively smooth process. Since the CONTENTdm metadata also included information about original item IDs , I was also able to parse this column and have the script recursively search directories in our digital storage to find archival master files that corresponded to the access files in CONTENTdm.

Once the script has located the master files and parsed the metadata, it is then able to assemble a package suitable for Archivematica ingest. Archivematica calls for a basic structure involving an objects directory and a metadata directory. For every master file that is located by the script, it makes a copy in an output/objects directory and then compares checksums for the two files to make sure that no data was corrupted. It also exports the newly created metadata csv (including the relative file paths to the objects that Archivematica requires for metadata ingest) into output/metadata.

In tests so far, this has worked surprisingly well! I tried pointing the script at the Charles Pratch Collection of early 1900s Grays Harbor photographs, and not only did it perform entirely as expected, the resulting CSV file required very little manual inspection prior to ingest into my trial instance of Archivematica.

I was able to go from a whole lot of this:

Title   Manning Hill and Eagle bike on road to Hoquiam, 1893.
Creator     Pratsch, Charles R.
Date    1893
Subject     Bicycles & tricycles--Washington (State); Men--Washington (State); Roads--Washington (State); Trees--Washington (State)
Type    Image
Genre   Glass negatives
Identifier  pc018b02n001
Source  Is found in PC 18, Charles R. Pratsch Photographs http://libraries.wsu.edu/masc/finders/pratsch.htm at Washington State University Libraries' Manuscripts, Archives, and Special Collections (MASC) http://libraries.wsu.edu/masc
Publisher   Manuscripts, Archives, and Special Collections, Washington State University Libraries: http://www.libraries.wsu.edu/masc/masc.htm
Coverage    Hoquiam, Washington
Rights  http://rightsstatements.org/vocab/NKC/1.0/
Rights Notes    No known copyright. Item went into public domain 70 years after the 1937 death of the author.
Format  Original photographic prints were scanned as 300 dpi TIFF files on a Microtek 9600XL scanner. 72 dpi JPEG files were then added to the CONTENTdm database at the WSU Libraries.
Language    English 

to a whole lot of this:

      <mets:xmlData>
        <dcterms:dublincore xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcterms="http://purl.org/dc/terms/" xsi:schemaLocation="http://purl.org/dc/terms/ http://dublincore.org/schemas/xmls/qdc/2008/02/11/dcterms.xsd">
          <dc:title>Manning Hill and Eagle bike on road to Hoquiam, 1893.</dc:title>
          <dc:creator>Pratsch, Charles R.</dc:creator>
          <dc:description></dc:description>
          <dc:date>1893</dc:date>
          <dc:subject>Bicycles &amp;amp; tricycles--Washington (State); Men--Washington (State); Roads--Washington (State); Trees--Washington (State)</dc:subject>
          <dc:type>Image</dc:type>
          <dc:identifer>pc018b02n001</dc:identifer>
          <dc:source>Is found in PC 18, Charles R. Pratsch Photographs http://libraries.wsu.edu/masc/finders/pratsch.htm at Washington State University Libraries' Manuscripts, Archives, and Special Collections (MASC) http://libraries.wsu.edu/masc</dc:source>
          <dc:publisher>Manuscripts, Archives, and Special Collections, Washington State University Libraries: http://www.libraries.wsu.edu/masc/masc.htm</dc:publisher>
          <dc:rights>http://rightsstatements.org/vocab/NKC/1.0/</dc:rights>
          <dc:format>Original photographic prints were scanned as 300 dpi TIFF files on a Microtek 9600XL scanner. 72 dpi JPEG files were then added to the CONTENTdm database at the WSU Libraries.</dc:format>
          <dc:language>English</dc:language>
        </dcterms:dublincore>
      </mets:xmlData>
    </mets:mdWrap>
  </mets:dmdSec>

while performing digital preservation actions on a whole lot of amazing things like this!

Man with bike in woods

Manning Hill and Eagle bike on road to Hoquiam. Source

I am still evaluating the different options for moving our digital storage into a package based preservation system, but I have been very heartened by the results of this process! While this system is only effective so far for digital items that have been cataloged, it shows that an Archivematica migration is definitely a viable possibility for us. I look forward to seeing how much more of the process could potentially be automated.

Home