I am an Archivist/Librarian
A blog about stuff. And things. Sometimes both.
Underwater diving operations, ca. 1900. Source
My institution has a LOT of amazing content preserved digitally in its storage. It also has a LOT of amazing metadata about this content, much of it contained within a hosted CONTENTdm instance, that surely represents thousands of hours of labor. One of my ongoing goals is to start gathering all of this information into archival packages built around the OAIS model. This would make it not only easier to handle in terms of performing preservation actions, it also would help ensure that digital items never become disassociated from their respective descriptive and administrative metadata.
Recently, the main tool I have been experimenting with is Archivematica. So far, Archivematica seems to have a lot of potential for assisting with what I am trying to do. It has a logical microservice based approach that should open the door to a lot of customization. Additionally, I like the fact that Archivematica is designed to leverage the power of the open-source community and is working to integrate other amazing projects such as MediaConch.
All this being said, I was curious to see what a hypothetical process of wrangling our existing data into Archivematica would consist of. Conceptually, I need to export the metadata out of CONTENTdm, associate it with preservation files in our storage and then arrange the metadata/items in a manner that is palatable to Archivematica. I was also curious to see how well I could automate this process, as should we decide to do a migration of this sort for real, I would rather not spend months and months copy-pasting metadata and files by hand!
To evaluate the viability of this process, I created a test script (current form available here on github). Using the Ruby CSV class, this script attempts to take the TSV (tab separated vales) metadata exports provided by CONTENTdm and parse it into the CSV (comma separated values) Archivematica allows for Dublin Core ingest. Although the two formats use different column headers, since they map to the same Dublin Core fields, this was a relatively smooth process. Since the CONTENTdm metadata also included information about original item IDs , I was also able to parse this column and have the script recursively search directories in our digital storage to find archival master files that corresponded to the access files in CONTENTdm.
Once the script has located the master files and parsed the metadata, it is then able to assemble a package suitable for Archivematica ingest. Archivematica calls for a basic structure involving an objects
directory and a metadata
directory. For every master file that is located by the script, it makes a copy in an output/objects
directory and then compares checksums for the two files to make sure that no data was corrupted. It also exports the newly created metadata csv (including the relative file paths to the objects that Archivematica requires for metadata ingest) into output/metadata
.
In tests so far, this has worked surprisingly well! I tried pointing the script at the Charles Pratch Collection of early 1900s Grays Harbor photographs, and not only did it perform entirely as expected, the resulting CSV file required very little manual inspection prior to ingest into my trial instance of Archivematica.
I was able to go from a whole lot of this:
to a whole lot of this:
while performing digital preservation actions on a whole lot of amazing things like this!
Manning Hill and Eagle bike on road to Hoquiam. Source
I am still evaluating the different options for moving our digital storage into a package based preservation system, but I have been very heartened by the results of this process! While this system is only effective so far for digital items that have been cataloged, it shows that an Archivematica migration is definitely a viable possibility for us. I look forward to seeing how much more of the process could potentially be automated.