Thursday, August 11, 2016

The Canadiana preservation network and access platform

Canadiana is sending me to Access 2016.  While I expect to learn quite a bit and meet new people, I wanted to make an additional introduction online ahead of time.

I have participated in many changes to the technical platform at Canadiana since I started in January 2011.  As always, what I write is my own thoughts and isn't an official statement from my employer.

Software platform

In early 2011 Canadiana was in transition from an access platform simply called "ECO" for Early Canadiana Online.  This was a mod-perl 1 application that was tightly tied to a legacy version of Apache, was written by an outside consultant, and was in need of upgrading.  I did my best to keep that running as long as possible while keeping our machines secure by having the mod-perl1 application running within a chroot() environment within a newer version of Debian which no longer supported mod-perl1.  This is something we would use linux containers for today, but that wasn't ready in 2011.


The new platform called CAP (Canadiana Access Platform) was also written in Perl, but based on the Catalyst Perl MVC framework, and written by Canadiana staff.  The new platform allowed for multiple "portals" which had their own access theme and content collections (subsets of the full set of AIPs in our repository). CAP used MySQL for most indexes (users and institutions, logging, content metadata) and Solr for search.

Early Canadiana Online remains an important collection, but now shares a platform with other collections.

Between 2012 and 2015 there were major upgrades to the back-end software to allow us to become a CRL certified Trustworthy Digital Repository (TDR).  My part was to upgrade and implement new processes for the automated validation and replication of our file repository.  When I started we were using rsync to copy a comparatively small (A few hundred GB) cmr/ (Canadiana Metadata Repository) directory to a few machines.  The size grew to the point where rsync took a large portion of the day just to realize that no AIPs had been updated. We also ran into a bug in rsync where it can't handle the number of files we were trying to synchronize.

A MySQL based TDR metadata system had been started by a colleague, which I moved to CouchDB to allow the data about our AIPs to be reliably replicated across data centers.  It was now the database that indicted that an AIP needed to be replicated to a specific repository, with rsync only used to reliably transfer all the contents of a specifically identified AIP.  We created processes where a subset of the AIPs would have full md5 checks of their contents, so over time we had validation of every AIP copy that had been replicated to each individual server.

Around this time we also moved from LVM over hardware raid to using ZFS for the TDR file repository, supporting multiple ZFS disk pools individually managed in the TDR file metadata database to optimize storage usage and speed of validation (reduce SAS cabling being a bottleneck).

The new repository system allows us flexibility in how we manage storage, as disk pools can be of any size. A given repository node can be instructed to only store a subset of AIPs as disk space allows, with other nodes having much more storage.  We are not locked in like other systems which only allow storage in a repository up to the size of the smallest node.


In late July 2016 we deployed a major upgrade to CAP to make use of a more Service Oriented Architecture (SOA) for metadata processing, something we call our "metadata bus".  While working closely with the lead developer and our metadata architect, I was the primary on design and implementation of the metadata bus, allowing our lead developer to focus on the upgrades to CAP required to integrate.  While I can write back-end software, I am poor when it comes to front-end and user interface software.

The core of the metadata bus is a series of CouchDB databases, building on the one we created for TDR replication and validation, which we run  microservices against to transform and combine data.  This allows individual microservices to be more easily designed, implemented  and tested than the more monolithic and manual processing we did previously.  This also allows us to easily precompute data which our access platform requires, increasing performance for access.  We continue to use Solr for search, but are now able to harness features we weren't able to with the older revision of the platform.

With this major upgrade fully deployed, I expect smaller incremental upgrades to be deployed more often.

Hardware


In 2011 all our servers were in Ottawa, with some on Wellington where we work and some at a commercial hosting service on Baseline Road.

During the transition from the "ECO" platform we had 2 servers (primary and backup) for ECO and 2 servers (primary and backup) for CAP.  While at the same co-location company, they were in two different cabinets (I only showed one in image, in Green, as this was the publicly accessible server).

On Wellington we had a workflow server (where we processed images from scanning to ingesting into our file repository), a backup server (Black), and a CMR/TDR server (Red)


In 2012 we moved the servers from Baseline Road to a commercial service in Montreal. We had 2 copies of our image data in Montreal (green for publicly accessible, red for standby), and 2 copies in Ottawa (red for standby, and as part of the backup server marked in black).

In 2014/2015 we added a publicly accessible TDR image data at partner University of Toronto, and an additional working copy of the TDR on Wellington (1 publicly accessible copy, 3 standby copies, and 1 copy part of the backup server).


In 2015 we added University of Alberta as a partner, first shipping a copy of the TDR image data and later in 2016 adding an application server.

We also added an application server and a custom project server to our partner at University of Toronto.

This year we separated the hardware configuration for an application server (which needed RAM, CPU and fast disk for database access and search) and the TDR image data/content server (which needed a large amount of storage, and CPU/RAM for making image derivatives). At the same time we designed and documented a WIP (Work In Progress - marked Orange in Ottawa) server to replace the older workflow concepts, and the concept of a "custom projects" server for special things (Marked blue. In Ottawa that houses development and operations tools such as Redmine, Subversion, and Icinga. In Toronto that houses the CHIN Artefacts Canada linked open data project, the legacy Canadiana Discovery Portal which is no longer part of our platform, the one-off Aboriginal Veterans mini-site, and the Drupal-based corporate site we plan to integrate into our platform)



Most recently we decided to move away from commercial hosting providers, and to rely entirely on partners. We picked up the copy of the TDR file repository that remained in the Montreal commercial provider and it waits in Ottawa as we plan for our next partner joining the preservation network.

If you are at a member institution that would like to join the preservation network, please get in touch. While we are interested to hear from any potential partner, ideal is if we can have more provinces represented in the network.  BANQ?  UVic? Memorial University of Newfoundland?

I just spent some vacation time in Québec, and would love another excuse to visit Université Laval!  :-)

There are some conversations underway about a new partner, and I am excited to see how things move forward.




No comments: