Friday, October 6, 2023

Still spinning? My time at the Canadian Research Knowledge Network

(See Part 1, Part 2)

After 5 years I felt I needed to leave CRKN.


I want to provide clarity for the following: While I am critical of some policies, procedures, and work practices that I felt delayed or blocked productive work being done, I am not being critical of individual people. I have noticed over the years that critique of policy is regularly misinterpreted as a critique of a person.

I never knew why in the past, but I have since learned this is a common miscommunication between Autistic and Allistic (non-Autistic) people. The same with the question “why?” being used by Autistic people in their constant desire to learn, while that is apparently a challenge/argument/etc for Allistics.

While the treatment I received from management because I was “Autistic At Work” was the final straw, I felt I was constantly having to fight with the management team to be allowed to do productive work. While there was agreement in theory, in practise there was always pushback against moving away from the DIY (Do It Yourself), NIH (Not Invented Here) attitudes.

I generally did not feel my gifts or contributions were being recognized or harnessed.


Differences in what I was told compared to what actually happened

Early in 2018 two different priorities were set for the small technology team: Archivematica adoption and reducing technological debt.

Managing custom software needs to be thought of as technological debt, so reducing technological debt includes moving away from custom software. Over the past 5 years there was minimal movement on the Archivematica project, and there is now more custom software and more CRKN owned hardware in member data centers than there was in 2018.

I’ll focus on only two specific areas to illustrate the problem.

Archivematica

From a UBC Campus tour, Archivematica Camp
While Scholars Portal launched OLRC in 2015, CRKN is still using a custom OAIS packaging system and its own SWIFT object storage cluster running on servers that CRKN owns and manages. Even with an Archivematica Migration project being approved in 2018 and confirmed in 2019, there were always other projects granted higher priority such that resources weren’t available to the Archivematica Migration project. Projects that only started in 2022 were able to derail the Archivematica Migration project and prerequisites such as what we called “Preservation-Access Split”.

3 CRKN staff people were sent to Archivematica Camp 2019, but were never able to make use of what was learned.


2 racks of CRKN servers at UTL
Many ScholarsPortal servers a close by. 
I kept being told that moving packaging and storage services to Scholars Portal’s OLRC was “too expensive”, but I never understood how that could be possible. They would have economies of scale, and better redundancy for staffing, training, and more. Canadiana/CRKN wasn’t a single OLRC user, but an organization offering a competing service that had many depositors that would be separate OLRC entities (separate Archivematica pipelines, separate Swift storage containers for AIPs, etc). CRKN did scanning and packaging on behalf of partners, meaning OLRC didn’t need to offer training/etc for these depositors directly.

Canadiana working with OLRC would have doubled OLRC’s object storage, so this was not a simple client relationship but a partnership. While managing technical services was new for CRKN, organizing broad cross-organizational collaborations and partnerships was exactly what CRKN was known to be good at.


Custom Cataloging Rules

One of the largest areas of push-back against adopting FLOSS community software was the use of custom cataloging rules by Canadiana and later CRKN’s cataloging team.

Understanding this requires understanding some of the history.

1978: CIHM
Many drawers of Microfiche...

Canadian Institute for Historical Microreproductions (CIHM) launched in 1978, and created the CIHM/ICMH Microfiche series.

MARC records were created for this series, using the MARC 490 field  to indicate which specific Microfiche in that series was being described.

  • 490$a would say something like “CIHM/ICMH Microfiche series = CIHM/ICMH collection de microfiches”
  • 490$v would say something like “99411” or “no. 99411”

The identifiers were all in the CaOOCIHM namespace (MARC 003 = “CaOOCIHM”) , with MARC 001 indicating “99411” as well.

1999: ECO platform

When the ECO (Early Canadiana Online) platform launched in 1999, the existing Microfiche descriptive records and the existing CIHM data model were used.

That platform was decommissioned in 2012, and decommissioning that older service was one of the earlier projects I was involved in.

2012: CAP (Canadiana Access Platform)

In 2012, new software was launched which had a different data model and used a different schema for identifiers. Canadiana moved beyond offering online access to scanned images from the CIHM Microfiche collection to offering access to other collections as well.

  • Identifiers were expanded to have a prefix indicating a depositor. That meant “99411” needed to become “oocihm.99411”. CIHM Numbers were deprecated, and should all have been quickly replaced with complete identifiers.
  • Not everything, and not even all the images from the Microfiche collection, would be considered part of the same collection as was the case for CIHM. This meant that CIHM’s way of using the 490 field was deprecated, with the intention being to use that field in the more common way in the future (to describe collections and volumes/issues of those collections).
  • We could have used MARC 001 for the full CAP identifier (not CIHM numbers), but we wanted to move from using a transparent identifier in that field to using a machine generated opaque identifier. The purpose of these records was to describe an online resource, so MARC 856 was the obvious choice to put the transparent identifier (within a full URL such as https://www.canadiana.ca/view/oocihm.99411 )

While not ideal, CAP made use of a custom schema inspired by Dublin Core for issues of series called IssueInfo (Issueinfo.xsd ). This is how the CAP system knew the difference between a “Title” (Series record or “Monograph” which could be described using Dublin Core or MARC) and an issue of a series (which must be cataloged using an IssueInfo record). 

CAP did not make use of the 490 field, although because there was pushback from the Cataloguing team in using MARC 856 we still needed to support looking for CIHM-era identifiers being stuffed in 490$v when records were being loaded into the databases.

2022: Preservation-Access Split

After years of delay due to pushback and other projects being given priority, this was launched in April 2022.

There were now two independent descriptive metadata databases: one for Preservation and one for Access. The same packaging tools used to manage OAIS packages were used to download as well as update Preservation descriptive metadata records.

In the past, Preservation records needed to match what was needed by Access. This was no longer the case, allowing records to slowly migrate to use the same encoding standards used by Archivematica. Splitting these databases and the identifiers they use was a prerequisite for adopting Archivematica, with Preservation records now needing to be in Dublin Core with an eye towards migrating all custom Canadiana AIPs to Archivematica AIPs.

On the Access side, CAP and the metadatabus were enhanced to support some features of the IIIF data model. Relationships between documents (including whether a document would be displayed to patrons as a monograph, series, or issue of a series) would be encoded within databases using the IIIF data model.

This meant Access descriptive records could now all be in MARC, deprecating both IssueInfo and Dublin Core records. While working with Julienne Pascoe I became very excited by Linked Open Data (LOD), and was following the Bibliographic Framework Initiative (BIBFRAME) closely. While some of Canadiana/CRKN’s developers favored moving all Access records to a custom Dublin Core derived schema, I always favored the LOD aspects of BIBFRAME.

One of the many migration paths was to enhance records using MARC as an intermediary step. The Library of Congress itself set up a project to use FOLIO as part of their transition, with some CRKN staff also becoming interested in FOLIO.

Status of the move away from custom software and custom cataloging?

Moving away from custom software requires cataloging staff to move away from custom cataloging rules, and adopt the Metadata Application Profile (MAP) and data model used by relevant community software. CRKN staff would no longer be creating or imposing their own custom encoding, but working with larger commuities and stakeholders.

Migration away from custom software involves transforming/refining all existing records (using automated processes) away from legacy custom MAPs/models.


  • Adopting Archivematica involves dropping the CIHM and CAP encoding rules and adopting the Archivematica encoding rules and data model for Preservation.
  • Adopting FOLIO (for records management and publication via OAI-PMH and likely later SPARQL for BIBFRAME) requires dropping the CIHM and CAP encoding rules and adopting the FOLIO encoding rules and data model. The data model is focused on concepts from MARC and BIBFRAME, so this involves migrating all Dublin Core and IssueInfo records to MARC (and encoding document relationships using the FOLIO data model, so series, issues and monographs are understood correctly). CAP’s data model is a small subset of the data model that FOLIO uses, so enhancement of relationship data becomes possible.
  • Adopting Blacklight-marc requires either custom software that would have to be maintained indefinitely, or adopting MARC for all searchable records (easily sourced from FOLIO using OAI-PMH for indexing).


As of my last day in May 2023:


  • The cataloguing team were still treating the CIHM encoding rules and data model (deprecated in 2012) as current.
  • "Updates" to descriptive metadata records were being sourced from a different database (Some from spreadsheets, some from Inmagic DB/TextWorks databases using CIHM era schemas, and only containing a subset of records) rather than from the Preservation or Access metadata databases. 
    • This meant any changes made directly to the Preservation or Access metaedata databases were being overwritten. 
    • A very old problem: "Document A" is edited to become "Document B" which is then edited to become "Document C". Then someone comes along and edits "Document A" to create "Document D", meaning all the changes made for B and C are lost.



Wednesday, October 4, 2023

Still spinning? The merger of Canadiana.org with the Canadian Research Knowledge Network

(See Part 1, Part 3)
My cubicle on my last day at Canadiana.

If you read earlier articles, you can tell I was excited about the merger possibility when I first heard about it. I looked at CRKN and it had the communications infrastructure that I felt was needed to get out of Canadiana’s DIY/NIH mindset. There were committees to help make key decisions, and there were partnerships with other organizations.

My first meeting with the new CEO, and several other meetings after, indicated exactly what I wanted to hear. What I understood was CRKN’s desire to move the technical team currently providing lower-level services (what I took to mean owning and managing hardware, maintaining custom software, etc) to being involved in cross-sector collaborations (members, other consortia, etc), participating in standards setting organizations, and other activities that were much higher up the technology stack.

The new CEO spoke about overseas trips to participate at standards organizations that I might be interested in. I was very interested, and eager to transition away from DIY/NIH to free up the time to make that possible.

While I was advocating for OpenStack SWIFT and Archivematica at Canadiana, my longer-term hope was that Canadiana (and later CRKN) wouldn’t be trying to duplicate the services of Canadiana and/or CRKN members and partners. I noticed Scholars Portal, the technological service provider for OCUL (one of the 4 Canadian regional library consortia), launched the Ontario Library Research Cloud in 2015 (See OCUL history). 

It seemed obvious to me that, while Canadiana/CRKN needed to create a transition plan, the goal of the plan would be to move these services to Scholars Portal and not continue to manage that duplicate service (Archivematica packaging, large OpenStack Swift clusters, staff training, etc).

I envisioned CRKN coordinating other technological services, possibly with COPPUL offering backup Object storage, and using COPPUL, OCUL, CAUL/CBUA and BCI cloud services for hosting all other services rather than Canadiana/CRKN owning and managing physical hardware in member data centers (Currently Dalhousie University, University of Toronto, University of Alberta, University of Victoria).

During the merger talks there was documentation for the CRKN/Canadiana merger which I followed closely. Some of that documentation became part of a Journal article: “Spinning In”: the merger of Canadiana.org with the Canadian Research Knowledge Network

Excerpt:

Care will need to be taken to ensure that any work supports existing initiatives and players such as CARL, CUCCIO, Confederation of Open Access Repositories, Scholars Portal, and Research Data Canada. It is important to note that Canadiana does not compete with Scholars Portal, but provides complementary capacity focused on documentary heritage content. Given similar preservation models and the ongoing interest in coordinated Canadian digital research infrastructure, there may be emerging opportunities for future collaboration, such as linking data and supporting common TDR nodes for mutual redundant backup and access load balancing.

This specific sentence concerned me: “It is important to note that Canadiana does not compete with Scholars Portal”.

At an administrative level this may appear true, but given OLRC was launched in 2015 and Canadiana was providing a duplicate (if inferior) technological service, that statement wasn’t strictly true.

I wrote the following to Jonathan Bengtson and forwarded to other members of the Canadiana and CRKN boards in summer 2017. I also sent a copy to Clare Appavoo, CRKN’s Executive Director, prior to us meeting for the first time in November 2017.

(Google Docs link)


Technological infrastructure: CRKN, Canadiana, OCUL, Scholars Portal

Introduction

As CRKN and Canadiana plan for a merged organization, it is useful to look more closely at the components of Canadiana. While Canadiana is a charity and CRKN is a nonprofit, they exist within a larger context of services offered to overlapping institutions by other nonprofits and consortia. We need to do a competitive landscape analysis to avoid conflict.

I (Russell McOrmond) am the Lead Systems Engineer for Canadiana.org. I am concerned that the relationship between Canadiana’s team of technological infrastructure providers and the technological infrastructure providers for OCUL (Also known as Scholars Portal) is unclear, and that unexpected consequences will result if we don’t create clarity. The merged organization already has plans to help roll out Scholars Portal services across the country.

Teams within Canadiana

From the outside Canadiana might be seen as a single entity, but internally it has a series of departments which have a focus. Understanding these departments is helpful to place them in the larger context.

  • Officer: We currently have a single officer, the acting CEO (Previously CIO)
  • Production: Currently a team of 5 people work on digitizing and describing (including cataloguing) the resulting images, and managing other processes such as OCR and ingest of that content into Canadiana’s TDR.
  • Administration: Currently a team of 2 who handle office management, payroll, and other financial work
  • Communications & Partnerships: Currently a team of 1
  • DevOps (software DEVelopment, metadata architecture, information technology OPerationS): Currently a team of 3 people that provides the technological infrastructure for Canadiana’s services.

As the Lead Systems Engineer, one of the 3 people in DevOps, I will remain focused on our team.

What does Canadiana’s DevOps team do

Canadiana’s DevOps team researches, creates and/or manages the technological infrastructure used to provide Canadiana’s services.

While we have historically been focused entirely on online publishing the outputs of the production team, a few years ago we started a multi-year project to modernise our platform such that manual intervention by the DevOps team would not be required for most operations. We would be adopting modern platform techniques (Microservices, Docker), open standards (Such as http://iiif.io/ ), and more collaborative development with stakeholders ( https://github.com/c7a ).

The longer term plan was to free up time within the DevOps team to allow us to expand into offering other services for our members. This is services beyond online publishing of scanned/described images.

What is Scholars Portal

Scholars Portal is the technological infrastructure provider for The Ontario Council of University Libraries (OCUL) http://ocul.on.ca/node/135 . Scholars Portal is to OCUL as the Canadiana DevOps team is to Canadiana.

While providing TDR services is a big part of what Canadiana does, and thus what Canadiana’s DevOps team has been focused on, it is a small part of what the Scholars Portal does for OCUL members. Many of the areas of expansion that Canadiana’s DevOps team have contemplated or proposed are already being rolled out by Scholars Portal, including Cloud storage and computing services (OLRC), Geospacial services (Scholars GeoPortal), research data deposit (Dataverse), born digital books and journals.

Clarifying relationship with Scholars Portal

Reading the “Appendix C - Reading Material Merger Background Documents” providing summaries of Canadiana.org and CRKN, you might think Scholars Portal is a Journal TDR. This would be similar to thinking that Canadiana is a microfiche scanner.

Documentation about the merger suggests the new organization doesn’t intend to compete with Scholars Portal. As Scholars Portal is larger than described so far, this will require close attention to the capabilities of both technological infrastructure providers to ensure we aren’t seen by our overlapping membership as offering competing infrastructure. This will be critical as the new organization plans to work with OCUL to offer Scholars Portal services across the country, so will be marketing services of both teams of technological infrastructure providers.

A public directory of Scholars Portal staff http://ocul.on.ca/spstaff lists 28 people, and they appear to be expanding. While not transparent to our members as we have no public staff directory, Canadiana’s DevOps team at one time had 8 people (3 in operations, 2 in software development, 1 metadata architect, one manager, and one coop student). We currently only have 3 people (1 software, 1 metadata architect, 1 operations).

If the multi-year project to modernize Canadiana’s infrastructure is successful, the technology will be much easier to manage. This could free up resources to allow Canadiana to expand into new service offerings, or it could be used as a justification to reduce the size of the team or outsource the management of the technological infrastructure (Including to Scholars Portal itself).

In the “merger considerations and opportunities backgrounder” section of the “Appendix C” document, there is discussion of expansion of Canadiana’s TDR platform. We need to ensure when discussing Scholars Portal that we don’t define their TDR narrowly by discounting their expansion of services, while presuming that any new services that Canadiana offers will be considered part of our TDR.

There are features of the technological infrastructure Canadiana is using to offer our TDR services that may not exist within the infrastructure that Scholars Portal is offering. How these enhancements are offered to our overlapping membership will need to be given adequate consideration. This could involve Canadiana expanding our technology platform to handle new data types, or could be Canadiana working with Scholars Portal’s to enhance their technology platform to have features that make it more trustworthy.

In an ideal scenario the technological infrastructure teams at Canadiana and OCUL would be working closely together to roll out new services to our joint pan-Canadian membership.

Summary

The opportunities described in the “Appendix C” document are all opportunities which a merged CRKN/Canadiana would be well placed to pursue. What is uncertain from the documentation, and thus a concern to the staff providing Canadiana’s technological infrastructure, is what role we will be playing in the future given the potential overlap with Scholars Portal staff and services.


Tuesday, October 3, 2023

Still spinning? My time at Canadiana.org prior to the CRKN merger

(See: Part 2, Part 3)
This series of articles is inspired by the journal article: “Spinning In”: the merger of Canadiana.org with the Canadian Research Knowledge Network

Leslie Weir giving me 5 year recognition
I started working at Canadiana in 2011, and was one of the Canadiana staff that transitioned to CRKN in 2018. Prior to Canadiana I spent most of my career involved in the Free/Libre and Open Source Software (FLOSS), which I’ve been involved in since 1992.

Canadiana had quite a bit of custom software, and that always made me uncomfortable. There wasn’t existing FLOSS software to fill the requirements when the custom ECO platform was launched in 1999 (decommissioned in 2012) or when the CAP platform was launched in 2012 (still being used). That changed over time.

AlouetteCanada, one of the organizations that merged to form Canadiana in 2008, had created an Open Source “Digital Collection Builder”. Artefactual was involved in that project, and they continued to build Open Source software for the community: the most visible and widely used being Access to Memory (AtoM) and Archivematica.


As well as custom Access software, Canadiana was managing a custom OAIS packaging system. (Disclosure: I was the primary developer/maintainer of that software and related infrastructure from 2014 until I left CRKN earlier this year. I had hoped all that custom software would have been decommissioned before I left).

While authoring custom OAIS packaging software might have felt necessary in 2012, Artefactual started working on Archivematica around the same time (See release notes). Archivematica almost immediately surpassed the functionality of the Canadiana OAIS packaging system. With a focus on digital preservation, being Open Source, and with archivists (rather than only librarians) involved, Artefactual received well deserved grants for, and community involvement with, their software.


I had a strong and constant urge to get rid of all the custom software I had authored and/or was maintaining as soon as possible. I started to advocate in 2015 within Canadiana to migrate to using Archivematica. Independently (mosty? I don't remember exactly.) Canadiana's Metadata Architect did an environmental scan in 2017 and also concluded we should migrate to Archivematica.


Canadiana had a custom REST API for accessing objects (images, etc) from storage. While the API was inspired by Amazon S3, I was aware OpenStack Swift had a module which provided an S3 compatibility layer. It seemed obvious to me that we should move away from any custom API to actually using a common API, enabling interoperability with other software without always having to customize.


Unfortunately, Canadiana generally had a “Do It Yourself” (DIY) and “Not Invented Here” (NIH) attitude, and thought of itself as a vendor. There was often push-back from colleagues against moving away from using custom software, custom data models, custom cataloging rules, etc. It took some time to convince colleagues to move from a private subversion repository to being more open on GitHub.


After years of internal advocacy, there was finally agreement on a few components:

  • Move from custom Image server to a IIIF Image server. We picked Cantaloupe, and that functionality was launched in 2017.
  • Move from custom object API to OpenStack Swift. In 2019 at CRKN there was a project to set up a temporary cluster using the SwiftStack management console and consulting services.
  • Move from custom OAIS packaging system to Archivematica


There was still no agreement among staff on Access software, but at least if we could complete the above there would be a better understanding organization-wide of why Canadiana should move away from other custom software (and related custom data models, custom metadata application profiles, and custom cataloging rules).

I believed the best option for Access software to match the need was to use IIIF APIs for image/manifest/collection view/navigation and page search (possibly start a new FLOSS project given the scale of the image repository), and BlackLight-marc for document search. Blacklight closely matched the style of services that CAP offered, allowing the transition to not be as jarring to patrons as some of the other options.


By July 2016 the conversation about a merger with CRKN had been made public. At the time I saw this as an opportunity for positive policy change, including away from DIY/NIH towards coordinating enhancement of community software projects and coordinating cloud services among Library consortiums.


Around late 2016 or early 2017 I became aware of Scholars Portal's OLRC (Ontario Library and Research Cloud) that was launched in 2015.  This meant that the Archivematica and SWIFT object storage already existed in the larger community, and partnering with them rather than Canadiana (or later CRKN) doing any DIY/NIH was possible.