Friday, June 9, 2017

IIIF.io : the hardest part will be saying "no".

Back in April I noted Canadiana is working on adopting APIs from IIIF, the International Image Interoperability Framework. We did a small demo in May as part of our participation at Code4Lib North.  Today is the final day of the 2017 IIIF Conference hosted at The Vatican, and this is an update on our progress.

What have we done so far?

We have a Cantaloupe Docker configuration on GitHub that we used for the demo.  This has the delegates Ruby script which finds the requested image within the AIP stored on the TDR Repository node which Cantaloupe is running on.

We have created a pull request for OpenJPEG to resolve an incompatibility between OpenJPEG and Cantaloupe. The fix allows Cantaloupe to offer access to our JPEG2000 images.

We will be integrating the OpenJPEG fix and some cleaner configuration into our Cantaloupe Docker configuration soon, bringing this Docker image closer to being worthy of being installed on production servers.

Our lead Application Developer, Sascha, created an application (with associated Docker configuration) that offers the IIIF Presentation API.  This reads data from the CouchDB presentation database used used by our existing platform.  It is expected that we will be adopting the IIIF structures for data within CouchDB at a later date, but this is a good intermediary step.

With these two Docker images running, accessing data from TDR repository pools and CouchDB, we are able to use any existing IIIF viewer to access Canadiana hosted content.

What is the largest stumbling block we've discovered?

We had already discovered the problem on our own, but the recent IIIF Adopters Survey made it clear.

Of the 70 institutions who completed the survey, 51 are currently using the IIIF Image API, 42 adopted IIIF Presentation, but The British Library and the Wellcome Trust are the only known institutions currently using the IIIF Authentication API.

Canadiana has both sponsored collections (where the depositor or other entity sponsored the collection which is then freely available to access) and subscription collections (where the funders have required we restrict access only to others who are financially contributing).  Making the sponsored collections available via IIIF will be much easier than the additional software we will have to author (including possibly having to help existing projects offering IIIF access tools) in order to support denying access to subscription collections.

Said another way: denying access will take far more of Canadiana's resources (staff and computing) than granting access.  Ideal would be if all our collections were sponsored, but that is not the environment we currently operate in.  At the moment a large portion of this charity's funding comes in the form of subscriptions, and this is already a topic of discussion within our board and membership.

This was not a total surprise.

We knew the move to a more modern distributed platform, which we were already planning before we decided to adopt IIIF, would involve a change in how we did authentication and authorization.  Implementing authorization rules is already a significant part of our technology platform.

Currently the CAP platform is based on a "deny unless permit" model, and there are only two public-facing software components: CAP which internally handles its own authorization, and COS which received a signed access token from CAP for each content request (Specific source file, specific rotation, specific zoom, limited amount of time, etc).  Only a few specific zoom levels are allowed, and there is no built-in image cropping/etc.


Implementing the same model for IIIF would have been extremely inefficient, even if possible to go through the multi-request Authentication API for each individual content request.

IIIF image access isn't done as one request for a single completed image but as multiple request for tiles representing parts of the image (and at a variety of zoom levels).  For efficiency we needed to move to a more liberal "grant unless denied" model where the access tokens are far more generic in what type of requests they would facilitate.

There are also several APIs that can (and should) be offered as different distributed web services. A service offering Presentation API data is likely to be deployed into multiple server rooms across the country, just as the Image API will be offered from multiple server rooms.   We may have fewer servers offering authentication, but that won't create a bottleneck as once a user has authenticated they won't need to go back to that service often (only when access has expired, or they need to pass access tokens to a new service).


We will be separating authorization from authentication, only checking the authentication token if required.  A new CouchDB authorization database would be needed that has records for every AIP (to indicate if it is sponsored or what subscription is required, and what level of access is granted), every user (what subscriptions they have purchased, or other types of access -- such as administrators) and every institution (subscriptions, other access).   Each content server request would involve consulting that database and determining if we had to deny access, with this data being replicated so it is local to each application which needs to use this data.

Where are we in our plan?

The plan was to migrate away from our existing Content Server first (See: The Canadiana preservation network and access platform for details on the current platform).  This would involve:

  • Adopting Cantaloupe for the IIIF Image API, including authorization.
  • Implementing the Authentication API, to set the Access cookie from the access token offered by the main platform website.
  • Implementing an IIIF Presentation API demonstration sufficient to test our implementation of the Authentication API with existing IIIF client applications.
  • Offer direct access to TDR files using the same Access cookie/token (Needed for PDF downloads as a minimum, also used by various internal microservices to access METS and other metadata records).
  • Retrofit our existing CAP portal service to use the Authentication API, as well as use Cantaloupe for all image serving.
  • Decommission the older ContentServer software on each repository node.
 With the Authentication API not as established as we thought, we may go a different route.


One possibility might be for Cantaloupe to grant full access to sponsored collections, and use a separate token similar to our existing token for subscription collections.   This would effectively disable most of the utility of IIIF for subscription content, other than allowing us to use the same ContentServer software for both types of content.

We haven't made decisions, only come to the realization that there is much more work to be done.   My hope is that we can push forward with making sponsored collections accessible via IIIF, even if we simply deny IIIF access to subscription collections in the interim (IE: CAP portal access only) while we figure out how to grant access to subscribers via IIIF.

IIIF isn't the only place we have this consideration

This consideration isn't unique to our IIIF implementation, and we come up against it regularly.

With the Heritage project the funding institutions sponsored public access to all the images from those LAC reels, but more advanced search capability was required to be a subscription service.   We implemented this in the shorter term by disabling (for non-subscribers) page-level search on the Heritage portal which hosts this content.

Some researchers and other external projects (some funded by Canadiana as part of the Heritage project, but that Canadiana technical staff were not involved in) have been collecting additional metadata for these LAC reels in the form of tags, categorization, and in some cases transcriptions of specific pages.  This data is being offered to us using project-specific data design that doesn't conform to any of the standards we plan on adopting in the future within the primary platform (See: IIIF annotations, with us likely extending our TDR preservation documentation to support encoding of specific open annotations).

Our platform doesn't yet have the capability to accept, preserve and provide search on this data. When we start a project to accept some of this data we will also have to figure out how to implement a mixture of funding models.  It is expected that most researchers will want the data they have funded to be open access, and would be unhappy if we restricted to subscribers search on their data.  This means we'll need to separate the subscription-required data funded by some groups with the open access search data provided by other groups.

It is likely we will end up with multiple search engines housing different types of data (search fields different from the common ones used within our primary platform), search-able by different groups of people, with a search front-end needing to collate results and display in a useful way.

Moving more of Canadiana's software projects to GitHub

As some of the links in recent articles suggest, we have started moving more of our software from an internal source control and issue tracker towards public GitHub projects.  While this has value as additional transparency to our membership, I also hope it will enable better collaboration with members, researchers, and others who have an interest in Canadiana's work.

For the longest time the Archive::BagIt perl module was the only GitHub project associated with Canadiana.  Robert Schmidt became the primary maintainer of this module when he was still at Canadiana, and this module is still critical to our infrastructure.


Added to the two IIIF related Docker images that I'll discuss more later are two PERL modules:

  • CIHM::METS::App is a tool to convert metadata from a variety of formats (CSV, DB/Text, MARC) to the 3 XML formats we use as descriptive metadata within our METS records (MARCXML, Dublin Core, Issueinfo).  This is used in the production process we use to generate or update AIPs within our TDR.
  • CIHM::METS::parse is the library used to read the METS records within the AIPs in the TDR and present normalized data to other parts of our access platform.  For more technical people this provides an example of how to read our METS records, as well as documenting exactly which fields we use within our access platform (for search and presentation).

My hope is that by the end of the summer all the software we use for a TDR Repository node will have moved to GitHub.  This provides additional transparency to the partner institutions who are hosting repository servers, clarifying exactly what software is running on that hardware.

We are a small team (currently 3 people) working within a Canadian charity, and would be very interested in having more collaborations.  We know we can't do all of this alone, which is a big part of why we are joining others in the GLAM community with IIIF. Even for the parts which are not IIIF, collaboration will be possible.

If you work at or attend one of our member institutions, or otherwise want to know more about what our technical team is doing, consider going to our GitHub organization page and clicking watch for sub-projects that interest you. Feel free to submit issue requests whether it be noticing a bug, suggesting a new feature (maybe someone with funding will agree and launch a collaboration), suggesting we take a closer look at some existing technology, or just asking questions (of the technical team -- we have other people who answer questions for subscribers/etc).

If not on GitHub, please feel free to open a conversation in the comments section of this blog.