Thursday, December 21, 2017

Canadiana DevOps 2017 year review and look to the future

Many ongoing changes for the DevOps team this year.

CRKN update

The CRKN December 2017 Newsbrief provides some updates about Canadiana itself.  The short-form is that there will be a Canadiana membership meeting mid-January to vote on an offer to consolidate the operations of the organizations.

I'm excited about the possibility of being a CRKN employee in the next few months.  As an organization they maintain close ties with their educational sector members, and they don't get confused with being a 'vendor' as Canadiana has been.  I look forward to not only the new employer, but the closer relationship with other people working in library technology across Canada.

In the new year we'll also be meeting some new staff.  As well as the existing CRKN staff, there are two positions we will need to fill in DevOps.

  • A Metadata Architect, as Julienne left for a job at LAC.
  • A System Administrator, as we have more work than the 2 of us remaining can handle.

IIIF Update

Our custom content server has been replaced with Cantaloup and a few support applications.   We are using our existing authentication model, which requires a signature for each specific request.  This means that regular IIIF clients won't work yet without a separate authentication.

We set up a demonstration for Sascha's talk at Access 2017, which allows specific logged-in users to set a cookie that will tell the content server to allow access to everything.  For these users the "About" tab for any document on any of our portals also has a "IIIF Presentation Document" section which includes a URL that can be cut-and-pasted into any IIIF viewer.   If you wish access to this demo, please get in touch with Sascha.

We have plans for a new authentication system and adopting more of the IIIF APIs in the future, but we need more work done on other aspects of the platform before we can do that.

Future of Canadiana's preservation platform

Back in April I wrote about JHOVE, and how we would be integrating format identification and validation into our preservation platform.

After we started some of that work in the spring we decided to explore some alternatives.  Our metadata architect Julienne did an environmental scan and evaluated some of the available tools.   We came to the conclusion that rather than continue to update and maintain our current OAIS platform that we would adopt an existing and already maintained platform.   The plan is to migrate from our custom OAIS platform to Archivematica.

This will involve providing a clean separation between our preservation platform and our access platform.

As well as changing the OAIS platform, we also plan to upgrade from using our own custom replication and validation services on top of ZFS to using OpenStack SWIFT.

The question of when we will be able to do these changes is dependent on the new staff, and how familiar they already are (or how quickly they can become) on these new technologies.

Future of Canadiana's access platform

As part of the environmental scan we also evaluated access tools such as Access to Memory, also primarily developed by Artefactual Systems.  While a great tool for archival description and access, Canadiana exists in that mushy-middle where we aren't exactly a library, exactly an archive, or exactly like any specific part of the GLAM community.

Before Julienne left she did some data modeling work and pointed us in the right direction for next steps.  We look forward next year to exploring the Portland Common Data ModelFedora, and components of Samvera.

Docker and GitHub

More of our software and configuration is up on our c7a GitHub, an initiative started in the summer.  While we maintain a local private subversion, we are slowly moving everything to public repositories on GitHub.

At the same time as the move to GitHub we started to deploy more of our software via docker, with most of our software and configurations now moved.

Repository servers

A repository server node, which currently also provides public access to repository content, has the following docker images:

  • cihm-public-cos  has the Apache image that sits in front of all related web services.
  • cihm-cookie-authorizer is used to verify JWTs and set related cookies.
  • cihm-file-access verifies JWTs and provides direct access to files, such as for PDF downloads
  • cihm-cantaloupe is our Cantaloupe configuration, with Cantaloupe provides derivative images using the IIIF Image API.
  • cihm-repomanage has tools for managing the file repository, such as replication and validation (fixity)
  • I am currently switching over to using the official Docker CouchDB image (tag 1.7.1 for the time being).

Application servers

An Application server node has the following images:

  • cihm-public-cap has the Apache image which is in front of all the web facing application services.
  • 'cap' , which is currently only in our subversion repository and only deployed to development servers.  In production we still use the older deployment mechanism (capistrano).
  • cihm-metadatabus has scripting to stream search data distributed by CouchDB to Solr
  • We are using the official Docker Solr image (tag 5.5.5)
  • These servers also have CouchDB servers

Demo server

We have a demo server which mostly runs legacy applications like the CDP, rdf and AV which we don't maintain, but also have a current demo:
  • cihm-iiif-presentation offers the IIIF Presentation API demonstration, which reads our current CouchDB presentation documents and provides in a IIIF compatible way.

Other servers that don't provide public access

We have other servers which are used internally which don't have a publicly accessible interfaces, and are used by staff to manage processes.

  • The servers that build SIPs, and the servers we use to ingest SIPs into AIPs to be stored on repository servers, use the cihm-ingestwip image.
  • The server used to manage our metadata bus databases use the cihm-metadatabus image (extracts metadata from SIPs, produces documents for presentation and search) and a server running the official Docker Solr image for local search.
  • We also have CouchDB running on many servers, as we use it as a reliable multi-datacenter replication service for most of our metadata.

Best wishes of the season, and Happy New Year.

Wednesday, December 20, 2017

Does public transit "prove" private vehicle ownership and driving is inappropriate?

If you haven't already read it, please read my earlier article where I discuss a layered model for road transportation, and I try to clarify that "technological protection measures" (TPMs) are actually a restriction on who is allowed to drive (IE: author software for), or choose drivers for, communications technology.  I strongly believe there are conversations that wouldn't even happen if we were talking about cars rather than computers.

Once you have a similar understanding of the communications technology being discussed, and the most appropriate transportation technology analogy, you can begin to see just how inappropriate some of the statements made about communications technology sound.

When I was a witness in front of the C-32 committee I gave a version of my "I'm holding up four things" talk I had already given in multiple settings (See: Protecting property rights in a digital world).  The intention is to clarify that when discussing TPMs there are potentially 4 things that have owners (the media, the copyrighted work stored on media, the access device, and the authors of the software on the device), and that focusing only on one of them (the non-software copyright owner) risks inducing infringement or effectively abolishing the property and other rights of the other 3 owners.

I am an example of someone who has all 4 ownership interests: I own media, I am a copyright holder for non-software works, I own devices, and I author software.

One of the most vocal opponents of my attempt to protect the rights of all 4 classes of owners owners is John Degen (See: Making a living as an author vs. off of authors.)  When he was a witness in front of the Senate committee studying the renumbered C-11 he discussed technology as well, but in a way that sounds quite silly for those of us who understand the technology and the relationship to creators.

Mr. Degen: This morning Mr. Henderson referenced a couple of times real world situations and a lot of the panic that goes into extreme situations that might happen. This is a Kobo eReader — not a commercial for Kobo — and I have a bunch of books on it. Let us say I was studying these books in a university environment. I have Moby Dick, that great Canadian classic up here. Let us say I was studying Moby Dick. On this piece of technology, Moby Dick is locked. It is within the Kobo propriety locked system. It cannot be transferred to a Kindle, for instance. They do that for definition within the marketplace. There are fears out there that were I to be studying in a classroom environment, the lock would impede my fair-dealing rights to research and private study. I get around that completely legally, and without breaking any locks, by using paper and a pen. I read what is on the electronic device and I make my notes for research and private studying. I am, in effect, copying what is in the text and I do that perfectly legally. That is more likely what will be happening in classrooms. The extreme fears about digital locks locking students away from information are completely unfounded.

If I provide a transportation technology translation of this intervention, you will see why what Mr Degen said makes no sense.

Fictional person: This morning people expressed panic about what might happen if individuals are no longer allowed to have the keys to the locks on their cars, choose drivers, or drive vehicles themselves.   I came to this committee this morning by OCTranspo.  This is a locked system where the vehicles are owned by the city, and the city employs all drivers.  There are other privately run systems such as Greyhound Canada, a subsidiary of British transport company FirstGroup, that owns the vehicles and hires all the drivers.   There are fears that if individuals couldn't drive vehicles or choose who drives their vehicles, that it would restrict their travel. The fact I got here by OCTranspo is proof this is not the case.  In fact, if these private and public sector transportation systems didn't exist I could have walked to the committee hearings. The extreme fears about non-owner locks on vehicles or prohibitions against choosing drivers or driving ones own vehicle are completely unfounded.

There are many reasons to be dismissive of what Mr Degen claims.

While he makes his living elsewhere (staff at Professional Writers Association and later Writers Union, and at the Ontario Arts Council when he spoke to committee), he is focused near exclusively on textual literary works.  His suggestion he could read the text on screen and do fair dealing research using pen and paper sounds as silly as someone suggesting all witnesses to all committees could have walked there.   While I live within walking distance of the federal parliament, most Canadians (including Mr. Degen) do not -- and while some creative works are only text, others are not.   His words were dismissive of the rights and interests of the vast majority of Canadian creators.  The Copyright Act regulates activities for works which are nothing at all like text literary works, and it is for these other works that many of the worst controversies arise.

Rather than a reason to dismiss concerns about technological measures, his comments are actually a reason to dismiss claims about the alleged effectiveness of technological measures at reducing copyright infringement.  For the works in which fair dealings research doesn't require unlocking, copyright infringement also doesn't require unlocking.  Someone who actually wanted to infringe the copyright on a textual work only has to re-type it.

Mr. Degen doesn't have an interest in driving his technology, or having any say into who does the driving.  I personally don't have a drivers license, but I still care about who is doing the driving when it comes to transportation technology. I think there is a big difference between a privately run transit system where a private corporation decides all the policy, and a publicly managed transit system.  I believe all passenger transportation systems, public or private, should be government regulated.  The fact Mr. Degen held up a device with unaccountable and non-transparent private policy suggests he might not even care about these important distinctions.

While it is his right to not care who controls technology, it is not valid for him to claim his lack of personal interest is a reason to dismiss other peoples interests or seek to diminish or abolish their rights.

What Mr. Degen describes is different than the OCTranspo example because passengers haven't been mislead to believe they own the bus.  In the case of the Kobo people are being dishonestly lead to believe they are "purchasing" something, but where they are not given the keys or allowed to change the locks on what they have been told they "own".  If this was an honest business relationship where the vendor wanted to retain control then they would have retained ownership, and Mr Degen's Kobo would have been rented.  There would have been a transparent rental agreement laying out all the conditions. Whether it is the enforceability of the rental agreements for things you don't own, or the legal protection of digital locks you apply to things you do own, it is dishonest and possibly unconstitutional to claim this is a matter of federal copyright law rather than provincial contract and property law.  Without clearly understanding the relationship is closer to a rental than purchase, privacy and other rights aren't being appropriately protected.  Far from being the subject matter of copyright law, technological measures are being abused to bypass many other laws and regulations.

The communications technology we are discussing is the same technology used to create and disseminate works.  Revoking the ability of owners to independently control or have a say in who controls their technology doesn't only impact audiences, but greatly impacts creators.   If some unaccountable and non-transparent third party has the ability to disallow in software (what controls the devices) specific creative works to be distributed, or even created in the first place, this can have a critical impact on culture.

This is why I believe that protecting technology property rights is a prerequisite for protecting creators' rights, and also why I consider those who are opponents to technology property rights to be opponents of creators' rights.

We wouldn't even be having this conversation if we were talking about cars rather than computers.  If Mr. Degen were talking about transportation technology he would have been appropriately laughed out of the committee.

I am a long time creators' rights advocate, focused on technology property rights. I believe fellow creators need to take a closer look at how communications technology works so that they can tell who are allies and who are opponents to protecting their rights.

Friday, December 15, 2017

Why I don't subscribe to the pizza metaphor of net neutrality

Mike Richardson posted a reference to the pizza metaphor of net neutrality. (Edit: Previous ink no longer works, so try The pizza metaphor of net neutrality)

I believe this analogy is part of the problem we are dealing with. If we think of "Internet Service Providers" (ISPs) as a flat service, then what I consider to be the core problems won't ever be able to be addressed.

The flat network discussions will continue to confuse more fiscal conservative or libertarian-minded politicians like Maxime Bernier into believing that removing Net Neutrality legislation is a reduction of government control, rather than removing the legislation amplifying the harmful impacts of other government interventions.

If you separate the physical layers (OSI model layers 1 through "2.5") from the various layers built over-the-top such as TCP/IP layers, then you get an entirely different picture.

Right-of-way is a government imposed limit on property rights used to put infrastructure (wires, fiber, sewers, roads, etc) above and below public and private property without needing to ask for permission or make payments to the property owners. This is a major government imposition that should come with strong conditions if allowed on behalf of private sector entities at all.

With analog services there might have been a temporary justification for granting this privilege to a tiny set of private sector entities (one in each jurisdiction for each of the most common specialty analog service: telephone and cable television), but with converged digital communications following the OSI layered approach this government intervention on behalf of private sector entities is no longer appropriate.  All the other municipal infrastructure requiring right-of-way, including wiring used for electrical distribution, is already public and we must recognise it is past time to end the communications exception.

Much of the physical communications infrastructure was already paid for or massively subsidized by taxpayers, and if we focused any new public money narrowly on the physical layer I believe this would reduce rather than increase overall government expenses.  Like road infrastructure, the physical infrastructure is what much of the economy is built on.  I believe it is inappropriate for a small number of private sector companies to have the ability to manipulate the economy and society through any manipulation of that infrastrucutre.

If the existing private sector entities don't like the idea of ending the communications exception, or claim it is an expropriation of their property, then governments can revoke the right-of-way privilege and see how long these companies last.  While they may think their cabling has value, I suspect the legal fees alone in trying to negotiate with all the land owners to pay rent would wipe out any theoretical value of this cabling.

It is only the physical layers that need strong government intervention. The services that are built on top of that, whether that be Internet transit or other over-the-top (layers 3+) services, would benefit from more competition -- not more regulation.

Switching ISPs should be as easy as choosing to go to one retailer or another, going to a friends house rather than a park, using UPS rather than FedEx or Canada Post, or choosing between any other competitive product or service.

If we had properly managed municipal physical layer infrastructure we could have individual devices in the home connecting to multiple different providers: Maybe Mom and Dad subscribe to a filter-free Internet service on their devices, while the devices the under-aged kids use connect to a specifically filtered service of their parents choosing.  "Television" devices would directly connect to multiple broadcast and content catalog services of the audiences choice, with what was previously called "cable TV" no longer being relevant for aggregation.  With a proper free market in OTT services there might even be a special service which "Internet of Things" devices can connect to for software updates and talking to specific servers that would still protect them from unauthorized access as they didn't need publicly routed "Internet" addresses at all -- not every homeowner should be expected to know how to manage the filters in a firewall, even if some homeowners should have the protected right to manage their own.

If your two-way voice service doesn't connect you to the correct pizza company, then you immediately switch to any of the large number of competitors who will provide the service you are demanding.  Market forces will quickly wipe out corrupt companies, with the only government intervention needed being number portability (addressing).  This pizza analogy only works if you stay within the old analog way of thinking that two-way voice service, previously called telephone, is a near-monopoly rather than having as many competitive options as there are choices for eating.

IPv4 and IPv6 should not be the only protocols being considered, and I believe that the municipal communications infrastructure shouldn't be imposing protocols or standing in the way of developing and deploying future protocols.  That is the role of private sector service providers, including ISPs which operate a layer 4 transit service.  Services at these layers should never be thought of as the same thing as the underlying physical infrastructure.

ISPs are providing a service similar to international shipping.  While international shipping is an important service, and a service that is appropriate to have common carriage (network neutrality) rules applied to it, we need to understand this as only one service among many and not equate ISP services with digital communications networks.  Common carriage doesn't apply to all road usage, and network neutrality shouldn't apply to all municipal data infrastructure usage.

I see Network Neutrality legislation as a temporary flawed answer to problems caused by governments allowing vertically integrated "retailers" to own the physical infrastructure.   While I strongly believe network neutrality legislation is a necessary evil, I still consider it evil and not something that should be confused as being a long-term solution.

Thursday, December 14, 2017

We wouldn't even be having this conversation if it was cars rather than computers!

Section 92 of Canada's copyright act indicates:

Review of Act

92 Five years after the day on which this section comes into force and at the end of each subsequent period of five years, a committee of the Senate, of the House of Commons or of both Houses of Parliament is to be designated or established for the purpose of reviewing this Act.

1997, c. 24, s. 50; 2012, c. 20, s. 58.

On December 13'th the following was included in a motion by Bardish Chagger Leader of the Government in the House of Commons and Minister of Small Business and Tourism:

(c) the Standing Committee on Industry, Science and Technology be the committee designated for the purposes of section 92 of the Copyright Act; and

This indicates that in the new year that the INDU committee will be reviewing the Copyright Act.

I was actively involved in the process last round.  I joined the process in the summer of 2001 when I heard that Canada was contemplating adding "technological protection measures" to our Copyright Act.  Software authors already understood the harm from the anti-circumvention aspects of the USA's Digital Millennium Copyright Act (DMCA).

In an earlier article I discuss a layered model for road transportation, and that "technological protection measures" (TPMs) are actually a restriction on who is allowed to drive (IE: author software for), or choose drivers for, communications technology.

While I don't have a drivers license or drive a car, I have been driving computers since 1981.  The rights of computer owners to drive their own computers, or choose their own drivers if they don't have the skills themselves, should be understood as fundamental a right as driving automobiles already is to the rest of society.

I had to get involved in this policy discussion, even though it never made sense to me that we were even having the discussion.  I acknowledge that technology, whether transportation or communications technology, can be abused in breaking the law.  While this has always been true of vehicles, there has never been a serious policy discussion about disallowing vehicle owners the right to drive their own vehicles, or disallow them to choose their own drivers if they didn't have those skills.   The only reason we were having this conversation is because policy makers, similar to the general public, lack adequate literacy on communications technology which we all take for granted for transportation technology.

While the section 92 review was announced in the summer of 2001, it wasn't until October 2003 that submissions were due.  My first formal submission to that process is available through my website.  In the summer of 2001 I started a discussion forum called "canada-dmca-opponents" which grew into the Digital Copyright Canada website.

I spent considerable time from 2001 through the passage of Bill C-11 in 2012 active in that area of policy.  This included only accepting part-time jobs so that I could participate.  When I started my current job in 2011 I only accepted an 80% contract so that I could attend every Bill C-32 and Bill C-11 committee meeting.

While my focus was on the rights of technology owners, false claims were often made about my views on copyright.  It was frequently suggested that if I was opposed to TPMs, the only possible reason could be because I didn't believe authors should get paid.   As a software author myself this was a ludicrous suggestion, and yet even some of the most sympathetic journalists would falsely claim I was an "anti-copyright crusader".

This would never have happened if we were talking about cars rather than computers.  Someone claiming that the only reason someone wants to drive their own car is because they wish to break the law or are a criminal would be appropriately laughed out of the room.   Unfortunately when it comes to technological measures, few recognize just how ridiculous it is.

It took me years to realize just how low technology literacy is within policy circles.  Most of the conversations about TPMs come from the belief that it is something applied to copyrighted works, and that these measures allow decisions to be made (can copies be made, under what conditions, etc).  This is similar to believing that a paperback book is sentient, and can come alive and autonomously run away if the reader of the book tries to do something the book doesn't like.  I have come to refer to this as the "Harry Potter" understanding of TPMs.  While purely based on fiction, this is the most common misunderstanding of TPMs.

If we were talking about cars rather than computers, people with such a low literacy of the relevant subject matter would not be considered experts or be allowed to dominate the debate.

I wish the review of the Copyright act would be about Copyright law.

I've learned quite a bit by speaking with fellow creators and creator groups, and have knowledge of the wide variety of market changes each group is facing.  In nearly all cases there are legitimate changes in which intermediaries are involved in the relationships between creators and their audiences.  While there are many intermediaries crying fowl at these advancements, the vast majority of the changes I've observed are positive for creators and should be encouraged.  In many cases when there are infringements, these are infringements induced by the harmful business practices of specific intermediaries: they are infringements that could be handled with an "inducement" regime for contributory infringements, rather than the incorrect focus of the "enablement" policy that was added as part of C-11.

I am forced again to focus on TPMs this round of Copyright Act review.  While it may be true that some copyright holders use TPMs, it has no more place in Copyright law than a National Energy Program has simply because some copyright holders use electricity.

I look forward to a future when the Copyright Act only has Copyright related provisions in it, and we can finally have a proper conversation about modernizing copyright law that isn't tainted by being dominated by non-copyright related discussion.

Monday, December 11, 2017

Hiding OSI layers leading to policy failures: Net Neutrality, Encrypted Media, ...

I've regularly written about the OSI model for digital communications.  I'm increasingly concerned that policy discussions that hide these layers are leading to policy failure.  I see this in the odd rhetoric coming out of the USA on network neutrality, which is the same country that originated the policy failures around technological protection measures.

When I learned about digital networks in the late 1980s this included the Open Systems Interconnection (OSI) model which describes the different interoperable layers that  digital communications enables.  This is in contrast with analog networks which were purpose built and inflexible.  Each layer in a digital network is built upon other layers, starting with the physical layer that describes specific electronics all the way up to the application layer which is the layer closest to the user.

When discussing communications technology people often make analogies to transportation technology, since most people have a greater understanding of transportation technology.   The problem is an poor and inflexible analogy has become dominant.

A comparison is often made to shipping via boats and rail, where a primary policy is common carriage.  This analogy suggests digital networks only have one layer much like the older analog networks, wiping out the flexibility both in terms of technology and policy which digital networks enable.  This flattening of layers also causes policy confusions which wouldn't happen if the layers were exposed through a better transportation analogy.

A layered model for road transportation

In 1994 the federal government formed the Information Highway Advisory Council (IHAC). Discussing roads and highways is an appropriate analogy to communications technology as it exposed the layers and complexity of the network, even though road transportation is still less flexible than digital communications networks.

A simplification of layers built on road infrastructure might be:

  • Road infrastructure.  This is comparable to the physical network layers.
  • Vehicles run "Over The Top" of those roads.  This is comparable to physical devices connected to the communications network.
  • Drivers control the vehicles.  This would be comparable to software authors, where software is the instructions that drive digital devices. (Note: It is software that differentiates between TCP/IP and other networking protocols.  ISP's are businesses that run their own devices and provide transport of packets encapsulated within TCP/IP.)
  • Passengers and parcels which would be placed in/on the vehicles for transport.  This is comparable to the applications which use the network (two way or one-way audio/video/text/etc communication) 

With transportation the roads are a mixture of municipal, provincial and federal management.  Private roads including driveways connect to publicly managed infrastructure.   While publicly owned vehicles exist, private (corporate and individual) vehicle ownership far exceeds public.  Individual citizens are allowed (in many ways actively encouraged) to personally own and drive vehicles.

If we use this road transportation analogy to go through various policy discussions the failures becomes more obvious.

Technological Protection Measures

While Canada formed IHAC, the USA formed the National Information Infrastructure working groups. Bruce A. Lehman chaired the Working Group on Intellectual Property Rights which during 1995 came up with a disastrous concept:  if it was possible for vehicles to be used to transport something illegally, then private citizens should not be allowed to drive vehicles or choose drivers.  Since immediately outlawing private drivers would have been too controversial to pass, a mixture of law and market forces would be used to indirectly achieve the goal.

  • Vehicle manufacturers would be granted the right to impose drivers, and it would be made illegal for the vehicle owner to fire that driver and chose their own. While a private citizen might be allowed to "own" a vehicle, they are not given the keys to the locks and it is made illegal for the owner to change the locks.
  • Destinations would be given the legally protected right to deny access to any person who did not provide proof that they arrived using an "authorized" vehicle with the manufacturer imposed driver.  The ability to access these destinations would serve as a market force to impose manufacturer determined drivers onto the majority of the population.

Mr. Lehman and his supporters may claim they were only trying to reduce unlawful activity, but it should be obvious that the harm to the economy and society as a whole of this type of policy greatly outweighs the alleged harm he was claiming to reduce.

This is the essence of the policy which Lehman tabled, was policy laundered through WIPO in 1996 when the 1995 bill didn't pass within the USA, and which later became the controversial part of the DMCA in 1998.  Canada included this harmful policy in Bill C-11 which inappropriately provided legal protection for "access controls" (IE: ties between content and specific devices/software, and non-owner locks on devices), even though this was not required by the WIPO treaties.  An even worse variation of this harmful policy was included in the TPP, and I will be surprised if the USA doesn't try to push this as part of the NAFTA renegotiation.

Network Neutrality

Imagine a country where a tiny subset of retail outlets owned all the roads. Governments and lobbiests would claim that there was "competition" in road infrastructure if home-owners in a specific city were able to choose between connecting their driveway to the Walmart roads or the Loblaws roads, where these roads favoured in subtle and not-so-subtle ways the ability of people to access some destinations over other destinations.

Companies like DHL, FedEx and UPS might be allowed to exist, but would be disparagingly called a "wholesale" market of the services of Walmart and Loblaws, rather than recognizing shipping as a different type of service than vertically integrated road owners which might also own their own vehicles and do their own shipping.  (Digital Example: Companies like Techsavvy are claimed by the CRTC to be part of a "wholesale" market, even though their TCP/IP routing service is built on top of the same physical infrastructure).

While all surface transportation related services are built "over the top" of the physical layers, the "over the top" terminology would be abused to refer only to competing services.  Even if you wanted to buy the identical item from Metro or Loblaws, the purchasing from Metro would be called "over the top" while the purchase from Loblaws would not.  The nearly identical service offered by Loblaws/Walmart would be regulated differently (or prohibited) if offered by a competitor (Digital Example: Bell's IPTV service branded as FibeTV is regulated as a cable service, even though our Copyright Act explicitly disallows this type of new media retransmission without separate permission/payment).

In a road neutrality debate the US Department of Transport chairman might claim the fact that so many people get entertained at Cineplex theaters is somehow "proof" that road neutrality already doesn't exist, so what tiny amount of minimal regulation currently exists should be repealed.

This may read as utter nonsense that no government would ever allow, but this is essentially the situation we are in today with communications networks. Incorrectly regulated convergence allowed the incumbent phone and cable companies to gain all the benefits of the OSI layered digital networking.  It now doesn't matter which physical connection (coax or twisted pair) comes into the home or business, the same services can be built on top such as two-way voice (previously called telephone), one-way video (previously called Cable TV), and many other applications.  Just as Walmart now sells food and Loblaws sells clothing, "phone" companies sell streaming video and "cable" companies sell two-way voice.

Policy Solutions

While common carriage has a place in the policy mix, it needs to be thought of as one small policy lever among many -- and only applied to services at specific layers of the communications stack.

As we have nearly always done with transportation technology, structural separation of communications technology is required.  I've come to the conclusion over recent decades that anything less than structural separation will be as effective as deck-chair rearranging on the Titanic.

The model we use for roads was created at a time when it was understood that roads were the infrastructure upon which much of the economy and society was built.  It is entirely inappropriate for the ownership or control of the similarly critical infrastructure for the new economy to be in the hands of a small group of private sector entities.   As with roads the different levels of government might hire private sector contractors to do much of the work, but the ownership and control must rest with the public sector.

We need open competition in the other layers.  The need for foreign ownership rules only apply to the physical infrastructure, which I've already suggested should be managed by the public sector. Other layers are already recognized with transportation to not need those restrictions: while there are Canadian automobile manufacturers, people are allowed to purchase foreign designed and manufactured vehicles.  The same should apply to digital communications products and services that run "over the top" of the government managed infrastructure.

Private citizens must have the right to own their own vehicles, and choose their own drivers for these vehicles including being their own driver if they have the skills.  With digital technology the equivalent is the right to own their own devices, to choose their own software, and to author their own software if they have those skills.   Laws which legally protect non-owner locks on devices, or allow content providers to impose specific device manufactures/software, should be repealed immediately.

The Ministry of Communications needs to be restored to federally mirror for communications what the federal Ministry of Transport handles for transportation.  This is a ministry that was abolished in 1995, at the time when convergence was being mismanaged.  The CRTC is currently inappropriately administered through the Minister of Heritage, who is in a conflict of interest with specific types of communication entities.  The pre-convergence Telecommunications and Broadcasting Acts are in critical need of modernization or replacement.

We as a society have always subsidized arts and culture, which are not always able to be adequately privately funded. While public arts funding should clearly exist for works created to be distributed by communications networks (such as scripted video programming), this should be done through direct accountable public funding and not through cross-subsidies between layers of the communications network.  There are too many ways to get cross-subsidies wrong and for governments to be manipulated: much of the current discussion around a so-called "Netflix tax" is a dishonest misinformation campaign initiated by vertically integrated companies like Bell, Telus and Rogers (often through the TV stations/studios and other media they own).

It should be understood that incumbent vendors will not be happy with any policy corrections. The increasingly extreme policy proposals coming from Bell, Telus and Rogers are to be expected as they are dealing with an existential crisis.  Required structural separation and free market competition would put many of these outmoded companies out of business.  This should be understood as a good thing, not something to minimize or delay, as structural separation will lead to a more innovative economy and society.  It really is a win-win scenario for nearly everyone, and is as critical to our future as governments building and maintaining the road infrastructure has been for the industrial economy.

Google Doc version (Which you can print or download a PDF from).

Thursday, October 5, 2017

Yes, CBC, I'm waiting for Alias Grace to be on Netflix.

CBC runs InCanada, an "online" Canadian Media Panel. I put "online" in quotations because while the panel is online, the CBC's broadcaster bias is always visible in how they ask questions. The latest survey is no exception.

The survey was essentially about Alias Grace, a Canadian-American miniseries that will air on CBC on September 25, 2017, and on Netflix on November 3, 2017.

The survey typically conflates Netflix with broadcasters, when Netflix is not a broadcaster. This is about as nonsensical as confusing a radio station with a record store when discussing music, and yet the legacy broadcasters continue to try to push this nonsense.

I sometimes make the comparison to the difference between an outhouse and indoor plumbing: Like broadcasting, people made use of outhouses before modern conveniences like indoor plumbing came along. And like indoor plumbing, people aren't likely to want to go backwards once they get used to online streaming.

While outhouses still exist in places where indoor plumbing is not available, it is not the predominant way that people "do their business". Unlike with an outhouse, there is no sense of urgency to use the outmoded platform to watch Alias Grace.

The survey asked if I saw the American series The Handmaid's Tale (TV series). While this was distributed by Hulu starting in April 2017, the series was blocked from Canadian access by Bell until they made it available on CraveTV in late July. Bell blocking, hiding and/or delaying lawful access to content is typical, and I consider them to be the largest Canadian contributory copyright infringer for their ongoing inducement of infringement.

If the NAFTA negotiations were intended to modernize trade relations within North America, the trade barriers disallowing cross-boarder shopping for telecommunications services and creative content would be a top priority. I believe we could massively reduce copyright infringement in North America if we moved to a single content market, where creators from the entire of North America had unrestricted access to the audiences of North America. That includes the content distribution services. North American audiences should also have the right to subscribe to any North American streaming service, and regional content restrictions within North America would be prohibited.

The concept of Canadians not being to view content at the same time as US audiences, including having the option to subscribe to the same online distribution services, must quickly become a distant memory.

Canadian Content policy should be focused on content, not on outdated distribution mechanisms. Hopefully a pro-free trade agenda will be part of the current Heritage Minister's thinking: you can't promote Canadian production capabilities and wide global distribution of Canadian content while still allowing regional content blocking.

Bell's anti-free trade agenda is trying to push policy in the opposite direction, including asking for mandated blocking when Canadians wish to access content that is not lawfully streamed in Canada. Bell is asking for mandated blocking because they want competitors to have to block the same competing distribution sites Bell already wants to block, which is also why they oppose VPNs (Apparently the technology, not only the perfectly legitimate cross-boarder-shopping use).

If I wanted to watch The Handmaid's Tale when US viewers were (or those that can tolerate the smell of an outhouse/broadcaster), or on the devices of my choosing, I would be forced to infringe copyright (easiest) or use a VPN (Less convenient, but currently more lawful).

There was no sense of urgency to watch The Handmaid's Tale. While there are shows that are important enough to me that would warrant finding alternative streaming options, none of these TV series based on Margaret Atwood novels are of sufficient interest.

My wife and I watched Handmaid's Tale on CraveTV. CraveTV is a horrible streaming service: there is a difference between the indoor plumbing at a 5-star hotel and an out-of-the-way truck stop. We only watch programming on CraveTV when it is not available anywhere else. The CraveTV Android App crashes fairly regularly. CraveTV works on few of my devices, compared to Netflix which pretty much always works -- and Netflix even has a simple app built into the SmartTV such that my wife and in-laws can also use it (CraveTV is too messy for less technical people to put up with).

While CBC isn't as bad as Bell when it comes to policies, I believe their outdated broadcaster-era thinking is harmful to Canadian creators and taxpayers.

Thursday, September 28, 2017

Copyright Board, Copyright Collectives, and the myth that "Fair use decimated educational publishing in Canada”

(This is a letter in an ongoing dialog with a few members of federal parliament. This email was added to the list of submissions for the Copyright Board consultation.)

David McGuinty, my MP in Ottawa South,

David Graham, MP (Laurentides — Labelle),

The Honourable Mélanie Joly, Minister of Canadian Heritage,

The Honourable Navdeep Bains, Minister of Innovation, Science and Economic Development,

Copyright Board Consultations

I would like to thank David McGuinty for forwarding the September 8, 2017 letter from Minister Joly. This was a response to my May 1, 2017 letter titled “Myth: Fair use decimated educational publishing in Canada”. My letter highlighting some of what might colloquially be referred to as “fake news” being spread globally, primarily sourced from Access Copyright, a Canadian Collective Society. The National Copyright Unit of Australia felt this myth spreading required a response[1]

As this myth primarily relates to an ongoing dispute between a collective society and provincially funded educational institutions, it ties in directly with the current consultation on the Copyright Board of Canada[2].

The consultation paper recognises that there has been an “explosive growth of media and related technologies worldwide”. This specific incarnation of the of the Copyright Board was created in 1989, the same year that development of HTTP, one of the key technologies underlying the World Wide Web, was initiated by Tim Berners-Lee at CERN.

We live in a world where advanced content recognition, search and online media distribution enables audiences to find and access any content that they want. Sometimes, when copyright owners allow, we are offered a variety of competing access and licensing services to choose from. Modern information and communications technologies have made redundant a sizeable portion of what the Copyright Board was historically envisioned to accomplish.

While the discussion paper suggests we can speed up processes at the board by “Reducing the Number of Matters Coming Before the Board Annually”, the paper does not discuss the need to reverse the historical proliferation of collective societies. At a time when many collectives should be recognised as decreasing in relevance, they continue to increase in political and economic influence.

I will use a few specific problematic areas to illustrate.

Orphaned Works

The incentives behind the current “Unlocatable Copyright Owners” regime administered by the copyright board are counterproductive. The purpose of the regime should be both to encourage copyright holders to be discoverable and negotiate licenses, as well as to provide copyright users protection from a previously hidden copyright holder who later surfaces. Creators, copyright holders, copyright intermediaries and commercial copyright users should all have economic incentives to make copyright holders discoverable.

Modern ICT has caused some technology vendors and governments to declare “privacy is dead”, so it is inconceivable that a copyright holder who wants to be found is unable to be found. Some responsibility should be presumed on anyone who wishes to harness the privileges which copyright offers.

  • Creators, copyright owners, collective societies, or other intermediaries should never receive proceeds from the unlocatable copyright owners regime. Fees should be kept with the board to fund its own operations and support services to increase discoverability, with any surplus returned to general revenue. There should be a clear economic incentive for these groups to make all copyright holders more easily discoverable.
  • Fees levied against commercial copyright users should be sufficiently higher than what would normally be offered by a copyright holder, to further encourage commercial users to help make copyright holders more easily discoverable.
  • Fair Dealings should be clearly expanded to cover non-commercial uses of works for which licenses cannot be easily obtained, including for reasons of unlocatable copyright holders. There can’t be a negative impact on the market for a work when no such market exists.
  • If a copyright owner is unlocatable, but the creator is locatable, then copyright should revert to the creator.
  • Fees previously distributed to collective societies, but were never disbursed to later-located creators or copyright owners, should be returned to the copyright board.

It has been claimed that the “no formalities” requirement of the Berne convention prohibits mandating registration for exercise of any copyright related rights. The reality is that if a copyright owner wishes to get paid they must make themselves known to someone, so it is illogical to suggest that requiring copyright owners do something to make themselves discoverable is a “formality”.

What this failed regime has allowed is for entities like the Access Copyright Foundation to take money from the orphan works regime as well as other fees extracted from authors as excessive transaction fees by Access Copyright, and create their own unaccountable arts funding program[3]. With this entity perceived as doing “good works”, the incentive to make copyright holders easily discoverable and able to receive greater direct payments for their works is diminished. This is a net-reduction in funding for authors, marketed as if it were a benefit to authors.

Educational use of copyrighted works

Nearly all uses of copyrighted works by provincially funded educational institutions is licensed with copyright owners, and not through collective societies. This includes the global growth of Open Access, as well as online databases offering subscription and/or transaction fees.

There is then a thin layer between where the use of a work is already licensed, and where the use of the work does not require a license, that is under dispute between collective societies and educational institutions. This is the dispute underlying the myth that fair dealings decimated educational publishing in Canada.

In this case the relevant parties are not educational institutions or collective societies, but provincial taxpayers and authors. I believe if provincial taxpayers were asked if they were willing to help fund creativity used in the classroom in this thin disputed area they would agree, as long as the funding was accountable and efficiently distributed. Unfortunately, with all the middle-men taking their cut (Access Copyright is said to take 30% for itself), the current regime is inappropriate.

We already have a model for a far more efficient regime active in Canada. The Public Lending Right (PLR)[4] program funds authors directly for the lending of their works in libraries. This funding program is far superior to having this activity covered by the Copyright Act. It is better for taxpayers as the money more efficiently funds authors, rather than all the unnecessary intermediaries and all their lawyers. If applied to educational uses this would not only provide considerably more funds to authors, it would end the expensive decades-long disputes launched by unnecessary intermediaries in front of the copyright board.

The PLR is an example of using the right tool for the right job. There is a harmful misconception held by some policy makers that copyright is a valid substitute for stable arts funding.  Arts funding can be accountably targeted at creators, where the benefit of copyright tends to goes to unnecessary intermediaries -- or leaves the country entirely.

As well as initiating a Public Education Right (PER) funding program, copyright law should be amended to clarify as fair dealings the current thin disputed layer of uses.

This clarity should, however, have responsibilities attached to it. Some education institutions want to have their cake and eat ours too by having exceptions to copyright on their inputs, but royalty bearing on their outputs.  The ability of institutions to use any institutional exceptions to copyright, as well as what has been clarified under the PER regime, should be conditioned on the institution adopting an Open Access publishing regime at least on par with the Tri-Agency Open Access Policy on Publications[5].

Lobbying by Collective Societies

Collective societies provide a specific financial service to copyright holders and copyright users. As noted by Copyright Board expert Howard Knopf, “Collectives are an exception from the basic antitrust and competition law abhorrence of price fixing and conspiracies”[6] As such, they are not optional to copyright holders who want to get paid for some specific uses of their works. Given this, collectives should not ever be able to claim to politically “represent” repertoire members any more than a bank should be able to claim to politically “represent” me simply because I have a bank account.

Collectives have been allowed to present themselves as proxies for the interests of creators - even when they are lobbying government for policies which benefit collectives at the expense of creators.

The operation of collectives should be scrutinized far more closely by government. This should include disallowing collectives from disbursing funds for purposes other than payment to creators for uses of their works. They should not be allowed to directly lobby government or fund foundations. It should never be seen as their money to spend: if authors wish to fund such activities they can voluntarily do so with their own money, including through optional member funded associations. They should never essentially have their money be “taxed” by a collective society intermediary.

More money to authors, more efficient copyright board

With Access Copyright no longer initiating disputes, resource constraints on the Copyright Board will decrease considerably at the same time as we will see increased funding for authors.

While I used Access Copyright as an example, the same will be true of several other collective societies. Better harnessing of modern ICT and modernizing the outdated thinking in our Copyright Act will greatly reduce the number of collective societies still in operation.

There will always be a need for some small number of collective societies, and a need for the copyright board to impose rates when normal commercial negotiations fail, but we should be providing legal and economic incentives to ensure these exceptions become rare.

[1] Myth: Fair use decimated educational publishing in Canada



[4] Public Lending Right program

[5] Tri-Agency Open Access Policy on Publications

[6] Canadian Copyright Collectives and the Copyright Board: a snap shot in 2008

Saturday, September 16, 2017

Taxpayers should pay authors for educational uses of works, not intermediaries

Replying to a Letter to the Editor in The Varsity.

It is taxpayers and authors that are paying the costs of this ongoing dispute, one way or the other.

What we are effectively discussing is a government funding program masquerading as copyright, and because of the misdirection that this is a copyright issue we are allowing intermediaries like educational institutions, collective societies, foreign publishers, and all their lawyers, to extract the bulk of the money.

If Mr. Degen was focused on Canadian authors getting paid he would be agreeing with me that we need to redirect taxpayer money misspent with the current regime towards a program similar to the Public Lending Right. The existing Public Lending Right funds authors based on their works being loaned by libraries, and a "Public Education Right" could directly fund authors based on specific uses of their works in publicly funded educational institutions. This would be applied only to that very narrow area of dispute between what educational institutions (IE: taxpayers) are already paying, and the clear and indisputable limitations of copyright.

Nearly all of what educational institutions use is already paid for, through payments via modern databases and other established systems. This includes the ongoing growth of Open Access. It is Access Copyright that has refused to allow the payment of transactional fees for the narrow area under dispute.

While Access Copyright had a victory with this specific lower court case, they will lose on appeal as they have lost other related cases. This area of law is quite clear, and contrary to Mr Degen's misdirection have not been on side with Access Copyright's interpretation of the law. This specific case is the outlier.

While the majority of the blame for this costly dispute lies with Access Copyright, that doesn't mean taxpayers or governments should be siding with educational institutions. We should be removing all of these unnecessary intermediaries from the debate entirely.

By fighting for Access Copyright's conflicting interests rather than authors, Mr Degen is pushing for policies which continue to reduce the revenues of authors. My hope is that he will eventually side with authors.

Friday, June 9, 2017 : the hardest part will be saying "no".

Back in April I noted Canadiana is working on adopting APIs from IIIF, the International Image Interoperability Framework. We did a small demo in May as part of our participation at Code4Lib North.  Today is the final day of the 2017 IIIF Conference hosted at The Vatican, and this is an update on our progress.

What have we done so far?

We have a Cantaloupe Docker configuration on GitHub that we used for the demo.  This has the delegates Ruby script which finds the requested image within the AIP stored on the TDR Repository node which Cantaloupe is running on.

We have created a pull request for OpenJPEG to resolve an incompatibility between OpenJPEG and Cantaloupe. The fix allows Cantaloupe to offer access to our JPEG2000 images.

We will be integrating the OpenJPEG fix and some cleaner configuration into our Cantaloupe Docker configuration soon, bringing this Docker image closer to being worthy of being installed on production servers.

Our lead Application Developer, Sascha, created an application (with associated Docker configuration) that offers the IIIF Presentation API.  This reads data from the CouchDB presentation database used used by our existing platform.  It is expected that we will be adopting the IIIF structures for data within CouchDB at a later date, but this is a good intermediary step.

With these two Docker images running, accessing data from TDR repository pools and CouchDB, we are able to use any existing IIIF viewer to access Canadiana hosted content.

What is the largest stumbling block we've discovered?

We had already discovered the problem on our own, but the recent IIIF Adopters Survey made it clear.

Of the 70 institutions who completed the survey, 51 are currently using the IIIF Image API, 42 adopted IIIF Presentation, but The British Library and the Wellcome Trust are the only known institutions currently using the IIIF Authentication API.

Canadiana has both sponsored collections (where the depositor or other entity sponsored the collection which is then freely available to access) and subscription collections (where the funders have required we restrict access only to others who are financially contributing).  Making the sponsored collections available via IIIF will be much easier than the additional software we will have to author (including possibly having to help existing projects offering IIIF access tools) in order to support denying access to subscription collections.

Said another way: denying access will take far more of Canadiana's resources (staff and computing) than granting access.  Ideal would be if all our collections were sponsored, but that is not the environment we currently operate in.  At the moment a large portion of this charity's funding comes in the form of subscriptions, and this is already a topic of discussion within our board and membership.

This was not a total surprise.

We knew the move to a more modern distributed platform, which we were already planning before we decided to adopt IIIF, would involve a change in how we did authentication and authorization.  Implementing authorization rules is already a significant part of our technology platform.

Currently the CAP platform is based on a "deny unless permit" model, and there are only two public-facing software components: CAP which internally handles its own authorization, and COS which received a signed access token from CAP for each content request (Specific source file, specific rotation, specific zoom, limited amount of time, etc).  Only a few specific zoom levels are allowed, and there is no built-in image cropping/etc.

Implementing the same model for IIIF would have been extremely inefficient, even if possible to go through the multi-request Authentication API for each individual content request.

IIIF image access isn't done as one request for a single completed image but as multiple request for tiles representing parts of the image (and at a variety of zoom levels).  For efficiency we needed to move to a more liberal "grant unless denied" model where the access tokens are far more generic in what type of requests they would facilitate.

There are also several APIs that can (and should) be offered as different distributed web services. A service offering Presentation API data is likely to be deployed into multiple server rooms across the country, just as the Image API will be offered from multiple server rooms.   We may have fewer servers offering authentication, but that won't create a bottleneck as once a user has authenticated they won't need to go back to that service often (only when access has expired, or they need to pass access tokens to a new service).

We will be separating authorization from authentication, only checking the authentication token if required.  A new CouchDB authorization database would be needed that has records for every AIP (to indicate if it is sponsored or what subscription is required, and what level of access is granted), every user (what subscriptions they have purchased, or other types of access -- such as administrators) and every institution (subscriptions, other access).   Each content server request would involve consulting that database and determining if we had to deny access, with this data being replicated so it is local to each application which needs to use this data.

Where are we in our plan?

The plan was to migrate away from our existing Content Server first (See: The Canadiana preservation network and access platform for details on the current platform).  This would involve:

  • Adopting Cantaloupe for the IIIF Image API, including authorization.
  • Implementing the Authentication API, to set the Access cookie from the access token offered by the main platform website.
  • Implementing an IIIF Presentation API demonstration sufficient to test our implementation of the Authentication API with existing IIIF client applications.
  • Offer direct access to TDR files using the same Access cookie/token (Needed for PDF downloads as a minimum, also used by various internal microservices to access METS and other metadata records).
  • Retrofit our existing CAP portal service to use the Authentication API, as well as use Cantaloupe for all image serving.
  • Decommission the older ContentServer software on each repository node.
 With the Authentication API not as established as we thought, we may go a different route.

One possibility might be for Cantaloupe to grant full access to sponsored collections, and use a separate token similar to our existing token for subscription collections.   This would effectively disable most of the utility of IIIF for subscription content, other than allowing us to use the same ContentServer software for both types of content.

We haven't made decisions, only come to the realization that there is much more work to be done.   My hope is that we can push forward with making sponsored collections accessible via IIIF, even if we simply deny IIIF access to subscription collections in the interim (IE: CAP portal access only) while we figure out how to grant access to subscribers via IIIF.

IIIF isn't the only place we have this consideration

This consideration isn't unique to our IIIF implementation, and we come up against it regularly.

With the Heritage project the funding institutions sponsored public access to all the images from those LAC reels, but more advanced search capability was required to be a subscription service.   We implemented this in the shorter term by disabling (for non-subscribers) page-level search on the Heritage portal which hosts this content.

Some researchers and other external projects (some funded by Canadiana as part of the Heritage project, but that Canadiana technical staff were not involved in) have been collecting additional metadata for these LAC reels in the form of tags, categorization, and in some cases transcriptions of specific pages.  This data is being offered to us using project-specific data design that doesn't conform to any of the standards we plan on adopting in the future within the primary platform (See: IIIF annotations, with us likely extending our TDR preservation documentation to support encoding of specific open annotations).

Our platform doesn't yet have the capability to accept, preserve and provide search on this data. When we start a project to accept some of this data we will also have to figure out how to implement a mixture of funding models.  It is expected that most researchers will want the data they have funded to be open access, and would be unhappy if we restricted to subscribers search on their data.  This means we'll need to separate the subscription-required data funded by some groups with the open access search data provided by other groups.

It is likely we will end up with multiple search engines housing different types of data (search fields different from the common ones used within our primary platform), search-able by different groups of people, with a search front-end needing to collate results and display in a useful way.

Moving more of Canadiana's software projects to GitHub

As some of the links in recent articles suggest, we have started moving more of our software from an internal source control and issue tracker towards public GitHub projects.  While this has value as additional transparency to our membership, I also hope it will enable better collaboration with members, researchers, and others who have an interest in Canadiana's work.

For the longest time the Archive::BagIt perl module was the only GitHub project associated with Canadiana.  Robert Schmidt became the primary maintainer of this module when he was still at Canadiana, and this module is still critical to our infrastructure.

Added to the two IIIF related Docker images that I'll discuss more later are two PERL modules:

  • CIHM::METS::App is a tool to convert metadata from a variety of formats (CSV, DB/Text, MARC) to the 3 XML formats we use as descriptive metadata within our METS records (MARCXML, Dublin Core, Issueinfo).  This is used in the production process we use to generate or update AIPs within our TDR.
  • CIHM::METS::parse is the library used to read the METS records within the AIPs in the TDR and present normalized data to other parts of our access platform.  For more technical people this provides an example of how to read our METS records, as well as documenting exactly which fields we use within our access platform (for search and presentation).

My hope is that by the end of the summer all the software we use for a TDR Repository node will have moved to GitHub.  This provides additional transparency to the partner institutions who are hosting repository servers, clarifying exactly what software is running on that hardware.

We are a small team (currently 3 people) working within a Canadian charity, and would be very interested in having more collaborations.  We know we can't do all of this alone, which is a big part of why we are joining others in the GLAM community with IIIF. Even for the parts which are not IIIF, collaboration will be possible.

If you work at or attend one of our member institutions, or otherwise want to know more about what our technical team is doing, consider going to our GitHub organization page and clicking watch for sub-projects that interest you. Feel free to submit issue requests whether it be noticing a bug, suggesting a new feature (maybe someone with funding will agree and launch a collaboration), suggesting we take a closer look at some existing technology, or just asking questions (of the technical team -- we have other people who answer questions for subscribers/etc).

If not on GitHub, please feel free to open a conversation in the comments section of this blog.

Thursday, May 11, 2017

Canadiana JHOVE report

This article is based on a document written to be used at Code4Lib North on May 11’th, and discusses what we’ve learned so far with our use of JHOVE.

What is JHOVE?

The original project was a collaboration between JSTOR and Harvard University Library, with JHOVE being an acronym for JSTOR/Harvard Object Validation Environment.  It provides functions to perform format-specific identification, validation, and characterization of digital objects.

JHOVE is currently maintained by the non-profit Open Preservation Foundation, operating out of the UK (Associated with the British Library in West Yorkshire).


What is Canadiana doing with JHOVE?

As of the last week of April we generate XML reports from JHOVE and include them within AIP revisions in our TDR.  At this stage we are not rejecting or flagging files based on the reports, only providing reports as additional data.  We will be further integrating JHOVE as part of our production process in the future.

Some terminology

What did Canadiana do before using JHOVE?

Prior to the TDR Certification process we made assumptions about files based on their file extensions: a .pdf was presumed to be a PDF file, a .tif a TIFF file, .jpg a JPEG file, and .jp2 a JPEG 2000 file.  We only allowed those 4 types of files into our repository.

As a first step we used ImageMagick’s ‘identify’ feature to identify and confirm that files matched the file types.  This meant that any files added since 2015 had data that matched the file type.

At that time we did not go back and check previously ingested files, as we knew we would eventually be adopting something like JHOVE.

Generating a report for all existing files
As of May 9, 2017 we have 61,829,569 files in the most recent revisions of the AIPs in our repository.  This does not include METS records, past revisions, or files related to the BagIt archive structure we use within the TDR.

I quickly wrote some scripts that would loop through all of our AIPs and generate reports for all the files in the files/ directory of the most recent AIP revision within each AIP.  We dedicated one of our TDR Repository nodes to generating reports for a full month to get the bulk of the reports, with some PDF files still being processed.

Top level report from scan

Total files
Not well-formed
941,875 (1.5%)
Not yet scanned
Well-Formed and valid
60,828,836 (98.4%)
Well-Formed, but not valid
58,605  (0.09%)

JHOVE offers a STATUS for files which is one of:

  • “Not well-formed” - problems at the purely syntactic requirement for the format
  • “Well-Formed, but not valid” - meets higher-level semantic requirements for format validity
  • “Well-Formed and valid” - passed both the well-formedness and validity tests

Issues with .jpg files

Not well-formed
Well-Formed and valid
Well-Formed and valid TIFF

We had 10+14=24 .jpg files which were ingested prior to adopting the ‘identify’ functionality that turned out to be broken (truncated files, 0 length files) or that had the wrong file extension.  9 of the “Not well-formed” were from LAC reel’s where we were ingesting images from reels with 1000 to 2000 images per reel.

Issues with .jp2 files

Well-Formed and valid

JHOVE didn’t report any issues with our JPEG 2000 files.

Issues with .tif files

Not well-formed, Tag 296 out of sequence
Not well-formed ,Value offset not word-aligned
Not well-formed  , IFD offset not word-aligned
Well-Formed and valid
Well-Formed, but not valid  ,Invalid DateTime separator: 28/09/2016 16:53:17
Well-Formed, but not valid , Invalid DateTime digit
Well-Formed, but not valid  , Invalid DateTime length
Well-Formed, but not valid  , PhotometricInterpretation not defined

  • Word alignment (offsets being evenly divisible by 4 bytes) is the largest issue for structure, but it something that will be easy to fix.  We are able to view these images so the data inside isn’t corrupted.
  • Validity of DateTime values is the next largest issue.  The format is should be "YYYY:MM:DD HH:MM:SS" , so something that says “2004: 6:24 08:10:11”  will be invalid (The blank is an Invalid DateTime digit) and “Mon Nov 06 22:00:08 2000” or “2000:10:31 07:37:08%09” will be invalid (Invalid DateTime length).
  • PhotometricInterpretation indicated the colour space of the image data (WhiteIsZero/BlackIsZero for grayscale, RGB, CMYK, YCbCr , etc).  The specification has no default, but we’ll be able to fix the files by making and checking some assumptions.

Issues with .pdf files

Not well-formed , No document catalog dictionary
Not well-formed  ,Invalid cross-reference table,No document catalog dictionary
Not well-formed , Missing startxref keyword or value
Not well-formed  ,Invalid ID in trailer,No document catalog dictionary
Not yet scanned
Well-Formed and valid
Well-Formed, but not valid , Missing expected element in page number dictionary
Well-Formed, but not valid ,Improperly formed date
Well-Formed, but not valid , Invalid destination object

One of the board members of the Open Preservation Foundation, the organization currently maintaining JHOVE, wrote a longer article on the JHOVE PDF module titled “Testing JHOVE PDF Module: the good, the bad, and the not well-formed” which might be of interest.  Generally, PDF is a hard format to deal with and there is more work that can be done with the module to ensure that the errors it is reporting are problems in the PDF file and not the module.

  • “No document catalog dictionary” -- The root tree node of a PDF is the ‘Document Catalog’, and it has a dictionary object.  This exposed a problem with an update to our production processes where we switched from using ‘pdftk’ to using ‘poppler’ from the FreeDesktop project for joining multiple single-page PDF files into a single multi-page PDF file.  While ‘pdftk’ generated Well-Formed and valid PDFs, poppler did not.

    When I asked on the Poppler forum they pointed to JHOVE as the problem, so at this point I don’t know where the problem is.

    I documented this issue at:
  • “Missing startxref keyword or value” - PDF files should have a header, document body, xref cross-reference table, and a trailer which includes a startxref.  I haven’t dissected the files yet, but these may be truncated.
  • “Missing expected element in page number dictionary”.  I’ll need to do more investigation.
  • “Not yet scanned”.  We have a series of multi-page PDF files generated by ABBYY Recognition Server which take a long time to validate.  Eventually it indicates the files are recognized with a PDF/A-1 profile.  I documented this issue at:

Our longer term strategy is to no longer modify files as part of the ingest process.  If single-page PDF files are generated from OCR (as is normally the case) we will ingest those single-page PDF files.  If we wish to provide a multi-page PDF to download this will be done as part of our access platform where long-term preservation requirements aren’t an issue. In the experiments we have done so far we have found the single-page PDF output of ABBYY Recognition server and PrimeOCR validate without errors, and it is the transformations we have done over the years that was the source of the errors.