Thursday, October 5, 2017

Yes, CBC, I'm waiting for Alias Grace to be on Netflix.

CBC runs InCanada, an "online" Canadian Media Panel. I put "online" in quotations because while the panel is online, the CBC's broadcaster bias is always visible in how they ask questions. The latest survey is no exception.

The survey was essentially about Alias Grace, a Canadian-American miniseries that will air on CBC on September 25, 2017, and on Netflix on November 3, 2017.

The survey typically conflates Netflix with broadcasters, when Netflix is not a broadcaster. This is about as nonsensical as confusing a radio station with a record store when discussing music, and yet the legacy broadcasters continue to try to push this nonsense.

I sometimes make the comparison to the difference between an outhouse and indoor plumbing: Like broadcasting, people made use of outhouses before modern conveniences like indoor plumbing came along. And like indoor plumbing, people aren't likely to want to go backwards once they get used to online streaming.

While outhouses still exist in places where indoor plumbing is not available, it is not the predominant way that people "do their business". Unlike with an outhouse, there is no sense of urgency to use the outmoded platform to watch Alias Grace.


The survey asked if I saw the American series The Handmaid's Tale (TV series). While this was distributed by Hulu starting in April 2017, the series was blocked from Canadian access by Bell until they made it available on CraveTV in late July. Bell blocking, hiding and/or delaying lawful access to content is typical, and I consider them to be the largest Canadian contributory copyright infringer for their ongoing inducement of infringement.


If the NAFTA negotiations were intended to modernize trade relations within North America, the trade barriers disallowing cross-boarder shopping for telecommunications services and creative content would be a top priority. I believe we could massively reduce copyright infringement in North America if we moved to a single content market, where creators from the entire of North America had unrestricted access to the audiences of North America. That includes the content distribution services. North American audiences should also have the right to subscribe to any North American streaming service, and regional content restrictions within North America would be prohibited.

The concept of Canadians not being to view content at the same time as US audiences, including having the option to subscribe to the same online distribution services, must quickly become a distant memory.


Canadian Content policy should be focused on content, not on outdated distribution mechanisms. Hopefully a pro-free trade agenda will be part of the current Heritage Minister's thinking: you can't promote Canadian production capabilities and wide global distribution of Canadian content while still allowing regional content blocking.


Bell's anti-free trade agenda is trying to push policy in the opposite direction, including asking for mandated blocking when Canadians wish to access content that is not lawfully streamed in Canada. Bell is asking for mandated blocking because they want competitors to have to block the same competing distribution sites Bell already wants to block, which is also why they oppose VPNs (Apparently the technology, not only the perfectly legitimate cross-boarder-shopping use).

If I wanted to watch The Handmaid's Tale when US viewers were (or those that can tolerate the smell of an outhouse/broadcaster), or on the devices of my choosing, I would be forced to infringe copyright (easiest) or use a VPN (Less convenient, but currently more lawful).

There was no sense of urgency to watch The Handmaid's Tale. While there are shows that are important enough to me that would warrant finding alternative streaming options, none of these TV series based on Margaret Atwood novels are of sufficient interest.


My wife and I watched Handmaid's Tale on CraveTV. CraveTV is a horrible streaming service: there is a difference between the indoor plumbing at a 5-star hotel and an out-of-the-way truck stop. We only watch programming on CraveTV when it is not available anywhere else. The CraveTV Android App crashes fairly regularly. CraveTV works on few of my devices, compared to Netflix which pretty much always works -- and Netflix even has a simple app built into the SmartTV such that my wife and in-laws can also use it (CraveTV is too messy for less technical people to put up with).


While CBC isn't as bad as Bell when it comes to policies, I believe their outdated broadcaster-era thinking is harmful to Canadian creators and taxpayers.

Thursday, September 28, 2017

Copyright Board, Copyright Collectives, and the myth that "Fair use decimated educational publishing in Canada”

(This is a letter in an ongoing dialog with a few members of federal parliament)
 
David McGuinty, my MP in Ottawa South,
David Graham, MP (Laurentides — Labelle),
The Honourable Mélanie Joly, Minister of Canadian Heritage,
The Honourable Navdeep Bains, Minister of Innovation, Science and Economic Development,
Copyright Board Consultations
I would like to thank David McGuinty for forwarding the September 8, 2017 letter from Minister Joly. This was a response to my May 1, 2017 letter titled “Myth: Fair use decimated educational publishing in Canada”.  My letter highlighting some of what might colloquially be referred to as “fake news” being spread globally, primarily sourced from Access Copyright, a Canadian Collective Society. The National Copyright Unit of Australia felt this myth spreading required a response[1].
As this myth primarily relates to an ongoing dispute between a collective society and provincially funded educational institutions, it ties in directly with the current consultation on the Copyright Board of Canada[2].
The consultation paper recognises that there has been an “explosive growth of media and related technologies worldwide”.  This specific incarnation of the of the Copyright Board was created in 1989, the same year that development of HTTP, one of the key technologies underlying the World Wide Web, was initiated by Tim Berners-Lee at CERN.
We live in a world where advanced content recognition, search and online media distribution enables audiences to find and access any content that they want. Sometimes, when copyright owners allow, we are offered a variety of competing access and licensing services to choose from.  Modern information and communications technologies have made redundant a sizeable portion of what the Copyright Board was historically envisioned to accomplish.
While the discussion paper suggests we can speed up processes at the board by “Reducing the Number of Matters Coming Before the Board Annually”, the paper does not discuss the need to reverse the historical proliferation of collective societies.  At a time when many collectives should be recognised as decreasing in relevance, they continue to increase in political and economic influence.
I will use a few specific problematic areas to illustrate.

Orphaned Works

The incentives behind the current “Unlocatable Copyright Owners” regime administered by the copyright board are counterproductive.  The purpose of the regime should be both to encourage copyright holders to be discoverable and negotiate licenses, as well as to provide copyright users protection from a previously hidden copyright holder who later surfaces.  Creators, copyright holders, copyright intermediaries and commercial copyright users should all have economic incentives to make copyright holders discoverable.  
Modern ICT has caused some technology vendors and governments to declare “privacy is dead”, so it is inconceivable that a copyright holder who wants to be found is unable to be found.  Some responsibility should be presumed on anyone who wishes to harness the privileges which copyright offers.
  • * Creators, copyright owners, collective societies, or other intermediaries should never receive proceeds from the unlocatable copyright owners regime.  Fees should be kept with the board to fund its own operations and support services to increase discoverability, with any surplus returned to general revenue.  There should be a clear economic incentive for these groups to make all copyright holders more easily discoverable.

Fees levied against commercial copyright users should be sufficiently higher than what would normally be offered by a copyright holder, to further encourage commercial users to help make copyright holders more easily discoverable.

Fair Dealings should be clearly expanded to cover non-commercial uses of works for which licenses cannot be easily obtained, including for reasons of unlocatable copyright holders.  There can’t be a negative impact on the market for a work when no such market exists.

If a copyright owner is unlocatable, but the creator is locatable, then copyright should revert to the creator.

Fees previously distributed to collective societies, but were never disbursed to later-located creators or copyright owners, should be returned to the copyright board.
It has been claimed that the “no formalities” requirement of the Berne convention prohibits mandating registration for exercise of any copyright related rights. The reality is that if a copyright owner wishes to get paid they must make themselves known to someone, so it is illogical to suggest that requiring copyright owners do something to make themselves discoverable is a “formality”.
What this failed regime has allowed is for entities like the Access Copyright Foundation to take money from the orphan works regime as well as other fees extracted from authors as excessive transaction fees by Access Copyright, and create their own unaccountable arts funding program[3].  With this entity perceived as doing “good works”, the incentive to make copyright holders easily discoverable and able to receive greater direct payments for their works is diminished.  This is a net-reduction in funding for authors, marketed as if it were a benefit to authors.

Educational use of copyrighted works

Nearly all uses of copyrighted works by provincially funded educational institutions is licensed with copyright owners, and not through collective societies.  This includes the global growth of Open Access, as well as online databases offering subscription and/or transaction fees.
There is then a thin layer between where the use of a work is already licensed, and where the use of the work does not require a license, that is under dispute between collective societies and educational institutions. This is the dispute underlying the myth that fair dealings decimated educational publishing in Canada.
In this case the relevant parties are not educational institutions or collective societies, but provincial taxpayers and authors.   I believe if provincial taxpayers were asked if they were willing to help fund creativity used in the classroom in this thin disputed area they would agree, as long as the funding was accountable and efficiently distributed.  Unfortunately, with all the middle-men taking their cut (Access Copyright is said to take 30% for itself), the current regime is inappropriate.
We already have a model for a far more efficient regime active in Canada. The Public Lending Right (PLR)[4] program funds authors directly for the lending of their works in libraries.  This funding program is far superior to having this activity covered by the Copyright Act. It is better for taxpayers as the money more efficiently funds authors, rather than all the unnecessary intermediaries and all their lawyers.   If applied to educational uses this would not only provide considerably more funds to authors, it would end the expensive decades-long disputes launched by unnecessary intermediaries in front of the copyright board.
The PLR is an example of using the right tool for the right job. There is a harmful misconception held by some policy makers that copyright is a valid substitute for stable arts funding.  Arts funding can be accountably targeted at creators, where the benefit of copyright tends to goes to unnecessary intermediaries -- or leaves the country entirely.
As well as initiating a Public Education Right (PER) funding program, copyright law should be amended to clarify as fair dealings the current thin disputed layer of uses.
This clarity should, however, have responsibilities attached to it.  Some education institutions want to have their cake and eat ours too by having exceptions to copyright on their inputs, but royalty bearing on their outputs.  The ability of institutions to use any institutional exceptions to copyright, as well as what has been clarified under the PER regime, should be conditioned on the institution adopting an Open Access publishing regime at least on par with the Tri-Agency Open Access Policy on Publications[5].

Lobbying by Collective Societies

Collective societies provide a specific financial service to copyright holders and copyright users. As noted by Copyright Board expert Howard Knopf, “Collectives are an exception from the basic antitrust and competition law abhorrence of price fixing and conspiracies”[6].  As such, they are not optional to copyright holders who want to get paid for some specific uses of their works.  Given this, collectives should not ever be able to claim to politically “represent” repertoire members any more than a bank should be able to claim to politically “represent” me simply because I have a bank account.
Collectives have been allowed to present themselves as proxies for the interests of creators -- even when they are lobbying government for policies which benefit collectives at the expense of creators.
The operation of collectives should be scrutinised far more closely by government.  This should include disallowing collectives from disbursing funds for purposes other than payment to creators for uses of their works.  They should not be allowed to directly lobby government or fund foundations.  It should never be seen as their money to spend: if authors wish to fund such activities they can voluntarily do so with their own money, including through optional member funded associations. They should never essentially have their money be “taxed” by a collective society intermediary.

More money to authors, more efficient copyright board

With Access Copyright no longer initiating disputes, resource constraints on the Copyright Board will decrease considerably at the same time as we will see increased funding for authors.

While I used Access Copyright as an example, the same will be true of several other collective societies.  Better harnessing of modern ICT and modernising the outdated thinking in our Copyright Act will greatly reduce the number of collective societies still in operation.
There will always be a need for some small number of collective societies, and a need for the copyright board to impose rates when normal commercial negotiations fail, but we should be providing legal and economic incentives to ensure these exceptions become rare.


[4] Public Lending Right program http://www.plr-dpp.ca/PLR/ 

[5] Tri-Agency Open Access Policy on Publications http://www.science.gc.ca/eic/site/063.nsf/eng/h_F6765465.html?OpenDocument 

[6] Canadian Copyright Collectives and the Copyright Board: a snap shot in 2008 http://www.macerajarzyna.com/pages/publications/Knopf_Canadian_Copyright_Collectives_Copyright_Board_Feb2008.pdf

Saturday, September 16, 2017

Taxpayers should pay authors for educational uses of works, not intermediaries

Replying to a Letter to the Editor in The Varsity.


It is taxpayers and authors that are paying the costs of this ongoing dispute, one way or the other.

What we are effectively discussing is a government funding program masquerading as copyright, and because of the misdirection that this is a copyright issue we are allowing intermediaries like educational institutions, collective societies, foreign publishers, and all their lawyers, to extract the bulk of the money.

If Mr. Degen was focused on Canadian authors getting paid he would be agreeing with me that we need to redirect taxpayer money misspent with the current regime towards a program similar to the Public Lending Right. The existing Public Lending Right funds authors based on their works being loaned by libraries, and a "Public Education Right" could directly fund authors based on specific uses of their works in publicly funded educational institutions. This would be applied only to that very narrow area of dispute between what educational institutions (IE: taxpayers) are already paying, and the clear and indisputable limitations of copyright.

Nearly all of what educational institutions use is already paid for, through payments via modern databases and other established systems. This includes the ongoing growth of Open Access. It is Access Copyright that has refused to allow the payment of transactional fees for the narrow area under dispute.

While Access Copyright had a victory with this specific lower court case, they will lose on appeal as they have lost other related cases. This area of law is quite clear, and contrary to Mr Degen's misdirection have not been on side with Access Copyright's interpretation of the law. This specific case is the outlier.

While the majority of the blame for this costly dispute lies with Access Copyright, that doesn't mean taxpayers or governments should be siding with educational institutions. We should be removing all of these unnecessary intermediaries from the debate entirely.

By fighting for Access Copyright's conflicting interests rather than authors, Mr Degen is pushing for policies which continue to reduce the revenues of authors. My hope is that he will eventually side with authors.

Friday, June 9, 2017

IIIF.io : the hardest part will be saying "no".

Back in April I noted Canadiana is working on adopting APIs from IIIF, the International Image Interoperability Framework. We did a small demo in May as part of our participation at Code4Lib North.  Today is the final day of the 2017 IIIF Conference hosted at The Vatican, and this is an update on our progress.

What have we done so far?

We have a Cantaloupe Docker configuration on GitHub that we used for the demo.  This has the delegates Ruby script which finds the requested image within the AIP stored on the TDR Repository node which Cantaloupe is running on.

We have created a pull request for OpenJPEG to resolve an incompatibility between OpenJPEG and Cantaloupe. The fix allows Cantaloupe to offer access to our JPEG2000 images.

We will be integrating the OpenJPEG fix and some cleaner configuration into our Cantaloupe Docker configuration soon, bringing this Docker image closer to being worthy of being installed on production servers.

Our lead Application Developer, Sascha, created an application (with associated Docker configuration) that offers the IIIF Presentation API.  This reads data from the CouchDB presentation database used used by our existing platform.  It is expected that we will be adopting the IIIF structures for data within CouchDB at a later date, but this is a good intermediary step.

With these two Docker images running, accessing data from TDR repository pools and CouchDB, we are able to use any existing IIIF viewer to access Canadiana hosted content.

What is the largest stumbling block we've discovered?

We had already discovered the problem on our own, but the recent IIIF Adopters Survey made it clear.

Of the 70 institutions who completed the survey, 51 are currently using the IIIF Image API, 42 adopted IIIF Presentation, but The British Library and the Wellcome Trust are the only known institutions currently using the IIIF Authentication API.

Canadiana has both sponsored collections (where the depositor or other entity sponsored the collection which is then freely available to access) and subscription collections (where the funders have required we restrict access only to others who are financially contributing).  Making the sponsored collections available via IIIF will be much easier than the additional software we will have to author (including possibly having to help existing projects offering IIIF access tools) in order to support denying access to subscription collections.

Said another way: denying access will take far more of Canadiana's resources (staff and computing) than granting access.  Ideal would be if all our collections were sponsored, but that is not the environment we currently operate in.  At the moment a large portion of this charity's funding comes in the form of subscriptions, and this is already a topic of discussion within our board and membership.

This was not a total surprise.

We knew the move to a more modern distributed platform, which we were already planning before we decided to adopt IIIF, would involve a change in how we did authentication and authorization.  Implementing authorization rules is already a significant part of our technology platform.

Currently the CAP platform is based on a "deny unless permit" model, and there are only two public-facing software components: CAP which internally handles its own authorization, and COS which received a signed access token from CAP for each content request (Specific source file, specific rotation, specific zoom, limited amount of time, etc).  Only a few specific zoom levels are allowed, and there is no built-in image cropping/etc.


Implementing the same model for IIIF would have been extremely inefficient, even if possible to go through the multi-request Authentication API for each individual content request.

IIIF image access isn't done as one request for a single completed image but as multiple request for tiles representing parts of the image (and at a variety of zoom levels).  For efficiency we needed to move to a more liberal "grant unless denied" model where the access tokens are far more generic in what type of requests they would facilitate.

There are also several APIs that can (and should) be offered as different distributed web services. A service offering Presentation API data is likely to be deployed into multiple server rooms across the country, just as the Image API will be offered from multiple server rooms.   We may have fewer servers offering authentication, but that won't create a bottleneck as once a user has authenticated they won't need to go back to that service often (only when access has expired, or they need to pass access tokens to a new service).


We will be separating authorization from authentication, only checking the authentication token if required.  A new CouchDB authorization database would be needed that has records for every AIP (to indicate if it is sponsored or what subscription is required, and what level of access is granted), every user (what subscriptions they have purchased, or other types of access -- such as administrators) and every institution (subscriptions, other access).   Each content server request would involve consulting that database and determining if we had to deny access, with this data being replicated so it is local to each application which needs to use this data.

Where are we in our plan?

The plan was to migrate away from our existing Content Server first (See: The Canadiana preservation network and access platform for details on the current platform).  This would involve:

  • Adopting Cantaloupe for the IIIF Image API, including authorization.
  • Implementing the Authentication API, to set the Access cookie from the access token offered by the main platform website.
  • Implementing an IIIF Presentation API demonstration sufficient to test our implementation of the Authentication API with existing IIIF client applications.
  • Offer direct access to TDR files using the same Access cookie/token (Needed for PDF downloads as a minimum, also used by various internal microservices to access METS and other metadata records).
  • Retrofit our existing CAP portal service to use the Authentication API, as well as use Cantaloupe for all image serving.
  • Decommission the older ContentServer software on each repository node.
 With the Authentication API not as established as we thought, we may go a different route.


One possibility might be for Cantaloupe to grant full access to sponsored collections, and use a separate token similar to our existing token for subscription collections.   This would effectively disable most of the utility of IIIF for subscription content, other than allowing us to use the same ContentServer software for both types of content.

We haven't made decisions, only come to the realization that there is much more work to be done.   My hope is that we can push forward with making sponsored collections accessible via IIIF, even if we simply deny IIIF access to subscription collections in the interim (IE: CAP portal access only) while we figure out how to grant access to subscribers via IIIF.

IIIF isn't the only place we have this consideration

This consideration isn't unique to our IIIF implementation, and we come up against it regularly.

With the Heritage project the funding institutions sponsored public access to all the images from those LAC reels, but more advanced search capability was required to be a subscription service.   We implemented this in the shorter term by disabling (for non-subscribers) page-level search on the Heritage portal which hosts this content.

Some researchers and other external projects (some funded by Canadiana as part of the Heritage project, but that Canadiana technical staff were not involved in) have been collecting additional metadata for these LAC reels in the form of tags, categorization, and in some cases transcriptions of specific pages.  This data is being offered to us using project-specific data design that doesn't conform to any of the standards we plan on adopting in the future within the primary platform (See: IIIF annotations, with us likely extending our TDR preservation documentation to support encoding of specific open annotations).

Our platform doesn't yet have the capability to accept, preserve and provide search on this data. When we start a project to accept some of this data we will also have to figure out how to implement a mixture of funding models.  It is expected that most researchers will want the data they have funded to be open access, and would be unhappy if we restricted to subscribers search on their data.  This means we'll need to separate the subscription-required data funded by some groups with the open access search data provided by other groups.

It is likely we will end up with multiple search engines housing different types of data (search fields different from the common ones used within our primary platform), search-able by different groups of people, with a search front-end needing to collate results and display in a useful way.

Moving more of Canadiana's software projects to GitHub

As some of the links in recent articles suggest, we have started moving more of our software from an internal source control and issue tracker towards public GitHub projects.  While this has value as additional transparency to our membership, I also hope it will enable better collaboration with members, researchers, and others who have an interest in Canadiana's work.

For the longest time the Archive::BagIt perl module was the only GitHub project associated with Canadiana.  Robert Schmidt became the primary maintainer of this module when he was still at Canadiana, and this module is still critical to our infrastructure.


Added to the two IIIF related Docker images that I'll discuss more later are two PERL modules:

  • CIHM::METS::App is a tool to convert metadata from a variety of formats (CSV, DB/Text, MARC) to the 3 XML formats we use as descriptive metadata within our METS records (MARCXML, Dublin Core, Issueinfo).  This is used in the production process we use to generate or update AIPs within our TDR.
  • CIHM::METS::parse is the library used to read the METS records within the AIPs in the TDR and present normalized data to other parts of our access platform.  For more technical people this provides an example of how to read our METS records, as well as documenting exactly which fields we use within our access platform (for search and presentation).

My hope is that by the end of the summer all the software we use for a TDR Repository node will have moved to GitHub.  This provides additional transparency to the partner institutions who are hosting repository servers, clarifying exactly what software is running on that hardware.

We are a small team (currently 3 people) working within a Canadian charity, and would be very interested in having more collaborations.  We know we can't do all of this alone, which is a big part of why we are joining others in the GLAM community with IIIF. Even for the parts which are not IIIF, collaboration will be possible.

If you work at or attend one of our member institutions, or otherwise want to know more about what our technical team is doing, consider going to our GitHub organization page and clicking watch for sub-projects that interest you. Feel free to submit issue requests whether it be noticing a bug, suggesting a new feature (maybe someone with funding will agree and launch a collaboration), suggesting we take a closer look at some existing technology, or just asking questions (of the technical team -- we have other people who answer questions for subscribers/etc).

If not on GitHub, please feel free to open a conversation in the comments section of this blog.

Thursday, May 11, 2017

Canadiana JHOVE report

This article is based on a document written to be used at Code4Lib North on May 11’th, and discusses what we’ve learned so far with our use of JHOVE.

What is JHOVE?



The original project was a collaboration between JSTOR and Harvard University Library, with JHOVE being an acronym for JSTOR/Harvard Object Validation Environment.  It provides functions to perform format-specific identification, validation, and characterization of digital objects.


JHOVE is currently maintained by the non-profit Open Preservation Foundation, operating out of the UK (Associated with the British Library in West Yorkshire).




Standard JHOVE modules for AIFF, ASCII, BYTESTREAM, GIF, HTML, JPEG, JPEG2000, PDF, TIFF, UTF8, WAVE, XML, MP3, ZIP.

What is Canadiana doing with JHOVE?



As of the last week of April we generate XML reports from JHOVE and include them within AIP revisions in our TDR.  At this stage we are not rejecting or flagging files based on the reports, only providing reports as additional data.  We will be further integrating JHOVE as part of our production process in the future.

Some terminology




What did Canadiana do before using JHOVE?



Prior to the TDR Certification process we made assumptions about files based on their file extensions: a .pdf was presumed to be a PDF file, a .tif a TIFF file, .jpg a JPEG file, and .jp2 a JPEG 2000 file.  We only allowed those 4 types of files into our repository.


As a first step we used ImageMagick’s ‘identify’ feature to identify and confirm that files matched the file types.  This meant that any files added since 2015 had data that matched the file type.


At that time we did not go back and check previously ingested files, as we knew we would eventually be adopting something like JHOVE.


Generating a report for all existing files
As of May 9, 2017 we have 61,829,569 files in the most recent revisions of the AIPs in our repository.  This does not include METS records, past revisions, or files related to the BagIt archive structure we use within the TDR.


I quickly wrote some scripts that would loop through all of our AIPs and generate reports for all the files in the files/ directory of the most recent AIP revision within each AIP.  We dedicated one of our TDR Repository nodes to generating reports for a full month to get the bulk of the reports, with some PDF files still being processed.

Top level report from scan



Total files
61,829,569
Not well-formed
941,875 (1.5%)
Not yet scanned
253
Well-Formed and valid
60,828,836 (98.4%)
Well-Formed, but not valid
58,605  (0.09%)


JHOVE offers a STATUS for files which is one of:


  • “Not well-formed” - problems at the purely syntactic requirement for the format
  • “Well-Formed, but not valid” - meets higher-level semantic requirements for format validity
  • “Well-Formed and valid” - passed both the well-formedness and validity tests

Issues with .jpg files



Not well-formed
10
Well-Formed and valid
44,743,051
Well-Formed and valid TIFF
14


We had 10+14=24 .jpg files which were ingested prior to adopting the ‘identify’ functionality that turned out to be broken (truncated files, 0 length files) or that had the wrong file extension.  9 of the “Not well-formed” were from LAC reel’s where we were ingesting images from reels with 1000 to 2000 images per reel.

Issues with .jp2 files



Well-Formed and valid
11,286,315


JHOVE didn’t report any issues with our JPEG 2000 files.

Issues with .tif files



Not well-formed, Tag 296 out of sequence
1
Not well-formed ,Value offset not word-aligned
503,575
Not well-formed  , IFD offset not word-aligned
435,197
Well-Formed and valid
4,608,048
Well-Formed, but not valid  ,Invalid DateTime separator: 28/09/2016 16:53:17
1
Well-Formed, but not valid , Invalid DateTime digit
21,004
Well-Formed, but not valid  , Invalid DateTime length
3,483
Well-Formed, but not valid  , PhotometricInterpretation not defined
202


  • Word alignment (offsets being evenly divisible by 4 bytes) is the largest issue for structure, but it something that will be easy to fix.  We are able to view these images so the data inside isn’t corrupted.
  • Validity of DateTime values is the next largest issue.  The format is should be "YYYY:MM:DD HH:MM:SS" , so something that says “2004: 6:24 08:10:11”  will be invalid (The blank is an Invalid DateTime digit) and “Mon Nov 06 22:00:08 2000” or “2000:10:31 07:37:08%09” will be invalid (Invalid DateTime length).
  • PhotometricInterpretation indicated the colour space of the image data (WhiteIsZero/BlackIsZero for grayscale, RGB, CMYK, YCbCr , etc).  The specification has no default, but we’ll be able to fix the files by making and checking some assumptions.

Issues with .pdf files



Not well-formed , No document catalog dictionary
3,081
Not well-formed  ,Invalid cross-reference table,No document catalog dictionary
2
Not well-formed , Missing startxref keyword or value
8
Not well-formed  ,Invalid ID in trailer,No document catalog dictionary
1
Not yet scanned
253
Well-Formed and valid
191,408
Well-Formed, but not valid , Missing expected element in page number dictionary
33,881
Well-Formed, but not valid ,Improperly formed date
33
Well-Formed, but not valid , Invalid destination object
1



One of the board members of the Open Preservation Foundation, the organization currently maintaining JHOVE, wrote a longer article on the JHOVE PDF module titled “Testing JHOVE PDF Module: the good, the bad, and the not well-formed” which might be of interest.  Generally, PDF is a hard format to deal with and there is more work that can be done with the module to ensure that the errors it is reporting are problems in the PDF file and not the module.


  • “No document catalog dictionary” -- The root tree node of a PDF is the ‘Document Catalog’, and it has a dictionary object.  This exposed a problem with an update to our production processes where we switched from using ‘pdftk’ to using ‘poppler’ from the FreeDesktop project for joining multiple single-page PDF files into a single multi-page PDF file.  While ‘pdftk’ generated Well-Formed and valid PDFs, poppler did not.

    When I asked on the Poppler forum they pointed to JHOVE as the problem, so at this point I don’t know where the problem is.

    I documented this issue at: https://github.com/openpreserve/jhove/issues/248
  • “Missing startxref keyword or value” - PDF files should have a header, document body, xref cross-reference table, and a trailer which includes a startxref.  I haven’t dissected the files yet, but these may be truncated.
  • “Missing expected element in page number dictionary”.  I’ll need to do more investigation.
  • “Not yet scanned”.  We have a series of multi-page PDF files generated by ABBYY Recognition Server which take a long time to validate.  Eventually it indicates the files are recognized with a PDF/A-1 profile.  I documented this issue at: https://github.com/openpreserve/jhove/issues/161


Our longer term strategy is to no longer modify files as part of the ingest process.  If single-page PDF files are generated from OCR (as is normally the case) we will ingest those single-page PDF files.  If we wish to provide a multi-page PDF to download this will be done as part of our access platform where long-term preservation requirements aren’t an issue. In the experiments we have done so far we have found the single-page PDF output of ABBYY Recognition server and PrimeOCR validate without errors, and it is the transformations we have done over the years that was the source of the errors.