Thursday, May 11, 2017

Canadiana JHOVE report

This article is based on a document written to be used at Code4Lib North on May 11’th, and discusses what we’ve learned so far with our use of JHOVE.

What is JHOVE?



The original project was a collaboration between JSTOR and Harvard University Library, with JHOVE being an acronym for JSTOR/Harvard Object Validation Environment.  It provides functions to perform format-specific identification, validation, and characterization of digital objects.


JHOVE is currently maintained by the non-profit Open Preservation Foundation, operating out of the UK (Associated with the British Library in West Yorkshire).




Standard JHOVE modules for AIFF, ASCII, BYTESTREAM, GIF, HTML, JPEG, JPEG2000, PDF, TIFF, UTF8, WAVE, XML, MP3, ZIP.

What is Canadiana doing with JHOVE?



As of the last week of April we generate XML reports from JHOVE and include them within AIP revisions in our TDR.  At this stage we are not rejecting or flagging files based on the reports, only providing reports as additional data.  We will be further integrating JHOVE as part of our production process in the future.

Some terminology




What did Canadiana do before using JHOVE?



Prior to the TDR Certification process we made assumptions about files based on their file extensions: a .pdf was presumed to be a PDF file, a .tif a TIFF file, .jpg a JPEG file, and .jp2 a JPEG 2000 file.  We only allowed those 4 types of files into our repository.


As a first step we used ImageMagick’s ‘identify’ feature to identify and confirm that files matched the file types.  This meant that any files added since 2015 had data that matched the file type.


At that time we did not go back and check previously ingested files, as we knew we would eventually be adopting something like JHOVE.


Generating a report for all existing files
As of May 9, 2017 we have 61,829,569 files in the most recent revisions of the AIPs in our repository.  This does not include METS records, past revisions, or files related to the BagIt archive structure we use within the TDR.


I quickly wrote some scripts that would loop through all of our AIPs and generate reports for all the files in the files/ directory of the most recent AIP revision within each AIP.  We dedicated one of our TDR Repository nodes to generating reports for a full month to get the bulk of the reports, with some PDF files still being processed.

Top level report from scan



Total files
61,829,569
Not well-formed
941,875 (1.5%)
Not yet scanned
253
Well-Formed and valid
60,828,836 (98.4%)
Well-Formed, but not valid
58,605  (0.09%)


JHOVE offers a STATUS for files which is one of:


  • “Not well-formed” - problems at the purely syntactic requirement for the format
  • “Well-Formed, but not valid” - meets higher-level semantic requirements for format validity
  • “Well-Formed and valid” - passed both the well-formedness and validity tests

Issues with .jpg files



Not well-formed
10
Well-Formed and valid
44,743,051
Well-Formed and valid TIFF
14


We had 10+14=24 .jpg files which were ingested prior to adopting the ‘identify’ functionality that turned out to be broken (truncated files, 0 length files) or that had the wrong file extension.  9 of the “Not well-formed” were from LAC reel’s where we were ingesting images from reels with 1000 to 2000 images per reel.

Issues with .jp2 files



Well-Formed and valid
11,286,315


JHOVE didn’t report any issues with our JPEG 2000 files.

Issues with .tif files



Not well-formed, Tag 296 out of sequence
1
Not well-formed ,Value offset not word-aligned
503,575
Not well-formed  , IFD offset not word-aligned
435,197
Well-Formed and valid
4,608,048
Well-Formed, but not valid  ,Invalid DateTime separator: 28/09/2016 16:53:17
1
Well-Formed, but not valid , Invalid DateTime digit
21,004
Well-Formed, but not valid  , Invalid DateTime length
3,483
Well-Formed, but not valid  , PhotometricInterpretation not defined
202


  • Word alignment (offsets being evenly divisible by 4 bytes) is the largest issue for structure, but it something that will be easy to fix.  We are able to view these images so the data inside isn’t corrupted.
  • Validity of DateTime values is the next largest issue.  The format is should be "YYYY:MM:DD HH:MM:SS" , so something that says “2004: 6:24 08:10:11”  will be invalid (The blank is an Invalid DateTime digit) and “Mon Nov 06 22:00:08 2000” or “2000:10:31 07:37:08%09” will be invalid (Invalid DateTime length).
  • PhotometricInterpretation indicated the colour space of the image data (WhiteIsZero/BlackIsZero for grayscale, RGB, CMYK, YCbCr , etc).  The specification has no default, but we’ll be able to fix the files by making and checking some assumptions.

Issues with .pdf files



Not well-formed , No document catalog dictionary
3,081
Not well-formed  ,Invalid cross-reference table,No document catalog dictionary
2
Not well-formed , Missing startxref keyword or value
8
Not well-formed  ,Invalid ID in trailer,No document catalog dictionary
1
Not yet scanned
253
Well-Formed and valid
191,408
Well-Formed, but not valid , Missing expected element in page number dictionary
33,881
Well-Formed, but not valid ,Improperly formed date
33
Well-Formed, but not valid , Invalid destination object
1



One of the board members of the Open Preservation Foundation, the organization currently maintaining JHOVE, wrote a longer article on the JHOVE PDF module titled “Testing JHOVE PDF Module: the good, the bad, and the not well-formed” which might be of interest.  Generally, PDF is a hard format to deal with and there is more work that can be done with the module to ensure that the errors it is reporting are problems in the PDF file and not the module.


  • “No document catalog dictionary” -- The root tree node of a PDF is the ‘Document Catalog’, and it has a dictionary object.  This exposed a problem with an update to our production processes where we switched from using ‘pdftk’ to using ‘poppler’ from the FreeDesktop project for joining multiple single-page PDF files into a single multi-page PDF file.  While ‘pdftk’ generated Well-Formed and valid PDFs, poppler did not.

    When I asked on the Poppler forum they pointed to JHOVE as the problem, so at this point I don’t know where the problem is.

    I documented this issue at: https://github.com/openpreserve/jhove/issues/248
  • “Missing startxref keyword or value” - PDF files should have a header, document body, xref cross-reference table, and a trailer which includes a startxref.  I haven’t dissected the files yet, but these may be truncated.
  • “Missing expected element in page number dictionary”.  I’ll need to do more investigation.
  • “Not yet scanned”.  We have a series of multi-page PDF files generated by ABBYY Recognition Server which take a long time to validate.  Eventually it indicates the files are recognized with a PDF/A-1 profile.  I documented this issue at: https://github.com/openpreserve/jhove/issues/161


Our longer term strategy is to no longer modify files as part of the ingest process.  If single-page PDF files are generated from OCR (as is normally the case) we will ingest those single-page PDF files.  If we wish to provide a multi-page PDF to download this will be done as part of our access platform where long-term preservation requirements aren’t an issue. In the experiments we have done so far we have found the single-page PDF output of ABBYY Recognition server and PrimeOCR validate without errors, and it is the transformations we have done over the years that was the source of the errors.

No comments:

Post a Comment

Blogger wants you to log into blogger, separate from having a Google account, for your name to show up. If you don't want to be "Unknown", then please take that extra step.

I reserve the right to remove inflammatory or otherwise inappropriate comments.

Blogs are easy to create and share thoughts on, so links are great to keep conversation going.