The PAWDOC Preservation story

In May 2018 the inaugural digital preservation work on the PAWDOC collection was completed. The story of the work that was done, and the lessons that were learnt, are documented in the following paper which can be downloaded from this site subject to Creative Commons conditions:

The Application of Preservation Planning Templates to a Personal Digital Collection

Instances of the populated preservation planning templates that were used to control the work are also provided:

A summary of the work done and the lessons learned has been published as a Blog Post on the Digital Preservation Coalition (DPC) website.

The preservation planning templates were updated as a result of insights gained in the work and these are available as embedded files in the above ‘Application of Preservation Planning Templates’ paper and also in the DPC website.

March: Long and Plans

It looks like the blog post describing the Digital Preservation work undertaken last year on the PAWDOC collection, will be published next month on the DPC website. It will refer to the full paper describing the work in more detail, which will be published here within pwofc.com. At the same time, the preservation planning document templates will be replaced by updated versions in the DPC website.  The publication of all these materials will be a fitting end to the preservation planning activities that are described in previous entries in this site. However, there will still be one thing to do before the topic can be considered complete and that is to review the effectiveness of the Preservation Maintenance Plan template when an instance of it will be used in the PAWDOC Preservation maintenance exercise scheduled for September 2021.

Clear Blue Calm Water

Unfortunately, the paper summarising the PAWDOC digital preservation work has not progressed in the last few months because the DPC has too much work on at the moment to deal with it. I’m hoping this might change in the early part of 2019.

In the meantime, I have just completed another important aspect of digital preservation work on the PAWDOC collection. I have long been concerned that the collection resides on a laptop running Windows 7 – an operating system for which Microsoft have said they will withdraw support in 2020.  At the same time, the battery in my existing laptop no longer functions so requiring that it be mains-connected at all times. So, about a week ago I acquired a Chillblast Leggera i7 Ultrabook with 8Mb of RAM and a 1Tb Samsung Solid State Drive (SSD). I listed a set of conversion activities and started working my way through attaching peripherals (keyboard, mouse, scanner) and loading software (Anti-virus, Scanning, Filemaker, MS Office, Cloud backup). All went well until nearly at the end when I hit the wall of connecting the external Dell 2405FP monitor which I bought in 2006, and which has worked fine ever since with at least three different laptops.

I had planned to use the laptop’s HDMI port and had acquired an HDMI to DVI adapter to enable an HDMI cable to be plugged into the Dell monitor’s DVI port. Unfortunately, the connection only worked for a few minutes. After that the monitor’s DVI interface went into Power Save Mode and, no matter what I tried, I couldn’t get it out of that mode. I then tried searching the net for a fix and discovered a huge number of entries about this problem for several different models of Dell monitors stretching back to 2005 – with no definitive fix emerging. I decided to try using the VGA port on the Dell monitor and duly purchased an Amazon next day delivery of an HDMI to VGA converter. Unfortunately, this simply had the similar effect of putting the monitor’s VGA interface into Power Save Mode.

However, a ray of hope did appear when I plugged the VGA lead back into my old laptop, and the Dell monitor immediately came out of Power Save Mode and the screen image was displayed. I was able to obtain the monitor menu while it was attached to the old laptop and returned the monitor back to factory settings – but this didn’t make any difference – everytime I attached the laptop’s HDMI port to either the monitor’s DVI or VGA interfaces they returned to Power Save Mode.

My last ditch effort to resolve the problem was to try using the laptop’s Mini Displayport (MD) port, and, in a state of some depression and resignation, yesterday I duly purchased an Amazon same day delivery of an MD to VGA adapter plug.  It cost £5.99, was ordered around 9am and was delivered around 8pm (really…). With the laptop switched off, I put the adapter into the laptop’s MD port and plugged in the monitor’s VGA cable. The buttons on the monitor went orange (signifying Power Save Mode) and I thought, ‘here we go again’ and switched on the laptop; and suddenly after a few seconds I saw a bright light out of the corner of my eye and, blow me down, there was the laptop screen on the monitor! I used it for a while and then, trepidatiously, tried closing the laptop lid and it kept on displaying on the monitor. Later, I shut the laptop down and subsequently fired it up again – but still no problem – up it came on the monitor. So it looks like this is now working OK. Phew.

This morning I reorganised my physical desktop and placed the new smaller laptop in a new position immediately next to my scanner so that the problem of making the scanner cable reach the laptop port was eliminated. With the conversion process complete and my desk back in some sort of order, I began to feel more in control of things and much more relaxed. I had sailed into clear blue calm water in the sheltered bay of an up to date operating system and a modern laptop.

DPC Publication Plans

A few days ago I agreed a way forward with the Digital Preservation Coalition (DPC) regarding the publication of the paper describing the PAWDOC digital preservation work: I will create a post summarising the learnings from the work, and the DPC will attach an edited PDF version of the whole paper, as well as the updated templates, to the original Case Note describing how I derived the preservation process that I applied to the PAWDOC collection. I’m hoping this will all be achieved by the end of October.

Under the paper wait

The paper describing the PAWDOC digital preservation work was submitted to the Digital Preservation Coalition (DPC) on 31st May and the organisation responded saying it was interested in the paper but was currently unable to provide a timescale for dealing with it due to a busy work schedule. I guess it might be several months before hearing whether the DPC will want to publish a version of the paper.

Paper written – Maint Plan test to do

The follow up paper describing my recently completed preservation project, is now ready for submission to the Digital Preservation Coalition (DPC). I’m hoping that, since they published my paper describing how I derived the Preservation Planning Templates in the first place, they might be interested in taking a paper describing how they have been used in practice. We’ll see. In any case it’s good to have been able to create a summarised account of what happened while its fresh in my mind.

Writing the first draft of the paper only took about a week. However, that piece of work made me realise that the details of what got done when, appears in five main documents – the paper I was writing, the Scoping document, the Plan DESCRIPTION, the Plan CHART, and section 2 of the Preservation Maintenance Plan (Previous preservation actions taken); and that the base data for all these documents was being derived from the three major controls sheets – the DROID analysis spreadsheet, the Files-that-won’t-open spreadsheet, and the Physical Disks spreadsheet. Although the facts were roughly consistent across the documents, there were several anomalies that would be apparent to readers, and the sheer number of files and types of conversions that had been performed made it difficult to check and make revisions. I decided that the only way to achieve true consistency and traceability across all the documents would be to specify columns in the control spreadsheets for all the categories I wanted to describe, and to have the spreadsheets add up the counts automatically.  This is what I spent the following two weeks doing – and a very slow and tortuous exercise it was. Which is why the paper makes several mentions of the need to set up control sheets correctly in the first place to facilitate downstream needs for control and for statistical information about what’s been done….

I was given a lot of very useful feedback on the drafts of the paper by Ross Spencer, including suggestions to include a summary timeline for the project at the beginning of the paper, to provide more details about the DROID tool, and to include some additional references.  Ross also advised making it clear that this is a personal collection with preservation decisions being made that the owners were comfortable with; and that different decisions might have been made by other people from the perspective of who the future users of the Collection might be. This prompted me to include an extra paragraph in the Conclusions section to the effect that no attempt has been made to convert some files (such as old versions of the Indexing software, or a Visio stencil file) because they don’t have content and their mere presence in the collection tell their own story. However, it’s got me thinking that there is a wider point here about what collections are for, and just how much detail of the digital form needs to be preserved. I’ll probably explore this issue further in the Personal Document Management topic in this Blog.

Writing the paper also prompted me to realise that, unfortunately, my Digital Preservation Journey can’t be completed until I’ve tested out the application of a Preservation Maintenance Plan. It’s one thing to fill in a Maintenance Plan (which was relatively quick and easy), but quite another to have it initiate and direct a full blown Preservation project. Only by using it in practice will it be known if it is an effective and useful tool; and, no doubt, its use will lead to some refinements being made to its contents. I shall explore whether I could use the Maintenance Plans I produced for photos and for mementos which were created in the course of the trials conducted when putting together the first versions of the Preservation Planning Templates. If they won’t provide an adequate test, I’ll have to wait until the date specified in the PAWDOC Preservation Maintenance Plan for the next Maintenance exercise – September 2021.

PawdocDP Preservation Project Put to Bed

Last Thursday (03May) I completed the preservation project on my document collection – quite a relief to know that it is now in reasonably good shape for a few more years. To finish off this work I intend to write a follow up paper recounting how the processes and templates I developed in the earlier stages of this exercise, fared when applied to a substantial body of files. Looking back I see that I started this Preservation Planning topic nearly four years ago, so its been a long haul and very labour intensive – I’m looking forward to being able to move it to the Journeys Completed section of this blog so that I can concentrate again on more creative and exciting forays!

Disk, Reordering, and Maintenence Plan Insights

Although my last post reported that I’d got through the long slog of the conversion aspects of this preservation project, in fact there was still more slog of other sorts to go. A lot more slog in fact: there was the transfer of the contents of 126 cd/dvd disks to the laptop; and there was the reordering of pages in 881 files to rectify the page order produced by scanning all front sides first and then turning over the stack of pages to scan the reverse sides at a time in the 1990s when I didn’t have a double sided scanner. In fact this exercise involved yet more conversion (from multi-page TIF file to PDF) before the reordering could be done.

This latter task really took a huge amount of time and effort and was yet another reminder of how easy it is to specify tasks in a preservation project without really appreciating how much hard graft they will entail. Having said that, it’s worth noting that my PDF application – eCopy PDF Pro – had two functions which made this task a whole easier: first, the ability to have eCopy convert a file to PDF is available in the menu brought up by right clicking on any file, thereby automatically suggesting a file title (based on the title of the original file) for the new PDF in the Save As dialogue box, and which then automatically displays the newly created file – all of which is relatively quick and easy. Second, eCopy has a function whereby thumbnails of all the pages in a document can be displayed on the screen and each page can be dragged and dropped to a new position. I soon worked out that the front-sides-then-reverse-sides scan produces a standard order in which the last page in the file is actually page 2 of the document; and that if you drag that page to be the second page in the document, then the new last page will actually be page 4 of the document and can be dragged to just before the 4th page in the document. In effect, to reorder simply means progressively dragging the last page to before page 2 and then before page 4 and then before page 6 etc until the end of the file is reached. Both these functions (to be able to click on a file title to get it converted, and to drag and drop pages around a screenfull of thumbnails) are well worth looking for in a PDF application.

Regarding the disks, I was expecting to have trouble with some of the older ones since, during the scoping work, I had encountered a few which the laptop failed to recognise. I did try cleaning such disks with a cloth without much success. However, what did seem to work was to select ‘Computer’ on the left side of the Windows Explorer Window which displays the laptop’s own drive on the right side of the window together with the any external disks that are present. For some reason, disks which kept on whirring without seeming to be recognised, just appeared on this right side of the window. I don’t profess to understand why this was happening – but was just glad that, in the end, there was only one disk that I couldn’t get the machine to display and copy its contents.

I’m now in the much more relaxed final stages of the project, defining backup arrangements and creating the Maintenance Plan and User Guide documents. The construction of the Maintenance Plan has thrown up a couple of interesting points. First, since it requires a summary of what preservation actions have been completed and what preservation issues are to be addressed next time, it would have made life easier to construct the preservation working documents in such a way that the information for the Preservation Maintenance Plan is effectively pre-specified – an obvious point really but easy to overlook – and I did overlook it…. The second point is a more serious issue. The Maintenance Plan is designed to define a schedule of work to be undertaken every few years; its certainly not something I want to be doing very often – I’ve got other things I want to do with my time. However, some of the problem files I have specified in the ‘Possible future preservation issues’ section in the Maintenance Plan could really do with being addressed straight away – or at least sooner than 2021 when I have specified the next Maintenance exercise should be carried out. I guess this is a dilemma which has to be addressed on a case by case basis. In THIS case, I’ve decided to just leave the points as they are in the Maintenance Plan so that they don’t get forgotten; but to possibly take a look at a few of them in the shorter term if I feel motivated enough.

The Conversion Slog

I’m glad to say I’ve nearly finished the long slog through the file conversion aspects of this digital preservation project. After dealing with about 900 files I just have another 50 or so Powerpoints and a few Visios to get through. It’s been a salutary reminder of how easily large quantities of digital material could be lost simply because the sheer volume of files makes for a very daunting task to retrieve them.

Below are a few of the things I’ve learnt as I’ve been ploughing through the files.

Email .eml files: These are mail messages which opened up fine in Windows Live Mail when I did the scoping work for this project. Unfortunately, since then I’ve had a system crash and Live Mail was not loaded into my rebuilt machine; and Microsoft removed all Live Mail support and downloads at the end of 2017. On searching for a solution on the net, I found several suggestions to change the extension to .mht to get the message to open in a browser. This works well, but unfortunately the message header (From, To, Subject, Date) is not reproduced. I ended up downloading the Mozilla Thunderbird email application, opening each email in turn in it, taking screenshots of each screenfull of message and copying them into Powerpoint, saving each one as a JPG, and then inserting the JPGs for all the emails in a particular category into a PDF document. A bit tortuous and maybe there are better ways of doing it – but at least I ended up with the PDFs I was aiming for.

Word for Mac 3.0 files: These files did open in MS Word 2007 – but only as continuous streams of text without any formatting. After some experimentation, I discovered that doing a carriage return towards the end of the file magically re-instated most of the formatting – though some spurious text was left at the end of the file. I saved these as DOCX files.

Word for Mac 4.0 & 5.0 and Word for Windows 1.0 & 2.0: These documents all opened up OK in Word 2007. However, I found that in longer documents which had been structured as reports with contents list, the paging had got slightly out of sync so that headings, paragraphs and bullets were left orphaned on different pages. I converted such files to DOCX format in order to have the option to reinstate the correct format in the future. Files without pagination problems, or which I had been able to fix without too much effort, were all converted to PDF.

PDF-A-1b: I have previously elected to store my PDF files in the PDF-A-1b format (designed to facilitate the long term storage of documents). However, on using the conformance checker in my PDF application (e-Copy PDF Pro) I discovered that they possessed several non-conformancies; and, furthermore, the first use of e-Copy PDF Pro’s ‘FIX’ facility does not resolve all of them. I decided that trying to make each new PDF I created conform to PDF-A-1b would take up too much time and would joepardise the project as a whole. So, I included the following statement in the Preservation Maintenance Plan that will be produced at the end of the project: “PDF files created in the previous digital preservation exercise were not conformant to the PDF-A-1b standard, and the eCopy PDF Pro ‘FIX’ facility was unable to rectify all of the non-conformances. Consideration needs to be given as to whether it is necessary to undertake work to ensure that all PDF files in the collection comply fully with the PDF-A-1b standard.

PowerPoint – for Mac 4.0. Presentation 4.0, and 97-2003: All of these failed to open with Powerpoint 2007, so I used Zamzar to convert them. Interestingly Zamzar wouldn’t convert to PPTX – only to Powerpoint 1997-2003 which I was subsequently able to open with Powerpoint 2007. So far, it has converted over 100 Powerpoints and failed with only four (two Mac 4.0 and two Presentation 4.0). The conversions have mostly been perfect with the small exception that, in some of the files, some of the slides include a spurious ‘Click to insert title’ text box. I can’t be sure that these have been inserted during the conversion process, but I think it unlikely that I would have left so many of them in place when preparing the slides. Zamzar’s overall Powerpoint conversion capability is very good – but I have experienced a couple of irritating characteristics: first, on several occassions it has sent me an email saying the conversion has been successful but then fails to provide the converted file implying that it wasn’t able to convert the file; and second, the download screen enables five or more files to be specified for conversion but if several files are included it only converts alternate files – the other files are reported to have been converted but no converted file is provided. This problem goes away if each file is specified on its own in its own download screen. The other small constraint is that the free service will only convert a maximum of 50 files in any 24 hour period – but that seems a fair limit for what is a really useful service (at the time of writing, the fee for the cheapest level of service was $9 a month).

UPDATED and ORIGINAL: I am including UPDATED in the file title of the latest version of a file, and ORIGINAL in earlier versions of the same file, because all files relating to a specific Reference No are stored in the same Windows Explorer Folder and users need to be able to pick out the correct preserved file to open. There will be only one UPDATED file – all earlier versions will have ORIGINAL in the file title. Another way of dealing with this issue of multiple file versions would be to remove all ORIGINAL versions to separate folders. However, this would make the earlier versions invisible and harder to get at, which may not be desirable. I believe this needs further thought – and the input of requirements from future users of the collection – before the best approach can be specified.

DOCX, PPTX and XSLX: When converting MS Office documents, unless I was converting to PDF, I elected to convert to the DOCX, PPTX and XLSX formats for two reasons – it is Microsoft’s future-facing format, and that – for the time being – it provides another way of distinguishing between files that have been UPDATED and those that haven’t.

Many of these experiences came as a surprise despite the amount of scoping work that was undertaken; and that is probably inevitable. To be able to nail down every aspect of each activity would take an inordinate amount of time. There will always be a trade off between time spent planning and the amount of certainty that can be built into a plan; and it will always be necessary to be pragmatic and flexible when executing a plan.

Retrospective Preservation Observations

Yesterday I reached a major milestone. I completed the conversion of the storage of my document collection from a Document Management System (DMS) to files in Windows Folders. It feels a huge release not to have the stress of maintaining two complicated systems – a DMS and the underlying SQL database – in order to access the documents.

From a preservation perspective, a stark conclusion has to be drawn from this particular experience: the collection started using a DMS some 22 years ago during which I have undergone 5 changes of hardware, one laptop theft and a major system crash. In order to keep the DMS and SQL Db going I have had to try and configure and maintain complex systems I had no in-depth knowledge of; engage with support staff over phone, email, screen sharing and in person for many, many hours to overcome problems; and backup and nurture large amounts of data regularly and reliably. If I had done nothing to the DMS and SQL Db over those years I would long ago have ceased to be able to access the files they contained. In contrast, if they had been in Windows folders I would still be able to access them. So, from a digital preservation perspective there can be no doubt that having the files in Windows Folders will be a hugely more durable solution.

When considering moving away from a DMS I was concerned it might be difficult to search for and find particular documents. I needn’t have worried. Over the last week or so I’ve done a huge amount of checking to ensure the export from the DMS into Windows Folders had been error free. This entailed constant searching of the 16,000 Windows Folders and I’ve found it surprisingly easy and quick to find what I need. The collection has an Index with each index entry having a Reference Number. There is a Folder for each Ref No within which there can be one or more separate files, as illustrated below.

Initially, I tried using the Windows Explorer search function to look for the Ref Nos, but I soon realised it was just as easy – and probably quicker – to scroll through the Folders to spot the Ref No I was looking for. The search function on the other hand will come in useful when searching for particular text strings within non-image documents such as Word and PDF – a facility built into Windows as standard.

I performed three main types of check to ensure the integrity of the converted collection: a check of the documents that the utility said it was unable to export; a check of the DMS files that remained after the export had finished (the utility deleted the DMS version of a file after it had exported it); and, finally, a check of all the Folder Ref Nos against the Ref Nos in the Index. These checks are described in more detail below.

Unable to export: The utility was unable to export only 13 of the 27,000 documents and most of these were due to missing files or missing pages of multi-page documents.

Remaining files: About 1400 files remained after the export had finished. About 1150 of  these were found to be duplicates with contents that were present in files that had been successfully exported. The duplications probably occurred in a variety of ways over the 22 year life of the DMS including human error in backing up and in moving files from off-line media to on-line media as Laptops started to acquire more storage. 70 of the files were used to recreate missing files or to augment or replace files that had been exported. Most of the rest were pages of blank or poor scans which I assume I had discovered and replaced at the point of scanning but which somehow had been retained in the system. I was unable to identify only 7 of the files.

Cross-check of Ref Nos in Index and Folders: This cross-check revealed the following problems with the exported material from the DMS:

  • 9 instances in which a DMS entry was created without a Index entry being created,
  • 9 cases in which incorrect Ref Nos had been created in the DMS,
  • 6 instances in which the final digit of a longer than usual Ref No had been omitted (eg PAW-BIT-Nov2014-33-11-1148 was exported as PAW-BIT-Nov2014-33-11-114),
  • 3 cases in which documents had been marked as removed in the Index but not removed from the DMS,
  • 2 cases in which documents were missing from the DMS export.

It also revealed a number of problems and errors within the 17,000 index entries. These included 12 instances in which incorrect Filemaker Doc Refs had been created, and 6 cases in which duplicated Filemaker entries were identified.

The overall conclusion from this review of the integrity of the systems managing the document collection over some 37 years, is that a substantial amount of human error has crept in, unobtrusively, over the years. Experience tells me that this is not specific to this particular system, but a general characteristic of all systems which are manipulated in some way or other by humans. From a digital preservation standpoint this is a specific risk in its own right since, as time goes by, as memories fade, and as people come and go, the knowledge about how and why these errors were made just disappears making it harder to identify and rectify them.