The UK Web Archive

Its been over a year since I wrote about this journey, so I’ll start this entry with a short recap of where I’m up to. Back in March 2019, I decided I would explore three different ways of archiving this pwofc website. First, by using tools provided by the company I pay to host the site; second, by using a tool called HTTrack, and thirdly, by submitting the site for inclusion in the British Library’s UK Web Archive (UKWA).

My experiences with the hosting site tools was less than satisfactory, and are documented in a post on 28April2019 entitled ‘A Backup Hosting Story’. My use of HTTrack was much more rewarding; it produced a complete backup of the whole of the site which could be navigated on my laptop screen with near instantaneous movement between pages, and which could be easily zipped into a single file for archiving. This is written up in the 30Apr2019 post titled ‘Getting an HTTrack copy’.

I’ve had to wait till now to relate my experience of submitting the site to the British Library’s UK Web Archive (UKWA), because the inclusion in the archive has been a little problematic. Here’s what happened: following a suggestion from Sara Thomson of the DPC, I filled in the form at https://beta.webarchive.org.uk/en/ukwa/info/nominate offering pwofc.com for archiving. Within about three weeks I received an email saying that the British Library would like to archive the site and requesting that I fill in the on-line licence form which I duly completed. A couple of days later, on 16th March 2019, I got an email confirming that the licence form had been submitted successfully and advising that: “Your website may not be available to view in the public archive for some time as we archive many thousands of websites and perform quality assurance checks on each instance. Due to the high number of submissions we receive, regrettably we cannot inform you when individual websites will be available to view in the archive at http://www.webarchive.org.uk/ but please do check the archive regularly as new sites are added every day.”

From then on I used the search facility at http://www.webarchive.org.uk/ every month or so to look for pwofc.com but with no success. Over a year later, on 21st April 2020, I replied to the licence confirmation email and asked if it was normal to wait for over a year for a site to be archived or if something had gone wrong. The very prompt reply said, “Unfortunately there is a delay between the time we index our content and when it can be searched through the public interface. We aim to update our indexes as soon as possible and this is an issue we are trying to fix, please bear with us as we do have limited resources. Your site has been archived and it can be accessed through this link: https://www.webarchive.org.uk/wayback/archive/*/http://www.pwofc.com/.

Sure enough, the link took me to a calendar of archiving activity, which showed that the site had been archived three times – twice on 01July2019 (both of which seemed to be complete and to work OK); and once on 13Mar2020 (which when clicked seemed to produce an endless cycle of uploadings). I reported this back to the Archivist who scheduled some further runs, and who, after these too were unsuccessful, asked if I could supply a site map. I duly installed the Google XML Sitemaps plugin on my pwofc.com WordPress site, provided the Archivist with the site map url, https://www.pwofc.com/ofc/sitemap.xml, and the archive crawler conducted some more runs. The 13th run of 2020, on 22nd June, seemed to have been successful: the archived site looked just as it should. I then set about doing a full check of the archived site against the current live site to ensure that all the images were present, and that the links were all in place and working. The findings are listed below:

  • External links not collected: Generally speaking, the UKWA archive had not included web pages external to pwofc.com. Instead, when such a link is selected in the archive one of the following two messages is displayed: either “The url XXX could not be found in this collection” (where XXX is the URL of the external site); or “Available in Legal Deposit Library Reading Rooms only”. However, in at least two instances the link does actually open the live external web page. I don’t know what parameters produce these different results.
  • Link doesn’t work: For one particular link (with the URL ‘http://www.dpconline.org/advice/case-notes’), which appears in two separate places in the archive, there is no response at all when the link is clicked.
  • Home link doesn’t work on linked internal pages: links to internal pages within pwofc.com all work fine in the archive. However, the Home button on the pages that are displayed after selecting such links, doesn’t produce any response.
  • Image with a link on it not displayed: The pwofc.com site has two instances of an image with a link overlaid on it. The archive displays the title of the image instead of the image itself.

On the whole, the archive provides quite a faithful reproduction of the site. However, the fact that no information was collected for most external web pages, and no link to the external live web pages is provided either, is quite a serious shortcoming for a site like pwofc.com which has at least 26 such links. Having said that, the archive aims to collect all the web sites on its books at least once a year; and all the different versions appear to be accessible from a calendared list of copies; so, should one be able to get on the UKWA roster, this would appear to be quite an effective way to backup or archive a blog.

A Story Board a Day Evaluation

Yesterday I started an evaluation of my Electronic Story Boards. Its been over a year and a half that I first put them together and since then I’ve looked at them occasionally; referred to them when I needed some specific information; and even forgotten that some information I knew I had was actually on one of them. However, I haven’t yet made a methodical assessment of how interesting, useful or effective they are. I’m going to try and do that by looking at a different story board every day starting with No 1 and working my way through to the final one – No 35.

No 1 is the Levinson book on Pragmatics, and it’s story board effectively summarises my involvement in the Cosmos project. After looking at it, two words immediately came to mind – Rich, and Personal. That one single page is rich in content – every element bringing back powerful memories; and Personal – because all the content is to do with me.

Later on yesterday, I took a look at the electronic version on the iPad. It was simple to find – all 35 story boards are represented as thumbnails on a single Sidebooks screen on the iPad. Selecting the Pragmatics Story Board brought up a full screen image that looked exactly like the laminated version I’d been looking at on the side of my bookcase. It was just as rich and personal, and it also enabled me to click the arrows and bring up further pages of related material. But, interestingly, those further pages didn’t add a great deal to the experience. The sense of wonder and powerful feelings that I felt, were generated by the material on the main story board: the additional material didn’t really augment them. However, I thought, those supporting pages would certainly be useful if you were specifically looking for detailed information.

That was my initial experience in this 35 day evaluation. I’ll make notes as I go, and summarise my conclusions in 5 or 6 week’s time.

New version 2.5 of the Maintenance Plan Template

A couple of days ago I completed an experiment to use the Maintenance Plan template to undertake initial Digital Preservation work on a collection instead of using the Scoping document. It proved to be very successful. The collection is relatively small with only 840 digital files of either jpg, pdf or MS Office format, so there were few complications and I was able to proceed through the Maintenance Plan process steps without any serious holdups. The whole exercise took just over a week with the majority of the time being taken up by the inventory check of the digital files and of about 300 associated physical artefacts. I used the structure of the Maintenance Plan to document what I was doing and to keep a handle on where I was up to.

As a result of this exercise I’ve now added the following guidance to the beginning of the Maintenance Plan template, and equivalent text to the beginning of the Scoping document template:

If this is the first time that Digital Preservation work has been done on a collection

EITHER use the Scoping template to get started (best for large, complex collections)

OR use this Maintenance Plan template to get started (can be effective for smaller, simpler collections – retitle it to ‘Initial Digital Preservation work on the @@@ collection’ and ignore sections Schedule, 3, 4 and 7)

This concludes the interim testing and revision of the Maintenance Plan template. It has resulted in some substantial changes to the latest version 2.5 of the document (an equivalent version 2.5 of the SCOPING Document Template has also been produced). The final and most substantial test of the Maintenance Plan template will take in September 2021 when the large and complex PAWDOC collection is due to undergo its first maintenance exercise.