The UK Web Archive

Its been over a year since I wrote about this journey, so I’ll start this entry with a short recap of where I’m up to. Back in March 2019, I decided I would explore three different ways of archiving this pwofc website. First, by using tools provided by the company I pay to host the site; second, by using a tool called HTTrack, and thirdly, by submitting the site for inclusion in the British Library’s UK Web Archive (UKWA).

My experiences with the hosting site tools was less than satisfactory, and are documented in a post on 28April2019 entitled ‘A Backup Hosting Story’. My use of HTTrack was much more rewarding; it produced a complete backup of the whole of the site which could be navigated on my laptop screen with near instantaneous movement between pages, and which could be easily zipped into a single file for archiving. This is written up in the 30Apr2019 post titled ‘Getting an HTTrack copy’.

I’ve had to wait till now to relate my experience of submitting the site to the British Library’s UK Web Archive (UKWA), because the inclusion in the archive has been a little problematic. Here’s what happened: following a suggestion from Sara Thomson of the DPC, I filled in the form at https://beta.webarchive.org.uk/en/ukwa/info/nominate offering pwofc.com for archiving. Within about three weeks I received an email saying that the British Library would like to archive the site and requesting that I fill in the on-line licence form which I duly completed. A couple of days later, on 16th March 2019, I got an email confirming that the licence form had been submitted successfully and advising that: “Your website may not be available to view in the public archive for some time as we archive many thousands of websites and perform quality assurance checks on each instance. Due to the high number of submissions we receive, regrettably we cannot inform you when individual websites will be available to view in the archive at http://www.webarchive.org.uk/ but please do check the archive regularly as new sites are added every day.”

From then on I used the search facility at http://www.webarchive.org.uk/ every month or so to look for pwofc.com but with no success. Over a year later, on 21st April 2020, I replied to the licence confirmation email and asked if it was normal to wait for over a year for a site to be archived or if something had gone wrong. The very prompt reply said, “Unfortunately there is a delay between the time we index our content and when it can be searched through the public interface. We aim to update our indexes as soon as possible and this is an issue we are trying to fix, please bear with us as we do have limited resources. Your site has been archived and it can be accessed through this link: https://www.webarchive.org.uk/wayback/archive/*/http://www.pwofc.com/.

Sure enough, the link took me to a calendar of archiving activity, which showed that the site had been archived three times – twice on 01July2019 (both of which seemed to be complete and to work OK); and once on 13Mar2020 (which when clicked seemed to produce an endless cycle of uploadings). I reported this back to the Archivist who scheduled some further runs, and who, after these too were unsuccessful, asked if I could supply a site map. I duly installed the Google XML Sitemaps plugin on my pwofc.com WordPress site, provided the Archivist with the site map url, http://www.pwofc.com/ofc/sitemap.xml, and the archive crawler conducted some more runs. The 13th run of 2020, on 22nd June, seemed to have been successful: the archived site looked just as it should. I then set about doing a full check of the archived site against the current live site to ensure that all the images were present, and that the links were all in place and working. The findings are listed below:

  • External links not collected: Generally speaking, the UKWA archive had not included web pages external to pwofc.com. Instead, when such a link is selected in the archive one of the following two messages is displayed: either “The url XXX could not be found in this collection” (where XXX is the URL of the external site); or “Available in Legal Deposit Library Reading Rooms only”. However, in at least two instances the link does actually open the live external web page. I don’t know what parameters produce these different results.
  • Link doesn’t work: For one particular link (with the URL ‘http://www.dpconline.org/advice/case-notes’), which appears in two separate places in the archive, there is no response at all when the link is clicked.
  • Home link doesn’t work on linked internal pages: links to internal pages within pwofc.com all work fine in the archive. However, the Home button on the pages that are displayed after selecting such links, doesn’t produce any response.
  • Image with a link on it not displayed: The pwofc.com site has two instances of an image with a link overlaid on it. The archive displays the title of the image instead of the image itself.

On the whole, the archive provides quite a faithful reproduction of the site. However, the fact that no information was collected for most external web pages, and no link to the external live web pages is provided either, is quite a serious shortcoming for a site like pwofc.com which has at least 26 such links. Having said that, the archive aims to collect all the web sites on its books at least once a year; and all the different versions appear to be accessible from a calendared list of copies; so, should one be able to get on the UKWA roster, this would appear to be quite an effective way to backup or archive a blog.

Leave a Reply

Your email address will not be published. Required fields are marked *