About 6 weeks ago (on 6th March), Sara Thomson of the Digital Preservation Coalition kindly spent some time on the phone with me discussing the archiving of web sites. I wanted to find out if there were any other solutions to the ones I had stumbled across in my brief internet search some 16 months ago. Sara suggested 3 approaches which were new to me and described them as follows in a subsequent email:
- UK Web Archive (UKWA) ‘Save a UK Website’: https://beta.webarchive.org.uk/en/ukwa/info/nominate Related to this – two web curators from the British Library (Nicola Bingham and Helena Byrne) presented at a DPC event last year discussing the UKWA, including the Save a UK Website function. A video recording of their talk along with their slides (and the other talks from the day) are here: https://dpconline.org/events/past-events/web-social-media-archiving-for-community-individual-archives
- HTTrack: https://www.httrack.com/ I gave a brief overview of HTTrack at that same DPC event last year that I linked to above. I have also included my slides at an attachment here – the HTTrack demo starts on slide 15.
- Webrecorder: https://webrecorder.io/ by Rhizome. Their website is great and really informative, but let me know if you have any questions about how it works.
Shortly after this, I followed the link that Sara had provided to the UKWA nomination site and filled in the form for pwofc.com. On 14th March I got a response saying that the British Library would like to archive pwofc.com and requesting that I fill in an on-line licence form which I duly completed. On 16th March I decided to explore the contents of the UKWA service and found it collects ‘millions of websites each year and billions of individual assets (pages, images, videos, pdfs etc.)’. I started looking at some of the blogs. The first one I came across was called Thirteen days in May and was about a cycling tour – but it seemed to lack some of the photos that were supposed to be there. The next two I looked at, however, did seem to have their full complement of photos; and one of them (called A Common Reader) had a strangely coincidental entry about ‘Instapaper’ which provides what sounds to be a very useful service for saving web sites for later reading. It looks like the UKWA does an automated trawl of all the websites under its wing at least once a year, so I guess that, as a backup, it should never be more than a year out of date.
An hour after completing this exploration, I got an email confirming that the licence form had been submitted successfully and advising that the archiving of pwofc.com would proceed as soon as possible but that it may not available to view in the archive for some time due to the many thousands of web sites being processed and the need to do quality assurance checks on each. Since then, I’ve been checking the archive every now and again, but pwofc.com hasn’t emerged yet. When it does, it’ll be interesting to see how faithfully it has been captured.
Regarding the other two suggestions that Sara made, I’ve decided to discount Webrecorder as that entails visiting every page and link in a website which would just take too much time and effort for pwofc.com. However, I’m going to have a go at using HTTrack, and I’m also going to try and get a backup of pwofc.com from my web hosting service. Having experienced all these various archiving solutions, there’ll be an opportunity to compare the various approaches and reach some conclusions.