A cursory tour of web archiving

Web archiving isn’t a simple proposition because not only do web sites keep changing, but they also have links to other sites. So, I guess I should have expected that my search for web archiving tools would come up with a disparate array of answers. It seems that the gold-plated solution is to pay a service such as Smarsh or PageFreezer to periodically take a snapshot of a website and to store it in their cloud. The period is user-definable and can be anything from every few hours to every month or year. Smarsh was advertising its basic service at $129 a month at the time of writing.

A more basic, do-it-yourself facility, is the Unix WGET command line function for which a downloadable Windows version is available. This enables all sorts of functions to be specified including downloading parts or all of a site, the scheduling of downloads etc.. However, as you might expect with a Unix function, it requires the user to input programming-type commands and to be aware of a large number of specifiable options.

More limited services such as Archive.is are available to capture, save and download individual pages – and some of these are free to use.

Regarding formats in which web archives can be saved, the Library of Congress’ preferred format is the ISO WARC (Web ARChive) file format. However, I was unable to find any tools or services which purport to store files in this format: it sounds like WARC is being used in the background by large institutions who are trying to preserve large volumes of web content. Interestingly the web hosting service I use for the this blog actually offers backups in various forms of zip files; and indeed, it is zip files that I have used in the past to store web sites that are included in my document collection.

Based on this very quick and certainly incomplete tour of the topic of Web Archiving, I’ve decided I won’t be trying to do anything fancy or different in the way I use technology to archive my old web sites. The zip format has worked well up to now and I see no reason to change that approach. As for a non-technological solution to web archiving, the notion of creating and binding a physical book of the first five years of this OFC web site is becoming more and more attractive. There’s something very solid and immutable about a book on a bookshelf. I’m definitely going to do that, and have set the end of 2017 as the cut-off date for its contents – I’m busy trying to make sure that the Journeys are all at appropriate stages by the 31st December.

The Printing Solution

Pwofc.com was born 5 years ago and, as it has covered more topics and grown in size, the likelihood of being able to reconstitute it should some disaster occur, seems to becoming increasingly remote. So, when I started to systematically go through every entry in the blog to tease out OFC  insights, it occurred to me that I could, at the same time, copy the contents into a word document which could subsequently be printed and bound into a hardcopy book in just the same way as the Sounds for Alexa book has been produced. That’s what I did, and I now have a 227 page document containing the main contents of the site. I now need to add in the 40 Appendix documents which have links from the main text. The final book may well have around 400 pages or more – but that shouldn’t present a bookbinding problem.

I haven’t established yet whether there is a standard website archiving solution which makes it easy to reconstitute and access a site; however, even if there is one, I think I shall feel more comfortable knowing that I actually have all the content in a single backed-up file. I shall feel even more comfortable when I have the book of pwofc.com in my bookcase.

A Blog is a Fragile Beast

A blog is quite a fragile beast when compared to a physical notebook; it requires reliable storage for a whole suite of files as well as appropriate software and hardware to display its contents. As this OFC blog has grown in size, I have started to wonder what would happen if, for some reason, the backend files were destroyed, or if I simply stopped paying the hosting charges. Right now I have no way of being able to reconstitute the blog outside my hosting company’s infrastructure. This is what I shall be exploring in subsequent entries.
In many ways this is a digital preservation exercise. However, I’ve decided to address it as a topic in its own right because of the special characteristics of blogs; they are large integrated entities which have to be  accessed as a whole. This OFC blog is nothing special; it is built on the WordPress platform which enables text and images to be created directly without having to deal with HTML code; and it is hosted by a specialist company. I pay an annual fee for the domain name – pwofc.com – and for the hosting service; and as part of that service I can request the hosting company to take a backup of the site whenever I wish. WordPress is free and gets updated from time to time. In other words, this is a very standard configuration of the type probably used by millions of bloggers. I’m hoping, therefore, that a practical archiving solution which enables a site to be easily reconstituted, will be readily available. We will see.