Checking the Collection

Two of the remaining things to be done with my lifetime document collection are to:

a) scan the remaining paper (documents not yet scanned because they were labelled as artefacts to be retained in both their paper and electronic form); and

b) go through all the index entries making sure they contain valid information and that there is an equivalent scan in the Document Management System.

For a) some of the paper documents have comb bindings and will require a binding machine if they are to be scanned using a sheet feeder and then reassembled in the comb binding. I acquired a very cheap comb binding machine on ebay some three weeks ago (though, it seems it was false economy – it stopped functioning properly and I had to send it back yesterday…) and have made a start on scanning the remaining paper. I’m addressing b) in parallel, and recording any issues or key points I find using the following notation in the ‘Movement Status’ field:

OK = The Index entry is as complete as possible and there is an equivalent scanned version

XX = There is a serious issue with this item.

Should the index entry and scans be present but there are some points to be recorded about them,  the ‘OK’ notation is qualified within brackets as follows (multiple qualifications can be recorded within the brackets as necessary separated by  a comma):

  • OK(multi): one or more of the equivalent scanned files in the FISH Document Management System are in the form of multiple TIF files – one for each page. FISH obscures the fact that there is a separate file for each page – but that is how the scan is actually stored.
  • OK(n docs): This identifies when there is more than one scanned document associated with this index entry – where n is the number of separate documents (this is a feature of this approach to electronic filing – multiple documents can be stored under a single Index entry).
  • OK(poor): the quality of some or all of the scanned electronic pages is poor.
  • OK(dbl): one or more of the associated scanned files came from documents with double sided pages which have been scanned all of one side first and then the pile turned over and the other side scanned. When this has been done the scanned pages are out of order. This was done with the first two scanners I had which were not able to handle double sided pages.
  • OK( ord): the pages of one or more of the scanned files are out of order for a reason other than the ‘dbl’ reason above.
  • OK(left): the original document was deliberately left at the location of the employer concerned when I moved jobs.
  • OK(A5): one of the scanners I had was not able to handle A5 pages reliably and sometimes recorded a line as an image dragged down the page for an inch or more.

Should an XX notation be applied to an Index entry, the reason it is being noted as such is recorded in brackets with one or more of the following notations:

  • XX(lost): the paper document was lost before a scan could be taken, so the Index entry is the only trace left of this document.
  • XX(ref): The Reference No is duplicated or incorrect in some other way.
  • XX(pap): The document is still only in paper form because its form is such that it has not yet been possible to digitise it effectively.

The fact that such points and issues are present in the collection in noticeable numbers, simply reflects the fact that, when dealing with such large volumes of material in the course of performing busy jobs across many years, it is inevitable that things will go wrong and mistakes will be made. Having been through the whole of the index, I’ll have statistics about the overall prevalence of such issues in this particular collection.

Search Status

In my last entry I said I’d contacted six potential repositories for my lifetime document collection. This is where those communications are up to:

Loughborough University’s Centre for Information Management: My email was forwarded to the University Library which did not respond. I followed this up on 20th September with an email to the Director of Library Services, and am waiting for a reply.

Manchester University’s Computer Science Dept Research Office: My email was forwarded to a researcher with an interest in the history of computing, but that person replied saying that her work in that area had been put on hold. On 20Sep I used the University library general enquiry form to enquire if the library would be interested. The library advised me to contact the Head of School administration  in the School of Computer Science who I duly emailed on 2nd October, and I am awaiting a reply.

City University’s Cass Business School:  My contact said he would pass my email onto his colleagues and I have heard nothing further.

UCL’s Dept of Information Studies: My contact said she would look out for interested people at conferences.

The National Archives: I was advised by my contact to direct my question to the Archive Sector Development team which, while it does not have any direct provision for taking private collections, should be well placed to provide advice. I emailed the Development Team on 16th September and am waiting for a reply.

 The Science Museum Wroughton Library and Archives: The Library asked me a number of questions about the collection but finally responded by sayingThank you for allowing us the time to consider your collection which we have now discussed with the Archive Collections Manager, Science Museum’s Keeper of Technologies and Engineering, and Head of Library and Archives. We have concluded that whilst we find this a most interesting idea, we do not think that the content fits within our current collecting policy criteria. You may have already contacted them, but we suggest that the National Computing Collection might be a more appropriate repository for your collection.”

This is all pretty much as expected: I know its going to be hard to find a repository that’s interested. However, should there be no interest from any of the above organisations, I plan to rely on interest being generated by the publication of my paper on Digital Preservation Planning.

Repository sought

In the last few months I’ve been making good progress on figuring out how to undertake a Digital Preservation project. Since I’m getting close to being ready to undertake digital preservation work on the PAW/DOC collection, I decided to make an attempt to find a home for the collection before I start. That way, I can tailor the digital preservation work to the requirements of the receiving repository – should I be lucky enough to find anyone who is interested. Anyway, I now have a short two pager to send to repositories which might be interested. This is the second version. Dave Thompson of the Wellcome Foundation (who I met on the UCL online Digital Curation course) was kind enough to comment on the first version and his observations resulted in a substantial rewrite. I’ve sent it to 6 organisations – Loughborough University’s Centre for Information Management, Manchester University’s Computer Science Dept, City University’s Cass Business School, UCL’s Dept of Information Studies, the National Archives, and The Science Museum Wroughton Library and Archives. If get a positive response from any of these all well and good. If I do not I shall proceed with the Digital Preservation work as planned.

Some three years ago I made a list of activities I wanted to undertake with the PAW/DOC collection, and this seems a good moment to summarise where I’m up to – the activities and their status are described below:

  • Scan the remaining 4 boxes of paper. Take the opportunity to explore scanning in colour and using PDF. Possibly also using OCR – though this is of much lower priority. DONE (but not OCRd)
  • Write a paper on “The paper artefact in the digital age” using an analysis of the contents of PAW/DOC as the basis for the paper. DONE
  • Explore the issues of longevity and survivability of file formats and of digital indexing and file management systems, using PAWDOC as the basis for the work. This could also include moving the material from FISH and even Filemaker. STILL TO BE DONE
  • Revisit all the requirements listed in my 2001 BIT paper to identify current status and opportunities for further work. STILL TO BE DONE
  • Scan all remaining PAW/DOC paper i.e. all those items in the three archive boxes (most of which have been identified as artefacts to be retained in their physical form). STILL TO BE DONE – but next on list – I’m trying to find a binding machine to be able to sheet feed the documents with comb binding
  • Check that all index entries are valid (i.e. not blank and with an appropriate Movement Field entry) and have an associated populated FISH entry. STILL TO BE DONE
  • Write up a guide to the material and to the technology supporting it. STILL TO BE DONE
  • Hand over PAW/DOC and its supporting technology to the new owner and provide training for the people who will be managing it going forward. STILL TO BE DONE

An Update – This Work on Hold

This work has lain dormant for a little while now – but only because I’ve been focusing on other supporting activities. In particular, I’m exploring the field of Digital Preservation with the aim of undertaking work to ensure that the contents of my work document collection is long lasting. In the process of doing that I’m also trying to publicise the existence of the collection in order to find someone who might be interested in giving it a long term home. So, I don’t intend to any further work on Personal Document Management until I’ve finished the Digital Preservation investigation.

For the record, I did actually go and talk to Jenny Bunn’s Digital Curation students at UCL on 27Feb2014. I talked for about 20 minutes, provided a handout (the odd layout is because it is designed to be printed double sided), and there was some Q&A at the end. I also had an interesting conversation afterwards with Jenny. However, it prompted no further interest in the work document collection.

Finally, a word about Anne O’Brien of Loughborough University who I started collaborating with on this topic in early 2013. The last contact I had with her was in September of that year, and I had heard nothing more from her or about her until I read in the November 2014 issue of the Loughborough University Alumni magazine that she had died in May 2014. Tom Jackson of Loughborough’s Centre for Information Management where she worked, confirmed in an email that she had died of a heart attack and that her death had come as a huge shock.  I’d like to record here that, in our brief collaboration, Ann was very helpful to me and gave me a number of substantial steers which moved the work I was doing forward both in terms of content and contacts.

Intrinsic Value of Artefacts

One of the people Neil Beagrie suggested I get in touch with was Elizabeth Shepherd, an Archivist and Records Management specialist in UCL’s Department of Information Studies. I duly emailed her early in Dec2013 and she asked Jenny Bunn, a Lecturer in the Department who is initiating a new teaching module on Digital Curation in January 2014, to contact me. Since then, Jenny and I have had a number of exchanges and we have agreed that there is potential for her students to make use of my document collection as a resource – though there is too little time to sandwich it into the early 2014 syllabus. Instead, I may go down to speak with her students in February or March.

Jenny also alerted me to a report on the Intrinsic Value of documents produced by the US National Archives and Records Service (NARS) in 1980. This is highly relevant to the work I am doing on the artefact in the digital age. So much so, that it has inspired me to define a clear set of research activities to establish if the NARS Intrinsic Value characteristics are relevant in Personal Information Management practices. Since this is now a distinct piece of work with clear objectives I shall continue to report on it under the separate heading of Digital Age Artefacts.

New Scanner – Canon DR2020U

Last Friday my new scanner – a Canon DR-2020U ADF + Flatbed – was delivered, and I have spent the last few days trying to integrate it into my system and exploring its functions. I ordered it through Tradescanners who have an excellent web site enabling comparisons to be made between a wide range of products. The scanner arrived within 24 hours of me placing the order which was excellent. Unfortunately, I’ve experienced two different sets of problems – first my BT Digital Vault software seems to interrupt the scanner software significantly (a problem widely reported on the net – the underlying software, FSHosting, just hogs the CPU); and secondly my existing scanner and Document Management software, which could use an ISIS driver but doesn’t because I haven’t got one for it, seems to interfere with the ISIS driver that came with the Canon scanner. Other than that, the new scanner seems to do everything its supposed to – full duplex scanning of both sides of the paper as it goes through, paper size detection, blank page detection and elimination, and saving to PDF, JPG or TIF as required. I’m pleased – but am having to work through the problems.

The field has exploded in the last 15 years

In an effort to understand what is going on in the world of Personal Electronic Filing, a few weeks ago I emailed some people I had identified from papers and web searches. The results have been very rewarding.

It is now clear to me that what was a niche area in the 1990s has expanded hugely to become a topic in its own right with a large body of literature and a worldwide community of interest. The rise of personal computing, email, social media and the mobile phone has effectively made most individuals – whether they know it or not – personal information managers; personal information is now considered to extend to photos, calendar entries, text messages, social media material etc.;  and the ubiquity of electronic media has necessitated the development of the field of data forensics to capture and identify evidence. The field of Data Preservation is of particular interest to Libraries and Museums which are grappling with the practical problems of curating collections which include digital material. There appear to be many initiatives underway in all these areas, of which various EEC-funded projects, the UK Data Preservation Coalition, the US Library of Congress guidance notes, and William Jones’ Personal Information Management workshops are probably just the tip of the iceberg. I’m grateful to Neil Beagrie for linking me into much of this material.

With this new awareness I have begun to try and understand the role that my personal collection might have. In particular, I’m wondering if it could become a Test Set for exploring Data Preservation issues rather than the original aim of being a Test Set for Personal Indexing and Retrieval (an objective which seems to have become defunct since the rise of the Search Engine). This could be a useful focal point in my continuing search to find people to collaborate with.


A Second Column for Facets

I’ve been giving the Excel Index that I developed last year a lot of use – mainly for the Memento Management  activity – and I’ve decided that having just one column for Facet is not enough. Inevitably there are cases where you want to specify two facets (for example, Loughborough and Rugby) and this is easily done by just putting one after the other with a comma between in the single Excel cell. The trouble is that Excel’s filter facility lists things alphabetically so, in the example above, if you look for Loughborough the entry “Loughborough, Rugby” appears in the appropriate position. However, if you are looking up “Rugby” the “Loughborough, Rugby” entry does not appear in that position so you may miss that particular item related to Rugby.

I’ve addressed the problem by including a second column for Facet, and by including both entries in both columns but with one in reverse order to the other, for example, in Column 1 “Loughborough, Rugby” and in column 2 “Rugby, Loughborough”. This ensures that, provided a search is done in both columns for a particular facet, you will find every instance of that facet and all secondary facets used with the facet being searched for.

Reasons for Keeping Hardcopy

I’ve been doing some preliminary practical work for the study of ‘the artefact in the digital age’ that I’m doing with Ann O’Brien. To gain an idea of the range of reasons for keeping hardcopy rather than just having a digitised version, I’ve reviewed the 357 items that I have chosen to keep rather than scan and throw away. Nineteen categories emerged. Ann and I will use this initial insight to plan in detail the practical work I am going to do in scanning four boxes of material that have not yet been scanned.

Storing Large Movie Files

In the last week or so I’ve been exploring the use of video editing and conversion tools – primarily to deal with the conversion and storage of personal cine film. However, since I also have about 15 pieces of video indexed in my document management system, I decided to try and use the same tools on them. The video is all on DVDs for two main reasons. First, most of it is in DVD video format (multiple files with extensions of either VOB, IFO or BUP) which is not conducive to storing in a document management system except in a zip file; and second, up to now I havn’t had sufficient storage in my PC to cope with the sizeable volumes of video files. Both of these constraints can now be overcome. I have the tools to convert multiple DVD video files into a single MP4 file; and my current laptop has 750Gb of storage of which two thirds is currently empty.

Overall, the exercise of moving the material on DVDs into the Fish document management system has been successful. All but one item has been transferred, and all the movies play directly from Fish when selected. However, a number of experiences are worth recounting:

  • Inaccessible DVD files: One DVD was transferred from VHS video to DVD by IC Video Ltd in the same way as several of the others, However, although it usually plays OK when put into the DVD slot, neither the FreeStudio or Movie Maker software I’m using was able to convert successfully from the DVD. Indeed, I’m not even able to copy it from the DVD to my laptop.  So that material will have to stay on the DVD until I find a solution.
  • Movie file formats: I also have a number of TV programmes downloaded from the net with .MPEG extensions and which play successfully on my laptop. However, my FISH document management system does not seem to support that extension. I did try changing the extension to MP4 (which FISH does support) but found the quality was much reduced. In the end I discovered that FISH does support a .M1V extension with which the files do play successfully. This prompted me to read up about MPEG and I discovered it has many versions with .M1V predictably being MPEG1.  I don’t plan to spend a great deal of time trying to understand all about movie formats, and am just happy to have found an extension that plays on my laptop and can be stored in my document management system. However, it has reminded me that there is much greater complexity in all these standards than meets the eye and consequently I shall be keeping the physical DVDs in case I encounter problems downstream.
  • DVDs with data files: A few of the physical DVDs contain not video files but collections of ordinary files. For example, one is the installation disk for an old version of my FISH document management system. Another is the installation disk for my Home Use versions of Microsoft Word and Excel. With these I simply zipped up the files and stored the zip file in Fish.
  • Large file sizes: Although file size is no longer such a problem on the laptop as previously, it still needs to be taken into account for backup purposes. All the material in my document management system is placed into so-called ‘bins’ – standard Windows folders specially configured by FISH. Over the years I have limited the size of the bins to around 200Mb to facilitate manageability and the taking of backups. These days I’m able to backup to 4.7 Gb DVDs – though I still try and keep the bin size to around 200 MB. With these movie files, however, some of the file sizes are well over 1 Gb, so I have created specific bins for them to go in and ensured that the sum of the files within them do not exceed 4.7GB (which is the advertised size of the DVDs) so that I will still be able to take the necessary backups. Unfortunately, as I discovered when trying to burn to disk, the usable size of blank DVDs is somewhat less at 4.37Gb. Consequently, I subsequently had to move some of the files around the bins I had created to keep the bins under the 4.37Gb limit. After over 30 years of personal computing, things are still never easy….