Repository sought

In the last few months I’ve been making good progress on figuring out how to undertake a Digital Preservation project. Since I’m getting close to being ready to undertake digital preservation work on the PAW/DOC collection, I decided to make an attempt to find a home for the collection before I start. That way, I can tailor the digital preservation work to the requirements of the receiving repository – should I be lucky enough to find anyone who is interested. Anyway, I now have a short two pager to send to repositories which might be interested. This is the second version. Dave Thompson of the Wellcome Foundation (who I met on the UCL online Digital Curation course) was kind enough to comment on the first version and his observations resulted in a substantial rewrite. I’ve sent it to 6 organisations – Loughborough University’s Centre for Information Management, Manchester University’s Computer Science Dept, City University’s Cass Business School, UCL’s Dept of Information Studies, the National Archives, and The Science Museum Wroughton Library and Archives. If get a positive response from any of these all well and good. If I do not I shall proceed with the Digital Preservation work as planned.

Some three years ago I made a list of activities I wanted to undertake with the PAW/DOC collection, and this seems a good moment to summarise where I’m up to – the activities and their status are described below:

  • Scan the remaining 4 boxes of paper. Take the opportunity to explore scanning in colour and using PDF. Possibly also using OCR – though this is of much lower priority. DONE (but not OCRd)
  • Write a paper on “The paper artefact in the digital age” using an analysis of the contents of PAW/DOC as the basis for the paper. DONE
  • Explore the issues of longevity and survivability of file formats and of digital indexing and file management systems, using PAWDOC as the basis for the work. This could also include moving the material from FISH and even Filemaker. STILL TO BE DONE
  • Revisit all the requirements listed in my 2001 BIT paper to identify current status and opportunities for further work. STILL TO BE DONE
  • Scan all remaining PAW/DOC paper i.e. all those items in the three archive boxes (most of which have been identified as artefacts to be retained in their physical form). STILL TO BE DONE – but next on list – I’m trying to find a binding machine to be able to sheet feed the documents with comb binding
  • Check that all index entries are valid (i.e. not blank and with an appropriate Movement Field entry) and have an associated populated FISH entry. STILL TO BE DONE
  • Write up a guide to the material and to the technology supporting it. STILL TO BE DONE
  • Hand over PAW/DOC and its supporting technology to the new owner and provide training for the people who will be managing it going forward. STILL TO BE DONE

An Update – This Work on Hold

This work has lain dormant for a little while now – but only because I’ve been focusing on other supporting activities. In particular, I’m exploring the field of Digital Preservation with the aim of undertaking work to ensure that the contents of my work document collection is long lasting. In the process of doing that I’m also trying to publicise the existence of the collection in order to find someone who might be interested in giving it a long term home. So, I don’t intend to any further work on Personal Document Management until I’ve finished the Digital Preservation investigation.

For the record, I did actually go and talk to Jenny Bunn’s Digital Curation students at UCL on 27Feb2014. I talked for about 20 minutes, provided a handout (the odd layout is because it is designed to be printed double sided), and there was some Q&A at the end. I also had an interesting conversation afterwards with Jenny. However, it prompted no further interest in the work document collection.

Finally, a word about Anne O’Brien of Loughborough University who I started collaborating with on this topic in early 2013. The last contact I had with her was in September of that year, and I had heard nothing more from her or about her until I read in the November 2014 issue of the Loughborough University Alumni magazine that she had died in May 2014. Tom Jackson of Loughborough’s Centre for Information Management where she worked, confirmed in an email that she had died of a heart attack and that her death had come as a huge shock.  I’d like to record here that, in our brief collaboration, Ann was very helpful to me and gave me a number of substantial steers which moved the work I was doing forward both in terms of content and contacts.

Intrinsic Value of Artefacts

One of the people Neil Beagrie suggested I get in touch with was Elizabeth Shepherd, an Archivist and Records Management specialist in UCL’s Department of Information Studies. I duly emailed her early in Dec2013 and she asked Jenny Bunn, a Lecturer in the Department who is initiating a new teaching module on Digital Curation in January 2014, to contact me. Since then, Jenny and I have had a number of exchanges and we have agreed that there is potential for her students to make use of my document collection as a resource – though there is too little time to sandwich it into the early 2014 syllabus. Instead, I may go down to speak with her students in February or March.

Jenny also alerted me to a report on the Intrinsic Value of documents produced by the US National Archives and Records Service (NARS) in 1980. This is highly relevant to the work I am doing on the artefact in the digital age. So much so, that it has inspired me to define a clear set of research activities to establish if the NARS Intrinsic Value characteristics are relevant in Personal Information Management practices. Since this is now a distinct piece of work with clear objectives I shall continue to report on it under the separate heading of Digital Age Artefacts.

New Scanner – Canon DR2020U

Last Friday my new scanner – a Canon DR-2020U ADF + Flatbed – was delivered, and I have spent the last few days trying to integrate it into my system and exploring its functions. I ordered it through Tradescanners who have an excellent web site enabling comparisons to be made between a wide range of products. The scanner arrived within 24 hours of me placing the order which was excellent. Unfortunately, I’ve experienced two different sets of problems – first my BT Digital Vault software seems to interrupt the scanner software significantly (a problem widely reported on the net – the underlying software, FSHosting, just hogs the CPU); and secondly my existing scanner and Document Management software, which could use an ISIS driver but doesn’t because I haven’t got one for it, seems to interfere with the ISIS driver that came with the Canon scanner. Other than that, the new scanner seems to do everything its supposed to – full duplex scanning of both sides of the paper as it goes through, paper size detection, blank page detection and elimination, and saving to PDF, JPG or TIF as required. I’m pleased – but am having to work through the problems.

The field has exploded in the last 15 years

In an effort to understand what is going on in the world of Personal Electronic Filing, a few weeks ago I emailed some people I had identified from papers and web searches. The results have been very rewarding.

It is now clear to me that what was a niche area in the 1990s has expanded hugely to become a topic in its own right with a large body of literature and a worldwide community of interest. The rise of personal computing, email, social media and the mobile phone has effectively made most individuals – whether they know it or not – personal information managers; personal information is now considered to extend to photos, calendar entries, text messages, social media material etc.;  and the ubiquity of electronic media has necessitated the development of the field of data forensics to capture and identify evidence. The field of Data Preservation is of particular interest to Libraries and Museums which are grappling with the practical problems of curating collections which include digital material. There appear to be many initiatives underway in all these areas, of which various EEC-funded projects, the UK Data Preservation Coalition, the US Library of Congress guidance notes, and William Jones’ Personal Information Management workshops are probably just the tip of the iceberg. I’m grateful to Neil Beagrie for linking me into much of this material.

With this new awareness I have begun to try and understand the role that my personal collection might have. In particular, I’m wondering if it could become a Test Set for exploring Data Preservation issues rather than the original aim of being a Test Set for Personal Indexing and Retrieval (an objective which seems to have become defunct since the rise of the Search Engine). This could be a useful focal point in my continuing search to find people to collaborate with.


A Second Column for Facets

I’ve been giving the Excel Index that I developed last year a lot of use – mainly for the Memento Management  activity – and I’ve decided that having just one column for Facet is not enough. Inevitably there are cases where you want to specify two facets (for example, Loughborough and Rugby) and this is easily done by just putting one after the other with a comma between in the single Excel cell. The trouble is that Excel’s filter facility lists things alphabetically so, in the example above, if you look for Loughborough the entry “Loughborough, Rugby” appears in the appropriate position. However, if you are looking up “Rugby” the “Loughborough, Rugby” entry does not appear in that position so you may miss that particular item related to Rugby.

I’ve addressed the problem by including a second column for Facet, and by including both entries in both columns but with one in reverse order to the other, for example, in Column 1 “Loughborough, Rugby” and in column 2 “Rugby, Loughborough”. This ensures that, provided a search is done in both columns for a particular facet, you will find every instance of that facet and all secondary facets used with the facet being searched for.

Reasons for Keeping Hardcopy

I’ve been doing some preliminary practical work for the study of ‘the artefact in the digital age’ that I’m doing with Ann O’Brien. To gain an idea of the range of reasons for keeping hardcopy rather than just having a digitised version, I’ve reviewed the 357 items that I have chosen to keep rather than scan and throw away. Nineteen categories emerged. Ann and I will use this initial insight to plan in detail the practical work I am going to do in scanning four boxes of material that have not yet been scanned.

Storing Large Movie Files

In the last week or so I’ve been exploring the use of video editing and conversion tools – primarily to deal with the conversion and storage of personal cine film. However, since I also have about 15 pieces of video indexed in my document management system, I decided to try and use the same tools on them. The video is all on DVDs for two main reasons. First, most of it is in DVD video format (multiple files with extensions of either VOB, IFO or BUP) which is not conducive to storing in a document management system except in a zip file; and second, up to now I havn’t had sufficient storage in my PC to cope with the sizeable volumes of video files. Both of these constraints can now be overcome. I have the tools to convert multiple DVD video files into a single MP4 file; and my current laptop has 750Gb of storage of which two thirds is currently empty.

Overall, the exercise of moving the material on DVDs into the Fish document management system has been successful. All but one item has been transferred, and all the movies play directly from Fish when selected. However, a number of experiences are worth recounting:

  • Inaccessible DVD files: One DVD was transferred from VHS video to DVD by IC Video Ltd in the same way as several of the others, However, although it usually plays OK when put into the DVD slot, neither the FreeStudio or Movie Maker software I’m using was able to convert successfully from the DVD. Indeed, I’m not even able to copy it from the DVD to my laptop.  So that material will have to stay on the DVD until I find a solution.
  • Movie file formats: I also have a number of TV programmes downloaded from the net with .MPEG extensions and which play successfully on my laptop. However, my FISH document management system does not seem to support that extension. I did try changing the extension to MP4 (which FISH does support) but found the quality was much reduced. In the end I discovered that FISH does support a .M1V extension with which the files do play successfully. This prompted me to read up about MPEG and I discovered it has many versions with .M1V predictably being MPEG1.  I don’t plan to spend a great deal of time trying to understand all about movie formats, and am just happy to have found an extension that plays on my laptop and can be stored in my document management system. However, it has reminded me that there is much greater complexity in all these standards than meets the eye and consequently I shall be keeping the physical DVDs in case I encounter problems downstream.
  • DVDs with data files: A few of the physical DVDs contain not video files but collections of ordinary files. For example, one is the installation disk for an old version of my FISH document management system. Another is the installation disk for my Home Use versions of Microsoft Word and Excel. With these I simply zipped up the files and stored the zip file in Fish.
  • Large file sizes: Although file size is no longer such a problem on the laptop as previously, it still needs to be taken into account for backup purposes. All the material in my document management system is placed into so-called ‘bins’ – standard Windows folders specially configured by FISH. Over the years I have limited the size of the bins to around 200Mb to facilitate manageability and the taking of backups. These days I’m able to backup to 4.7 Gb DVDs – though I still try and keep the bin size to around 200 MB. With these movie files, however, some of the file sizes are well over 1 Gb, so I have created specific bins for them to go in and ensured that the sum of the files within them do not exceed 4.7GB (which is the advertised size of the DVDs) so that I will still be able to take the necessary backups. Unfortunately, as I discovered when trying to burn to disk, the usable size of blank DVDs is somewhat less at 4.37Gb. Consequently, I subsequently had to move some of the files around the bins I had created to keep the bins under the 4.37Gb limit. After over 30 years of personal computing, things are still never easy….


A New, Simpler, Cheaper Filing System

Most of the electronic filing that I’ve done up to now has been for business documents, but now I’m starting to focus more on personal documents. The electronic system I’ve been using can cater for all kinds of documents, but I’ve decided to have a separate system for my personal material. This is mainly because the business documents may have research and historical uses and may be taken elsewhere, while the personal documents are for the use of myself and my family.

In setting up a filing system for personal documents I could re-use the one I’ve used for business documents since 1981. However, even though it uses relatively small scale software products, I have found it relatively costly and complex to maintain over the years as new versions have had to be obtained and installed to keep pace with new operating systems and other developments. Most upgrades to new versions have been disruptive and time consuming. Furthermore, the need to perform such upgrades to ensure continuity of operation imposes additional uncomfortable pressure. I want to minimise these problems in my personal filing system.

The solution I’m going to try out will have an index in Excel, and the base material will be stored in the Windows filing system. At present, Windows and Excel are ubiquitous and are integral parts of the basic computer system that I maintain for my own use. Hence the system will not incur additional cost, and there will be no special software functionality to make things more complicated. However, this simplicity is only achieved by sacrificing some flexibility in index searching. With the Filemaker index that I use for my business documents, multiple search terms and options (all, one or the other, not present etc.) will deliver a list of all records that match the criteria. This will not be possible to replicate in as the Excel FIND command simply steps through matching entries – though the filter facility in conjunction with a ‘Facet’ field will go some way towards it. I’m prepared to forego this functionality to achieve the substantial long term benefits of simplicity.

I will use the same index fields as for my business document with a few additions to cope with the limitations of Excel. The full list of fields will be as follows:

Reference number: mandatory for each entry and having four parts: an Owner identifier (PAW for Paul Wilson), a Set identifier (PERS), a Serial number (e.g. 1817) and a Sub-Serial number (e.g. 01). So, a typical Reference No looks like this: PAW-PERS-1817-01). The purpose of the Serial number is to enable new documents to be given the next number on the list, i.e. the number signifies nothing other than the physical location of the document in the file. The purpose of the sub-serial number is to enable two or more documents to be kept physically together in a file even if the later one is logged after the subsequent serial number has been allocated. The Reference No will be written/attached to the relevant physical documents/artefacts, and will be included at the beginning of the equivalent electronic file(s). Note that, for my business documents, the separator in the Reference No is a slash (PAW/PERS/1817/01), however a slash is not a valid character in Windows file names so it has been replaced with a dash (PAW-PERS-1817-01).

Title: mandatory for each entry, unlimited in length (subject to the size limitation of  an Excel cell) and the contents to be at the owner’s discretion (i.e. it could be different to the actual title on a document). Note that up to the first 100 characters will be copied and pasted after the Ref No in the file names of associated electronic files; therefore Titles should be constructed to make this first 100 characters as informative as possible. The Title may also include Keywords or Phrases as described below.

Keywords or phrases: optional for each entry and unlimited in number, specified entirely at the owner’s discretion, separated by commas and added to the end of the Title after three dots.

Facet: optional for each entry and consisting of a single word or phrase which can be used as a broad search term by being selected from the Excel Filter list.

Physical Location: optional for each entry and used to indicate if there is a physical item associated with this entry and, if so, where it is located.

Electronic version: optional for each entry and used to indicate if there is an electronic item associated with this entry.

Publication Date: optional for each  entry and used to specify the exact date (ddmmmyyyy) which the material concerned came into being.

Year: mandatory for each entry in full form (yyyy) and used to indicate the year in which the material concerned first came into being (this is needed in addition to the Publication Date because it is not possible to specify an exact date for some items and an Excel field can only have a single date format).

Creation Date: mandatory for each entry and used to specify when the Index entry was created.

It is the scanning of various documents for a photobook of my work experiences (see my journal on memento management – that has prompted me to do this right now. So I shall start using this new system for those items immediately. After it has accumulated a substantial amount of material I will report again on how effective it is.

Working with Ann O’Brien

I asked Tom Jackson if he could suggest anyone who might be interested in working with me on Personal Document Management and he introduced me to Ann O’Brien of Loughborough University’s Department of Information Science. Ann and I duly spoke and discussed what I have been doing and what topics could be explored. A couple of Ann’s immediate off-the-cuff thoughts were:

Transferable Index: Based on my experience and system, would it be possible to develop a simple index and framework taxonomy which other individuals would find easy and useful to use?

A life in industry: Perhaps I could write a ‘book’ illustrated by references out to some of the contents of the filing system. This could be an ordinary book or an eBook.

In the short term, we agreed to explore if there would be any benefit in addressing the topic “The artefact in the digital age” using those documents I had resisted destroying, for whatever reason, after scanning. Ann will investigate if any research has been done on this topic and if it is likely that such a paper would be accepted in a journal; and I will provide Ann with a number of examples of such documents.