UCL’s Online Digital Curation Course

A couple of weeks ago I joined UCL’s free – and excellent – 8 week online Digital Curation course. It has several hundred participants from all over the world – many of them professionals and students in the Archiving and Curation field. The course covers what digital curation is, how it is performed, and its major activities and communities worldwide, as well as leading participants through some practical digital curation work on their own files. This latter activity is a perfect fit with the trial I am currently performing of an approach to creating and planning a Preservation Plan.  The course also encourages participants to discuss what is being taught and, although I’m actually doing digital curation work, I’m an amateur with no training, so I’m finding it very valuable to listen to the perspectives of specialists in the area.

Last week in the course we were asked to write about our digital mindset – our early experiences with computers and any turning points where we suddenly became more aware of the digital world we are now in. This was my (slightly augmented) contribution:

I first came across computers at university where we handed our punched card programs into the Computer Dept and collected the results a day or two later. In my first job in Kodak I experienced computerised stock control, sales estimating and factory production planning, and was fascinated. I became a Needs Analyst. However, it wasn’t till I joined the National Computing Centre’s newly formed Office Systems division in 1980 that the digital penny really dropped. The job was to seek out best practice and spread it to UK organisations. It was a time when Word Processing was gaining ground, personal computers were being introduced and electronic mail was just emerging. Within a year I knew that the future for the individual, both in the office and at home, was digital. I plunged in enthusiastically. I started filing all my documents using an index knowing that eventually the index would be computerised and that the documents themselves would be digitised; I replaced my pocket and desk diaries with a constantly updated folded A4 page that I kept in my wallet; and I rushed to work early in the morning to furiously communicate with distant colleagues in the British Library electronic journal project BLEND. By the time I took my next job in 1984 my path was set and the remaining 26 years of my career were spent harnessing the increasing power and lowering costs of computers to augment my digital visions. At home, we started budgeting on the EazyCalc spreadsheet, our addresses were held in a database, and I started indexing and scanning every family photo. At work, my wallet diary was eventually replaced by an Organiser and then mobile phone (though my wallet diary sheets are the best diary records I have); and I immersed myself in email, Computer Conferencing services, and research in configurable message systems. My file index was computerised on a Mac and eventually I started scanning my documents into a document management system. Shortly afterwards I started to experience preservation anxiety when I realised that this ever expanding, increasingly precious collection of all my work knowledge was utterly dependent on the next 30 years of effective back-up procedures and flawless migrations through many upgrades of three software products, the operating system, and my laptop.

When I retired in 2012 and was released from the overload hell that email had become, I had time to digitise the boxes of mementos accumulated since 1958. So, now I have a 33Gb digital collection of all my work documents (approx 180,000 scanned pages) which is in serious need of a preservation plan and a final destination. I also have a 44Gb collection of 17,000 family photos, and a 7Gb collection of 1600 digitised family mementos – both of which have a destination (my offspring) but which also require a preservation plan and a mechanism for informing, and handing them over to, the unsuspecting recipients. My digital vision for the workplace has long since been achieved; but there is much left to explore in the home – how to show, share and bring to life our physical and digital objects, and how to ensure they are reliably passed through the generations; and of course, ways to allay the ever-present preservation anxiety associated with such precious collections.

PDF/A Flavours and Error Messages

A week ago I acquired an updated version of my eCopy PDF Pro Office software with much more comprehensive facilities for creating PDF/A documents. Since then I’ve been exploring what those facilities are and using them on the files I’m converting in my PAW-PERS collection. The updated eCopy software provides support for PDF/A-1a, PDF/A-1b, PDF/A-2a, PDF/A-2b, and PDF/A-2u. Broadly speaking, PDF/A-1b seems to be the most basic level of conformance required and aims to achieve a reliably rendered visual appearance. PDF/A-1a supports additional features such as Tags and Language, while PDF/A-2 (which was published after PDF-A-1) also ensures that layers, transparency and embedded files are preserved. eCopy enables you to check whether a document conforms to any one of these standards, and I used this facility to check that the documents I was converting to PDF from other formats, complied with PDF/A-1b. On almost every occasion, even though I was using the eCopy software to convert the documents into PDF, the compliance check threw up errors – below are examples of some of the most common ones.

  • xmp: CreateDate Bad XMP Date: ‘2015-01-27T09:56:01Z:P’ Page;1 Number;1 (The XMP metadata stream should conform to XMP specification);
  • Mismatch between xmp:CreateDate (‘2015-01-27T09:56:01Z:P’) and CreationDate (‘D:20150127095601Z’) (The XMP metadata stream should conform to XMP specification);
  • DeviceRGB used in image but no output intent (Device-specific colour space used, but no Output Intent is defined for the file)
  • Output Intent missing (Non-Device-independent colour space is used but no OutputIntent is not defined).
  • Missing PDF/A identifier (the PDF/A version and conformance level of a file shall be specified using the PDF/A identification extension schema in the XMP packet)

eCopy also provides a “Fix” facility which in most cases cleared the errors – though only if the resulting file was saved with a different file name. In some cases however, even the Fixed file still had errors in it which were only cleared by a further “Fix” and saving the file to yet another file name.

This turned out to be a rather tortuous process so I decided that I was only going to check and ensure PDF/A-1b compliance for the files that, at the start of this exercise, were not in PDF format at all. The remaining 800+ files which were already in PDF format at the start of this exercise will have to stay as they are for now. I’ve checked a few of them and they all have several compliance errors, but to ensure they all complied with PDF-A1-b would consume more time than I am prepared to spend right now.

The key findings from this phase of the work so far are that it is vital to fully understand the file formats you are targeting, and to become very familiar with the software you intend to use, before creating the Preservation Plan. Without that knowledge the Plan is likely to be unrealistic and almost impossible to stick to.

A first attempt at a Preservation Plan

Shortly after my last entry, I set about creating a Preservation Plan for my PAW-PERS collection of personal documents and mementos. I combined elements of Project Planning that I had experienced while working as an IT professional together with aspects of the preservation planning concepts already documented in the Scoping Document. The Project Plan consists of two documents – a Project Plan Description and a Project Plan Chart.

I sent drafts of these two documents to Chris Hilton, William Kilbride and Neil Beagrie asking for comments. Chris Hilton of the Wellcome Foundation very kindly sent me back his views just before Christmas – in summary, he thought the plans were thorough and that the decision to convert most documents to PDF or PDF/A was a good one. He also suggested keeping the original versions of any documents containing some processing components (such as spreadsheets) which may not be captured within the PDA format; and he endorsed keeping off-site copies.

Neil Beagrie put me in touch with Gabriela Redwine of Yale University who is doing work on Personal Digital Archiving for the DPC (Digital Preservation Coalition). She too provided a positive reaction to the Preservation Plan. So with these two endorsements I set about implementing the plan – the first part of which requires that those documents that need to be retained in their original form are identified; and that the remaining files are converted to PDF/A.

Unfortunately the rigours of Christmas and a subsequent call on my time to help my son and his wife do initial renovation work on their new house, have interrupted progress. However, even the little I have done so far has identified a number of issues: a) conversion of an htm document into a PDF document using my PDF package (eCopy PDF Pro Office) did not produce a good similarity. The most reliable rendition was achieved by copying the htm screen into a Word document and then turning that into a PDF; b) a 2010 article from Ohio State University alerts readers that Word 2007 only produces a so-called PDF/A-1b version which does not include tags and mark-ups and which is suitable for documents which are primarily image-based and do not have alternate text. The more complete PDF/A-1a version enables screen reader technology to correctly read the document to disabled persons; c) It seems that even if you have software that can convert to PDF/A format, it still only places the “PDF” extension at the end of the file name, thereby providing no explicit confirmation of whether the file has been converted successfully to PDF/A or whether a file is or is not PDF/A compliant.

PDF/A

Since creating the test Scoping template in July, I’ve been trying to find someone to give me feedback on it – but with no success yet. Consequently, I have decided that I must press on with or without feedback. To that end, today I researched PDF/A  on the net and discovered that it is a standard which specifies certain features which will make PDF/A files more independent and self-contained and therefore more likely to be readable in the future. This is clearly a better format than ordinary PDF to store files in for the long term. Apparently, a more recent version of my PDF software (eCopy PDF Pro Office from Nuance) does support PDF/A and is available as a free download. I plan to obtain the upgrade, check out its PDF/A capabilities and then, armed with that knowledge, I shall follow up the Scoping document previously created for the Mementos collection with a Preservation Plan document.

Final Observations and Frame Works

After spending 6 weeks with the four different sizes of bookshelf posters on my wall (40×30 in – full size, 30×20 , 18×12, 15×10), last week I came to these conclusions:

  • While the full size version is easiest to see and read, the next size down – 30×20 in – is still perfectly usable;
  • Even the two smallest sizes provide sufficient detail to be able to distinguish between books and to find them in the iPad.
  • Hanging the posters vertically so it appears as if the books are stacked one on top of the other doesn’t present a problem – in fact it makes it easier to read the book titles; however it might be better to remove the edge of the shelf running vertically down the poster and perhaps replacing it with a shelf at the bottom of the stack.
  • As with ordinary books, its more convenient to view the bookshelf posters at head height; and it’s interesting to anticipate that a system displaying digital versions of the posters would enable shelves to be switched to the preferred height at will.
  • The posters can be presented together in different combinations and arrangements, for example, the poster of a particular shelf can be horizontal or vertical and can be placed at the top or bottom of a group of posters. This would be easy to replicate in a system managing digital versions of the bookshelves
  •  Like the posters, digital versions of the bookshelves could be duplicated and displayed in other rooms or locations.

With these points in mind I decided to put the four different sets of posters to the following uses:

  • The second biggest size posters have become my permanent visible images of the books I have scanned and which I no longer have physical copies of. They are arranged in a 40 x 30 in IKEA RIBBA frame which is on my study wall directly ahead of me as I sit at my desk. Being able to use the smaller-than-full-size posters has made it much more feasible to do this – the full size posters would have taken up too much of the wall space. I now have a much more constantly visible view of the spines than I ever had before when they were on bookshelves behind my desk amongst a lot of other material.
  • The third biggest size posters have now been arranged on a sheet of white paper and placed underneath the plastic desk pad on which my keyboard and mouse sit and on which I write longhand on occasion. This provides an unobtrusive decoration and demonstrates the reproducibility of the electronic bookshelf. The picture below shows the framed electronic bookshelf posters, the version under the desk mat, the book PDF files and an opened file on the adjacent computer screen, and the iPad showing thumbnails of the same PDF files.
    IMG_3622
  • The smallest size posters have been arranged in a 20 x 16in Wilko frame (see below) and given to my son and his wife as a housewarming present for the library area of their new house (not sure how much they will enjoy this but it had to go somewhere…!).
    Elec Bookshelf Picture Small
  • The largest, full size posters have been stored at the back of a large picture frame that I have in my study (see Poster Management journey) in case I should want to use them in future.

In assembling the sets of posters as described above, I took the opportunity to vary the way the individual posters were displayed and to think about how they might appear on a large scale display or roll of electronic paper. Given that many arrangements are possible I included a tag line at the bottom of each one to identify the title of the collection (‘Col’) and the particular arrangement of that collection (‘Rig’). An example is shown below:

Col and Rig example

There is undoubtedly some synergy between some of the points that have emerged from this electronic bookshelf exercise and in the way that mementos might be displayed, and I intend to think about these when I start the next phase of the Memento Management work described elsewhere in this site. In the meantime, however, my current exploration of the Electronic Bookshelf has come to an end. Perhaps when electronic paper becomes sufficiently cheap, and when an App is available to create, manipulate and arrange the images of book spines and covers, I’ll attempt to replace my framed poster version with the real thing.

Done and Digitised – 1980-2011 !

I’ve just finished digitising the third tranche of my mementos – the material we have kept in separate pocket folders for each year since we got married in 1980. This was an even bigger job than the two previous tranches (one for work related materials, and the other for my own mementos from 1958 – 1979), since it involved so much material of such a diverse nature. The end result is 575 index entries, and 611 electronic files taking up 2.5Gb of storage. About 220 physical items were retained in either 40-Pocket Presentation Folders, Clear Foolscap Plastic Wallets, or a Display Cabinet.

Overall the whole exercise has taken about six weeks of at least a couple of hours work every day – often a lot more. The most time-consuming part of the exercise was the initial sorting and organisation of the material.  Scanning the items was relatively quick – though some couldn’t be scanned and had to be photographed and this added time to the process. I photographed three types of items: a) all the Birthday/Anniversary/Easter cards etc. that we had kept – these were photographed as groups – first the fronts and then the insides with the writing on – rather than scanning each one individually; b) large formats such as magazines, newspaper articles and some theatre programmes that were simply too big to fit on the scanner; and c) 3D physical objects such as a winners medal.

I attempted to identify the set of index terms (facets) as I went along, but inevitably requirements for new terms identified half way through affected the allocations made earlier. I also attempted to store the physical artefacts in a coherent way as I went along, but this too is difficult to finalise until the end of the process when you can see the full extent of the amount and type of material to be dealt with. To have any hope of keeping things under control it’s necessary to decide on an initial ordering criteria, such as date, and then to leave plenty of spaces to enable additional items that are encountered later on in the exercise to be slotted in. I failed to do that sufficiently well in this exercise and consequently now have most of the material in reasonable order but also a substantial number of items stored separately which need to be interleaved with the main set.

I’ve stored all the digitised items as PDF files for three reasons: a) PDF enables you to collect up several related individual scans or photographed images so that they can be accessed as a coherent set of items; b) The SideBooks App which I am using to display the items on my iPad, will only accept PDFs, ZIP, CBZ, RAR and CBR formats; c) There seems to be some consensus that PDFs are a good ‘data preservation’ format for enabling files to be read in the long term.

As with my work and pre-marriage mementos, I’ve loaded this new set into the SideBooks App on my iPad. I continue to be impressed at how easy that process is – just a matter of copying the files you want to move and pasting them into Dropbox. I tended to copy over groups of ten or twenty files at a time which take only a few seconds each to load into Dropbox.  After that, the Dropbox  page in Sidebooks can be opened and a tap on the file concerned starts the downloading process. A few seconds later it’s all done, and the first page of the PDF file is displayed as a thumbnail. I feel it is a startlingly effective way of bringing material to life that has been trapped in files and boxes. Since this set of items is as much my wife’s as mine, she too has the set of items in SideBooks on her iPad, so it will be very interesting to see if she feels the same way after she’s used it for a while.

Miniature representations

In the last post but one, I described how it was pleasing to have full size poster replicas (40×30 inches) of the shelves of books I have scanned, in easy to see positions on the wall in front of my desk. Since then I have begun to wonder just how small these poster replicas could be to provide the same experience. Therefore, as a final phase in this journey, I’ve had the poster set reprinted in three smaller sizes (30x 20 in, 18 x 12 in, 15×10 in) and positioned them in the remaining wall space in my study as shown in the pictures below. Over the next couple of weeks I’ll mull over how the different sizes compare and try to come up with a view as to whether a miniature representation can provide a similar experience to that provided by a full size representation.

30 x 20 in poster18 x 12 in postersSmallest size posters 15x10

Virtual Display —> Full Text Integration

The original aim of this investigation was to display virtual images of books on a wall and to be able to call up the full text of any one of them to read. The two ends of this objective – display of a virtual bookshelf, and the ability to read the full text of any of the books – have been achieved. However, the integration of the two ends is, as yet, a manual process requiring the user to choose a title from the virtual display, to open up the iPad SideBooks application, and to find and open the required title.

I’ve briefly thought about a variety of ways that this process might be automated. The original notion was to use e-Paper and to be able to touch the image of a book spine to bring up the full text on a separate screen. In the absence of e-Paper, I toyed with the notion of using Image recognition via an iPhone App, but I couldn’t find an App that would use the text recognised to open up another application. I then started thinking about voice recognition and tried out the Apple Siri voice recognition facility on the iPad. Frustratingly, although it will very effectively open up the Sidebooks application just by the user saying “SideBooks”, Siri is not yet able to search for and open up files within applications. This was confirmed to me by an Apple Chat  support person – though he did advise me that I should inform Apple of my requirement via a Feedback form, which I duly did.

That’s as far as I’ve got. Voice recognition does seem to be the most promising approach – and I guess it’s quite possible that Apple may enhance Siri to call up files, sometime in the future. In the meantime, I expect I’ll be able to just about get by in manual mode!

Displaying the virtual bookshelf

Before scanning all the books I took photos of them on their bookshelves and had full size posters made of the images. The poster print of the shelf of Paperback  books had two images of about 88 cm (37 inches) in length and roughly 24 cm (10 inches) in height (slightly more than the largest book). After completing the scan of the Paperbacks back in July, I cut out the two poster images, fixed them on some stiff cardboard using miniature bulldog clips and found a couple of spaces on the study wall facing my desk as shown in the image below.

IMG_2961

Having had several weeks to ponder them, I‘ve found it pleasing to have them there. I have much better visibility of these books than I had previously as I now look at them every time I sit down at my desk; and it took relatively little effort to achieve and has not taken up any valuable space other than areas of previously blank wall. Of course an electronic and customisable display of the virtual books would have been preferable – but this low-tech version serves much the same purpose.

After completing the scans of the Work and University books last Sunday, I duly cut out the poster prints of those two sets of books. The University books poster is relatively small – some 39 x 32 cm (15.5 x 12.5 inches) and was relatively easy to fit onto the now increasingly crowded wall facing my desk – see below bottom left.

IMG_2966However, the Work books poster, at 114 x 31 cm (47.5 x 12 inches), was far more difficult to place. I was even considering putting it on the wall behind my desk under the bookshelves – or even on the empty bookshelf where the books originally sat – until I had a lightbulb moment and realised the poster didn’t have to be horizontal. Since the titles are normally printed down the spine, they will appear horizontal – and be easier to read – when the poster is turned vertically! Obvious really – but I just hadn’t seen it up to that point. Anyway that made finding a space a whole lot easier and I finally selected a very visible spot between the window and the existing bookshelf as shown below.

IMG_2964

So now I still have my books around me, and I can access their contents very easily on the iPad; and I also have two empty bookshelves which I can use for other things – a welcome benefit since I have very little spare storage space left in my study.

Hardcopies Digitised – e-textbooks are Best!

After a wonderful family wedding in Italy, I restarted the scanning work on the 19th of August. With everything I’d learned doing the paperbacks, I was able to work much faster and I completed all 75 hardbacks (some 21,000 pages) in just 10 days.

Of course there were differences: hardbacks are constructed differently – typically with a strip of gauze being glued onto both the spine and the thick cardboard covers. This has to be cut to remove the pages of the book from the hardback covers. Unlike the paperback covers which were mostly small enough to scan both front, spine and back all at once, most of the hardback covers were bigger and the fronts and backs had to be scanned separately. Some of the hardbacks also had dust jackets which also required their fronts and backs scanning separately. To acquire full images of the full front, spine and back of both the covers and the dust jackets, I took photos of each and trimmed them down using the cropping tool in the PDF PRO software that I’m using (I included these images for completeness in case I want to do further electronic manipulations or displays in the future)

For every book, two PDF files were produced: one for the complete book with dust jacket front cover, inside dust jacket front, hard cover front, inside hard cover front, book pages, inside hard cover back, hard cover back, inside dust jacket back, dust jacket back (or similar for paperbacks but without the dust jackets). The other file was for the cover components and included all the items included in the first file but without the book pages and with full images of the complete cover and dust jacket. The cover and dust jacket images were cropped and finalised in the second file before being pulled into the first file to complete the working PDF file of the whole book which was downloaded into Sidebooks on the iPad via Dropbox. The master versions of the two PDF files for each book are stored in a separate folder on my laptop with an offline backup in the cloud.

The hardbacks were books I acquired for University and for Work – i.e. they were for study and reference. Having got them all into electronic form and onto the iPad, I really cannot see why anyone would bother with a hardcopy version of such textbooks. The iPad version is lighter, smaller, more portable, quicker to access, easier to search and far easier to store. I shall make a point to ask my friends in academia and publishing if there is a noticeable trend away from hardcopy textbooks.

Now that the digitisation work has been completed I shall spend a day or two thinking about what further work to do on this particular Journey.