PAWDOC: Technology requirements and problems

To operate a personal electronic filing system you need a computer with a screen, a scanner, software to manage an Index and the documents, and a general approach. My colleague, John Pritchard, and I decided to explore what it would be like to operate such a system after visiting Amoco in the USA, and we followed the approach that we had seen there: every document was given a reference number and an index entry and was then stored in reference number order. Searches were performed on the index and retrieval was achieved by using the reference number.

We were able to apply the approach immediately using index cards. However, the technology to support the approach took a long time to become sufficiently powerful and cheap to become feasible for the individual to apply it: and it took many more years before it could be considered to fully support personal electronic filing systems. Consequently much of the experience gained in using hardware and software to support PAWDOC has been in how to manage imperfect technology solutions. This has been particularly the case with computer storage which was insufficient and expensive when I first started scanning PAWDOC documents in 1996. The bulk of scanned documents had to be held offline on Magneto-Optical disks and this not only imposed a whole set of management requirements but also constrained the portability of the system. Today, however, storage is plentiful and cheap and the whole of the digitised PAWDOC collection is held on my laptop.

Scanners too have become better and cheaper since 1996. The first one I had was only capable of scanning in Black & White and one side of the paper at a time. Consequently, scanning large documents took a long time, and any colour on documents I scanned at that time has been lost. The scanner I have today takes less time to scan a page despite the fact it also scanning in colour and both sides of the paper as it goes through the machine.

In many ways the software to support personal filing has always been in place, but its performance has been constrained by computing power. For example, the indexing software I use took over three minutes to conduct a complex search on less than 4,000 records in 1988, whilst my current version of the same software takes less than one second to conduct the same search on over 17,000 records.

The software to manage the stored documents has also been constrained by computer power – but in a rather unexpected way. In the 1980s and 90s when I first started using the PAWDOC system the conventional thinking was that a dedicated Document Management System was needed for the purpose. Such software applications were large complex beasts with numerous features and they relied on an underlying database application. Today, PAWDOC documents are stored in Windows folders labelled with a Reference Number. My laptop and the Windows 10 operating system are more than powerful enough to be able to display and search over 17,000 folders in just a few seconds. Such a solution would not have been feasible in the mid 90s, but today’s power has enabled a very complicated and constraining element of the personal electronic filing system architecture to be dispensed with.

Specific questions relating to this aspect are answered below. Note that the status of each answer will fall into one of the following 5 categories: Not Started, Ideas Formed, Experience Gained, Partially Answered, Fully Answered.

Q38. What additional software functionality is required?

2001 Answer: Partially answered: A system which eliminates the need for two systems by combining simple and flexible indexing and searching functionality, and file management functionality which keeps track of the thousands of electronic files (Wilson 1996a: 3).

  • Facilities to detect low usage and to automatically recommend the destruction of paper (after scanning).
  • Intelligent synonym functionality that can recognize relationships between frequently used abbreviations and terms, and which requests the user to confirm possible synonym relationships (Wilson 1990: 96 ± 97).
  • The ability to automatically manage multi-part reference numbers of the type PAW/DOC/7653/01 and to be able to present the next unused number.
  • The ability to produce a KWIC (Key Words In Context) or KWOC (Key Words Out of Context) index (Wilson 1992a: 29,30).
  • The ability to store a set of web pages without losing the links between them (the FISH Document Management System is unable to do this because it stores each individual file with a new file name consisting of a combination of alphanumerics) (Wilson 1995b: 131)
  • . Functionality to support the assembly, development and use of knowledge (Wilson 1997: 3 ± 4).

2019 Answer: Fully answered: My current views on the additional functionality listed in the 2001 answer are as follows:

  • Combined Indexing and file storage: Now that I have eliminated the Document Management System and replaced it with Windows folder, I no longer feel this is needed. However, despite retrieval being simple and quick, it could be made even more effective if the files associated with a particular Reference Number could be automatically listed under the Index entry for that number; and if the file you require could be selected and opened from that list.
  • Low useage detection: Now that all documents are digitised and paper is no longer taking up valuable space, there is no need to identify which hardcopies are not being accessed and could therefore be digitised and removed. Consequently this requirement is no longer needed.
  • Intelligent synonym functionality: Terminology continues to change, so this is still required.
  • Management of multi-part Reference Numbers: This is still a requirement. It would make it quicker and easier to create new index entries.
  • Production of a KWIC index: I no longer produce paper backups of the index, so this is no longer required.
  • Store web pages without losing links: I now use zip functionality to combine and store the multiple files making up a single web site, so this is no longer required.
  • Nugget/knowledge management: I never clearly ascertained if this would be worthwhile or not (see more detailed discussion in the answers to questions 27 – 29, and also in the topic ‘Knowledge Development‘ elsewhere in this web site).

In addition I would add the following:

  • Use of flexible Date formats: This is required to be able to specify BOTH exact dates (for, say, the date a document gets created or a letter is sent – dd/mm/yyy); AND partial dates (for, say, the year a book is published – yyyy – or the month and year of publication of a journal or magazine – mm/yyyy)

Q39. What technology problems have been experienced while operating the electronic filing system?

2001 Answer: Experience gained:

  • Replacing the PC requires the re-installation of all the software, which has been problematic on the last three occasions.
  • Upgrades of software can require a complex conversion process.
  • The index software (Filemaker Pro) crashes from time to time, but the Filemaker recovery function has always been able to deal with it except for an occasion over 10 years ago when the backup file had to be used (Wilson 1992a: 65, 72, 79).
  • About 30 of the files were lost in the document management system -probably in the course of moving them to off-line storage or moving them back into the PC’s hard disk (Wilson 1995b: 137, 139).
  • One of the Magneto-Optical disks became corrupted and the backup files had to be used (Wilson 1995b: 141).

2019 Answer: Fully answered: Over the last 20 years most of the technology problems I’ve had, seem to relate to four main areas – storage, specialist software, upgrades, and obsolescence:

  • Issues associated with the lack of cheap reliable storage: This problem has largely disappeared. When I first started scanning in 1996 I had to use external Magneto-Optical disks attached to my laptop, and I did suffer some data transfer and disk corruption problems. Today I have more than enough fast storage with the 1Tb SSD in my laptop.
  • The management and cost of specialist software: I had to deal with a wide variety of issues over the years with my document management software and its associated Sybase, and subsequently SQL, database. So much so that I have concluded that t is far better to avoid all specialist software if at all possible. It introduces complexity and is costly to buy, upgrade and support. While general purpose software may have fewer features, overall it is likely to be much easier to manage and use, and is likely to be a much more viable long term solution for the individual. I am very pleased to have eliminated the document management system and associated database from the PAWDOC architecture, and to now be using the much more familiar and straightforward Windows folders to store PAWDOC files in. I still use Filemaker for the Index but I regard this also as specialist software. Although it is very reliable and presents few management issues, it still has to be upgraded every three years at a cost of over £200 a time; whereas I know that I could still operate the Index if I exported the data to an Excel spreadsheet. In conclusion, I would recommend anyone setting up a personal electronic filing system to use standard multi-purpose software, prefereably which you are already using, and to avoid specialist software if at all possible.
  • The complexities associated with upgrading platforms and operating systems: Moving systems from old to new computers, or upgrading operating systems, are major changes with associated risks. That’s not to say that it will necessarily be difficult – but over the years I have encountered issues and have found the more complex the systems being used the greater the challenges. The document management system had to be totally reinstalled from scratch when it was moved to a new laptop and that was something I only ever achieved once by myself without any supplier support. Now that PAWDOC only uses a Filemaker Index and Windows folders, the risks and difficulties associated with upgrades are much lower.
  • Obsolescence: As files are accumulated over the years, they may become unreadable because you no longer have the appropriate application software running on your machine. When I conducted a Digital Preservation exercise on the PAWDOC system in 2016-2018 I discovered many examples of such files, and it took a considerable effort to deal with the problems and achieve readable files again. Similar problems can affect hardware such as disks and memory sticks – though I feel less vulnerable on this front as I have sufficient storage on my laptop to cope with all PAWDOC requirements. However, anyone operating a long long term filing system is going to have to undertake periodic Digital Preservation work of one sort or another to ensure that their documents continue to be readable.

Q40. What contingency arrangements can be made to minimize and overcome technology problems?

2001 Answer: Ideas formed:

  • Make clear notes on little used technology procedures and fixes.
  • Document system components and configuration settings.
  • Assemble support phone numbers.
  • Keep all of the above in hardcopy and in a place that does not require the filing system to find them.

2019 Answer: Fully answered: In addition to the four points made in the 2001 answer (make notes on procedures and fixes, document components and configurations, document support numbers, keep such documentation outside the system in hardcopy), I would add:

  • Be diligent about regularly backing up.
  • Ensure you know how to use backup data to reinstall applications.
  • If you have a specialist Index application, consider regularly exporting the data to a spreadsheet application so that, if the application fails, you still have immediate access to the Index.

Q41. What equipment is needed to operate a filing system and what are the key criteria by which it should be selected?

2001 Answer: Experience gained:

  • A high resolution monitor preferably capable of displaying a whole A4 page in a magnification you can read.
  • A laptop computer with sufficient hard disk to store all the electronic files and scanned images in the collection, and with room for the growth of the collection.
  • An off-line storage system that can be used to make backups of all the collection’s electronic files and scanned images, as well as the electronic filing software application’s configuration, control and data files.
  • The equipment should not be too noisy.

2019 Answer: Fully answered:

  • A large screen high resolution colour monitor big enough to display a whole portrait page sufficiently large as to be roughly readable without magnification, and//or capable of being turned into a portrait monitor as required.
  • A light weight laptop computer small enough to be transported in hand luggage, with a high resolution colour screen, sufficient fast SSD storage to accommodate the whole of the personal filing collection, and which makes a minimal amount of noise.
  • A colour scanner with both a sheet feeder and a flatbed capable of scanning documents at least A4 in size, which is reasonably fast, and is small enough to fit on or next to your desk. Its software should be capable of automatically adjusting the scan to the size of the document and automatically adjusting sloping originals to produce a vertical scan. It should also provide easy to use facilities for adjusting contrast and brightness to deal with poor originals, and for resetting after sheet feeder jams.
  • Two or more external hard disks or flash drives with sufficient capacity to store the whole of the personal filing collection, for use as a) a local backup, b) a remote in-country backup, and c) if required, a remote out-of-country backup.

Q42. What considerations should be taken into account when physically laying out the filing system?

2001 Answer: Partially answered:

  • Paper files should be placed so they are accessible while sitting at the desk (Wilson 1990: 94)
  • The scanner should be placed so it can be operated while sitting at the desk.

2019 Answer: Fully answered: Over the years I’ve had to cope with a variety of company offices and a long period of operating out of my home study. In all these situations I have tried to arrange the physical layout so I could conduct all my filing activities while seated at my desk. I’ve found this to be feasible and effective.  Hardcopy files can be placed in an upright filing cabinet (or cardboard boxes) alongside or behind one’s desk; and a scanner can be placed on the right hand edge of the desk. Backup external drives can be placed in a pedestal drawer.

Q43. What criteria should be used to select an electronic filing system software package?

2001 Answer: Experience gained:

  • Ability to support the desired filing schema.
  • Ability to manage both hardcopy and electronic files.
  • Enables the rapid input of new items.
  • Enables easy and quick searching.

2019 Answer: Fully answered: In addition to the points made in the 2001 answer (support for the filing schema, management of both hardcopy and electronic files, rapid input of new items, and easy and quick searching), I would add:

  • Simplicity and understandability of the architecture of the system.
  • Ease of installation.

Q44. Is it feasible to construct a filing system out of multiple different software packages?

2001 Answer: Experience gained: Yes. However, provided all the requirements are met, it would be more efficient and easier to manage if only a single package was required.

2019 Answer: Fully answered: Yes, it is feasible, provided effective integration between the packages can be achieved, and provided not too much effort is required to set up and maintain the integration. However, it undoubtedly complicates matters and requires more effort to manage and maintain, therefore, the simpler the packages to be integrated the better. However, on balance I would not recommend it if a single piece of software will do the job.

Q45. How much file space do you need to store an individual’s personal files?

2001 Answer: Experience gained: Assuming only black and white scanning, no digitizing of journals or books, and no video material, a collection built up over 70 years would require approximately 53 GB (Wilson 2001a). Until experience is gained of colour scanning and digitized video, a more realistic figure cannot be estimated.

2019 Answer: Partially answered: The current PAWDOC collection can’t be considered a total lifetime collection because:

  • A substantial number of the colour hardcopies were scanned in B&W.
  • For about 10 years when I was working in Bid Management with highly confidential and fast moving documents, the number of documents I was putting into the collection was much reduced.
  • The collection only includes about 30 years of my 40 years of work.

It should also be remembered that about half the collection was assembled under business conditions that were in transition from paper only to paper + electronic – very different from today’s environment. Furthermore, the type of work I did and my overlapping interest in technology research, dictated my coming into contact with a particular range of documents; different types of jobs and interests will dictate different numbers of documents of different types.

Having said that, all the items in PAWDOC have been digitised and the overall digital collection takes up about 46Gb.

Q46. How much file space is taken up by the average document?

2001 Answer: Partially answered: Chan’s results showed the following sizes for an A4 page: line art 87 kb; black and white 91 kb; halftone 181 kb; and colour 3347 kb (Chan 1993: 28). In practice, initial black and white scans at 240 dpi were producing an average file size of about 40 kb (Wilson 1995c: 1).

2019 Answer: Fully answered: File sizes vary depending on what application they have been created in and on whether they are scanned as colour or B&W documents. Therefore, file sizes for a number of these combinations were established using my current scanner (a Canon DR-2020U) to scan at 300dpi a single full page of typed text for the B&W document and a single page containing 5 colour photos of various sizes for the colour document.

  • B&W page created in Word 2007: 13 Kb
  • B&W page scanned in B&W to PDF: 105 Kb
  • B&W page scanned in Greyscale to JPG (the scanner would not scan in B&W to JPG): 579 Kb
  • B&W page scanned in 24 bit colour to JPG: 584 Kb
  • B&W page scanned in B&W to TIF: 69 Kb
  • Colour page created in Powerpoint 2007: 1,100 Kb
  • Colour page scanned in 24 bit colour to PDF: 808 Kb
  • Colour page scanned in 24 bit colour to JPG: 750 Kb
  • Colour page scanned in 24 bit colour to TIF: 25,389 Kb

Q47. What’s the best type of storage media to keep electronic files on?

2001 Answer: Experience gained: A hard disk in the laptop is best because it is so quick and easy to use. CDs are good because CD writers are cheap and CD drives are available in most laptops. Having said that, this does not preclude other media with similar characteristics.

2019 Answer: Fully answered: Its best to keep your files with you on a laptop – or on your mobile phone provided you have all the necessary applications on the phone and you feel the screen is big enough to be able to read the documents. However, since both laptops and phones are portable and therefore at higher risk of being lost or stolen, adequate measures must be taken to protect the data should the equipment fall into the wrong hands. Another possibility is to store the master set of files in a cloud-based service, however I believe that would be unwise due to the risks of the service failing or being subject to viruses or hacking. A cloud-based service may be suitable for backup, though external hard drives or SSD flash drives are cheap and effective enough for the purpose.

PAWDOC: Relationship between Work Patterns and Filing Activities

Dealing with documents is an inevitable part of office work, so filing work practices are not unusual – everyone has them. It’s just that if you operate a Reference Number-based filing system, those work practices may be a little different. In particular, every document has to be digitised, recorded in the Index, and placed in a single store.

Getting digitised documents is very much easier than it used to be. To start with, most documents are created and distributed electronically, so the amount of hardcopy is much reduced; and scanners these days are much cheaper and faster – and available in most of today’s offices. So, even if you away from base, it will usually be possible to digitise hardcopy.

Recording a document in the Index is relatively quick and simple to do, provided that the number of Index fields is kept to a minimum. Then it is just a matter of creating a folder for the newly created Index entry; renaming the file to include the Reference Number and a short descriptive title similar to that specified in the Index Title field; and moving the renamed file to the newly created folder.

As with most regularly occurring activities, the less often you do it, the more the backlog builds up. I am firmly of the opinion that to operate this kind of filing system effectively, new documents should be put into the system as soon as they are received – or at least as soon as possible after that. However, whatever approach is taken, you will inevitably have to spend some time on the filing activity. If you want to reduce the time you spend, two possible strategies are, a) decide not to collect everything but only to collect documents on certain topics, or from certain people etc., so that the number of documents you need to file and manage is reduced; b) expand the scope of some index entries so that they will accommodate a greater number of documents, thereby reducing the number of new index entries and new folders that have to be created.

The benefit of having a digital filing collection is that, with today’s modern laptops and high capacity, cheap, storage, you can carry your information around with you and access it wherever and whenever you want. The commensurate downside of this is that your digital file store becomes a very precious commodity which needs to be protected and regularly backed-up in the event of loss or theft.

Specific questions relating to this aspect are answered below. Note that the status of each answer will fall into one of the following 5 categories: Not Started, Ideas Formed, Experience Gained, Partially Answered, Fully Answered.

Q36. How does this approach to filing affect work patterns?

2001 Answer: Experience gained: There are two main impacts. Firstly, it entails the regular indexing of new items as they are received and/or read. However, this not an absolute necessity since, as with any type of filing system, the new items can be piled up to be input in bulk sometime later. It is just much easier and effective to do a little often (Wilson 1990: 97). Secondly, it offers the opportunity to be much more sophisticated about capturing nuggets and developing knowledge. This almost certainly would affect work patterns, though it is not known how at present (Wilson 1997: 3 – 4).

2019 Answer: Fully answered: I am even more convinced than I was in 2001, that it is far better to include new items in the collection as soon as they are received and not to let documents accumulate. Scanning is no longer such a problem if you are away from your home base – a networked scanning capability can probably be found in most offices. If not, however, hardcopy documents have to be kept in a folder until you get back to your scanner.

It is probably best to keep a stock of the most recently acquired hardcopy (which you will already have digitised) in case you need them as working documents. This will entail having a designated box or drawer and managing it by eliminating the oldest when space becomes short. It is probably not worth recording in the index the existence of such a small specific stock of hardcopy.

Given that every item in the filing system will be digitised and stored on your laptop, it will be possible to carry around your entire collection and access it in any location. Such an accessible and useful store will inevitably become very precious, therefore you may need to take special measures to protect access in the case of loss or theft of the laptop; and you will need to maintain constant and effective backup arrangements.

Q37. What strategies can be employed to minimize user effort and maximize user motivation?

2001 Answer: Experience gained:

  • Don’t attempt backfile conversion.
  • File little and often, not lots infrequently.
  • Minimize the number of fields to input when creating index entries.

2019 Answer: Fully answered: In addition to the strategies identified in 2001 (no backfile conversion, file little and often, and minimise the number of fields) I would add the following:

  • Identify the categories of the documents you receive that you could do without and don’t file them.
  • Expand the scope of your Reference Numbers so that you can put more files in those folders without having to create new index entries.
  • Store just URL references for certain less important material so that you don’t have to copy text, create documents and save them to the digital store.

Note that the three strategies above involve making reductions in the overall set of documents that you file in the course of your work. This is dangerous because it’s difficult to predict which documents you will or won’t need to refer to in the long term. However, it’s a calculated risk based on your knowledge of what information you want and need to keep; and it may be a worthwhile risk if it reduces the filing load and gives you more motivation to keep abreast of the filing work.

PAWDOC: Hardcopy/Electronic mix

The intention of my colleague, John Pritchard, and myself when setting up our filing systems to be Reference Number-based, was to explore what it would be like to operate in an electronic office. Unfortunately we didn’t have the electronic tools – indexing application, scanner, and document management system – to do the job properly. So my PAWDOC filing system started off being totally paper-based.  I moved the Index into a database in 1986, but it wasn’t until 1996 that I acquired a scanner and a Document Management System (DMS); so up until then my operating practices had been moulded around the needs of paper and even the electronic files I was creating and receiving were being indexed and stored in paper form (though usually with an electronic copy being kept in the Windows Explorer folder system).

This equilibrium started to change when I got the scanner and DMS. Gradually the emphasis started to move towards digital files. Whereas before I wanted to put paper into PAWDOC (giving rise to significant space problems) now I increasingly wanted to digitise and destroy paper. This change was also stimulated by the growing use of computers in the office, the rise in production of born digital documents, and their distribution by increasingly popular email systems.

In both these eras, however, I had had to manage a mix of paper and electronic files. Pre-1996 before I got the DMS, I mainly used paper and the electronic files were held in the background. From 1996, when I started using the DMS, I used a field in my Index to indicate whether I had a paper or electronic file, which enabled me to look in the correct place for one or the other. This worked well and I did operate effectively using a mix of the two media.

By the time I retired in 2012, I considered that the way paper was being used had changed completely. It was fast becoming just a secondary working medium, with the primary medium for creating and storing documents being electronic. I believe the transition to this third era, is now just about complete. It looks as though we’ll be using paper for the foreseeable future, and we will need to be capable of working with and managing both electronic and hardcopy documents: but, for the purposes of filing, there are only three types of material that need to be catered for:

  • Electronic copies of every item in the filing system
  • Hardcopy working documents – kept for a relatively short period
  • Hardcopy artefacts – significant or unusual documents that are wanted in their original physical form

The latter two categories are relatively small subsets of material – the actual size of which will be dictated by the inclinations of the filing system owner.

Specific questions relating to this aspect are answered below. Note that the status of each answer will fall into one of the following 5 categories: Not Started, Ideas Formed, Experience Gained, Partially Answered, Fully Answered.

Q33. How do you manage items that can exist in both hardcopy and electronic form?

2001 Answer: Experience gained:

  • Ensure there is an explicit marker in the index that indicates if there is an electronic file or hardcopy or both of a particular item.
  • Before throwing the paper away check that any annotations, post-it notes etc are already included in the electronic document. If they are not, then either type or scan them in (Wilson 1995c).

2019 Answer: Fully answered: Electronic and hardcopy files can be managed in the same filing system by diligent use of the Index. In the PAWDOC system, those documents that are being retained long-term in hardcopy form (because they are significant or have unusual characteristics) have the abbreviation PHYS (for physical) in the ‘Movement Status’ field. It is also advisable to include a similar indicator in the file title of the equivalent digital document so that there is clarity throughout the system about what hardcopies exist.

Should an individual wish to keep some working documents in hardcopy form for a short period for use in meetings or to annotate them etc., then it is feasible to just keep them in a file or box after digitising them either without making any entry in the index (because they should only be a small subset of material which the individual should be familiar with); or with an indication in the index that a hardcopy is also being retained (though this imposes an additional management overhead to insert the indicator and to remove it when the hardcopy is destroyed).

Q34. Is it effective to manage electronic and paper files together?

2001 Answer: Partially answered: Definitely. It is much easier and faster than having separate indexes for each media (Wilson 1995c). In any case, since an individual has only one overall knowledge base, integrated support for that knowledge base should be provided regardless of the different media that parts of it are stored on.

2019 Answer: Fully answered: Yes, electronic and hardcopy files can be effectively managed in the same filing system. I have had extensive experience of doing so over the last 20+ years; it just requires diligent use of the Index. Everything is digitised so the default status is that there is a digital file but no hardcopy. If a hardcopy is retained as well, that information is recorded in the index.

Q35. Is it necessary to keep paper if an equivalent electronic file is available?

2001 Answer: Partially answered: Paper is still needed – even if an equivalent electronic file exists – in at least two circumstances. First, when the paper is to be used in meetings or in other situations in which it is not convenient to use a laptop computer (Wilson 1995b: 113, 114); and second, when you want to keep artefacts in their original form – be that paper, CD, videotape etc. If neither of these reasons applies, the paper can be destroyed and the ultimate scanning prize can be won – the freeing up of a large amount of physical space (Wilson 1997: 3).

2019 Answer: Fully answered: It isn’t absolutely necessary to keep paper if an electronic version is available, but individuals may wish to have both versions for two types of material: documents that you may need to use in the near future and that you prefer to work with in hardcopy format; and documents which are not easily replicated in the electronic environment or which you believe are so significant or unusual as to merit being worth keeping in their original format. The specific documents that fall into each category (if any), and the size of the resulting subsets, will depend on the needs and inclinations of each individual.

PAWDOC: Archiving

For the first fifteen years of using the PAWDOC system, I didn’t have the capability to scan documents or to manage electronic documents. Hence, in those years the filing system was oriented around hardcopy which is inherently bulky when it builds up over time. It wasn’t long before I ran out of space in the upright filing cabinet next to my desk and I was forced to select a subset of documents to put in boxes and store elsewhere in my office. As time went by I began to run out of space for the archive boxes, so I started putting the oldest ones into the company store.

Over this period I established an archiving routine and honed it until I had got it down to a standard procedure. I used a field in the Index to record when I accessed a document, and if there was no entry in that field I considered that document to be a candidate for archiving. I also put an indicator in the Index when a document was archived so that I knew where to look if I needed it. In the same way that documents were stored in the filing cabinet in Reference Number order, the archive documents were also stored in the boxes in Reference Number in order to provide a reliable way of finding an archived document.

When I got a scanner and a Document Management System in 1996, my modus operandi changed. I started to scan every new document as I included it in the PAWDOC system at the same time as attempting to scan the huge backlog of hardcopy documents. After each scan I took a decision as to whether to destroy the hardcopy or to keep it – and more often than not I chose to destroy it. This then was a significant turning point when archiving was replaced by digitising. Of course, the digital route was still not that straightforward because computer systems were relatively slow, and digital storage  was limited and expensive. However, as time went by and technology improved these shortcomings were minimised. Today, it’s possible to buy a memory stick with sufficient storage for the whole of the PAWDOC collection – a lifetime of work documents – for less than £15.

Today there is still a need for hardcopy documents, but only for special or working documents. Digital versions are sufficient for the bulk of a personal collection. Consequently, there is no longer a need for bulky and growing sets of hardcopy material, and no longer a need to do any more archiving.

Specific questions relating to this aspect are answered below. Note that the status of each answer will fall into one of the following 5 categories: Not Started, Ideas Formed, Experience Gained, Partially Answered, Fully Answered.

Q31. Is it necessary to archive paper documents?

2001 Answer: Experience gained: It is only necessary to archive paper documents if:

  • You cannot scan them
  • You have not got enough space for the artefacts you want to keep in their original form.

2019 Answer: Fully answered: For paper-based collections, archiving was essential unless one had a very, very large office or study. However, digitisation now provides a store of virtually unlimited size, and, in my experience, only a very small subset of documents will be deemed to be sufficiently significant or unusual as to require being kept for the long term in both hardcopy and digital form. These subsets of precious hardcopy should be small enough to be kept in the average office or study. In practice, therefore, with modern systems, archiving is no longer necessary.

Q32. If it is necessary to archive, how do you do it and how long does it take?

2001 Answer: Fully answered: Implement a ‘date last accessed’ facility, whereby whenever an index entry reference number is accessed (to obtain the hardcopy or electronic document) you automatically record the current date. Archiving can then be performed on those index entries that have no date in the `date last accessed’ field. The whole process of using the index to select items to be archived, marking the selected index entries as `archived’, removing the items from the physical file and boxing them up takes approximately two minutes per item archived (Wilson 1992b: 2.10).

2019 Answer: Fully answered: With modern systems, I believe there is no longer a need to archive. However, if I did have to do so, I would probably use the method I developed in the 1980s and 90s – maintain a Date Last Accessed field and use that to identify documents I’m not using and which can therefore be digitised and/or archived.

PAWDOC: Use of Information

The practice of sidelining text in articles, papers, and books is not uncommon and is something I started doing in the late 1970s – primarily to assist the writing of technical books when I was working at the National Computing Centre. I started to use my PAWDOC filing system in 1981, so, by the late 80s I was aware that a) there were a lot of documents in my filing system that I hadn’t looked at for several years, and b) inside all these documents were a large number of sidelined significant points. I started to think about these points as nuggets of information, some of which perhaps were the bedrock of my developing ideas, but others of which perhaps I had simply forgotten. I wondered also if an explicit examination of all these nuggets might prompt inter-relationships to be identified and new ideas to be developed. I considered this to be a potentially valuable spin-off from all the effort expended in operating a very comprehensive personal filing system. The notion of actually making use of the information that PAWDOC contained instead of having most of it just lie there statically, was very attractive.

Consequently, I looked for some simple and inexpensive tools to try out these ideas, and came across Mind Mapping software on the free disks issued with PC magazines in late 1990s. I started experimenting with one of them but found it was too bitty just pulling nuggets out of the odd paper – I felt I needed a whole set of material to work with. So I created Mind Maps for 19 esoteric books (on subjects such as The Great Pyramid), but found it too difficult to inter-relate the different Mind Maps. That’s as far as I got.

I’m still not sure if there is any merit in explicitly managing nuggets to either just cement them in one’s mind, or to inter-relate them and develop new ideas; and, I must admit, I’ve never done any serious book review to see what other work, if any, has been done on this subject. The retrospective work that Peter Tolmie is planning to do with me may throw a little light on the impact that such nuggets may have had on me – but that’s as far as it goes. My current view is that explicitly managing nuggets is probably not worth doing, and that adding extra tasks to the job of managing a personal filing system may well be the stone that breaks the camel’s back.

Specific questions relating to this aspect are answered below. Note that the status of each answer will fall into one of the following 5 categories: Not Started, Ideas Formed, Experience Gained, Partially Answered, Fully Answered.

Q27. How can an electronic filing system be used to develop and use knowledge?

2001 Answer: Ideas formed:

  • Include substantive information in the index entries, for example phone numbers, book references, and expense claim amounts.
  • Identify the nuggets of information (i.e. the valuable bits) when you first read a document (Wilson 1997: 3 – 4).
  • Capture and structure the nuggets into the overall nugget-base at the same time as indexing the item (Wilson 1997: 3 – 4).

2019 Answer: Ideas formed: The first point in the 2001 answer to this question (‘include substantive information in the index entries’) has proved useful: so much so that I started including substantive information in the file names of digital documents (for example, total amounts claimed in the file names of expense claim spreadsheets).

With respect to nuggets, those ideas emerged from my practice of sidelining text that I thought significant, I developed the idea that these pieces of information (which I called nuggets) might be picked out, recorded and combined with other nuggets to produce novel ideas and concepts. I explored technology options that might assist in this process and decided Mind Mapping software might be worth trying, and tried it out with a variety of esoteric books that I was reading at the time. Although I Mind Mapped 19 books I never took it to the next stage of combining them to see if I could develop any new concepts – there seemed to be no easy and effective way of doing so. That is as far as I got with this notion. I’m hoping that the experimental work I’m planning to do with Peter Tolmie on this subject might indicate if there is any merit in exploring these notions further or not.

Q28. What is the best way to capture and structure information nuggets?

2001 Answer: Ideas formed: By using a Concept Development tool. Some initial prototyping has been done using the Visual Concepts package and the eMindMaps package.

2019 Answer: Experience gained: I explored the use of Concept Development tools for this purpose by using the eMindMaps tool to capture nuggets from 19 separate esoteric books (on subjects such as The Great Pyramid). I found that, although it was probably quite a good way of summarising a book (or article or paper) on one page, there was no easy or effective way to combine several mind maps together or to relate an item on one mind map to an item on another mind map. So I concluded that such tools were not going to be an effective nugget management solution. I’m not intending to explore this any further, however, if I did, I would look into the collaborative concept development tools that I know were being explored by the CSCW community in the 1990s and from which commercially available software might have emerged by now. An alternative, much more feasible, solution might be simply to accumulate the nuggets in a spreadsheet. I guess the question of what tool to use is very much tied up to what one wants to do with the nuggets and what benefits can be achieved by working with them.

Q29. Is it feasible and practical to capture and structure information nuggets as well as indexing items?

2001 Answer: Not started

2019 Answer: Not started

Q30. Is it worthwhile building and developing an information nugget base?

2001 Answer: Not started:

2019 Answer: Ideas formed: I’ve always thought there was value in the nuggets I sidelined in articles and papers – which is way I started exploring this topic in the first place. However, whether there is any value in working with them in any way at all (either just accumulating them in a spreadsheet to cement each point in one’s mind, or inter-relating them in a specialist tool to develop new ideas) is still an unknown.

PAWDOC: Searching

The ability to search for and find a document is just about the most important aspect of a filing system, and that capability has undoubtedly been improved by increases in the power of the modern computer. For example, when the index for the PAWDOC collection was computerised in 1986 it took 228 seconds to conduct a standard search on 3200 records. By 2001, that standard search conducted on over 14,000 records took 7 seconds. That same search, now performed on over 17,000 records, takes less than a second on my current laptop – in fact it’s virtually instantaneous.

Of course, search speed is only half the story, since a targetted document also has to be selected and retrieved. The fact that whole digitised collections can be held on a modern laptop, means that this second part of the process can also be very quick. In fact the total end to end search and retrieval time for the current PAWDOC system is typically between 15 and 30 seconds.

However, speed is not the most important element of a successful search. Instead it is the ability to find what you are looking for – whether you are after a specific document or just doing a general search to see what you have on a specific subject. In the PAWDOC system, searches are conducted on the collection’s Index, therefore success is critically dependent on there being a match between the words specified in the search query and the words in the index entry of the targetted document(s). In a personal system, both elements – index entry and search query – come from a single individual’s mind, so more often than not a match is achieved. However, inevitably there are cases where a match isn’t achieved first time – and, sometimes, not at all. There are a variety of reasons for this including the passage of time (I placed my first document into PAWDOC some 38 years ago), and changing terminology as one becomes more familiar with a topic or as technology develops. To cope with these problems, strategies such as using terminology appearing in unsuccessful searches, and adding keywords to index entries when an item is eventually found, can be helpful.

Specific questions relating to this aspect are answered below. Note that the status of each answer will fall into one of the following 5 categories: Not Started, Ideas Formed, Experience Gained, Partially Answered, Fully Answered.

Q22. How long does it take to find items in the filing system?

2001 Answer: Partially answered:

  • Using the `t f o a d r’ test (find every record with the combination of t, f, o, a, d and r somewhere in it) on the 1988 Filemaker Index with 3200 records running on a Macintosh computer, 27 hits were found in 228 seconds. The equivalent test on the 2001 Filemaker Index with 14111 records running under Windows `98 on a Pentium II PC, took seven seconds to identify 612 hits. A search for all records containing the syllable `man’ across the same two systems took four seconds to identify 211 hits in the 1988 system, and one second to identify 1137 records in the 2001 system (Wilson 1992a: 38, 85).
  • Having identifed the correct index entry (in the 2001 system) it takes between 10 and 20 seconds to obtain the reference number, go to the hardcopy cabinet/box, find the item and pull it out.
  • Should the index entry concerned refer to an electronic file, it takes about 6 ± 10 seconds in the 2001 system to have the document management software display the relevant folder, to double click the required item and to have it open up in the relevant application.
  • The total end-to-end retrieval time for the 2001 system is about 10 ± 30 seconds for hardcopy and 10 ± 20 seconds for electronic files.

2019 Answer: Fully Answered: Using the `t f o a d r’ test (find every record with the combination of t, f, o, a, d and r somewhere in it) on the 2019 Filemaker Pro 15 Index with 17294 records running on a Chillblast Intel i7  computer with 8Gb of RAM, 1224 hits were found in less than a  second – in fact, almost instantaniously. A search for all records containing the syllable `man’ across the same system took less than a second  to identify 1538 hits (in fact it too was almost instantaneous).

Having identified the correct index entry, it takes between 10 and 15 seconds to copy the reference number, go to the Windows File Explorer screen, select the main PAWDOC folder, paste the number into the search field, press enter, and double click  the folder when it appears. Since there may or may not be multiple files in the folder, and since different files may require different applications which probably open at different speeds, it is difficult to provide a reliable figure for selecting a file and opening it. However, as a very rough guide it is likely to take between 3 and 15 seconds.

Therefore the total end to end retrieval time in the current system is approximately 13 – 30 seconds.

Q23. What can you do to speed up retrieval?

2001 Answer: Not started:

2019 Answer: Fully Answered: There are two key factors that affect retrieval times – Physical Proximity and System Integration. Physical Proximity relates to how close you are physically to the system and the hardcopy and/or digital documents. If you haven’t got the system with you then you can’t even identify the document you require, let alone retrieve it regardless of whether it is a hardcopy or digital document. If you are able to identify the document you require, and it is a hardcopy document, then the closer you are to the hardcopy documents the faster retrieval is likely to be (for example, retrieval will be faster if the hardcopies are in the same room that you are in as opposed to in a room down the corridor).If the document you require is an electronic document, then it will probably  be a little faster to retrieve if it is on the same system you are using to search for documents, than if it is on some remote server elsewhere. Therefore, from a Physical Proximity perspective, retrieval can be speeded up by making sure that the Index and the digital store and any hardcopies are all as close to the user as possible.

System Integration refers to the linkage between the searchable Index and a collection’s digital files. Zero integration requires the user to remember the Reference Number selected in the search process, to go to the database of digital files, and to use the Reference Number to open the relevant folder. In contrast, a very high level of integration might be achieved by having the files being stored under a particular Reference Number, appear somewhere in the Index screen for that Reference Number, and being able to open a particular file from there. A halfway house might be to have a macro which will use a Reference Number identified in the index to automatically open up the folder of that particular Reference Number. Therefore, from a System Integration perspective, retrieval can be speeded up by reducing the keystrokes required to go from selecting an Index entry to viewing the files associated with that Index entry.

Q24. In what circumstances are searches conducted?

2001 Answer: Partially answered:

  • `Start Work’: focused assembly of information while under no pressure.
  • `Mid Work’: a search for a specific piece of information while pre-occupied with the interrupted activity.
  • `Visitor’: a search while you are talking to someone at your desk.
  • `Phone Call’: a search while you are on the phone to someone.

2019 Answer: Fully Answered: There are probably three intersecting dimensions to the circumstances in which searches are conducted: Activity (what you’re doing at the time you conduct the search); Work Content (the topic you are working on when the search is conducted); and Location (the type of place in which the search is conducted).

Four different types of Activity are described in the original BIT answer in 2001 (Start Work, Mid Work, Visitor, Phone Call) – though I have no data on the relative frequency of each of those. However, from experience I would guess that Mid Work occurred most often with the frequency of Start Work, Phone Call and Visitor occurring in that descending order.

Work Content might be the subject you are looking into, or the project you are working on, or the organisation you are working for, or any other categorisation that summarises the type of work being undertaken. Again, no data is available to identify what types of work content have been most associated with the searches made on the PAWDOC collection.

Location can be categorised as Employer’s Office, Other Organisation’s Premises, Travelling, and Home. I know that, over the years, I have indeed conducted many searches at all these types of location.

Q25. What are the most common types of searches?

2001 Answer: Partially answered:

  • The `Familiar Item’ search for an item that you have accessed several times before.
  • The `Long Lost Friend’ search for an item you are sure is there but have not accessed recently.
  • The `Shot in the Dark’ search to see if there is any material on a subject.
  • The `Literature Search’ to find everything you have on a subject.

2019 Answer: Partially answered: The 2001 BIT answer provides one perspective on the most common types of searches conducted on the PAWDOC collection (the ‘Familiar Item’ search; the ‘Long Lost friend’ search; the ‘Shot in the dark’ search; and the ‘Literature Search’). However, another perpective might be to categorise the types of document content most frequently searched for. This analysis might be feasible to perform using the ‘Date Last Accessed’ field’ in the Filemaker Index. Although this may not be an entirely accurately record of which items have or haven’t ever been searched for, it nevertheless does provide some sort of indication. Therefore, by categorising the 4551 records which have an entry in the Date Last Accessed field (out of a total 17,294 records) and ranking the categories by number of occurrences, some indication will be gained of the types of documents that have been searched for and their relative frequency.

Q26. What are the most effective search strategies?

2001 Answer: Ideas formed:

  • Get into the habit of searching the filing system when you need some information even when you don’t think you have anything relevant; after several years you forget what you have (Wilson 1992a: 4, 25).
  • Let your mind roam freely when selecting search words; you are more likely to come up with words you originally specified as keywords (Wilson 1992a: 4).
  • Specify searches with minimal parts of words to avoid problems where spelling errors have been made.

2019 Answer: Fully Answered: In addition to the suggestions made in the 2001 BIT article (search just in case, let your mind roam freely, and use minimal parts of search words), I would add:

  • Use any older terminology you can think of if your current terminology isn’t coming up with the goods;
  • Check the results of unsuccessful searches to see if there are any terms which you might try in subsequent searches;
  • If eventually you are successfull in a search that has taken some time, consider adding some additional search terms to the index entry to give yourself a better chance of quicker success in the future.

PAWDOC: Scanning

Despite today’s widespread use of computers and email to create and distribute electronic documents, there has been little, if any, reduction in the use of paper in business. So, office workers  are likely to received, annotate or even handwrite paper documents for the foreseeable future; and scanning remains an integral part of any filing regime. However, it is much more straightforward activity than it used to be: scanners are now fast, effective and cheap and capable of digitising most documents. Contrast and brightness settings can deal with faint or over-emboldened documents; and colour scanning is no longer problematic – nearly all scanners can scan colour and the larger file sizes produced are no longer an issue as storage has become cheap and plentiful. Even large document sizes are no longer the issue they used to be because, if they are too large to be scanned on the equipment you have, they can simply be photographed with today’s high resolution cameras in mobile phones to produce easily readable images which can be enlarged at will.

The time it takes to scan a document is highly dependent on the model of scanner that is used and the size of document being scanned, so I can only provide timings for the particular scanner that I use – a 5 year old Canon DR-2020U A4 scanner. It takes roughly 1.5 – 2 minutes for a 1 page colour document; and roughly 2.5 – 3 minutes for a 5 page double sided colour document. These are overall times which include preparing the paper, specifying the name of the file containing the scanned documents, and saving the file to a particular Windows folder.

Experience has shown that working with scanned documents is perfectly doable – the image quality on today’s screens is good and, in any case, the image can be enlarged if necessary. Of course, it is much easier to do so at a fixed workstation – perhaps with a large screen; having to work with scans on a laptop in meetings or while travelling is inevitably a little less comfortable, especially as it isn’t really possible to fit the whole of an A4 page in portrait mode on a typical laptop screen and be able to read it easily. More often than not users will enlarge the image so that only half a page is showing on the screen and then scroll up and down. Having said that, I have found that the iPad – and presumably other tablets – overcomes this problem. Whole A4 scanned pages can be presented in full on an iPad in portrait mode, and are very clear and fully readable. The iPad is light to hold and it is easy to navigate through multi-page documents; and it can be purchased with sufficient storage as to hold all the digital documents in a collection  such as PAWDOC. This would seem to be the best way to put a digitised filing system to use – though great care would have to be taken to look after it as the very characteristics that make it so useable also make it extremely vulnerable to loss or theft.

Specific questions relating to this aspect are answered below. Note that the status of each answer will fall into one of the following 5 categories: Not Started, Ideas Formed, Experience Gained, Partially Answered, Fully Answered.

Q15. How long does it take to scan documents?

2001 Answer: Fully answered:

  • About 30 seconds for a black & white page. (Chan 1993: 28)
  • 40 ± 50 seconds for a black & white page (Wilson 1997: 3)
  • 30 ± 50 seconds a page for dual journal pages on a flatbed scanner (Wilson 1996b

2019 Answer: Fully Answered: Using a Canon DR-2020U scanner purchased in 2013 and a Chillblast i5 computer with 8Gb of RAM and 1Tb SSD drive purchased in 2018, scanning time for a 5 page double sided colour document is about 70 seconds; and for a single colour page about 25 sec. However, this doesn’t tell the whole story because the scanning process also involves preparing the pages and storing the output. Approximate timings for all of the different process elements are as follows: a) Prepare the paper ie. remove any staples or postITs, get pages  alligned etc.. (5 seconds if there are no problems – but could take much longer) b) Start the scan – place the pages in the scanner, select the appropriate scanning job – Colour or B&W etc. – and press START (10 seconds), c) Perform the scan (70 seconds for 5 double sided colour pages, or 25 seconds for a single colour page), d) Create the file title – paste the Ref No (which had been copied in the indexing process) into the file title box and type in the rest of the file title  (15 -30 seconds), e) Save the file – select the folder into which the document is to be placed and press SAVE (20 – 30 seconds), f) Check the folder – check that the file has been stored in the correct place with the correct title (15 seconds); g) Check the file – open the file to check it has been scanned correctly (only for multi-page documents or for documents where there may be an issue) (15 seconds).

In summary, overall scanning time for a 1 page A4 colour document is roughly 1.5 – 2 minutes; and for a 5 page double sided A4 colour document about 2.5 – 3 minutes.

Q16. What are the practical problems associated with scanning?

2001 Answer: Partially answered:

  • Scanning double-sided pages is time-consuming and error prone (Wilson 1995b: 121, 1995c: 1).
  • Paper jams and misfeeds sometime occur when using the scanner sheet feeder (Wilson 1995c: 1).
  • Some documents need preparation work prior to scanning, for example, removing staples or guillotining bound documents (Chan 1993: 34 ± 37).

2019 Answer: Fully Answered: Practical scanning problems are of two types – those that are inherent to the scanning activity; and those that are due to shortcomings in the scanner being used. The main inherent scanning problem is the need to ensure that the papers being scanned are free from staples, PostIT notes and any other additional items which will jam as the paper passes through the scanner. Problems due to scanner shortcomings may be due to a lack of capability in the scanner concerned such as no sheet feeder; the inability to scan large pages such as A3 (A3 scanners are available but they’re big and more expensive); and inability to scan double sided pages (a particular problem I had when I started scanning but I’m glad to say that my current scanner scans both sides of a page at the same time as it makes a single pass through the scanner). Scanner shortcomings may also include capabilities in which technology developments are continually making inroads, for example, sheet feeder technology (hugely better in my current scanner – but multiple pages are still pulled through occassionally); and dealing with very faint text (I would hope that future software will automatically adjust brightness and contrast settings to deal with this). Finally there may be shortcomings in a particular model of scanner. I believe this is the case in the software driving my Canon DR-2020U which seems incapable of scanning a full page which has continuous black or very dark areas on the edge of the page – the scan truncates the page to remove those areas. To deal with this I have to cover the relevant edges with white paper. This  may have been resolved in a later release of the software – though I havn’t checked for some time.

Q17. What are the practical problems of scanning documents containing colour?

2001 Answer: Ideas formed:

  • Some colours scan black when using a black and white scanner (Wilson 1995b; 76, 77, 98)
  • Colour scans take more time (Wilson 1997: 2).
  • Colour scans produce files 10 or 20 times the size of a black and white scan (Chan 1993:43, Wilson 1997:2)
  • Colour scans (which are made up of red, green and blue) may not work well on printers (which use cyan, magenta, yellow and black) (Chan 1993: 44)

2019 Answer: Fully Answered: Colour scanning no longer presents a particular problem because today’s scanner all have a colour capability, and even though colour scans take up more space, storage is now plentiful and cheap (for comparison purposes, with my current scanner, a single page with 5 colour photos on it scanned at 300dpi B&W to a PDF of 126 Kb; and at 300dpi Colour to a PDF of 801 Kb).

For the most part, colour scans are normally sharp and clear. The only problem I have had with colour scanning seems to be specific to the software used by my current 5 year old Canon DR-2020U scanner. I’m very pleased with the performance of this scanner except for tha fact that it truncates those parts of the edges of pages which have black or very dark colour. To deal with this, I have to place white paper just over the edges of the page – which can be difficult for larger pages which go up to the edge of the scanner platen. I have never experienced this problem before and hope that it has been fixed in a software upgrade.

Q18. Can poor originals be successfully scanned?

2001 Answer: Experience gained: Modifying the scanner settings can dramatically improve the quality of scans of poor originals.

2019 Answer: Partially Answered: Originals can be considered poor for a variety of reasons including skewed images, holes punched in the sides for binding purposes, and faint or heavily smudged images. My current scanner has on/off features to deal with skew (works well) and punched holes (I havn’t really tried to use this); and has settings to adjust contrast and brightness (these also work well but I’ve found it difficult to remember which combination of settings work best – perhaps in the future more automated assistance will be provided for this). In summary, usable scans can often be achieved from poor originals.

Q19. Can all types and sizes of documents be scanned successfully?

2001 Answer: Experience gained: Yes, provided you have a flatbed scanner with a sheet feeder. If the flatbed is not big enough, many of the larger documents can be cut or folded to the necessary size. The spines of gummed or stapled newsletters and booklets can be cut off (Wilson 1995b: 99). Post-it notes stuck on a page go through the sheet feeder without a problem (Wilson 1995b: 13, 39, 140). Journals can be scanned successfully on a flatbed scanner (Wilson 1995b:124).

2019 Answer: Fully Answered: As described in the 2001 answer, the vast majority of documents can be scanned successfully provided you have a scanner which can accommodate the physical size of the document. Of course, if you can’t lay a document flat on the scanning platen, as is the case when trying to scan a book or a journal, there will always be some distortion at the centre of the dual pages.  Most people are likely to have an A4 or, in some cases, A3 scanners  (though an A3 scanner does ta ke up more space) so that they will be unable to scan documents above these sizes; However, that is of no consequence these days because digital photography is now so cheap, and powerful, and prevalent on mobile phones, that it can be used to photograph documents that are too big or difficult to scan. The digital images produced are good enough to read and can be enlarged as necessary. A good example of this is the increasingly common practice of taking photos of flipcharts of which there are a few in the PAWDOC collection. However, there are a few potential problems with photographing documents or other flat objects: one is that, unless the camera is held in exactly the same plane as the document page, the image of the document will appear shorter on one side than on the other; and the other is that, if the document will not lie flat, the document image will appear equivalently distorted in the photograph.

Q20. How can you minimize the amount of scanning that needs to be done?

2001 Answer: Experience gained: Possible strategies include:

  • Stick with electronic files as much as possible
  • Minimize the printing of paper (Wilson 1995b: 82, 90).
  • Elect to receive electronic versions of newsletters, magazines and journals (Wilson 1995b: 126).
  • Go out of your way to obtain electronic versions of paper you get from other people (Wilson 1997: 3).

2019 Answer: Fully Answered: Despite paper still being very widely used (the paperless office never actually arrived), most documents are now created and distributed in digital form – very little paper arrives in the overland or office mail today. Hence, scanning can be minimised by simply storing the electronic versions of documents you create and that you receive by email; and by seeking out the electronic versions of other hardcopy document that you receive. For example, I get the quarterly newsletter of the professional body that I belong to in hardcopy form; but I go to their website to get a digital version of the magazine. Having said that, journal and magazine publishers continue to protect their paper and articles behind paywalls; and newspapers seem to be following their example. Consequently, filing system owners may have to scan some hardcopy papers and articles for the foreseeable future.

Q21. Can scanned images be used successfully in day-to-day work?

2001 Answer: Ideas formed: Relatively few of the scanned images have been used to date. However, experience so far has not indicated any major problems with either reading them on screen or printing them out. An A4 sized display would certainly make it easier to read scanned images on screen (Wilson 1995b: 104).

2019 Answer: Fully Answered: Although I don’t access many of the PAWDOC scanned images these days, over the years I have used them and worked with them and have had no trouble doing so. Nowadays I view all my scanned documents (in both multi-page TIFF and PDF formats) in my eCopy PDF Pro application and this works fine – though it would undeoubtedly be better if the screen could be in Portrait mode so that the whole of an A4 page could be displayed in the way one reads the hardcopy. Recently I transferred several scanned documents onto my iPad and I’ve found I can read them all without any difficulty with the iPad in Portrait mode and without needing to enlarge them. This has been a bit of revelation since it is clearly a much easier and more pleasing experiece to read these documents on the iPad than it is on my 24in desktop screen or my laptop. The combination of the iPad’s high resiolution screen, it’s light weight, the fact that you can hold it as close as you want, make it a winning combination for scanned work documents. This could be the way to store digital filing systems in the future – or perhaps large screen phones could serve the same purpose?

PAWDOC: Indexing (and the Accession process)

Indexing is a key aspect of a filing system because it is one of the mechanisms used to find documents. An alternative mechanism is to search the full text of documents, but the PAWDOC system has not explored this approach because a) the PAWDOC system does not recognise the full text of all the documents it contains, and b) because I have always believed this approach would provide too many hits thereby reducing search effectiveness.

Indexing within the PAWDOC collection is performed at two levels: first, each entry (uniquely identified by a Reference Number) in the main index file contains a Title field (in which a free format description and any number of uncontrolled keywords can be specified) and a few other fields including publication date. Second, each electronic file (be it a scan of a hardcopy document, or a born-digital document) associated with a particular Reference Number contains a unique file title. The contents of these two sets of indexing information provide the means to search for a document, to decide if you have found what you are looking for, and to get a better idea of what the chosen Reference Number and/or associated electronic file(s) contain. These potential uses are worth bearing in mind when indexing information is being specified.

Inevitably mistakes are made when specifying Title information and so it is essential that index entries can be changed over time to correct typos, grammar and factual errors as necessary. Changes may also be inspired by experiencing difficulty in finding a document, and in these cases some additional text may be included to reflect the search terms that first came to mind, or to reflect the changes in terminology that occurs regularly in language and in today’s fast moving technical and cultural environment.

Indexing is closely integrated into the process of including a new document in the PAWDOC collection, not least because adding a new file involves creating a file title which in turn includes a Reference Number and possibly some text from the associated Index entry. Hence, timings for how long indexing takes are embedded within the overall process of Accessioning (which includes scanning for hardcopy documents). For the PAWDOC collection, which uses a scanner acquired in 2012 and a laptop acquired in 2018, it takes approximately 2.5 – 4 minutes to accession a single colour A4 hardcopy page; and about 3.5 – 5 minutes for a 5 page double sided colour hardcopy document. An electronic of any type and number of pages takes approximately 2 – 3.5 minutes to accession. These are appreciable amounts of time for an overhead administration activity amidst a busy working day. Therefore, it is worth exploring any means of reducing these timings. The obvious way is to keep the number of index fields to an absolute minimum. Experience with the PAWDOC system indicates that integrating keywords into the Title field (as opposed to having a separate keywords field) has been very successful. Conversely, the Publication Date field has not proved to be very useful and I believe could be dispensed with. Experience with the other fields used in the PAWDOC system are summarised below:

  • Reference Number – essential
  • Creation Date – a useful control
  • Movement Status – very useful for recording the whereabouts and status of documents
  • Date Last Accessed – only useful if you particularly need it.

Specific questions relating to this aspect are answered below. Note that the status of each answer will fall into one of the following 5 categories: Not Started, Ideas Formed, Experience Gained, Partially Answered, Fully Answered.

Q10. How long does it take to index a new document?

2001 Answer: Fully answered: It takes 1 -2 minutes to make an entry in the main index and to write the Reference Number on the document. A further 30 seconds is needed to create an entry in the document management system for either an electronic file or a document that is to be scanned immediately (Wilson1995c: 1). Scanning a document will add a further 40-50 seconds for a single page – though the per-page time is reduced considerably for multi-page documents that are put through the sheet feeder (Wilson 1997: 3).

2019 Answer: Fully Answered: This question should really be ‘How long does it take to index a new document and include it in the collection?’ because the indexing process is closely integrated with the process of saving the file. The figures in the 2001 Answer for making a new entry in the main index and writing the Reference Number on the document (1-2 minutes), and for creating a new entry in the document management system (an extra 30 seconds) still stand.  Scanning a document using the Canon DR-2020U scanner I bought in 2013 and the new Chillblast laptop I bought in 2018 with a 1 Terrabyte solid state drive, takes 25 seconds for one colour page or 70 seconds for 5 double sided colour pages. However, the Document Management system was eliminated from the PAWDOC architecture in 2018 and replaced by Windows folders, so the subsequent set of actions has changed to the following: a) Create the main index entry (1-2 mins); b) Create a sub-folder within the main PAWDOC folder in the Windows file system, with the new Reference Number and copy the Ref No (15 seconds); c) create the file title (Ref No   Title,  Date scanned) in the scanner control application starting by pasting in the copied Reference Number (15 – 30 seconds); d) Save the file – use the scanner control application to select the new folder just created to specify where the file should be stored and press save (20-30 seconds); e) check the file – open the file to check that it has been stored in the correct place with the correct title (15 seconds); f) Check the folder – open the file to check it has been scanned correctly (only for multi-page docs or for docs where they may be an issue) (15 seconds). The process for electronic documents has also changed as follows: i) create the main index entry (1-2 mins), ii) create a sub-folder within the main PAWDOC folder in the Windows file system, with the new Reference Number and copy the number (15 seconds); iii) Create the file title – open the Save AS dialogue box and give the file a new file name (Ref No   Title,  Date saved) by first pasting in the copied Reference Number (15 – 30 seconds); iv) Save the file – navigate to the folder into which the file is to be saved and press SAVE (20 – 30 seconds); v) Check the file – open the file to check that it has been stored in the correct place with the correct title (15 seconds). Of course, electronic documents which are placed into an existing Ref No have no need to create a new index entry or a new folder.

In summary, overall timings for including new documents in the collection are as follows: for a 5 page double sided hardcopy document using a new Ref No – approximately 3.5 – 5 minutes; this may reduce to about 2.5 – 3 minutes if an existing Ref No is being used. For an equivalent 5 page electronic document using a new Ref No – approx 2 – 3.5 mins; this may reduce to 1 -1.5 mins if an existing Ref No is being used. Timings for a hardcopy document with only 1 page of contents are reduced by about 45 seconds due to reduced scanning time; electronic document timings stay the same regardless of page count.

Q11. What can you do to speed up indexing?

2001 Answer: Ideas formed: Keep the number of index fields to a minimum.

2019 Answer: Fully Answered: Minimise the number of fields. Automate the generation of the Creation Date. Eliminate the Date Last Accessed field if there is no good reason for recording that information.  To speed up the overall accession process, a) store more rather than less documents in existing Reference Nos (using existing Ref Nos avoids having to create new index entries); b) use a faster scanner; c) reduce the time it takes application programs to page through 17,000+ sub-folder names in the Save As function (for some reason, in my Windows 10 system, the application programs take far longer to do this than Windows File Explorer).

Q12. What is the most effective set of index fields?

2001 Answer: Experience gained: The smallest number you can manage with (Wilson 1990: 95). Reference number, title and keywords, date of creation of the record, movement status, date last accessed.

2019 Answer: Fully Answered: Keeping the number of fields to the absolute smallest number you think you can manage with will minimise the time and effort spent on putting new items into the system and on managing the system. The minimum number of fields I could manage with are Reference No, Title, Creation Date, and Movement Status. The other two fields that I use, but which I believe I could do without, are Publication Date (I rarely refer to this) and Date Last Accessed (which was included mainly for research purposes).

Q13. What criteria should be employed when defining titles and keywords?

2001 Answer: Ideas formed:

  • Remember there is no need to use the actual title of the document.
  • Define titles/keywords in accordance with what the document means to you – that way you are more likely to be able to retrieve it.
  • Bear in mind that title/keywords serve multiple purposes (Wilson 1992a: 8):
    • To enable search and retrieval
    • To enable the user to decide if a retrieved record is what was being searched for
    • To provide the user with an understanding of what is contained in the item referenced by a retrieved record.

2019 Answer: Fully Answered: Do not feel constrained to use the actual title of the document – you are far more likely to be able to specify successful searches for items if you define Titles and Keywords which convey what the document means to you. If you are aware terminology is changing, use the latest terms you know as you are likely to become less familiar with the old terminology as time passes. Since there are no length constraints on the Index entry, use as many words as necessary to describe what the document is about. The second level of indexing – the text in the Titles of individual files – is constrained by length and for that reason might well be shorter than the main index entry. However, care should be taken not to truncate unnecessarily and not to just take the quick and easy route to create the file title by cutting and pasting a generic part of the Index entry. Certainly, where there are two or more files associated with the same Reference No, the titles of the 2nd and subsequent files should clearly distinguish them from all the other files in that Reference No. Also, bear in mind that Title fields (which include Keywords) and File Titles serve at least the following three purposes: a) to enable search and retrieval; b) to enable the user to decide if a retrieved record is what was being searched for; c) to provide the user with an understanding of what is contained in the item referenced by a retrieved record.

Q14. Are there circumstances in which keywords and titles should be changed over time?

2001 Answer: Experience gained: It is very necessary to be able to alter keywords and titles for the following reasons:

  • To accommodate the user’s growing familiarity with a topic. For example, the term `OA Human Aspects’ became an abbreviation `OA HCI’ in later entries, so it needed to be placed in previous entries to ensure they would be retrieved if only `OA HCI’ was specified in the search term (Wilson 1992a: 22).
  • To accommodate changing language. For example, the term `DTI pilot’ became `DOI pilot’ when DTI changed its name (Wilson 1992a, 12).
  • To accommodate additional material being added to an existing hardcopy file, or being added to the associated Folder in the Document Management System.

2019 Answer: Fully Answered: Users should be able to change Titles and Keywords at will to correct mistakes or to try and improve the search success rate. Circumstances in which changes might be made for the latter reason include:

a) to accommodate the user’s growing familiarity with a topic. For example, the term `OA Human Aspects’ became an abbreviation `OA HCI’ in later entries, so it needed to be placed in previous entries to ensure they would be retrieved if only `OA HCI’ was specified in the search term;

b) to accommodate changing language. For example, the term `DTI pilot’ became `DOI pilot’ when DTI changed its name; and

c) to accommodate additional files being added to the Reference No.

PAWDOC: Deciding what to file

Most types of things can be managed in a digital filing system. Documents, brochures and books can be either scanned or photographed; electronic documents are, of course, already in the right form; and even physical artefacts such as, for example, art works, can be photographed.

Once it’s been decided to establish a filing system, the reasons why the items concerned are being kept, and what is going to be done with them, should be written down. This should help in defining the following three key points: i) how long items are to be kept for; ii) whether any physical originals are to be retained after they have been digitised; and iii) whether every item is to be filed or just a particular subset. Point iii) is important because, if only a subset is required, the amount of time and effort that will be spent on adding items to the filing system, and managing it, will be reduced. Against this time saving must be weighed the extra effort – and potential errors that may be made – in deciding whether each particular object is to be included or not. Conversely, collecting every instance of a type of an item makes life easy and requires no decision making.

If the filing system is to replace or augment an existing collection, a decision must be taken about whether to exclude the existing items, or to undertake ‘backfile conversion’ to include some or all of them in the digital collection. Backfile conversion is often a lengthy and tortuous process and can act as a drag on proceeding with the new system; therefore, it should be avoided at all costs. If, however, it is thought to be needed, the reasons for doing it should be clearly understood before proceeding.

Specific questions relating to this aspect are answered below. Note that the status of each answer will fall into one of the following 5 categories: Not Started, Ideas Formed, Experience Gained, Partially Answered, Fully Answered.

Q5. What types of information and media can be managed by an electronic filing system?

2001 Answer: Fully answered: Just about anything can be managed by an electronic filing system including single sheets and stapled sets of hardcopy documents, scientific journals, volumes of bound material, ring binders, books, 35 mm slides, videos, electronic files and e-mail messages.

2019 Answer: Fully answered: Any kind of item can be managed by an EFS – though it may be necessary to also keep the original physical artefact to experience the full impact of certain types of items such as specially folded brochures, or significant douments with a tactile presence. Types of items that I have succeeded in managing in the PAWDOC collection include hardcopy documents, ring binders, books, journal papers, magazines, brochures, overhead slides, flipchart pages, CAD/CAM drawings, photos, videos, web pages, and email messages.

Q6. Can all types of electronic files be managed in an electronic filing system?

2001 Answer: Partially answered: Yes, certainly all types of application files can be. I have not stored executable files yet but have been advised that it can be done and how to do it.

2019 Answer: Fully Answered: Any types of file format – including executables – can be managed in this way. For example, file types in the PAWDOC system include Word, Excel, Powerpoint, PDF, TIFF, JPG, MP4, MPP, ZIP, and HTML. However, the real issue is whether you will be able to open them in future years if the particular application programmes are lost or not kept up to date. For this reason, it may be worthwhile converting some files to more common formats, such as PDF, before storing them. In the future, these so-called ‘Digital Preservation’ issues may get addressed automatically by AI programmes.

Q7. What criteria should be employed when deciding what to file?

2001 Answer: Not started

2019 Answer: Fully Answered: Individuals should file as much or as little information as they think they will need in their jobs – otherwise they will not have the motivation to spend the time introducing new material and managing the filing system. If a robust, all embracing, file facility is available within a particular system (such as in many electronic mail systems) such systems should be used to their full capability (i.e retain everything up to any limitations set by the system administrator). Inevitably, whatever an individual decides to file, there will always be some material which is never subsequently accessed, and there will always be some required information that can’t be found in the system. This seems to be a regularly occurring phenomenon and should not deter individuals from filing – though they may influence the filing criteria that an individual applies over time.

Q8. Is it worth doing backfile conversion?

2001 Answer: Experience gained: Backfile conversion is certainly not essential, and is probably not worth doing, except for material you definitely want to keep for many years.

2019 Answer: Fully Answered: Backfile conversion usually takes a huge amount of time and effort, so a detailed analysis should be undertaken before deciding whether to do it or not. Alternative approaches to dealing with the old items should be investigated. Backfile conversion should only be carried out if there are compelling reasons for doing so.

Q9. What are the major considerations when carrying out backfile conversion?

2001 Answer: Experience gained: Backfile conversion requires a huge amount of time and effort and is not to be undertaken lightly.

2019 Answer: Fully Answered: Considerations to be taken into account when conducting backfile conversion include:

  1. Ensure the physical backfile material does not interfere with any space required by the replacement filing system.
  2. Decide if any of the backfiles being converted will need to be retained in their original form and, if so, define clear criteria for selecting which ones to keep.
  3. Decide what quality the converted backfiles need to be and then put in place the appropriate equipment, software and procedures to achieve that quality.
  4. Decide how users will be informed of which backfiles have been converted and which are yet to be converted.
  5. Establish a clear and doable schedule for the conversion process.
  6. Put in place motivational aids for sticking to the schedule. For example, setting a target of a certain number of conversions each day; and creating a visible progress sheet on which achievements can be ticked off.

PAWDOC: Collection content, size, growth rate and usage patterns

The PAWDOC collection was set up in 1981 to explore the application of office technology. In 2001 a paper was published in the Journal Behaviour & Information Technology called ’20 years in the life of a long term empirical personal electronic filing study’. This described PAWDOC and summarised findings about its use up to that point under the following 15 headings:

Two extra considerations are ‘Architecture‘ and ‘Requirements and Objectives‘. Now, a further 18 years on, the findings will be further reviewed in this and subsequent posts. The first of these follows below.

Collection content, size, growth rate and usage patterns

The precursor to PAWDOC was a conventional filing system in an upright cabinet using hanging folders and crystal tabs to store documents only (not journals, books etc.). The contents were specified in an index taxonomy with entries of the type, for example, 1.6.8 – Quarterly Progress Reports, which I was constantly adding to. Consequently much space was taken up in the cabinet by folders with crystal tabs housing only a few pages.  This changed after the PAWDOC system came into use: it needed far fewer hanging folders because each folder was filled to capacity with as many of the serially numbered documents that it could take.

The PAWDOC schema was explicitly defined to support the management of multiple sets of different material owned by multiple different owners. This capability was used to manage a variety of sets of my own material in addition to documents, for example, several different journals, 35mm slides (for making presentations), ring binders, and books; and, in the early years of the system, material owned by other people and organisations – though, as time went by, I made fewer and fewer such entries not least because of the uncertainty of being able to access such material.

The system soon became an integral part of my working life, and I used it just about every day. Several hundred new items were being added each year and this steady rate of acquisition soon ran up against the physical limits of the upright cabinet, so it became necessary to archive material in boxes and then to put some of the boxes in store. This went on until I started digitising newly acquired documents in 1996 and disposing of most originals. At this point, I also started to store born digital documents regardless of the applications they were created in.

In 2001 I started a new job in Bid Management which precluded personal storage of its associated highly confidential and fast moving documentation; so my useage of the PAWDOC system did reduce from then onwards. Nevertheless I continued to use it regularly on most days, and was still adding over 200 new items each year up to when I retired in 2012.

The 45 boxes of hardcopy that I had acquired eventually came out of store around 2001 and were stored in my garden shed. Their number was overwhelming and I began to doubt I would ever get them all digitised. However I stuck at the task, sometimes going at it solidly for several days at a time when my wife was away. By the time I retired in 2012 only 4 boxes remained and these had all been scanned by 2014: it was a great relief. The huge amount of physical space taken up by the collection had been reduced to just two archive boxes of significant hardcopy documents; and, after I had conducted a digital preservation exercise on the collection in 2018, the digital footprint of the collection amounted to some 115Gb. I have ended up with a fairly complete digitised archive of all the non-highly confidential materials that I had encountered throughout my working life, including substantial amounts of material from my earlier career from 1972 with Kodak and then CPC. Since retiring I’m continuing to add a few documents (around 40 up to 2019) in three categories: significant articles relating to the work I used to do; documents relating to the digital preservation of the PAWDOC collection; and material relating to work I am doing to investigate and document the findings from the PAWDOC collection’s 38 years of existence.

Specific questions relating to this aspect are answered below. Note that the status of each answer will fall into one of the following 5 categories: Not Started, Ideas Formed, Experience Gained, Partially Answered, Fully Answered.

Q1. What are the contents of the collection?

2001 Answer: Fully answered: At the beginning of July 2001, the collection consisted of 14 100 index entries representing approximately 185 000 pages of paper, 50 000 scanned pages, over 30 scientific journals (including Behaviour & Information Technology from 1982), around 30 books and conference proceedings, 3700 MS Word files, 400 MS Excel files, 250 MS PowerPoint files, 150 other electronic files of various types, and 10 CDs.

2019 Answer: Fully answered: The PAWDOC user Guide created in 2018 says: “All types of documents were stored including letters, internal memos, circulars, reports, specifications, minutes, overhead slides, 35mm slides, notes, training materials, brochures, manuals, maps, emails, computer magazines, journal articles, conference proceedings, and videos. As Office Technology became more versatile, electronic documents such as word processor files, spreadsheets, presentations and web sites were also filed.”. In June 2019, the collection consisted of 17,293 Index entries representing 29,610 electronic files in 16,067 Windows folders, and about 340 physical hardcopy documents in two archive boxes. A further 384 old electronic backup files are also stored in a separate folder. The Checking exercise performed in 2016  identified the following numbers of different types of files in the collection: Word – 6380; Powerpoint – 466; Excel – 625; HTML – 382; Help – 90; Zip – 92;  11 other apps – 88; Scanned documents – 28,418. The collection was primarily a work collection and therefore the number of new items being included reduced to a trickle when I retired in 2012.

Q2. How much space does the collection take up?

2001 Answer: Fully answered: The paper takes up about 4.7 sq. metres of floor space and 1.7 metres of shelf space. The scanned images and electronic files take up 2.9 GB. The scanner, magneto-optical drive and CD Writer take up about 0.25 sq metres of desk space. The Filemaker Pro index is about 8.1 MB in size and the FISH data file is 11.2 MB. The Filemaker Pro, FISH, SQL Anywhere and Easy CD Creator software packages take up approximately 38 Mb.

2019 Answer: Fully answered: The two archive boxes stand one upon the other and take up 0.2 sq m of floor space. The laptop in which the electronic files reside takes up 0.08 sq m of desk space. The electronic files of the main collection take up 45.9 Gb storage space; and the backup files take up 66.6 Gb. The Filemaker Pro software used for the index takes up 336 Mb of file space.

Q3. What is the growth rate of the collection?

2001 Answer: Partially answered: Between 1981 and 1993 an average of 543 index entries were created each year – estimated to consist of an average 29.4 pages per day (Chan 1993:25).Over the whole 20 years life of the system, the growth rate has been an average of 705 entries per year with a range of 210 ± 1202new entries a year. In 1993, it was estimated that the collection was increasing at the rate of 3.8 MB per day.

2019 Answer: Fully answered: The growth rate of the collection is shown in the chart below.

There are three distinct phases: 1981 – 2000; 2001 – 2011 when I was working on highly confidential bids; and 2012 – 2019 when I was retired. The average growth rates during these periods were:

  • 1981-2000 – 696
  •  2001-2011 – 272
  • 2012-2019 – 46

Q4. How often are the contents accessed?

2001 Answer: Experience gained: Between 1987 and 1993, an average of 363 records were being accessed each year (Chan 1993: 25)

2019 Answer: Experience gained: The only data that has been collected on this question is in the date last accessed field, and unfortunately that only records the latest date an item was accessed – there may have been any number of earlier accesses. Furthermore, some items may have been accessed without the date last accessed field being updated. Having said that, 4,551 items have an entry in the date last accessed field, implying that 12,742 items have never been looked at for work purposes after they had been included in the collection (the date last accessed field was never updated when items were looked at for the purposes of controlling and writing about the collection). For the period from 2001 onwards, when I moved jobs into Bid Management, there were 751 index items with dates of 2001 or later in the date last accessed field (only 15 of these were in 2012 and only a further 15 of these were from 2013 onwards).

NB. The various references in the texts above to Chan,1993 relate to the following reference at the end of the 2001 paper in Behaviour & Information Technology:

CHAN, S. C. 1993, Feasibility of Paperless Office, Submitted in partial fulfilment of the requirement for the degree of MSc in Information Systems and Technology in The Information Science Department at City University, London (Supervisor: Dr David Bawden) [PAW/DOC/4012/08].