Phase 4 Results

The Phase 4 objective was to explore if it is possible to use AI to investigate the contents of a large archive by using a combination of its Index entries and the names of the associated files. Having completed the work I can report that it is indeed feasible to do just that – and with some spectacular results. However, there are some caveats which are highlighted in the following paragraphs.

The issue with a large archive is that there is too much information to hold in the Context Window of an LLM. To get round that problem, I split up the Index and File Name information into small subsets, and instructed the LLM to, first, produce answers for each subset, and then to merge the subset answers into an overall answer for the whole archive. The practicalities of splitting up the data and establishing subsets are described in a previous post. Suffice it to say that it is best to a) produce subsets that combine a number of index entries together with their associated file names (rather than having separate Index Entry and File Name subsets), and b) to keep the number of characters in every individual subset file well under the limits specified by the LLM to avoid truncation.

I tested this subset strategy by asking the following 6 questions of my PAWDOC work archive of some 17,000 Index entries and 31,000 associated files:

  1. List all the people named in this part of the Index and its associated files, and the organisation they belong to if any.
  2. Describe Paul Wilson’s career over the period covered by this part of the Index and its associated files.
  3. What significant changes in Information Technology occurred during the period covered by this part of the Index and its associated files?
  4. Document all the travel undertaken by Paul Wilson over the period covered by this part of the Index and its associated files.
  5. What training was undertaken by Paul Wilson over the period covered by this part of the Index and its associated files, and how important were particular elements to his subsequent career?
  6. What are the strangest or most unusual things to be found within this part of the Index and its associated files, including unlikely coincidences, and events with unexpected outcomes?

As described in a previous post, to get the answers you want, you need to provide clear and detailed instructions to the AI in the Prompt that you submit: the bare questions as phrased above are not sufficient, so I used far more detailed versions.

Another factor that makes a difference in the results you can get is the LLM that you choose to use. For this investigation I used ChatGPT and Claude Sonet 4.6 (I wanted to use Copilot as well but found that my subset files contained too many characters for Copilot and were being truncated). The results from ChatGPT and Claude are compared below. However, its worth saying here that different LLMs have different abilities and its good to be clear about what sort of task you want the AI to perform and which LLMs are the best AIs for that job. The Huggingface website is one of the primary LLM repositories for localised models, each being categorised as being suitable for one or more of over 50 different tasks – though support for archives isn’t one of them. Similar advice about which cloud platforms are best for particular tasks can be found in review articles with titles like “10 best AI….”. However, I have found nothing specific to archives in these texts either. So, as yet, I have no definitive listing of the AI tasks that might be required to support archives, nor of which LLM models are best suited to perform those tasks. I hope to have more information on this when I get to Phase 7 of this work.

Getting back to the Phase 4 tests, I ended up with 10 Subsets of PAWDOC Index entries and associated file names. So, for each of the six questions I got 10 answers for ChatGPT and 10 for Claude – 120 answers in all. Then I asked ChatGPT and Claude respectively to combine the 10 answers they had produced for each question into a single merged answer, which produced 6 merged answers from ChatGPT and 6 from Claude. I reviewed each of the 132 answers and gave each of the merged answers a score out of 10; the results are summarised in the table below.

Question ChatGPT (score out of 10) Claude (score out of 10)
Q1 People Poor result. Unable to distinguish people names from adjacent words e.g. ‘Do, To’                          Time to produce: People spreadsheet 34seconds; Organisation spreadsheet 65seconds                              Score: 2 Over 2,440 names were listed (all looking valid) with 1,245 being allocated to one of 600+ organisation. Impressive data collection in the subsets and an excellent consolidation.              Time to produce: People spreadsheet 62s; Organisation spreadsheet 111s              Score: 9
Q2 Career Pretty good 6-page answer but constructed around general activities not organisations or key projects.                                    Time to produce: 45s              Score: 7 An impressive 13-page report. A few errors probably due to limited data in the Index entries and File Names. A hugely informative, comprehensive, and highly readable piece.                                                 Time to produce: 343ss            Score: 8.5
Q3 IT changes Quite good 11-page answer identifying 14 major categories of IT change with details within; but the Reference Numbers are listed separately and not related to specific changes.                      Time to produce: 62s              Score: 7 An impressive 21-page report with 22 categories of IT change, a summary timeline table, and a conclusion with Cross-Cutting Observations. A comprehensive and coherent overview.                      Time to produce: 830s                      Score: 9
Q4 Travel A 5-page report listing 62 travel events (though the subset reports listed a total of 122). Little detailed analysis. Not a very good document.                                Time to produce: 59s              Score: 3 An exceptionally comprehensive 48-page report with a table of contents and detailing 326 confirmed trips, 331 destination visits, 135 unique destinations, and 137,018 total one-way miles. Hugely impressive.    Time to produce: 2502s            Score: 9.5
Q5 Training A 10-page report with too little data and statistics. Many events are listed under a general category. The subset reports were better listing 93 specific events.          Time to produce: 15s              Score: 3 A 23-page formal report detailing 129 training events and a discussion on their relevance to Wilson’s career. A thoroughly competent and authoritative document.              Time to produce: 1712s           Score: 9
Q6 Strange items An 8-page report listing the top 22 strange and unusual items, rather than all 100+ items identified in the subsets. Clearly presented and readable.                                  Time to produce: 30s              Score: 8 A 29-page report detailing 172 instances of strange and unusual events sorted into 12 categories. A table at the end lists all 172 instances graded from 1 (least strange) to 10 (most strange) ordered in ascending order of Strangeness. A very good clear answer, well formatted and easy to read.                                            Time to produce: 1175s            Score: 9

Claude has clearly produced the best results – and perhaps the ‘time to produce’ numbers indicate why that should be: ChatGPT took an average of 44 seconds to produce its merged answers, and Claude took an average of 962 seconds. Indeed, in some cases, Claude’s merged reports are so detailed and so well formatted that they are too believable – and this is their downfall: to reproduce such comprehensive results, or to verify the answers, would probably take weeks of manual work, and so, in view of this and despite their potential for error (more of this below), it is very tempting to just assume they are totally correct.

Although I didn’t diligently check every aspect of some of these extremely detailed reports, I did spot a few errors which suggests there may be several more across this body of material. They are summarised below to give a flavour of what can go wrong.

  • ChatGPT Q1 subset answers: Large numbers of incorrect People and Organisations are put forward by ChatGPT. A sensible rationale for identifying the difference between any two adjacent words and a Person or Organisation’s name is just not there. For example, two of the people names put forward by ChatGPT are ‘Do, To’ and ‘Taxation, Oil’.
  • Claude Q2 subset answers: The subset B answer says “CSC – Bid Management and Internal Systems work, 1984–1990s… a transition into bid management roles by the early 1990s “; but I didn’t start bid management work until 2001.
  • Claude Q2 subset answers: A subset C section heading – ‘Joining CSC and the Cosmos / Amigo research project (1986–1989)’ is wrong because we never joined the Amigo Research project – we just had a seminar to share what each group was doing.
  • Claude Q3 subset answers: In the subset A answer, the PAW-ACMOIS-Jul88-p277 (1988) review of “wireless Intraoffice network technologies’ appears incorrectly in section 5 (CSCW) instead of section 4 (Local Area Networks).
  • Claude Q5 subset answers: Claude incorrectly assumes that I explored doing a ‘UCL Graduate Diploma in HCI with Ergonomics’ but actually the index entry concerned a letter from UCL asking me to publicise their course.
  • Claude Q2 merged answer: The summary trajectory section starts in 1977 with my time at CPC and doesn’t mention Kodak [where I worked prior to CPC] at all despite Kodak appearing in its subset B answer.
  • Claude Q2 merged answer: Claude says that, in a CHOTS paper, I specified the ‘four-part reference-number scheme later used in PAWDOC itself’ whereas the scheme was already being used in PAWDOC when I was doing the CHOTS work.
  • Claude Q2 merged answer: The CSC start date is specified as 1986 but actually was 1984.
  • Claude Q3 merged answers: The text says “Doug Engelbart’s bootstrap seminars on Dynamic Knowledge Repositories, which CSC adopted (PAW-DOC-7385-01)”. However, the only reason I can see for claiming that ‘CSC adopted’ is the inclusion of a username and password for access to the Bootstrap institute in the 7385-01 Index entry.
  • Claude Q5 merged answer: Three CPC courses are wrongly ascribed to Kodak in sub-section 3.2.
  • Claude Q5 merged answer: Claude assumes that an Index entry about a letter concerning Nottingham’s proposed MSc in Human-Computer Interaction is “evidence that he was seriously weighing a formal postgraduate qualification” whereas, in fact, I was simply responding to a request for support for the establishment of such a course.

These errors may have been due to lack of material in the subset, or to a misinterpretation of the information provided, or to limited information in Index entries without access to the associated files; but however they were caused they provide an important reminder that all LLM material needs to be checked if you want to rely on its veracity.

Overall, then, I believe it is feasible to use AI to explore large archives via their Index entries and associated File Names. It may need careful planning and preparation and be time-consuming to carry out, but the results can be very informative – perhaps providing insights which would be just too time-consuming and expensive to obtain in any way. Certainly, there’s no way I, or anyone else, would ever have produced the in-depth material on the information contained in PAWDOC on people, organisations, travel, and training that Claude has produced in these tests. However, if you are going to make more than just casual enquiries of your archive, researchers would be well advised to develop verification strategies as an integral part of the exercise. These need not be comprehensive and definitive checks, but instead may involve sampling, using catalogue searches, or even using the AI itself, to get a sense of how much, if anything, is amiss or not. The research need not stop there. There are probably optimum strategies for using such AI answers, and awareness of their flaws, as a starting point for research. However, at present, I am not aware of any such strategies having been documented. This, too, is another aspect I hope to know more about by the time I get to Phase 7 of this work.

Below records the breakdown of the time I spent on Phase 4 and across all phases.

Activity No of Tasks or task breakdown  Elapsed time Time spent
Phase 1 70 (started 05Mar2026) 43 days 105 hrs
Phase 2 8 4 days 11 hrs
Phase 3 · Create test files, test, analyse results
· Research & draft pwofc.com posts
3 days
4 days
15 hrs
12 hrs
Phase 4 · Create test files, test, analyse results
· Research & draft pwofc.com posts
14 days
11 days
80 hrs
31 hrs
Totals 111 days  254 hrs

Leave a Reply

Your email address will not be published. Required fields are marked *