Phase 4 Results

The Phase 4 objective was to explore if it is possible to use AI to investigate the contents of a large archive by using a combination of its Index entries and the names of the associated files. Having completed the work I can report that it is indeed feasible to do just that – and with some spectacular results. However, there are some caveats which are highlighted in the following paragraphs.

The issue with a large archive is that there is too much information to hold in the Context Window of an LLM. To get round that problem, I split up the Index and File Name information into small subsets, and instructed the LLM to, first, produce answers for each subset, and then to merge the subset answers into an overall answer for the whole archive. The practicalities of splitting up the data and establishing subsets are described in a previous post. Suffice it to say that it is best to a) produce subsets that combine a number of index entries together with their associated file names (rather than having separate Index Entry and File Name subsets), and b) to keep the number of characters in every individual subset file well under the limits specified by the LLM to avoid truncation.

I tested this subset strategy by asking the following 6 questions of my PAWDOC work archive of some 17,000 Index entries and 31,000 associated files:

  1. List all the people named in this part of the Index and its associated files, and the organisation they belong to if any.
  2. Describe Paul Wilson’s career over the period covered by this part of the Index and its associated files.
  3. What significant changes in Information Technology occurred during the period covered by this part of the Index and its associated files?
  4. Document all the travel undertaken by Paul Wilson over the period covered by this part of the Index and its associated files.
  5. What training was undertaken by Paul Wilson over the period covered by this part of the Index and its associated files, and how important were particular elements to his subsequent career?
  6. What are the strangest or most unusual things to be found within this part of the Index and its associated files, including unlikely coincidences, and events with unexpected outcomes?

As described in a previous post, to get the answers you want, you need to provide clear and detailed instructions to the AI in the Prompt that you submit: the bare questions as phrased above are not sufficient, so I used far more detailed versions.

Another factor that makes a difference in the results you can get is the LLM that you choose to use. For this investigation I used ChatGPT and Claude Sonet 4.6 (I wanted to use Copilot as well but found that my subset files contained too many characters for Copilot and were being truncated). The results from ChatGPT and Claude are compared below. However, its worth saying here that different LLMs have different abilities and its good to be clear about what sort of task you want the AI to perform and which LLMs are the best AIs for that job. The Huggingface website is one of the primary LLM repositories for localised models, each being categorised as being suitable for one or more of over 50 different tasks – though support for archives isn’t one of them. Similar advice about which cloud platforms are best for particular tasks can be found in review articles with titles like “10 best AI….”. However, I have found nothing specific to archives in these texts either. So, as yet, I have no definitive listing of the AI tasks that might be required to support archives, nor of which LLM models are best suited to perform those tasks. I hope to have more information on this when I get to Phase 7 of this work.

Getting back to the Phase 4 tests, I ended up with 10 Subsets of PAWDOC Index entries and associated file names. So, for each of the six questions I got 10 answers for ChatGPT and 10 for Claude – 120 answers in all. Then I asked ChatGPT and Claude respectively to combine the 10 answers they had produced for each question into a single merged answer, which produced 6 merged answers from ChatGPT and 6 from Claude. I reviewed each of the 132 answers and gave each of the merged answers a score out of 10; the results are summarised in the table below.

Question ChatGPT (score out of 10) Claude (score out of 10)
Q1 People Poor result. Unable to distinguish people names from adjacent words e.g. ‘Do, To’                          Time to produce: People spreadsheet 34seconds; Organisation spreadsheet 65seconds                              Score: 2 Over 2,440 names were listed (all looking valid) with 1,245 being allocated to one of 600+ organisation. Impressive data collection in the subsets and an excellent consolidation.              Time to produce: People spreadsheet 62s; Organisation spreadsheet 111s              Score: 9
Q2 Career Pretty good 6-page answer but constructed around general activities not organisations or key projects.                                    Time to produce: 45s              Score: 7 An impressive 13-page report. A few errors probably due to limited data in the Index entries and File Names. A hugely informative, comprehensive, and highly readable piece.                                                 Time to produce: 343ss            Score: 8.5
Q3 IT changes Quite good 11-page answer identifying 14 major categories of IT change with details within; but the Reference Numbers are listed separately and not related to specific changes.                      Time to produce: 62s              Score: 7 An impressive 21-page report with 22 categories of IT change, a summary timeline table, and a conclusion with Cross-Cutting Observations. A comprehensive and coherent overview.                      Time to produce: 830s                      Score: 9
Q4 Travel A 5-page report listing 62 travel events (though the subset reports listed a total of 122). Little detailed analysis. Not a very good document.                                Time to produce: 59s              Score: 3 An exceptionally comprehensive 48-page report with a table of contents and detailing 326 confirmed trips, 331 destination visits, 135 unique destinations, and 137,018 total one-way miles. Hugely impressive.    Time to produce: 2502s            Score: 9.5
Q5 Training A 10-page report with too little data and statistics. Many events are listed under a general category. The subset reports were better listing 93 specific events.          Time to produce: 15s              Score: 3 A 23-page formal report detailing 129 training events and a discussion on their relevance to Wilson’s career. A thoroughly competent and authoritative document.              Time to produce: 1712s           Score: 9
Q6 Strange items An 8-page report listing the top 22 strange and unusual items, rather than all 100+ items identified in the subsets. Clearly presented and readable.                                  Time to produce: 30s              Score: 8 A 29-page report detailing 172 instances of strange and unusual events sorted into 12 categories. A table at the end lists all 172 instances graded from 1 (least strange) to 10 (most strange) ordered in ascending order of Strangeness. A very good clear answer, well formatted and easy to read.                                            Time to produce: 1175s            Score: 9

Claude has clearly produced the best results – and perhaps the ‘time to produce’ numbers indicate why that should be: ChatGPT took an average of 44 seconds to produce its merged answers, and Claude took an average of 962 seconds. Indeed, in some cases, Claude’s merged reports are so detailed and so well formatted that they are too believable – and this is their downfall: to reproduce such comprehensive results, or to verify the answers, would probably take weeks of manual work, and so, in view of this and despite their potential for error (more of this below), it is very tempting to just assume they are totally correct.

Although I didn’t diligently check every aspect of some of these extremely detailed reports, I did spot a few errors which suggests there may be several more across this body of material. They are summarised below to give a flavour of what can go wrong.

  • ChatGPT Q1 subset answers: Large numbers of incorrect People and Organisations are put forward by ChatGPT. A sensible rationale for identifying the difference between any two adjacent words and a Person or Organisation’s name is just not there. For example, two of the people names put forward by ChatGPT are ‘Do, To’ and ‘Taxation, Oil’.
  • Claude Q2 subset answers: The subset B answer says “CSC – Bid Management and Internal Systems work, 1984–1990s… a transition into bid management roles by the early 1990s “; but I didn’t start bid management work until 2001.
  • Claude Q2 subset answers: A subset C section heading – ‘Joining CSC and the Cosmos / Amigo research project (1986–1989)’ is wrong because we never joined the Amigo Research project – we just had a seminar to share what each group was doing.
  • Claude Q3 subset answers: In the subset A answer, the PAW-ACMOIS-Jul88-p277 (1988) review of “wireless Intraoffice network technologies’ appears incorrectly in section 5 (CSCW) instead of section 4 (Local Area Networks).
  • Claude Q5 subset answers: Claude incorrectly assumes that I explored doing a ‘UCL Graduate Diploma in HCI with Ergonomics’ but actually the index entry concerned a letter from UCL asking me to publicise their course.
  • Claude Q2 merged answer: The summary trajectory section starts in 1977 with my time at CPC and doesn’t mention Kodak [where I worked prior to CPC] at all despite Kodak appearing in its subset B answer.
  • Claude Q2 merged answer: Claude says that, in a CHOTS paper, I specified the ‘four-part reference-number scheme later used in PAWDOC itself’ whereas the scheme was already being used in PAWDOC when I was doing the CHOTS work.
  • Claude Q2 merged answer: The CSC start date is specified as 1986 but actually was 1984.
  • Claude Q3 merged answers: The text says “Doug Engelbart’s bootstrap seminars on Dynamic Knowledge Repositories, which CSC adopted (PAW-DOC-7385-01)”. However, the only reason I can see for claiming that ‘CSC adopted’ is the inclusion of a username and password for access to the Bootstrap institute in the 7385-01 Index entry.
  • Claude Q5 merged answer: Three CPC courses are wrongly ascribed to Kodak in sub-section 3.2.
  • Claude Q5 merged answer: Claude assumes that an Index entry about a letter concerning Nottingham’s proposed MSc in Human-Computer Interaction is “evidence that he was seriously weighing a formal postgraduate qualification” whereas, in fact, I was simply responding to a request for support for the establishment of such a course.

These errors may have been due to lack of material in the subset, or to a misinterpretation of the information provided, or to limited information in Index entries without access to the associated files; but however they were caused they provide an important reminder that all LLM material needs to be checked if you want to rely on its veracity.

Overall, then, I believe it is feasible to use AI to explore large archives via their Index entries and associated File Names. It may need careful planning and preparation and be time-consuming to carry out, but the results can be very informative – perhaps providing insights which would be just too time-consuming and expensive to obtain in any way. Certainly, there’s no way I, or anyone else, would ever have produced the in-depth material on the information contained in PAWDOC on people, organisations, travel, and training that Claude has produced in these tests. However, if you are going to make more than just casual enquiries of your archive, researchers would be well advised to develop verification strategies as an integral part of the exercise. These need not be comprehensive and definitive checks, but instead may involve sampling, using catalogue searches, or even using the AI itself, to get a sense of how much, if anything, is amiss or not. The research need not stop there. There are probably optimum strategies for using such AI answers, and awareness of their flaws, as a starting point for research. However, at present, I am not aware of any such strategies having been documented. This, too, is another aspect I hope to know more about by the time I get to Phase 7 of this work.

Below records the breakdown of the time I spent on Phase 4 and across all phases.

Activity No of Tasks or task breakdown  Elapsed time Time spent
Phase 1 70 (started 05Mar2026) 43 days 105 hrs
Phase 2 8 4 days 11 hrs
Phase 3 · Create test files, test, analyse results
· Research & draft pwofc.com posts
3 days
4 days
15 hrs
12 hrs
Phase 4 · Create test files, test, analyse results
· Research & draft pwofc.com posts
14 days
11 days
80 hrs
31 hrs
Totals 111 days  254 hrs

File Splitting and Truncation

As described in the previous post, the Index and file information in my PAWDOC collection is too big to be ingested and used in a single prompt in today’s AI systems. So, I’m using it to explore how AI might be able to analyse a very large collection in pieces and then to stitch the results together. The first step in such an undertaking is to decide what information is to be provided to the AI and how small the individual subsets (i.e. files) of that information need to be to guarantee that they will be taken into account in full in the AI’s analysis (files that are too large will simply have some of their contents truncated after they are uploaded but before the AI starts its analysis).

I decided to provide the following fields from the PAWDOC index to the AI:

  • Reference Number: every document in the collection has a Reference Number which is made up of four parts – an Owner Identifier, a Set Identifier, a Serial Number, and a Sub-Serial Number. For example, PAW-DOC-4046-01.
  • Title: this contains free format text describing the document(s) concerned. Three dots (…) in the middle of the text denotes that what follows are Keywords/Phrases.
  • Publication date: this is the publication date of the oldest document relating to that Reference Number.
  • Creation date: the date the index entry was created.

I also decided to provide the file title for every file associated with each Index entry. File titles have a general structure of Reference Number, Description, filename extension. For example:

PAW-DOC-1104-39   Planning the ITUSA document Interchange group X.400 User Test.tif

I was able to produce a file of the relevant Index fields by exporting them out Filemaker. The export contained 17,381 entries composed of just over 2,400,000 characters (I calculated character numbers by using Excel’s ‘LEN’ formula) and sized at 1.4Mb.

Obtaining a list of the associated Filenames was a little more difficult as the files reside inside separate folders for each Reference Number. However, a search of the net established that the 7z file compression utility will provide a list of all the files within a higher-level directory structure; so, I downloaded and installed 7z, and created a file of file titles containing 31,270 entries composed of 2,568,000 characters and sized at 1Mb.

To assess the number and size of the subset files that needed to be produced I turned to the insights I recorded in the Phase 1 and Phase 2 Summary results:

  • ChatGPT’s usable context window is about 80k-100k tokens (approximately 160k-300k characters – assuming 2-3 characters/token). ChatGPT limits uploads to 3 files a day but this can be circumvented by putting multiple files in a zip file.
  • Copilot doesn’t have a fixed context window – its design means that its effective context window is much larger and more flexible than a single token number would suggest. It limits uploads to about 7 batches of up to 20 files. Approximately 30k characters per file should work fine.
  • Claude’s context window is 200k tokens (roughly 500 pages of text or approximately 800,000 characters). Claude caps file uploads at 30Mb per file and 20 files per conversation.

Copilot and Claude’s limits suggested that, for each prompt, I should load no more than 20 files containing a maximum of 30k characters each. This would exceed ChatGPTs limits, but I thought that I could do without ChatGPT if its results were poor.

I had already decided that I would upload a copy of the ‘PAWDOC Guide‘ file in every prompt; and that I would count the text request as a file in its own right; so that left a maximum of 18 files of Index and Filename information in each prompt. I duly, set about splitting the Index file into subsets of around 30k characters, and then creating associated files of file titles (ensuring that none of the files exceeded 30k characters) and assembled the combination of Index files and File Title files into groups of 18 files or less. It was a very laborious task. If I’d realised how time-consuming it was going to be, I would have found a way of splitting out the Reference Number from the File titles into a separate field and merged the overall Index file and overall File Title file into a single file sorted by Reference Number, which would have made the splitting task a much simpler and quicker operation – a lesson worth remembering (in actual fact, I’d achieved a similar feat a few months earlier simply by asking ChatGPT to provide me with an Excel function for a similar task – I’d just forgotten it was that easy). Anyway, after the job was completed, I found I had 10 subsets of either 17 or 18 files each, and which I named Subsets A-J. Subset B’s files are shown below.

  • Test B files\PAWDOC Index 12.csv
  • Test B files\PAWDOC Index 13.csv
  • Test B files\PAWDOC Index 14.csv
  • Test B files\PAWDOC Index 15.csv
  • Test B files\PAWDOC Index 16.csv
  • Test B files\PAWDOC Index 17.csv
  • Test B files\PAWDOC Index 18.csv
  • Test B files\PAWDOC File Names 7.csv
  • Test B files\PAWDOC File Names 8.csv
  • Test B files\PAWDOC File Names 9.csv
  • Test B files\PAWDOC File Names 10.csv
  • Test B files\PAWDOC File Names 11.csv
  • Test B files\PAWDOC File Names 12.csv
  • Test B files\PAWDOC File Names 13.csv
  • Test B files\PAWDOC File Names 14.csv
  • Test B files\PAWDOC File Names 15.csv
  • Test B files\PAWDOC File Names 16.csv

I then set about getting each of the AIs to answer question 1 with Subset A. As it happened, there seemed to be no problem with ChatGPT: it happily accepted all the files in a zip file, and advised that no files were truncated when it analysed them. This didn’t seem to concur with the ChatGPT context window information I had established earlier, but the results seemed reasonable so I decided to use ChatGPT in the testing. So too with Claude, which also was happy to ingest all the files compressed into a single zip file. Copilot, however, produced inferior results which I established, after some Q&A with the AI, was due to the files being truncated. Copilot explained the problem very clearly:

Your file size (≈33 KB) and character count (≈29,950) were both within your rule of thumb — but that rule is no longer reliable. The truncation happened because:

  1. The ingestion limit is based on tokens, not characters or kilobytes: A token is roughly 3–4 characters of English text, but CSV files with long lines, punctuation, quotes, and irregular structure tokenize very inefficiently. Your file contains long filenames, commas, quotes, and mixed punctuation dramatically increase token count. So even though the file is only ~30k characters, it may be 10k–15k tokens, which is enough to trigger truncation.
  2. CSVs with hundreds of long rows are the worst-case format: Each line is treated as a dense block of text. Tokenization cost spikes.
  3. The system applies a per‑file token cap: This cap is not published, but from your repeated tests, the effective safe zone is:
  • ≈15,000–18,000 characters for CSV-like content
  • ≈20,000–22,000 characters for plain text
  • ≈30,000 characters only for very simple text

Your file was near 30k characters but had extremely token-heavy content → truncated.

It was clear that I wasn’t going to be able to conduct the tests with Copilot using the same subsets that I had already created. However, ChatGPT and Claude seemed able to deal with the subsets, and would enable me to make some sort of comparison of results. Furthermore, I wasn’t prepared to redo the subsets, so I decided simply to exclude Copilot from the tests. This experience suggests that it’s a good idea to test some sample file sizes with all the models to be used, before undertaking the final splitting of files into subsets.

A prompt about Prompts

The Index to my PAWDOC collection of work documents is in a Filemaker database, and I’d recently been told that Filemaker had added some AI functionality into its product, and that it also had a number of AI-related videos for users to learn from. So, at the start of this investigation into providing AI support for larger collections like PAWDOC, I took to watching some of the videos. One was entitled ‘Learn about AI prompting basics (30May2024)’ and it outlined some of the key things to remember when constructing an AI prompt. In particular:

  • Clearly define the audience.
  • Define your goals.
  • Provide a step-by-step approach (do a, b, c, etc).
  • Include examples of what you want and say why they work.

At the time, I was thinking about putting a protective covering on the top of books to stop dust settling on the top of the text block, and I thought I could get AI to produce an outline which I could print on card and cut out. So, I set about developing a prompt using the guidelines listed above. It turned out to be an interactive process which took me at least a dozen attempts before I was satisfied with the results – illustrated in the image below.

The way the cover works is that you score a line where the image ends, cut away the two small rectangles on either side, and fold down the resulting tab and place it inside the spine of the book. The rest of the cut-out then sits on the top of the text block as shown in the top-down view of some of the books below. When reading the book, the cover can act as a bookmark.

Perfecting the prompt was a matter of getting the cut-out lines to be drawn in the right places; of getting the text boxes to be of the right size with minimal empty space; of getting the whole of the specified area to be covered with interesting illustrations; of figuring out whether its best to have it produced in a Word or Powerpoint document (I ended up with Word); and of realising that Claude AI would produce more interesting illustrations than ChatGPT (particular AI models are better at some things than others). Getting all of these things right took rounds of experimentation until I came up with the final prompt.

I had the same experience when investigating AI support for my large PAWDOC document collection for which I devised 6 evaluation questions, each of which were to be asked of a subset of the PAWDOC Index and file titles (the need to use subsets was because, unlike AI support for the Mementos collection which I have written about in earlier posts, the PAWDOC Index and file titles have some 16+ times the number of characters – a size which could not be ingested and used in a single prompt in today’s AI systems. So, I decided to see what sort of results could be obtained by splitting the information into 10 subsets, getting answers for each subset, and then stitching the 10 answers together). The 6 questions I specified were:

  1. List all the people named in this part of the Index and its associated files, and the organisation they belong to if any.
  2. Describe Paul Wilson’s career over the period covered by this part of the Index and its associated files.
  3. What significant changes in Information Technology occurred during the period covered by this part of the Index and its associated files.
  4. Document all the travel undertaken by Paul Wilson over the period covered by this part of the Index and its associated files.
  5. What training was undertaken by Paul Wilson over the period covered by this part of the Index and its associated files, and how important were particular elements to his subsequent career.
  6. What are the strangest or most unusual things to be found within this part of the Index and its associated files, including unlikely coincidences, and events with unexpected outcomes.

These all look fairly straightforward but the actual prompts I ended up with were considerably more extensive. For example, the prompt for the sixth question was:

PAWDOC is a work document collection built up from 1981 to the present day. The attached Guide outlines how PAWDOC is constructed and what it contains. Also attached is a subset of PAWDOC’s Index and the digital files associated with those index entries. Note: None of the actual PAWDOC documents have been provided. The following request is to be undertaken using just the information in the Index entries and the File Names provided in the attached files: identify the training that was undertaken by Paul Wilson over the period covered by this part of the Index and its associated files, including the objectives and composition of the training, how it was delivered, how much time was spent on it, how relevant it was to his subsequent career, and citing the Reference Numbers from which the information was obtained. NB. Be sure to inspect all supplied data first, then analyse.

All the prompts required a number of iterations before they were producing good results. They are reproduced in the attached document together with additional prompts to stitch the 10 responses to each question together. Note the prompt for the first question which was especially demanding.

These experiences are not novel – they merely reflect a well-known fact that to get the best out of an AI capability the prompts need to be well thought out and will likely require several iterations before they are deemed satisfactory.

Rephasing

At the start of this journey, I said I would conduct my investigations in 9 Phases:

  1. AI support for the Memento collection’s index entries.
  2. AI support for the Memento collection’s combined Index entries and file titles.
  3. AI support for the Memento collection’s index entries, file titles and textual items.
  4. AI support for PAWDOC’s index entries.
  5. AI support for PAWDOC’s combined index entries and file titles.
  6. AI support for PAWDOC’s combined index entries, file titles, and some or all of the born digital items.
  7. AI support for a subset of PAWDOC’s scanned items.
  8. AI support for a combination of index entries, file titles, some born digital material and some scanned items.
  9. AI support for the whole of PAWDOC.

I’ve completed the first three Phases and reported on them in previous posts. However, given what I have learned up to now, I’ve decided to change my plan in the following ways:

Combine Phases 4 and 5: It’s a feasible and better approach to obtain AI support for a combination of Index entries and associated File names.

Combine Phases 6, 7, and 8: I’ve discovered that AI can deal with scanned documents in much the same way as born digital documents, so they can both be investigated together. Furthermore, given the limits on context windows, and file numbers and sizes, it’s not really practical to try to obtain AI support for combinations of the whole index and large numbers of content files. Instead, I shall investigate the approach of applying AI to a subset of content files and their associated index entries.

Change Phase 9 – AI support for the whole of PAWDOC will have been explored in the new Phases outlined above. Instead, the final Phase will assess all my findings about how AI can support personal archives

Add another Phase: Since setting out the Phases back in March this year, I’ve discovered that the FileMaker application I use to manage the PAWDOC Index, has been incorporating progressive amounts of AI support since the version I currently have which came out in 2019. Furthermore, FileMaker is currently offering a 45-day free trial of its latest version – FileMaker Pro 2025. I intend to take up this offer once I have completed the revised PAWDOC Phases. This will include watching various training videos about FileMaker’s AI’s capabilities in addition to performing tests on the PAWDOC Index and Files. I anticipate that the way FileMaker has addressed the opportunities afforded by AI, will extend my AI knowledge as well as providing an insight into how any Database/Indexing/Repository software might try to exploit AI.

In summary the Phases I now intend to follow are as follows:

1-3: Complete.

4. AI support for PAWDOC’s combined index entries and file titles

5. AI support for selected subsets of PAWDOC born digital/scanned document and their associated index entries.

6. AI Support provided by FileMaker for the PAWDOC Index.

7. Assessing all my findings about how AI can support personal archives.

Phase 3 Results – Text & Image Content

Phase 3’s objective was to assess how well AI could support not only an Index and its file titles, but also the textual content of the files itself. However, after doing a bit of digging, I discovered that AI’s could not only work with textual content, but also with scanned text, and with images in general (of course, such material still has to be presented to the AI in the form of uploaded files or RAG Chunks). With this awareness in mind, I set about devising some tests to find out just how well AIs can perform when presented with real content as opposed to just metadata. I came up with the following:

Tests of Machine-readable text

  • Test 1 – describe and summarise the contents of three years of diary entries in word format
  • Test 2 – discuss any relationships that can be found between three Word files with diverse contents: my library loan history for 2004-2012; an account of the petitioning of a school’s teachers to make a change to daily activities; some thoughts about university life while in the infirmary recovering from German Measles.

Tests of Image-only scanned text.

  • Test 3 – summarise Friends of the Earth activities in Harrow as documented in three image-only scanned text documents from 1976-1979 in PDF format.

Tests of Text in images

  • Test 4 – list all the events and activities described in three documents of events, tickets, membership cards etc.
  • Test 5 – describe and summarise the contents of all the images in three pages of Italy stamps which also include cutouts from the relevant parts of stamp catalogues.
  • Test 6 – catalogue the contents of the three pages of Italy stamps images using the following fields: Reference Number, Country, Year, Value, Notes.

Tests of collections of objects in images

  • Test 7 – describe and summarise the contents of all the images in three photos of chocolate wrappers, with each photo showing a) a double page of a chocolate wrapper scrapbook (in an unused 1967 A4 diary); and b) 3 plastic wallets containing loose wrappers.
  • Test 8 – catalogue the contents of the three chocolate wrapper photos using the following fields: Reference Number, Name, Manufacturer, Type, and Size.

  • Test 9 – describe and summarise the contents of three photos of household ornaments showing a) 10 pieces of Wedgewood, b) 30 small display items including porcelain (cups, saucers, plates, vases, jug, trinket boxes, flowers), glassware (bowl, vase, jug, flower, bird), stoneware (lighthouse, ashtrays, barometer), and wood (bowl, elephants); and c) 13 sundry items including silver trays, bowls, coasters and shoehorn; large shells, letter rack; pen holder; and decorative tray and plate.
  • Test 10 – catalogue the contents of the household ornament photos using the following fields: Reference Number, Name, Type, Colour.

As may be apparent from the above descriptions, each test involved attaching three files to the AI Prompt together with a request, for example, “Using the three files I have just uploaded, catalogue the contents of the images using the following fields: Reference Number, Name, Type, Colour.”

All 10 tests were applied to the four AIs that had been used in the previous phase – AnythingLLM with Mistral, ChatGPT, Copilot and Claude. I did explore the possibility of using two other widely used products – Llama from Meta, and Gemini from Google. However, both require that you create an account before you can use them and I didn’t want to do that because, among other reasons, I’m trying to limit my exposure to data collection and advertising which are central to both of those organisation’s operations. Indeed, during the process of opening a Meta account, I was actually informed that I would be consenting to being shown adverts: I stopped at that point. So, for these tests I stuck with the four AIs previously mentioned. I have been using the free version of ChatGPT, Copilot and Claude up to now. However, when I started doing these tests ChatGPT suddenly changed the number of files it was allowing me to upload each day from 3 to 2. Since all the tests involve 3 files I elected to upgrade to ChatGPT-Go which enables you to “usually upload far more than the Free tier’s 3 files/day, but there is still a rate limit, and OpenAI hasn’t publicly stated the exact number.” The cost was £7 a month with the ability to cancel anytime. I encountered no limits when I was conducting these tests with the free versions of Copilot and Claude.

Before discussing the test results, its worth being clear about the image recognition and text-reading capabilities of the AIs concerned. First, AnythingLLM is not capable of interpreting images so, unsurprisingly its results in these tests are very poor. However, I included it anyway just to see how it would react.  Second, ChatGPT, Copilot and Claude (like most other Large Language Models) don’t apply separate conventional OCR (Optical Character Recognition) techniques to interpret text in scans, photos or other images. Instead, they undertake text recognition as part of their general image understanding capabilities which includes the combined assessment of visual patterns, language, spatial relationships, and context. Consequently, their text recognition capabilities often depend on the type and volume of training data they have been given. As ChatGPT put it: The image recognition capabilities of AI models “are deeply dependent on the training data, but also on model design, compute scale, optimization methods, and how visual and language information are combined during training.”

With this in mind, we can now consider the results of the 10 tests. Each individual test was evaluated in two ways: first, I looked at the answer provided and wrote a short paragraph reflecting what I thought about it; and, second, I awarded a score out of 10 according to what I’d written.  As is obvious, this is a completely subjective, and a potentially highly inaccurate, process. However, it does have the advantage that it uses a single observer with a consistent mind-set. Hopefully, that consistency will be reflected throughout all the results.

All the evaluation paragraphs and scores for all four AIs are included in this word document. However, I also produced an overall summary paragraph, and overall average score, for each AI, and these are provided below. They are accompanied by an equivalent paragraph produced by ChatGPT after it was provided with each AIs detailed paragraphs and scores for each test.

AI (Overall Average Score) Human Overall evaluation ChatGPT’s overall evaluation
Anything LLM (1.3) AnythingLLM made it clear in the later tests that it is unable to interpret images. Given this fact it is odd that for two of the questions concerning images it gave answers that were completely nonsensical (circuit boards for stamps; and a photo of ‘a person wearing a white lab coat’) – it could have just said it couldn’t interpret the images. Regarding scanned text, AnythingLLM was able to interpret both the scanned typed text and the scans of tickets – however sometimes incorrectly and sometimes hallucinating parts of the resulting answer. Whatever answers were provided by AnythingLLM, were typically very brief and lacking in supporting material. Across the 10 tests, AnythingLLM demonstrated consistently weak performance, particularly when interpreting images and scanned documents. In six image-based tests involving ornaments, chocolate wrappers and stamps, the AI failed completely, often producing wildly inaccurate descriptions such as circuit boards, robots or software code instead of the actual subjects shown. These responses were considered entirely unusable and all received scores of zero. Performance improved only slightly when handling text-based or scanned documentary material. In the events and activities test, the AI identified a few correct details but omitted most of the content and introduced several errors, resulting in a middling score. Responses to the Friends of the Earth documents and the diary summaries were similarly superficial, inaccurate or overly narrow. Overall, the AI showed limited comprehension, poor contextual understanding and an inability to extract detailed or reliable information from uploaded materials, achieving an average score of just 1.3 out of 10.
ChatGPT (8.7) ChatGPT gave a very good answer to the question in which text was provided in Word, but a little less so when assessing scanned typed text. It performed extremely well when assessing objects in images and in picking text out of those images, suggesting that it had been given some training data in the subjects covered (stamps, chocolate wrappers, and household goods). The one thing that wasn’t very helpful were the images which it displayed in answers to the image questions – they were not of items in the supplied images but items similar to those in the supplied images  which was just generally confusing. Most answers were generally very clear and complete, and there were only a few errors. The catalogues it produced were clear and with good descriptive information. Across the 10 tests, the AI demonstrated consistently strong performance in analysing, cataloguing, summarising, and interpreting a wide variety of uploaded documents and images. Scores ranged from 7.5 to 9.7 out of 10, with an overall average of approximately 8.7, indicating a high level of competence. The AI performed particularly well when extracting structured information from complex visual material such as stamp collections, chocolate wrappers, ornaments, and historical documents, often identifying nearly all relevant items and presenting the results clearly in tables or organised summaries. Its ability to read embedded text and recognise themes across multiple files was repeatedly praised. The main weaknesses noted were occasional omissions, minor reading inaccuracies, over-generalisation in crowded images, and the insertion of misleading illustrative images not present in the source material. Nevertheless, the tester consistently regarded the outputs as comprehensive, useful, readable, and well-structured overall.
Copilot (7.5) Copilot produced excellent answers when reading from text in Word format – comprehensive, well structured, and insightful, with no noticeable errors. It was a little less competent when reading the scanned typed text. It read text in images without a problem – including vertical text. However, it performed less well when identifying objects in images – often missing out several items from both the descriptive text or grouping several items together; and the catalogue listings were similarly incomplete. This might relect a lack of training data in the areas concerned. Across the 10 tests, the AI demonstrated strong overall performance, particularly in tasks involving summarisation, thematic interpretation, and extracting information from structured or semi-structured documents. Its best results came from diary summaries, document relationship analysis, and chronological event extraction, where it showed impressive comprehension, organisation, and the ability to interpret handwritten and vertically oriented text. Scores in these areas were consistently high, often above 9/10. Performance was more mixed when detailed cataloguing or precise item identification was required from complex or cluttered images. In several image-based catalogue tasks, the AI missed items, misinterpreted grouped objects, or introduced incorrect assumptions about how materials had been arranged or curated. Stamp and wrapper cataloguing proved especially challenging, with omissions and occasional transcription errors reducing reliability. Nevertheless, even weaker responses were generally well-presented and useful in part, indicating a capable system whose strengths lie more in synthesis and interpretation than exhaustive visual inventory work.
Claude (9.3) Claude’s overall evaluation score of 9.3 reflects the fact that it performed extremely well in all but one of the tests. Its answers were comprehensive, full of detail and easy to read. It displayed competency in all four areas being tested – Word text, scanned typed text, text in images, and images of objects. Very few facts or items were missed. In one of the image files it identified the base of a lampstand from an image of just a small piece of its base. Its capabilities suggest it has had a broad range of training data. Its catalogue listings were good with two of the four being produced in a useful Excel format. The only thing that let it down was some numerical  errors in the answer cataloguing household items: it incorrectly included the legend row and an empty base row in the total number of items it reported (i.e. it said there were 55 items instead of 53); and it reported that there were 17 items listed in the spreadsheet for image 3 whereas it had only actually listed 15 items in the spreadsheet. Across the 10 tests, the AI demonstrated consistently high performance, achieving scores between 8.7 and 9.9 out of 10, with an overall average of approximately 9.3. Its strongest capabilities were in extracting, cataloguing, and summarising information from complex images and scanned documents, often identifying nearly every visible item and adding insightful contextual observations. The AI showed particular skill in recognising handwritten text, interpreting historical or archival material, and producing structured outputs such as Excel spreadsheets. Reviewers repeatedly praised the clarity, comprehensiveness, and readability of the responses, as well as the AI’s ability to infer broader themes and relationships across documents. Minor weaknesses included occasional misidentifications of objects, over-interpretation of details, and small numerical inconsistencies in summaries or item counts. Nevertheless, these errors were generally isolated and did not significantly detract from the overall quality. The results indicate an AI with excellent analytical and descriptive abilities across diverse document and image-processing tasks.

Claude comes out a clear winner in these tests, with ChatGPT coming in second. Copilot, while performing excellently with text, appears to have had less relevant image training. At a general level, however, the results illustrate very clearly that AIs can work extremely well with both text and images; and could be very useful to collectors in identifying items, describing them, cataloguing them, and creating indexes for them.

For completeness, below records the breakdown of the time I spent on Phase 3 and across all phases.

Activity No of Tasks or task breakdown Elapsed time Time spent
Phase 1 70 43 days 105 hrs
Phase 2 8 4 days 11 hrs
Phase 3 · Create test files
· Research & drafting pwofc.com post
3 days
4 days
15 hrs
12 hrs
Totals 80 54 days 143 hrs

Creating an Audio Book

Not all books have to be in print format. If you have a collection of audio material, then it’s perfectly feasible to create an audio book. That is what I did in 2008 when I assembled the material from two cassette tapes I had recorded when walking from Harrow to Trafalgar Square in London in 1975. I used the excellent open source, free-to-use Audacity software to break the recordings into discrete sections; stored the files on a CD; and created a cover for the CD case which included some of the (very poor) photos I had taken on the way.

In the last month I’ve just undertaken a similar exercise. A friend who used to work for BBC Radio, and who knows the value of audio memoirs, had suggested that I record my ageing mother speaking about her early life at school, her experiences during World War II and in the WRENS, and the many years she spent in Singapore. I took the advice and duly recorded some 8 hours of conversation with my mother when she was in her late nineties. The recordings were made using one of the utility apps in an Apple iPhone which produced 24 very clear files in the m4a format. I decided to combine these with another 8 short files of my mother describing particular photos, which I had recorded when sorting out her photo collection back in 2012.

I wanted to accompany the audio files with a word document summarising the contents, but realised that, in order to do that, I would need to listen to each file in detail and then create a summary paragraph. That sounded like hard work, so I decided to enlist the help of transcription services to produce written versions of each audio file. I tried to find a free service but couldn’t find one that produced good transcriptions in the volumes I required. In the end I chose to subscribe to Otter-ai’s business service for one month at a cost of $30. This enabled an unlimited number of files to be downloaded to the Otter site which then produced transcriptions in three parts:

  • An Overview Summary paragraph (broadly good but with the occasional error)
  • An Outline in the form of separate headings containing a series of bullet points (these seemed to contain quite a lot of errors in the bullet points)
  • A Detailed Transcript (pretty good at capturing almost every word, but sometimes getting the odd word wrong. The transcript was just a series of paragraphs without the identification of who was speaking each paragraph, and sometimes not splitting the paragraphs into separate speakers).

Taking Otter’s capabilities into account, I decided to edit the Detailed Transcript to get it as correct as I could and to insert the relevant speaker at the beginning of each paragraph (which was either my mother or myself). This entailed actually listening to the audio as I edited the transcript – a much easier task, I would imagine, than just listening to the audio while making notes. I also edited the Overview Summary paragraph to ensure it was correct and informative so that I could just lift that out to include in the overall summary document; and I ignored the Outline but left it intact in the transcript files. So, I ended up with the following set of materials:

  • 1 overall ‘Introduction and Summaries’ Word document
  • 32 audio files in m4a format
  • 1 transcript Word document for each audio file
  • I or more jpg image files for each of the audio files describing particular photos.

This package of material includes quite a significant element of our family’s history, so I wanted to give a copy of this audio book to each of my children in a form that would be accessible in the long term, and would stand a chance of not just getting lost. Consequently, I decided to use a USB memory stick to store the audio files on, and to box it up in a clear USB case designed to take a cover insert. I obtained both from River Solutions Group: 5 Twister Silver White USB Sticks 4GB at £2.25 each; and a pack of 10 Rivermedia Premium USB Case Clear 14mm for £8.50.

I produced the cover in PowerPoint and included on it a picture of my mother and a list of the 32 audio files headed by the ‘Introduction and Summaries’ file (10 on the front cover and 22 on the back). I’m hoping that this overall package will look significant enough to ensure its survival for a few decades at least.

 

A New Loft Management App

Addenda to ‘Loft Management’ [First post]   [Last Post]

Over the 20+ years I’ve been operating a Loft Index I’ve used 3 different computer applications: Filemaker on a laptop, uGrokIT on a phone, and Memento Database on a phone. In the March 2025 post describing the move to the Memento Database I noted that scrolling was hesitant and that the records moved up the screen on top of the field headers. The Memento Help person had told me that ‘this happens because Memento stores photos in their original size, which can be quite large’ and that they would ‘try to add a photo compression feature in future releases’. I decided to ignore these problems for three reasons: a) they weren’t interfering with creating new entries or searching, b) the product was free to use locally, and c) I’d gone to a lot of effort transferring the data and photos into the application. I realise now, that was a mistake.

I should have seen the writing on the wall. When I wanted to add something to the Index last week, I discovered the interface was a little different (there’d obviously been an upgrade) and my loft database just locked up when I opened it. I wasn’t even able to get it to produce a csv export file of the current contents.

So, I’ve spent a couple of days moving the index to a new app. I’m now using the Collections database from Risolvi Productions for which I’ve paid a single one-off fee of £7.99. Luckily there haven’t been too many changes to my loft index over the last year so I was able to use an October 2025 export from the Memento app for the data, and a March 2025 folder of the associated photos, and to deduce what changes had occured since Oct25 from the first frozen screen of the malfunctioning Memento app. With the new database fully set up in the Collections app and all the associated photos loaded in as well, I’m glad to say it all seems to be working perfectly.

The lesson here is clear: if an app doesn’t seem to be working properly, don’t use it; and be prepared to pay a little rather than just always going for a free-to-use product. Obvious really….

Phase 2 results – and enter Claude

The object of Phase 2 was to explore how AI can support the combination of a collection’s Index and the titles of the associated files. The answer is straightforward: the assembled file titles need to be provided to the AI model in one or more files, in the same way that the Index is delivered. In Windows, file titles can be collected by highlighting the files you are interested in and selecting ‘Copy as path’ from the Right-Click drop down menu. These can then be pasted into either Excel or Word, and the path in front of the file name can be deleted by using the Find and Replace function: specify the path in the Find box and specify blank in the Replace box, highlight the file names, then select Replace-All. No doubt other operating systems have similar capabilities.

I tried this out with the names of 2065 files associated with the Mementos index, along with the 1.62 version of the Mementos Index (the version I used for the final set of tests in Phase1). I retained the 5 evaluation questions used in Phase 1, but I also added three more to further test the capabilities of the AI models concerned (more of these at the end of this post). I performed tests using AnythingLLM/Mistral and Copilot. ChatGPT wasn’t used because it has a limit of around 250kb and the file titles together with the Index would have significantly exceeded that. I also tried out Anthropic’s Claude model for the first time. The results are shown in the table below together with the results of the final test in Phase1 for comparison purposes.

System Phase1 – no file titles

5 evaluation questions

Phase 2 with file titles

5 evaluation questions

AnythingLLM/ Mistral 4.4 out of 10: Overall, this was a disappointing result. The answers were very sparse with little rationale or summation. Some of the answers given were of dubious worth. 2.3 out of 10: This was a generally very poor result. Four of the answers were virtually worthless; while 2 of the answers weren’t too bad at all. Quite apart from the contents of the answers, they were all relatively short and with none of the embellishments and rationale that seem to be a standard part of the responses of AIs like ChatGPT and Copilot.
Copilot 8.7 out of 10: Three of these five answers were exceptionally good, and all the responses were well-illustrated with rationale and plenty of examples. The summaries at the end were well constructed and useful. I only spotted one error across the five answers, though there were three or four things which stood out as having been omitted. 8.9 out of 10: Overall this is an outstanding result. Comprehensive answers with good introductions and summaries and lots of examples with reference numbers included. Nearly everything was correct and complete – it was bemusing to read things like  the series of homes we had been in or the different companies I’d worked for – an experience like being amazed that someone you are talking to seems to know your life history. The one concern is that in one answer the AI suggests it had taken facts from a document (Barlborough prospectus) which it hadn’t got access to and the only conclusion is that it hallucinated the information (Indeed it subsequently admitted it had).
Claude 8.2 out of 10: Claude’s answers overall were pretty good – but a bit patchy with scores ranging from 6 to 9.8 (this latter score being ‘first class’). The poorer scores were largely the result of misinterpretations and errors.

 

From these results, it’s not possible to assess if the inclusion of File Titles has made a difference: AnythingLLM’s performance was generally very poor across both tests; Copilot’s two scores were too similar to draw any conclusion; and Claude did not have any earlier test result to compare against. However, it seems reasonable to assume that if there is additional information in the file titles, over and above that in the Index, there will be a better outcome.

Since this was the first time I had used Claude, I enquired about its operational parameters, and discovered that Claude’s Context Window is 200K tokens across all models and paid plans (except for Enterprise plans). That is roughly 500 pages of text or approximately 800,000 characters. Claude caps file uploads at 30MB per file and 20 files per conversation. I found the quality of Claude’s answers to be at the same sort of high level as Copilot – though the results indicate that, on this particular set of uploaded material, Copilot has a bit of an edge. I also noticed that Claude differs from Copilot in two ways: first, Claude, unlike Copilot, tells you what it’s doing in the course of responding to a question; and, second, Claude took three times longer to respond to the same set of 8 evaluation questions (59.3 vs 19.6 seconds).

The three extra evaluation questions that I added were:

  • Describe the things that happened to Paul Wilson during the time he spent at Barlborough Hall School and support the story with references to relevant mementos.
  • Identify anything strange or unexpected about the contents of the Memento collection and support the analysis with references to particular files.
  • Describe life in the Wilson family during the 1990s citing relevant artefacts in the collection.

These extra questions were deliberately designed to explore the AIs ability to make more broadly-based connections and inferences, as opposed to identifying related items in a narrow subject area. This is where AI can excel – and sometimes fail – as indicated in the detailed results for each question provided below.

Question Copilot Result Claude Result
Describe the things that happened to Paul Wilson during the time he spent at Barlborough Hall School and support the story with references to relevant mementos.

 

7 out of 10: This is a comprehensive summary of my time at Barlborough – though surprisingly lacking in info about sports activities. The answer claims to have details of the contents of the school prospectus and of letters to parents – but it didn’t and I think it just hallucinated the info. 9.8 out of 10: This an excellent, comprehensive answer utilising all the Barlborough material in the collection so far as I could see. I didn’t spot any errors. The answer was easy to read and included all the relevant Memento Reference Numbers. First class.
Identify anything strange or unexpected about the contents of the Memento collection and support the analysis with references to particular files.

 

9 out of 10: I’m quite bemused by this answer:  I don’t really know what I was expecting – but Copilot delivered a whole bunch of unusual stuff with good rationale for their inclusion. Extraordinary! 6 out of 10: Claude identified 12 different strange entries – but actually only about 5 are really valid. The so-called errors in the birthday card collection arose simply because I entered birthday cards for different years in the same year; the so-called error in the Miss Saigon Programme date was because I also bought a special Souvenir programme with the date of the first performance. Item 5 is also covered by Item 1. So this is an interesting set of material – but not entirely correct or useful.
Describe life in the Wilson family during the 1990s citing relevant artefacts in the collection.

 

9 out of 10: Another very comprehensive answer replete with examples and a good summary at the end. I haven’t checked in detail whether its all correct but nothing is standing out as being wrong. I’m finding it hard to fault these answers. 8.8 out of 10: This is a very complete account of our time in the 1990s written in a very readable style spiced with occasional wry comments and humour. Each detail is accompanied by the relevant Reference Number. It is highly informative but does include a few misinterpretations – in particular claiming that my wife was at home during the 90’s when actually she was back to teaching; that l was seconded to the IR in ‘mid-1990s’ whereas it was the beginning of the 90s; that the EDS chocolate bar had been kept whereas it was just the box it was in; that my son went to Exeter to do a 2nd degree whereas it was his first; that I was a devoted Aston Villa supporter whereas it was my other son; and that my son had a band called Phases which was stretching it since it was just he and a friend making a cassette recording. However, these are relatively small points amidst the huge array of correct facts that are presented in this highly readable piece.

 

Finally, to come back to the objective for Phase 2, I have concluded that file titles can certainly be included in the material delivered to an AI model, and this is certainly worth doing if they include additional material to that which is contained in an Index – or if an  Index does not exist.

For completeness, below records the breakdown of the time I spent on Phase 2.

Activity Time spent – hours
8 Tasks (total elapsed time: 4 days) 11
Time spent assembling files of File Titles and testing 8
Time spent researching and drafting post for pwofc.com 3

Phase 1 Summary Results

Phase1 Description: AI support for the Memento collection’s index entries.

Conclusions

  • Different LLMs are trained with different amounts and types of training data, thereby making them more or less suitable for operation within specific domains. Therefore, there may be some LLMs available that are more suitable than others for use with Collection indexes.
  • To enable LLMs to provide answers to questions related to specific information in documents or databases, a process called Retrieval Augmented Generation (RAG) is used whereby relevant parts of the information are provided to the LLM alongside the question.
  • The RAG process performed by particular products typically includes a multiplicity of different techniques to identify an appropriate set of Chunks for submission with the question to the LLM. It is likely that particular combinations of these techniques will be more effective than others for use with collection indexes.
  • Collection Indexes may contain information which is more unevenly spread throughout their contents than a written document with an ordered set of contents. Consequently, such indexes may require more chunks to be sent to the LLM, and may require the LLM to have larger Context Windows, than ordinary documents in order to obtain satisfactory answers.
  • Out of the 4 models/systems I have tried out so far, Copilot is by far the best LLM for use with a collection index of around 2,500 entries.
  • The choice of systems, models, configurations, and Index adjustments in this first Phase were heavily influenced by ChatGPT, and therefore may have been based on some inaccuracies or hallucinations. This needs to be born in mind when considering the findings from Phase 1 and taking them forward into subsequent phases.

These conclusions were reached in the course of undertaking the activities summarised below.

Tasks and Timescales

Activity Time spent – hours
70 Tasks (Total Elapsed time: 43 days) 105
Time spent addressing points in AI preparedness document 7
Time spent installing software 8
Time spent adjusting indexes and testing 52
Time spent researching and drafting posts for pwofc.com 38

Reference Sources

  • A Survey on Retrieval-Augmented Text Generation for Large Language Models by Yizheng Huang, Jimmy Huang, v2, arXiv:2404.10981, 23Aug2024.
  • I obtained lots of answers from ChatGPT to questions I asked about RAG, AI software products, and how to adjust the Memento index to get better AI results.

Concepts and Terminology encountered

  • LLM stands for Large Language Model. Hundreds of thousands of these have been built and new ones emerge daily, but many are variants of major models like GPT-4 from ChatGPT which can be accessed online. Other LLMs can be downloaded from websites like Hugging Face, to use locally on a stand-alone computer.
  • RAG stands for Retrieval Augmented Generation whereby the LLM is not trained on the archive, but instead a relevant subset of the archive data is provided to the LLM alongside the question.
  • Chunk is a smaller, manageable segment of a larger document or dataset. In RAG, one or more Chunks are provided to the LLM along with the question.
  • Token is the fundamental unit of data used by LLMs. Models convert questions and Chunks into tokens, process them, produce an answer in tokens, and then convert those tokens back into text. Generally, one token equals approximately 0.75 English words.
  • Vector Databases store mathematical representations of Tokens as vectors – lists of numbers – in such a way that related items are clustered together thereby enabling capabilities like similarity searching.
  • Embedding: Chunks are broken down into Tokens which are converted into mathematical representations and embedded in a Vector Database.
  • Context Window is the maximum number of Tokens that can be handled by the LLMs working memory (which contains both the input prompt and the answer). If the inputs to the LLM exceed its Context Window, then some content may be simply left out and the answer may be less complete.
  • Hallucination is a phenomenon where LLMs generate false, misleading, or nonsensical information confidently. It happens when an LLM predicts text based on patterns rather than facts, often due to poor training data, ambiguous prompts, or a lack of understanding of reality.

Test results – 2nd set of evaluation criteria (scores out of 10)

                      Type of LLM  Index variant AnythingLLM Mistral AnythingLLM Mixtral Copilot –  MS LLM ChatGPT – GPT-5.3
Less some columns + AI Context short 4.8 5
Less all extraneous fields, no Guide to the Index 2.2 2.9 8
Combined into 1 field and with Guide 3.1 4.8 8.6
Less all extraneous fields, no Guide 5.9
Less all extraneous fields and with Guide 9.1
Index with Set removed and with Guide 4.4 3.6 8.7 6.3

 

Enter Copilot and ChatGPT

In the previous post I explained why I changed the evaluation criteria I was using to the following:

  • What are the main themes that run through the entire index?
  • Are there distinct phases or periods in the collection?
  • Which items suggest important life events or transitions?
  • What patterns or motifs repeat across the collection?
  • What are the top 5 most important entries, and why?

I tried using them with a version of the Index which had extraneous fields such as ‘Physical Location’ and ‘No of digital files’ removed leaving just Ref No, Description, Item Label, Set, Facets, AI Context, and Year all in a single column. This produced a result of 4.8 out of 10 using the Mistral model and 5 out of 10 for Mixtral – and I proved to myself that I was able to apply the new evaluation criteria, albeit with rather more subjective scoring.

At this point it dawned on me that it might be worth trying to use the MS Copilot AI provided as an integral part of my Windows operating system. After making some initial enquiries through the net about its possible use with Excel, I got the impression it could only be used with MS Office 365 which I do not have (I have the home version of Office); but Copilot itself set me straight explaining that the 365 version enabled support for the functionality within the Office applications, whereas the free-to-use version of Copilot simply ingests documents up to the cloud where it works out the answer to the question it has been asked. I tried it out using a very simple version of the Mementos index with just the Ref No, Description, Set, Facets, and Year fields in their own columns, and was excited by the result which I described as follows:

“This is a strong result (8 out of 10): comprehensive answers with just a few poor interpretations, but with no obvious hallucinations. Importantly, Copilot was able to deal with the whole attached Index in one go which delivers far better answers than the RAG approach which can only work on a pre-selected subset of the material. The response time (average 4.4 seconds) was very quick considering that the whole document had to be sent to, and analysed in, the cloud before delivering its answer.”

I then tried to compare the AnythingLLM and the Copilot systems by using yet another version of the Mementos index based on suggestions from ChatGPT. It still contained just Ref No, Description, Set, Facets, and Year with the all the fields in a single cell, but this time with the Description replaced by keywords derived automatically using an Excel formula supplied by ChatGPT. The results were very clear: Mistral scored 3.1, Mixtral 4.8, and Copilot 8.6. For the Copilot Test I wasn’t sure if the size of the file would cause a problem so I split it into three files of between 200kb and 300kb each, and these seemed to have been ingested successfully. However, I subsequently discovered that the files had been truncated so that only about the first 120 lines of each were ingested – making about 360 entries in all out of the total 2393 (I guess Copilot must have truncated the file I used in the previous test as well). Given this, Copilot’s 8.6 score was even more impressive.

Finding that Copilot had truncated the files prompted me to do some digging and experimentation to find out just what its limits are. I established that it will ingest up to 20 whole documents of up to about 30,000 characters/30kb file size each in a single turn, and will work across all those documents to construct its answer. Furthermore, more batches of 20 can be submitted in subsequent turns up to a total of about 150 after which ‘the conversation becomes unwieldy’ (I’m not sure what is meant by this). I duly split the Mementos Index into 17 files, all between 20,000 – 30,000 characters and conducted the test again, this time including a version of the Guide document. This produced an even better score of 9.1.

Discovering Copilot’s capabilities made me wonder what ChatGPT could do. I discovered that the free version which I was using allows you to upload a maximum of 3 files of up to about 512Mb in any one day. However, to be sure that all the contents of all the files will be taken into account in answering a question, the total of the 3 files need to be a maximum of about 300k characters with 250k being a safe limit. I put ChatGPT to the test with the same index used in the first Copilot test but broken into three files of between 111k and 130k characters each. This produced a result of 5.9 which was probably expected given that the file size limits had been exceeded.

As I was exploring the file size limits for Copilot and ChatGPT, it became clear that the ultimate determinant of the amount of text that can be reliably reasoned about at once is the AI model’s maximum ‘Context Window’. This is effectively the model’s working memory which contains both the input prompt and the answer. If the inputs exceed the Context Window then some content may be simply left out and the answer may be less complete. The table below summarises the Context Window limits for the different AI models I’ve been using.

AI System Model Maximum Context Window
AnythingLLM Mistral 7B Instruct (32k context), Q4_K_M quantization 32k tokens ( ~65k–100k characters in csv files) (assuming 2-3 characters/token)
AnythingLLM Mixtral 8x7B Instruct (32k context), Q4_0 quantization 32k tokens ( ~65k–100k characters in csv files) (assuming 2-3 characters/token)
Copilot MS LLM (Microsoft does not publish the names of its models The MS LLM doesn’t have a single fixed Context Window; it’s designed around task‑adaptive context management, and hence the effective context it can use is much larger and more flexible than a single token number would suggest.
ChatGPT (free version) GPT-5.3 The maximum Context Window is 128k tokens, but because this includes system instructions. conversation history, and output tokens, the usable Context Window is about 80k – 100k tokens (160k – 300k  characters in csv files) (assuming 2-3 characters/token)

 If one wishes Index files to be considered in full by an AI system, the primary requirement is to ensure that the whole set fits into the Context Window. However, when it comes to assembling and submitting those files there are additional considerations to be taken into account, as summarised below.

AI system Approach Max total file size Max No of files Max size of each file
AnythingLLM Embed (RAG) No limit. The limit is the size of the Chunks the files are divided into. The default is set at 256 tokens (800-1000 characters). No limit. Though more than 50K Chunks may cause retrieval problems. No limit.
AnythingLLM Attach function Default is set to 30k-50k characters for csv files. Can be adjusted within AnythingLLM settings. No Limit No limit.
Copilot Attach function No absolute limit, but above about 150 files (4,500k characters) the conversation becomes unwieldy. About 7 batches of up to 20 files in a single prompt. No absolute limit, but for csv files 10,000 rows or approximately 30k characters, should work fine.
ChatGPT (free version) Attach function To be confident that the model will read everything and not overlook anything the total should be kept to about 250k characters. 3 within any one day. This limit can be circumvented by putting multiple files in a zip file, or by putting the contents of a file into the prompt with the question. 512Mb is the absolute limit but to be sure all the contents will be acted on, keep it to 200k-300k characters. If 3 files are to be used, keep their total to 200k-300k characters.

With these constraints in mind, I conducted a final set of tests to compare the three AI Systems – AnythingLLM’s Rag approach, Copilot, and ChatGPT. I assembled a cut-down version of the Index (by removing the MW set) which was small enough (around 274k characters, 270kb file size) to fit within the limits of all three systems.  As can be seen in the table below, there was a clear winner.

System Average Evaluation Score out of 10 Performance Summary Average time taken to start responding (seconds)
AnythingLLM (Mistral) 4.4 Overall, this was a disappointing result. The answers were very sparse with little rationale or summation. Some of the answers given were of dubious worth. 6.6
AnythingLLM (Mixtral) 3.6 This score was a little bit worse than Mistral’s result despite it taking a lot longer to start printing out its answers. Two of the five answers seemed to mainly regurgitate the different sets that were described in the Index; and some of the answers were of rather dubius relevance. However, of most concern is that there were at least 4 instances of hallucinations – in two cases, Reference Numbers which don’t exist were cited. 36.4
ChatGPT (GPT-5.3) 6.3 All the responses were comprehensive with extensive rationale and good summation. However, the content wasn’t always appropriate. It wasn’t necessarily wrong but was sometimes just a little dubious. Furthermore, there was rather too much emphasis on the way the collection was organised rather than on its contents. On two occasions, examples of individual entries were specifically asked for but generalisations were delivered. 6
Copilot (MS LLM) 8.7 Three of these five answers were exceptionally good, and all the responses were well illustrated with rationale and examples. The summaries at the end were well constructed and useful. I only spotted one error across the five answers, though there were three or four things which stood out as having been omitted. 16.4

Copilot was a clear winner with ChatGPT following on behind. Both provided substantial answers which included rationale, several examples, and a summary. In contrast the AnythingLLM RAG answers were sparse, sometimes not very good, and occassionally included complete hallucinations. The RAG approach just doesn’t seem very effective for Index material.