Phase 5 Completed

Posted on 27Jun2026 by admin

Phase 5 in this journey was to assess the possibility of investigating the contents of a large archive by providing an AI with not only the Index, but also the actual digital files relating to the Index entries. As explained in the previous post, AIs are only able to hold a limited amount of information in their memory; so, for large archives I’m employing a strategy whereby the AI is asked to provide an answer to a subset of the archive’s information; and once that answer has been provided, another subset is presented for an answer. When all the subsets have been exhausted, the AI is then asked to stitch all the answers together to provide an overall answer for the archive. That is the approach I’ve taken here: I assembled a whole set of digital files, divided them up into groups which the AI is capable of handling, got answers from the AI for each group, and then got the AI to stitch the answers together. To highlight the practicalities of this exercise, the whole process is described in some detail below.

As in the previous phase, the archive I’ve used for this test is my PAWDOC work archive of some 17,000 Index entries and 31,000 associated files. I decided I’d do two distinct tests: one with files about a very specific subject with a limited life-span; and one with files related to a very general topic which might appear anywhere in the lifetime of the archive. For the first I chose to investigate the two years (1977-78) that I worked for a company called ‘CPC’; and for the second I chose the word ‘measurement’. In each case, I performed a search of the relevant term (‘CPC’ or ‘measurement’) on my PAWDOC index in the Filemaker application. These produced lists of 21 and 40 index entries respectively. PAWDOC index entries can have any number of files associated with them, and in these cases, it turned out that‘CPC’ had 48 files, and ‘measurement’ had 72 files.

The next part of the process was to split the files into sufficiently small subsets such that each subset would be accepted in full in a single request to the AI. The two AIs I used were Claude and Copilot, both of which limit file uploads to 20 per conversation; Claude limits an individual file to 30Mb and Copilot to 50Mb. So, I chose to limit my subsets to the following:

Number of files: 19 (this to include the ‘PAWDOC Guide’ file and a file of the Index entries related to the files in this subset: leaving room for 17 content files)
Max individual file size: 28Mb
Max Total file size: 40Mb
Text files to be no bigger than 18k characters

Armed with these criteria I set about dividing up the content files into their subsets and ended up with 5 for each one:

Subset	No of files (less Guide & Index Entries files)	Largest file	Total subset file size
A (CPC)	16	1.4 Mb	7.7 Mb
B (CPC)	15	4.6 Mb	15.4 Mb
C (CPC)	13	5.2 Mb	20.3 Mb
D (CPC)	2	25 Mb	35 Mb
E (CPC)	3	20.8 Mb	37.9 Mb
F (Measurement)	17	5.6 Mb	14.5 Mb
G (Measurement)	17	2.8 Mb	14.1 Mb
H (Measurement)	16	20 Mb	33.4 Mb
I (Measurement)	17	7.2 Mb	18 Mb
J (Measurement)	5	19.5 Mb	25.1 Mb

I encountered three issues in the course of creating the subsets

TIF files: Most of the PAWDOC files are scans of hardcopy documents in multi-page TIF files. However, the free version of Copilot does not support TIF files; and Claude only supports single page TIF files, so I had to turn all the TIF files into PDFs. As it turned out, that was a quick process: select a file; right click the mouse and choose ‘Open with [PDF App]; when it opens in the PDF app, select ‘Save As’ and save it as a PDF in the relevant Subset folder.
One of the files (a scan of a desk diary) was 35.7 Mb – a little over my Individual file size limit of 28 Mb. So, I split it into two files of 21Mb and 15 Mb respectively.
One of the files was an old PowerPoint file with a ’ppt’ extension. I believe Claude and Copilot only accept pptx extensions so I produced a pptx version for the subset.

With the subsets prepared, the final thing to do was to specify the following prompts:

For the CPC subsets: PAWDOC is a work document collection built up from 1981 to the present day. The attached files include a Guide outlining how PAWDOC is constructed and what it contains. Also attached is a subset of PAWDOC’s Index and the actual digital files associated with those index entries. The following request is to be undertaken using the information in the attached files: Outline everything that you can find out about CPC in the attached files, including its origins, operations, locations, products, finances, people and culture; and describe the contributions that Paul Wilson made while he worked for the company. Present the results in a word document with an executive summary at the beginning.

For the Measurement subsets: PAWDOC is a work document collection built up from 1981 to the present day. The attached files include a Guide outlining how PAWDOC is constructed and what it contains. Also attached is a subset of PAWDOC’s Index, and the actual digital files associated with those index entries. The following request is to be undertaken using the information in the attached files: Describe everything that you can find out about ‘Measurement’ in the attached files, under a set of appropriate category headings which should include the philosophy of measurement, attitudes towards measurement, and the pros and cons of measurement. In a final section suggest interesting further research that could be undertaken to extend the findings reported here. Present the results in a word document with an executive summary at the beginning.

Running the subsets through the AIs was relatively quick for Claude. Indeed, I’d completed the whole exercise – definition of the questions, creation of the subsets and runs through both the CPC and Measurement tests – within the space of two days. I’m not sure whether or not that was helped by the fact that I was still operating under the Claude Pro plan for which I had paid £18 for a one-month subscription. For Copilot, however, it was a different kettle of fish. I got through the CPC tests and the first of the Measurement subsets, quickly enough. But at that point I came up against something I hadn’t encountered before – Copilot’s daily upload quota. Apparently, this is nothing to do with the number of files you upload, but rather the number of upload events you initiate in a 24-hour period (an event can be an upload of anything from 1 to 20 files). When you exceed this, you have to wait a rolling 24 hours before you can upload anything again – but there is no where to look to see when you can start uploading again. So, I ground to a halt on the Copilot Measurement tests, and wasted a lot of time trying to find out what the problem was. Having found out, I decided to abandon the Copilot Measurement tests for reasons that will become obvious below.

Copilot’s individual subset answers to the CPC question were all reasonably good, 6-8 page answers with a fair amount of detail – though four out of the five were mainly in bullet point format within headings and sub-headings. The merged document was in similar style and I gave it 7 out of 10 with following comments:

This is a 10-page report with 10 separate sections and lots of sub-headings with the text being primarily in the form of bullet points. There’s plenty of detail in the bullet points, but no citations back to the specific Reference Numbers from which they came. The bullet point format means that there is little additional commentary or embellishment and inevitably makes it a less informative read. It took Copilot just 8 seconds to produce this merged report as compared to an average of 38 seconds to produce the subset answers.

However, the Claude answers to the CPC questions were a significant cut above the Copilot versions, with 8-13 pages of discursive text, and a great deal of detail. I scored Claude’s CPC merged document 9.7 with these comments:

This 21-page merged document is of a very high standard with the following contents: Exec Summary; 1. Origins and Corporate Structure; 2. Locations; 3. Products; 4. Finances; 5. People; 6. Culture and Working Environment; 7. Paul Wilson’s Contributions at CPC, 8. Quick Reference Summary; 9. Further Work Recommended for Researchers. In each of these sections there is a huge amount of detail described in discursive text. Claude has clearly inspected and interpreted many, if not all, of the 49 files provided – all of which were scans in PDF documents. I noticed a few doubtful assumptions: a) In section 2.4 it is suggested that Perivale was a CPC location (I don’t think it was – though I couldn’t find the ‘Wray memorandum of December 1976’ to check, which is interesting); b) section 6.5 suggests that ‘Wilson’s formal 11-stage systems methodology paper, produced c.1978 ‘ was actually in operation (but it was only a proposal); and c) some items in the Project Portfolio list in section 7.5 were taken from a scan of my desk diary and may not have actually been as significant as they sound, eg. ‘Corn Shipment Simulation’ and ‘Forecasting’ and ‘Factory Open Day Coordination’. There may well be other misinterpretations I didn’t spot, but despite these, this is a hugely impressive, comprehensive, thorough, and highly detailed report. It took Claude just over 6 minutes to produce as compared to an average of 7 and a half minutes to produce the subset answers.

The Claude Measurement answers were of the same ilk. In fact, because the topic is so broad, each of the individual answers made good reading in their own right. I scored Claude’s merged Measurement document 9.8 out of 10 with the following comments:

This is a very comprehensive 23-page report on a wide variety of aspects of ‘measurement’. There is a 4-page discussion on the ‘The philosophy of measurement’ under 6 sub-headings. The ‘Attitudes to measurement’ section has 5 sub-headings; and the pros and cons of measurement are addressed in similar detail (10 pros and 11 cons). The section on ‘Measurement Instruments and Frameworks’ lists 22 different mechanisms. Section 7 provides 11 suggestions for further work; and a full listing of all the documents used in the analysis rounds of the report. As with the CPC report, it appears that Claude inspected and interpreted most, if not all, of the 79 files provided, of which 75 were scans in PDF documents, three were word documents, and one was a PowerPoint file. This merged report is excellent. How accurate it is, is something I wouldn’t be able to say without doing many days if not weeks of work. It took Claude 8 minutes to produce this report as compared to an average of 5 and a half minutes to produce the subset answers.

One very practical point emerges from these answers: the difference in the time it took for Copilot and Claude to arrive at their answers is striking. Copilot operates in seconds, whilst Claude operates in minutes. There is a distinct possibility that this has something to do with the quality of the results they produced; and it is another reminder that some LLM models are better than others for particular tasks.

Regarding the overall objective of these Phase 5 tests, these results clearly indicate that it is indeed feasible to have AI investigate an archive through its digital files. Furthermore, the results are likely to be even more impressive than those that can be obtained when providing the AI with just index entries and file names. Consequently, there is even greater likelihood that researchers will employ these techniques to explore archives, and even less likelihood that they will spend time verifying the results. This is a serious long-term issue for archives, for researchers, and for the integrity of the global information canon.

In these tests I did try to get the AI to suggest further work to mitigate these problems, and, indeed, several suggestions were forthcoming. Although I’m not convinced that they directly addressed the accuracy issue, I do think they indicate that better prompts, more focused on identifying potential issues and mitigations, could be developed.

Below records the breakdown of the time I spent on Phase 5 and across all phases.

Activity	No of Tasks or task breakdown	Elapsed time	Time spent
Phase 1	70 (started 05Mar2026)	43 days	105 hrs
Phase 2	8	4 days	11 hrs
Phase 3	· Create test files, test, analyse results · Research & draft pwofc.com posts	3 days 4 days	15 hrs 12 hrs
Phase 4	· Create test files, test, analyse results · Research & draft pwofc.com posts	14 days 11 days	80 hrs 31 hrs
Phase 5	· Create test files, test, analyse results · Research & draft pwofc.com posts	4 days 2 days	13 hrs 8 hrs
Totals		85 days	275 hrs

Phase 4 Results

Posted on 24Jun2026 by admin

The Phase 4 objective was to explore if it is possible to use AI to investigate the contents of a large archive by using a combination of its Index entries and the names of the associated files. Having completed the work I can report that it is indeed feasible to do just that – and with some spectacular results. However, there are some caveats which are highlighted in the following paragraphs.

The issue with a large archive is that there is too much information to hold in the Context Window of an LLM. To get round that problem, I split up the Index and File Name information into small subsets, and instructed the LLM to, first, produce answers for each subset, and then to merge the subset answers into an overall answer for the whole archive. The practicalities of splitting up the data and establishing subsets are described in a previous post. Suffice it to say that it is best to a) produce subsets that combine a number of index entries together with their associated file names (rather than having separate Index Entry and File Name subsets), and b) to keep the number of characters in every individual subset file well under the limits specified by the LLM to avoid truncation.

I tested this subset strategy by asking the following 6 questions of my PAWDOC work archive of some 17,000 Index entries and 31,000 associated files:

List all the people named in this part of the Index and its associated files, and the organisation they belong to if any.
Describe Paul Wilson’s career over the period covered by this part of the Index and its associated files.
What significant changes in Information Technology occurred during the period covered by this part of the Index and its associated files?
Document all the travel undertaken by Paul Wilson over the period covered by this part of the Index and its associated files.
What training was undertaken by Paul Wilson over the period covered by this part of the Index and its associated files, and how important were particular elements to his subsequent career?
What are the strangest or most unusual things to be found within this part of the Index and its associated files, including unlikely coincidences, and events with unexpected outcomes?

As described in a previous post, to get the answers you want, you need to provide clear and detailed instructions to the AI in the Prompt that you submit: the bare questions as phrased above are not sufficient, so I used far more detailed versions.

Another factor that makes a difference in the results you can get is the LLM that you choose to use. For this investigation I used ChatGPT and Claude Sonet 4.6 (I wanted to use Copilot as well but found that my subset files contained too many characters for Copilot and were being truncated). The results from ChatGPT and Claude are compared below. However, its worth saying here that different LLMs have different abilities and its good to be clear about what sort of task you want the AI to perform and which LLMs are the best AIs for that job. The Huggingface website is one of the primary LLM repositories for localised models, each being categorised as being suitable for one or more of over 50 different tasks – though support for archives isn’t one of them. Similar advice about which cloud platforms are best for particular tasks can be found in review articles with titles like “10 best AI….”. However, I have found nothing specific to archives in these texts either. So, as yet, I have no definitive listing of the AI tasks that might be required to support archives, nor of which LLM models are best suited to perform those tasks. I hope to have more information on this when I get to Phase 7 of this work.

Getting back to the Phase 4 tests, I ended up with 10 Subsets of PAWDOC Index entries and associated file names. So, for each of the six questions I got 10 answers for ChatGPT and 10 for Claude – 120 answers in all. Then I asked ChatGPT and Claude respectively to combine the 10 answers they had produced for each question into a single merged answer, which produced 6 merged answers from ChatGPT and 6 from Claude. I reviewed each of the 132 answers and gave each of the merged answers a score out of 10; the results are summarised in the table below.

Question	ChatGPT (score out of 10)	Claude (score out of 10)
Q1 People	Poor result. Unable to distinguish people names from adjacent words e.g. ‘Do, To’ Time to produce: People spreadsheet 34seconds; Organisation spreadsheet 65seconds Score: 2	Over 2,440 names were listed (all looking valid) with 1,245 being allocated to one of 600+ organisation. Impressive data collection in the subsets and an excellent consolidation. Time to produce: People spreadsheet 62s; Organisation spreadsheet 111s Score: 9
Q2 Career	Pretty good 6-page answer but constructed around general activities not organisations or key projects. Time to produce: 45s Score: 7	An impressive 13-page report. A few errors probably due to limited data in the Index entries and File Names. A hugely informative, comprehensive, and highly readable piece. Time to produce: 343ss Score: 8.5
Q3 IT changes	Quite good 11-page answer identifying 14 major categories of IT change with details within; but the Reference Numbers are listed separately and not related to specific changes. Time to produce: 62s Score: 7	An impressive 21-page report with 22 categories of IT change, a summary timeline table, and a conclusion with Cross-Cutting Observations. A comprehensive and coherent overview. Time to produce: 830s Score: 9
Q4 Travel	A 5-page report listing 62 travel events (though the subset reports listed a total of 122). Little detailed analysis. Not a very good document. Time to produce: 59s Score: 3	An exceptionally comprehensive 48-page report with a table of contents and detailing 326 confirmed trips, 331 destination visits, 135 unique destinations, and 137,018 total one-way miles. Hugely impressive. Time to produce: 2502s Score: 9.5
Q5 Training	A 10-page report with too little data and statistics. Many events are listed under a general category. The subset reports were better listing 93 specific events. Time to produce: 15s Score: 3	A 23-page formal report detailing 129 training events and a discussion on their relevance to Wilson’s career. A thoroughly competent and authoritative document. Time to produce: 1712s Score: 9
Q6 Strange items	An 8-page report listing the top 22 strange and unusual items, rather than all 100+ items identified in the subsets. Clearly presented and readable. Time to produce: 30s Score: 8	A 29-page report detailing 172 instances of strange and unusual events sorted into 12 categories. A table at the end lists all 172 instances graded from 1 (least strange) to 10 (most strange) ordered in ascending order of Strangeness. A very good clear answer, well formatted and easy to read. Time to produce: 1175s Score: 9

Claude has clearly produced the best results – and perhaps the ‘time to produce’ numbers indicate why that should be: ChatGPT took an average of 44 seconds to produce its merged answers, and Claude took an average of 962 seconds. Indeed, in some cases, Claude’s merged reports are so detailed and so well formatted that they are too believable – and this is their downfall: to reproduce such comprehensive results, or to verify the answers, would probably take weeks of manual work, and so, in view of this and despite their potential for error (more of this below), it is very tempting to just assume they are totally correct.

Although I didn’t diligently check every aspect of some of these extremely detailed reports, I did spot a few errors which suggests there may be several more across this body of material. They are summarised below to give a flavour of what can go wrong.

ChatGPT Q1 subset answers: Large numbers of incorrect People and Organisations are put forward by ChatGPT. A sensible rationale for identifying the difference between any two adjacent words and a Person or Organisation’s name is just not there. For example, two of the people names put forward by ChatGPT are ‘Do, To’ and ‘Taxation, Oil’.
Claude Q2 subset answers: The subset B answer says “CSC – Bid Management and Internal Systems work, 1984–1990s… a transition into bid management roles by the early 1990s “; but I didn’t start bid management work until 2001.
Claude Q2 subset answers: A subset C section heading – ‘Joining CSC and the Cosmos / Amigo research project (1986–1989)’ is wrong because we never joined the Amigo Research project – we just had a seminar to share what each group was doing.
Claude Q3 subset answers: In the subset A answer, the PAW-ACMOIS-Jul88-p277 (1988) review of “wireless Intraoffice network technologies’ appears incorrectly in section 5 (CSCW) instead of section 4 (Local Area Networks).
Claude Q5 subset answers: Claude incorrectly assumes that I explored doing a ‘UCL Graduate Diploma in HCI with Ergonomics’ but actually the index entry concerned a letter from UCL asking me to publicise their course.
Claude Q2 merged answer: The summary trajectory section starts in 1977 with my time at CPC and doesn’t mention Kodak [where I worked prior to CPC] at all despite Kodak appearing in its subset B answer.
Claude Q2 merged answer: Claude says that, in a CHOTS paper, I specified the ‘four-part reference-number scheme later used in PAWDOC itself’ whereas the scheme was already being used in PAWDOC when I was doing the CHOTS work.
Claude Q2 merged answer: The CSC start date is specified as 1986 but actually was 1984.
Claude Q3 merged answers: The text says “Doug Engelbart’s bootstrap seminars on Dynamic Knowledge Repositories, which CSC adopted (PAW-DOC-7385-01)”. However, the only reason I can see for claiming that ‘CSC adopted’ is the inclusion of a username and password for access to the Bootstrap institute in the 7385-01 Index entry.
Claude Q5 merged answer: Three CPC courses are wrongly ascribed to Kodak in sub-section 3.2.
Claude Q5 merged answer: Claude assumes that an Index entry about a letter concerning Nottingham’s proposed MSc in Human-Computer Interaction is “evidence that he was seriously weighing a formal postgraduate qualification” whereas, in fact, I was simply responding to a request for support for the establishment of such a course.

These errors may have been due to lack of material in the subset, or to a misinterpretation of the information provided, or to limited information in Index entries without access to the associated files; but however they were caused they provide an important reminder that all LLM material needs to be checked if you want to rely on its veracity.

Overall, then, I believe it is feasible to use AI to explore large archives via their Index entries and associated File Names. It may need careful planning and preparation and be time-consuming to carry out, but the results can be very informative – perhaps providing insights which would be just too time-consuming and expensive to obtain in any way. Certainly, there’s no way I, or anyone else, would ever have produced the in-depth material on the information contained in PAWDOC on people, organisations, travel, and training that Claude has produced in these tests. However, if you are going to make more than just casual enquiries of your archive, researchers would be well advised to develop verification strategies as an integral part of the exercise. These need not be comprehensive and definitive checks, but instead may involve sampling, using catalogue searches, or even using the AI itself, to get a sense of how much, if anything, is amiss or not. The research need not stop there. There are probably optimum strategies for using such AI answers, and awareness of their flaws, as a starting point for research. However, at present, I am not aware of any such strategies having been documented. This, too, is another aspect I hope to know more about by the time I get to Phase 7 of this work.

Below records the breakdown of the time I spent on Phase 4 and across all phases.

Activity	No of Tasks or task breakdown	Elapsed time	Time spent
Phase 1	70 (started 05Mar2026)	43 days	105 hrs
Phase 2	8	4 days	11 hrs
Phase 3	· Create test files, test, analyse results · Research & draft pwofc.com posts	3 days 4 days	15 hrs 12 hrs
Phase 4	· Create test files, test, analyse results · Research & draft pwofc.com posts	14 days 11 days	80 hrs 31 hrs
Totals		79 days	254 hrs

File Splitting and Truncation

Posted on 15Jun2026 by admin

As described in the previous post, the Index and file information in my PAWDOC collection is too big to be ingested and used in a single prompt in today’s AI systems. So, I’m using it to explore how AI might be able to analyse a very large collection in pieces and then to stitch the results together. The first step in such an undertaking is to decide what information is to be provided to the AI and how small the individual subsets (i.e. files) of that information need to be to guarantee that they will be taken into account in full in the AI’s analysis (files that are too large will simply have some of their contents truncated after they are uploaded but before the AI starts its analysis).

I decided to provide the following fields from the PAWDOC index to the AI:

Reference Number: every document in the collection has a Reference Number which is made up of four parts – an Owner Identifier, a Set Identifier, a Serial Number, and a Sub-Serial Number. For example, PAW-DOC-4046-01.
Title: this contains free format text describing the document(s) concerned. Three dots (…) in the middle of the text denotes that what follows are Keywords/Phrases.
Publication date: this is the publication date of the oldest document relating to that Reference Number.
Creation date: the date the index entry was created.

I also decided to provide the file title for every file associated with each Index entry. File titles have a general structure of Reference Number, Description, filename extension. For example:

PAW-DOC-1104-39 Planning the ITUSA document Interchange group X.400 User Test.tif

I was able to produce a file of the relevant Index fields by exporting them out Filemaker. The export contained 17,381 entries composed of just over 2,400,000 characters (I calculated character numbers by using Excel’s ‘LEN’ formula) and sized at 1.4Mb.

Obtaining a list of the associated Filenames was a little more difficult as the files reside inside separate folders for each Reference Number. However, a search of the net established that the 7z file compression utility will provide a list of all the files within a higher-level directory structure; so, I downloaded and installed 7z, and created a file of file titles containing 31,270 entries composed of 2,568,000 characters and sized at 1Mb.

To assess the number and size of the subset files that needed to be produced I turned to the insights I recorded in the Phase 1 and Phase 2 Summary results:

ChatGPT’s usable context window is about 80k-100k tokens (approximately 160k-300k characters – assuming 2-3 characters/token). ChatGPT limits uploads to 3 files a day but this can be circumvented by putting multiple files in a zip file.
Copilot doesn’t have a fixed context window – its design means that its effective context window is much larger and more flexible than a single token number would suggest. It limits uploads to about 7 batches of up to 20 files. Approximately 30k characters per file should work fine.
Claude’s context window is 200k tokens (roughly 500 pages of text or approximately 800,000 characters). Claude caps file uploads at 30Mb per file and 20 files per conversation.

Copilot and Claude’s limits suggested that, for each prompt, I should load no more than 20 files containing a maximum of 30k characters each. This would exceed ChatGPTs limits, but I thought that I could do without ChatGPT if its results were poor.

I had already decided that I would upload a copy of the ‘PAWDOC Guide‘ file in every prompt; and that I would count the text request as a file in its own right; so that left a maximum of 18 files of Index and Filename information in each prompt. I duly, set about splitting the Index file into subsets of around 30k characters, and then creating associated files of file titles (ensuring that none of the files exceeded 30k characters) and assembled the combination of Index files and File Title files into groups of 18 files or less. It was a very laborious task. If I’d realised how time-consuming it was going to be, I would have found a way of splitting out the Reference Number from the File titles into a separate field and merged the overall Index file and overall File Title file into a single file sorted by Reference Number, which would have made the splitting task a much simpler and quicker operation – a lesson worth remembering (in actual fact, I’d achieved a similar feat a few months earlier simply by asking ChatGPT to provide me with an Excel function for a similar task – I’d just forgotten it was that easy). Anyway, after the job was completed, I found I had 10 subsets of either 17 or 18 files each, and which I named Subsets A-J. Subset B’s files are shown below.

Test B files\PAWDOC Index 12.csv
Test B files\PAWDOC Index 13.csv
Test B files\PAWDOC Index 14.csv
Test B files\PAWDOC Index 15.csv
Test B files\PAWDOC Index 16.csv
Test B files\PAWDOC Index 17.csv
Test B files\PAWDOC Index 18.csv
Test B files\PAWDOC File Names 7.csv
Test B files\PAWDOC File Names 8.csv
Test B files\PAWDOC File Names 9.csv
Test B files\PAWDOC File Names 10.csv
Test B files\PAWDOC File Names 11.csv
Test B files\PAWDOC File Names 12.csv
Test B files\PAWDOC File Names 13.csv
Test B files\PAWDOC File Names 14.csv
Test B files\PAWDOC File Names 15.csv
Test B files\PAWDOC File Names 16.csv

I then set about getting each of the AIs to answer question 1 with Subset A. As it happened, there seemed to be no problem with ChatGPT: it happily accepted all the files in a zip file, and advised that no files were truncated when it analysed them. This didn’t seem to concur with the ChatGPT context window information I had established earlier, but the results seemed reasonable so I decided to use ChatGPT in the testing. So too with Claude, which also was happy to ingest all the files compressed into a single zip file. Copilot, however, produced inferior results which I established, after some Q&A with the AI, was due to the files being truncated. Copilot explained the problem very clearly:

Your file size (≈33 KB) and character count (≈29,950) were both within your rule of thumb — but that rule is no longer reliable. The truncation happened because:

The ingestion limit is based on tokens, not characters or kilobytes: A token is roughly 3–4 characters of English text, but CSV files with long lines, punctuation, quotes, and irregular structure tokenize very inefficiently. Your file contains long filenames, commas, quotes, and mixed punctuation dramatically increase token count. So even though the file is only ~30k characters, it may be 10k–15k tokens, which is enough to trigger truncation.
CSVs with hundreds of long rows are the worst-case format: Each line is treated as a dense block of text. Tokenization cost spikes.
The system applies a per‑file token cap: This cap is not published, but from your repeated tests, the effective safe zone is:

≈15,000–18,000 characters for CSV-like content
≈20,000–22,000 characters for plain text
≈30,000 characters only for very simple text

Your file was near 30k characters but had extremely token-heavy content → truncated.

It was clear that I wasn’t going to be able to conduct the tests with Copilot using the same subsets that I had already created. However, ChatGPT and Claude seemed able to deal with the subsets, and would enable me to make some sort of comparison of results. Furthermore, I wasn’t prepared to redo the subsets, so I decided simply to exclude Copilot from the tests. This experience suggests that it’s a good idea to test some sample file sizes with all the models to be used, before undertaking the final splitting of files into subsets.

A prompt about Prompts

Posted on 14Jun2026 by admin

The Index to my PAWDOC collection of work documents is in a Filemaker database, and I’d recently been told that Filemaker had added some AI functionality into its product, and that it also had a number of AI-related videos for users to learn from. So, at the start of this investigation into providing AI support for larger collections like PAWDOC, I took to watching some of the videos. One was entitled ‘Learn about AI prompting basics (30May2024)’ and it outlined some of the key things to remember when constructing an AI prompt. In particular:

Clearly define the audience.
Define your goals.
Provide a step-by-step approach (do a, b, c, etc).
Include examples of what you want and say why they work.

At the time, I was thinking about putting a protective covering on the top of books to stop dust settling on the top of the text block, and I thought I could get AI to produce an outline which I could print on card and cut out. So, I set about developing a prompt using the guidelines listed above. It turned out to be an interactive process which took me at least a dozen attempts before I was satisfied with the results – illustrated in the image below.

The way the cover works is that you score a line where the image ends, cut away the two small rectangles on either side, and fold down the resulting tab and place it inside the spine of the book. The rest of the cut-out then sits on the top of the text block as shown in the top-down view of some of the books below. When reading the book, the cover can act as a bookmark.

Perfecting the prompt was a matter of getting the cut-out lines to be drawn in the right places; of getting the text boxes to be of the right size with minimal empty space; of getting the whole of the specified area to be covered with interesting illustrations; of figuring out whether its best to have it produced in a Word or Powerpoint document (I ended up with Word); and of realising that Claude AI would produce more interesting illustrations than ChatGPT (particular AI models are better at some things than others). Getting all of these things right took rounds of experimentation until I came up with the final prompt.

I had the same experience when investigating AI support for my large PAWDOC document collection for which I devised 6 evaluation questions, each of which were to be asked of a subset of the PAWDOC Index and file titles (the need to use subsets was because, unlike AI support for the Mementos collection which I have written about in earlier posts, the PAWDOC Index and file titles have some 16+ times the number of characters – a size which could not be ingested and used in a single prompt in today’s AI systems. So, I decided to see what sort of results could be obtained by splitting the information into 10 subsets, getting answers for each subset, and then stitching the 10 answers together). The 6 questions I specified were:

List all the people named in this part of the Index and its associated files, and the organisation they belong to if any.
Describe Paul Wilson’s career over the period covered by this part of the Index and its associated files.
What significant changes in Information Technology occurred during the period covered by this part of the Index and its associated files.
Document all the travel undertaken by Paul Wilson over the period covered by this part of the Index and its associated files.
What training was undertaken by Paul Wilson over the period covered by this part of the Index and its associated files, and how important were particular elements to his subsequent career.
What are the strangest or most unusual things to be found within this part of the Index and its associated files, including unlikely coincidences, and events with unexpected outcomes.

These all look fairly straightforward but the actual prompts I ended up with were considerably more extensive. For example, the prompt for the sixth question was:

PAWDOC is a work document collection built up from 1981 to the present day. The attached Guide outlines how PAWDOC is constructed and what it contains. Also attached is a subset of PAWDOC’s Index and the digital files associated with those index entries. Note: None of the actual PAWDOC documents have been provided. The following request is to be undertaken using just the information in the Index entries and the File Names provided in the attached files: identify the training that was undertaken by Paul Wilson over the period covered by this part of the Index and its associated files, including the objectives and composition of the training, how it was delivered, how much time was spent on it, how relevant it was to his subsequent career, and citing the Reference Numbers from which the information was obtained. NB. Be sure to inspect all supplied data first, then analyse.

All the prompts required a number of iterations before they were producing good results. They are reproduced in the attached document together with additional prompts to stitch the 10 responses to each question together. Note the prompt for the first question which was especially demanding.

These experiences are not novel – they merely reflect a well-known fact that to get the best out of an AI capability the prompts need to be well thought out and will likely require several iterations before they are deemed satisfactory.

Rephasing

Posted on 09Jun2026 by admin

At the start of this journey, I said I would conduct my investigations in 9 Phases:

AI support for the Memento collection’s index entries.
AI support for the Memento collection’s combined Index entries and file titles.
AI support for the Memento collection’s index entries, file titles and textual items.
AI support for PAWDOC’s index entries.
AI support for PAWDOC’s combined index entries and file titles.
AI support for PAWDOC’s combined index entries, file titles, and some or all of the born digital items.
AI support for a subset of PAWDOC’s scanned items.
AI support for a combination of index entries, file titles, some born digital material and some scanned items.
AI support for the whole of PAWDOC.

I’ve completed the first three Phases and reported on them in previous posts. However, given what I have learned up to now, I’ve decided to change my plan in the following ways:

Combine Phases 4 and 5: It’s a feasible and better approach to obtain AI support for a combination of Index entries and associated File names.

Combine Phases 6, 7, and 8: I’ve discovered that AI can deal with scanned documents in much the same way as born digital documents, so they can both be investigated together. Furthermore, given the limits on context windows, and file numbers and sizes, it’s not really practical to try to obtain AI support for combinations of the whole index and large numbers of content files. Instead, I shall investigate the approach of applying AI to a subset of content files and their associated index entries.

Change Phase 9 – AI support for the whole of PAWDOC will have been explored in the new Phases outlined above. Instead, the final Phase will assess all my findings about how AI can support personal archives

Add another Phase: Since setting out the Phases back in March this year, I’ve discovered that the FileMaker application I use to manage the PAWDOC Index, has been incorporating progressive amounts of AI support since the version I currently have which came out in 2019. Furthermore, FileMaker is currently offering a 45-day free trial of its latest version – FileMaker Pro 2025. I intend to take up this offer once I have completed the revised PAWDOC Phases. This will include watching various training videos about FileMaker’s AI’s capabilities in addition to performing tests on the PAWDOC Index and Files. I anticipate that the way FileMaker has addressed the opportunities afforded by AI, will extend my AI knowledge as well as providing an insight into how any Database/Indexing/Repository software might try to exploit AI.

In summary the Phases I now intend to follow are as follows:

1-3: Complete.

4. AI support for PAWDOC’s combined index entries and file titles

5. AI support for selected subsets of PAWDOC born digital/scanned document and their associated index entries.

6. AI Support provided by FileMaker for the PAWDOC Index.

7. Assessing all my findings about how AI can support personal archives.

Phase 3 Results – Text & Image Content

Posted on 20May2026 by admin

Phase 3’s objective was to assess how well AI could support not only an Index and its file titles, but also the textual content of the files itself. However, after doing a bit of digging, I discovered that AI’s could not only work with textual content, but also with scanned text, and with images in general (of course, such material still has to be presented to the AI in the form of uploaded files or RAG Chunks). With this awareness in mind, I set about devising some tests to find out just how well AIs can perform when presented with real content as opposed to just metadata. I came up with the following:

Tests of Machine-readable text

Test 1 – describe and summarise the contents of three years of diary entries in word format
Test 2 – discuss any relationships that can be found between three Word files with diverse contents: my library loan history for 2004-2012; an account of the petitioning of a school’s teachers to make a change to daily activities; some thoughts about university life while in the infirmary recovering from German Measles.

Tests of Image-only scanned text.

Test 3 – summarise Friends of the Earth activities in Harrow as documented in three image-only scanned text documents from 1976-1979 in PDF format.

Tests of Text in images

Test 4 – list all the events and activities described in three documents of events, tickets, membership cards etc.
Test 5 – describe and summarise the contents of all the images in three pages of Italy stamps which also include cutouts from the relevant parts of stamp catalogues.
Test 6 – catalogue the contents of the three pages of Italy stamps images using the following fields: Reference Number, Country, Year, Value, Notes.

Tests of collections of objects in images

Test 7 – describe and summarise the contents of all the images in three photos of chocolate wrappers, with each photo showing a) a double page of a chocolate wrapper scrapbook (in an unused 1967 A4 diary); and b) 3 plastic wallets containing loose wrappers.
Test 8 – catalogue the contents of the three chocolate wrapper photos using the following fields: Reference Number, Name, Manufacturer, Type, and Size.

Test 9 – describe and summarise the contents of three photos of household ornaments showing a) 10 pieces of Wedgewood, b) 30 small display items including porcelain (cups, saucers, plates, vases, jug, trinket boxes, flowers), glassware (bowl, vase, jug, flower, bird), stoneware (lighthouse, ashtrays, barometer), and wood (bowl, elephants); and c) 13 sundry items including silver trays, bowls, coasters and shoehorn; large shells, letter rack; pen holder; and decorative tray and plate.
Test 10 – catalogue the contents of the household ornament photos using the following fields: Reference Number, Name, Type, Colour.

As may be apparent from the above descriptions, each test involved attaching three files to the AI Prompt together with a request, for example, “Using the three files I have just uploaded, catalogue the contents of the images using the following fields: Reference Number, Name, Type, Colour.”

All 10 tests were applied to the four AIs that had been used in the previous phase – AnythingLLM with Mistral, ChatGPT, Copilot and Claude. I did explore the possibility of using two other widely used products – Llama from Meta, and Gemini from Google. However, both require that you create an account before you can use them and I didn’t want to do that because, among other reasons, I’m trying to limit my exposure to data collection and advertising which are central to both of those organisation’s operations. Indeed, during the process of opening a Meta account, I was actually informed that I would be consenting to being shown adverts: I stopped at that point. So, for these tests I stuck with the four AIs previously mentioned. I have been using the free version of ChatGPT, Copilot and Claude up to now. However, when I started doing these tests ChatGPT suddenly changed the number of files it was allowing me to upload each day from 3 to 2. Since all the tests involve 3 files I elected to upgrade to ChatGPT-Go which enables you to “usually upload far more than the Free tier’s 3 files/day, but there is still a rate limit, and OpenAI hasn’t publicly stated the exact number.” The cost was £7 a month with the ability to cancel anytime. I encountered no limits when I was conducting these tests with the free versions of Copilot and Claude.

Before discussing the test results, its worth being clear about the image recognition and text-reading capabilities of the AIs concerned. First, AnythingLLM is not capable of interpreting images so, unsurprisingly its results in these tests are very poor. However, I included it anyway just to see how it would react. Second, ChatGPT, Copilot and Claude (like most other Large Language Models) don’t apply separate conventional OCR (Optical Character Recognition) techniques to interpret text in scans, photos or other images. Instead, they undertake text recognition as part of their general image understanding capabilities which includes the combined assessment of visual patterns, language, spatial relationships, and context. Consequently, their text recognition capabilities often depend on the type and volume of training data they have been given. As ChatGPT put it: The image recognition capabilities of AI models “are deeply dependent on the training data, but also on model design, compute scale, optimization methods, and how visual and language information are combined during training.”

With this in mind, we can now consider the results of the 10 tests. Each individual test was evaluated in two ways: first, I looked at the answer provided and wrote a short paragraph reflecting what I thought about it; and, second, I awarded a score out of 10 according to what I’d written. As is obvious, this is a completely subjective, and a potentially highly inaccurate, process. However, it does have the advantage that it uses a single observer with a consistent mind-set. Hopefully, that consistency will be reflected throughout all the results.

All the evaluation paragraphs and scores for all four AIs are included in this word document. However, I also produced an overall summary paragraph, and overall average score, for each AI, and these are provided below. They are accompanied by an equivalent paragraph produced by ChatGPT after it was provided with each AIs detailed paragraphs and scores for each test.

AI (Overall Average Score)	Human Overall evaluation	ChatGPT’s overall evaluation
Anything LLM (1.3)	AnythingLLM made it clear in the later tests that it is unable to interpret images. Given this fact it is odd that for two of the questions concerning images it gave answers that were completely nonsensical (circuit boards for stamps; and a photo of ‘a person wearing a white lab coat’) – it could have just said it couldn’t interpret the images. Regarding scanned text, AnythingLLM was able to interpret both the scanned typed text and the scans of tickets – however sometimes incorrectly and sometimes hallucinating parts of the resulting answer. Whatever answers were provided by AnythingLLM, were typically very brief and lacking in supporting material.	Across the 10 tests, AnythingLLM demonstrated consistently weak performance, particularly when interpreting images and scanned documents. In six image-based tests involving ornaments, chocolate wrappers and stamps, the AI failed completely, often producing wildly inaccurate descriptions such as circuit boards, robots or software code instead of the actual subjects shown. These responses were considered entirely unusable and all received scores of zero. Performance improved only slightly when handling text-based or scanned documentary material. In the events and activities test, the AI identified a few correct details but omitted most of the content and introduced several errors, resulting in a middling score. Responses to the Friends of the Earth documents and the diary summaries were similarly superficial, inaccurate or overly narrow. Overall, the AI showed limited comprehension, poor contextual understanding and an inability to extract detailed or reliable information from uploaded materials, achieving an average score of just 1.3 out of 10.
ChatGPT (8.7)	ChatGPT gave a very good answer to the question in which text was provided in Word, but a little less so when assessing scanned typed text. It performed extremely well when assessing objects in images and in picking text out of those images, suggesting that it had been given some training data in the subjects covered (stamps, chocolate wrappers, and household goods). The one thing that wasn’t very helpful were the images which it displayed in answers to the image questions – they were not of items in the supplied images but items similar to those in the supplied images which was just generally confusing. Most answers were generally very clear and complete, and there were only a few errors. The catalogues it produced were clear and with good descriptive information.	Across the 10 tests, the AI demonstrated consistently strong performance in analysing, cataloguing, summarising, and interpreting a wide variety of uploaded documents and images. Scores ranged from 7.5 to 9.7 out of 10, with an overall average of approximately 8.7, indicating a high level of competence. The AI performed particularly well when extracting structured information from complex visual material such as stamp collections, chocolate wrappers, ornaments, and historical documents, often identifying nearly all relevant items and presenting the results clearly in tables or organised summaries. Its ability to read embedded text and recognise themes across multiple files was repeatedly praised. The main weaknesses noted were occasional omissions, minor reading inaccuracies, over-generalisation in crowded images, and the insertion of misleading illustrative images not present in the source material. Nevertheless, the tester consistently regarded the outputs as comprehensive, useful, readable, and well-structured overall.
Copilot (7.5)	Copilot produced excellent answers when reading from text in Word format – comprehensive, well structured, and insightful, with no noticeable errors. It was a little less competent when reading the scanned typed text. It read text in images without a problem – including vertical text. However, it performed less well when identifying objects in images – often missing out several items from both the descriptive text or grouping several items together; and the catalogue listings were similarly incomplete. This might relect a lack of training data in the areas concerned.	Across the 10 tests, the AI demonstrated strong overall performance, particularly in tasks involving summarisation, thematic interpretation, and extracting information from structured or semi-structured documents. Its best results came from diary summaries, document relationship analysis, and chronological event extraction, where it showed impressive comprehension, organisation, and the ability to interpret handwritten and vertically oriented text. Scores in these areas were consistently high, often above 9/10. Performance was more mixed when detailed cataloguing or precise item identification was required from complex or cluttered images. In several image-based catalogue tasks, the AI missed items, misinterpreted grouped objects, or introduced incorrect assumptions about how materials had been arranged or curated. Stamp and wrapper cataloguing proved especially challenging, with omissions and occasional transcription errors reducing reliability. Nevertheless, even weaker responses were generally well-presented and useful in part, indicating a capable system whose strengths lie more in synthesis and interpretation than exhaustive visual inventory work.
Claude (9.3)	Claude’s overall evaluation score of 9.3 reflects the fact that it performed extremely well in all but one of the tests. Its answers were comprehensive, full of detail and easy to read. It displayed competency in all four areas being tested – Word text, scanned typed text, text in images, and images of objects. Very few facts or items were missed. In one of the image files it identified the base of a lampstand from an image of just a small piece of its base. Its capabilities suggest it has had a broad range of training data. Its catalogue listings were good with two of the four being produced in a useful Excel format. The only thing that let it down was some numerical errors in the answer cataloguing household items: it incorrectly included the legend row and an empty base row in the total number of items it reported (i.e. it said there were 55 items instead of 53); and it reported that there were 17 items listed in the spreadsheet for image 3 whereas it had only actually listed 15 items in the spreadsheet.	Across the 10 tests, the AI demonstrated consistently high performance, achieving scores between 8.7 and 9.9 out of 10, with an overall average of approximately 9.3. Its strongest capabilities were in extracting, cataloguing, and summarising information from complex images and scanned documents, often identifying nearly every visible item and adding insightful contextual observations. The AI showed particular skill in recognising handwritten text, interpreting historical or archival material, and producing structured outputs such as Excel spreadsheets. Reviewers repeatedly praised the clarity, comprehensiveness, and readability of the responses, as well as the AI’s ability to infer broader themes and relationships across documents. Minor weaknesses included occasional misidentifications of objects, over-interpretation of details, and small numerical inconsistencies in summaries or item counts. Nevertheless, these errors were generally isolated and did not significantly detract from the overall quality. The results indicate an AI with excellent analytical and descriptive abilities across diverse document and image-processing tasks.

Claude comes out a clear winner in these tests, with ChatGPT coming in second. Copilot, while performing excellently with text, appears to have had less relevant image training. At a general level, however, the results illustrate very clearly that AIs can work extremely well with both text and images; and could be very useful to collectors in identifying items, describing them, cataloguing them, and creating indexes for them.

For completeness, below records the breakdown of the time I spent on Phase 3 and across all phases.

Activity	No of Tasks or task breakdown	Elapsed time	Time spent
Phase 1	70	43 days	105 hrs
Phase 2	8	4 days	11 hrs
Phase 3	· Create test files · Research & drafting pwofc.com post	3 days 4 days	15 hrs 12 hrs
Totals	80	54 days	143 hrs

Phase 2 results – and enter Claude

Posted on 06May2026 by admin

The object of Phase 2 was to explore how AI can support the combination of a collection’s Index and the titles of the associated files. The answer is straightforward: the assembled file titles need to be provided to the AI model in one or more files, in the same way that the Index is delivered. In Windows, file titles can be collected by highlighting the files you are interested in and selecting ‘Copy as path’ from the Right-Click drop down menu. These can then be pasted into either Excel or Word, and the path in front of the file name can be deleted by using the Find and Replace function: specify the path in the Find box and specify blank in the Replace box, highlight the file names, then select Replace-All. No doubt other operating systems have similar capabilities.

I tried this out with the names of 2065 files associated with the Mementos index, along with the 1.62 version of the Mementos Index (the version I used for the final set of tests in Phase1). I retained the 5 evaluation questions used in Phase 1, but I also added three more to further test the capabilities of the AI models concerned (more of these at the end of this post). I performed tests using AnythingLLM/Mistral and Copilot. ChatGPT wasn’t used because it has a limit of around 250kb and the file titles together with the Index would have significantly exceeded that. I also tried out Anthropic’s Claude model for the first time. The results are shown in the table below together with the results of the final test in Phase1 for comparison purposes.

System	Phase1 – no file titles 5 evaluation questions	Phase 2 with file titles 5 evaluation questions
AnythingLLM/ Mistral	4.4 out of 10: Overall, this was a disappointing result. The answers were very sparse with little rationale or summation. Some of the answers given were of dubious worth.	2.3 out of 10: This was a generally very poor result. Four of the answers were virtually worthless; while 2 of the answers weren’t too bad at all. Quite apart from the contents of the answers, they were all relatively short and with none of the embellishments and rationale that seem to be a standard part of the responses of AIs like ChatGPT and Copilot.
Copilot	8.7 out of 10: Three of these five answers were exceptionally good, and all the responses were well-illustrated with rationale and plenty of examples. The summaries at the end were well constructed and useful. I only spotted one error across the five answers, though there were three or four things which stood out as having been omitted.	8.9 out of 10: Overall this is an outstanding result. Comprehensive answers with good introductions and summaries and lots of examples with reference numbers included. Nearly everything was correct and complete – it was bemusing to read things like the series of homes we had been in or the different companies I’d worked for – an experience like being amazed that someone you are talking to seems to know your life history. The one concern is that in one answer the AI suggests it had taken facts from a document (Barlborough prospectus) which it hadn’t got access to and the only conclusion is that it hallucinated the information (Indeed it subsequently admitted it had).
Claude		8.2 out of 10: Claude’s answers overall were pretty good – but a bit patchy with scores ranging from 6 to 9.8 (this latter score being ‘first class’). The poorer scores were largely the result of misinterpretations and errors.

From these results, it’s not possible to assess if the inclusion of File Titles has made a difference: AnythingLLM’s performance was generally very poor across both tests; Copilot’s two scores were too similar to draw any conclusion; and Claude did not have any earlier test result to compare against. However, it seems reasonable to assume that if there is additional information in the file titles, over and above that in the Index, there will be a better outcome.

Since this was the first time I had used Claude, I enquired about its operational parameters, and discovered that Claude’s Context Window is 200K tokens across all models and paid plans (except for Enterprise plans). That is roughly 500 pages of text or approximately 800,000 characters. Claude caps file uploads at 30MB per file and 20 files per conversation. I found the quality of Claude’s answers to be at the same sort of high level as Copilot – though the results indicate that, on this particular set of uploaded material, Copilot has a bit of an edge. I also noticed that Claude differs from Copilot in two ways: first, Claude, unlike Copilot, tells you what it’s doing in the course of responding to a question; and, second, Claude took three times longer to respond to the same set of 8 evaluation questions (59.3 vs 19.6 seconds).

The three extra evaluation questions that I added were:

Describe the things that happened to Paul Wilson during the time he spent at Barlborough Hall School and support the story with references to relevant mementos.
Identify anything strange or unexpected about the contents of the Memento collection and support the analysis with references to particular files.
Describe life in the Wilson family during the 1990s citing relevant artefacts in the collection.

These extra questions were deliberately designed to explore the AIs ability to make more broadly-based connections and inferences, as opposed to identifying related items in a narrow subject area. This is where AI can excel – and sometimes fail – as indicated in the detailed results for each question provided below.

Question	Copilot Result	Claude Result
Describe the things that happened to Paul Wilson during the time he spent at Barlborough Hall School and support the story with references to relevant mementos.	7 out of 10: This is a comprehensive summary of my time at Barlborough – though surprisingly lacking in info about sports activities. The answer claims to have details of the contents of the school prospectus and of letters to parents – but it didn’t and I think it just hallucinated the info.	9.8 out of 10: This an excellent, comprehensive answer utilising all the Barlborough material in the collection so far as I could see. I didn’t spot any errors. The answer was easy to read and included all the relevant Memento Reference Numbers. First class.
Identify anything strange or unexpected about the contents of the Memento collection and support the analysis with references to particular files.	9 out of 10: I’m quite bemused by this answer: I don’t really know what I was expecting – but Copilot delivered a whole bunch of unusual stuff with good rationale for their inclusion. Extraordinary!	6 out of 10: Claude identified 12 different strange entries – but actually only about 5 are really valid. The so-called errors in the birthday card collection arose simply because I entered birthday cards for different years in the same year; the so-called error in the Miss Saigon Programme date was because I also bought a special Souvenir programme with the date of the first performance. Item 5 is also covered by Item 1. So this is an interesting set of material – but not entirely correct or useful.
Describe life in the Wilson family during the 1990s citing relevant artefacts in the collection.	9 out of 10: Another very comprehensive answer replete with examples and a good summary at the end. I haven’t checked in detail whether its all correct but nothing is standing out as being wrong. I’m finding it hard to fault these answers.	8.8 out of 10: This is a very complete account of our time in the 1990s written in a very readable style spiced with occasional wry comments and humour. Each detail is accompanied by the relevant Reference Number. It is highly informative but does include a few misinterpretations – in particular claiming that my wife was at home during the 90’s when actually she was back to teaching; that l was seconded to the IR in ‘mid-1990s’ whereas it was the beginning of the 90s; that the EDS chocolate bar had been kept whereas it was just the box it was in; that my son went to Exeter to do a 2nd degree whereas it was his first; that I was a devoted Aston Villa supporter whereas it was my other son; and that my son had a band called Phases which was stretching it since it was just he and a friend making a cassette recording. However, these are relatively small points amidst the huge array of correct facts that are presented in this highly readable piece.

Finally, to come back to the objective for Phase 2, I have concluded that file titles can certainly be included in the material delivered to an AI model, and this is certainly worth doing if they include additional material to that which is contained in an Index – or if an Index does not exist.

For completeness, below records the breakdown of the time I spent on Phase 2.

Activity	Time spent – hours
8 Tasks (total elapsed time: 4 days)	11
Time spent assembling files of File Titles and testing	8
Time spent researching and drafting post for pwofc.com	3

Phase 1 Summary Results

Posted on 17Apr2026 by admin

Phase1 Description: AI support for the Memento collection’s index entries.

Conclusions

Different LLMs are trained with different amounts and types of training data, thereby making them more or less suitable for operation within specific domains. Therefore, there may be some LLMs available that are more suitable than others for use with Collection indexes.
To enable LLMs to provide answers to questions related to specific information in documents or databases, a process called Retrieval Augmented Generation (RAG) is used whereby relevant parts of the information are provided to the LLM alongside the question.
The RAG process performed by particular products typically includes a multiplicity of different techniques to identify an appropriate set of Chunks for submission with the question to the LLM. It is likely that particular combinations of these techniques will be more effective than others for use with collection indexes.
Collection Indexes may contain information which is more unevenly spread throughout their contents than a written document with an ordered set of contents. Consequently, such indexes may require more chunks to be sent to the LLM, and may require the LLM to have larger Context Windows, than ordinary documents in order to obtain satisfactory answers.
Out of the 4 models/systems I have tried out so far, Copilot is by far the best LLM for use with a collection index of around 2,500 entries.
The choice of systems, models, configurations, and Index adjustments in this first Phase were heavily influenced by ChatGPT, and therefore may have been based on some inaccuracies or hallucinations. This needs to be born in mind when considering the findings from Phase 1 and taking them forward into subsequent phases.

These conclusions were reached in the course of undertaking the activities summarised below.

Tasks and Timescales

Activity	Time spent – hours
70 Tasks (Total Elapsed time: 43 days)	105
Time spent addressing points in AI preparedness document	7
Time spent installing software	8
Time spent adjusting indexes and testing	52
Time spent researching and drafting posts for pwofc.com	38

Reference Sources

A Survey on Retrieval-Augmented Text Generation for Large Language Models by Yizheng Huang, Jimmy Huang, v2, arXiv:2404.10981, 23Aug2024.
I obtained lots of answers from ChatGPT to questions I asked about RAG, AI software products, and how to adjust the Memento index to get better AI results.

Concepts and Terminology encountered

LLM stands for Large Language Model. Hundreds of thousands of these have been built and new ones emerge daily, but many are variants of major models like GPT-4 from ChatGPT which can be accessed online. Other LLMs can be downloaded from websites like Hugging Face, to use locally on a stand-alone computer.
RAG stands for Retrieval Augmented Generation whereby the LLM is not trained on the archive, but instead a relevant subset of the archive data is provided to the LLM alongside the question.
Chunk is a smaller, manageable segment of a larger document or dataset. In RAG, one or more Chunks are provided to the LLM along with the question.
Token is the fundamental unit of data used by LLMs. Models convert questions and Chunks into tokens, process them, produce an answer in tokens, and then convert those tokens back into text. Generally, one token equals approximately 0.75 English words.
Vector Databases store mathematical representations of Tokens as vectors – lists of numbers – in such a way that related items are clustered together thereby enabling capabilities like similarity searching.
Embedding: Chunks are broken down into Tokens which are converted into mathematical representations and embedded in a Vector Database.
Context Window is the maximum number of Tokens that can be handled by the LLMs working memory (which contains both the input prompt and the answer). If the inputs to the LLM exceed its Context Window, then some content may be simply left out and the answer may be less complete.
Hallucination is a phenomenon where LLMs generate false, misleading, or nonsensical information confidently. It happens when an LLM predicts text based on patterns rather than facts, often due to poor training data, ambiguous prompts, or a lack of understanding of reality.

Test results – 2^nd set of evaluation criteria (scores out of 10)

Type of LLM Index variant	AnythingLLM Mistral	AnythingLLM Mixtral	Copilot – MS LLM	ChatGPT – GPT-5.3
Less some columns + AI Context short	4.8	5
Less all extraneous fields, no Guide to the Index	2.2	2.9	8
Combined into 1 field and with Guide	3.1	4.8	8.6
Less all extraneous fields, no Guide				5.9
Less all extraneous fields and with Guide			9.1
Index with Set removed and with Guide	4.4	3.6	8.7	6.3

Enter Copilot and ChatGPT

Posted on 15Apr2026 by admin

In the previous post I explained why I changed the evaluation criteria I was using to the following:

What are the main themes that run through the entire index?
Are there distinct phases or periods in the collection?
Which items suggest important life events or transitions?
What patterns or motifs repeat across the collection?
What are the top 5 most important entries, and why?

I tried using them with a version of the Index which had extraneous fields such as ‘Physical Location’ and ‘No of digital files’ removed leaving just Ref No, Description, Item Label, Set, Facets, AI Context, and Year all in a single column. This produced a result of 4.8 out of 10 using the Mistral model and 5 out of 10 for Mixtral – and I proved to myself that I was able to apply the new evaluation criteria, albeit with rather more subjective scoring.

At this point it dawned on me that it might be worth trying to use the MS Copilot AI provided as an integral part of my Windows operating system. After making some initial enquiries through the net about its possible use with Excel, I got the impression it could only be used with MS Office 365 which I do not have (I have the home version of Office); but Copilot itself set me straight explaining that the 365 version enabled support for the functionality within the Office applications, whereas the free-to-use version of Copilot simply ingests documents up to the cloud where it works out the answer to the question it has been asked. I tried it out using a very simple version of the Mementos index with just the Ref No, Description, Set, Facets, and Year fields in their own columns, and was excited by the result which I described as follows:

“This is a strong result (8 out of 10): comprehensive answers with just a few poor interpretations, but with no obvious hallucinations. Importantly, Copilot was able to deal with the whole attached Index in one go which delivers far better answers than the RAG approach which can only work on a pre-selected subset of the material. The response time (average 4.4 seconds) was very quick considering that the whole document had to be sent to, and analysed in, the cloud before delivering its answer.”

I then tried to compare the AnythingLLM and the Copilot systems by using yet another version of the Mementos index based on suggestions from ChatGPT. It still contained just Ref No, Description, Set, Facets, and Year with the all the fields in a single cell, but this time with the Description replaced by keywords derived automatically using an Excel formula supplied by ChatGPT. The results were very clear: Mistral scored 3.1, Mixtral 4.8, and Copilot 8.6. For the Copilot Test I wasn’t sure if the size of the file would cause a problem so I split it into three files of between 200kb and 300kb each, and these seemed to have been ingested successfully. However, I subsequently discovered that the files had been truncated so that only about the first 120 lines of each were ingested – making about 360 entries in all out of the total 2393 (I guess Copilot must have truncated the file I used in the previous test as well). Given this, Copilot’s 8.6 score was even more impressive.

Finding that Copilot had truncated the files prompted me to do some digging and experimentation to find out just what its limits are. I established that it will ingest up to 20 whole documents of up to about 30,000 characters/30kb file size each in a single turn, and will work across all those documents to construct its answer. Furthermore, more batches of 20 can be submitted in subsequent turns up to a total of about 150 after which ‘the conversation becomes unwieldy’ (I’m not sure what is meant by this). I duly split the Mementos Index into 17 files, all between 20,000 – 30,000 characters and conducted the test again, this time including a version of the Guide document. This produced an even better score of 9.1.

Discovering Copilot’s capabilities made me wonder what ChatGPT could do. I discovered that the free version which I was using allows you to upload a maximum of 3 files of up to about 512Mb in any one day. However, to be sure that all the contents of all the files will be taken into account in answering a question, the total of the 3 files need to be a maximum of about 300k characters with 250k being a safe limit. I put ChatGPT to the test with the same index used in the first Copilot test but broken into three files of between 111k and 130k characters each. This produced a result of 5.9 which was probably expected given that the file size limits had been exceeded.

As I was exploring the file size limits for Copilot and ChatGPT, it became clear that the ultimate determinant of the amount of text that can be reliably reasoned about at once is the AI model’s maximum ‘Context Window’. This is effectively the model’s working memory which contains both the input prompt and the answer. If the inputs exceed the Context Window then some content may be simply left out and the answer may be less complete. The table below summarises the Context Window limits for the different AI models I’ve been using.

AI System	Model	Maximum Context Window
AnythingLLM	Mistral 7B Instruct (32k context), Q4_K_M quantization	32k tokens ( ~65k–100k characters in csv files) (assuming 2-3 characters/token)
AnythingLLM	Mixtral 8x7B Instruct (32k context), Q4_0 quantization	32k tokens ( ~65k–100k characters in csv files) (assuming 2-3 characters/token)
Copilot	MS LLM (Microsoft does not publish the names of its models	The MS LLM doesn’t have a single fixed Context Window; it’s designed around task‑adaptive context management, and hence the effective context it can use is much larger and more flexible than a single token number would suggest.
ChatGPT (free version)	GPT-5.3	The maximum Context Window is 128k tokens, but because this includes system instructions. conversation history, and output tokens, the usable Context Window is about 80k – 100k tokens (160k – 300k characters in csv files) (assuming 2-3 characters/token)

If one wishes Index files to be considered in full by an AI system, the primary requirement is to ensure that the whole set fits into the Context Window. However, when it comes to assembling and submitting those files there are additional considerations to be taken into account, as summarised below.

AI system	Approach	Max total file size	Max No of files	Max size of each file
AnythingLLM	Embed (RAG)	No limit. The limit is the size of the Chunks the files are divided into. The default is set at 256 tokens (800-1000 characters).	No limit. Though more than 50K Chunks may cause retrieval problems.	No limit.
AnythingLLM	Attach function	Default is set to 30k-50k characters for csv files. Can be adjusted within AnythingLLM settings.	No Limit	No limit.
Copilot	Attach function	No absolute limit, but above about 150 files (4,500k characters) the conversation becomes unwieldy.	About 7 batches of up to 20 files in a single prompt.	No absolute limit, but for csv files 10,000 rows or approximately 30k characters, should work fine.
ChatGPT (free version)	Attach function	To be confident that the model will read everything and not overlook anything the total should be kept to about 250k characters.	3 within any one day. This limit can be circumvented by putting multiple files in a zip file, or by putting the contents of a file into the prompt with the question.	512Mb is the absolute limit but to be sure all the contents will be acted on, keep it to 200k-300k characters. If 3 files are to be used, keep their total to 200k-300k characters.

With these constraints in mind, I conducted a final set of tests to compare the three AI Systems – AnythingLLM’s Rag approach, Copilot, and ChatGPT. I assembled a cut-down version of the Index (by removing the MW set) which was small enough (around 274k characters, 270kb file size) to fit within the limits of all three systems. As can be seen in the table below, there was a clear winner.

System	Average Evaluation Score out of 10	Performance Summary	Average time taken to start responding (seconds)
AnythingLLM (Mistral)	4.4	Overall, this was a disappointing result. The answers were very sparse with little rationale or summation. Some of the answers given were of dubious worth.	6.6
AnythingLLM (Mixtral)	3.6	This score was a little bit worse than Mistral’s result despite it taking a lot longer to start printing out its answers. Two of the five answers seemed to mainly regurgitate the different sets that were described in the Index; and some of the answers were of rather dubius relevance. However, of most concern is that there were at least 4 instances of hallucinations – in two cases, Reference Numbers which don’t exist were cited.	36.4
ChatGPT (GPT-5.3)	6.3	All the responses were comprehensive with extensive rationale and good summation. However, the content wasn’t always appropriate. It wasn’t necessarily wrong but was sometimes just a little dubious. Furthermore, there was rather too much emphasis on the way the collection was organised rather than on its contents. On two occasions, examples of individual entries were specifically asked for but generalisations were delivered.	6
Copilot (MS LLM)	8.7	Three of these five answers were exceptionally good, and all the responses were well illustrated with rationale and examples. The summaries at the end were well constructed and useful. I only spotted one error across the five answers, though there were three or four things which stood out as having been omitted.	16.4

Copilot was a clear winner with ChatGPT following on behind. Both provided substantial answers which included rationale, several examples, and a summary. In contrast the AnythingLLM RAG answers were sparse, sometimes not very good, and occassionally included complete hallucinations. The RAG approach just doesn’t seem very effective for Index material.

Embedding Parameters and new Evaluation Questions

Posted on 10Apr2026 by admin

The tests described in the previous post highlighted a number of questions which are discussed below:

The Text Chunk Overlap is described as ‘the maximum overlap of characters that occurs during chunking between two adjacent text Chunks’. Under what circumstances would an overlap occur, and if it occurs, does exactly the same set of overlap characters appear in the two Chunks?
ChatGPT’s Answer: Overlap occurs when you intentionally reuse part of the previous chunk to maintain continuity. And in typical character-based chunking, the overlap is exactly the same sequence of characters in both chunks.

1 cell = 1 Chunk: In my Mementos Index I have collapsed all the columns for a single entry (i.e. in one row) into a single cell [as advised by ChatGPT]. Hence the whole index in now just a series of cells in a single column. Will the AnythingLLM embedding function attempt to put each cell in a single Chunk provided the number of characters in a cell is less than the Text Chunk Size? What happens if the number of characters exceeds the Text Chunk Size. ChatGPT’s answer: Yes, a cell will usually become a single chunk if it fits within the chunk size. If it doesn’t, it will be split into multiple overlapping chunks, each embedded separately. If you’ve set an overlap, adjacent chunks will share overlapping text (as discussed in the answer to Question 1.)

Chunks starting with the beginning of an entry: When embedding my Mementos Index with all the entries in single cells in a single column, how can I ensure that each Chunk starts with the beginning of an entry regardless of whether the previous entry exceeded the Text Chunk Size or not?
ChatGPT’s answer: To ensure every chunk starts at the beginning of an entry: Best: ingest each cell as a separate document
Next best: split by a delimiter chunking
Avoid: relying on chunk size alone.
[I didn’t think this was a helpful answer – it seems there’s not much practically that can be done about this issue]

More Chunks than Snippet limit: What would the impact be on AI’s answer to a question which concerns more Chunks than the maximum allowable number of Context Snippets?
ChatGPT’s answer: The AI can only reason over the chunks it’s given. If relevant information exceeds the snippet limit answers become partial, biased, or occasionally incorrect. The fix is a balance of: chunk design, retrieval quality, and context size.

Duplicate Descriptions: There are two instances of the ‘Description’ field in my revised Mementos Index – one in the original ‘Description’ field, and one in the new ‘Item Label field which combines the ‘Reference Number’ field with the ‘Description field [as advised by ChatGPT]. For longer descriptions, this second instance of the ‘Description’ field may be resulting in the Maximum Number of Characters in a Chunk being exceeded. What would the impact on AI performance be if the ‘Item Label’ field was removed?
ChatGPT’s answer: Removing the duplicated Description from the Item Label will likely:
– reduce chunk splitting
– improve retrieval accuracy
– increase effective context usage
– produce more complete answers
and has little to no downside if the reference number is still present elsewhere.

Attaching files: I’m using AnythingLLM with Ollama and the Mixtral model on a laptop with Intel 9 core and 64Gb RAM. I’d prefer to have the model work on all parts of my Mementos Index with 2390 entries, rather than a limited number of Chunks. I believe that the way to do that is to not embed the index but to ATTACH the index (now adjusted so that all entries are in single cells in a single column) in .txt format. Will that work on my configuration? And roughly how long will it take for the AI to start printing out its answers?
ChatGPT’s answer: In AnythingLLM (even when using Ollama with Mixtral): Attaching a file does not bypass context limits. The model still has a maximum context window (how much text it can “see” at once). So: The model cannot process all 2,390 entries simultaneously unless they fit inside the context window. Mixtral typically supports around: ~32k tokens context (varies by version). Rough conversion: 1 token ≈ 3–4 characters 32k tokens ≈ 100k–120k characters. Your dataset reality – If you have 2,390 entries, even modest 200 characters per entry, that’s: 2390 × 200 ≈ 478,000 characters – that’s 4–5× larger than the entire context window. So, the system will truncate the file and only the first portion is seen by the model. In the worse case:
– System becomes slow or unresponsive
– Very long delay before generation starts
– Possible failure or timeout
The bottom line is:
❌ Attaching the full .txt will not let the model process all 2,390 entries
❌ It will likely truncate or slow down significantly
⏱️ You may wait 10–60+ seconds before any output—and still get incomplete results
✅ Embeddings + retrieval is the correct and scalable approach
✅ Your restructuring (one entry per cell) is exactly the right direction

Having acquired these learnings, I was ready to continue testing. First, however I knew I needed a more appropriate set of evaluation criteria. So, I put this final question to ChatGPT:

“When applying AI to an index to a collection in an Excel spreadsheet, what generally applicable evaluation questions could be asked which could not be answered directly by using the spreadsheet Filter facility?

The answer provided 4 questions in each of 11 categories, and then one extra question i.e. 45 questions in all. From these I selected the following five:

What are the main themes that run through the entire index?
Are there distinct phases or periods in the collection?
Which items suggest important life events or transitions?
What patterns or motifs repeat across the collection?
What are the top 5 most important entries, and why?

Armed with these new evaluation questions, and a better understanding of the settings and parameters associated with embedding, I set out to undertake another round of tests using the Mementos Index.

OFC

Order from Chaos, Digitisation, and their intersection

Category Archives: AI for Personal Archives