Phase 3 Results – Text & Image Content

Posted on 20May2026 by admin

Phase 3’s objective was to assess how well AI could support not only an Index and its file titles, but also the textual content of the files itself. However, after doing a bit of digging, I discovered that AI’s could not only work with textual content, but also with scanned text, and with images in general (of course, such material still has to be presented to the AI in the form of uploaded files or RAG Chunks). With this awareness in mind, I set about devising some tests to find out just how well AIs can perform when presented with real content as opposed to just metadata. I came up with the following:

Tests of Machine-readable text

Test 1 – describe and summarise the contents of three years of diary entries in word format
Test 2 – discuss any relationships that can be found between three Word files with diverse contents: my library loan history for 2004-2012; an account of the petitioning of a school’s teachers to make a change to daily activities; some thoughts about university life while in the infirmary recovering from German Measles.

Tests of Image-only scanned text.

Test 3 – summarise Friends of the Earth activities in Harrow as documented in three image-only scanned text documents from 1976-1979 in PDF format.

Tests of Text in images

Test 4 – list all the events and activities described in three documents of events, tickets, membership cards etc.
Test 5 – describe and summarise the contents of all the images in three pages of Italy stamps which also include cutouts from the relevant parts of stamp catalogues.
Test 6 – catalogue the contents of the three pages of Italy stamps images using the following fields: Reference Number, Country, Year, Value, Notes.

Tests of collections of objects in images

Test 7 – describe and summarise the contents of all the images in three photos of chocolate wrappers, with each photo showing a) a double page of a chocolate wrapper scrapbook (in an unused 1967 A4 diary); and b) 3 plastic wallets containing loose wrappers.
Test 8 – catalogue the contents of the three chocolate wrapper photos using the following fields: Reference Number, Name, Manufacturer, Type, and Size.

Test 9 – describe and summarise the contents of three photos of household ornaments showing a) 10 pieces of Wedgewood, b) 30 small display items including porcelain (cups, saucers, plates, vases, jug, trinket boxes, flowers), glassware (bowl, vase, jug, flower, bird), stoneware (lighthouse, ashtrays, barometer), and wood (bowl, elephants); and c) 13 sundry items including silver trays, bowls, coasters and shoehorn; large shells, letter rack; pen holder; and decorative tray and plate.
Test 10 – catalogue the contents of the household ornament photos using the following fields: Reference Number, Name, Type, Colour.

As may be apparent from the above descriptions, each test involved attaching three files to the AI Prompt together with a request, for example, “Using the three files I have just uploaded, catalogue the contents of the images using the following fields: Reference Number, Name, Type, Colour.”

All 10 tests were applied to the four AIs that had been used in the previous phase – AnythingLLM with Mistral, ChatGPT, Copilot and Claude. I did explore the possibility of using two other widely used products – Llama from Meta, and Gemini from Google. However, both require that you create an account before you can use them and I didn’t want to do that because, among other reasons, I’m trying to limit my exposure to data collection and advertising which are central to both of those organisation’s operations. Indeed, during the process of opening a Meta account, I was actually informed that I would be consenting to being shown adverts: I stopped at that point. So, for these tests I stuck with the four AIs previously mentioned. I have been using the free version of ChatGPT, Copilot and Claude up to now. However, when I started doing these tests ChatGPT suddenly changed the number of files it was allowing me to upload each day from 3 to 2. Since all the tests involve 3 files I elected to upgrade to ChatGPT-Go which enables you to “usually upload far more than the Free tier’s 3 files/day, but there is still a rate limit, and OpenAI hasn’t publicly stated the exact number.” The cost was £7 a month with the ability to cancel anytime. I encountered no limits when I was conducting these tests with the free versions of Copilot and Claude.

Before discussing the test results, its worth being clear about the image recognition and text-reading capabilities of the AIs concerned. First, AnythingLLM is not capable of interpreting images so, unsurprisingly its results in these tests are very poor. However, I included it anyway just to see how it would react. Second, ChatGPT, Copilot and Claude (like most other Large Language Models) don’t apply separate conventional OCR (Optical Character Recognition) techniques to interpret text in scans, photos or other images. Instead, they undertake text recognition as part of their general image understanding capabilities which includes the combined assessment of visual patterns, language, spatial relationships, and context. Consequently, their text recognition capabilities often depend on the type and volume of training data they have been given. As ChatGPT put it: The image recognition capabilities of AI models “are deeply dependent on the training data, but also on model design, compute scale, optimization methods, and how visual and language information are combined during training.”

With this in mind, we can now consider the results of the 10 tests. Each individual test was evaluated in two ways: first, I looked at the answer provided and wrote a short paragraph reflecting what I thought about it; and, second, I awarded a score out of 10 according to what I’d written. As is obvious, this is a completely subjective, and a potentially highly inaccurate, process. However, it does have the advantage that it uses a single observer with a consistent mind-set. Hopefully, that consistency will be reflected throughout all the results.

All the evaluation paragraphs and scores for all four AIs are included in this word document. However, I also produced an overall summary paragraph, and overall average score, for each AI, and these are provided below. They are accompanied by an equivalent paragraph produced by ChatGPT after it was provided with each AIs detailed paragraphs and scores for each test.

AI (Overall Average Score)	Human Overall evaluation	ChatGPT’s overall evaluation
Anything LLM (1.3)	AnythingLLM made it clear in the later tests that it is unable to interpret images. Given this fact it is odd that for two of the questions concerning images it gave answers that were completely nonsensical (circuit boards for stamps; and a photo of ‘a person wearing a white lab coat’) – it could have just said it couldn’t interpret the images. Regarding scanned text, AnythingLLM was able to interpret both the scanned typed text and the scans of tickets – however sometimes incorrectly and sometimes hallucinating parts of the resulting answer. Whatever answers were provided by AnythingLLM, were typically very brief and lacking in supporting material.	Across the 10 tests, AnythingLLM demonstrated consistently weak performance, particularly when interpreting images and scanned documents. In six image-based tests involving ornaments, chocolate wrappers and stamps, the AI failed completely, often producing wildly inaccurate descriptions such as circuit boards, robots or software code instead of the actual subjects shown. These responses were considered entirely unusable and all received scores of zero. Performance improved only slightly when handling text-based or scanned documentary material. In the events and activities test, the AI identified a few correct details but omitted most of the content and introduced several errors, resulting in a middling score. Responses to the Friends of the Earth documents and the diary summaries were similarly superficial, inaccurate or overly narrow. Overall, the AI showed limited comprehension, poor contextual understanding and an inability to extract detailed or reliable information from uploaded materials, achieving an average score of just 1.3 out of 10.
ChatGPT (8.7)	ChatGPT gave a very good answer to the question in which text was provided in Word, but a little less so when assessing scanned typed text. It performed extremely well when assessing objects in images and in picking text out of those images, suggesting that it had been given some training data in the subjects covered (stamps, chocolate wrappers, and household goods). The one thing that wasn’t very helpful were the images which it displayed in answers to the image questions – they were not of items in the supplied images but items similar to those in the supplied images which was just generally confusing. Most answers were generally very clear and complete, and there were only a few errors. The catalogues it produced were clear and with good descriptive information.	Across the 10 tests, the AI demonstrated consistently strong performance in analysing, cataloguing, summarising, and interpreting a wide variety of uploaded documents and images. Scores ranged from 7.5 to 9.7 out of 10, with an overall average of approximately 8.7, indicating a high level of competence. The AI performed particularly well when extracting structured information from complex visual material such as stamp collections, chocolate wrappers, ornaments, and historical documents, often identifying nearly all relevant items and presenting the results clearly in tables or organised summaries. Its ability to read embedded text and recognise themes across multiple files was repeatedly praised. The main weaknesses noted were occasional omissions, minor reading inaccuracies, over-generalisation in crowded images, and the insertion of misleading illustrative images not present in the source material. Nevertheless, the tester consistently regarded the outputs as comprehensive, useful, readable, and well-structured overall.
Copilot (7.5)	Copilot produced excellent answers when reading from text in Word format – comprehensive, well structured, and insightful, with no noticeable errors. It was a little less competent when reading the scanned typed text. It read text in images without a problem – including vertical text. However, it performed less well when identifying objects in images – often missing out several items from both the descriptive text or grouping several items together; and the catalogue listings were similarly incomplete. This might relect a lack of training data in the areas concerned.	Across the 10 tests, the AI demonstrated strong overall performance, particularly in tasks involving summarisation, thematic interpretation, and extracting information from structured or semi-structured documents. Its best results came from diary summaries, document relationship analysis, and chronological event extraction, where it showed impressive comprehension, organisation, and the ability to interpret handwritten and vertically oriented text. Scores in these areas were consistently high, often above 9/10. Performance was more mixed when detailed cataloguing or precise item identification was required from complex or cluttered images. In several image-based catalogue tasks, the AI missed items, misinterpreted grouped objects, or introduced incorrect assumptions about how materials had been arranged or curated. Stamp and wrapper cataloguing proved especially challenging, with omissions and occasional transcription errors reducing reliability. Nevertheless, even weaker responses were generally well-presented and useful in part, indicating a capable system whose strengths lie more in synthesis and interpretation than exhaustive visual inventory work.
Claude (9.3)	Claude’s overall evaluation score of 9.3 reflects the fact that it performed extremely well in all but one of the tests. Its answers were comprehensive, full of detail and easy to read. It displayed competency in all four areas being tested – Word text, scanned typed text, text in images, and images of objects. Very few facts or items were missed. In one of the image files it identified the base of a lampstand from an image of just a small piece of its base. Its capabilities suggest it has had a broad range of training data. Its catalogue listings were good with two of the four being produced in a useful Excel format. The only thing that let it down was some numerical errors in the answer cataloguing household items: it incorrectly included the legend row and an empty base row in the total number of items it reported (i.e. it said there were 55 items instead of 53); and it reported that there were 17 items listed in the spreadsheet for image 3 whereas it had only actually listed 15 items in the spreadsheet.	Across the 10 tests, the AI demonstrated consistently high performance, achieving scores between 8.7 and 9.9 out of 10, with an overall average of approximately 9.3. Its strongest capabilities were in extracting, cataloguing, and summarising information from complex images and scanned documents, often identifying nearly every visible item and adding insightful contextual observations. The AI showed particular skill in recognising handwritten text, interpreting historical or archival material, and producing structured outputs such as Excel spreadsheets. Reviewers repeatedly praised the clarity, comprehensiveness, and readability of the responses, as well as the AI’s ability to infer broader themes and relationships across documents. Minor weaknesses included occasional misidentifications of objects, over-interpretation of details, and small numerical inconsistencies in summaries or item counts. Nevertheless, these errors were generally isolated and did not significantly detract from the overall quality. The results indicate an AI with excellent analytical and descriptive abilities across diverse document and image-processing tasks.

Claude comes out a clear winner in these tests, with ChatGPT coming in second. Copilot, while performing excellently with text, appears to have had less relevant image training. At a general level, however, the results illustrate very clearly that AIs can work extremely well with both text and images; and could be very useful to collectors in identifying items, describing them, cataloguing them, and creating indexes for them.

For completeness, below records the breakdown of the time I spent on Phase 3 and across all phases.

Activity	No of Tasks or task breakdown	Elapsed time	Time spent
Phase 1	70	43 days	105 hrs
Phase 2	8	4 days	11 hrs
Phase 3	· Create test files · Research & drafting pwofc.com post	3 days 4 days	15 hrs 12 hrs
Totals	80	54 days	143 hrs

Creating an Audio Book

Posted on 12May2026 by admin

Not all books have to be in print format. If you have a collection of audio material, then it’s perfectly feasible to create an audio book. That is what I did in 2008 when I assembled the material from two cassette tapes I had recorded when walking from Harrow to Trafalgar Square in London in 1975. I used the excellent open source, free-to-use Audacity software to break the recordings into discrete sections; stored the files on a CD; and created a cover for the CD case which included some of the (very poor) photos I had taken on the way.

In the last month I’ve just undertaken a similar exercise. A friend who used to work for BBC Radio, and who knows the value of audio memoirs, had suggested that I record my ageing mother speaking about her early life at school, her experiences during World War II and in the WRENS, and the many years she spent in Singapore. I took the advice and duly recorded some 8 hours of conversation with my mother when she was in her late nineties. The recordings were made using one of the utility apps in an Apple iPhone which produced 24 very clear files in the m4a format. I decided to combine these with another 8 short files of my mother describing particular photos, which I had recorded when sorting out her photo collection back in 2012.

I wanted to accompany the audio files with a word document summarising the contents, but realised that, in order to do that, I would need to listen to each file in detail and then create a summary paragraph. That sounded like hard work, so I decided to enlist the help of transcription services to produce written versions of each audio file. I tried to find a free service but couldn’t find one that produced good transcriptions in the volumes I required. In the end I chose to subscribe to Otter-ai’s business service for one month at a cost of $30. This enabled an unlimited number of files to be downloaded to the Otter site which then produced transcriptions in three parts:

An Overview Summary paragraph (broadly good but with the occasional error)
An Outline in the form of separate headings containing a series of bullet points (these seemed to contain quite a lot of errors in the bullet points)
A Detailed Transcript (pretty good at capturing almost every word, but sometimes getting the odd word wrong. The transcript was just a series of paragraphs without the identification of who was speaking each paragraph, and sometimes not splitting the paragraphs into separate speakers).

Taking Otter’s capabilities into account, I decided to edit the Detailed Transcript to get it as correct as I could and to insert the relevant speaker at the beginning of each paragraph (which was either my mother or myself). This entailed actually listening to the audio as I edited the transcript – a much easier task, I would imagine, than just listening to the audio while making notes. I also edited the Overview Summary paragraph to ensure it was correct and informative so that I could just lift that out to include in the overall summary document; and I ignored the Outline but left it intact in the transcript files. So, I ended up with the following set of materials:

1 overall ‘Introduction and Summaries’ Word document
32 audio files in m4a format
1 transcript Word document for each audio file
I or more jpg image files for each of the audio files describing particular photos.

This package of material includes quite a significant element of our family’s history, so I wanted to give a copy of this audio book to each of my children in a form that would be accessible in the long term, and would stand a chance of not just getting lost. Consequently, I decided to use a USB memory stick to store the audio files on, and to box it up in a clear USB case designed to take a cover insert. I obtained both from River Solutions Group: 5 Twister Silver White USB Sticks 4GB at £2.25 each; and a pack of 10 Rivermedia Premium USB Case Clear 14mm for £8.50.

I produced the cover in PowerPoint and included on it a picture of my mother and a list of the 32 audio files headed by the ‘Introduction and Summaries’ file (10 on the front cover and 22 on the back). I’m hoping that this overall package will look significant enough to ensure its survival for a few decades at least.

A New Loft Management App

Posted on 10May2026 by admin

Addenda to ‘Loft Management’ [First post] [Last Post]

Over the 20+ years I’ve been operating a Loft Index I’ve used 3 different computer applications: Filemaker on a laptop, uGrokIT on a phone, and Memento Database on a phone. In the March 2025 post describing the move to the Memento Database I noted that scrolling was hesitant and that the records moved up the screen on top of the field headers. The Memento Help person had told me that ‘this happens because Memento stores photos in their original size, which can be quite large’ and that they would ‘try to add a photo compression feature in future releases’. I decided to ignore these problems for three reasons: a) they weren’t interfering with creating new entries or searching, b) the product was free to use locally, and c) I’d gone to a lot of effort transferring the data and photos into the application. I realise now, that was a mistake.

I should have seen the writing on the wall. When I wanted to add something to the Index last week, I discovered the interface was a little different (there’d obviously been an upgrade) and my loft database just locked up when I opened it. I wasn’t even able to get it to produce a csv export file of the current contents.

So, I’ve spent a couple of days moving the index to a new app. I’m now using the Collections database from Risolvi Productions for which I’ve paid a single one-off fee of £7.99. Luckily there haven’t been too many changes to my loft index over the last year so I was able to use an October 2025 export from the Memento app for the data, and a March 2025 folder of the associated photos, and to deduce what changes had occured since Oct25 from the first frozen screen of the malfunctioning Memento app. With the new database fully set up in the Collections app and all the associated photos loaded in as well, I’m glad to say it all seems to be working perfectly.

The lesson here is clear: if an app doesn’t seem to be working properly, don’t use it; and be prepared to pay a little rather than just always going for a free-to-use product. Obvious really….

Phase 2 results – and enter Claude

Posted on 06May2026 by admin

The object of Phase 2 was to explore how AI can support the combination of a collection’s Index and the titles of the associated files. The answer is straightforward: the assembled file titles need to be provided to the AI model in one or more files, in the same way that the Index is delivered. In Windows, file titles can be collected by highlighting the files you are interested in and selecting ‘Copy as path’ from the Right-Click drop down menu. These can then be pasted into either Excel or Word, and the path in front of the file name can be deleted by using the Find and Replace function: specify the path in the Find box and specify blank in the Replace box, highlight the file names, then select Replace-All. No doubt other operating systems have similar capabilities.

I tried this out with the names of 2065 files associated with the Mementos index, along with the 1.62 version of the Mementos Index (the version I used for the final set of tests in Phase1). I retained the 5 evaluation questions used in Phase 1, but I also added three more to further test the capabilities of the AI models concerned (more of these at the end of this post). I performed tests using AnythingLLM/Mistral and Copilot. ChatGPT wasn’t used because it has a limit of around 250kb and the file titles together with the Index would have significantly exceeded that. I also tried out Anthropic’s Claude model for the first time. The results are shown in the table below together with the results of the final test in Phase1 for comparison purposes.

System	Phase1 – no file titles 5 evaluation questions	Phase 2 with file titles 5 evaluation questions
AnythingLLM/ Mistral	4.4 out of 10: Overall, this was a disappointing result. The answers were very sparse with little rationale or summation. Some of the answers given were of dubious worth.	2.3 out of 10: This was a generally very poor result. Four of the answers were virtually worthless; while 2 of the answers weren’t too bad at all. Quite apart from the contents of the answers, they were all relatively short and with none of the embellishments and rationale that seem to be a standard part of the responses of AIs like ChatGPT and Copilot.
Copilot	8.7 out of 10: Three of these five answers were exceptionally good, and all the responses were well-illustrated with rationale and plenty of examples. The summaries at the end were well constructed and useful. I only spotted one error across the five answers, though there were three or four things which stood out as having been omitted.	8.9 out of 10: Overall this is an outstanding result. Comprehensive answers with good introductions and summaries and lots of examples with reference numbers included. Nearly everything was correct and complete – it was bemusing to read things like the series of homes we had been in or the different companies I’d worked for – an experience like being amazed that someone you are talking to seems to know your life history. The one concern is that in one answer the AI suggests it had taken facts from a document (Barlborough prospectus) which it hadn’t got access to and the only conclusion is that it hallucinated the information (Indeed it subsequently admitted it had).
Claude		8.2 out of 10: Claude’s answers overall were pretty good – but a bit patchy with scores ranging from 6 to 9.8 (this latter score being ‘first class’). The poorer scores were largely the result of misinterpretations and errors.

From these results, it’s not possible to assess if the inclusion of File Titles has made a difference: AnythingLLM’s performance was generally very poor across both tests; Copilot’s two scores were too similar to draw any conclusion; and Claude did not have any earlier test result to compare against. However, it seems reasonable to assume that if there is additional information in the file titles, over and above that in the Index, there will be a better outcome.

Since this was the first time I had used Claude, I enquired about its operational parameters, and discovered that Claude’s Context Window is 200K tokens across all models and paid plans (except for Enterprise plans). That is roughly 500 pages of text or approximately 800,000 characters. Claude caps file uploads at 30MB per file and 20 files per conversation. I found the quality of Claude’s answers to be at the same sort of high level as Copilot – though the results indicate that, on this particular set of uploaded material, Copilot has a bit of an edge. I also noticed that Claude differs from Copilot in two ways: first, Claude, unlike Copilot, tells you what it’s doing in the course of responding to a question; and, second, Claude took three times longer to respond to the same set of 8 evaluation questions (59.3 vs 19.6 seconds).

The three extra evaluation questions that I added were:

Describe the things that happened to Paul Wilson during the time he spent at Barlborough Hall School and support the story with references to relevant mementos.
Identify anything strange or unexpected about the contents of the Memento collection and support the analysis with references to particular files.
Describe life in the Wilson family during the 1990s citing relevant artefacts in the collection.

These extra questions were deliberately designed to explore the AIs ability to make more broadly-based connections and inferences, as opposed to identifying related items in a narrow subject area. This is where AI can excel – and sometimes fail – as indicated in the detailed results for each question provided below.

Question	Copilot Result	Claude Result
Describe the things that happened to Paul Wilson during the time he spent at Barlborough Hall School and support the story with references to relevant mementos.	7 out of 10: This is a comprehensive summary of my time at Barlborough – though surprisingly lacking in info about sports activities. The answer claims to have details of the contents of the school prospectus and of letters to parents – but it didn’t and I think it just hallucinated the info.	9.8 out of 10: This an excellent, comprehensive answer utilising all the Barlborough material in the collection so far as I could see. I didn’t spot any errors. The answer was easy to read and included all the relevant Memento Reference Numbers. First class.
Identify anything strange or unexpected about the contents of the Memento collection and support the analysis with references to particular files.	9 out of 10: I’m quite bemused by this answer: I don’t really know what I was expecting – but Copilot delivered a whole bunch of unusual stuff with good rationale for their inclusion. Extraordinary!	6 out of 10: Claude identified 12 different strange entries – but actually only about 5 are really valid. The so-called errors in the birthday card collection arose simply because I entered birthday cards for different years in the same year; the so-called error in the Miss Saigon Programme date was because I also bought a special Souvenir programme with the date of the first performance. Item 5 is also covered by Item 1. So this is an interesting set of material – but not entirely correct or useful.
Describe life in the Wilson family during the 1990s citing relevant artefacts in the collection.	9 out of 10: Another very comprehensive answer replete with examples and a good summary at the end. I haven’t checked in detail whether its all correct but nothing is standing out as being wrong. I’m finding it hard to fault these answers.	8.8 out of 10: This is a very complete account of our time in the 1990s written in a very readable style spiced with occasional wry comments and humour. Each detail is accompanied by the relevant Reference Number. It is highly informative but does include a few misinterpretations – in particular claiming that my wife was at home during the 90’s when actually she was back to teaching; that l was seconded to the IR in ‘mid-1990s’ whereas it was the beginning of the 90s; that the EDS chocolate bar had been kept whereas it was just the box it was in; that my son went to Exeter to do a 2nd degree whereas it was his first; that I was a devoted Aston Villa supporter whereas it was my other son; and that my son had a band called Phases which was stretching it since it was just he and a friend making a cassette recording. However, these are relatively small points amidst the huge array of correct facts that are presented in this highly readable piece.

Finally, to come back to the objective for Phase 2, I have concluded that file titles can certainly be included in the material delivered to an AI model, and this is certainly worth doing if they include additional material to that which is contained in an Index – or if an Index does not exist.

For completeness, below records the breakdown of the time I spent on Phase 2.

Activity	Time spent – hours
8 Tasks (total elapsed time: 4 days)	11
Time spent assembling files of File Titles and testing	8
Time spent researching and drafting post for pwofc.com	3

Phase 1 Summary Results

Posted on 17Apr2026 by admin

Phase1 Description: AI support for the Memento collection’s index entries.

Conclusions

Different LLMs are trained with different amounts and types of training data, thereby making them more or less suitable for operation within specific domains. Therefore, there may be some LLMs available that are more suitable than others for use with Collection indexes.
To enable LLMs to provide answers to questions related to specific information in documents or databases, a process called Retrieval Augmented Generation (RAG) is used whereby relevant parts of the information are provided to the LLM alongside the question.
The RAG process performed by particular products typically includes a multiplicity of different techniques to identify an appropriate set of Chunks for submission with the question to the LLM. It is likely that particular combinations of these techniques will be more effective than others for use with collection indexes.
Collection Indexes may contain information which is more unevenly spread throughout their contents than a written document with an ordered set of contents. Consequently, such indexes may require more chunks to be sent to the LLM, and may require the LLM to have larger Context Windows, than ordinary documents in order to obtain satisfactory answers.
Out of the 4 models/systems I have tried out so far, Copilot is by far the best LLM for use with a collection index of around 2,500 entries.
The choice of systems, models, configurations, and Index adjustments in this first Phase were heavily influenced by ChatGPT, and therefore may have been based on some inaccuracies or hallucinations. This needs to be born in mind when considering the findings from Phase 1 and taking them forward into subsequent phases.

These conclusions were reached in the course of undertaking the activities summarised below.

Tasks and Timescales

Activity	Time spent – hours
70 Tasks (Total Elapsed time: 43 days)	105
Time spent addressing points in AI preparedness document	7
Time spent installing software	8
Time spent adjusting indexes and testing	52
Time spent researching and drafting posts for pwofc.com	38

Reference Sources

A Survey on Retrieval-Augmented Text Generation for Large Language Models by Yizheng Huang, Jimmy Huang, v2, arXiv:2404.10981, 23Aug2024.
I obtained lots of answers from ChatGPT to questions I asked about RAG, AI software products, and how to adjust the Memento index to get better AI results.

Concepts and Terminology encountered

LLM stands for Large Language Model. Hundreds of thousands of these have been built and new ones emerge daily, but many are variants of major models like GPT-4 from ChatGPT which can be accessed online. Other LLMs can be downloaded from websites like Hugging Face, to use locally on a stand-alone computer.
RAG stands for Retrieval Augmented Generation whereby the LLM is not trained on the archive, but instead a relevant subset of the archive data is provided to the LLM alongside the question.
Chunk is a smaller, manageable segment of a larger document or dataset. In RAG, one or more Chunks are provided to the LLM along with the question.
Token is the fundamental unit of data used by LLMs. Models convert questions and Chunks into tokens, process them, produce an answer in tokens, and then convert those tokens back into text. Generally, one token equals approximately 0.75 English words.
Vector Databases store mathematical representations of Tokens as vectors – lists of numbers – in such a way that related items are clustered together thereby enabling capabilities like similarity searching.
Embedding: Chunks are broken down into Tokens which are converted into mathematical representations and embedded in a Vector Database.
Context Window is the maximum number of Tokens that can be handled by the LLMs working memory (which contains both the input prompt and the answer). If the inputs to the LLM exceed its Context Window, then some content may be simply left out and the answer may be less complete.
Hallucination is a phenomenon where LLMs generate false, misleading, or nonsensical information confidently. It happens when an LLM predicts text based on patterns rather than facts, often due to poor training data, ambiguous prompts, or a lack of understanding of reality.

Test results – 2^nd set of evaluation criteria (scores out of 10)

Type of LLM Index variant	AnythingLLM Mistral	AnythingLLM Mixtral	Copilot – MS LLM	ChatGPT – GPT-5.3
Less some columns + AI Context short	4.8	5
Less all extraneous fields, no Guide to the Index	2.2	2.9	8
Combined into 1 field and with Guide	3.1	4.8	8.6
Less all extraneous fields, no Guide				5.9
Less all extraneous fields and with Guide			9.1
Index with Set removed and with Guide	4.4	3.6	8.7	6.3

Enter Copilot and ChatGPT

Posted on 15Apr2026 by admin

In the previous post I explained why I changed the evaluation criteria I was using to the following:

What are the main themes that run through the entire index?
Are there distinct phases or periods in the collection?
Which items suggest important life events or transitions?
What patterns or motifs repeat across the collection?
What are the top 5 most important entries, and why?

I tried using them with a version of the Index which had extraneous fields such as ‘Physical Location’ and ‘No of digital files’ removed leaving just Ref No, Description, Item Label, Set, Facets, AI Context, and Year all in a single column. This produced a result of 4.8 out of 10 using the Mistral model and 5 out of 10 for Mixtral – and I proved to myself that I was able to apply the new evaluation criteria, albeit with rather more subjective scoring.

At this point it dawned on me that it might be worth trying to use the MS Copilot AI provided as an integral part of my Windows operating system. After making some initial enquiries through the net about its possible use with Excel, I got the impression it could only be used with MS Office 365 which I do not have (I have the home version of Office); but Copilot itself set me straight explaining that the 365 version enabled support for the functionality within the Office applications, whereas the free-to-use version of Copilot simply ingests documents up to the cloud where it works out the answer to the question it has been asked. I tried it out using a very simple version of the Mementos index with just the Ref No, Description, Set, Facets, and Year fields in their own columns, and was excited by the result which I described as follows:

“This is a strong result (8 out of 10): comprehensive answers with just a few poor interpretations, but with no obvious hallucinations. Importantly, Copilot was able to deal with the whole attached Index in one go which delivers far better answers than the RAG approach which can only work on a pre-selected subset of the material. The response time (average 4.4 seconds) was very quick considering that the whole document had to be sent to, and analysed in, the cloud before delivering its answer.”

I then tried to compare the AnythingLLM and the Copilot systems by using yet another version of the Mementos index based on suggestions from ChatGPT. It still contained just Ref No, Description, Set, Facets, and Year with the all the fields in a single cell, but this time with the Description replaced by keywords derived automatically using an Excel formula supplied by ChatGPT. The results were very clear: Mistral scored 3.1, Mixtral 4.8, and Copilot 8.6. For the Copilot Test I wasn’t sure if the size of the file would cause a problem so I split it into three files of between 200kb and 300kb each, and these seemed to have been ingested successfully. However, I subsequently discovered that the files had been truncated so that only about the first 120 lines of each were ingested – making about 360 entries in all out of the total 2393 (I guess Copilot must have truncated the file I used in the previous test as well). Given this, Copilot’s 8.6 score was even more impressive.

Finding that Copilot had truncated the files prompted me to do some digging and experimentation to find out just what its limits are. I established that it will ingest up to 20 whole documents of up to about 30,000 characters/30kb file size each in a single turn, and will work across all those documents to construct its answer. Furthermore, more batches of 20 can be submitted in subsequent turns up to a total of about 150 after which ‘the conversation becomes unwieldy’ (I’m not sure what is meant by this). I duly split the Mementos Index into 17 files, all between 20,000 – 30,000 characters and conducted the test again, this time including a version of the Guide document. This produced an even better score of 9.1.

Discovering Copilot’s capabilities made me wonder what ChatGPT could do. I discovered that the free version which I was using allows you to upload a maximum of 3 files of up to about 512Mb in any one day. However, to be sure that all the contents of all the files will be taken into account in answering a question, the total of the 3 files need to be a maximum of about 300k characters with 250k being a safe limit. I put ChatGPT to the test with the same index used in the first Copilot test but broken into three files of between 111k and 130k characters each. This produced a result of 5.9 which was probably expected given that the file size limits had been exceeded.

As I was exploring the file size limits for Copilot and ChatGPT, it became clear that the ultimate determinant of the amount of text that can be reliably reasoned about at once is the AI model’s maximum ‘Context Window’. This is effectively the model’s working memory which contains both the input prompt and the answer. If the inputs exceed the Context Window then some content may be simply left out and the answer may be less complete. The table below summarises the Context Window limits for the different AI models I’ve been using.

AI System	Model	Maximum Context Window
AnythingLLM	Mistral 7B Instruct (32k context), Q4_K_M quantization	32k tokens ( ~65k–100k characters in csv files) (assuming 2-3 characters/token)
AnythingLLM	Mixtral 8x7B Instruct (32k context), Q4_0 quantization	32k tokens ( ~65k–100k characters in csv files) (assuming 2-3 characters/token)
Copilot	MS LLM (Microsoft does not publish the names of its models	The MS LLM doesn’t have a single fixed Context Window; it’s designed around task‑adaptive context management, and hence the effective context it can use is much larger and more flexible than a single token number would suggest.
ChatGPT (free version)	GPT-5.3	The maximum Context Window is 128k tokens, but because this includes system instructions. conversation history, and output tokens, the usable Context Window is about 80k – 100k tokens (160k – 300k characters in csv files) (assuming 2-3 characters/token)

If one wishes Index files to be considered in full by an AI system, the primary requirement is to ensure that the whole set fits into the Context Window. However, when it comes to assembling and submitting those files there are additional considerations to be taken into account, as summarised below.

AI system	Approach	Max total file size	Max No of files	Max size of each file
AnythingLLM	Embed (RAG)	No limit. The limit is the size of the Chunks the files are divided into. The default is set at 256 tokens (800-1000 characters).	No limit. Though more than 50K Chunks may cause retrieval problems.	No limit.
AnythingLLM	Attach function	Default is set to 30k-50k characters for csv files. Can be adjusted within AnythingLLM settings.	No Limit	No limit.
Copilot	Attach function	No absolute limit, but above about 150 files (4,500k characters) the conversation becomes unwieldy.	About 7 batches of up to 20 files in a single prompt.	No absolute limit, but for csv files 10,000 rows or approximately 30k characters, should work fine.
ChatGPT (free version)	Attach function	To be confident that the model will read everything and not overlook anything the total should be kept to about 250k characters.	3 within any one day. This limit can be circumvented by putting multiple files in a zip file, or by putting the contents of a file into the prompt with the question.	512Mb is the absolute limit but to be sure all the contents will be acted on, keep it to 200k-300k characters. If 3 files are to be used, keep their total to 200k-300k characters.

With these constraints in mind, I conducted a final set of tests to compare the three AI Systems – AnythingLLM’s Rag approach, Copilot, and ChatGPT. I assembled a cut-down version of the Index (by removing the MW set) which was small enough (around 274k characters, 270kb file size) to fit within the limits of all three systems. As can be seen in the table below, there was a clear winner.

System	Average Evaluation Score out of 10	Performance Summary	Average time taken to start responding (seconds)
AnythingLLM (Mistral)	4.4	Overall, this was a disappointing result. The answers were very sparse with little rationale or summation. Some of the answers given were of dubious worth.	6.6
AnythingLLM (Mixtral)	3.6	This score was a little bit worse than Mistral’s result despite it taking a lot longer to start printing out its answers. Two of the five answers seemed to mainly regurgitate the different sets that were described in the Index; and some of the answers were of rather dubius relevance. However, of most concern is that there were at least 4 instances of hallucinations – in two cases, Reference Numbers which don’t exist were cited.	36.4
ChatGPT (GPT-5.3)	6.3	All the responses were comprehensive with extensive rationale and good summation. However, the content wasn’t always appropriate. It wasn’t necessarily wrong but was sometimes just a little dubious. Furthermore, there was rather too much emphasis on the way the collection was organised rather than on its contents. On two occasions, examples of individual entries were specifically asked for but generalisations were delivered.	6
Copilot (MS LLM)	8.7	Three of these five answers were exceptionally good, and all the responses were well illustrated with rationale and examples. The summaries at the end were well constructed and useful. I only spotted one error across the five answers, though there were three or four things which stood out as having been omitted.	16.4

Copilot was a clear winner with ChatGPT following on behind. Both provided substantial answers which included rationale, several examples, and a summary. In contrast the AnythingLLM RAG answers were sparse, sometimes not very good, and occassionally included complete hallucinations. The RAG approach just doesn’t seem very effective for Index material.

Embedding Parameters and new Evaluation Questions

Posted on 10Apr2026 by admin

The tests described in the previous post highlighted a number of questions which are discussed below:

The Text Chunk Overlap is described as ‘the maximum overlap of characters that occurs during chunking between two adjacent text Chunks’. Under what circumstances would an overlap occur, and if it occurs, does exactly the same set of overlap characters appear in the two Chunks?
ChatGPT’s Answer: Overlap occurs when you intentionally reuse part of the previous chunk to maintain continuity. And in typical character-based chunking, the overlap is exactly the same sequence of characters in both chunks.

1 cell = 1 Chunk: In my Mementos Index I have collapsed all the columns for a single entry (i.e. in one row) into a single cell [as advised by ChatGPT]. Hence the whole index in now just a series of cells in a single column. Will the AnythingLLM embedding function attempt to put each cell in a single Chunk provided the number of characters in a cell is less than the Text Chunk Size? What happens if the number of characters exceeds the Text Chunk Size. ChatGPT’s answer: Yes, a cell will usually become a single chunk if it fits within the chunk size. If it doesn’t, it will be split into multiple overlapping chunks, each embedded separately. If you’ve set an overlap, adjacent chunks will share overlapping text (as discussed in the answer to Question 1.)

Chunks starting with the beginning of an entry: When embedding my Mementos Index with all the entries in single cells in a single column, how can I ensure that each Chunk starts with the beginning of an entry regardless of whether the previous entry exceeded the Text Chunk Size or not?
ChatGPT’s answer: To ensure every chunk starts at the beginning of an entry: Best: ingest each cell as a separate document
Next best: split by a delimiter chunking
Avoid: relying on chunk size alone.
[I didn’t think this was a helpful answer – it seems there’s not much practically that can be done about this issue]

More Chunks than Snippet limit: What would the impact be on AI’s answer to a question which concerns more Chunks than the maximum allowable number of Context Snippets?
ChatGPT’s answer: The AI can only reason over the chunks it’s given. If relevant information exceeds the snippet limit answers become partial, biased, or occasionally incorrect. The fix is a balance of: chunk design, retrieval quality, and context size.

Duplicate Descriptions: There are two instances of the ‘Description’ field in my revised Mementos Index – one in the original ‘Description’ field, and one in the new ‘Item Label field which combines the ‘Reference Number’ field with the ‘Description field [as advised by ChatGPT]. For longer descriptions, this second instance of the ‘Description’ field may be resulting in the Maximum Number of Characters in a Chunk being exceeded. What would the impact on AI performance be if the ‘Item Label’ field was removed?
ChatGPT’s answer: Removing the duplicated Description from the Item Label will likely:
– reduce chunk splitting
– improve retrieval accuracy
– increase effective context usage
– produce more complete answers
and has little to no downside if the reference number is still present elsewhere.

Attaching files: I’m using AnythingLLM with Ollama and the Mixtral model on a laptop with Intel 9 core and 64Gb RAM. I’d prefer to have the model work on all parts of my Mementos Index with 2390 entries, rather than a limited number of Chunks. I believe that the way to do that is to not embed the index but to ATTACH the index (now adjusted so that all entries are in single cells in a single column) in .txt format. Will that work on my configuration? And roughly how long will it take for the AI to start printing out its answers?
ChatGPT’s answer: In AnythingLLM (even when using Ollama with Mixtral): Attaching a file does not bypass context limits. The model still has a maximum context window (how much text it can “see” at once). So: The model cannot process all 2,390 entries simultaneously unless they fit inside the context window. Mixtral typically supports around: ~32k tokens context (varies by version). Rough conversion: 1 token ≈ 3–4 characters 32k tokens ≈ 100k–120k characters. Your dataset reality – If you have 2,390 entries, even modest 200 characters per entry, that’s: 2390 × 200 ≈ 478,000 characters – that’s 4–5× larger than the entire context window. So, the system will truncate the file and only the first portion is seen by the model. In the worse case:
– System becomes slow or unresponsive
– Very long delay before generation starts
– Possible failure or timeout
The bottom line is:
❌ Attaching the full .txt will not let the model process all 2,390 entries
❌ It will likely truncate or slow down significantly
⏱️ You may wait 10–60+ seconds before any output—and still get incomplete results
✅ Embeddings + retrieval is the correct and scalable approach
✅ Your restructuring (one entry per cell) is exactly the right direction

Having acquired these learnings, I was ready to continue testing. First, however I knew I needed a more appropriate set of evaluation criteria. So, I put this final question to ChatGPT:

“When applying AI to an index to a collection in an Excel spreadsheet, what generally applicable evaluation questions could be asked which could not be answered directly by using the spreadsheet Filter facility?

The answer provided 4 questions in each of 11 categories, and then one extra question i.e. 45 questions in all. From these I selected the following five:

What are the main themes that run through the entire index?
Are there distinct phases or periods in the collection?
Which items suggest important life events or transitions?
What patterns or motifs repeat across the collection?
What are the top 5 most important entries, and why?

Armed with these new evaluation questions, and a better understanding of the settings and parameters associated with embedding, I set out to undertake another round of tests using the Mementos Index.

Initial Phase 1 Test Results

Posted on 27Mar2026 by admin

Over the last 10 days or so I’ve been running operational tests on the Mementos Index using the AnythingLLM front-end/embedding tool, and the Mistral and Mixtral models. The process I’ve been through has typically been a) to make some adjustment to the index (as advised by ChatGPT) to enable the AI system to produce better results; and then b) to run the 6 standard test queries with first the 4.2Gb Mistral model and then the much bigger (25.8Gb) Mixtral model.

The main changes made to the Index and what they were supposed to achieve are listed below:

Action	Changes	Benefits
Added an ‘Item Label’ column	Combined the Reference Number with the Description in the format:[Reference No] — [Description] and placed it in the new column.	It is descriptive, human readable, and AI readable. It is the primary semantic identity of each object.
Included the Guide	Changed the Guide to the Collection and its Index into a text (.txt) document and embedded it into the AnythingLLM workspace along with the Index itself.	It teaches the model how to interpret the data in the Index.
Normalised the Facets	Changes made to Facet contents to eliminate capital first letters (except for proper names), and minimise plurals.	Reduces duplication and improves matching
Added a ‘Primary Facet’ column	Took the first Facet in the Facet 1 column (which I have always regarded as the primary facet) and placed it into the new Primary Facet column.	AI tries to detect clusters, but without a dominant signal, clustering can become messy. Once a primary facet exists, AI can start discovering higher-level themes.
Added an ‘AI Context’ column	Combined Item Label with Facets 1 in the format: [Item Label]. Facets: [Facet 1 keywords] and placed it in the new column.	It combines all the key semantic signals into the same chunk of text, enabling the AI to retrieve by conceptual meaning rather descriptive text.
Created a ‘Search Keywords’ column	Identified specific meaningful searchable words within the ‘Description’ field by using a 10-line Excel formula supplied by ChatGPT. The formula also filtered out Stopwords (such as ‘the’) and retained ALL-CAPS words (like KRS).	Providing specific words to search enables the AI to dramatically improve recall.
Restructured all the index rows into single cells	Combined all 18 columns of information for each item into a single cell using a 19-line Excel formula provided by ChatGPT. The formula was placed into the 19^th column of the Index and pulled down. This column was then copied and placed into a Notepad file with a .txt extension.	Each item can now be embedded as a separate Chunk (provided Chunk size limits are not exceeded), enabling cleaner semantic matching, more accurate retrieval, more complete answers, and fewer errors.

I’ve also adjusted two different variables in the course of the testing I’ve carried out:

The Large Language Models (LLMs): I’ve performed all the tests using first the 4.2Gb Mistral model, and then the 25.8 Gb Mixtral model.
The number of ‘Context snippets’ passed to the LLM: I started with the default of 4 Context snippets, and towards the end of the testing I upped it to 40 (NB. Usually 1 Context snippet = 1 Chunk).

The Test Questions I asked the AI are listed below, and the measurements I took are as specified in a previous post.

What items are to do with the KRS? [KRS stands for Kodak Recreational Society]
What happened on the 20^th?
List the items relating to exam results
What linen is in the collection?
Are there any items relating to Aston Martin cars?
What documents are there about finances? (50%)

For each of the questions, I recorded the time it took the AI to start displaying the answer, and the percentage of correct answers produced. For each set of 6 questions, I also calculated the overall average time and overall average % correct. The summary results are in the table below.

Use of the Mistral model (with 4 Context snippets except for the last two tests)
*Progressive additions to base Index*	*Average time to start responding (seconds*	*Average % of correct answers*	*Result Assessment*
Item Label	2.5	43%	The AI identified a number of additional items that didn’t contain the keywords in the questions – Bank accounts, Receipt slips, and pension for the Finance question; Table cloths for the Linen question; and ‘results of…’ for the exam results question.
Guide document	3.5	49%	Adding the Guide as an embedded document made very little difference. It identified all the additional items that were identified before the Guide was added – as well as one extra item – suit cloth for the linen question (which is what increased its performance to 49%)
Normalised Facets	3.3	43%	This produced a poorer overall result than the un-normalised index (43% vs 49%) though this was largely a result of a completely incorrect answer to the KRS question. Other noticeable differences were a poorer performance on the finance question and a better performance on the linen question.
Primary Facet and AI Context	4.2	38%	This produced a poorer overall result than the normalised version (43%) which in turn was poorer than the un-normalised index (49%). Of note was that it recognised ‘debt’ as being to do with finances; but it failed to spot KRS (again).
Context snippets delivered increased to 40	4	39%	This produced a result almost exactly the same as with only 4 Context snippets – 39% vs 38%. However, it did identify 20-23Aug1993 as an event on the 20th. Unfortunately, it also said that a car had been sold at the Discount Bedding Centre whereas in fact it had crashed there.
One item in a single Excel cell in .txt format	3	33%	This produced a poorer result (33%) than all the tests delivering just 4 Context Snippets; though it did pick up a Finance item which hadn’t been identified in any of the previous tests.

Use of the Mixtral model (with 4 Context snippets except for the last two tests)
*Progressive additions to base Index*	*Average time to start responding (seconds*	*Average % of correct answers*	*Result Assessment*
Item Label	No test	No test	No test
Guide document	22.7	50%	The use of the much larger Mixtral model had very little impact. The major differences were a) the average time to start responding increased from 3 seconds to 23 seconds, and b) for the Finance question it identified 2 extra items (house sale/purchases). Very strangely, it followed its negative answer for the Aston Martin question by listing all 25 items that were in the 4 Context Snippets provided.
Normalised Facets	22	57%	This produced a slightly better result than the previous un-normalised test (57% vs 50%) due to improvements in the Finance, Linen and Exam Result questions. There was a strange result for the ’20th’ question in which the AI listed all the context Snippets it had been provided for the exam result, linen, and Aston Martin questions.
Primary Facet and AI Context	20.7	49%	Overall, this expanded version of the Index performed pretty much the same as the earlier version without the new fields apart from the particularly noticeable 4 hallucinations the AI produced for the question about finances – complete with reference numbers already occupied by other items. This hadn’t happened before.
Chunks delivered increased to 40	23.2	38%	Mixtral’s performance was significantly worse with the 40 Context Snippets delivered than it was with just 4 Context Snippets (38% vs 49%); on top of which it completely made-up descriptions for 4 reference numbers. Interestingly, Mistral’s performance was 39% with the 40 Context Snippets, so, on this evidence, there was little advantage to be had with Mixtral despite it taking significantly longer to respond.
One item in a single Excel cell in .txt format	22.3	26%	Not only did this produce a poorer result than the previous unformatted index using the Mixtral model (26% vs 38%), but this result was actually the worst in this whole series of Phase 1 tests. It also completely hallucinated an Aston Martin purchase. Perhaps one mitigating factor is that individual items often exceeded the max number of characters in a chunk (1000) resulting in chunks often consisting of bits of one item and bits of another.

There are two conclusions to be drawn from these results. First, the AI is a lot worse at word search than an Excel spreadsheet; and, second, making various changes to the Index being tested seemed to make the AI perform worse not better. On the plus side, however, there are several instances of the AI correctly identifying relevant items without the exact words in the question being present in the item’s record. Unfortunately, a number of instances also occurred in which the AI hallucinated and made things up.

There may be good reason for these occurrences. As ChatGPT made it clear to me, Indexes are often short, compressed, and keyword-based, whereas AI embeddings work best with descriptive sentences; and AI is probably at its best when asking exploratory questions. Furthermore, the questions I’ve been asking are not very prescriptive; ChatGPT suggests using the following text constructs when needed

“Using ONLY the provided documents” to reduce hallucination
“Do not invent information” to force restraint
“If unsure, say…” → prevents guessing

Another reason why performance appears to have been poor may have been because of the Chunking parameters that were used. Most of the tests were run with Chunks that combined multiple items – which I believe is not ideal. Even when each item was confined to a single cell in the final test, the resulting Chunks were not limited to one item – probably because Chunk size was limited to 1000 characters and some items overran this limit. I need to explore this issue some more.

A second parameter that may have affected performance is the maximum number of Context Snippets that could be delivered with a question to the model. Most of the tests were limited to 4 Context Snippets, though this was increased to 40 in the final test. This clearly would have an impact on the answers provided – particularly if the number of relevant items exceeds the maximum allowable number of Context Snippets. This too requires further exploration.

This is where I’m up to in this first phase exercise. The tests I’ve done have undoubtedly increased my familiarity with the software being used and with the outputs that can be expected. However, I need to do some further investigations as described above to fully understand how things are actually working; and it seems I need to rethink what I actually want the AI system to inform me about so that I can specify more appropriate evaluation measures. I shall attempt to address these points before drawing this first phase of the overall investigation to a close.

A software stack for 64Gb

Posted on 20Mar2026 by admin

When I wrote the previous post just over a week ago, I thought I had an operational AI configuration; but it turned out not to be the case. I was getting “Ollama not responding” more often than not when I sent in my prompts, and I eventually concluded that my 8-year-old laptop just wasn’t up to the job. I had already planned to upgrade it later this year, so I decided to bring that forward and do it right away. I elected to buy an HP Omen laptop with 64Gb RAM and it duly arrived on Monday 16^th March. There followed an intense period of installing applications and transferring data from my old laptop. There were some problems – there always are when you get a new machine – but by the following afternoon I was ready to restart my AI journey.

I started again by asking ChatGPT what tools and models it would suggest for my new 64Gb laptop, and it recommended LM Studio running the “Mistral 7B Instruct” model with AnythingLLM providing the front end and RAG capability. I duly downloaded and installed all this software, but hit a problem when I entered my first query: AnythingLLM is set up to provide a variety of system prompts (instructions that shape the AI’s responses and behaviour) which are not recognised by LM Studio and the Mistral model. ChatGPT first advised me to run another model, and when that didn’t work either, it suggested disabling AnythingLLM’s System Prompts. Unfortunately, AnythingLLM wouldn’t let me do that. Eventually, after about two and a half hours, I gave up trying to troubleshoot the problem and took up another of ChatGPT’s suggestions to replace LM Studio with Ollama running another Mistral model. This change only took about 15 minutes – and it worked! I started running my test questions through the new configuration and was getting answers back in 2-6 seconds – every time!

Now, throughout this process I was following ChatGPT’s guidance. I simply don’t have the knowledge to do any of this on my own, and, I must say, ChatGPT has been very clear and helpful; most answers provide options, a rationale for its suggestions, and a final summary of what should be done. However, as demonstrated by my above experiences, ChatGPT is not necessarily familiar with all aspects of all available products, nor fully aware of all potential problems. If it was, it wouldn’t have suggested the initial pairing of LM Studio with Mistral and AnythingLLM. Furthermore, when asked about functionality in a particular product it often offers various possibilities depending on which version is being used, suggesting a general knowledge rather than specific expertise. Of course, this is exactly what should be expected from an AI system. After all it is only predicting the next word based on a whole load of training data.

Let me be clear: the guidance I’ve already received from ChatGPT has enabled me to make considerable progress in a relatively short period of time; and I plan to continue to rely on it to guide my future steps in this journey – after all I have no other option. However, I will remain alert to the possibility of its advice being incomplete or unsound or even wrong; and I will rely on the actual experiences I have with the software itself, to draw my own conclusions.

Some simple Evaluation Metrics

Posted on 11Mar2026 by admin

Once I got the software installed and working, I started to ask questions about my Memento collection. However, the answers that came back were not very encouraging: some were incomplete and others were just completely wrong. I hoped that the suggestions ChatGPT had previously made about structuring the CSV file would improve the results, but I realised that in order to check if that was indeed the case, I would need some way of evaluating how well the system was performing – just as had been suggested in the Preparedness guidelines.

I deliberately decided not to go overboard with my evaluation metrics at this stage: what I needed was a small simple set that could be applied relatively quickly and that would produce some numbers I could compare across different versions of the input documents and the system configuration. I came up with the following six questions (the percentages are the first set of results as described further on below):

What items are to do with the KRS? [KRS standing for Kodak Recreational Society] (0%)
What happened on the 20^th? (0%)
List the items relating to exam results (25%)
What linen is in the collection? (50%)
Are there any items relating to Aston Martin cars? [there are some individuals called Martin in the Index] (100%)
What documents are there about finances? (50%)

For each question I knew that, when I opened the CSV file in Excel, I could use the filter facility to get a definitive number of items that answered the question. So, to assess the answers provided by AI I added the number of the items identified by the Filter that the AI had reported correctly, to the number of additional correct answers identified by the AI (Total correct answers); and then divided that number by the sum of a) the number of answers identified by the filter, b) the number of additional correct answers the AI identified, and c) the number of incorrect answers that the AI identified (Total number of answers overall).

For example, for the question about listing the items relating to exam results, the filter identified 2 items (2 FILTER) but AI didn’t report either of them (0 CORRECT). However, it did report two items in which the word exam and results appeared separately (2 ADDITIONAL CORRECT). It also reported 3 items in which just the words exam or exams appeared (3 INCORRECT), and another item concerning an assessment but in which neither the words exam or results are present (1 INCORRECT). This produced a result of (0+2)/(2+2+3+1) = 2/8 = 25%.

The results for this rudimentary version of the CSV file were as shown against each of the questions listed above. The overall result was 38%. While this is in no way a definitive analysis, it nevertheless will enable a comparison to be made between different implementations. I intend to use it at least for the remainder of this first phase.

OFC

Order from Chaos, Digitisation, and their intersection

Author Archives: admin