Creating an Audio Book

Not all books have to be in print format. If you have a collection of audio material, then it’s perfectly feasible to create an audio book. That is what I did in 2008 when I assembled the material from two cassette tapes I had recorded when walking from Harrow to Trafalgar Square in London in 1975. I used the excellent open source, free-to-use Audacity software to break the recordings into discrete sections; stored the files on a CD; and created a cover for the CD case which included some of the (very poor) photos I had taken on the way.

In the last month I’ve just undertaken a similar exercise. A friend who used to work for BBC Radio, and who knows the value of audio memoirs, had suggested that I record my ageing mother speaking about her early life at school, her experiences during World War II and in the WRENS, and the many years she spent in Singapore. I took the advice and duly recorded some 8 hours of conversation with my mother when she was in her late nineties. The recordings were made using one of the utility apps in an Apple iPhone which produced 24 very clear files in the m4a format. I decided to combine these with another 8 short files of my mother describing particular photos, which I had recorded when sorting out her photo collection back in 2012.

I wanted to accompany the audio files with a word document summarising the contents, but realised that, in order to do that, I would need to listen to each file in detail and then create a summary paragraph. That sounded like hard work, so I decided to enlist the help of transcription services to produce written versions of each audio file. I tried to find a free service but couldn’t find one that produced good transcriptions in the volumes I required. In the end I chose to subscribe to Otter-ai’s business service for one month at a cost of $30. This enabled an unlimited number of files to be downloaded to the Otter site which then produced transcriptions in three parts:

  • An Overview Summary paragraph (broadly good but with the occasional error)
  • An Outline in the form of separate headings containing a series of bullet points (these seemed to contain quite a lot of errors in the bullet points)
  • A Detailed Transcript (pretty good at capturing almost every word, but sometimes getting the odd word wrong. The transcript was just a series of paragraphs without the identification of who was speaking each paragraph, and sometimes not splitting the paragraphs into separate speakers).

Taking Otter’s capabilities into account, I decided to edit the Detailed Transcript to get it as correct as I could and to insert the relevant speaker at the beginning of each paragraph (which was either my mother or myself). This entailed actually listening to the audio as I edited the transcript – a much easier task, I would imagine, than just listening to the audio while making notes. I also edited the Overview Summary paragraph to ensure it was correct and informative so that I could just lift that out to include in the overall summary document; and I ignored the Outline but left it intact in the transcript files. So, I ended up with the following set of materials:

  • 1 overall ‘Introduction and Summaries’ Word document
  • 32 audio files in m4a format
  • 1 transcript Word document for each audio file
  • I or more jpg image files for each of the audio files describing particular photos.

This package of material includes quite a significant element of our family’s history, so I wanted to give a copy of this audio book to each of my children in a form that would be accessible in the long term, and would stand a chance of not just getting lost. Consequently, I decided to use a USB memory stick to store the audio files on, and to box it up in a clear USB case designed to take a cover insert. I obtained both from River Solutions Group: 5 Twister Silver White USB Sticks 4GB at £2.25 each; and a pack of 10 Rivermedia Premium USB Case Clear 14mm for £8.50.

I produced the cover in PowerPoint and included on it a picture of my mother and a list of the 32 audio files headed by the ‘Introduction and Summaries’ file (10 on the front cover and 22 on the back). I’m hoping that this overall package will look significant enough to ensure its survival for a few decades at least.

 

A New Loft Management App

Addenda to ‘Loft Management’ [First post]   [Last Post]

Over the 20+ years I’ve been operating a Loft Index I’ve used 3 different computer applications: Filemaker on a laptop, uGrokIT on a phone, and Memento Database on a phone. In the March 2025 post describing the move to the Memento Database I noted that scrolling was hesitant and that the records moved up the screen on top of the field headers. The Memento Help person had told me that ‘this happens because Memento stores photos in their original size, which can be quite large’ and that they would ‘try to add a photo compression feature in future releases’. I decided to ignore these problems for three reasons: a) they weren’t interfering with creating new entries or searching, b) the product was free to use locally, and c) I’d gone to a lot of effort transferring the data and photos into the application. I realise now, that was a mistake.

I should have seen the writing on the wall. When I wanted to add something to the Index last week, I discovered the interface was a little different (there’d obviously been an upgrade) and my loft database just locked up when I opened it. I wasn’t even able to get it to produce a csv export file of the current contents.

So, I’ve spent a couple of days moving the index to a new app. I’m now using the Collections database from Risolvi Productions for which I’ve paid a single one-off fee of £7.99. Luckily there haven’t been too many changes to my loft index over the last year so I was able to use an October 2025 export from the Memento app for the data, and a March 2025 folder of the associated photos, and to deduce what changes had occured since Oct25 from the first frozen screen of the malfunctioning Memento app. With the new database fully set up in the Collections app and all the associated photos loaded in as well, I’m glad to say it all seems to be working perfectly.

The lesson here is clear: if an app doesn’t seem to be working properly, don’t use it; and be prepared to pay a little rather than just always going for a free-to-use product. Obvious really….

Phase 2 results – and enter Claude

The object of Phase 2 was to explore how AI can support the combination of a collection’s Index and the titles of the associated files. The answer is straightforward: the assembled file titles need to be provided to the AI model in one or more files, in the same way that the Index is delivered. In Windows, file titles can be collected by highlighting the files you are interested in and selecting ‘Copy as path’ from the Right-Click drop down menu. These can then be pasted into either Excel or Word, and the path in front of the file name can be deleted by using the Find and Replace function: specify the path in the Find box and specify blank in the Replace box, highlight the file names, then select Replace-All. No doubt other operating systems have similar capabilities.

I tried this out with the names of 2065 files associated with the Mementos index, along with the 1.62 version of the Mementos Index (the version I used for the final set of tests in Phase1). I retained the 5 evaluation questions used in Phase 1, but I also added three more to further test the capabilities of the AI models concerned (more of these at the end of this post). I performed tests using AnythingLLM/Mistral and Copilot. ChatGPT wasn’t used because it has a limit of around 250kb and the file titles together with the Index would have significantly exceeded that. I also tried out Anthropic’s Claude model for the first time. The results are shown in the table below together with the results of the final test in Phase1 for comparison purposes.

System Phase1 – no file titles

5 evaluation questions

Phase 2 with file titles

5 evaluation questions

AnythingLLM/ Mistral 4.4 out of 10: Overall, this was a disappointing result. The answers were very sparse with little rationale or summation. Some of the answers given were of dubious worth. 2.3 out of 10: This was a generally very poor result. Four of the answers were virtually worthless; while 2 of the answers weren’t too bad at all. Quite apart from the contents of the answers, they were all relatively short and with none of the embellishments and rationale that seem to be a standard part of the responses of AIs like ChatGPT and Copilot.
Copilot 8.7 out of 10: Three of these five answers were exceptionally good, and all the responses were well-illustrated with rationale and plenty of examples. The summaries at the end were well constructed and useful. I only spotted one error across the five answers, though there were three or four things which stood out as having been omitted. 8.9 out of 10: Overall this is an outstanding result. Comprehensive answers with good introductions and summaries and lots of examples with reference numbers included. Nearly everything was correct and complete – it was bemusing to read things like  the series of homes we had been in or the different companies I’d worked for – an experience like being amazed that someone you are talking to seems to know your life history. The one concern is that in one answer the AI suggests it had taken facts from a document (Barlborough prospectus) which it hadn’t got access to and the only conclusion is that it hallucinated the information (Indeed it subsequently admitted it had).
Claude 8.2 out of 10: Claude’s answers overall were pretty good – but a bit patchy with scores ranging from 6 to 9.8 (this latter score being ‘first class’). The poorer scores were largely the result of misinterpretations and errors.

 

From these results, it’s not possible to assess if the inclusion of File Titles has made a difference: AnythingLLM’s performance was generally very poor across both tests; Copilot’s two scores were too similar to draw any conclusion; and Claude did not have any earlier test result to compare against. However, it seems reasonable to assume that if there is additional information in the file titles, over and above that in the Index, there will be a better outcome.

Since this was the first time I had used Claude, I enquired about its operational parameters, and discovered that Claude’s Context Window is 200K tokens across all models and paid plans (except for Enterprise plans). That is roughly 500 pages of text or approximately 800,000 characters. Claude caps file uploads at 30MB per file and 20 files per conversation. I found the quality of Claude’s answers to be at the same sort of high level as Copilot – though the results indicate that, on this particular set of uploaded material, Copilot has a bit of an edge. I also noticed that Claude differs from Copilot in two ways: first, Claude, unlike Copilot, tells you what it’s doing in the course of responding to a question; and, second, Claude took three times longer to respond to the same set of 8 evaluation questions (59.3 vs 19.6 seconds).

The three extra evaluation questions that I added were:

  • Describe the things that happened to Paul Wilson during the time he spent at Barlborough Hall School and support the story with references to relevant mementos.
  • Identify anything strange or unexpected about the contents of the Memento collection and support the analysis with references to particular files.
  • Describe life in the Wilson family during the 1990s citing relevant artefacts in the collection.

These extra questions were deliberately designed to explore the AIs ability to make more broadly-based connections and inferences, as opposed to identifying related items in a narrow subject area. This is where AI can excel – and sometimes fail – as indicated in the detailed results for each question provided below.

Question Copilot Result Claude Result
Describe the things that happened to Paul Wilson during the time he spent at Barlborough Hall School and support the story with references to relevant mementos.

 

7 out of 10: This is a comprehensive summary of my time at Barlborough – though surprisingly lacking in info about sports activities. The answer claims to have details of the contents of the school prospectus and of letters to parents – but it didn’t and I think it just hallucinated the info. 9.8 out of 10: This an excellent, comprehensive answer utilising all the Barlborough material in the collection so far as I could see. I didn’t spot any errors. The answer was easy to read and included all the relevant Memento Reference Numbers. First class.
Identify anything strange or unexpected about the contents of the Memento collection and support the analysis with references to particular files.

 

9 out of 10: I’m quite bemused by this answer:  I don’t really know what I was expecting – but Copilot delivered a whole bunch of unusual stuff with good rationale for their inclusion. Extraordinary! 6 out of 10: Claude identified 12 different strange entries – but actually only about 5 are really valid. The so-called errors in the birthday card collection arose simply because I entered birthday cards for different years in the same year; the so-called error in the Miss Saigon Programme date was because I also bought a special Souvenir programme with the date of the first performance. Item 5 is also covered by Item 1. So this is an interesting set of material – but not entirely correct or useful.
Describe life in the Wilson family during the 1990s citing relevant artefacts in the collection.

 

9 out of 10: Another very comprehensive answer replete with examples and a good summary at the end. I haven’t checked in detail whether its all correct but nothing is standing out as being wrong. I’m finding it hard to fault these answers. 8.8 out of 10: This is a very complete account of our time in the 1990s written in a very readable style spiced with occasional wry comments and humour. Each detail is accompanied by the relevant Reference Number. It is highly informative but does include a few misinterpretations – in particular claiming that my wife was at home during the 90’s when actually she was back to teaching; that l was seconded to the IR in ‘mid-1990s’ whereas it was the beginning of the 90s; that the EDS chocolate bar had been kept whereas it was just the box it was in; that my son went to Exeter to do a 2nd degree whereas it was his first; that I was a devoted Aston Villa supporter whereas it was my other son; and that my son had a band called Phases which was stretching it since it was just he and a friend making a cassette recording. However, these are relatively small points amidst the huge array of correct facts that are presented in this highly readable piece.

 

Finally, to come back to the objective for Phase 2, I have concluded that file titles can certainly be included in the material delivered to an AI model, and this is certainly worth doing if they include additional material to that which is contained in an Index – or if an  Index does not exist.

For completeness, below records the breakdown of the time I spent on Phase 2.

Activity Time spent – hours
8 Tasks (total elapsed time: 4 days) 11
Time spent assembling files of File Titles and testing 8
Time spent researching and drafting post for pwofc.com 3

Phase 1 Summary Results

Phase1 Description: AI support for the Memento collection’s index entries.

Conclusions

  • Different LLMs are trained with different amounts and types of training data, thereby making them more or less suitable for operation within specific domains. Therefore, there may be some LLMs available that are more suitable than others for use with Collection indexes.
  • To enable LLMs to provide answers to questions related to specific information in documents or databases, a process called Retrieval Augmented Generation (RAG) is used whereby relevant parts of the information are provided to the LLM alongside the question.
  • The RAG process performed by particular products typically includes a multiplicity of different techniques to identify an appropriate set of Chunks for submission with the question to the LLM. It is likely that particular combinations of these techniques will be more effective than others for use with collection indexes.
  • Collection Indexes may contain information which is more unevenly spread throughout their contents than a written document with an ordered set of contents. Consequently, such indexes may require more chunks to be sent to the LLM, and may require the LLM to have larger Context Windows, than ordinary documents in order to obtain satisfactory answers.
  • Out of the 4 models/systems I have tried out so far, Copilot is by far the best LLM for use with a collection index of around 2,500 entries.
  • The choice of systems, models, configurations, and Index adjustments in this first Phase were heavily influenced by ChatGPT, and therefore may have been based on some inaccuracies or hallucinations. This needs to be born in mind when considering the findings from Phase 1 and taking them forward into subsequent phases.

These conclusions were reached in the course of undertaking the activities summarised below.

Tasks and Timescales

Activity Time spent – hours
70 Tasks (Total Elapsed time: 43 days) 105
Time spent addressing points in AI preparedness document 7
Time spent installing software 8
Time spent adjusting indexes and testing 52
Time spent researching and drafting posts for pwofc.com 38

Reference Sources

  • A Survey on Retrieval-Augmented Text Generation for Large Language Models by Yizheng Huang, Jimmy Huang, v2, arXiv:2404.10981, 23Aug2024.
  • I obtained lots of answers from ChatGPT to questions I asked about RAG, AI software products, and how to adjust the Memento index to get better AI results.

Concepts and Terminology encountered

  • LLM stands for Large Language Model. Hundreds of thousands of these have been built and new ones emerge daily, but many are variants of major models like GPT-4 from ChatGPT which can be accessed online. Other LLMs can be downloaded from websites like Hugging Face, to use locally on a stand-alone computer.
  • RAG stands for Retrieval Augmented Generation whereby the LLM is not trained on the archive, but instead a relevant subset of the archive data is provided to the LLM alongside the question.
  • Chunk is a smaller, manageable segment of a larger document or dataset. In RAG, one or more Chunks are provided to the LLM along with the question.
  • Token is the fundamental unit of data used by LLMs. Models convert questions and Chunks into tokens, process them, produce an answer in tokens, and then convert those tokens back into text. Generally, one token equals approximately 0.75 English words.
  • Vector Databases store mathematical representations of Tokens as vectors – lists of numbers – in such a way that related items are clustered together thereby enabling capabilities like similarity searching.
  • Embedding: Chunks are broken down into Tokens which are converted into mathematical representations and embedded in a Vector Database.
  • Context Window is the maximum number of Tokens that can be handled by the LLMs working memory (which contains both the input prompt and the answer). If the inputs to the LLM exceed its Context Window, then some content may be simply left out and the answer may be less complete.
  • Hallucination is a phenomenon where LLMs generate false, misleading, or nonsensical information confidently. It happens when an LLM predicts text based on patterns rather than facts, often due to poor training data, ambiguous prompts, or a lack of understanding of reality.

Test results – 2nd set of evaluation criteria (scores out of 10)

                      Type of LLM  Index variant AnythingLLM Mistral AnythingLLM Mixtral Copilot –  MS LLM ChatGPT – GPT-5.3
Less some columns + AI Context short 4.8 5
Less all extraneous fields, no Guide to the Index 2.2 2.9 8
Combined into 1 field and with Guide 3.1 4.8 8.6
Less all extraneous fields, no Guide 5.9
Less all extraneous fields and with Guide 9.1
Index with Set removed and with Guide 4.4 3.6 8.7 6.3

 

Enter Copilot and ChatGPT

In the previous post I explained why I changed the evaluation criteria I was using to the following:

  • What are the main themes that run through the entire index?
  • Are there distinct phases or periods in the collection?
  • Which items suggest important life events or transitions?
  • What patterns or motifs repeat across the collection?
  • What are the top 5 most important entries, and why?

I tried using them with a version of the Index which had extraneous fields such as ‘Physical Location’ and ‘No of digital files’ removed leaving just Ref No, Description, Item Label, Set, Facets, AI Context, and Year all in a single column. This produced a result of 4.8 out of 10 using the Mistral model and 5 out of 10 for Mixtral – and I proved to myself that I was able to apply the new evaluation criteria, albeit with rather more subjective scoring.

At this point it dawned on me that it might be worth trying to use the MS Copilot AI provided as an integral part of my Windows operating system. After making some initial enquiries through the net about its possible use with Excel, I got the impression it could only be used with MS Office 365 which I do not have (I have the home version of Office); but Copilot itself set me straight explaining that the 365 version enabled support for the functionality within the Office applications, whereas the free-to-use version of Copilot simply ingests documents up to the cloud where it works out the answer to the question it has been asked. I tried it out using a very simple version of the Mementos index with just the Ref No, Description, Set, Facets, and Year fields in their own columns, and was excited by the result which I described as follows:

“This is a strong result (8 out of 10): comprehensive answers with just a few poor interpretations, but with no obvious hallucinations. Importantly, Copilot was able to deal with the whole attached Index in one go which delivers far better answers than the RAG approach which can only work on a pre-selected subset of the material. The response time (average 4.4 seconds) was very quick considering that the whole document had to be sent to, and analysed in, the cloud before delivering its answer.”

I then tried to compare the AnythingLLM and the Copilot systems by using yet another version of the Mementos index based on suggestions from ChatGPT. It still contained just Ref No, Description, Set, Facets, and Year with the all the fields in a single cell, but this time with the Description replaced by keywords derived automatically using an Excel formula supplied by ChatGPT. The results were very clear: Mistral scored 3.1, Mixtral 4.8, and Copilot 8.6. For the Copilot Test I wasn’t sure if the size of the file would cause a problem so I split it into three files of between 200kb and 300kb each, and these seemed to have been ingested successfully. However, I subsequently discovered that the files had been truncated so that only about the first 120 lines of each were ingested – making about 360 entries in all out of the total 2393 (I guess Copilot must have truncated the file I used in the previous test as well). Given this, Copilot’s 8.6 score was even more impressive.

Finding that Copilot had truncated the files prompted me to do some digging and experimentation to find out just what its limits are. I established that it will ingest up to 20 whole documents of up to about 30,000 characters/30kb file size each in a single turn, and will work across all those documents to construct its answer. Furthermore, more batches of 20 can be submitted in subsequent turns up to a total of about 150 after which ‘the conversation becomes unwieldy’ (I’m not sure what is meant by this). I duly split the Mementos Index into 17 files, all between 20,000 – 30,000 characters and conducted the test again, this time including a version of the Guide document. This produced an even better score of 9.1.

Discovering Copilot’s capabilities made me wonder what ChatGPT could do. I discovered that the free version which I was using allows you to upload a maximum of 3 files of up to about 512Mb in any one day. However, to be sure that all the contents of all the files will be taken into account in answering a question, the total of the 3 files need to be a maximum of about 300k characters with 250k being a safe limit. I put ChatGPT to the test with the same index used in the first Copilot test but broken into three files of between 111k and 130k characters each. This produced a result of 5.9 which was probably expected given that the file size limits had been exceeded.

As I was exploring the file size limits for Copilot and ChatGPT, it became clear that the ultimate determinant of the amount of text that can be reliably reasoned about at once is the AI model’s maximum ‘Context Window’. This is effectively the model’s working memory which contains both the input prompt and the answer. If the inputs exceed the Context Window then some content may be simply left out and the answer may be less complete. The table below summarises the Context Window limits for the different AI models I’ve been using.

AI System Model Maximum Context Window
AnythingLLM Mistral 7B Instruct (32k context), Q4_K_M quantization 32k tokens ( ~65k–100k characters in csv files) (assuming 2-3 characters/token)
AnythingLLM Mixtral 8x7B Instruct (32k context), Q4_0 quantization 32k tokens ( ~65k–100k characters in csv files) (assuming 2-3 characters/token)
Copilot MS LLM (Microsoft does not publish the names of its models The MS LLM doesn’t have a single fixed Context Window; it’s designed around task‑adaptive context management, and hence the effective context it can use is much larger and more flexible than a single token number would suggest.
ChatGPT (free version) GPT-5.3 The maximum Context Window is 128k tokens, but because this includes system instructions. conversation history, and output tokens, the usable Context Window is about 80k – 100k tokens (160k – 300k  characters in csv files) (assuming 2-3 characters/token)

 If one wishes Index files to be considered in full by an AI system, the primary requirement is to ensure that the whole set fits into the Context Window. However, when it comes to assembling and submitting those files there are additional considerations to be taken into account, as summarised below.

AI system Approach Max total file size Max No of files Max size of each file
AnythingLLM Embed (RAG) No limit. The limit is the size of the Chunks the files are divided into. The default is set at 256 tokens (800-1000 characters). No limit. Though more than 50K Chunks may cause retrieval problems. No limit.
AnythingLLM Attach function Default is set to 30k-50k characters for csv files. Can be adjusted within AnythingLLM settings. No Limit No limit.
Copilot Attach function No absolute limit, but above about 150 files (4,500k characters) the conversation becomes unwieldy. About 7 batches of up to 20 files in a single prompt. No absolute limit, but for csv files 10,000 rows or approximately 30k characters, should work fine.
ChatGPT (free version) Attach function To be confident that the model will read everything and not overlook anything the total should be kept to about 250k characters. 3 within any one day. This limit can be circumvented by putting multiple files in a zip file, or by putting the contents of a file into the prompt with the question. 512Mb is the absolute limit but to be sure all the contents will be acted on, keep it to 200k-300k characters. If 3 files are to be used, keep their total to 200k-300k characters.

With these constraints in mind, I conducted a final set of tests to compare the three AI Systems – AnythingLLM’s Rag approach, Copilot, and ChatGPT. I assembled a cut-down version of the Index (by removing the MW set) which was small enough (around 274k characters, 270kb file size) to fit within the limits of all three systems.  As can be seen in the table below, there was a clear winner.

System Average Evaluation Score out of 10 Performance Summary Average time taken to start responding (seconds)
AnythingLLM (Mistral) 4.4 Overall, this was a disappointing result. The answers were very sparse with little rationale or summation. Some of the answers given were of dubious worth. 6.6
AnythingLLM (Mixtral) 3.6 This score was a little bit worse than Mistral’s result despite it taking a lot longer to start printing out its answers. Two of the five answers seemed to mainly regurgitate the different sets that were described in the Index; and some of the answers were of rather dubius relevance. However, of most concern is that there were at least 4 instances of hallucinations – in two cases, Reference Numbers which don’t exist were cited. 36.4
ChatGPT (GPT-5.3) 6.3 All the responses were comprehensive with extensive rationale and good summation. However, the content wasn’t always appropriate. It wasn’t necessarily wrong but was sometimes just a little dubious. Furthermore, there was rather too much emphasis on the way the collection was organised rather than on its contents. On two occasions, examples of individual entries were specifically asked for but generalisations were delivered. 6
Copilot (MS LLM) 8.7 Three of these five answers were exceptionally good, and all the responses were well illustrated with rationale and examples. The summaries at the end were well constructed and useful. I only spotted one error across the five answers, though there were three or four things which stood out as having been omitted. 16.4

Copilot was a clear winner with ChatGPT following on behind. Both provided substantial answers which included rationale, several examples, and a summary. In contrast the AnythingLLM RAG answers were sparse, sometimes not very good, and occassionally included complete hallucinations. The RAG approach just doesn’t seem very effective for Index material.

Embedding Parameters and new Evaluation Questions

The tests described in the previous post highlighted a number of questions which are discussed below:

  1. The Text Chunk Overlap is described as ‘the maximum overlap of characters that occurs during chunking between two adjacent text Chunks’. Under what circumstances would an overlap occur, and if it occurs, does exactly the same set of overlap characters appear in the two Chunks?
    ChatGPT’s Answer: Overlap occurs when you intentionally reuse part of the previous chunk to maintain continuity. And in typical character-based chunking, the overlap is exactly the same sequence of characters in both chunks.
  1. 1 cell = 1 Chunk: In my Mementos Index I have collapsed all the columns for a single entry (i.e. in one row) into a single cell [as advised by ChatGPT]. Hence the whole index in now just a series of cells in a single column. Will the AnythingLLM embedding function attempt to put each cell in a single Chunk provided the number of characters in a cell is less than the Text Chunk Size? What happens if the number of characters exceeds the Text Chunk Size.                                ChatGPT’s answer: Yes, a cell will usually become a single chunk if it fits within the chunk size. If it doesn’t, it will be split into multiple overlapping chunks, each embedded separately. If you’ve set an overlap, adjacent chunks will share overlapping text (as discussed in the answer to Question 1.)
  1. Chunks starting with the beginning of an entry: When embedding my Mementos Index with all the entries in single cells in a single column, how can I ensure that each Chunk starts with the beginning of an entry regardless of whether the previous entry exceeded the Text Chunk Size or not?
    ChatGPT’s answer: To ensure every chunk starts at the beginning of an entry: Best: ingest each cell as a separate document
    Next best: split by a delimiter chunking
    Avoid: relying on chunk size alone.
    [I didn’t think this was a helpful answer – it seems there’s not much practically that can be done about this issue]
  1. More Chunks than Snippet limit: What would the impact be on AI’s answer to a question which concerns more Chunks than the maximum allowable number of Context Snippets?
    ChatGPT’s answer: The AI can only reason over the chunks it’s given. If relevant information exceeds the snippet limit answers become partial, biased, or occasionally incorrect. The fix is a balance of: chunk design, retrieval quality, and context size.
  1. Duplicate Descriptions: There are two instances of the ‘Description’ field in my revised Mementos Index – one in the original ‘Description’ field, and one in the new ‘Item Label field which combines the ‘Reference Number’ field with the ‘Description field [as advised by ChatGPT]. For longer descriptions, this second instance of the ‘Description’ field may be resulting in the Maximum Number of Characters in a Chunk being exceeded. What would the impact on AI performance be if the ‘Item Label’ field was removed?
    ChatGPT’s answer: Removing the duplicated Description from the Item Label will likely:
    – reduce chunk splitting
    – improve retrieval accuracy
    – increase effective context usage
    – produce more complete answers
    and has little to no downside if the reference number is still present elsewhere.
  1. Attaching files: I’m using AnythingLLM with Ollama and the Mixtral model on a laptop with Intel 9 core and 64Gb RAM. I’d prefer to have the model work on all parts of my Mementos Index with 2390 entries, rather than a limited number of Chunks. I believe that the way to do that is to not embed the index but to ATTACH the index (now adjusted so that all entries are in single cells in a single column) in .txt format. Will that work on my configuration? And roughly how long will it take for the AI to start printing out its answers?
    ChatGPT’s answer: In AnythingLLM (even when using Ollama with Mixtral): Attaching a file does not bypass context limits. The model still has a maximum context window (how much text it can “see” at once). So: The model cannot process all 2,390 entries simultaneously unless they fit inside the context window. Mixtral typically supports around: ~32k tokens context (varies by version). Rough conversion:   1 token ≈ 3–4 characters        32k tokens ≈ 100k–120k characters. Your dataset reality – If you have 2,390 entries, even modest 200 characters per entry, that’s: 2390 × 200 ≈ 478,000 characters – that’s 4–5× larger than the entire context window. So, the system will truncate the file and only the first portion is seen by the model. In the worse case:
    – System becomes slow or unresponsive
    – Very long delay before generation starts
    – Possible failure or timeout
    The bottom line is:
    ❌ Attaching the full .txt will not let the model process all 2,390 entries
    ❌ It will likely truncate or slow down significantly
    ⏱️ You may wait 10–60+ seconds before any output—and still get incomplete results
    ✅ Embeddings + retrieval is the correct and scalable approach
    ✅ Your restructuring (one entry per cell) is exactly the right direction

Having acquired these learnings, I was ready to continue testing. First, however I knew I needed a more appropriate set of evaluation criteria. So, I put this final question to ChatGPT:

“When applying AI to an index to a collection in an Excel spreadsheet, what generally applicable evaluation questions could be asked which could not be answered directly by using the spreadsheet Filter facility?

The answer provided 4 questions in each of 11 categories, and then one extra question i.e. 45 questions in all. From these I selected the following five:

  • What are the main themes that run through the entire index?
  • Are there distinct phases or periods in the collection?
  • Which items suggest important life events or transitions?
  • What patterns or motifs repeat across the collection?
  • What are the top 5 most important entries, and why?

Armed with these new evaluation questions, and a better understanding of the settings and parameters associated with embedding, I set out to undertake another round of tests using the Mementos Index.

Initial Phase 1 Test Results

Over the last 10 days or so I’ve been running operational tests on the Mementos Index using the AnythingLLM front-end/embedding tool, and the Mistral and Mixtral models. The process I’ve been through has typically been a) to make some adjustment to the index (as advised by ChatGPT) to enable the AI system to produce better results; and then b) to run the 6 standard test queries with first the 4.2Gb Mistral model and then the much bigger (25.8Gb) Mixtral model.

The main changes made to the Index and what they were supposed to achieve are listed below:

Action Changes Benefits
Added an ‘Item Label’ column Combined the Reference Number with the Description in the format:[Reference No] — [Description]        and placed it in the new column. It is descriptive, human readable, and AI readable. It is the primary semantic identity of each object.
Included the Guide Changed the Guide to the Collection and its Index into a text (.txt) document and embedded it into the AnythingLLM workspace along with the Index itself. It teaches the model how to interpret the data in the Index.
Normalised the Facets Changes made to Facet contents to eliminate capital first letters (except for proper names), and minimise plurals. Reduces duplication and improves matching
Added a ‘Primary Facet’ column Took the first Facet in the Facet 1 column (which I have always regarded as the primary facet) and placed it into the new Primary Facet column. AI tries to detect clusters, but without a dominant signal, clustering can become messy. Once a primary facet exists, AI can start discovering higher-level themes.
Added an ‘AI Context’ column Combined Item Label with Facets 1 in the format:                                      [Item Label]. Facets: [Facet 1 keywords]                                        and placed it in the new column. It combines all the key semantic signals into the same chunk of text, enabling the AI to retrieve by conceptual meaning rather descriptive text.
Created a ‘Search Keywords’ column Identified specific meaningful searchable words within the ‘Description’ field by using a 10-line Excel formula supplied by ChatGPT. The formula also filtered out Stopwords (such as ‘the’) and retained ALL-CAPS words (like KRS). Providing specific words to search enables the AI to dramatically improve recall.
Restructured all the index rows into single cells Combined all 18 columns of information for each item into a single cell using a 19-line Excel formula provided by ChatGPT. The formula was placed into the 19th column of the Index and pulled down. This column was then copied and placed into a Notepad file with a .txt extension. Each item can now be embedded as a separate Chunk (provided Chunk size limits are not exceeded), enabling cleaner semantic matching, more accurate retrieval, more complete answers, and fewer errors.

I’ve also adjusted two different variables in the course of the testing I’ve carried out:

  • The Large Language Models (LLMs): I’ve performed all the tests using first the 4.2Gb Mistral model, and then the 25.8 Gb Mixtral model.
  • The number of ‘Context snippets’ passed to the LLM: I started with the default of 4 Context snippets, and towards the end of the testing I upped it to 40 (NB. Usually 1 Context snippet = 1 Chunk).

The Test Questions I asked the AI are listed below, and the measurements I took are as specified in a previous post.

  • What items are to do with the KRS? [KRS stands for Kodak Recreational Society]
  • What happened on the 20th?
  • List the items relating to exam results
  • What linen is in the collection?
  • Are there any items relating to Aston Martin cars?
  • What documents are there about finances? (50%)

For each of the questions, I recorded the time it took the AI to start displaying the answer, and the percentage of correct answers produced. For each set of 6 questions, I also calculated the overall average time and overall average % correct. The summary results are in the table below.

 Use of the Mistral model (with 4 Context snippets except for the last two tests)
Progressive additions to base Index Average time to start responding (seconds Average % of correct answers Result Assessment
Item Label 2.5 43% The AI identified a number of additional items that didn’t contain the keywords in the questions – Bank accounts, Receipt slips, and pension for the Finance question; Table cloths for the Linen question; and ‘results of…’ for the exam results question.
Guide document 3.5 49% Adding the Guide as an embedded document made very little difference. It identified all the additional items that were identified before the Guide was added – as well as one extra item – suit cloth for the linen question (which is what increased its performance to 49%)
Normalised Facets 3.3 43% This produced a poorer overall result than the un-normalised index (43% vs 49%) though this was largely a result of a completely incorrect answer to the KRS question. Other noticeable differences were a poorer performance on the finance question and a better performance on the linen question.
Primary Facet and AI Context 4.2 38% This produced a poorer overall result than the normalised version (43%) which in turn was poorer than the un-normalised index (49%). Of note was that it recognised ‘debt’ as being to do with finances; but it failed to spot KRS (again).
Context snippets delivered increased to 40 4 39% This produced a result almost exactly the same as with only 4 Context snippets – 39% vs 38%. However, it did identify 20-23Aug1993 as an event on the 20th. Unfortunately, it also said that a car had been sold at the Discount Bedding Centre whereas in fact it had crashed there.
One item in a single Excel cell in .txt format 3 33% This produced a poorer result (33%) than all the tests delivering just 4 Context Snippets; though it did pick up a Finance item which hadn’t been identified in any of the previous tests.

 

Use of the Mixtral model (with 4 Context snippets except for the last two tests)
Progressive additions to base Index Average time to start responding (seconds Average % of correct answers Result Assessment
Item Label No test No test No test
Guide document 22.7 50% The use of the much larger Mixtral model had very little impact.  The major differences were a) the average time to start responding increased from 3 seconds to 23 seconds, and b) for the Finance question it identified 2 extra items (house sale/purchases). Very strangely, it followed its negative answer for the Aston Martin question by listing all 25 items that were in the 4 Context Snippets provided.
Normalised Facets 22 57% This produced a slightly better result than the previous un-normalised test (57% vs 50%) due to improvements in the Finance, Linen and Exam Result questions. There was a strange result for the ’20th’ question in which the AI listed all the context Snippets it had been provided for the exam result, linen, and Aston Martin questions.
Primary Facet and AI Context 20.7 49% Overall, this expanded version of the Index performed pretty much the same as the earlier version without the new fields apart from the particularly noticeable 4 hallucinations the AI produced for the question about finances – complete with reference numbers already occupied by other items. This hadn’t happened before.
Chunks delivered increased to 40 23.2 38% Mixtral’s performance was significantly worse with the 40 Context Snippets delivered than it was with just 4 Context Snippets (38% vs 49%); on top of which it completely made-up descriptions for 4 reference numbers. Interestingly, Mistral’s performance was 39% with the 40 Context Snippets, so, on this evidence, there was little advantage to be had with Mixtral despite it taking significantly longer to respond.
One item in a single Excel cell in .txt format 22.3 26% Not only did this produce a poorer result than the previous unformatted index using the Mixtral model (26% vs 38%), but this result was actually the worst in this whole series of Phase 1 tests.  It also completely hallucinated an Aston Martin purchase. Perhaps one mitigating factor is that individual items often exceeded the max number of characters in a chunk (1000) resulting in chunks often consisting of bits of one item and bits of another.

There are two conclusions to be drawn from these results. First, the AI is a lot worse at word search than an Excel spreadsheet; and, second, making various changes to the Index being tested seemed to make the AI perform worse not better. On the plus side, however, there are several instances of the AI correctly identifying relevant items without the exact words in the question being present in the item’s record. Unfortunately, a number of instances also occurred in which the AI hallucinated and made things up.

There may be good reason for these occurrences. As ChatGPT made it clear to me, Indexes are often short, compressed, and keyword-based, whereas AI embeddings work best with descriptive sentences; and AI is probably at its best when asking exploratory questions. Furthermore, the questions I’ve been asking are not very prescriptive; ChatGPT suggests using the following text constructs  when needed

  • “Using ONLY the provided documents” to reduce hallucination
  • “Do not invent information” to force restraint
  • “If unsure, say…” → prevents guessing

Another reason why performance appears to have been poor may have been because of the Chunking parameters that were used. Most of the tests were run with Chunks that combined multiple items – which I believe is not ideal. Even when each item was confined to a single cell in the final test, the resulting Chunks were not limited to one item – probably because Chunk size was limited to 1000 characters and some items overran this limit. I need to explore this issue some more.

A second parameter that may have affected performance is the maximum number of Context Snippets that could be delivered with a question to the model. Most of the tests were limited to 4 Context Snippets, though this was increased to 40 in the final test. This clearly would have an impact on the answers provided – particularly if the number of relevant items exceeds the maximum allowable number of Context Snippets. This too requires further exploration.

This is where I’m up to in this first phase exercise. The tests I’ve done have undoubtedly increased my familiarity with the software being used and with the outputs that can be expected. However, I need to do some further investigations as described above to fully understand how things are actually working; and it seems I need to rethink what I actually want the AI system to inform me about so that I can specify more appropriate evaluation measures. I shall attempt to address these points before drawing this first phase of the overall investigation to a close.

A software stack for 64Gb

When I wrote the previous post just over a week ago, I thought I had an operational AI configuration; but it turned out not to be the case. I was getting “Ollama not responding” more often than not when I sent in my prompts, and I eventually concluded that my 8-year-old laptop just wasn’t up to the job. I had already planned to upgrade it later this year, so I decided to bring that forward and do it right away. I elected to buy an HP Omen laptop with 64Gb RAM and it duly arrived on Monday 16th March. There followed an intense period of installing applications and transferring data from my old laptop. There were some problems – there always are when you get a new machine – but by the following afternoon I was ready to restart my AI journey.

I started again by asking ChatGPT what tools and models it would suggest for my new 64Gb laptop, and it recommended LM Studio running the “Mistral 7B Instruct” model with AnythingLLM providing the front end and RAG capability. I duly downloaded and installed all this software, but hit a problem when I entered my first query: AnythingLLM is set up to provide a variety of system prompts (instructions that shape the AI’s responses and behaviour) which are not recognised by LM Studio and the Mistral model. ChatGPT first advised me to run another model, and when that didn’t work either, it suggested disabling AnythingLLM’s System Prompts. Unfortunately, AnythingLLM wouldn’t let me do that. Eventually, after about two and a half hours, I gave up trying to troubleshoot the problem and took up another of ChatGPT’s suggestions to replace LM Studio with Ollama running another Mistral model. This change only took about 15 minutes – and it worked! I started running my test questions through the new configuration and was getting answers back in 2-6 seconds – every time!

Now, throughout this process I was following ChatGPT’s guidance. I simply don’t have the knowledge to do any of this on my own, and, I must say, ChatGPT has been very clear and helpful; most answers provide options, a rationale for its suggestions, and a final summary of what should be done. However, as demonstrated by my above experiences, ChatGPT is not necessarily familiar with all aspects of all available products, nor fully aware of all potential problems. If it was, it wouldn’t have suggested the initial pairing of LM Studio with Mistral and AnythingLLM. Furthermore, when asked about functionality in a particular product it often offers various possibilities depending on which version is being used, suggesting a general knowledge rather than specific expertise. Of course, this is exactly what should be expected from an AI system. After all it is only predicting the next word based on a whole load of training data.

Let me be clear: the guidance I’ve already received from ChatGPT has enabled me to make considerable progress in a relatively short period of time; and I plan to continue to rely on it to guide my future steps in this journey – after all I have no other option. However, I will remain alert to the possibility of its advice being incomplete or unsound or even wrong; and I will rely on the actual experiences I have with the software itself, to draw my own conclusions.

Some simple Evaluation Metrics

Once I got the software installed and working, I started to ask questions about my Memento collection. However, the answers that came back were not very encouraging: some were incomplete and others were just completely wrong. I hoped that the suggestions ChatGPT had previously made about structuring the CSV file would improve the results, but I realised that in order to check if that was indeed the case, I would need some way of evaluating how well the system was performing – just as had been suggested in the Preparedness guidelines.

I deliberately decided not to go overboard with my evaluation metrics at this stage: what I needed was a small simple set that could be applied relatively quickly and that would produce some numbers I could compare across different versions of the input documents and the system configuration. I came up with the following six questions (the percentages are the first set of results as described further on below):

  • What items are to do with the KRS? [KRS standing for Kodak Recreational Society] (0%)
  • What happened on the 20th? (0%)
  • List the items relating to exam results (25%)
  • What linen is in the collection? (50%)
  • Are there any items relating to Aston Martin cars? [there are some individuals called Martin in the Index] (100%)
  • What documents are there about finances? (50%)

For each question I knew that, when I opened the CSV file in Excel, I could use the filter facility to get a definitive number of items that answered the question. So, to assess the answers provided by AI I added the number of the items identified by the Filter that the AI had reported correctly, to the number of additional correct answers identified by the AI (Total correct answers); and then divided that number by the sum of a) the number of answers identified by the filter, b) the number of additional correct answers the AI identified, and c) the number of incorrect answers that the AI identified (Total number of answers overall).

For example, for the question about listing the items relating to exam results, the filter identified 2 items (2 FILTER) but AI didn’t report either of them (0 CORRECT). However, it did report two items in which the word exam and results appeared separately (2 ADDITIONAL CORRECT). It also reported 3 items in which just the words exam or exams appeared (3 INCORRECT), and another item concerning an assessment but in which neither the words exam or results are present (1 INCORRECT). This produced a result of (0+2)/(2+2+3+1) = 2/8 = 25%.

The results for this rudimentary version of the CSV file were as shown against each of the questions listed above. The overall result was 38%. While this is in no way a definitive analysis, it nevertheless will enable a comparison to be made between different implementations. I intend to use it at least for the remainder of this first phase.

Installing the AI Software

The software that ChatGPT had advised me to install was called AnythingLLM (LLM standing for, of course, Large Language Model). I duly opened its website (https://anythingllm.com/) and selected the ‘Download for desktop’ box. It took about 13 minutes to download the 370Mb programme and install it. On opening the application, I was told that a) it had selected the best model (Qwen3Vision2BInstruct) for my hardware (a 9-year old Windows 11 laptop with 8Mb of RAM); b) that I was to use the LanceDB Vector Database; c) that these settings could be modified anytime; and d) that the model, chats, vectors and document text would all be stored privately on that instance of AnythingLLM and would only be accessible on that device.

I uploaded my mementos CSV file, and then got a Warning Message saying something like ‘the workspace is using 102,OOO of its 3,500  available tokens. Choose how you want to proceed – Continue anyway or Embed file’. The associated help page says:

“Continue Anyway: Will continue to add the document full text to the chat window, but data will be lost in this process as AnythingLLM will automatically prune the context to fit. You should not do this as you will experience inaccurate LLM behaviour.

Embed: Will embed the document (RAG) and add it to the workspace. This will allow the LLM to use the document as a source of information, but it will not be able to use the full text of the document. This option may or may not be visible depending on your permissions on the workspace.”

I selected Embed and that process took about 5 minutes.

I then asked some questions such as ‘Show me items relating to xxx” but consistently got the reply: ‘Could not respond to message. Model requires more system memory (8.7 GiB) than is available (5.4 GiB)’. I asked ChatGPT how much memory AnythingLLM needed to run and it said that the model that had been selected wasn’t suitable for a machine with 8Gb of RAM. Instead, it said I should use the Ollama phi3:mini model and advised how to obtain it. However, that didn’t work, so ChatGPT said that meant that Ollama wasn’t on my machine and that I needed to download and install that first, and provided me a website link to do so.

I installed Ollama (which included installing a redistributable version of Visual C) and restarted my laptop as instructed by ChatGPT. Then I installed phi3:mini by typing ‘ollama pull phi3:mini’ at the Command Line prompt as instructed by ChatGPT. Then I had to select the Ollama LLM in AnythingLLL by going into the Workspace settings (the little rose icon) and selecting Ollama. While in that Ollama section of the drop-down there was another settings rose icon which had to be clicked to access the Ollama-specific settings screen in which ChatGPT had advised me to place ‘http://localhost:11434’ in the ‘Ollama Base URL’ field.

At this point I noted that ‘phi3:mini’ was correctly displayed in the ‘Ollama Model’ field. Having done all this I was able to select the Mementos CSV document in Anything LLM and have it embedded; after which I was able to ask some questions and to get some answers.

Now, what was going on in all of this? This is what I discovered after having a few exchanges with ChatGPT:

The software that is needed for AI has three layers:

  1. The model (Phi-3 Mini) → the “brain” that generates text.
  2. Ollama → the engine that runs the model locally.
  3. AnythingLLM → the interface and workflow tool you interact with.

The Model (Phi-3 Mini) is the AI brain – the trained neural network that produces answers by:

  • Predicting the next token in text
  • Generating responses to prompts
  • Using knowledge learned during training.

Ollama is the system that runs the AI model on the computer. It does the following:

  • Loads the model
  • Sends your prompt to it
  • Streams the response back
  • Performs other functions such as loading models into RAM and providing an API server for other applications.

AnythingLLM is the user interface and AI workflow platform which does the following:

  • Connects to Ollama
  • Sends prompts to the model
  • Displays responses
  • Manages workspaces
  • Embeds knowledge sources (RAG)
  • Keeps chat history
  • Handles embeddings and document search.

This architecture is flexible: it enables different products to be switched into any of the three components while keeping the other two the same.

AnythingLLM embeds knowledge sources by retrieving information in external documents, encoding that information and placing it into a Vector Database (in this case, LanceDB). The steps it takes to do this are:

  • It turns text into numbers. This is known as Embedding in which text is converted into numerical vectors that represent meaning. For example , “The cat sat on the mat” becomes something like: [0.213, -0.551, 0.889, …]
  • Storing the embeddings in the Vector Database along with Chunks of text and References to the original documents. For example, Vector: [0.213, -0.551, 0.889 …]; Text: “The mitochondria is the powerhouse of the cell.”; Source: biology_notes.pdf.

When you ask a question, AnythingLLM finds the closest matching vectors. So, overall, the RAG (Retrieval-Augmented Generation) process works as follows:

Step 1 — Document chunking. When you upload a document to AnythingLLM, the document is split into small sections (PDF → paragraphs → chunks).

Step 2 — Embedding creation. Each chunk is converted to a vector.

Step 3 — Storage. The vectors are stored in the vector database.

Step 4 — Question time. When you ask a question such as “What causes tides?”, AnythingLLM:

  • converts the question into an embedding
  • searches the vector database
  • retrieves the most similar chunks.

Step 5 — Context injection. The retrieved chunks are added to the question and the combined prompt is sent to Phi-3 Mini.

Step 6 — AI generates answer. Now the model answers using your documents, not just training data; and the answer is shown in shown in AnythingLLM.

Unfortunately, even after I had successfully installed a working configuration, the system occasionally could not respond, and the results were often incomplete or incorrect. In ChatGPT’s opinion these problems are most likely being caused by:

  • The model temporarily exhausting RAM: I only have 8Gb of RAM and the AI components probably take up between 5 and 6GB (AnythingLLM app ~500 MB, Vector database + embeddings ~200–400 MB, Phi-3 Mini loaded in Ollama ~3–4 GB, Prompt + generation buffers ~0.5–1 GB). Adding the 2-3 GB taken up by the Operating System, means that every now and again I’m probably hitting a 7-8 GB total resulting in the Operating System occasionally swapping memory to disk, Ollama pausing or timing out, and AnythingLLM reporting “Ollama not responding”.
  • Poor data structure in the CSV file: RAG systems like AnythingLLM perform best with short natural-language passages, not table rows. When CSVs are embedded directly, column relationships are lost, retrieval becomes noisy, and the model guesses incorrectly. Hence ChatGPT’ suggestions in the previous post for how to refine the contents of the CSV file.
  • A limitation in the Phi-3 Mini model’s capability. Phi-3 Mini is optimized for low-memory environments, while larger models typically provide higher completeness and accuracy.

Still, I do at least now have a working system which I can experiment with – even if I have to occasionally put up with “Ollama not responding”. The following post documents how well or otherwise this initial configuration performed.