Phase1 Description: AI support for the Memento collection’s index entries.
Conclusions
- Different LLMs are trained with different amounts and types of training data, thereby making them more or less suitable for operation within specific domains. Therefore, there may be some LLMs available that are more suitable than others for use with Collection indexes.
- To enable LLMs to provide answers to questions related to specific information in documents or databases, a process called Retrieval Augmented Generation (RAG) is used whereby relevant parts of the information are provided to the LLM alongside the question.
- The RAG process performed by particular products typically includes a multiplicity of different techniques to identify an appropriate set of Chunks for submission with the question to the LLM. It is likely that particular combinations of these techniques will be more effective than others for use with collection indexes.
- Collection Indexes may contain information which is more unevenly spread throughout their contents than a written document with an ordered set of contents. Consequently, such indexes may require more chunks to be sent to the LLM, and may require the LLM to have larger Context Windows, than ordinary documents in order to obtain satisfactory answers.
- Out of the 4 models/systems I have tried out so far, Copilot is by far the best LLM for use with a collection index of around 2,500 entries.
- The choice of systems, models, configurations, and Index adjustments in this first Phase were heavily influenced by ChatGPT, and therefore may have been based on some inaccuracies or hallucinations. This needs to be born in mind when considering the findings from Phase 1 and taking them forward into subsequent phases.
These conclusions were reached in the course of undertaking the activities summarised below.
Tasks and Timescales
| Activity | Time spent – hours |
| 70 Tasks (Total Elapsed time: 43 days) | 105 |
| Time spent addressing points in AI preparedness document | 7 |
| Time spent installing software | 8 |
| Time spent adjusting indexes and testing | 52 |
| Time spent researching and drafting posts for pwofc.com | 38 |
Reference Sources
- A Survey on Retrieval-Augmented Text Generation for Large Language Models by Yizheng Huang, Jimmy Huang, v2, arXiv:2404.10981, 23Aug2024.
- I obtained lots of answers from ChatGPT to questions I asked about RAG, AI software products, and how to adjust the Memento index to get better AI results.
Concepts and Terminology encountered
- LLM stands for Large Language Model. Hundreds of thousands of these have been built and new ones emerge daily, but many are variants of major models like GPT-4 from ChatGPT which can be accessed online. Other LLMs can be downloaded from websites like Hugging Face, to use locally on a stand-alone computer.
- RAG stands for Retrieval Augmented Generation whereby the LLM is not trained on the archive, but instead a relevant subset of the archive data is provided to the LLM alongside the question.
- Chunk is a smaller, manageable segment of a larger document or dataset. In RAG, one or more Chunks are provided to the LLM along with the question.
- Token is the fundamental unit of data used by LLMs. Models convert questions and Chunks into tokens, process them, produce an answer in tokens, and then convert those tokens back into text. Generally, one token equals approximately 0.75 English words.
- Vector Databases store mathematical representations of Tokens as vectors – lists of numbers – in such a way that related items are clustered together thereby enabling capabilities like similarity searching.
- Embedding: Chunks are broken down into Tokens which are converted into mathematical representations and embedded in a Vector Database.
- Context Window is the maximum number of Tokens that can be handled by the LLMs working memory (which contains both the input prompt and the answer). If the inputs to the LLM exceed its Context Window, then some content may be simply left out and the answer may be less complete.
- Hallucination is a phenomenon where LLMs generate false, misleading, or nonsensical information confidently. It happens when an LLM predicts text based on patterns rather than facts, often due to poor training data, ambiguous prompts, or a lack of understanding of reality.
Test results – 2nd set of evaluation criteria (scores out of 10)
| Type of LLM Index variant | AnythingLLM Mistral | AnythingLLM Mixtral | Copilot – MS LLM | ChatGPT – GPT-5.3 |
| Less some columns + AI Context short | 4.8 | 5 | ||
| Less all extraneous fields, no Guide to the Index | 2.2 | 2.9 | 8 | |
| Combined into 1 field and with Guide | 3.1 | 4.8 | 8.6 | |
| Less all extraneous fields, no Guide | 5.9 | |||
| Less all extraneous fields and with Guide | 9.1 | |||
| Index with Set removed and with Guide | 4.4 | 3.6 | 8.7 | 6.3 |