Initial Phase 1 Test Results

Over the last 10 days or so I’ve been running operational tests on the Mementos Index using the AnythingLLM front-end/embedding tool, and the Mistral and Mixtral models. The process I’ve been through has typically been a) to make some adjustment to the index (as advised by ChatGPT) to enable the AI system to produce better results; and then b) to run the 6 standard test queries with first the 4.2Gb Mistral model and then the much bigger (25.8Gb) Mixtral model.

The main changes made to the Index and what they were supposed to achieve are listed below:

Action Changes Benefits
Added an ‘Item Label’ column Combined the Reference Number with the Description in the format:[Reference No] — [Description]        and placed it in the new column. It is descriptive, human readable, and AI readable. It is the primary semantic identity of each object.
Included the Guide Changed the Guide to the Collection and its Index into a text (.txt) document and embedded it into the AnythingLLM workspace along with the Index itself. It teaches the model how to interpret the data in the Index.
Normalised the Facets Changes made to Facet contents to eliminate capital first letters (except for proper names), and minimise plurals. Reduces duplication and improves matching
Added a ‘Primary Facet’ column Took the first Facet in the Facet 1 column (which I have always regarded as the primary facet) and placed it into the new Primary Facet column. AI tries to detect clusters, but without a dominant signal, clustering can become messy. Once a primary facet exists, AI can start discovering higher-level themes.
Added an ‘AI Context’ column Combined Item Label with Facets 1 in the format:                                      [Item Label]. Facets: [Facet 1 keywords]                                        and placed it in the new column. It combines all the key semantic signals into the same chunk of text, enabling the AI to retrieve by conceptual meaning rather descriptive text.
Created a ‘Search Keywords’ column Identified specific meaningful searchable words within the ‘Description’ field by using a 10-line Excel formula supplied by ChatGPT. The formula also filtered out Stopwords (such as ‘the’) and retained ALL-CAPS words (like KRS). Providing specific words to search enables the AI to dramatically improve recall.
Restructured all the index rows into single cells Combined all 18 columns of information for each item into a single cell using a 19-line Excel formula provided by ChatGPT. The formula was placed into the 19th column of the Index and pulled down. This column was then copied and placed into a Notepad file with a .txt extension. Each item can now be embedded as a separate Chunk (provided Chunk size limits are not exceeded), enabling cleaner semantic matching, more accurate retrieval, more complete answers, and fewer errors.

I’ve also adjusted two different variables in the course of the testing I’ve carried out:

  • The Large Language Models (LLMs): I’ve performed all the tests using first the 4.2Gb Mistral model, and then the 25.8 Gb Mixtral model.
  • The number of ‘Context snippets’ passed to the LLM: I started with the default of 4 Context snippets, and towards the end of the testing I upped it to 40 (NB. Usually 1 Context snippet = 1 Chunk).

The Test Questions I asked the AI are listed below, and the measurements I took are as specified in a previous post.

  • What items are to do with the KRS? [KRS stands for Kodak Recreational Society]
  • What happened on the 20th?
  • List the items relating to exam results
  • What linen is in the collection?
  • Are there any items relating to Aston Martin cars?
  • What documents are there about finances? (50%)

For each of the questions, I recorded the time it took the AI to start displaying the answer, and the percentage of correct answers produced. For each set of 6 questions, I also calculated the overall average time and overall average % correct. The summary results are in the table below.

 Use of the Mistral model (with 4 Context snippets except for the last two tests)
Progressive additions to base Index Average time to start responding (seconds Average % of correct answers Result Assessment
Item Label 2.5 43% The AI identified a number of additional items that didn’t contain the keywords in the questions – Bank accounts, Receipt slips, and pension for the Finance question; Table cloths for the Linen question; and ‘results of…’ for the exam results question.
Guide document 3.5 49% Adding the Guide as an embedded document made very little difference. It identified all the additional items that were identified before the Guide was added – as well as one extra item – suit cloth for the linen question (which is what increased its performance to 49%)
Normalised Facets 3.3 43% This produced a poorer overall result than the un-normalised index (43% vs 49%) though this was largely a result of a completely incorrect answer to the KRS question. Other noticeable differences were a poorer performance on the finance question and a better performance on the linen question.
Primary Facet and AI Context 4.2 38% This produced a poorer overall result than the normalised version (43%) which in turn was poorer than the un-normalised index (49%). Of note was that it recognised ‘debt’ as being to do with finances; but it failed to spot KRS (again).
Context snippets delivered increased to 40 4 39% This produced a result almost exactly the same as with only 4 Context snippets – 39% vs 38%. However, it did identify 20-23Aug1993 as an event on the 20th. Unfortunately, it also said that a car had been sold at the Discount Bedding Centre whereas in fact it had crashed there.
One item in a single Excel cell in .txt format 3 33% This produced a poorer result (33%) than all the tests delivering just 4 Context Snippets; though it did pick up a Finance item which hadn’t been identified in any of the previous tests.

 

Use of the Mixtral model (with 4 Context snippets except for the last two tests)
Progressive additions to base Index Average time to start responding (seconds Average % of correct answers Result Assessment
Item Label No test No test No test
Guide document 22.7 50% The use of the much larger Mixtral model had very little impact.  The major differences were a) the average time to start responding increased from 3 seconds to 23 seconds, and b) for the Finance question it identified 2 extra items (house sale/purchases). Very strangely, it followed its negative answer for the Aston Martin question by listing all 25 items that were in the 4 Context Snippets provided.
Normalised Facets 22 57% This produced a slightly better result than the previous un-normalised test (57% vs 50%) due to improvements in the Finance, Linen and Exam Result questions. There was a strange result for the ’20th’ question in which the AI listed all the context Snippets it had been provided for the exam result, linen, and Aston Martin questions.
Primary Facet and AI Context 20.7 49% Overall, this expanded version of the Index performed pretty much the same as the earlier version without the new fields apart from the particularly noticeable 4 hallucinations the AI produced for the question about finances – complete with reference numbers already occupied by other items. This hadn’t happened before.
Chunks delivered increased to 40 23.2 38% Mixtral’s performance was significantly worse with the 40 Context Snippets delivered than it was with just 4 Context Snippets (38% vs 49%); on top of which it completely made-up descriptions for 4 reference numbers. Interestingly, Mistral’s performance was 39% with the 40 Context Snippets, so, on this evidence, there was little advantage to be had with Mixtral despite it taking significantly longer to respond.
One item in a single Excel cell in .txt format 22.3 26% Not only did this produce a poorer result than the previous unformatted index using the Mixtral model (26% vs 38%), but this result was actually the worst in this whole series of Phase 1 tests.  It also completely hallucinated an Aston Martin purchase. Perhaps one mitigating factor is that individual items often exceeded the max number of characters in a chunk (1000) resulting in chunks often consisting of bits of one item and bits of another.

There are two conclusions to be drawn from these results. First, the AI is a lot worse at word search than an Excel spreadsheet; and, second, making various changes to the Index being tested seemed to make the AI perform worse not better. On the plus side, however, there are several instances of the AI correctly identifying relevant items without the exact words in the question being present in the item’s record. Unfortunately, a number of instances also occurred in which the AI hallucinated and made things up.

There may be good reason for these occurrences. As ChatGPT made it clear to me, Indexes are often short, compressed, and keyword-based, whereas AI embeddings work best with descriptive sentences; and AI is probably at its best when asking exploratory questions. Furthermore, the questions I’ve been asking are not very prescriptive; ChatGPT suggests using the following text constructs  when needed

  • “Using ONLY the provided documents” to reduce hallucination
  • “Do not invent information” to force restraint
  • “If unsure, say…” → prevents guessing

Another reason why performance appears to have been poor may have been because of the Chunking parameters that were used. Most of the tests were run with Chunks that combined multiple items – which I believe is not ideal. Even when each item was confined to a single cell in the final test, the resulting Chunks were not limited to one item – probably because Chunk size was limited to 1000 characters and some items overran this limit. I need to explore this issue some more.

A second parameter that may have affected performance is the maximum number of Context Snippets that could be delivered with a question to the model. Most of the tests were limited to 4 Context Snippets, though this was increased to 40 in the final test. This clearly would have an impact on the answers provided – particularly if the number of relevant items exceeds the maximum allowable number of Context Snippets. This too requires further exploration.

This is where I’m up to in this first phase exercise. The tests I’ve done have undoubtedly increased my familiarity with the software being used and with the outputs that can be expected. However, I need to do some further investigations as described above to fully understand how things are actually working; and it seems I need to rethink what I actually want the AI system to inform me about so that I can specify more appropriate evaluation measures. I shall attempt to address these points before drawing this first phase of the overall investigation to a close.

Leave a Reply

Your email address will not be published. Required fields are marked *