Embedding Parameters and new Evaluation Questions

The tests described in the previous post highlighted a number of questions which are discussed below:

The Text Chunk Overlap is described as ‘the maximum overlap of characters that occurs during chunking between two adjacent text Chunks’. Under what circumstances would an overlap occur, and if it occurs, does exactly the same set of overlap characters appear in the two Chunks?
ChatGPT’s Answer: Overlap occurs when you intentionally reuse part of the previous chunk to maintain continuity. And in typical character-based chunking, the overlap is exactly the same sequence of characters in both chunks.

1 cell = 1 Chunk: In my Mementos Index I have collapsed all the columns for a single entry (i.e. in one row) into a single cell [as advised by ChatGPT]. Hence the whole index in now just a series of cells in a single column. Will the AnythingLLM embedding function attempt to put each cell in a single Chunk provided the number of characters in a cell is less than the Text Chunk Size? What happens if the number of characters exceeds the Text Chunk Size. ChatGPT’s answer: Yes, a cell will usually become a single chunk if it fits within the chunk size. If it doesn’t, it will be split into multiple overlapping chunks, each embedded separately. If you’ve set an overlap, adjacent chunks will share overlapping text (as discussed in the answer to Question 1.)

Chunks starting with the beginning of an entry: When embedding my Mementos Index with all the entries in single cells in a single column, how can I ensure that each Chunk starts with the beginning of an entry regardless of whether the previous entry exceeded the Text Chunk Size or not?
ChatGPT’s answer: To ensure every chunk starts at the beginning of an entry: Best: ingest each cell as a separate document
Next best: split by a delimiter chunking
Avoid: relying on chunk size alone.
[I didn’t think this was a helpful answer – it seems there’s not much practically that can be done about this issue]

More Chunks than Snippet limit: What would the impact be on AI’s answer to a question which concerns more Chunks than the maximum allowable number of Context Snippets?
ChatGPT’s answer: The AI can only reason over the chunks it’s given. If relevant information exceeds the snippet limit answers become partial, biased, or occasionally incorrect. The fix is a balance of: chunk design, retrieval quality, and context size.

Duplicate Descriptions: There are two instances of the ‘Description’ field in my revised Mementos Index – one in the original ‘Description’ field, and one in the new ‘Item Label field which combines the ‘Reference Number’ field with the ‘Description field [as advised by ChatGPT]. For longer descriptions, this second instance of the ‘Description’ field may be resulting in the Maximum Number of Characters in a Chunk being exceeded. What would the impact on AI performance be if the ‘Item Label’ field was removed?
ChatGPT’s answer: Removing the duplicated Description from the Item Label will likely:
– reduce chunk splitting
– improve retrieval accuracy
– increase effective context usage
– produce more complete answers
and has little to no downside if the reference number is still present elsewhere.

Attaching files: I’m using AnythingLLM with Ollama and the Mixtral model on a laptop with Intel 9 core and 64Gb RAM. I’d prefer to have the model work on all parts of my Mementos Index with 2390 entries, rather than a limited number of Chunks. I believe that the way to do that is to not embed the index but to ATTACH the index (now adjusted so that all entries are in single cells in a single column) in .txt format. Will that work on my configuration? And roughly how long will it take for the AI to start printing out its answers?
ChatGPT’s answer: In AnythingLLM (even when using Ollama with Mixtral): Attaching a file does not bypass context limits. The model still has a maximum context window (how much text it can “see” at once). So: The model cannot process all 2,390 entries simultaneously unless they fit inside the context window. Mixtral typically supports around: ~32k tokens context (varies by version). Rough conversion: 1 token ≈ 3–4 characters 32k tokens ≈ 100k–120k characters. Your dataset reality – If you have 2,390 entries, even modest 200 characters per entry, that’s: 2390 × 200 ≈ 478,000 characters – that’s 4–5× larger than the entire context window. So, the system will truncate the file and only the first portion is seen by the model. In the worse case:
– System becomes slow or unresponsive
– Very long delay before generation starts
– Possible failure or timeout
The bottom line is:
❌ Attaching the full .txt will not let the model process all 2,390 entries
❌ It will likely truncate or slow down significantly
⏱️ You may wait 10–60+ seconds before any output—and still get incomplete results
✅ Embeddings + retrieval is the correct and scalable approach
✅ Your restructuring (one entry per cell) is exactly the right direction

Having acquired these learnings, I was ready to continue testing. First, however I knew I needed a more appropriate set of evaluation criteria. So, I put this final question to ChatGPT:

“When applying AI to an index to a collection in an Excel spreadsheet, what generally applicable evaluation questions could be asked which could not be answered directly by using the spreadsheet Filter facility?

The answer provided 4 questions in each of 11 categories, and then one extra question i.e. 45 questions in all. From these I selected the following five:

What are the main themes that run through the entire index?
Are there distinct phases or periods in the collection?
Which items suggest important life events or transitions?
What patterns or motifs repeat across the collection?
What are the top 5 most important entries, and why?

Armed with these new evaluation questions, and a better understanding of the settings and parameters associated with embedding, I set out to undertake another round of tests using the Mementos Index.

OFC

Order from Chaos, Digitisation, and their intersection

Embedding Parameters and new Evaluation Questions

Leave a Reply Cancel reply