{"id":2771,"date":"2026-04-10T07:39:42","date_gmt":"2026-04-10T06:39:42","guid":{"rendered":"https:\/\/www.pwofc.com\/ofc\/?p=2771"},"modified":"2026-04-10T08:00:47","modified_gmt":"2026-04-10T07:00:47","slug":"embedding-parameters-and-new-evaluation-questions","status":"publish","type":"post","link":"https:\/\/www.pwofc.com\/ofc\/2026\/04\/10\/embedding-parameters-and-new-evaluation-questions\/","title":{"rendered":"Embedding Parameters and new Evaluation Questions"},"content":{"rendered":"<p>The tests described in the previous post highlighted a number of questions which are discussed below:<\/p>\n<ol>\n<li><strong>The Text Chunk Overlap<\/strong> is described as \u2018the maximum overlap of characters that occurs during chunking between two adjacent text Chunks\u2019. Under what circumstances would an overlap occur, and if it occurs, does exactly the same set of overlap characters appear in the two Chunks?<br \/>\n<u>ChatGPT\u2019s Answer<\/u>: Overlap occurs when you intentionally reuse part of the previous chunk to maintain continuity. And in typical character-based chunking, the overlap is exactly the same sequence of characters in both chunks.<\/li>\n<\/ol>\n<ol start=\"2\">\n<li><strong>1 cell = 1 Chunk:<\/strong> In my Mementos Index I have collapsed all the columns for a single entry (i.e. in one row) into a single cell [as advised by ChatGPT]. Hence the whole index in now just a series of cells in a single column. Will the AnythingLLM embedding function attempt to put each cell in a single Chunk provided the number of characters in a cell is less than the Text Chunk Size? What happens if the number of characters exceeds the Text Chunk Size.\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 <u>ChatGPT\u2019s answer:<\/u> Yes, a cell will usually become a single chunk if it fits within the chunk size. If it doesn\u2019t, it will be split into multiple overlapping chunks, each embedded separately. If you\u2019ve set an overlap, adjacent chunks will share overlapping text (as discussed in the answer to Question 1.)<\/li>\n<\/ol>\n<ol start=\"3\">\n<li><strong>Chunks starting with the beginning of an entry:<\/strong> When embedding my Mementos Index with all the entries in single cells in a single column, how can I ensure that each Chunk starts with the beginning of an entry regardless of whether the previous entry exceeded the Text Chunk Size or not?<br \/>\n<u>ChatGPT\u2019s answer:<\/u> To ensure every chunk starts at the beginning of an entry: Best: ingest each cell as a separate document<br \/>\nNext best: split by a delimiter chunking<br \/>\nAvoid: relying on chunk size alone.<br \/>\n[I didn&#8217;t think this was a helpful answer &#8211; it seems there&#8217;s not much practically that can be done about this issue]<\/li>\n<\/ol>\n<ol start=\"4\">\n<li><strong>More Chunks than Snippet limit:<\/strong> What would the impact be on AI\u2019s answer to a question which concerns more Chunks than the maximum allowable number of Context Snippets?<br \/>\n<u>ChatGPT\u2019s answer:<\/u> The AI can only reason over the chunks it\u2019s given. If relevant information exceeds the snippet limit answers become partial, biased, or occasionally incorrect. The fix is a balance of: chunk design, retrieval quality, and context size.<\/li>\n<\/ol>\n<ol start=\"5\">\n<li><strong>Duplicate Descriptions:<\/strong> There are two instances of the \u2018Description\u2019 field in my revised Mementos Index \u2013 one in the original \u2018Description\u2019 field, and one in the new \u2018Item Label field which combines the \u2018Reference Number\u2019 field with the \u2018Description field [as advised by ChatGPT]. For longer descriptions, this second instance of the \u2018Description\u2019 field may be resulting in the Maximum Number of Characters in a Chunk being exceeded. What would the impact on AI performance be if the \u2018Item Label\u2019 field was removed?<br \/>\n<u>ChatGPT\u2019s answer:<\/u> Removing the duplicated Description from the Item Label will likely:<br \/>\n&#8211; reduce chunk splitting<br \/>\n&#8211; improve retrieval accuracy<br \/>\n&#8211; increase effective context usage<br \/>\n&#8211; produce more complete answers<br \/>\nand has little to no downside if the reference number is still present elsewhere.<\/li>\n<\/ol>\n<ol start=\"6\">\n<li><strong>Attaching files: <\/strong>I\u2019m using AnythingLLM with Ollama and the Mixtral model on a laptop with Intel 9 core and 64Gb RAM. I\u2019d prefer to have the model work on all parts of my Mementos Index with 2390 entries, rather than a limited number of Chunks. I believe that the way to do that is to not embed the index but to ATTACH the index (now adjusted so that all entries are in single cells in a single column) in .txt format. Will that work on my configuration? And roughly how long will it take for the AI to start printing out its answers?<br \/>\n<u><u>ChatGPT\u2019s answer: <\/u><\/u>In AnythingLLM (even when using Ollama with Mixtral): Attaching a file does <strong>not bypass context limits<\/strong>. The model still has a <strong>maximum context window<\/strong> (how much text it can \u201csee\u201d at once). So: The model cannot process all 2,390 entries simultaneously unless they fit inside the context window. Mixtral typically supports around: ~32k tokens context (varies by version). Rough conversion:\u00a0 \u00a01 token \u2248 3\u20134 characters\u00a0 \u00a0 \u00a0 \u00a0 32k tokens \u2248 100k\u2013120k characters. Your dataset reality &#8211; If you have 2,390 entries, even modest 200 characters per entry, that\u2019s: 2390 \u00d7 200 \u2248 478,000 characters &#8211; that\u2019s 4\u20135\u00d7 larger than the entire context window. So, the system will truncate the file and only the first portion is seen by the model. In the worse case:<br \/>\n&#8211; System becomes slow or unresponsive<br \/>\n&#8211; Very long delay before generation starts<br \/>\n&#8211; Possible failure or timeout<br \/>\nThe bottom line is:<br \/>\n&#x274c; Attaching the full .txt will <strong>not<\/strong> let the model process all 2,390 entries<br \/>\n&#x274c; It will likely <strong>truncate or slow down significantly<br \/>\n<\/strong>&#x23f1;&#xfe0f; You may wait <strong>10\u201360+ seconds<\/strong> before any output\u2014and still get incomplete results<br \/>\n&#x2705; Embeddings + retrieval is the <strong>correct and scalable approach<br \/>\n<\/strong>&#x2705; Your restructuring (one entry per cell) is exactly the right direction<\/li>\n<\/ol>\n<p>Having acquired these learnings, I was ready to continue testing. First, however I knew I needed a more appropriate set of evaluation criteria. So, I put this final question to ChatGPT:<\/p>\n<p>\u201cWhen applying AI to an index to a collection in an Excel spreadsheet, what generally applicable evaluation questions could be asked which could not be answered directly by using the spreadsheet Filter facility?<\/p>\n<p>The answer provided <a href=\"https:\/\/www.pwofc.com\/ofc\/wp-content\/uploads\/2026\/04\/ChatGPTs-suggested-list-of-Evaluation-Questions.docx\">4 questions in each of 11 categories, and then one extra question<\/a> i.e. 45 questions in all. From these I selected the following five:<\/p>\n<ul>\n<li>What are the main themes that run through the entire index?<\/li>\n<li>Are there distinct phases or periods in the collection?<\/li>\n<li>Which items suggest important life events or transitions?<\/li>\n<li>What patterns or motifs repeat across the collection?<\/li>\n<li>What are the top 5 most important entries, and why?<\/li>\n<\/ul>\n<p>Armed with these new evaluation questions, and a better understanding of the settings and parameters associated with embedding, I set out to undertake another round of tests using the Mementos Index.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The tests described in the previous post highlighted a number of questions which are discussed below: The Text Chunk Overlap is described as \u2018the maximum overlap of characters that occurs during chunking between two adjacent text Chunks\u2019. Under what circumstances &hellip; <a href=\"https:\/\/www.pwofc.com\/ofc\/2026\/04\/10\/embedding-parameters-and-new-evaluation-questions\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[40],"tags":[],"class_list":["post-2771","post","type-post","status-publish","format-standard","hentry","category-ai-for-personal-archives"],"_links":{"self":[{"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/posts\/2771","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/comments?post=2771"}],"version-history":[{"count":13,"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/posts\/2771\/revisions"}],"predecessor-version":[{"id":2785,"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/posts\/2771\/revisions\/2785"}],"wp:attachment":[{"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/media?parent=2771"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/categories?post=2771"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/tags?post=2771"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}