{"id":2874,"date":"2026-06-15T09:07:37","date_gmt":"2026-06-15T08:07:37","guid":{"rendered":"https:\/\/www.pwofc.com\/ofc\/?p=2874"},"modified":"2026-06-24T11:53:33","modified_gmt":"2026-06-24T10:53:33","slug":"file-splitting-and-truncation","status":"publish","type":"post","link":"https:\/\/www.pwofc.com\/ofc\/2026\/06\/15\/file-splitting-and-truncation\/","title":{"rendered":"File Splitting and Truncation"},"content":{"rendered":"<p>As described in the <a href=\"https:\/\/www.pwofc.com\/ofc\/2026\/06\/14\/a-prompt-about-prompts\/\">previous post<\/a>, the Index and file information in my PAWDOC collection is too big to be ingested and used in a single prompt in today\u2019s AI systems. So, I\u2019m using it to explore how AI might be able to analyse a very large collection in pieces and then to stitch the results together. The first step in such an undertaking is to decide what information is to be provided to the AI and how small the individual subsets (i.e. files) of that information need to be to guarantee that they will be taken into account in full in the AI\u2019s analysis (files that are too large will simply have some of their contents truncated after they are uploaded but before the AI starts its analysis).<\/p>\n<p>I decided to provide the following fields from the PAWDOC index to the AI:<\/p>\n<ul>\n<li>Reference Number: every document in the collection has a Reference Number which is made up of four parts &#8211; an Owner Identifier, a Set Identifier, a Serial Number, and a Sub-Serial Number. For example, PAW-DOC-4046-01.<\/li>\n<li>Title: this contains free format text describing the document(s) concerned. Three dots (&#8230;) in the middle of the text denotes that what follows are Keywords\/Phrases.<\/li>\n<li>Publication date: this is the publication date of the oldest document relating to that Reference Number.<\/li>\n<li>Creation date: the date the index entry was created.<\/li>\n<\/ul>\n<p>I also decided to provide the file title for every file associated with each Index entry. File titles have a general structure of Reference Number, Description, filename extension. For example:<\/p>\n<p>PAW-DOC-1104-39\u00a0\u00a0 Planning the ITUSA document Interchange group X.400 User Test.tif<\/p>\n<p>I was able to produce a file of the relevant <strong>Index fields<\/strong> by exporting them out Filemaker. The export contained 17,381 entries composed of just over 2,400,000 characters (I calculated character numbers by using Excel\u2019s \u2018LEN\u2019 formula) and sized at 1.4Mb.<\/p>\n<p>Obtaining a list of the associated <strong>Filenames<\/strong> was a little more difficult as the files reside inside separate folders for each Reference Number. However, a search of the net established that the 7z file compression utility will provide a list of all the files within a higher-level directory structure; so, I downloaded and installed 7z, and created a file of file titles containing 31,270 entries composed of 2,568,000 characters and sized at 1Mb.<\/p>\n<p>To assess the number and size of the subset files that needed to be produced I turned to the insights I recorded in the <a href=\"https:\/\/www.pwofc.com\/ofc\/2026\/04\/17\/phase-1-summary-results\/\">Phase 1<\/a> and <a href=\"https:\/\/www.pwofc.com\/ofc\/2026\/05\/06\/phase-2-results-and-enter-claude\/\">Phase 2<\/a> Summary results:<\/p>\n<ul>\n<li><strong>ChatGPT&#8217;s<\/strong> usable context window is about 80k-100k tokens (approximately 160k-300k characters \u2013 assuming 2-3 characters\/token). ChatGPT limits uploads to 3 files a day but this can be circumvented by putting multiple files in a zip file.<\/li>\n<li><strong>Copilot<\/strong> doesn\u2019t have a fixed context window &#8211; its design means that its effective context window is much larger and more flexible than a single token number would suggest. It limits uploads to about 7 batches of up to 20 files. Approximately 30k characters per file should work fine.<\/li>\n<li><strong>Claude\u2019s <\/strong>context window is 200k tokens (roughly 500 pages of text or approximately 800,000 characters). Claude caps file uploads at 30Mb per file and 20 files per conversation.<\/li>\n<\/ul>\n<p>Copilot and Claude\u2019s limits suggested that, for each prompt, I should load no more than 20 files containing a maximum of 30k characters each. This would exceed ChatGPTs limits, but I thought that I could do without ChatGPT if its results were poor.<\/p>\n<p>I had already decided that I would upload a copy of the &#8216;<a href=\"https:\/\/www.pwofc.com\/ofc\/wp-content\/uploads\/2026\/06\/PAWDOC-Guide.docx\">PAWDOC Guide<\/a>&#8216; file in every prompt; and that I would count the text request as a file in its own right; so that left a maximum of 18 files of Index and Filename information in each prompt. I duly, set about splitting the Index file into subsets of around 30k characters, and then creating associated files of file titles (ensuring that none of the files exceeded 30k characters) and assembled the combination of Index files and File Title files into groups of 18 files or less. It was a very laborious task. If I&#8217;d realised how time-consuming it was going to be, I would have found a way of splitting out the Reference Number from the File titles into a separate field and merged the overall Index file and overall File Title file into a single file sorted by Reference Number, which would have made the splitting task a much simpler and quicker operation &#8211; a lesson worth remembering (in actual fact, I&#8217;d achieved a similar feat a few months earlier simply by asking ChatGPT to provide me with an Excel function for a similar task &#8211; I&#8217;d just forgotten it was that easy). Anyway, after the job was completed, I found I had 10 subsets of either 17 or 18 files each, and which I named Subsets A-J. Subset B\u2019s files are shown below.<\/p>\n<ul>\n<li>Test B files\\PAWDOC Index 12.csv<\/li>\n<li>Test B files\\PAWDOC Index 13.csv<\/li>\n<li>Test B files\\PAWDOC Index 14.csv<\/li>\n<li>Test B files\\PAWDOC Index 15.csv<\/li>\n<li>Test B files\\PAWDOC Index 16.csv<\/li>\n<li>Test B files\\PAWDOC Index 17.csv<\/li>\n<li>Test B files\\PAWDOC Index 18.csv<\/li>\n<li>Test B files\\PAWDOC File Names 7.csv<\/li>\n<li>Test B files\\PAWDOC File Names 8.csv<\/li>\n<li>Test B files\\PAWDOC File Names 9.csv<\/li>\n<li>Test B files\\PAWDOC File Names 10.csv<\/li>\n<li>Test B files\\PAWDOC File Names 11.csv<\/li>\n<li>Test B files\\PAWDOC File Names 12.csv<\/li>\n<li>Test B files\\PAWDOC File Names 13.csv<\/li>\n<li>Test B files\\PAWDOC File Names 14.csv<\/li>\n<li>Test B files\\PAWDOC File Names 15.csv<\/li>\n<li>Test B files\\PAWDOC File Names 16.csv<\/li>\n<\/ul>\n<p>I then set about getting each of the AIs to answer question 1 with Subset A. As it happened, there seemed to be no problem with ChatGPT: it happily accepted all the files in a zip file, and advised that no files were truncated when it analysed them. This didn\u2019t seem to concur with the ChatGPT context window information I had established earlier, but the results seemed reasonable so I decided to use ChatGPT in the testing. So too with Claude, which also was happy to ingest all the files compressed into a single zip file. Copilot, however, produced inferior results which I established, after some Q&amp;A with the AI, was due to the files being truncated. Copilot explained the problem very clearly:<\/p>\n<p><em>Your file size (\u224833 KB) and character count (\u224829,950) were both within your rule of thumb \u2014 but <strong>that rule is no longer reliable<\/strong>. <\/em><em>The truncation happened because:<\/em><\/p>\n<ol>\n<li><strong><em> The ingestion limit is based on tokens, not characters or kilobytes: <\/em><\/strong><em>A token is roughly 3\u20134 characters of English text, but <strong>CSV files with long lines, punctuation, quotes, and irregular structure tokenize very inefficiently<\/strong>. Your file contains long filenames, commas, quotes, and mixed punctuation dramatically increase token count. So even though the file is only ~30k characters, it may be <strong>10k\u201315k tokens<\/strong>, which is enough to trigger truncation.<\/em><\/li>\n<li><strong><em> CSVs with hundreds of long rows are the worst-case format: <\/em><\/strong><em>Each line is treated as a dense block of text. Tokenization cost spikes.<\/em><\/li>\n<li><strong><em> The system applies a per\u2011file token cap: <\/em><\/strong><em>This cap is not published, but from your repeated tests, the effective safe zone is:<\/em><\/li>\n<\/ol>\n<ul>\n<li><strong><em>\u224815,000\u201318,000 characters<\/em><\/strong><em> for CSV-like content<\/em><\/li>\n<li><strong><em>\u224820,000\u201322,000 characters<\/em><\/strong><em> for plain text<\/em><\/li>\n<li><strong><em>\u224830,000 characters only for very simple text<\/em><\/strong><\/li>\n<\/ul>\n<p><em>Your file was near 30k characters but had extremely token-heavy content \u2192 <strong>truncated<\/strong>.<\/em><\/p>\n<p>It was clear that I wasn\u2019t going to be able to conduct the tests with Copilot using the same subsets that I had already created. However, ChatGPT and Claude seemed able to deal with the subsets, and would enable me to make some sort of comparison of results. Furthermore, I wasn\u2019t prepared to redo the subsets, so I decided simply to exclude Copilot from the tests. This experience suggests that it\u2019s a good idea to test some sample file sizes with all the models to be used, before undertaking the final splitting of files into subsets.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>As described in the previous post, the Index and file information in my PAWDOC collection is too big to be ingested and used in a single prompt in today\u2019s AI systems. So, I\u2019m using it to explore how AI might &hellip; <a href=\"https:\/\/www.pwofc.com\/ofc\/2026\/06\/15\/file-splitting-and-truncation\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[40],"tags":[],"class_list":["post-2874","post","type-post","status-publish","format-standard","hentry","category-ai-for-personal-archives"],"_links":{"self":[{"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/posts\/2874","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/comments?post=2874"}],"version-history":[{"count":2,"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/posts\/2874\/revisions"}],"predecessor-version":[{"id":2905,"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/posts\/2874\/revisions\/2905"}],"wp:attachment":[{"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/media?parent=2874"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/categories?post=2874"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.pwofc.com\/ofc\/wp-json\/wp\/v2\/tags?post=2874"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}