Phase 2 results – and enter Claude

The object of Phase 2 was to explore how AI can support the combination of a collection’s Index and the titles of the associated files. The answer is straightforward: the assembled file titles need to be provided to the AI model in one or more files, in the same way that the Index is delivered. In Windows, file titles can be collected by highlighting the files you are interested in and selecting ‘Copy as path’ from the Right-Click drop down menu. These can then be pasted into either Excel or Word, and the path in front of the file name can be deleted by using the Find and Replace function: specify the path in the Find box and specify blank in the Replace box, highlight the file names, then select Replace-All. No doubt other operating systems have similar capabilities.

I tried this out with the names of 2065 files associated with the Mementos index, along with the 1.62 version of the Mementos Index (the version I used for the final set of tests in Phase1). I retained the 5 evaluation questions used in Phase 1, but I also added three more to further test the capabilities of the AI models concerned (more of these at the end of this post). I performed tests using AnythingLLM/Mistral and Copilot. ChatGPT wasn’t used because it has a limit of around 250kb and the file titles together with the Index would have significantly exceeded that. I also tried out Anthropic’s Claude model for the first time. The results are shown in the table below together with the results of the final test in Phase1 for comparison purposes.

System	Phase1 – no file titles 5 evaluation questions	Phase 2 with file titles 5 evaluation questions
AnythingLLM/ Mistral	4.4 out of 10: Overall, this was a disappointing result. The answers were very sparse with little rationale or summation. Some of the answers given were of dubious worth.	2.3 out of 10: This was a generally very poor result. Four of the answers were virtually worthless; while 2 of the answers weren’t too bad at all. Quite apart from the contents of the answers, they were all relatively short and with none of the embellishments and rationale that seem to be a standard part of the responses of AIs like ChatGPT and Copilot.
Copilot	8.7 out of 10: Three of these five answers were exceptionally good, and all the responses were well-illustrated with rationale and plenty of examples. The summaries at the end were well constructed and useful. I only spotted one error across the five answers, though there were three or four things which stood out as having been omitted.	8.9 out of 10: Overall this is an outstanding result. Comprehensive answers with good introductions and summaries and lots of examples with reference numbers included. Nearly everything was correct and complete – it was bemusing to read things like the series of homes we had been in or the different companies I’d worked for – an experience like being amazed that someone you are talking to seems to know your life history. The one concern is that in one answer the AI suggests it had taken facts from a document (Barlborough prospectus) which it hadn’t got access to and the only conclusion is that it hallucinated the information (Indeed it subsequently admitted it had).
Claude		8.2 out of 10: Claude’s answers overall were pretty good – but a bit patchy with scores ranging from 6 to 9.8 (this latter score being ‘first class’). The poorer scores were largely the result of misinterpretations and errors.

From these results, it’s not possible to assess if the inclusion of File Titles has made a difference: AnythingLLM’s performance was generally very poor across both tests; Copilot’s two scores were too similar to draw any conclusion; and Claude did not have any earlier test result to compare against. However, it seems reasonable to assume that if there is additional information in the file titles, over and above that in the Index, there will be a better outcome.

Since this was the first time I had used Claude, I enquired about its operational parameters, and discovered that Claude’s Context Window is 200K tokens across all models and paid plans (except for Enterprise plans). That is roughly 500 pages of text or approximately 800,000 characters. Claude caps file uploads at 30MB per file and 20 files per conversation. I found the quality of Claude’s answers to be at the same sort of high level as Copilot – though the results indicate that, on this particular set of uploaded material, Copilot has a bit of an edge. I also noticed that Claude differs from Copilot in two ways: first, Claude, unlike Copilot, tells you what it’s doing in the course of responding to a question; and, second, Claude took three times longer to respond to the same set of 8 evaluation questions (59.3 vs 19.6 seconds).

The three extra evaluation questions that I added were:

Describe the things that happened to Paul Wilson during the time he spent at Barlborough Hall School and support the story with references to relevant mementos.
Identify anything strange or unexpected about the contents of the Memento collection and support the analysis with references to particular files.
Describe life in the Wilson family during the 1990s citing relevant artefacts in the collection.

These extra questions were deliberately designed to explore the AIs ability to make more broadly-based connections and inferences, as opposed to identifying related items in a narrow subject area. This is where AI can excel – and sometimes fail – as indicated in the detailed results for each question provided below.

Question	Copilot Result	Claude Result
Describe the things that happened to Paul Wilson during the time he spent at Barlborough Hall School and support the story with references to relevant mementos.	7 out of 10: This is a comprehensive summary of my time at Barlborough – though surprisingly lacking in info about sports activities. The answer claims to have details of the contents of the school prospectus and of letters to parents – but it didn’t and I think it just hallucinated the info.	9.8 out of 10: This an excellent, comprehensive answer utilising all the Barlborough material in the collection so far as I could see. I didn’t spot any errors. The answer was easy to read and included all the relevant Memento Reference Numbers. First class.
Identify anything strange or unexpected about the contents of the Memento collection and support the analysis with references to particular files.	9 out of 10: I’m quite bemused by this answer: I don’t really know what I was expecting – but Copilot delivered a whole bunch of unusual stuff with good rationale for their inclusion. Extraordinary!	6 out of 10: Claude identified 12 different strange entries – but actually only about 5 are really valid. The so-called errors in the birthday card collection arose simply because I entered birthday cards for different years in the same year; the so-called error in the Miss Saigon Programme date was because I also bought a special Souvenir programme with the date of the first performance. Item 5 is also covered by Item 1. So this is an interesting set of material – but not entirely correct or useful.
Describe life in the Wilson family during the 1990s citing relevant artefacts in the collection.	9 out of 10: Another very comprehensive answer replete with examples and a good summary at the end. I haven’t checked in detail whether its all correct but nothing is standing out as being wrong. I’m finding it hard to fault these answers.	8.8 out of 10: This is a very complete account of our time in the 1990s written in a very readable style spiced with occasional wry comments and humour. Each detail is accompanied by the relevant Reference Number. It is highly informative but does include a few misinterpretations – in particular claiming that my wife was at home during the 90’s when actually she was back to teaching; that l was seconded to the IR in ‘mid-1990s’ whereas it was the beginning of the 90s; that the EDS chocolate bar had been kept whereas it was just the box it was in; that my son went to Exeter to do a 2nd degree whereas it was his first; that I was a devoted Aston Villa supporter whereas it was my other son; and that my son had a band called Phases which was stretching it since it was just he and a friend making a cassette recording. However, these are relatively small points amidst the huge array of correct facts that are presented in this highly readable piece.

Finally, to come back to the objective for Phase 2, I have concluded that file titles can certainly be included in the material delivered to an AI model, and this is certainly worth doing if they include additional material to that which is contained in an Index – or if an Index does not exist.

For completeness, below records the breakdown of the time I spent on Phase 2.

Activity	Time spent – hours
8 Tasks (total elapsed time: 4 days)	11
Time spent assembling files of File Titles and testing	8
Time spent researching and drafting post for pwofc.com	3

OFC

Order from Chaos, Digitisation, and their intersection

Phase 2 results – and enter Claude

Leave a Reply Cancel reply