This is the start of my attempt to undertake Phase1 of my investigation into providing an AI interrogation capability for archives. Phase 1 concerns providing AI support for my Memento collection’s index entries.
The first step was to apply the advice in the recent publication “AI preparedness guidelines for archivists” by Prof. Giovanni Colavizza and Prof. Lise Jaillant. This suggests addressing four main areas (referred to as Pillars). My analysis of the Memento collection’s preparedness relating to each of the four areas is recorded in a Memento Preparedness document and summarised below:
Pillar 1 – Completeness and excluded data. The collection is complete; all items in the Index are to be interrogated by AI in this phase; no items have been excluded.
Pillar 2 – Metadata and access. 14 fields are in columns in the Index and these are available for use for AI interrogation. However, there are many additional columns (containing various analytical data) which are to be removed for this exercise. All information remaining in the Index after the additional columns have been removed, will be available for AI interrogation; no information will be subjected to restricted access in this exercise. Provenance and relationship information is embedded in the Reference No, and sometimes in the Notes field. An extensive range of narrative information about the collection and the Index is contained in a Guide worksheet within the Index spreadsheet.
Pillar 3 – Data types, formats, and file structures. Before making any changes to the Index file an assessment will be made as to whether the change is wanted in the original index or not. If it is not, a copy of the Index will be made. A variety of different file formats are present in the digital files of the collection, but the vast majority are either .pdf, .docx, or .jpg documents. Some standardisation changes may be required in some of the index fields. Folder names distinguish between the different collection components and link back into the overall Collection folder structure. All item digital File Titles contain the relevant Reference Number.
Pillar 4 – Application-specific metrics and evaluation. The ability to find what you are looking for in the Index is the primary requirement of collection users. Another requirement is to find the last Reference Number used overall or in a particular series, in order to specify the appropriate next Reference Number for an item you are adding to the collection. How these and other criteria should be translated into evaluation metrics will be considered through the course of the project.
As a result of the above analysis the following 14 actions were identified (the notation ‘P1.1’ stands for ‘the first action in P1 – Phase 1’; P1.2 is the second action in Phase 1; and so on).
P1.1 Mem(Index). Ensure completeness and normalisation. All 14 fields were checked to eliminate blanks and to normalise content where necessary.
P1.2 Mem(Index). Remove columns O to BX from the file used for this AI work. Columns O to BX were removed.
P1.3 Mem(All). Document all the Provenance and Relationship info embedded within the Index and the File Titles. The Guide was expanded to describe, a) the 14 fields, b) how the digital filename is constructed, and c) how the collection came about (which includes references to posts in the pwofc.com website).
P1.4 Mem(All). Observe how the Provenance and Relationship info is used to create guidelines for producing such documentation. To be revisited during the implementation of Phase 1
P1.5 Mem(Index). Identify any extra narrative info that is available or is needed. None needed.
P1.6 Mem(Index). Produce any extra narrative info that is required. None needed
P1.7 Mem(All). Carry out ‘wanted or not in the original index’’ check before each action. Done
P1.8 Mem(Items). Check what formats exist in the collection files. 16 different file formats are present in the collection – DOC, DOCX, FMP12, HTM, JPG, M4A, MP3, MP4, PDF, PDF-A1-b, PPTX, TIFF, XLSM, XLSX, XLS, ZIP.
P1.9 Mem(Items). Define AI-friendly standard formats: Only the Index to the collection (an XLSX document) is to be used in this phase, and this will be converted into a csv file for the purpose.
P1.10 Mem(Items). Make any changes to existing formats to conform to new standards. An AI friendly file in csv format was derived from the original Index document. Since you can’t create a csv with multiple worksheets, two new files were created: one with the file name ‘Mementos Collection Index for AI Phase 1.csv’, and another with the file name ‘Mementos Collection Guide for AI Phase 1’.
P1.11 Mem(All). Document the folder structure for the derivative file. For all this AI work, a new folder was created into which these derived files, and all other files derived for AI purposes, will be placed: C:\Users\pwils\Documents\AI.
P1.12 Mem(All) – Find out what ‘supports programmatic retrieval’ means in practice. ChatGPT advised that this usually means: querying a vector database, calling a search API, pulling documents from a content repository, and fetching structured data from a database.
P1.13 Mem(All) – Make any changes necessary to support programmatic retrieval. I don’t have enough knowledge yet to understand if any changes are needed to support that process. I will have to revisit this question when I actually start to try to implement the capability.
P1.14 Mem(All) – Prompt for ideas about success metrics as each action is taken in the course of the project. This question will be revisited as work on this phase progresses.
This brought to an end the Preparedness work which took a total of approximately 7 hours. The next step was to try and implement an AI capability: to start this process I asked ChatGPT what should be the first thing I do to create a RAG interrogation capability for the Memento’s Index (RAG stands for Retrieval-Augmented Generation – whereby the AI is not trained on the archive, but instead the archive data is provided to the AI at answer time). What followed is reported in the next post.