We recently received a personal archive collection that will be of great importance to researchers today and in the future, but at 76 500 files (25.5GB) we were a little perplexed as to how to go about making the collection available. The main issues are:
- The collection is completely unsorted comprising the contents of a number of PCs
- The size of the collection makes it impossible for us to survey it manually
- There is very little descriptive metadata – no meaningful titles; dates have been recorded as the date the collection was copied over to the archives; and there are no details of authors so it is difficult to know what has been created by the depositor and what is third-party content
- There is a huge amount of duplication including exact duplicates, and numerous versions of the same ‘record’ often with very little difference in the content
- All this means we also have very little knowledge about the over-arching content of the archive making it difficult to provide even a collection level description and promote the collection with users.
So where to start? As archivists we know what we need to do to make the collection accessible, and with a paper collection we would physically sort it. But this is much more difficult with an electronic collection and, to be honest, we don’t have the technical knowledge and skills to make it happen. SPRUCE, a current JISC project looking at building the business case for digital preservation, has been working to help practitioners (i.e. archivists, curators and librarians) overcome these problems by holding a series of ‘Mash-Up’ events that bring together information professionals and technical developers (Rebecca Neilsen has written an excellent post about what the event involves). I recently attended the London Mash-Up and took a sample of this collection along with me.
By the end of the three days the developers I was working with had created a number of solutions to the problems outlined above. They started by using Apache Tika to extract descriptive metadata from the collection. From here they were able to create word clouds of single words, two worded terms (bi-grams) and three worded terms (tri-grams) (which didn’t work so well); overarching dates of the collection; names of individuals and organisations; and file formats. This enables the archivist to start writing a broad description of the collection and it was good to see that the terms that were pulled out of the collection were what I would’ve expected – but would find it very difficult to do from a blank slate.
Perl scripts were then used to aid the daunting task of appraising the collection. The first approach was to search for keywords in filenames to find all the files of a similar title creating a filepath list which can then be used to manually locate the files and check for absolute duplicates or versions of the same record. This does mean you need to have an idea of where there is potential duplication and what the file titles are but this can be done by a quick survey of part of the collection and then automating the search across the whole of the collection. Probably my favourite solution was using checksums; normally used to see if a file has been changed, we used them to find absolute duplicates. Again, a list of the filepaths is created that the archivist can use to manually destroy files. There is also potential for this stage to be automated but, in the meantime, the creation of these lists certainly beats going through 76.5k files one by one searching for duplicates from memory.
To capture all this work and make it available to others facing similar problems the SPRUCE project team have developed a wiki to record information on the problems, technical developments, solutions and the work of practitioners who used the three days to beging thinking about building a strategic plan for digital preservation at their own organisations. All the information from the London event is available here and anyone specifically interested in the technical work carried out on the collection I took should have a look at Peter Cliff’s description of the solution they adopted.
So after three days we now have a better knowledge of the tools required to begin work on sorting the collection to eventually make it available. We also have a better understanding of what steps we need to take to formalise policies and procedures for the Transformations project. And the best thing is that all this information is freely available to other information professionals facing similar dilemmas.