A few months back I wrote a post about my experience of going to one of the SPRUCE Mash-Up events. Since then I’ve been doing a bit more work looking at the digital collection I took with me and I thought it was about time I wrote an update on what I’ve been doing.
Firstly, I’ve found that the outcomes of the Mash-Up event were simply too complex to follow up with limited technological knowledge or assistance. Instead I’ve been using a more ’archivist-friendly’ piece of software called Karen’s Printer Directory to generate descriptive metadata which can then be imported into Excel enabling quicker assessment and appraisal of the collection. This software has already been recommended by a number of project teams working in the area of digital preservation including the UK-USA collaborative AIMS project and the JISC-funded Paradigm project. The great benefit is that it’s extremely simple to use. Once downloaded an easy-to-use dialogue box enables you to choose whether to print or save the metadata once it has been exported; export metadata about files / folders or both; and select the location of the folders and files you want metadata to be generated for. Further options are also available. My handy hints would be:
- On the ‘Sort files by’ option remember to select ‘don’t sort’ from the drop-down box
- Under the ‘Print Option’ Select ‘File info only’. You can extract metadata from both files and folders at the same time but, once imported into Excel, I found the latter option makes it more difficult to manipulate the metadata as there is duplication of columns in the spreadsheet. In addition, information about the file metadata is more useful for finding exact duplicates and versions of the same record.
- On the right-hand side box you can select the metadata you want to extract. I found the following metadata (in this order) to be the most useful – Folder Name, File Name, File Path, Extension, File Size, Date Created, Date Modified, File Version, MD5 Hash, and SHA-1 Hash.
Once the metadata has been generated you can import the data into an Excel spreadsheet. In Excel go to Get External Data>From Text and locate the file that has been exported from Karen’s Printer Directory. It does mean that if the digital collection is very large you will have a huge spreadsheet to work with which can be quite slow and cumbersome. However it is far more helpful than working your way through the actual collection.
The metadata for the whole collection will be retained as a permanent record and a second spreadsheet has been created for using as the working document for the appraisal. In terms of the appraisal I am starting to make some progress removing duplicates and other unwanted sections of the collection.
To find duplicates and potential duplicates go to Home>Conditional Formatting>Highlight Cell Rules>Duplicate Values and select the way in which you want duplicate data across cells to be highlighted. This is a more appropriate method than simply deleting cells (Data>Remove duplicates) as you will have a record of the files that can either be flagged as absolute duplicates (using the algorithm – MDA5 / SHA-1) or as files that need to be checked as they have the same file titles. This snapshot of a section of the Excel spreadsheet, with the absolute duplicates highlighted in red gives a sense of how helpful this software can be in the initial weed of the collection:
Once all the absolute duplicates were removed, I started working my way through the spreadsheet locating sections of the collection that can easily be marked for disposal. In particular I’ve found a large number of set up files for various systems and external media including cameras; music files and links to websites that we won’t retain (although we now have the link listed in the original metadata exported from Karen’s Printer Directory). I have found the spreadsheet quite difficult to work with at times, and it can be painstakingly slow working my way through such a large amount of data. However it has enabled me to highlight whole sections of the collection that I have marked for destruction and it has directed me towards other sections where a quick check of the actual content has informed an appraisal decision.
I’m not particularly sure if the way I’ve been working is the most sensible (or quickest) but the good news is that a spreadsheet that initially had more than 45 000 records is now down to just over 8000. I’m hoping that through the work I’ve been doing so far I will be able to put together some procedures for dealing with completely unsorted collections in the future. It will also form the basis of the guidelines we will put together for individuals who would like to deposit their digital archives with us.