We have uploaded records in bulk to our Library catalogue to include some older material which was only available via a single printed list which sat on top of a card catalogue cabinet near the enquiry desk. When I discovered this, I knew it would be easy to scan and make available in pdf, but it seemed to me it would be much more useful if it was integrated with our other electronically catalogued theses and available for full searching.
To do this, we scanned and then OCR’d the document to make it machine readable. Having done this, we wrote a program which looked for various patterns in the file (for example the presence of a blank line) in order to split it into separate records with separate fields such as title, author etc.. From that we did some manual cleanup where characters had been incorrectly recognised. For example, the lowercase letter l is often read as number 1. We then loaded the records to the library system. It took three minutes to do this and within a few hours all the records had been automatically indexed.
We have similar projects in the pipeline so check back regularly. You can search the set of theses here.