The Library on a Massive Scale
- By Bridget McCrea
- 10/05/11
What do you get when you combine the collections from 60 major research institutions into a single, digitized library? A comprehensive collection, of course, but also a major headache for the people who have to collect, organize, preserve, and publish the information in a user-friendly manner for students, professors, and the general public.
That's the headache that John Wilkin, associate university librarian for the University of Michigan's Library Information Technology (LIT) department and executive director of HathiTrust at U-M. The latter is a comprehensive digital archive comprising materials from likes the U-M, Arizona State University, Baylor University, Columbia University, and Dartmouth College, to name just a handful.
At press time, HathiTrust's digitized collection included 9.71 million total volumes, more than 5.15 million book titles, 256,000 serial titles, 3.4 billion pages, and 435 TB of information.
Wilkin gave Campus Technology a look at how the initiative started, challenges it's faced when trying to manage huge volumes of data, and how HathiTrust has overcome those obstacles.
Bridget McCrea: How did HathiTrust come about?
John Wilkin: A few years ago we began to realize that a lot of research libraries have the same types of content, and that while there are also many "unique" volumes and titles, putting together what we call a "collective collection" would be extremely beneficial and efficient for all of the various institutions. The idea went over well, and we started with about 23 institutions--a number that's since grown to 60. We just exceeded 9 million volumes, which positions HathiTrust as one of the top 10 research libraries in North America.
McCrea: How did you handle digitization of the content?
Wilkin: College libraries have managed large amounts of content for a long time, but primarily on their respective campuses. So when we embarked on this venture we already had a lot of experience with large-scale digital preservation. With this project, however, we were pressing the boundaries of what we could do with the technology that we were already using.
McCrea: So had you outgrown traditional library content management systems?
Wilkin: We had a number of systems "strapped together" in intelligent ways, but running them required a lot of hardware administration. There were also performance issues, and unanswered questions regarding scaling--such as, how do you go from 100 terabytes of data to 200 terabytes? We really felt like we were at the outer boundaries of what we could do using our current systems.
McCrea: What did you do about it?
Wilkin: We turned to the storage experts at our partner institutions for help in putting together an RFP, which was very competitive and included a lot of analysis. It probably took a year to assemble, including writing up the specs, gathering vendor responses, and filtering out the companies that didn't meet our needs. We completed a very serious evaluation that took into account our scalability issues, data management questions, and the fact that, as institutions, we were working with limited staff and budgets. Basically, we wanted a scalable solution that didn't require us to add staff to support it. After that lengthy RFP process, we selected Isilon's OneFS management tool.
'As universities, we try to future-proof our problems and we sometimes end up buying speculatively on a future that isn't quite here yet. We wind up spending too much, and ending up with mismatched systems down the road.' --John Wilkin |
McCrea: What functions does the solution handle for HathiTrust?
Wilkin: We ended up with a high-performing, easily managed, and scalable system that helped us accomplish what we wanted to do. We didn't just want to publish a library online and let it stagnate. We're at more than 9 million volumes and growing daily--most likely to 10 million by the end of this year. The system allows us to add up to 30,000 volumes per day. We can bring content into the repository, validate it, and know exactly what's being stored for the long term. It also handles ongoing validation checks, to make sure content isn't unintentionally being changed or altered along the way.
Open to users worldwide, the archive is actively managed and synchronized across two sites [with materials brought in via one site and then replicated to the other site], and we also do load balancing across those sites.
McCrea: Does this take place seamlessly?
Wilkin: Yes. A team of six people manages the entire digital collection.
McCrea: What's next on the agenda?
Wilkin: We're dealing with a lot of books and journals that our libraries have held for hundreds of years, and that are now being digitized. Right now we're looking pretty closely at a number of different format issues, and experimenting with various formats. Going forward, we'll be examining other new publishing options, such as publishing directly into the HathiTrust archives, inclusive of audio, images, and other multimedia components.
McCrea: What advice would you give another institution or consortium that's grappling with data management problems?
Wilkin: Don't buy technology for a future that you can't already "feel." As universities, we try to future-proof our problems and we sometimes end up buying speculatively on a future that isn't quite here yet. We wind up spending too much, and ending up with mismatched systems down the road. The key is to be more pragmatic in what you are doing, and to keep an eye on the future that's closer at hand.