The Library on a Massive Scale

What do you get when you combine the collections from 60 major research institutions into a single, digitized library? A comprehensive collection, of course, but also a major headache for the people who have to collect, organize, preserve, and publish the information in a user-friendly manner for students, professors, and the general public.

That's the headache that John Wilkin, associate university librarian for the University of Michigan's Library Information Technology (LIT) department and executive director of HathiTrust at U-M. The latter is a comprehensive digital archive comprising materials from likes the U-M, Arizona State University, Baylor University, Columbia University, and Dartmouth College, to name just a handful.

At press time, HathiTrust's digitized collection included 9.71 million total volumes, more than 5.15 million book titles, 256,000 serial titles, 3.4 billion pages, and 435 TB of information.

Wilkin gave Campus Technology a look at how the initiative started, challenges it's faced when trying to manage huge volumes of data, and how HathiTrust has overcome those obstacles.

Bridget McCrea: How did HathiTrust come about?

John Wilkin: A few years ago we began to realize that a lot of research libraries have the same types of content, and that while there are also many "unique" volumes and titles, putting together what we call a "collective collection" would be extremely beneficial and efficient for all of the various institutions. The idea went over well, and we started with about 23 institutions--a number that's since grown to 60. We just exceeded 9 million volumes, which positions HathiTrust as one of the top 10 research libraries in North America.

McCrea: How did you handle digitization of the content?

Wilkin: College libraries have managed large amounts of content for a long time, but primarily on their respective campuses. So when we embarked on this venture we already had a lot of experience with large-scale digital preservation. With this project, however, we were pressing the boundaries of what we could do with the technology that we were already using.

McCrea: So had you outgrown traditional library content management systems?

Wilkin: We had a number of systems "strapped together" in intelligent ways, but running them required a lot of hardware administration. There were also performance issues, and unanswered questions regarding scaling--such as, how do you go from 100 terabytes of data to 200 terabytes? We really felt like we were at the outer boundaries of what we could do using our current systems.

McCrea: What did you do about it?

Wilkin: We turned to the storage experts at our partner institutions for help in putting together an RFP, which was very competitive and included a lot of analysis. It probably took a year to assemble, including writing up the specs, gathering vendor responses, and filtering out the companies that didn't meet our needs. We completed a very serious evaluation that took into account our scalability issues, data management questions, and the fact that, as institutions, we were working with limited staff and budgets. Basically, we wanted a scalable solution that didn't require us to add staff to support it. After that lengthy RFP process, we selected Isilon's OneFS management tool.

'As universities, we try to future-proof our problems and we sometimes end up buying speculatively on a future that isn't quite here yet. We wind up spending too much, and ending up with mismatched systems down the road.'
--John Wilkin

McCrea: What functions does the solution handle for HathiTrust?

Wilkin: We ended up with a high-performing, easily managed, and scalable system that helped us accomplish what we wanted to do. We didn't just want to publish a library online and let it stagnate. We're at more than 9 million volumes and growing daily--most likely to 10 million by the end of this year. The system allows us to add up to 30,000 volumes per day. We can bring content into the repository, validate it, and know exactly what's being stored for the long term. It also handles ongoing validation checks, to make sure content isn't unintentionally being changed or altered along the way.

Open to users worldwide, the archive is actively managed and synchronized across two sites [with materials brought in via one site and then replicated to the other site], and we also do load balancing across those sites.

McCrea: Does this take place seamlessly?

Wilkin: Yes. A team of six people manages the entire digital collection.

McCrea: What's next on the agenda?

Wilkin: We're dealing with a lot of books and journals that our libraries have held for hundreds of years, and that are now being digitized. Right now we're looking pretty closely at a number of different format issues, and experimenting with various formats. Going forward, we'll be examining other new publishing options, such as publishing directly into the HathiTrust archives, inclusive of audio, images, and other multimedia components.

McCrea: What advice would you give another institution or consortium that's grappling with data management problems?

Wilkin: Don't buy technology for a future that you can't already "feel." As universities, we try to future-proof our problems and we sometimes end up buying speculatively on a future that isn't quite here yet. We wind up spending too much, and ending up with mismatched systems down the road. The key is to be more pragmatic in what you are doing, and to keep an eye on the future that's closer at hand.

Featured

  • From Fire TV to Signage Stick: University of Utah's Digital Signage Evolution

    Jake Sorensen, who oversees sponsorship and advertising and Student Media in Auxiliary Business Development at the University of Utah, has navigated the digital signage landscape for nearly 15 years. He was managing hundreds of devices on campus that were incompatible with digital signage requirements and needed a solution that was reliable and lowered labor costs. The Amazon Signage Stick, specifically engineered for digital signage applications, gave him the stability and design functionality the University of Utah needed, along with the assurance of long-term support.

  • college student working on a laptop, surrounded by icons representing campus support services

    National U Launches Student Support Hub for Non-Traditional Learners

    National University has launched a new student support hub designed to help online and working learners balance career, education, and family responsibilities as they pursue their education. Called "The Nest," the facility is positioned as a "co-learning" center that provides wraparound support services, work and study space, and access to child care.

  • Three cubes of noticeably increasing sizes are arranged in a straight row on a subtle abstract background

    A Sense of Scale

    Gardner Campbell explores the notion of scale in education and shares some of his own experience "playing with scale" — scaling up and/or scaling down — in an English course at VCU.

  • AI microchip, a cybersecurity shield with a lock, a dollar coin, and a laptop with financial graphs connected by dotted lines

    Survey: Generative AI Surpasses Cybersecurity in 2025 Tech Budgets

    Global IT leaders are placing bigger bets on generative artificial intelligence than cybersecurity in 2025, according to new research by Amazon Web Services (AWS).