The Library on a Massive Scale

What do you get when you combine the collections from 60 major research institutions into a single, digitized library? A comprehensive collection, of course, but also a major headache for the people who have to collect, organize, preserve, and publish the information in a user-friendly manner for students, professors, and the general public.

That's the headache that John Wilkin, associate university librarian for the University of Michigan's Library Information Technology (LIT) department and executive director of HathiTrust at U-M. The latter is a comprehensive digital archive comprising materials from likes the U-M, Arizona State University, Baylor University, Columbia University, and Dartmouth College, to name just a handful.

At press time, HathiTrust's digitized collection included 9.71 million total volumes, more than 5.15 million book titles, 256,000 serial titles, 3.4 billion pages, and 435 TB of information.

Wilkin gave Campus Technology a look at how the initiative started, challenges it's faced when trying to manage huge volumes of data, and how HathiTrust has overcome those obstacles.

Bridget McCrea: How did HathiTrust come about?

John Wilkin: A few years ago we began to realize that a lot of research libraries have the same types of content, and that while there are also many "unique" volumes and titles, putting together what we call a "collective collection" would be extremely beneficial and efficient for all of the various institutions. The idea went over well, and we started with about 23 institutions--a number that's since grown to 60. We just exceeded 9 million volumes, which positions HathiTrust as one of the top 10 research libraries in North America.

McCrea: How did you handle digitization of the content?

Wilkin: College libraries have managed large amounts of content for a long time, but primarily on their respective campuses. So when we embarked on this venture we already had a lot of experience with large-scale digital preservation. With this project, however, we were pressing the boundaries of what we could do with the technology that we were already using.

McCrea: So had you outgrown traditional library content management systems?

Wilkin: We had a number of systems "strapped together" in intelligent ways, but running them required a lot of hardware administration. There were also performance issues, and unanswered questions regarding scaling--such as, how do you go from 100 terabytes of data to 200 terabytes? We really felt like we were at the outer boundaries of what we could do using our current systems.

McCrea: What did you do about it?

Wilkin: We turned to the storage experts at our partner institutions for help in putting together an RFP, which was very competitive and included a lot of analysis. It probably took a year to assemble, including writing up the specs, gathering vendor responses, and filtering out the companies that didn't meet our needs. We completed a very serious evaluation that took into account our scalability issues, data management questions, and the fact that, as institutions, we were working with limited staff and budgets. Basically, we wanted a scalable solution that didn't require us to add staff to support it. After that lengthy RFP process, we selected Isilon's OneFS management tool.

'As universities, we try to future-proof our problems and we sometimes end up buying speculatively on a future that isn't quite here yet. We wind up spending too much, and ending up with mismatched systems down the road.'
--John Wilkin

McCrea: What functions does the solution handle for HathiTrust?

Wilkin: We ended up with a high-performing, easily managed, and scalable system that helped us accomplish what we wanted to do. We didn't just want to publish a library online and let it stagnate. We're at more than 9 million volumes and growing daily--most likely to 10 million by the end of this year. The system allows us to add up to 30,000 volumes per day. We can bring content into the repository, validate it, and know exactly what's being stored for the long term. It also handles ongoing validation checks, to make sure content isn't unintentionally being changed or altered along the way.

Open to users worldwide, the archive is actively managed and synchronized across two sites [with materials brought in via one site and then replicated to the other site], and we also do load balancing across those sites.

McCrea: Does this take place seamlessly?

Wilkin: Yes. A team of six people manages the entire digital collection.

McCrea: What's next on the agenda?

Wilkin: We're dealing with a lot of books and journals that our libraries have held for hundreds of years, and that are now being digitized. Right now we're looking pretty closely at a number of different format issues, and experimenting with various formats. Going forward, we'll be examining other new publishing options, such as publishing directly into the HathiTrust archives, inclusive of audio, images, and other multimedia components.

McCrea: What advice would you give another institution or consortium that's grappling with data management problems?

Wilkin: Don't buy technology for a future that you can't already "feel." As universities, we try to future-proof our problems and we sometimes end up buying speculatively on a future that isn't quite here yet. We wind up spending too much, and ending up with mismatched systems down the road. The key is to be more pragmatic in what you are doing, and to keep an eye on the future that's closer at hand.

Featured

  • stylized illustration of people conversing on headsets

    AI and Our Next Conversations in Higher Education

    Ryan Lufkin, the vice president of global strategy for Instructure, examines how the focus on AI in education will move from experimentation to accountability.

  • abstract generative AI technology

    Apple and Google Strike AI Deal to Bring Gemini Models to Siri

    Apple and Google announced they have embarked on a multiyear partnership that will put Google's Gemini models and cloud technology at the core of the next generation of Apple Foundation Models, a move that could help Apple accelerate long-promised upgrades to Siri while handing Google a high-profile distribution win on the iPhone.

  • Hand holding a glowing AI sphere

    Beyond the Hype: 5 Actionable Steps for Higher Ed to Master AI in 2026

    AI has arrived as a powerful, pervasive reality, bringing with it a whirlwind of innovation, new tools, and pressing questions. Here are five practical steps to help your institution navigate this rapidly evolving landscape and accelerate its path to real transformation.

  • abstract data flow

    Google Intros New Gemini Enterprise Agent Platform

    Google Cloud has announced a new platform for building and managing enterprise AI agents, as the company seeks to turn its Gemini models and Vertex AI tooling into a broader system for automating business workflows.