The Library on a Massive Scale

What do you get when you combine the collections from 60 major research institutions into a single, digitized library? A comprehensive collection, of course, but also a major headache for the people who have to collect, organize, preserve, and publish the information in a user-friendly manner for students, professors, and the general public.

That's the headache that John Wilkin, associate university librarian for the University of Michigan's Library Information Technology (LIT) department and executive director of HathiTrust at U-M. The latter is a comprehensive digital archive comprising materials from likes the U-M, Arizona State University, Baylor University, Columbia University, and Dartmouth College, to name just a handful.

At press time, HathiTrust's digitized collection included 9.71 million total volumes, more than 5.15 million book titles, 256,000 serial titles, 3.4 billion pages, and 435 TB of information.

Wilkin gave Campus Technology a look at how the initiative started, challenges it's faced when trying to manage huge volumes of data, and how HathiTrust has overcome those obstacles.

Bridget McCrea: How did HathiTrust come about?

John Wilkin: A few years ago we began to realize that a lot of research libraries have the same types of content, and that while there are also many "unique" volumes and titles, putting together what we call a "collective collection" would be extremely beneficial and efficient for all of the various institutions. The idea went over well, and we started with about 23 institutions--a number that's since grown to 60. We just exceeded 9 million volumes, which positions HathiTrust as one of the top 10 research libraries in North America.

McCrea: How did you handle digitization of the content?

Wilkin: College libraries have managed large amounts of content for a long time, but primarily on their respective campuses. So when we embarked on this venture we already had a lot of experience with large-scale digital preservation. With this project, however, we were pressing the boundaries of what we could do with the technology that we were already using.

McCrea: So had you outgrown traditional library content management systems?

Wilkin: We had a number of systems "strapped together" in intelligent ways, but running them required a lot of hardware administration. There were also performance issues, and unanswered questions regarding scaling--such as, how do you go from 100 terabytes of data to 200 terabytes? We really felt like we were at the outer boundaries of what we could do using our current systems.

McCrea: What did you do about it?

Wilkin: We turned to the storage experts at our partner institutions for help in putting together an RFP, which was very competitive and included a lot of analysis. It probably took a year to assemble, including writing up the specs, gathering vendor responses, and filtering out the companies that didn't meet our needs. We completed a very serious evaluation that took into account our scalability issues, data management questions, and the fact that, as institutions, we were working with limited staff and budgets. Basically, we wanted a scalable solution that didn't require us to add staff to support it. After that lengthy RFP process, we selected Isilon's OneFS management tool.

'As universities, we try to future-proof our problems and we sometimes end up buying speculatively on a future that isn't quite here yet. We wind up spending too much, and ending up with mismatched systems down the road.'
--John Wilkin

McCrea: What functions does the solution handle for HathiTrust?

Wilkin: We ended up with a high-performing, easily managed, and scalable system that helped us accomplish what we wanted to do. We didn't just want to publish a library online and let it stagnate. We're at more than 9 million volumes and growing daily--most likely to 10 million by the end of this year. The system allows us to add up to 30,000 volumes per day. We can bring content into the repository, validate it, and know exactly what's being stored for the long term. It also handles ongoing validation checks, to make sure content isn't unintentionally being changed or altered along the way.

Open to users worldwide, the archive is actively managed and synchronized across two sites [with materials brought in via one site and then replicated to the other site], and we also do load balancing across those sites.

McCrea: Does this take place seamlessly?

Wilkin: Yes. A team of six people manages the entire digital collection.

McCrea: What's next on the agenda?

Wilkin: We're dealing with a lot of books and journals that our libraries have held for hundreds of years, and that are now being digitized. Right now we're looking pretty closely at a number of different format issues, and experimenting with various formats. Going forward, we'll be examining other new publishing options, such as publishing directly into the HathiTrust archives, inclusive of audio, images, and other multimedia components.

McCrea: What advice would you give another institution or consortium that's grappling with data management problems?

Wilkin: Don't buy technology for a future that you can't already "feel." As universities, we try to future-proof our problems and we sometimes end up buying speculatively on a future that isn't quite here yet. We wind up spending too much, and ending up with mismatched systems down the road. The key is to be more pragmatic in what you are doing, and to keep an eye on the future that's closer at hand.

Featured

  • open laptop in a college classroom with holographic AI icons like a brain and data charts rising from the screen

    4 Ways Universities Are Using Google AI Tools for Learning and Administration

    In a recent blog post, Google shared an array of education customer stories, showcasing ways institutions are using AI tools like Gemini and NotebookLM to transform both learning and administrative tasks.

  • illustration of a human head with a glowing neural network in the brain, connected to tech icons on a cool blue-gray background

    Meta Launches Stand-Alone AI App

    Meta Platforms has introduced a stand-alone artificial intelligence app built on its proprietary Llama 4 model, intensifying the competitive race in generative AI alongside OpenAI, Google, Anthropic, and xAI.

  • three main icons—a cloud, a user profile, and a padlock—connected by circuit lines on a blue abstract background

    Report: Identity Has Become a Critical Security Perimeter for Cloud Services

    A new threat landscape report points to new cloud vulnerabilities. According to the 2025 Global Threat Landscape Report from Fortinet, while misconfigured cloud storage buckets were once a prime vector for cybersecurity exploits, other cloud missteps are gaining focus.

  • Stylized illustration showing cybersecurity elements like shields, padlocks, and secure cloud icons on a neutral, minimalist digital background

    Microsoft Announces Security Advancements

    Microsoft has announced major security advancements across its product portfolio and practices. The work is part of its Secure Future Initiative (SFI), a multiyear cybersecurity transformation the company calls the largest engineering project in company history.