The Library on a Massive Scale

What do you get when you combine the collections from 60 major research institutions into a single, digitized library? A comprehensive collection, of course, but also a major headache for the people who have to collect, organize, preserve, and publish the information in a user-friendly manner for students, professors, and the general public.

That's the headache that John Wilkin, associate university librarian for the University of Michigan's Library Information Technology (LIT) department and executive director of HathiTrust at U-M. The latter is a comprehensive digital archive comprising materials from likes the U-M, Arizona State University, Baylor University, Columbia University, and Dartmouth College, to name just a handful.

At press time, HathiTrust's digitized collection included 9.71 million total volumes, more than 5.15 million book titles, 256,000 serial titles, 3.4 billion pages, and 435 TB of information.

Wilkin gave Campus Technology a look at how the initiative started, challenges it's faced when trying to manage huge volumes of data, and how HathiTrust has overcome those obstacles.

Bridget McCrea: How did HathiTrust come about?

John Wilkin: A few years ago we began to realize that a lot of research libraries have the same types of content, and that while there are also many "unique" volumes and titles, putting together what we call a "collective collection" would be extremely beneficial and efficient for all of the various institutions. The idea went over well, and we started with about 23 institutions--a number that's since grown to 60. We just exceeded 9 million volumes, which positions HathiTrust as one of the top 10 research libraries in North America.

McCrea: How did you handle digitization of the content?

Wilkin: College libraries have managed large amounts of content for a long time, but primarily on their respective campuses. So when we embarked on this venture we already had a lot of experience with large-scale digital preservation. With this project, however, we were pressing the boundaries of what we could do with the technology that we were already using.

McCrea: So had you outgrown traditional library content management systems?

Wilkin: We had a number of systems "strapped together" in intelligent ways, but running them required a lot of hardware administration. There were also performance issues, and unanswered questions regarding scaling--such as, how do you go from 100 terabytes of data to 200 terabytes? We really felt like we were at the outer boundaries of what we could do using our current systems.

McCrea: What did you do about it?

Wilkin: We turned to the storage experts at our partner institutions for help in putting together an RFP, which was very competitive and included a lot of analysis. It probably took a year to assemble, including writing up the specs, gathering vendor responses, and filtering out the companies that didn't meet our needs. We completed a very serious evaluation that took into account our scalability issues, data management questions, and the fact that, as institutions, we were working with limited staff and budgets. Basically, we wanted a scalable solution that didn't require us to add staff to support it. After that lengthy RFP process, we selected Isilon's OneFS management tool.

'As universities, we try to future-proof our problems and we sometimes end up buying speculatively on a future that isn't quite here yet. We wind up spending too much, and ending up with mismatched systems down the road.'
--John Wilkin

McCrea: What functions does the solution handle for HathiTrust?

Wilkin: We ended up with a high-performing, easily managed, and scalable system that helped us accomplish what we wanted to do. We didn't just want to publish a library online and let it stagnate. We're at more than 9 million volumes and growing daily--most likely to 10 million by the end of this year. The system allows us to add up to 30,000 volumes per day. We can bring content into the repository, validate it, and know exactly what's being stored for the long term. It also handles ongoing validation checks, to make sure content isn't unintentionally being changed or altered along the way.

Open to users worldwide, the archive is actively managed and synchronized across two sites [with materials brought in via one site and then replicated to the other site], and we also do load balancing across those sites.

McCrea: Does this take place seamlessly?

Wilkin: Yes. A team of six people manages the entire digital collection.

McCrea: What's next on the agenda?

Wilkin: We're dealing with a lot of books and journals that our libraries have held for hundreds of years, and that are now being digitized. Right now we're looking pretty closely at a number of different format issues, and experimenting with various formats. Going forward, we'll be examining other new publishing options, such as publishing directly into the HathiTrust archives, inclusive of audio, images, and other multimedia components.

McCrea: What advice would you give another institution or consortium that's grappling with data management problems?

Wilkin: Don't buy technology for a future that you can't already "feel." As universities, we try to future-proof our problems and we sometimes end up buying speculatively on a future that isn't quite here yet. We wind up spending too much, and ending up with mismatched systems down the road. The key is to be more pragmatic in what you are doing, and to keep an eye on the future that's closer at hand.

Featured

  • glowing brain, connected circuits, and abstract representations of a book and graduation cap on a light gray gradient background

    Snowflake Launches Program to Upskill 100,000 People in Data and AI

    Cloud data platform Snowflake is embarking on an effort to train and certify more than 100,000 users on its AI Data Cloud by 2027. The One Million Minds + One Platform program will provide Snowflake-delivered courses, training materials, and free access to Snowflake software, at no cost to learners.

  • two abstract humanoid figures made of interconnected lines and polygons, glowing slightly against a dark gradient background

    Microsoft Introduces Copilot Chat Agents for Education

    Microsoft recently announced Microsoft 365 Copilot Chat, a new pay-as-you-go offering that adds AI agents to its existing free chat tool for Microsoft 365 education customers.

  • hand touching glowing connected dots

    Registration Now Open for Tech Tactics in Education: Thriving in the Age of AI

    Tech Tactics in Education has officially opened registration for its May 7 virtual conference on "Thriving in the Age of AI." The annual event, brought to you by the producers of Campus Technology and THE Journal, offers hands-on learning and interactive discussions on the most critical technology issues and practices across K–12 and higher education.

  • Three cubes of noticeably increasing sizes are arranged in a straight row on a subtle abstract background

    A Sense of Scale

    Gardner Campbell explores the notion of scale in education and shares some of his own experience "playing with scale" — scaling up and/or scaling down — in an English course at VCU.