Open Menu Close Menu

Digital Libraries

7 Tips for Data Mining Your Digital Archives

Content mining of archival materials can make for amazing discoveries. Here's how to prepare for the coming influx of researchers who will want access to digital archives as a data source.

Today's digital archives are more than a repository of content — they are a new data source for researchers. The computationally-intensive research being performed in the digital humanities, for example, can chart data such as word frequency, proximity, source and other factors to 32 dimensions across decades of previously published journals, newspapers, annual reports or other documents — in a matter of hours. "The human mind can't even conceive of what that would be, but the computer can do that," said Darby Orcutt, the assistant head of collection management at North Carolina State University Libraries.

Orcutt's institution has been working hard to line up agreements with digital archive vendors to allow its researchers to data mine those archives — even in cases where there may not yet be a specific research need. He knows that one day the access will be requested and he wants to be ready.

Data mining of digital archives hasn't necessarily found wide pickup in higher education — or among the vendors that supply those archival materials. Orcutt points to culture as the culprit: "Among certain of the vendors there's really a culture of fear. This is something new and different. On the part of the libraries there's a culture that leans towards wanting to figure it all out — this whole world of data mining — before we do anything."

He has taken a different tack. "I've been saying, let's go ahead and enter into these arrangements. Let's go ahead and make this content available and not worry about every clause of every agreement being absolutely perfect."

Orcutt and Iris Hanney, president of Unlimited Priorities, a firm that consults with both libraries and vendors with archival collections, recently shared their advice for how to approach the business of content mining with digital archives.

1) Plan on Being a "Permanent Access Owner"

Most aggregators sell their content in one of two ways, Hanney explained. "You can subscribe for a year and have access for a year. Or 'permanent access' gives you the right to use the content in perpetuity and pay an annual maintenance fee to keep that data going."

With an annual subscription, when people are removing data off the servers for research projects, there's no way for the vendor to take it back once the subscription has ended. "There's not a researcher in the world who's going to go to this length and accept that it's only going to be available for a year. It would defeat the purpose of the research," she said.

If there's access your researchers need for digital content mining, make sure you have that permanent access license. "That becomes important to the aggregator to keep it feasible financially," she added.

2) Expect to Work On-Premise, Not Online

Be prepared to run your archive data mining activities on your own servers, not on the publishers' servers. Hanney said she learned the hard way as a vendor that allowing researchers to data mine in the same environment where the databases are hosted will cause myriad problems. "I pay for storage space. I pay for activity. I pay for counter compliant statistics. I pay to maintain the data in an environment that allows researchers to sit in their offices, call up our database online and search their hearts out." But allowing computer programs to run the research means "my usage factor on my server goes ballistic. All of a sudden, statistics go through the roof because millions of computers are doing millions of things."

At the same time, Orcutt pointed out, that rush of activity will prevent other researchers from gaining access to the resources. Frankly, he said, "for folks doing this work, even with the bandwidth we have on university campuses, it would be slow as molasses to do this online."

3) Don't Get Picky About Formats

As the old nugget goes, the wonderful thing about standards is that there are so many of them to choose from. The same holds true with formats for archival databases. As Hanney explained, aggregators have created the data in formats used to load the data into their systems for use by the public. "Don't come to me and say you want a different format because that means I have to go through gigabytes of data and convert it to conform to your needs. Am I happy to do that? You bet, if you want to pay me to do that."

The format issue has prevented many a library from pursuing access to digital archives for data mining, added Orcutt. "Colleagues from other libraries have said to me that they're not interested in these things unless they meet certain standards of conformity. Some of these vendors are never going to get to the point where they can produce exactly the format that people want. There's not even agreement in the library or scholarly community about what that format would look like. My feeling is that essentially we've got to go ahead and nail down what we can get."

Where there's a mismatch, Hanney advised, institutions "need to build something that conforms to what the greater population of the vendor world is utilizing as their standards."

4) Ask for Everything

In spite of a cautionary tone regarding what libraries may expect to receive from vendors, Orcutt works with an expectation that he can get "everything." "I want the XML files. I want the image files. I want the OCR data. And I want to know at least to the extent that I can something of the history of the product in terms of how those files were produced. That's the sort of information that's going to be useful to a researcher."

He fears that many libraries are waiting for "the ideal" before they move ahead to secure digital access rights. "I would rather secure the access and then if things change, go from there."

Besides, he added, every researcher comes to a project with his or her own ideal for how the data should be structured. "But there's an awful lot of prep work that goes into this. That's one of the things that's important to understand. It's not a matter of taking these files and plugging them into a computer program and saying, 'Here. Read it.'"

5) Pursue Access Rights, Not Just Data Mining Rights

Both Orcutt and Hanney agree that libraries should spend time defining what they really need — and right now, for Orcutt, that translates to access rights, not a data dump. "What we need to be doing is securing our ability to access these things. But I don't think that means we need to be saying, 'OK, let's download it all just in case somebody needs it.'"

But access rights are a tricky issue. Orcutt doesn't believe libraries necessarily need an explicit right to do data mining — unless it's something that a prior agreement has ruled out. "I'm not of the school that data or text mining rights are something we need to have explicitly granted to us." The flip side to that, however, is that none of his library's current agreements clearly state that researchers have the ability to access data sets in order to conduct their mining activities. "It's all well and good for me to say if I had the whole data on my server, I could mine it. But if I can't get the whole data on my server, it's irrelevant. So that's what we're agreeing to — that we can have access to the data for these purposes."

Orcutt added that virtually all of the contracts his institution has with vendors "do not allow somebody to go in and download huge chunks of information — for very good reasons."

He has seen model and actual contracts that stipulate that the library has the right to do mining, but say nothing about how to get the data. Then when they turn around and request access, the vendor will come back with a pricey proposal to make the data available in the format requested by the school.

6) A Cost-Recovery Model May Be Your Best Option

To avoid unexpected (and egregious) vendor charges for gaining access to data, one approach is to work toward a "cost recovery" model. That's the kind of arrangement North Carolina State made with Gale in 2014. "The agreement we struck allows you to request a hard drive with all the files on it and to pay a modest cost recovery — somewhere around $200 or $300 — to have the hard drive delivered to you," he said. That is an inelegant solution, but it may be the most elegant of all the inelegant solutions."

The company immediately turned around and offered a similar arrangement to all of its other institutional customers.

Orcutt advocates for libraries to agree up front what cost recovery means. "For some vendors their only experience of doing this thus far is with the 'boutique' formatting and selection of the data set for a particular research need. That's very expensive. This shouldn't be a profit center for the vendor. But it also shouldn't be a loss."

7) Understand What Your Researchers Are Doing

Deans from other libraries have told Orcutt that they don't have anybody involved with data mining, so they don't consider it important. "I guarantee that's because they don't understand what their researchers are doing," he observed. "I think you'd be hard pressed to find a single place where somebody isn't data mining."

That's why he encourages the library community, in cooperation with vendors, to do planning now. "These agreements can take a long time. My fear is, if we don't go ahead and nail them down now, they won't be there when we need them, when our researchers need them."

comments powered by Disqus