Big Data

In the first installment of a two-part series, CT explains what Big Data is and its potential for improving student learning and success.

Colleges and universities are swimming in an ever-widening sea of data. We all are. Human beings and machines together generate about 2.5 quintillion (1018) bytes every day, according to IBM's latest estimate. The sources of all that data are dizzyingly diverse: e-mail, blogs, click streams, security cameras, weather sensors, social networks, academic research, and student portfolios, to name just a few. And it's all coming at us at warp speed: Google alone reportedly processes 24 petabytes (that's a quadrillion--1015--bytes) every day.

The industry buzz phrase for this phenomenon is "Big Data," which loosely refers to data sets too large and/or diverse for conventional tools to manage and mine efficiently. For colleges and universities, Big Data presents a challenge that will only get...well...bigger. But approached with the right tools and strategies, Big Data also offers an incredibly rich resource for improving retention rates, fine-tuning curricula, and supporting students, faculty, and administration in myriad ways.

This story appeared in the October 2012 digital edition of Campus Technology.

In higher education, Big Data may be seen in two distinct contexts: 1) as a product of research institutions that are charged with gathering, managing, and curating a wide range of structured and unstructured data; and 2) as a resource for predictive analytics.

The former is not exactly a new phenomenon, although the sources and velocities of the data streams are expanding and accelerating. But the latter has emerged as a way to leverage a variety of data sources--some new, some not--to help guide students along course and degree paths that will lead to higher graduation rates.

Predictive analytics, which applies statistical techniques to data to forecast likely outcomes, is not a new process either. The difference now is scale: Applied to vast amounts of data from a huge variety of sources, predictive analytics now seems capable of achieving its crystal ball promise.

New Possibilities
"The thing you have to understand about Big Data is that it gives us something we haven't had before," says Gerry McCartney, CIO at Purdue University (IN). "Analyzing these massive data sets allows us to extract patterns, which you just can't obtain from smaller data sets, that allow us to predict, say, how a student will do in class. That predictive capability is a direct result of the volumes of data being analyzed."

In other words, you can see more useful things in the data when more of it's available. Of course, "big" is a relative adjective. One organization's terabyte is another's gigabyte. Traditionally, in most organizations, predictive analytics has involved data volumes in the high gigabytes and low terabytes. In small and mid-sized organizations today, it probably remains in that realm. But don't expect it to stay that way.

"In the last few years, we've crossed a threshold," says James Kobielus, a former Forrester analyst who now serves as Big Data evangelist at IBM. "Suddenly, we're seeing multi-terabytes--tens, hundreds, even thousands of terabytes, aka petabytes--coming into analytic application environments in many industries. In many ways it's still a specialized environment; it's not as if every business intelligence application needs that much data under the covers just yet. But it's moving toward that volume over the next several years over a wide range of use cases."

But it would be a mistake to think that the promise--and challenge--of Big Data is simply tied up in the size of the data sets available for analysis. "In my opinion, 'Big Data' is kind of a misnomer," says Darren Catalano, associate vice president of business intelligence for the University of Maryland University College (UMUC). "We've always had systems that have generated a whole lot of data. It's just that now we're paying attention to that data. We used to look at it as something that facilitates business processes--something very operational. But now we're applying advanced analytic techniques to these large data sets."

And it is these analytic techniques that represent the golden key to unlocking the secrets of giant data sets. As analysts peer into these very large data sets, patterns emerge, nuances appear, and trends reveal themselves. Big Data provides more granularity than smaller data sets, and--if you've got enough of the right type of data--you get what Kobielus calls a 360-degree portrait of a student, his world, what's going on in his mind, and what he's likely to do.

"You can get a deeper and more nuanced portrait of what your customers--the students--like and don't like, or the kinds of courses they would like to sign up for, or the kinds of majors they want to pursue," explains Kobielus. "This isn't just a way of selling a customer more stuff or signing up students for more expensive courses; it's about making those students happier and more fulfilled, and maybe leading them on their way to faster completion of a course of study, based on giving them fine-tuned guidance throughout their time at the university."

The Three V's
The analytic techniques and algorithms needed to identify actionable trends are far more complex than similar efforts of previous decades, for one simple reason: The data sets encompass data points that go way beyond easily quantified measurements.

"If you go by the numbers, most of the data being generated now is unstructured and semi-structured data," says Anjul Bhambhri, IBM's vice president of Big Data. "It's not just social data, although you hear a lot of talk about social data in Big Data circles. But enterprises are collecting a lot of log data, and there's a structured and unstructured component to this type of data. For example, enterprises are doing a lot of work around network-performance management, predicting maintenance-schedule requirements, and analyzing logs. I was at a conference recently where a company with a solution in the log analytics space talked about 2.5 exabytes [2.5 quintillion bytes] of log data being getting generated every two days."

The sheer variety of data is one of three defining characteristics of Big Data, commonly referred to as the Three V's--variety, volume, and velocity. For many higher education institutions, velocity and variety--not volume--are the trickier aspects of Big Data to manage, says Charles Thornburgh, CEO and founder of Civitas Learning, an Austin, TX-based predictive analytics company.

"The variety of the data causes the first hurdle," notes Thornburgh. "Many different autonomous systems are collecting very different types of data. Even schools that have deployed a data warehouse are unlikely to be centralizing all of the data in a manner that supports empirical mining."

Today, school decisions about what data to collect are often based on compliance needs, rather than a desire to be able to use the data predictively. "Additionally," adds Thornburgh, "the ability to scale the processing of this data in a near-real-time or even a nightly job is something that most schools could find challenging."

Indeed, the velocity at which this data is collected is only going to get faster. "We're now moving more toward real-time, continuous-streaming acquisition of data from various sources for a lot of different applications," explains Kobielus.

To a certain degree, schools are beneficiaries of the fact that so much of their student information is stored in relational databases, making it easy to slice and dice the data. "Most of the data in higher education is dramatically more structured," Thornburgh notes, citing student information systems (SIS) in particular. "This is precise information about all these students: exactly who they are, which classes they took, what grades they received in those classes. Each of those events is not just the click of a keystroke; it's the manifestation of months of work on both the faculty member's and the student's behalf."

Unfortunately, the very structure that makes it easy to analyze aspects of student data stands at odds with another underlying concept behind Big Data: flexibility. While relational databases are great at serving up data for preconfigured purposes, it can be a bear to set them up to generate different results, even when the amount of data involved is of moderate size.

"That's one of the key limitations of traditional designs today," insists Charles Zedlewski, vice president of the products group at Cloudera, one of the leading commercial supporters of Hadoop and a range of Big Data solutions and services. "Typically, once you set up a database, it's difficult and expensive to change later. In Big Data, the whole point is that you're acquiring so much data that it's not realistic to assume up front all the different ways that you're going to use it. You can't possibly predict that. So how do you make it possible to experiment and change a lot at very little cost?"

The New Tools of Big Data

The growth in the volume of the world's data is currently outpacing Moore's Law, which posits that the number of transistors on integrated circuits doubles approximately every two years. In other words, notes Charles Zedlewski, vice president of the products group at Cloudera, the pace of microprocessor innovation is not keeping up with the rate at which data is being created.

"Keep in mind that an ever-higher fraction of that data cannot be readily organized into the traditional rows and columns of a database," adds Zedlewski. "These two phenomena are basically starting to break the traditional architectures and technologies people have used for the past 20-30 years to manage data."

Enter Hadoop, an open source platform for data-intensive, distributed computing that has become synonymous with Big Data. The Apache Hadoop project was originally developed at Yahoo by Doug Cutting, now an architect at Cloudera. (The project was named for his daughter's stuffed elephant.)

At its core, Hadoop is a combination of Google's MapReduce and the Hadoop Distributed File System (HDFS). MapReduce is a programming model for processing and generating large data sets. It supports parallel computations on so-called unreliable computer clusters. HDFS is designed to scale to petabytes of storage and to run on top of the file systems of the underlying operating system. Yahoo released to developers the source code for its internal distribution of Hadoop in 2009.

"It was essentially a storage engine and a data-processing engine combined," explains Zedlewski. "But Hadoop today is really a constellation of about 16 to 17 open source projects, all building on top of that original project, extending its usefulness in all kinds of different directions. When people say 'Hadoop,' they're really talking about that larger constellation."

Cloudera is a provider of Hadoop system-management tools and support services. Its Hadoop distribution, dubbed the Cloudera Distribution Including Apache Hadoop (CDH), is a data-management platform that combines a number of components, including support for the Hive and Pig languages; the HBase database for random, real-time read/write access; the Apache ZooKeeper coordination service; the Flume service for collecting and aggregating log and event data; Sqoop for relational database integration; the Mahout library of machine learning algorithms; and the Oozie server-based workflow engine, among others.

The sheer volume of data is not why most customers turn to Hadoop. Instead, it's the flexibility the platform provides. "It's the idea that you can hold on to lots and lots of data without having to predetermine how you're going to use it, and still make productive use of it later," says Zedlewski.

Make no mistake, Hadoop can handle the big stuff. Speaking at the annual Hadoop Summit in California this summer, Facebook engineer Andrew Ryan talked about his company's record-setting reliance on HDFS clusters to store more than 100 petabytes of data.

Hadoop is already an industry standardin the world of Big Data, and it is increasingly showing up in computer science curricula on campuses. One of Cloudera's co-founders, Jeff Hammerbacher, created a data-science class that he teaches at the University of California, Berkeley. And both Stanford University and the Massachusetts Institute of Technology require students in introductory computer science courses to write a MapReduce job.

Perhaps the most famous application of Hadoop is in IBM's Watson computer, which beat two former "Jeopardy" champions on national television in 2011.

As important as it is, Hadoop is just one of the technologies emerging to support Big Data analytics, according to James Kobielus, IBM's Big Data evangelist. NoSQL, which is a class of non-relational database-management systems, is often used to characterize key value stores and other approaches to analytics, much of it focused on unstructured content. New social graph analysis tools are used on many of the new event-based sources to analyze relationships and enable customer segmentation by degrees of influence. And so-called semantic web analysis (which leverages the Resource Description Framework specification) is critical for many text analytics applications.

Pressure to Change
Until recently, the relational database has worked fine, because the level of analysis was fairly superficial. Traditionally, higher education has tended to think of analytics as a matter of slicing and dicing basic outcomes information by criteria such as ethnicity, geography, or Pell status. Such an approach can identify big-picture gaps in equity and effectiveness, but it's not really predictive.

Looking ahead, such a broad-brush approach is unlikely to be viable. "Higher education is facing a lot of pressure these days from an accountability standpoint," notes Cole Clark, global vice president in database giant Oracle's education and research industries group. "It's really being pushed hard to improve student outcomes and demonstrate that the money spent on higher education is producing the kinds of outcomes we all want to see. That pressure has pushed schools to look at ways to pull meaningful data out of their SISs, LMSs, and, increasingly, social media."

Purdue is one school that has invested heavily in the development of homegrown systems that leverage these kinds of resources, but its CIO believes that the majority of colleges and universities have just scratched the surface of what might be possible with Big Data analytics.

"Every higher education institution has this data, but it just sits there like gold in the ground," explains McCartney. "They have to do more than just manage it. Big Data and the new tools we're seeing now are about mining that gold. It's about extracting real value from the data that we're now accumulating at a ferocious rate."

And, in McCartney's eyes, the payoff promised by Big Data exceeds even the initial hype. "I think Big Data is going to have a bigger impact on regular people than the internet," he declares. "The internet is basically just a delivery mechanism. Now I can read my newspaper online, I've got on-demand movies, I've got VoIP, I can query Google. There's not a lot in there I couldn't do before--the internet just gave me more convenient and universal access to it. But Big Data gives me something I've never had before."

Big Data, Part II
In the November digital issue of CT, writer John K. Waters will examine some of the Big Data initiatives currently under way at colleges and universities.

Featured