Data & Analytics | News
U California Researchers Release Beta for Big Data Management
- By Dian Schaffhauser
A team of California universities has released a beta version of a system for managing big data along with more traditional forms of data. Researchers from the University of California in Irvine, Riverside, and San Diego have banded together to create AsterixDB, a Java-based "big data management system" (BDMS).
The work began in 2009 with funding from the National Science Foundation and, eventually, the state of California and others. The goal was to create a set of new technologies for "ingesting, storing, managing, indexing, querying, and analyzing vast quantities of semi-structured information." The researchers pulled ideas from three areas — semi-structured data, parallel databases, and data-intensive computing — to create a "next generation" open source application that could run on large clusters of commodity computers.
At the heart of the system, the AsterixDB engine operates on a "shared nothing" architecture. Each computer in the cluster runs independently and is self-sufficient.
"We're providing a next-generation platform for storing, managing, coordinating, and making use of Big Data," said Michael Carey, a UC Irvine professor leading the work. Big data is, of course, the output generated moment by moment by numerous online sources, including blogs, micro-blogging sites, transactions, sensors, status updates, and other computing activities. The challenge of managing that data with traditional database management technologies is that it is generated increasingly faster, takes multiple forms, and isn't easily categorized for rapid analysis.
According to an overview posted on the AsterixDB site, the work has targeted usage within multiple scenarios: cases where information is well-typed and highly regular (and predictably so) to situations where the content is textual, irregular, and therefore "hard to anticipate up front." Technical areas have focused on data storage and indexing that's highly scalable, query processing of semi-structured data on very large clusters, and the merging of techniques from parallel database processing and data-intensive computing.
"Big Data crosses a lot of domains, from government to health care to business," noted Carey. "It's hard for us to imagine an area where AsterixDB can't contribute."
Now the authors of the system are hoping to extend real-world testing by finding partners that can use the platform in various domains generating big data. Those environments may currently be using data management schemes based on Apache projects Hadoop, Pig, Hive, and HBase as well as MongoDB, among others.
"We're putting AsterixDB out in an unrestricted open-source form," Carey explained. "Users can do whatever they want with it, and we can learn from what they do and further improve our platform based on their needs."