Data & Analytics | News
MIT Rethinks Big Data Processing
- By Dian Schaffhauser
Research by a small team at the Massachusetts Institute of Technology may turn out to help streamline the processing of big data--those terabytes of streaming data that are generated from GPSs in smartphones and a multitude of other sensors. The basic idea is to create "succinct representations" of huge data sets so that existing algorithms can handle them more efficiently.
As described in "The Single Pixel GPS: Learning Big Data Signals from Tiny Coresets," a paper presented at the Association for Computing Machinery's International Conference on Advances in Geographic Information Systems, three MIT researchers have figured out how to represent data so that it takes up less space in memory while still being processed in conventional ways. That's useful because it means the technique can be used with existing algorithms rather than having to replace them with new ones.
The researchers applied the technique to the processing of two-dimensional location data generated by GPS receivers. According to Daniela Rus, a professor of computer science and engineering and director of MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), these receivers take position readings every 10 seconds. That adds up to about a gigabyte of data each day. Systems that attempt to analyze traffic patterns from readings sent by a massive number of cars can easily be bogged down by the volume of data generated.
What the scientists have figured out is that the analysis doesn't need to encompass each point of data generated by a given car--only some of it, such as when the car is turning. The path between that point and the next turn could be approximated by a straight line. The collection of those sets of data form a new "coreset" that can be compressed on the run, as it were.
The researchers' algorithm has to find a series of line segments that most accurately defines the data points. The algorithm also stores the exact coordinates of a random sampling of the points, which stand in for the potential randomness of the unsampled points in the calculations.
The technique, which encompasses a great deal of mathematics, is a tradeoff between "accuracy and complexity," said Dan Feldman, a post-doctoral student in Rus' group and lead author on the new paper. It's the combination of linear estimates and random sampling that allows the algorithm to compress data in chunks; as new data arrives, the algorithm does recalculations.
What's the point? For all practical purposes, many potential uses for big data don't stand up to the processing they would require. The MIT team's approach suggests that a slightly erroneous approximation is better than a calculation that doesn't get performed at all. Now the scientists must consider uses for the technique that have similar characteristics to the use of GPS receiver data.
One application under consideration by Feldman is the analysis of video data. Each scene might be considered comparable to a line segment; the shift from one scene to another is like the car turning. And sample frames from a scene could provide that random sampling.
This isn't the only research being done on campus in the area of big data. In May 2012 MIT was selected to host "bigdata@CSAIL," a new Intel-sponsored research center focused on developing techniques for working with big data.