Hadoop Summit: Yahoo Gathers the Stuffed Elephant Crowd

Yahoo hosted the first-ever Apache Hadoop Summit this week in Santa Clara, CA. The day-long event presented a program of speakers from the Hadoop developer and user communities, including representatives from Yahoo, IBM, Microsoft, Facebook, Google, and University of California, Berkeley, among others.

The event drew around 500 attendees, but event organizers were unsure of the exact number. They were, in fact, caught off guard by the turnout and had to change venues to accommodate a standing-room-only crowd.

"We organized the summit because we've been investing a lot in Hadoop ourselves, and we knew there was a large community of Hadoop users out there that mostly haven't met each other," said Yahoo Technical Evangelist Jeremy Zawodny. "I guess it was larger than we thought."

The Hadoop Framework is an open source, Java-based distributed computing platform designed to allow implementations of MapReduce to run on large clusters of commodity hardware. Google's MapReduce is a programming model for processing and generating large data sets. It supports parallel computations over large data sets on unreliable computer clusters.

Yahoo hired Hadoop's creator, Doug Cutting, early last year to work full-time on the framework. Cutting created the Lucene open source information retrieval library with Mike Cafarella, and the Nutch open source search engine based on it. Both projects are now managed through the Apache Software Foundation.

"The momentum around Hadoop is growing every day," Cutting said. "It's really exciting to watch."

Cutting called Yahoo's resource commitment to the Hadoop framework "considerable," but offered no details. Yahoo has made a very public commitment to Hadoop. In February, it launched what company representatives claimed to be the world's largest Hadoop production application. Called the Yahoo Webmap, the application runs a 10,000-plus-core Linux cluster and produces data used in every Yahoo Web search query, according to company literature.

The initial intended use of Hadoop within Yahoo was to support Web search, Cutting said, by building the Web search index and maintaining that massive collection of data. But although it is making the Yahoo search engine more easily scalable and reliable, he said, the majority of in-house users are actually employing Hadoop for data exploration.

"It turns out that there are all these other people within the company who want to be able to access and analyze these massive data sets -- access logs, event logs, Web and geographic data -- and use them to improve the Web search software itself," Cutting said. "So they're using Hadoop for analysis to improve the software, as opposed to actually implementing the Web search. That's where we're seeing the big payoff."

And that's where he expects other companies to jump on the Hadoop bandwagon.

"The data exploration is more generalizable to lots of businesses, and that's why we're seeing all this interest," he added. "Companies are amassing more and more data, and they need to explore it. The tools that are out there for doing ad hoc exploration and analysis of new data sets aren't as convenient."

Along with several Yahoo representatives, the roster of summit presenters included the IBM Almaden Research Center's Kevin Beyer, who described how to use JAQL, a query language for JSON (JavaScript Object Notation) data, in Hadoop apps.

Microsoft's Michael Isard was also on hand to talk about DryadLINQ, which combines Microsoft's Dryad distributed execution engine and the .NET Language Integrated Query (LINQ). DryadLINQ is similar to JAQL and Yahoo's open source Pig, which is an infrastructure designed to support ad hoc analysis of very large data sets. But DryadLINQ doesn't actually run on Hadoop.

"Microsoft is doing a very similar set of technologies," Cutting explained, "but all within Microsoft. They're not using an open source model, and it's not even a commercial product at this point. I think they're here because they want to talk with people on a technical level, and because this is important technology, but not in terms of actually cooperating with people by sharing code and building on one another's efforts."

Cutting named the framework "Hadoop" after his son's yellow stuffed elephant. The yellow pachyderm is the official mascot/logo of the project.

About the Author

John K. Waters is a freelance journalist and author based in Mountain View, CA.

Featured

  • MathGPT

    MathGPT AI Tutor Now Out of Beta

    Ed tech provider GotIt! Education has announced the general availability of MathGPT, an AI tutor and teaching assistant for foundational math support.

  • person signing a bill at a desk with a faint glow around the document. A tablet and laptop are subtly visible in the background, with soft colors and minimal digital elements

    California Governor Signs AI Content Safeguards into Law

    California Governor Gavin Newsom has officially signed off on a series of landmark artificial intelligence bills, signaling the state’s latest efforts to regulate the burgeoning technology, particularly in response to the misuse of sexually explicit deepfakes. The legislation is aimed at mitigating the risks posed by AI-generated content, as concerns grow over the technology's potential to manipulate images, videos, and voices in ways that could cause significant harm.

  • white desk with an open digital tablet showing AI-related icons like gears and neural networks

    Elon University and AAC&U Release Student Guide to AI

    A new publication from Elon University 's Imagining the Digital Future Center and the American Association of Colleges and Universities offers students key principles for navigating college in the age of artificial intelligence.

  • abstract technology icons connected by lines and dots

    Digital Layers and Human Ties: Navigating the CIO's Dilemma in Higher Education

    As technology permeates every aspect of life on campus, efficiency and convenience may come at the cost of human connection and professional identity.