Hadoop Summit: Yahoo Gathers the Stuffed Elephant Crowd
        
        
        
			- By John K. Waters
- 03/28/08
Yahoo hosted the first-ever Apache Hadoop Summit this week  in Santa Clara, CA. The day-long event presented a  program of speakers from the Hadoop developer and user communities, including  representatives from Yahoo, IBM, Microsoft, Facebook, Google, and University of California, Berkeley,  among others. 
The event drew around 500 attendees, but event organizers  were unsure of the exact number. They were, in fact, caught off guard by the  turnout and had to change venues to accommodate a standing-room-only crowd. 
"We organized the summit because we've been investing a  lot in Hadoop ourselves, and we knew there was a large community of Hadoop  users out there that mostly haven't met each other," said Yahoo Technical Evangelist  Jeremy Zawodny. "I guess it was larger than we thought."
The Hadoop  Framework is an open source, Java-based distributed computing platform designed  to allow implementations of MapReduce to run on  large clusters of commodity hardware. Google's MapReduce is a programming model  for processing and generating large data sets. It supports parallel  computations over large data sets on unreliable computer clusters.
Yahoo hired Hadoop's creator, Doug Cutting, early last year  to work full-time on the framework. Cutting created the Lucene open source information  retrieval library with Mike Cafarella, and the Nutch open source search engine  based on it. Both projects are now managed through the Apache Software Foundation.
"The momentum around Hadoop is growing every day,"  Cutting said. "It's really exciting to watch."
Cutting called Yahoo's resource commitment to the Hadoop  framework "considerable," but offered no details. Yahoo has made a  very public commitment to Hadoop. In February, it launched what company representatives  claimed to be the world's largest Hadoop production application. Called the  Yahoo Webmap, the application runs a 10,000-plus-core Linux cluster and  produces data used in every Yahoo Web search query, according to company  literature.
The initial intended use of Hadoop within Yahoo was to  support Web search, Cutting said, by building the Web search index and  maintaining that massive collection of data. But although it is making the  Yahoo search engine more easily scalable and reliable, he said, the majority of  in-house users are actually employing Hadoop for data exploration.
"It turns out that there are all these other people  within the company who want to be able to access and analyze these massive data  sets -- access logs, event logs, Web and geographic data -- and use them to  improve the Web search software itself," Cutting said. "So they're  using Hadoop for analysis to improve the software, as opposed to actually  implementing the Web search. That's where we're seeing the big payoff." 
And that's where he expects other companies to jump on the  Hadoop bandwagon. 
"The data exploration is more generalizable to lots of  businesses, and that's why we're seeing all this interest," he added. "Companies  are amassing more and more data, and they need to explore it. The tools that  are out there for doing ad hoc exploration and analysis of new data sets aren't as convenient."
Along with several Yahoo representatives, the roster of  summit presenters included the IBM   Almaden Research   Center's Kevin Beyer, who  described how to use JAQL, a query language for JSON (JavaScript Object  Notation) data, in Hadoop apps.
Microsoft's Michael Isard was also on hand to talk about DryadLINQ,  which combines Microsoft's Dryad distributed execution engine and the .NET  Language Integrated Query (LINQ). DryadLINQ is similar to JAQL and Yahoo's open  source Pig, which is an infrastructure  designed to support ad hoc analysis  of very large data sets. But DryadLINQ doesn't actually run on Hadoop. 
"Microsoft is doing a very similar set of technologies,"  Cutting explained, "but all within Microsoft. They're not using an open  source model, and it's not even a commercial product at this point. I think they're  here because they want to talk with people on a technical level, and because  this is important technology, but not in terms of actually cooperating with  people by sharing code and building on one another's efforts."
Cutting named the framework "Hadoop" after his son's  yellow stuffed elephant. The yellow pachyderm is the official mascot/logo of  the project.
        
        
        
        
        
        
        
        
        
        
        
        
            
        
        
                
                    About the Author
                    
                
                    
                    John K. Waters is a freelance journalist and author based in Mountain View, CA.