Next week Hadoop World takes place in New York City. The big event follows on the heels of the official gold release last week of Apache Hadoop 2.0, which significantly overhauls the MapReduce programming model for processing large data sets with a parallel, distributed algorithm on a cluster.
Sitting on top of the Hadoop Distributed File System (HDFS), YARN (Yet-Another-Resource-Negotiator) is meant to perform as a large-scale, distributed operating system for big data applications. Multiple apps can now run at the same time in Hadoop, with the global ResourceManager and NodeManager providing a generic system for managing the applications in a distributed way.
Among the YARN-ready applications is Apache Giraph, an iterative graph processing system built for high scalability – and the programming framework that helps Facebook with its Graph Search service of connections across friends, subscriptions, and so on, providing the means for it to express a wide range of graph algorithms in a simple way and scale them to massive datasets. Facebook explained in a post in August that it had modified and used Giraph to analyze a trillion edges, or connections between different entities, in under four minutes.
Also ported over to run on YARN is Spark, an open source cluster computing system that aims to make data analytics fast to run and fast to write, and which was initially developed for iterative algorithms like those common in machine learning, as well as interactive data mining. Spark’s heritage is at UC Berkeley, and according to this report, the professors behind it are now cooking up stealth startup Databricks, whose web site notes that it plans to use the Apache Spark platform to “transform large-scale data analysis.”
Apache HAMA is included in the YARN-powered apps, as well. The latest Hadoop 2-compatible release of the computing framework on top of HDFS for massive scientific computations such as matrix, graph and network algorithms, which came earlier this month, adds new features such as new Bulk Synchronous Parallel-based Machine Learning algorithms (Clustering and NeuralNetwork) and dynamic graph APIs.
Speaking of machine learning and Hadoop, news comes this week that Skytree’s Skytree Server (which The Semantic Web Blog initially discussed here) now integrates with Apache Hadoop to identify and deliver data insights and advanced analytics via machine learning methods in any Hadoop environment. It’s also built partnerships with Hortonworks as the first machine learning vendor certified on Hortonworks Data Platform 2.0.
Released yesterday, the product from a leading voice in the enterprise Apache Hadoop community Is purported to be the first commercial distribution built on the Hadoop 2 release delivering the YARN-based architecture of Hadoop 2. Other partnerships Skytree announced were with MapR Technologies to provide enterprises with easy and fast access to high performance machine learning on Hadoop distributions.