by Angela Guess
BlueData’s Chief Architect Tom Phelan recently wrote an article entitled “HDFS Upgrades Are Painful. But they Don’t Have to Be.” Phelan begins, “It’s hard enough to gather all the data that an enterprise needs for a Hadoop deployment; it shouldn’t be hard to manage it as well. But if you follow the traditional Hadoop “best practices”, it is. In particular, upgrades to the Hadoop Distributed File System (HDFS) are excruciatingly painful. By way of background, each version of Hadoop is composed of a compute side and a data side. The compute side refers to MapReduce and other data processing applications. The data side refers to HDFS. The compute side and the data side of Hadoop are closely linked. This is due to the Protocol Buffers and RPC connections used by the Hadoop Java code to communicate between the compute and data sides of Hadoop. If the versions of the compute side and data side do not match, then your applications will not run.”
Phelan continues, “Let’s say your organization has a 100 node Hadoop cluster with 100 petabytes of data in HDFS running Hadoop version X and you want to try out the latest version of Hive, from Hadoop version Y, on some of that data. Then you instantiate version Y of Hadoop on another system or virtual machine, and you configure your application to use remote HDFS access to your existing 100 petabytes of data. However, due to the version differences, your application from version Y gets an error trying to connect to the HDFS file system from version X. Now what do you do?”
He goes on, “When my colleagues and I founded BlueData, we recognized this challenge – and we built a solution specifically designed to address this problem and eliminate the pain. One of the key components of our BlueData EPIC software platform is called DataTap (or “dtap” for short). With DataTap, your applications from Hadoop version Y can access and tap into data from HDFS built on Hadoop version X (or any version or distribution). Simple. Easy. Straightforward. Painless. This allows you to try out the latest versions of Hadoop applications without having to go through the tedious and time-consuming process of upgrading your HDFS to the latest version, or copying data, or playing games with Java jar files. Similarly, when you upgrade the version of Hadoop running your HDFS, you can still use your existing and stable (i.e. older Hadoop version) applications against that data.”
Read more at BlueData.com.
Photo credit: BlueData