The Apache Software Foundation and its Influence on Data Management

By on


Get our weekly newsletter in your inbox with the latest Data Management articles, webinars, events, online courses, and more.

DB Reviewby Paul Williams

The open source Apache Software Foundation holds a wide-ranging influence in the world of software development, based on fostering the development of a host of products now commonplace in the data management industry, both in the NoSQL and relational spaces. The foundation grew out of the work on the Apache HTTP server in the mid 1990s. The developers of that server, collectively known as the Apache Group, formed the Apache Software Foundation in 1999.

2002 saw the creation of the Apache Incubator project, a progenitor of innovation in software development. Any developer or group of developers with an idea for a software product can petition to become part of the Incubator project, which provides a pathway to full inclusion as an Apache Software Foundation project. Additionally, externally developed projects wanting to become part of Apache need to first go through the Incubator.

To get a feel for the role played by the Apache Software Foundation on the data management industry, it helps to take inventory of a selection of its many data-related projects.

The Explosion of Big Data Processed by Hadoop

While not a database system per se, there is little doubt of the importance played by Apache Hadoop in the world of data management. Hadoop provides the framework for the distributed processing necessary to manage the volume and velocity of Big Data. The framework at its core contains file system functionality and the MapReduce processing pattern inspired by similar work performed at Google.

Mostly written in Java, Hadoop thrives on the community input typical of the open source software movement. Its original genesis happened at Yahoo! as part of an internet search project. While most instances of Hadoop include the Hadoop Distributed File System (HDFS), the framework supports other file systems, including those accessible through the FTP and HTTP protocols.

Hadoop is a top-level project at Apache, and over time the platform has grown to include a collection of sub-projects in the data management realm. Apache Hive provides analytical, OLAP type functionality on top of Hadoop, essentially adding a data warehousing framework. Apache HBase is a tabular style database modeled on Google’s Big Table, which garnered previous coverage at DATAVERSITY. Apache Zookeeper serves in the role of traffic cop in Hadoop installations.

Hadoop is widely used throughout the industry. Yahoo! might be the largest user, with over 100,000 CPUs on over 40,000 different computers currently in production. Facebook claimed to manage the world’s largest Hadoop data cluster, with 100 petabytes stored as of last summer, although they have since moved to an in-house developed system to manage that data.

CouchDB in the World of NoSQL Document Databases

Apache CouchDB is a NoSQL database that uses marked-up JSON as a persistence format. Its API is accessed through HTTP while using a MapReduce processing framework. It received earlier DATAVERSITY coverage in the NoSQL Movement article about Document databases.

Damien Katz first began development on the CouchDB project in 2005. Katz formerly worked on Lotus Notes for IBM. CouchDB was accepted into the Apache Incubator program in 2008, and graduated to top-level status a few months later. Last year, Katz teamed up with the people behind the membase memory caching database to form the company, Couchbase, with a product that combines features from both databases.

The BBC uses CouchDB as a scalable data store across multiple data centers. It provides a database backend for multiple Facebook applications. Other production instances of CouchDB can found on Apache’s wiki page for the project.

Apache Cassandra Thrives on Distribution

Another NoSQL database that is a top-level Apache project is Apache Cassandra; it earned coverage in DATAVERSITY’s NoSQL Movement article on Key-Value Databases. Cassandra is essentially a key-value data store with BigTable modeling and scalability. It leverages a file system infrastructure inspired by Amazon Dynamo, allowing it to thrive in highly-distributed environments.

Originally developed at Facebook to power their Inbox Search auto-complete functionality, Cassandra joined the Apache Incubator program in 2009, earning top-level project status the following year. Commercial support options for the database are offered by DataStax and Acunu.

In addition to its high performance in distributed environments, Cassandra is also known for its decentralized nature which leads to a high level of fault tolerance. The database is in production at a wide number of internet-based companies, including Reddit, Rackspace, Netflix, SoundCloud and Twitter. Facebook recently replaced their Inbox Search functionality with a system using Apache HBase.

Apache Accumulo also Inspired by BigTable

Another Apache database inspired by Google’s BigTable is Apache Accumulo. Originally developed by National Security Agency in 2008, Accumulo became part of the Apache Incubator 2011, earning top-level status last year. It was covered in the DATAVERSITY NoSQL Movement article on tabular databases.

Also running on top of Hadoop, Accumulo differentiates itself from other tabular databases by its focus on security, hearkening back to its genesis at the NSA. Its innovative cell-based access control is a key part of Accumulo’s security functionality.

Object Relational Mapping with Apache

In addition to the Apache projects that focus on data persistence, or perform the distributed processing for data-driven applications, there are two object-relational Mapping frameworks at Apache that are worth noting.

Apache Cayenne includes object-relational mapping and remoting services functionality. The framework provides a binding layer between database models and Java objects. Cayenne offers remoting through Remote Object Persistence which allows these Java objects to be accessed by clients through a web services layer.

After its beginning in the early 2000s as an open source project led by Objectstyle, Cayenne became at top-level project at Apache in 2006. Cayenne also features database reverse engineering functionality, as well as a GUI-based modeling tool, called the CayenneModeler. It is regarded in the industry as a mature product with enterprise-level performance.

Another Apache ORM project serving a role as a binding tool between a relational database and Java objects is Apache Torque. Torque began as part of the Apache Turbine Framework, a project for the rapid development of web-based applications written in Java. Uncoupled from Turbine, Torque is a now a sub-project of the Apache Database Project. Torque uses XML and DTD generated from a RDBMS to generate the Java classes used in binding.

Derby and the Apache Database Project

Referenced when mentioning Apache Torque, the Apache Database Project currently embarks on a mission for “the creation and maintenance of commercial-quality, open-source, database solutions based on software licensed to the Foundation, for distribution at no charge to the public.” One of these projects is Derby, which got its start in the mid 1990s as a Java database at a company called Cloudscape, which ended up acquired by Informix, which in turn was followed by another acquisition by IBM, which distributed the database for a time as IBM Cloudscape.

In addition to being known as Apache Derby, Oracle also distributes the same binaries, branded as Java DB. Featuring a small footprint of only 2MB, Derby is suitable for a wide range of embedded projects that also rely on Java. The database supports both ODBC and SQL at the API level.

Apache Incubator Projects Hold Hope for the Future

Some projects currently in the Apache Incubator are worth noting by the data professional for future reference. Apache Giraph is a graph database that entered incubation status in early 2012. Giraph works as a Hadoop job and runs on a variety of Hadoop infrastructures, including Amazon EC2. It is similar to the Pregel graph processor, but adds fault tolerance by leveraging the job coordination services provided by Apache Zookeeper.

Those interested in graph databases need to keep an eye out for when Giraph potentially graduates to top-level status at Apache or gets a commercial implementation.

Another Apache Incubator database project is Empire-db which provides a data persistence layer with a measure of object-relation mapping functionality. The product’s aim is to provide better compile time type safety by using Java object models instead of XML for schema definitions. In theory, this offers less dependence on the use of literals and string operations.

Empire-db compares with other ORM tools in the Java community like Hibernate. The tool also leverages metadata management to help serve as glue between the presentation and persistence layers. It is a product worth paying attention for those involved in data and metadata modeling.

The innovations in the data industry ushered in by the Apache Software Foundation remain vital and far-reaching. They prove what can be accomplished by an inspired community of software engineers who code for the love of software instead of purely financial gain.

Leave a Reply

We use technologies such as cookies to understand how you use our site and to provide a better user experience. This includes personalizing content, using analytics and improving site operations. We may share your information about your use of our site with third parties in accordance with our Privacy Policy. You can change your cookie settings as described here at any time, but parts of our site may not function correctly without them. By continuing to use our site, you agree that we can save cookies on your device, unless you have disabled cookies.
I Accept