Loading...
You are here:  Home  >  Data Education  >  Big Data News, Articles, & Education  >  Big Data Articles  >  Current Article

Revolutionizing Big Data and Hadoop: Operations and Analytics

By   /  April 28, 2015  /  No Comments

big data hadoop x300by Jelani Harper

The implementation of a Big Data initiative through utilizing Hadoop enables organizations to achieve a number of key objectives in their overall data-centric processes including:

  • Incorporating more sources: According to MapR CMO Jack Norris, “There’s going to be ever increasing sources of data. The rate of increase is growing. 60 to 70 percent of data growth is happening year over year. Now with the Internet of Things (IoT), that’s even increasing.”
  • Maximizing agility while reducing time to action: Much of the value in utilizing Big Data lies in an organization’s ability to leverage that data with as little latency as possible to influence business processes. Doing so requires a form of agility which may need to supersede some of the conventional time constraints pertaining to relational, schema-oriented Data Modeling environments.
  • Determine the current and future state of business: The real-time analytics options of Big Data can both impact the business as data is generated and offer predictive capabilities for where the business is headed.

According to Norris, Big Data initiatives that utilize a single platform for both analytics and operations are able to leverage the aforementioned benefits with a simplified architecture, cost reductions, and lower latency that allow a distinct advantage that competitors can’t match:

“We recognized very early on that where Hadoop is going is well beyond the original batch orientation that the project was focused on. Going through and doing an index for Web searches is very different from doing online retail transactions. We made the innovations at that platform level because we recognized the journey that people would be on and [that] the move to broad, real-time integrated analytics and operations was required.”

Beyond Batch

Hadoop was able to accelerate beyond its conventionally ponderous batch-oriented processing via the incorporation of additional databases working in tandem with its distributed file system, HDFS, which makes the platform scalable for Big Data sets. The combination of databases such as MapR-DB with HDFS enables organizations to combine analytics and operations in a single environment at the scale and speed necessary to handle transactions in close to real time. Norris reflected on the impact of utilizing a database such as MapR-DB with Hadoop for operations:

“It’s much faster, [offers] consistent low latency, better scale, and we’ve been selected across industries for performance and reliability aspects. Major retailers, telecos, financial services, and governmental agencies that are doing these operations are doing them with MapR.”

Architectural Assumptions

There are a number of changes in contemporary architecture that makes combining operations and analytics in a single platform more viable than before. As such, a number of basic architectural assumptions that resulted in separate repositories for analytics and operations are not as relevant as they once were. Such assumptions include the perception that:

  • Applications dictate the organization of data: Prior to the introduction of Big Data repositories such as Hadoop that could scale to accommodate numerous applications, individual applications required their own data marts and aspects of storage and data access that are obsolete with the storage capabilities of contemporary Big Data platforms. “Today we’re seeing platforms where there are all sorts of data going in and all sorts of analytics being done,” Norris remarked.
  • The network can’t accommodate both computations (at scale and speed) and storage: One of the conventional justifications for separate platforms for analytics and operations was that the amount of storage required would create a strain on the network’s computing power—particularly with larger sets of data.
  • Analytics inherently slows production: Norris observed, “The reason that production and analytics stayed separated is because the analytics jobs were bringing the production jobs to their knees. So it was like, this isn’t working, if we have the analytics that we need to do we need separate infrastructure.”

Hadoop’s Architecture

Hadoop’s architecture directly addresses a number of these assumptions and renders them obsolete. Whereas previously organizations had to account for cost issues pertaining to separate infrastructure for analytics and operations, they can now utilize this open source option to combine infrastructures and also contend with issues of scalability. “Hadoop is a scale-out architecture where if you’ve got double the data, you just add additional commodity nodes and that last node is the same cost as the first node. It’s a linear scaling,” Norris said.

Equally as important is the fact that instead of accounting for the sort of latency involved with running analytics on data that is stored in a separate location, Hadoop can expedite the analytics process to real-time speeds because “the compute and the disk are local and distributed across the cluster so you can do the analysis quickly,” Norris added.

Real-Time Value

Perhaps the most vital aspect of the combination of analytics and operations in a single platform is the ability for the enterprise to distinguish real-time analytics (and action) from near real time. The examples of such differences are demonstrated in any number of verticals from fraud detection to e-commerce, industrial equipment asset management to recommender engines. With Big Data applications becoming more of a reality each day, the difference between near-real-time and real-time action is taking on greater importance. Norris commented on this distinction:

“I want to make sure when organizations say real time they’re looking at the broad aspect of real time. It’s not just real time in one narrow aspect, how you’re leveraging data. It’s not how fast an individual query is, for instance. It’s from the data collection point until the business action is taken. It’s that whole phase that needs to be ‘real- time’. When it comes to impacting business as it happens, it’s how fast that cycle happens.”

Governance Ramifications

Hadoop’s expanded role as a platform for both operations and analytics significantly impacts a number of issues related to Data Governance. Utilizing it as a solitary platform helps organizations reduce the number of copies of data they have and minimize the impact of a silo-based culture associated with an application-centric strategy for organizing their data. Such a simplification of infrastructure and its ramifications for governing data may be useful for ensuring regulatory compliance. According to Norris:

“One of the benefits of centralization is that you can focus your efforts and understand where your data is and who has access to it in a way that can simplify some of the aspects of that information.”

Additionally, a centralized approach can help expedite access to data since Hadoop can be used effectively as a Data Lake. In this respect, analysts are free to evaluate data from new sources and derive a rapid time to insight and action that simply would not be possible in more traditional environments involving schema and IT facilitated metadata. On the one hand, such hallmarks of Data Governance are not only useful but perhaps even necessary for managing data in the long term. On the other, they may hinder efforts to exploit competitive advantage in the short term, which may hinge on agility. According to Norris:

“Organizations can effectively deal with all of these fast growing data sources and do so in a very efficient manner so that they’re adjusting faster than the competition and making adjustments that have a much more significant impact than the competitors. It really doesn’t matter what your starting point is, that’s going to be the field that drives your competitive advantages and eventual dominance.”

What Big Data Means Today

Utilizing a singular platform for analytics and operations is by no means a panacea. By housing all of this data in one repository, organizations have an even greater need to ensure that such a platform is trustworthy and facilitates business continuity in the form of having reliable backup and recovery capabilities in the event of failure. Questions of security are even more importance with this approach as a centralized method represents the bulk of an organization’s data and requires detailed analyses of exposure and additional security concerns.

Ultimately, Big Data today represents a critical point of distinction between competitors. Such a distinction is between those that are leveraging such initiatives and those that are not, as well as between those that are doing so in a more readily accessible way than their competitors are. Big Data can yield significant points of insight about one’s customer base down to real-time actions and trends that can be exploited in fleeting time frames. With its penchant for enabling the incorporation of more data sources, increased agility, and faster time to action, Big Data’s salience has effectively transcended sheer quantities of data. Norris noted:

“This is not about ‘hey, I can do a Business Intelligence query against more data’. This is about significantly impacting an organization’s top line, or their ability to mitigate risk which has a huge payback, or streamlining their operations and becoming more efficient and avoiding big cost anomalies.”

You might also like...

Where is Data Science in the Hype Cycle?

Read More →