Big Data Integration 101: The What, Why, and How

By on

Big Data Integration is an important and essential step in any Big Data project. There are, however, several issues to take into consideration. Generally speaking, Big Data Integration combines data originating from a variety of different sources and software formats, and then provides users with a translated and unified view of the accumulated data.

Managing “integrated” Big Data assures more confidence in decision-making and provides superior insights. The process of integrating huge data sets can be quite complicated and can present several challenges. Some challenges faced during the integration process include: uncertainty of data, management, syncing across data sources, finding insights, and skill availability.

A primary purpose of Big Data implementation is to present the data in new and unique ways. To gain new insights and, in business, new advantages. Recognizing the needs of the organization prior to “organizing” the data is useful in a broad range of Big Data projects, including business and scientific research. Big Data Integration combines traditional data, social media, data from the Internet of Things (IoT), and transactional data. Data that is not compatible, or has not been translated/transformed, is essentially useless for such projects. John Thielens, the Chief Technology Officer of Cleo, a Big Data Integration solutions service, said:

“A lot of what’s discussed concerning Big Data has to do with the wonders of today’s powerful analytics tools. But before any analytics can be performed, data integration has to happen. That means your data – historic, operational, and real-time – must be sourced, moved, transformed, and provisioned to users, with technologies that promise security and control all along the way.”

Big Data Integration Tools

As “traditional” tools for data integration continue to evolve, they should be reevaluated for their abilities to process the ever-increasing variety of unstructured data, as well as the growing volume of Big Data. Integration technologies must have a common platform to support Data Quality and profiling.

The integration of data from different applications takes data from one environment (the source) and sends it to another data environment (the target). In traditional data warehouses, ETL (extract, transform, and load) technologies are used to organize data. Those technologies have evolved, and continue to evolve, to work within Big Data environments.

When working with Big Data, tools supporting batch integration processes, with real-time integration across several sources, can be quite useful. A pharmaceutical company, for example, may want to merge data stored in its MDM (Master Data Management) system and Big Data from sources describing the outcomes of prescription drug usage.

When using the cloud, data can be organized using integration Platform-as-a-Service (iPaaS). This service is generally easy to use and can include data from Cloud-based sources, such as Software-as-a-Service (SaaS).

Organizations use MDM systems to promote the collection, aggregation, consolidation, and delivery of reliable data throughout the organization. Additionally, new tools, such as Scribe and Sqoop are being used to support the integration of Big Data. There is also an increasing emphasis on ETL technologies in Big Data research.

Mike Tuchen, CEO of Talend, an open source ETL solutions service, said:

“There is a once-in-a-generation shift taking place in the industry as the entire Data Management stack gets redefined. Companies now recognize that data is a competitive advantage and are turning away from legacy integration solutions to more agile and modern solutions that are optimized for Hadoop.”

The Challenges of Big Data Integration

Finding Staff: Though the number of data scientists and Big Data analysts continues to grow, there is still a lack of people to fill all the positions in the Big Data research industry. The typical Big Data expert has gained experience with tool implementation and has an understanding of how to organize the data to best research it. Data scientists and Big Data analysts should be familiar with traditional relational database tools, as well as in-memory analytics, NoSQL Data Management frameworks, and Hadoop ecosystems.

Bringing in the Data: The issues involved with accessing data coming from an extensive range of sources is also a challenge. The skills needed to navigate the extraction processes are necessary for the goal of analyzing and processing Big Data.

Synchronization: Data coming from a wide range of sources uses different schedules and rates, and can quickly become desynchronized from the originating system. Data synchronization provides consistency in systems and continually updates to maintain that consistency. In traditional Data Management systems, the process of data extraction, migration, and transformation all promote desynchronization.

Data Management Tools: Incompatibility between Big Data Management tools can cause problems. They can be incompatible NoSQL approaches — hierarchical object representation and the key-value storage provide two good examples. The range of NoSQL tools has caused some confusion regarding the compatibility of different approaches. Selecting the appropriate tools for a highly functional data integration system requires forethought. Small organizations that are planning to start data warehousing face a decision about the tools they will be using.

Choosing a Strategy: Big Data Integration often begins with a simple need to share information. This is often followed by an interest in breaking down the “data silos” for purposes of analysis. Businesses will often leap from one project to another without an organizational plan. To meet goals that are sometimes contradictory, and include security and compliance needs, a true data integration strategy should be developed.

Considering the Big Picture

Ignoring Big Data Integration is, in the long run, inefficient and time consuming. Many organizational leaders take technology for granted, believing all data integration solutions are equal, without evaluating and testing them. In truth, there are a variety of data integration technologies available, in terms of functions and the problems they address. Considerations should include performance, Data Governance, and security.

Organizations implementing Big Data Integration solutions often ignore these considerations, because they don’t understand that these concepts are actually related to data integration. These are concepts that should be core constituents of the data integration process, starting with a logical architecture and moving to physical deployment. If they are not integrated initially, they will have to be added later. While the integration of performance, governance, and security may seem obvious to some, most organizations ignore them during the planning phase.

On the plus side, data integration technology continues to improve, and has changed with the changes in infrastructure, such as the cloud and Big Data. In spite of its flexibility and continuing evolution, there must still be some hard thinking during planning phase of setting up a Big Data Integration system.

Big Data Databases

The basic elements of a Big Data database organizes data in novel ways when compared to traditional relational databases. This is primarily the result of scalability and the use of both unstructured and structured data. For a Big Data analysis to be useful, it must be understood and trusted by upper management. The basics of a Big Data ecosystem include Cassandra, Hadoop, Hbase, MongoDB, and many others. While each has their own ways of extracting and loading data, several use Hadoop as a foundation. Choosing the best Big Data platform requires some serious thought. 

Cassandra has combined two Big Data technologies, Dynamo and Google’s open-sourced BigTable.

This platform is “extremely” scalable, and it designed to cope with challenges of Data Management in modern business. It is also decentralized, providing redundancy mechanisms. Cassandra comes with Hadoop integration and MapReduce support. Cassandra’s weaknesses include limited options for retrieving data, and background tasks make its performance “occasionally” unpredictable.

Hadoop comes with three great strengths. It works with both structured and unstructured data, it is cost-efficient (open-sourced), and it is fast. The sources can come from social media, clickstream data, or government agencies. As a data storage system, Hadoop is a surprisingly cost-effective solution. As separate storage, the primary system is allowed to work more quickly. It also provides automatic backups for lost data. On the other hand, Hadoop provides no security, and can be easily hacked. It’s also not very good at working with “small” data.

Hbase is a very popular platform with several strengths, including consistency, sharding, and failover support, and load sharing. It also comes with some weaknesses. If the “Hmaster” fails, it takes a “long time” to recover it. It also has problems with querying and cannot provide more than one indexing within the table.

MongoDb is a very fast document database, and offers ACID properties. It has a failover mechanism that works automatically. It supports common authentication mechanisms like LDAP and AD and makes replication very easy. Auto-sharding enables horizontal scalability and the database makes querying easy. Sadly, it does not support JOIN operations, nor support transactions. It also has some memory limitations due to indexing methods.

Image used under license from

Leave a Reply