by Charles Roe
Part 1 of the article BI/Analytics on NoSQL: Review of Architectures discussed the general trends in the growth of NoSQL technologies over the past few years. It asked the question, as stated by Nicholas Goldman in his presentation, “how can companies put traditional Business Intelligence (BI) tools on top of NoSQL or big data systems?” The answer is still one fraught with a number of difficulties, including a lack of data points as evidence for substantial full loop integrations of the various technologies. This lack of information that proves the success of the marriage between BI and NoSQL has created a conflicted viewpoint within the two primary factions in the relationship:
- The managers, analysts and other business people on the data warehouse, analytics, BI side who want AdHoc reporting systems, quickly rendered dashboards, self-service report authoring and other common applications they are used to with their BI applications.
- The IT people and developers, who see so much core processing success, believe that NoSQL is the answer to many of big data’s problems such as scalability, long term storage, a rich analytical environment and many others.
The rest of Part 1 then covered the first three primary architectural sets that are currently being employed by various organizations. They are only general use cases and each organization that is doing NoSQL integrations within their existing data systems, or starting from scratch with it, all have various levels of customization. The first three mentioned were:
- NoSQL Reports: The main attribute of this architecture is using a developer-built system to access the full richness of the NoSQL environment. It is expensive, but fully customizable.
- NoSQL thru and thru: This approach is similar to the first, but builds in more flexibility to the entire system. It also requires a large developer overhead cost, but also has issues with integration of other SQL-centric data in the company.
- NoSQL + MySQL: The third architecture is a combination method that removes the data from the NoSQL system and puts it into traditional SQL/BI applications. It lacks data freshness and NoSQL richness, but allows for the use of off-the-self software.
Part 2 of the article continues with a discussion of the next three primary use architectures being employed at this time, some real world uses of NoSQL that are using it in various manifestations of the six architectures and the conclusion.
4. NoSQL as ETL Data Source:
The fourth approach takes a novel look at the NoSQL data. Instead of seeing as something separate within the data warehouse, all the NoSQL data is understood as just another ETL data source. The data is extracted from the NoSQL or big data system, put into the data warehouse and integrated with other data already there. Thus, this is the first architecture that enables integrated data. It can then be used with standard BI tools. Much in the same way as approach three though, the rich expressiveness of the NoSQL environment is lost, there is a large ETL development cost to make the integration work, traditional data warehouse tools are also costly and the NoSQL scalability upside is lost.
5. NoSQL programs in BI Tools:
This approach brings the developer, and those requisite costs, entirely back into the system, but not to the same extent. A developer must write a program for a standard commodity BI tool that essential flattens the NoSQL data and outputs it into a report. The developer doesn’t have to spend time programming report factors like margins and colors, but instead writes an application that connects the BI tool to the NoSQL data. This skips the need for SQL, AdHoc web-based access tools and metadata. It is simpler in form, cheaper than writing loads of custom reports, still uses the rich NoSQL language so can write MapReduce jobs, gives up-to-date access to 100% of the dataset, but has downsides that include slower speeds for aggregations and summaries, some developer costs, lack of integration with other systems and no AdHoc access.
6. NoSQL via BI Database (SQL):
The final approach adds in a third party Enterprise Information Integration (EII) system in between the commodity BI tool and the NoSQL or Big data system. That EII tool can speak to both, so acts as the intermediary that translates the data into useable models by the BI tool. This approach allows integration with other data, gives live up-to-date access; the ETL is simple with INSERT/MERGEs done nightly and has AdHoc access to live, cached data. It has a sort of “best of both worlds” approach in that all the NoSQL or big data tools still exist on the backside in the core, while the frontend is all traditional, BI tools to make both parties (mentioned earlier) happy. But, there is the additional cost and complications of adding a third system into the mix and there is still a loss of the richness of the NoSQL environment since some of the aggregations, reducers and other classifications are just not available, or too awkward, to render in an SQL environment.
Real World Uses
There are numerous companies using a variety of these approaches separately and together to give them the NoSQL/big data and BI functionality they need.
- Mozilla: In their Socorro project they are essentially using Approach Two (NoSQL thru and thru) where they collect Firefox web browser crash reports, send and store those reports in an Hbase/HDFS database and as a nightly process use a combination of custom scripts with Hadoop to process and aggregate the data, put it into useable summaries, import it into a PostgreSQL database so it is then available for sharing to those who need it. So while this is mostly NoSQL architecture, the end point eventually becomes one where the NoSQL data is translated into a SQL database.
- Company X: They wanted to have the ability to do visualizations with Tableau – a commodity BI tool. So they built a system similar to Approach Six, where they used Tableau as their BI frontend, integrated Lucid DB as the EII third party system in the middle and then had Splunk – a commercial NoSQL data aggregator – as the NoSQL backend.
- Meteor Solutions: Through a custom designed use of the ‘NoSQL thru and thru’ approach using Cloudant’s BigCouch cloud-based version of CouchDB, Meteor Solutions built a frontend reporting system that had all the data aggregations and indices worked into it. Their custom reporting functionality essentially solved the common NoSQL/SQL aggregation problem and thus fixed one of the biggest downsides to Approach Two.
- Nameless Companies A, B, C: There are three particular web-related companies that are using Approach Three. They have data in big data systems like Hadoop or NoSQL; use those systems for the data storage and ETL/Data preparation, process it and have it worked through custom scripts, then load it up into a MySQL tool or another form of analytic database. This allows them to have access to the summarized, aggregated data for easy AdHoc analysis, dashboarding and other common BI tool use.
The growth of NoSQL systems over the past few years has prompted the development of more and more companies working on integrating NoSQL and big data into traditional SQL-centric systems. The technologies are still in their adolescence and for most companies, the necessity for custom-built, developer driven arrangements remain a primary concern, in terms of usability and cost. Overall, the six architectures discussed above break down into two essential classifications: 1) Those that use NoSQL only and build flexibility into the application side with better UIs, better AdHoc reporting and other custom features onto the NoSQL products they are currently using and 2) Those companies that use NoSQL to run their applications, but then take that data out of the NoSQL system and put it into a MySQL or traditional data warehouse for more “after the fact” analysis. Both approaches have many success stories and the best approach for a particular company is based on their specific needs. The debate will continue, further integrations will move forward and more companies will enter into the fray as the concerns of big data become ever more important in the coming years.