Loading...
You are here:  Home  >  Data Education  >  Big Data News, Articles, & Education  >  Big Data Articles  >  Current Article

Building a Better Data Supply Chain

By   /  March 17, 2015  /  No Comments

data supply chain x300by Jennifer Zaino

The data supply chain must be simplified and it must be integrated. That’s the ultimate mission of Metanautix, another startup to emerge – at least in part – from the mind of a former Google employee. (You can read about the startup SpaceCurve, also with ex-Googler origins, here.)

“A big part of the problem with data is that you can’t tell how it all fits together, and we want to help build that picture,” says Metanautix founder and CEO Theo Vassilakis, former engineering director at Google who was in charge of Dremel. Dremel is Google’s highly scalable, interactive ad-hoc query system for analysis of read-only nested data; BigQuery, which enables fast, SQL-like queries against append-only tables, is Google’s external implementation of Dremel.

Vassilakis’ experience with Dremel helped shape the vision for Metanautix, as did that of his co-founder and CTO, Toli Lerios, previously senior software lead for image algorithmics at Facebook. Metanautix brings together the high level functionality of standard SQL with next generation distributed technology in its Quest data compute engine, which provides a way to navigate and analyze data from any source on any scale, without requiring that it be moved into a centralized system. It runs on VMware vSphere and VMware vCloud Air to provide virtualized Big Data analytics on-premises and in the Cloud. Data, accessed and combined as a SQL table, could be anything, from logs to records to documents to images and video. “A lot of things people do with images are determining aggregates or sums – for example, the percentage of red or green – and SQL is really good at that,” Vassilakis says.

Accessing and analyzing data in different formats and structures, residing in data silos on-premise or in the Cloud, is a long-established challenge, and getting to actual insights upon which to take action can be a slow and often iffy process. It may require business analysts, themselves often times familiar with SQL, to bring in software engineers in a complex data analysis pipeline to bridge the gap between the nice, clean, Big Data. Such data typically resides in something like a Teradata system and the faster changing, messier data that is more likely to sit inside a NoSQL system. Quest, however, integrates with Teradata and MongoDB – a popular NoSQL database – and many other data sources, making it easy to analyze JSON-structured data together with traditional relational data. Part of the goal, says Vassilakis, is to make it possible for anyone to be able to query anything with SQL.

“A SQL database, a NoSQL database, a file, a JPEG or video. At the end of the day, if you have a unified high level English-like language to combine all your data, you are going to be better off,” he says. Also, most standard visual analytics tools generate SQL, so it provides pluggability, he adds.

Metanautix also is trying to address the fact that most businesses today are lucky if they can afford to hire one Data Scientist, never mind more than one. SQL, a 40-year old, English-like language known by many, takes a big role in cross-data, cross-system analytics, which can become less of a problem, because more ordinary users are empowered to do the work themselves. Newer, fancier, less well known languages often are employed to deal with bigger and more diverse data, “but what about using the skills of old with smart systems underneath doing the heavy lifting,” he says. “Plus, SQL at a high level is easier to manage.”

And a big part of the “magic,” as he describes it is the distributed computing systems aspect of the product that uses aggregation trees for speedy insights, scaling to petabytes of data and thousands of servers. The simplified and transparent data supply chain Quest enables includes its facilities for end-to-end analysis, including ad-hoc analysis and discovery and in-memory serving.

Data assets analyzed at speed by Quest can power Tableau Dashboards for interactive visual analytics, or users can conduct interactive analysis using other favored tools such as Microsoft Excel or Hyperion.

Sharing Knowledge, Embracing Transparency

Technology like Google’s BigQuery, Vassilakis thinks, as well as Amazon’s Redshift, actually reinforce Metanautix’s message that a lot of traditional enterprises want to access their data using SQL. He does point out, however, that most Cloud analytics systems, unlike Metanautix, have modified SQL and so don’t offer the same value proposition of letting users leverage the technology in a traditional cluster on their own sites. A key idea for Metanautix is the concept of being a data compute engine packaged in standard VMware virtual machines. The data compute approach doesn’t require pulling all the data into it before starting a query versus a database engine, he says. With a Cloud analytics solution, “it is more like, ‘first put all your data in my cloud and then you can do useful stuff with it, and then we charge you for all the data in the cloud,’” he says. At Metanautix, the charges are for the compute, not for the data storage.

“We also do a wider variety of computations than BigQuery does – for example, processing image and video data, and we have built-in ETL access that BigQuery doesn’t do, at least as of today,” Vassilakis says. In fact, Metanautix has just closed a deal with a utility company which requires its contractors to take photos of every stage of gas pipeline installations, as well as include 3D models. This will ease the way for it to better analyze these images to ensure the work was done correctly, because it won’t have to go through the slow and expensive process of storing all that massive data before querying it.

“It takes a lot of money to put those data bits in, and so it’s not usual to put in raw binary videos or JPEGs,” he says. “A database is not optimized for analyzing [such things] but a data compute engine is.”

But Dremel, as its own entity, has brought other values to the table that Metanautix seeks to expand upon, as well: transparency and visibility. Dremel was used for data access and analysis by nearly every discipline within Google because it was easy to use, it was always on, and it gave everyone a really clean picture of what was going on. “A lot of different people used the same tool so they could look at the queries they all ran in standard SQL, and from that you could tell what data people were interested in,” Vassilakis says. It became a way of sharing institutional knowledge, and of growing that knowledge by taking the tables people had used before and building new data combinations on top of them.

“We are trying to create that social effect with Metanautix for the general enterprise,” he says. “We think that is how you integrate the data supply chain and get a better view of what people are doing.”

It hasn’t fully realized its goals on this end yet, as it’s still working to build the right user interface experience. There is a prototype version that lets users see at a high level the queries other people are using, the tables they are touching and joining, and what tables were constructed atop others’ work – subject of course to organizations’ specific security requirements or compliance restraints. “We are working on nice, easy-to-consume Tableau-style workbooks that you can navigate conveniently,” he says.

Vassilakis also emphasizes that the security and privacy angle is substantial in all enterprises, and that Metanautix helps manage that.

As much as anything, organizations have to supplement what is possible through technology with cultural change that motivates people to embrace an integrated data supply chain. “Tools and systems have to help guide people to do the right thing,” according to Vassilakis. “It’s making systems with the right incentives, like an easier user interface.”

You might also like...

Is Data Governance Solely About Controls on Data?

Read More →