Paths, Patterns, and Lakes: The Shapes of Data to Come

Click to learn more about author James Kobielus.

Data doesn’t exist outside your engagement with it. Or, rather, it may physically exist, but it’s little more than a shapeless mass of potential insights until you attempt to extract something useful from it.

Drilling for actionable intelligence can take either of two approaches: query for it or mine for it. The approach that you use will shape the types of data platforms you prefer to use. Increasingly, the pattern-seeking approach will predominate in the big data era where more data is unstructured and more intelligence is mined from data lakes via machine learning and other statistical models.

Relational databases have become dominant because, for most applications, engagement with the data is primarily through structured queries. These define a precise grammatical path for extracting answers, and typically leverage that the dominant shape of most data be the relational structure of tables linked by primary keys, joins, indexes, and other techniques. One of the principal reasons for relational technology’s near-universal adoption is that, abetted by the universal SQL grammar, relational enables flexible query optimization across diverse data scenarios.

Relational queries can be amazingly agile and efficient if the shape of the underlying databases conform to standard rules, such as third normal form, and leverage the design features mentioned above. Columnar and dimensional are access-path variations on the relational shape, enabling query speeds to be accelerated even further under specific schematic constraints.

However, graph databases are beasts of a different shape, one that relies on the concepts of “nodes” and “edges,” though they share a fundamental focus on accelerating query access paths with relational technology. Indeed, graph technology has its own standard query grammars, such as Gremlin, which are referred to as “graph traversal languages.”

As Andreas Kollegger notes in this article, graph technology excels in defining flexible access paths through networked data structures of arbitrary complexity, and don’t have relational’s inherent bias toward hierarchical query paths. As he explains, graph databases are well-suited for many complex query scenarios that relational databases can’t address without being twisted in awkward directions.

Kollegger tries to make a case for graph databases eventually obsoleting relational technology, but I’m not convinced. If graph technology were the panacea for optimizing data access performance, it would have been recognized as such long ago. Also, hierarchical relationships—the sweet spot for the relational model—are far more pervasive in applications than he gives credit for.

But even more fundamentally, graph technology, like relational, is not suited to the new world of unstructured data, machine learning, and schema on read. Whatever its merits for defining fast flexible query paths, a graph, like a relational structure, is inherently a schema-on-write technique applied to structured data.

The new shape of data is the lake. This refers to a clustered environment that includes a distributed file system, algorithm library, and machine learning execution engine. Rather than enforce a common schema and semantics on the objects it stores, the data lake uses schema on read and uses statistical models to extract meaningful patterns from it all. By “patterns,” I’m referring to the most salient correlations, features, and predictors that might otherwise go unnoticed in the data.

Within a data lake, some data sets may take shapes that are neither relational nor graph, but which nonetheless facilitate subsequent statistical analyses. One mathematical approach—topological data analysis (TDA)—reduces large, raw multi-dimensional data sets down to compressed representations with fewer dimensions while preserving properties of relevance to subsequent analyses. Mathematical approaches such as TDA, principal component analysis (PCA) and singular value decomposition (SVD) are important tools in wrestling these high-dimensional analytics challenges down to earth.

From a pattern-sensing standpoint, the ease of mining any particular data lake is determined by the range of unstructured data platforms it includes (e.g., Hadoop, MongoDB, Cassandra) and on the statistical libraries and modeling tools available for mining it. But query tools are also an important feature of the data lake. After all, no lake is complete if it doesn’t support any of the commercial dialects of SQL on Hadoop.

Unless you’re a data scientist, your primary engagement with data lakes will be in the act of querying them. As you do with any database, no matter what shape it’s in.

LISTEN NOW: MY CAREER IN DATA PODCAST

Data Topics

Paths, Patterns, and Lakes: The Shapes of Data to Come

Leave a Reply Cancel reply