Data is useless if it lacks structure or if there is no opportunity to add structure downstream so that it can be processed, analyzed, and delivered into real-world applications.
Structure is the contextual matrix within which data’s value is realized. It refers to the formats, schemas, models, tags, ontologies, patterns, and other artifacts without which data is a useless pile of raw bits. So-called “unstructured” sources–such as files, text, logs, events, sensors, and media–only possesses value if there’s sufficient metadata and other contextualizing artifacts to realize value in use.
Structure is either built into the data from its inception or added to it subsequently. To the extent that data is unstructured at its source, data management professionals must rely on some combination of data integration, manual modeling and tagging, natural language processing, text mining, machine learning and other approaches to discover, encode, and publish its structure for downstream utilization.
Data can acquire various levels of structure, including formats, schemas, metadata, models, tags, ontologies, context, and the like. It can gain structure in various ways, ranging from totally automated processes to entirely manual handling. It can acquire structure at many points in the data-processing pipeline, ranging from the moment it’s created all the way through data acquisition, integration, transformation, preparation, modeling, analysis, query, visualization, and usage.
Does it matter whether any of these structure-giving functions are performed at the time the data is created or at one or more points downstream in its life cycle?
One of the most geek-intensive perennial discussion in the big data industry is whether a core feature of traditional relational database management systems (RDBMSs), “schema on write” (i.e., data structured at creation into a pre-existing schema native to the database), is becoming obsolete in favor of “schema on read” (i.e., data structured or restructured in the downstream act of preparing, analyzing, and using it, independent of whatever native structure it may had at the source). Some refer to this distinction as “early binding” of structure to data (i.e., schema on write) vs. “late binding” (e.g., schema on read). Here’s a good SlideShare that highlights the use cases where each approach is best suited.
Essentially, schema on write–the core of relational database technology–is suitable for repetitive data for operational applications with apriori semantics, specific use cases, stringent governance, and predictable performance. In other words, online transactional processing, data warehousing, statistical analysis, and business intelligence. The SlideShare describes these as use cases involving “known unknowns” in the data.
The principal downsides of “schema on write,” per my colleague Tom Deutsch in an IBM Data magazine post last year, are several. These include its requirement that data be modeled up front, difficulty in changing schemas, and bias against sources that can’t be fit into the schema.
By contrast, schema on read–the forte of Hadoop and NoSQL–is focused on exploratory applications with aposteriori semantics, myriad use cases, non-stringent governance, and best-effort performance. In other words, classic data science, which analyzes what the SlideShare refers to as “unknown unknowns.”
The downside of schema on read, per Deutsch, include the need for compute-intensive processing, the data’s lack of self-documentation, and the need to “spend time creating the jobs that create the schema on read.”
That last point says something that schema-on-read advocates, especially Hadoop-only devotees, tend to downplay. Essentially, what it refers to is data preparation, which is still necessary in the schema-on-read arena, just as it has always been for schema-on-write applications such as enterprise data warehousing (EDW) and business intelligence (BI).
In other words, we’re talking about extract-transform-load (ETL), albeit performed downstream by the data scientist, on an ad-hoc basis, using raw data as the input, and creating “schema on read” as the output prior to transforming and loading the source “unstructured” data. Indeed, the above-cited SlideShare refers to schema on read as requiring ETL “on the fly” when retrieving raw data from Hadoop Distributed File System (HDFS) for exploratory data science.
Though some industry observers, such as this one, refer to schema on read as a “no-ETL” approach, it’s just not true. Rather, it simply displaces ETL away from its traditional automated execution in a specialized pipeline node and away from its traditional handling by a special cadre of data-integration specialists. Instead, schema on read pushes all of this processing downstream into the data-modeling workloads of data scientists who write MapReduce code to support ETL and other data analytic functions to be executed in their Hadoop clusters.
Clearly, the need for data modeling is not going away with the rise of Hadoop. Not only is do schemas–aka “data models”–need to be constructed on a late-binding basis to sift through unstructured data, but will remain central to RDBMS platforms, which underpin online transaction processing (OLTP), EDW, BI, and other mainstay data applications.
Big data’s new frontiers, such as Hadoop and NoSQL, haven’t budged RDBMSs from these core roles, which will remain stable going forward.