The next big thing for the data management community is to give up central control and planning in order to gain scalability and robustness.
Relational database systems have been the backbone of enterprise information management since the 1970s. The increase in enterprise information to levels beyond what traditional relational systems can effectively manage provides a generational challenge to enterprises. Drives toward maturing data management procedures and practices (via the CMM model) miss the point that the sheer volume of data is growing faster than current systems can manage - thereby ensuring that data management practitioners are likely to have more (not less!) difficulty maturing their practices.
A common approach to this challenge has been growing larger data collections in larger databases and data warehouses. This has proven to be both expensive and ineffective - data growth has outstripped the ability of relational systems to grow. Paradoxically, the solution to the problem of huge data management may be to adopt smaller, not larger, databases and to manage them differently. When databases get large and are directly coupled with applications, silos are the natural result. Smaller databases that are used in a more distributed manner and are loosely coupled to applications allow for more flexible management. These "data islands" may be used by a variety of applications that are not directly controlled by the owners of the databases, thus creating a form of data archipelago; a loose confederation of islands in a nominal partnership. Such an approach requires a radical rethinking of zones of control within enterprises. Enterprise data managers will need to give up traditional social mechanisms of central control and central planning that are stopping them from effectively dealing with huge data volumes.
One approach to creating smaller, distributed data islands is to identify information architectures that provide principles of scalability, robustness to change and robustness to single-point failure, such as elements of World Wide Web architecture.
Elements of Web architecture considered critical for use include:
- REST (for architectural properties of scalability and robustness)
- URI (to obviate the need for a "bus" via universal addressing and late binding)
- RDF (data model grounded in URIs)
- RDFS/SKOS/OWL as appropriate (vocabulary description languages for RDF for definitions of inter-linking)
Note that the suggested use of vocabulary description languages for metadata descriptions is not meant to imply any form of "upper level ontology". Localized descriptions of datasets published via a Web architecture are considered sufficient as long as any terms used are grounded in accessible URIs.
Similarly, the suggested use of "Web architecture" is not intended to be synonymous with the public World Wide Web. Enterprises are recognized to have security restrictions that surpass that of the public Web, although the public Web and private networks using Web architecture may be utilized as appropriate to augment and connect enterprise systems.
Addressing huge volumes of relational data suggests localized central management of relational data islands, interlinking using techniques pioneered in the Linking Open Data project and decentralized curation of the links between data islands.
Social adjustments necessary will include changes to hierarchical control over tightly-held enterprise "core" systems to allow linking with remote divisions/users within a (possibly distributed) enterprise.
Top-down models of control and data function well in situations where we can identify and centralize all relevant information for a particular purpose. What happens when we can't? Common examples of the latter include mergers, acquisitions, interactions with separately- developed workgroup systems and rapidly changing business directions.
There is an obvious analogy to software engineering: Prior planning and centralized control were theorized to create perfect results, but that never happened for good reasons. Newer so-called Agile methodologies attempt iterative programming to counter changing requirements and recognize the limitations of top-down planning. Similarly, data management in its current form will fail to mature along the CMM model.
Like operators of Web systems that are used by other Web services without notice, foreknowledge or agreement (e.g. Flikr, del.icio.us, Google docs), data island operators will have to adjust to unanticipated uses of their systems by remote users (in "enterprise mash-ups"). This will require a massive social change itself, since operators of such services can and do suffer "The Slashdot Effect", a marked and unexpected increase in traffic sufficient to crash machines or make them effectively unreachable.
Enterprises may not believe they can make such extreme sacrifices to acquire scalability and robustness. I ask simply in response, "What choice do you have?"