Most everyone has heard of “big data” – the popular term for data so massive it’s difficult to manage. Today, the volume of search engine queries, online retail sales and Twitter messages regularly exceeds the capabilities of traditional databases.
There’s a complement to big data that we call “big schema”. Modern data can not only have vast quantities and fast rates, but can also have diverse structure. Big schema can arise with enterprise data models, large data warehouses and scientific data.
Enterprise data models
An enterprise data model (EDM) describes the essence of an organization – it abstracts multiple apps, combining and reconciling their content. EDMs have many purposes, such as integrating app data, driving consistency across apps, documenting enterprise scope, finding functional gaps and overlaps, and providing a vision for future apps. Many enterprises have dozens of apps, so schema size can be very large.
The UK financial software vendor Avelo has been using an EDM to coordinate and integrate apps. Avelo was formed by the merger of four predecessor companies, so its apps aligned poorly. They have different abstractions, naming approaches and development styles. As a result, it was difficult to construct an EDM.
We limited the scope of Avelo’s EDM to cope with the poor alignment. We started by seeding the EDM via rapid reverse engineering. We browsed each app’s schema to find core concepts – the tables with the most foreign key connections – and used only the top 10. Business experts helped us reconcile the concepts to create a high-level EDM.
Large data warehouses
Data warehouses can also involve big schema. A data warehouse combines data from day-to-day operational apps and places it on a common basis for analysis and reporting. A large enterprise can have a great deal of data to analyze, leading to many data warehouse tables.
We can’t do much to restrain the size of a large data warehouse. But by using agile data modeling, we can make sure that payoff occurs incrementally, as the warehouse is constructed.
We recently worked on a large data warehouse encompassing multiple departments that illustrates both good and bad approaches. One department’s staff focused on building their portion of the warehouse and deferring usage. After many months of work, they are still building. Another department chose to build incrementally, according to business demand. This latter approach has been more successful and easier to justify for continued funding.
Scientific data is a third source of big schema. Scientific apps are extremely complex, involving time series, complex data types, and deep dependencies and constraints. Scientific schema is often not only large, but also difficult to represent.
Many years ago, we worked on the PDXI project sponsored by the AIChE. The purpose of PDXI was to produce a data model to serve as the basis for a data exchange standard for chemical engineering apps. Chemical plants have a wide variety of equipment, complex mixtures of substances and a range of operating conditions, so there is a lot of data to represent. The PDXI model was several hundred pages. This was too much to manage, too much to explain and too much to understand.
In retrospect, we now realize that we should have used more generic data structures. For example, the PDXI model had fifty pages for equipment, such as tanks, reactors, pumps and distillation columns. A better model would have avoided all this detail by combining data and metadata. Then the fine particulars of each kind of equipment could have been specified elsewhere.
So when you build applications, think not only about big data, but also big schema. For where there is big data, there is often big schema. And big schema can even arise by itself.