John Singer wonders if Conceptual Data Modeling can save IT from itself.“I definitely think that we need a bit of saving. We need a little help in terms of how we build systems, especially from a data perspective.”
Singer spoke at DATAVERSITY®’s Enterprise Data World Conference, about Data Modeling, current gaps in the field, and how the future of modeling might look. Singer is the founder of NodeEra open-source Property Graph Modeling Software.
GET UNLIMITED ACCESS TO 160+ ONLINE COURSES
Choose from a wide range of on-demand Data Management courses and comprehensive training programs with our premium subscription.
There’s no question that people are doing amazing things with machine learning and business analytics, Singer said. It’s not that today’s systems don’t produce good results, but at the end of the day, we’re really still building unit record processing systems — they’re just faster and better at what they do. “And I don’t think we can move forward until we address that issue.”
Current Data Modeling Tools
Typically, a data modeler is assigned a project where the stated product is a data model, but in reality, what the project owners are asking for is a physical database design. The methodology that modelers are taught is to first build a conceptual data model, then extract a logical data model from that, and then to refine that into a physical data model.
Conceptual Data Modeling is business-oriented, technology independent, and abstract. The logical model adds specific properties and technical elements, and the physical model includes DDL and super/sub types specific to the database, he said.
The Problem with the Model
But Singer has a problem with the conceptual data model because it’s usually defined in such broad-brush strokes. Ask what a conceptual data model is, the answer is often: “It’s more abstract.” To Singer, that’s not sufficient. “It’s really not what we need to accomplish, but it’s all we have.” Another issue is with the polyglot persistence layer. Organizations have so many different target databases that an Entity/Relationship model doesn’t really apply to a lot of the databases in use today.
Current modeling tools support the creation of these different models, and they can be linked, but the maintenance is a big problem, he said. “You can create the greatest conceptual model in the world, but nobody cares about it, because it’s just not impactful to anyone other than the data modeler.” Although he has no complaint with the process, it’s just not enough for conceptual models.
A Desperate Need for Conceptual Data Modeling
Singer pointed out that most of EDW addresses topics that exist to fix the lack of a good conceptual data design: governance, data catalogs, data glossary, lineage, strategy, and quality — these are all necessary, but the design at the front end of the system gets lost because the data model can’t capture it. “And when we persist the data into the database, it sure doesn’t get captured there.” Which leads to his assertion that there is a critical need for a conceptual data model.
Solution Requirements for the Conceptual Database
Singer’s three-step solution, which he calls a “conceptual database,” includes both the model and the persistence.
- Model = data
The model and the data are defined using the same language so that the model equals the data.
- Technology neutral
The model must easily map back and forth to and from existing systems and databases.
- Mirror human behavior/be intuitive
The model should be intuitive, more closely mirroring human behavior, because humans excel at defining and discussing concepts, he said. “Language is really the missing piece.”
Existing Conceptual Data Modeling Approaches
In 1977, Peter Pin-Shan Chen wrote a paper titled, The Entity-Relationship Model: Toward a Unified View of Data. His goal was to unify the different data models in use at the time.
“The relational model is based on relational theory,” said Chen, “but it may lose some important semantic information about the real world.” We can create a conceptual model that’s more semantically rich, Singer added, “but as soon as we put that data in a relational database, we lose all the context.”
Early Linguistic Based Modeling: NIAM/ORM
In the 1990s, another conceptually-oriented modeling approach, NIAM, emerged. An acronym for Nijssen’s Information Analysis Methodology, (after G.M. Nijssen, one of the researchers who developed it), it was later renamed Natural Language Information Analysis Model to clarify that the model was a team effort. The approach eventually became known as Object-Role Modeling (ORM).
ORM was designed to better reflect human language used to describe the concepts in the model. It’s a more semantically rich way to model data, he said. It doesn’t persist in this form in a database, so although a relational design could be built from it from it, all the semantic detail would be lost.
Toward a New Database Management System (DBMS)
Newer technologies like property graphs and semantic web provide some, but not all, of what is needed.
- Property graph
To understand property graphs, it’s important to let go of the assumptions inherent in a relational database structure. An extremely flexible model, the property graph is very simple: “It’s nodes and relationships, and you put properties on them. You can really do anything you want with it,” he said, and modelers will often naturally gravitate toward a Chen- or an ORM-style model. The conceptual data model is not predefined, and because it’s not created until runtime, the modeler can just intuitively start modeling the data, treating every property as an entity. The downside, he said, is that “The semantics are just all in your head. And the underlying database doesn’t really have any understanding of the semantics.”
- Semantic Web Technologies
Distributed by its very nature, the goal of the semantic web is that “anyone anywhere can say anything about anything.” Users can publish data and that data can be linked to any other published data. As with property graphs, semantic web is different from the relational database structure, using describing things as a form of logic. The basic unit, called an “RDF triple” (Resource Description Facility) is an assertion of some fact — a relationship that exists between the subject and the object — expressed as three parts of a sentence in the form: subject-predicate-object. The combination of all RDF assertions is called the RDF Graph. Unlike previous models, there is no loss of semantics when persisting data, he said.
Differences from a Relational Database
In a relational database,the table type must be defined before data can be added to it. With the semantic web, instance data can be collected and the database can classify it for you, or it can determine what category it belongs to.
Everything is expressed using the physical data model, (the triple), but the conceptual data model is rigorously defined, as opposed to the property graph, where the conceptual model is defined just by convention.
“Here, it’s specifically called out.” Singer calls semantic web’s inferencing engine its “superpower,” because it can infer new facts or types from given facts, and it can classify things independently. “The ‘kryptonite’ part is that it’s hard to understand. Really smart people get the logic and the rest of us all kind of struggle.”
Semantic web databases seem to fulfill some of the requirements of a conceptual database, he said. Most importantly, the “model = data” requirement is clearly there, but the real issue is ease of use. How can this be made easier to use and accessible to business users, not just IT experts?
The concept of formal semantics grew out of the study of linguistics. Formal semantics uses techniques from mathematics and logic to form theories about human or computer languages.
The basic unit in formal semantics is the sentence, which, like human language, is a grammatically sound string of words. Each sentence has meaning and that meaning is called a “proposition.” Propositions are converted into a logical meta-language using a form of logic called predicate calculus. Propositions are matched with a set of values about the world and based on how well they match, can be determined to be true or not.
Toward a Language-Based API
The way data concepts are modeled must evolve to an easily understood form that survives persistence to a database, he said, “And the only way I’m able to see how this can happen is by going to a more language-based API.”
Language process occurs in the subconscious mind. The system should be able to explain itself when asked: “What is the definition of that?” or “Which part of the business cares about this?” “We should be able to capture and maintain all this business context in a way that that stays with the data.”
Conceptual Database Future
The challenge is to bridge from the logic to the language. “We need to do this in a way that more mirrors human behavior,” and Singer believes that language is the way to accomplish that.People are undoubtedly doing amazing things with machine learning and business analytics, he said, “but at the end of the day, we’re really still building unit record processing systems — they’re just faster and better at what they do. And I don’t think we can move forward until we address that issue.”
Want to learn more about DATAVERSITY’s upcoming events? Check out our current lineup of online and face-to-face conferences here.
Here is the video of the Enterprise Data World Presentation:
Image used under license from Shutterstock.com