Data Modeling is the “act” of creating a data model (physical, logical, conceptual, etc.) and includes defining and determining an organization’s data needs and goals. The act of Data Modeling defines not just data elements, but also the structures they form and the relationships between them. Developing a data model requires the data modelers to work closely with the rest of the organization to establish the goals, and the end users of the information systems to establish the processes.
A data model contains “data elements” (for example, a customer’s name, an address, or a picture of an airplane) that are standardized and organized into patterns, allowing them to relate to one another. The programming language used has an influence on the shape of the model, as does the database being used. The model defines how data is connected, and how data is processed and stored inside the computer system. For instance, a data element representing a house can be associated with other elements, which, in turn, represent the color of the house, its size, address, and the owner’s name. How the information is organized varies from one model to the next.
Data Modeling, databases, and programming languages are interdependent and have evolved together. Databases have evolved in basically four phases, and these phases tend to overlap:
- Phase I took place from roughly the 1960s to 1999, and included the development of database management systems (DBMSs) – hierarchical databases, inverted list databases, network databases, and the first object-oriented databases appeared around 1985.
- Phase II is described as relational databases, and introduced the structured query language (SQL) starting about 1990.
- Phase III supported Online Analytical Processing (OLAP), which was developed around 1990 (along with specialized DBMSs) and continues to be used today.
- Phase IV introduced NoSQL in 2008, supporting the use of big data, nonrelational data, graphs, and more.
In his book, “Data and Reality” (1978), Bill Kent compared data models to road maps, emphasizing the differences between the real world, and the world of symbols. He wrote, “Highways are not painted red, rivers don’t have county lines running down the middle, and you can’t see contour lines on a mountain.” This observation contrasts with many researchers who attempted to create clean, mathematically sterile models. Kent preferred to emphasize the basic messiness of reality, and suggested data modelers should focus on creating order out of the chaos, without distorting the basic truth. (With the popularity of NoSQL and non-relational data, Kent’s suggestions from 1978 have proven to be a good idea, but for technical reasons, it took us a while to get there.)
Data Modeling in the 1960s
The concept of Data Modeling started becoming important in the 1960s, as management information systems (MISs) became popular. (Before 1960, there was very little data or data storage. Computers of this time were essentially giant calculators). Various theoretical data models were proposed during the 1960s, including three that became a reality. The first two are “the hierarchical data model” and “the network data model.” The third theoretical model, the relational model, was proposed by Edgar F. Codd in the late 1960s.
The first true commercial database system became available in 1964, was called the Integrated Data Store (IDS), and was developed by Charles Bachman, with General Electric supporting his research. IDS used the network model, described as a flexible way of representing objects and their relationships in a graph form. IBM chose to focus on hierarchical models, designed for their Information Management System (IMS). In this model, records’ relationships take a treelike shape. While the structure is simple, it is also inflexible due to a confining “one-to-many” relationship format.
As Data Modeling and DBMSs evolved, so too did programming languages. Simula was developed in 1967, and was the first object-oriented language for programming. (Other languages evolved from Simula, such as Java, Eifel, C++, and Smalltalk). The evolution of programming languages was a strong influence in shaping the models using these languages.
Data Modeling in the 1970s
In 1970, Edgar F. Codd’s ideas were published. His ideas offered a significantly different way of handling data, suggesting all data within a database could be displayed as tables using columns and rows, which would be called “relations.” These “relations” would be accessible using a non-procedural, or declarative, language. (Remember, languages influence the shape of the model, and vice versa). Rather than writing an algorithm to access data, this approach required only a file name to be entered to identify the desired information. This clever idea led to much higher productivity. It was faster and more efficient, and prompted IBM to create SQL. (Originally called SEQUEL or Structured English Query Language).
Also, during this decade, G.M. Nijssen created “The Nijssen Information Analysis Method” (NIAM). Because this method’s evolution has included a number of other developers, the title has been altered to read “Natural language Information Analysis Method” with a small “L” in language, so it maintains the same acronym.
Data Modeling in the 1980s
NIAM was developed further in the 1980s, with the help of Terry Halpin. Its name was changed to Object Role Modeling (ORM). ORM brought about a dramatic change in the way data is perceived and how to process the data. The traditional mindset required that data and procedures must be stored separately. (It should be noted, a number of techs dislike ORM because it breaks all the rules.)
By the end of the 1980s, the hierarchical model was becoming outdated, with Codd’s relational model becoming the popular replacement. Query optimizers had become inexpensive enough, and sophisticated enough, for the relational model to be incorporated into the database systems of most industries. (Banks, and similar institutions, still prefer hierarchical data models for processing monetary and statistical information.)
1998 and NoSQL
The original version of NoSQL is a database developed by Carlo Strozzi in 1998. He created a relational, open-source database, that “did not expose” the SQL connections, “but was still relational.” Later versions of NoSQL dropped the relational model aspects completely.
2008 to Present: The Growth of Non-Relational Models
One of NoSQL’s advantages is its ability to store data using a schema-less, or non-relational, format. Another is its huge data storage capabilities, referring to its horizontal scalability. This makes it particularly well-suited for handling unstructured data, and in turn, well-suited for processing big data. (The term “big data” lost its meaning as using big data became the norm.) Rick van der Lans, an independent analyst and consultant, stated in a DATAVERSITY interview:
“The Data Modeling process is always there. You can look at that role in a simple way, by thinking of it as a process that leads to a diagram. In the process of creating the diagram, you are trying to understand what the data means and how the data elements relate together. Thus, understanding is a key aspect of Data Modeling.”
Because the data is structureless, a variety of data models can be used, after the fact, to translate and map out the data, giving it structure. It is generally understood that different data models, and the different languages associated with them, provide different paradigms, or different ways of looking at problems and solutions. With NoSQL, it is common to store data in a variety of locations (horizontal scalability), providing a variety of potential data model translations. This storage technique is called polyglot persistence. The question then becomes, “What is the best data model to use?” According to van der Lans:
“That’s why some call the data multi-structured, meaning that you can look at the same data from different angles. It’s as if you are using different filters when looking at the same object.”
Because of its flexibility, and large data storage capacity, NoSQL-style data stores have become popular. However, NoSQL databases still have a long way to go, in terms of evolution. Many organizations have not included a data model in their NoSQL systems since Data Modeling with such data stores exists primarily within the actual code.
These same organizations may want to build and use a data model and to increase the staff with Data Modeling skills. The discrepancy is based on a lack of modelers experienced with NoSQL databases, combined with nearly no tools for NoSQL Data Modeling. The need for experienced NoSQL data modelers, and the appropriate tools, is still an ever-present need.
Hackolade has developed a downloadable, user-friendly data model that provides powerful visual tools for NoSQL. Their software combines the simplicity of graphic data models with NoSQL document databases. This combination reduces development time, increases application quality, and lowers execution risks. The software is currently compatible with Couchbase, DynamoDB, and MongoDB schemas, and the company plans to introduce software for several other NoSQL databases.
The desire for Data Modeling in new database models will continue to move the industry forward as more organizations seek to capitalize on the diversity of non-relational designs while still utilizing their time-honored and well-known Data Modeling practices.
Image used under license from Shutterstock.com