Data Architecture is a set of rules, policies, and models that determine what kind of data gets collected, and how it gets used, processed, and stored within a database system. Data integration, for example, is dependent on Data Architecture for instructions on the integration process. Without the shift from a programming paradigm to a Data Architecture paradigm, modern computers would be much clumsier and much slower.
In the early days of computers, simplistic programs were created to deal with specific types of computer problems, and concepts such as data integration were not even considered. Each program was isolated from other programs. From the 1940s to the early 1970s, program processing was the primary concern. An architectural structure for data was generally not given much (if any) consideration. A programmer’s main focus was on getting a computer to perform specific actions that supported a organization’s short-term goals. Only data defined as “needed for the program” was used, and computers were not used for long-term data storage. Recovering data required the ability to write programs capable of retrieving specific information, which was time-consuming and expensive.
Shifting from a Programming Paradigm to Database Architecture Paradigm
In 1970, Edgar F. Codd published a paper (A Relational Model of Data for Large Shared Data Banks) describing a relational procedure for organizing data. Codd’s theory was based on the mathematics used in set theory, combined with a list of rules that assured data was being stored with a minimum of redundancy. His approach successfully created database structures which streamlined the efficiency of computers. Prior to Codd’s work, COBOL programs, and most others, had their data arranged hierarchically. This arrangement made it necessary to start a search in the general categories, and then search through progressively smaller ones. The relational approach allowed users to store data in a more organized, more efficient way using two-dimensional tables (or as Codd called them, “relations”).
In 1976, while working at MIT, Peter Chen published a paper (The Entity-Relationship Model-Toward a Unified View of Data) introducing “entity/relationship modeling,” more commonly known today as “data modeling.” His approach represented data structures graphically. Two years later, Oracle announced the first relational database management system (RDBMS) designed for business.
People working with computers began to realize these data structures were more reliable than program structures. This stability was supported by redesigning the middle of the system and isolating the processes from each other (similar to the way programmers kept their programs isolated). The key to this redesign was the addition of data buffers.
Buffers were originally a temporary memory storage system designed to remove data from a primitive computer’s memories quickly, so the computer would not get bogged down, and could continue working on problems. The data was then transferred from the buffer to a printer, which “slowly” printed out the most recent calculations. Today’s version of a data buffer is an area shared by devices, or a program’s processes, that are operating at different speeds, or with different priorities. A modern buffer allows each process, or device, to operate without conflict. Similar to a cache, a buffer acts as a “midway holding space,” but also helps to coordinate separate activities, rather than simply streamlining memory access.
The business community quickly recognized the advantages of Edgar F. Codd’s and Peter Chen’s insights. The new data structure designs were noticeably faster, more flexible, and more stable than program structures. Additionally, their insights prompted a cultural shift in the computer programming community. The structure of data was now considered more important than the programs.
Assumptions Lost During the Paradigm Shift
The evolution of Data Architecture required the elimination of three basic assumptions. (Assumption- something taken for granted; a guess, lacking hard evidence, and treated as fact.)
Assumption 1: Each program should be isolated from other programs. This isolation philosophy led to duplications of program codes, data definitions, and data entries. Codd’s relational approach resolved the issue of unnecessary duplication. His model separated the database’s schema, or layout, from the physical information storage (becoming the standard for database systems). His relational model pointed out data did not need to be stored in separate, isolated programs, and data entries and program coding did not need to be unnecessarily duplicated. A single relational database could be used to store all the data. As a result, consistency could be (almost) guaranteed and it was easier to find errors.
Assumption 2: Input and output are equal, and should be designed with matching pairs. Both output and input devices currently have data processing rates which can vary tremendously. This is quite different from the expectation both will operate at the same speed. The use of buffers initiated the realization output could, and should, be treated differently from input. Peter Chen’s innovations brought to light the differences between the creators of data and the consumers of data. Consumers of data generally want to see large amounts of information from different parts of the underlying database for comparison, and to eclectically extract the most useful information. Creators of data, on the other hand, focus on dealing with it, one process at a time. The goals of data creators (input) and data consumers (output) are completely different.
Assumption 3: The organization of a business should be reflected in its computer programs. With the use of buffers and a relational database, the notion “programs” should imitate a company’s structure gradually shifted. The more flexible databases took over the role of providing a useful structure for businesses to follow, while gathering and processing information. A modern data model will reflect both the organization of a business and the tools used to realize it’s goals.
SQL and Data Architecture
Codd’s relational approach resulted in the Structured Query Language (SQL), becoming the standard query language in the 1980s. Relational databases became quite popular and boosted the database market, in turn causing a major loss of popularity for hierarchical database models.
In the early 1990s, many major computer companies (still focused on programs) tried to sell expensive, complicated database products. In response, new, more competitive businesses began releasing tools and software (Oracle Developer, PowerBuilder) for enhancing a systems Data Architecture. In the mid- 1990s, use of the Internet promoted significant growth in the database industry and the general sale of computers.
A result of architecturally designed databases is the development of Data Management. Organizations and businesses have discovered the information itself is valuable to the company. Through the 1990s, the titles “data administrator” and “database administrator” began appearing. The data administrator is responsible for the quality and integrity of the data used.
Relational database management systems have made it possible to create a database presenting a conceptual schema (a map of sorts) and then offer different perspectives of the database, designed for both the data creators and data consumers. Additionally, each database management system can tune its physical storage parameters separately from the column structure and table.
NoSQL and Data Architecture
NoSQL is not a program. It is a database management system, and uses fairly simple architecture. It can be useful when handling big data and a relational model is not needed. NoSQL database systems are quite diverse in the methods and processes they use to manage and store data. SQL systems often have more flexibility in terms of functionality than NoSQL systems, but lack the scalability NoSQL systems are famous for. But, there are now numerous commercial packages available that are combining a “best of both worlds” approach, and more are coming to the market all the time.
A number of organizations recently covered in articles and interviews on DATAVERSITY® (there are many other possibilities available) offer a Data Architecture solution for processing big data with tools common to relational databases. Kyvos Insights sells software that works with Hadoop storage systems. Their Hadoop/OLAP combination promotes the processing of unstructured “and” structured data at a variety of scales, allowing big data to be analyzed with relative ease.
Hackolade also sells a software package, with a user-friendly data model offering “highly functional” tools for dealing with NoSQL. The software merges NoSQL with the simplicity of visual graphics. This, combined with Hackolade’s other tools, reduces development time and increases application quality. Their software is currently compatible with Couchbase, DynamoDB, and MongoDB schemas (they have plans to include additional NoSQL databases).
RedisLabs combines access to their cloud with their software package, the Redis Pack, to provide another architectural solution. The three strengths provided by the Redis Pack and their cloud are speed, persistence (saving your info), and the variety of datatypes they have available. Essentially, Redis is an “extremely fast” NoSQL, key-value data store, and acts as a database, a cache, and as a message broker.
Reltio provides a service. They have created a cloud management platform, and provide the tools and services needed to accomplish to process big data. They furnish researchers, merge big data from multiple sources with Master Data Management (MDM), and develop unified objectives. Reltio’s systems support a variety of industries, including retail, life sciences, entertainment, healthcare, and the government.
Data Architecture has changed completely since its early days, and likely due to newer trends such as the Internet of Things, cloud computing, microservices, advanced analytics, machine learning and artificial intelligence, and emergent technologies like blockchain will continue to alter even more far into the future.
Image used under license from Shutterstock.com