Data Management is the organization of data, the steps used to achieve efficiency, and gather business intelligence from that data. Data Management, as a concept, began in the 1960s, with ADAPSO (the Association of Data Processing Service Organizations) forwarding Data Management advice, with an emphasis on professional training and quality assurance metrics. Data management has evolved significantly over the last six decades.
Data Management should not be confused with Data Governance, nor with Database Management. Data Governance is a set of practices and concepts which prioritize and organize data, as well as the enforcement of policies around data, while following various regulations and curtailing poor data practices.
LIVE ONLINE TRAINING: DATA MANAGEMENT LITERACY
Get up to speed on today’s most important data management practices during this two-day workshop – Feb. 7-8, 2023.
Data Governance is essentially a part of the greater whole of Data Management. Database Management, on the other hand, is focused on the tools and technology used to create and alter the foundation of data, rather than the overall system used to organize the data. Database Management is also a subdivision of Data Management.
To gain a better understanding of Data Management, consider the following: Each airport has outgoing flights. Each passenger has a destination and reaching each destination requires one or more of flights. Additionally, each flight has a certain number of passengers. The information could be shown hierarchically, but this method has a major problem. The displayed data can be focused on flights, or passengers, or destinations, but not all three simultaneously. Displaying three separate hierarchies requires storing the data redundantly, and starts becoming expensive. Also, updating the data in three separate files is more difficult than updating it in one. All three hierarchies must be updated to eliminate confusion. Using a network data model, which is much more flexible, provides a better solution. Good Data Management is key to a successful business.
The management of data first became an issue in the 1950s, when computers were slow, clumsy, and required massive amounts of manual labor to operate. Several computer-oriented companies used entire floors to warehouse and “manage” only the punch cards storing their data. These same companies used other floors to maintain sorters, tabulators, and banks of card punches. Programs of the time were setup in a binary or decimal form, and were read from toggled on/off switches at the front of the computer, or magnetic tape, or even punch cards. This form of programing was originally called Absolute Machine Language (and later changed to First Generation Programming Languages).
Second Generation Programming Languages
Second Generation Programming Languages (formerly called Assembly Languages) were used as an early method for organizing and managing data. These languages became popular in the late 1950s and used letters from the alphabet for programming, rather than a complex string of ones and zeros. Because of this, programmers could use assembly mnemonics, making it easier to remember the codes. These languages are now antiquated, but helped to make programs much more readable for humans, and freed programmers from tedious, error-prone calculations.
High Level Languages
An understanding of foundational languages can help in creating a new web service or application.
High Level Languages (HLL) are older programming languages which were easy to read by humans. Some are still popular. Some aren’t. They allow a programmer to write generic programs which are not completely dependent of a specific kind of computer. While the emphasis of these languages is on ease-of-use, their primary purpose is to organize and manage data. Different High-Level Languages come with different strengths:
- FORTRAN was originally created by IBM during the 1950s for engineering and science applications. It is still used for numerical weather prediction, finite element analysis, computational fluid dynamics, computational physics, crystallography and computational chemistry.
- Lisp (historically, LISP) was originally described in 1958, and quickly became a favorite programming language for AI research. It was unusual in that it made no distinction between data and code, and was one of the first programming languages to initiate a number of ideas in computer science, such as automatic storage management, dynamic typing, and tree data structures. Lisp also had the flexibility to expand in ways its designers had never thought of. (Lisp is on the decline.)
- COBOL (Common Business Oriented Language) was developed by CODASYL in 1959, and was part of a U.S. Department of Defense goal to create a “portable” programming language for data processing. It is an English-like programming language designed primarily for business, finance, and administrative systems. In 2002, COBOL was revised and became an object-oriented programming language.
- BASIC (the Beginner’s All-purpose Symbolic Instruction Code) describes a group of general-purpose programming languages designed to be user-friendly. It was designed in 1964 at Dartmouth College. (BASIC doesn’t get used much, these days.)
- C was invented at Bell Labs in the 1970s, and had an operating system written inside of it. The operating system was UNIX, and because the program was written in C, UNIX could now be transported to another system. (At present, it continues to be one of the most popular programming languages in the world.)
- C++ (pronounced “c plus plus”) is based on C, and is a general-use programming language, with low-level memory manipulation. It was designed to be easily altered, comes with desktop applications, and can be installed in a variety of platforms. (It is still used widely, and popularity seems to be growing.)
Extract, Transform, and Load
One of the earliest Data Management tools is the ETL. ETL (extract, transform and load) started gaining popularity in the 1970s, and is still one of the most popular data integration techniques on the market. It collects data from different sources, and converts it into a consistent form. The integrated data is then downloaded into a data warehouse (or some other storage system).
Database Management as a Component of Data Management
Data Management has steadily evolved to include a broad range of technologies and tools. This includes database management software as a component of the data management system. Database Management systems are the most common form of Data Management platforms, and act as an interface between the database and the end user.
Two popular Database Management systems are SQL, a relational Database Management system (RDBMS) and NoSQL, a database that stores data using non-relational storage formats, and which are often scalable (or expandable).
Online Data Management
Online Data Management systems, such as travel reservations and stock market trading, must coordinate and manage data quickly and efficiently. In the late 1950s, several industries began experimenting with online transactions. Currently, Online Data Management systems can process healthcare information (think efficiency), or measure, store, and analyze as many as 7.5 million weld sessions per day (think productivity). These systems allow a program to read files or records, update them, and send the updated info back to the online user.
SQL (Structured Query Language) was developed by Edgar F. Codd during the 1970s, and focused on relational databases, providing consistent data processing and reducing the amount of duplicated data. The program is also fairly easy to learn, because it responds to commands in English (as opposed to a computer language). The relational model allows large amounts of data to be processed quickly and efficiently. The language became standardized in 1985.
Relational models represent both relationships and subject matter in a uniform way. A characteristic of relational data models is their use a unified language while navigating, manipulating, and defining data, rather than using separate languages for each task. Relational “algebra” is used to process record sets as a group, with “operators” being applied to whole record sets. Relational data models, combined with operators, provides shorter and simpler programs.
The relational model presented some unexpected benefits. It turned out to be very well-suited for parallel processing, client-server computing, and GUIs (graphical user interfaces). Additionally, a relational database model system (RDBMS) allows multiple users to access the same database simultaneously.
The primary purpose of NoSQL is the processing and research of big data. It started as basically a search engine, with some additional management features, and is “not” a part of a relational database. That has changed now with much more advanced NoSQL platforms. While structured data can be used during the research, it is not necessary. NoSQL’s true strength is its capacity to store and filter huge amounts of structured and unstructured data. The data manager has a variety of NoSQL databases to choose from, each with its own specific strengths. NoSQL databases are commonly used for big data research because they could store and manage a variety of data types.
The efficiency of NoSQL is the result of its unstructured nature, trading off consistency for speed and agility. This style of architecture supports horizontal scalability and has allowed significantly large-scale data warehouses (Amazon, Google, and the CIA) to process vast amounts of information. NoSQL is great at processing big data. The term “big data” has started to fade away since 2019-20, as the use of massive amounts of data is now the norm.
The concept of NoSQL came about in 1998, and was first used by Carlo Strozzi, but did not begin to gain in popularity until after 2005, when Doug Cutting and Mike Cafarella released Nutch to the general public. Nutch led to Hadoop (now referred to as Apache Hadoop), and as “free” open source software, quickly became quite popular.
Data integration can be described as combining data gathered from several sources, and transforming that data so it can be presented in a unified way. The earliest data integration system was designed in 1991, for the University of Minnesota. The original goal was to make data easier to use and process, for both systems and people.
The Data Pipeline
IN 1999, data pipelines were beginning to be used in support of the internet. A data pipeline is a form of IT architecture designed to collect, organize, and route data to be utilized further in data analytics. ETLs (or an equivalent) are part of the data pipeline’s architecture. Generally speaking, a data pipeline is an automated series of steps performed on data. It is often used to collect and process data from multiple sources, and then send it to a data warehouse, or to use it for some form of analysis.
In the early 2000s, data catalogs started being used for Data Management, Metadata Management, and Data Curation. Primarily, a data catalog records an organization’s available data assets and organizes them. Data cataloging includes tasks such as metadata ingestion, metadata discovery, and the creation of semantic relationships between the metadata.
In the mid-2000s, data hubs became a form of Data Management. They started being used to store data and act as a point of integration using a hub-and-spoke architecture. A modern data hub uses data-centric storage architecture for consolidating and sharing data to support analytics and AI workloads.
A basic difference between data hubs and a data lake or a data warehouse is that data hubs are designed for short-term data storage, while data lakes/warehouses are for long-term storage. On the other hand, Data hubs can support the seamless governance and flow of data between different endpoints.
Modern data hubs perform stream processing, batch processing, and AI/ML processing The AI/ML features makes it practical to perform analytical processing in the hub, rather than moving massive amounts of data across a network for analysis. Features like data exploration, data protection, indexing, and metadata management are offered by modern data hubs.
Big Data and Data Lakes
NoSQL, with its expandable memory and ability to process structured and unstructured data, opened the doors to big data research. Data warehouses and lakes are two data storage systems commonly used for data research (analytics).
The Data Management tools for data warehouses differ from those used by data lakes, in that data warehouses are typically used with a relational database (SQL), and store structured data gathered from a variety of different sources and prepared for analysis. Data warehouses are used primarily for enterprise reporting and limited business intelligence.
Data lakes, on the other hand, store a large mass of unstructured data (NoSQL) for machine learning, large scale business intelligence, and other analytics applications. Data lakes often store raw data that has been stored as is. Credit for the term “data lakes” is given to James Dixon, who used it In October of 2010,
Data Governance is typically part of a Data Management platform, designed to assure the quality and usability of data that has been collected by an organization. Early versions of Data Governance programs were focused on data cataloging.
In 2005, Data Governance started gainingmore popularity as a way to access quality data for big data research purposes.
The European Union’s GDPR (General Data Protection Regulation) took effect in 2016, causing many businesses to scramble in trying to meet the new compliance standards. The simplest IT solution for dealing with the new privacy protection laws was to develop new Data Governance software.
In 2018, there were a number of massive data breaches targeting a variety of organizations in different industries (some examples being Facebook, Equifax, Yahoo, and Marriott). As a result, security and data governance became intertwined.
A data fabric platform improves Data Management by making repetitive tasks automated. Data fabric uses a combination of technology and architecture designed to manage several kinds of data from multiple database management systems. It provides a more sophisticated Data Management platform.
Data fabric is a single platform designed to manage all kinds of data, and different technologies from multiple data centers, including the cloud.
However, the primary purpose of a data fabric is data orchestration — which requires coordination with data ingestion, storage, preparation, and pipelines. It also includes improved metadata management, governance, and security.
Data Management in the Cloud
Cloud Data Management is fast becoming an additional responsibility for in-house data managers. Though the concept of cloud storage was developed in the 1960s, it didn’t become a reality until 1999, when Salesforce offered the delivery of applications via its website. Amazon imitated the idea in 2002, providing internet-based (cloud) services, which included storage. The rented use of applications and services on a website, via the internet, quickly became a popular way of dealing with large and unusual projects. As comfort with the services developed, many organizations began shifting the bulk of their storage, and processing activities, to the cloud. Consequently, a number of cloud start-ups formed.
The cloud now provides organizations with dedicated Data Management resources, as-needed. The benefits of managing data in the cloud include:
- Access to cutting-edge technology.
- The reduction of in-house system maintenance costs.
- Increased flexibility in meeting the changing needs of business.
- The processing of big data.
SLAs (Service Level Agreements) are the contracts used to agree on guarantees between the customers and a service provider. As the architecture of different cloud providers varies, it is in the data manager’s best interest to investigate, and select the best fit, based on their organization’s needs. The compatibility of a cloud’s security and access to storage are both of crucial concerns for a cloud data manager, and should be researched thoroughly.
Artificial Intelligence and Data Management
It is predictable that, within the next ten years, AI will help organize and sort through huge amounts of stored data, and make routine decisions on basic procedures. It will become more and more valuable as an assistant to the data manager. Some examples include:
- Processing, managing, and storing unstructured data.
- Discarding irrelevant data.
- Maximizing data integration for research and info queries.
- Determining the value of data, and the best location to store it.
Artificial intelligence has great potential for assisting data managers in developing and managing a highly functional Data Management program.
Image used under license from Shutterstock.com