A Database Management System allows a person to organize, store, and retrieve data from a computer. It is a way of communicating with a computer’s “stored memory.” In the very early years of computers, “punch cards” were used for input, output, and data storage. Punch cards offered a fast way to enter data, and to retrieve it. Herman Hollerith is given credit for adapting the punch cards used for weaving looms to act as the memory for a mechanical tabulating machine, in 1890. Much later, databases came along.
Databases (or DBs) have played a very important part in the recent evolution of computers. The first computer programs were developed in the early 1950s, and focused almost completely on coding languages and algorithms. At the time, computers were basically giant calculators and data (names, phone numbers) was considered the leftovers of processing information. Computers were just starting to become commercially available, and when business people started using them for real-world purposes, this leftover data suddenly became important.
Enter the Database Management System (DBMS). A database, as a collection of information, can be organized so a Database Management System can access and pull specific information. In 1960, Charles W. Bachman designed the Integrated Database System, the “first” DBMS. IBM, not wanting to be left out, created a database system of their own, known as IMS. Both database systems are described as the forerunners of navigational databases.
By the mid-1960s, as computers developed speed and flexibility, and started becoming popular, many kinds of general use database systems became available. As a result, customers demanded a standard be developed, in turn leading to Bachman forming the Database Task Group. This group took responsibility for the design and standardization of a language called Common Business Oriented Language (COBOL). The Database Task Group presented this standard in 1971, which also came to be known as the “CODASYL approach.”
The CODASYL approach was a very complicated system and required substantial training. It depended on a “manual” navigation technique using a linked data set, which formed a large network. Searching for records could be accomplished by one of three techniques:
- Using the primary key (also known as the CALC key)
- Moving relationships (also called sets) to one record from another
- Scanning all records in sequential order
Eventually, the CODASYL approach lost its popularity as simpler, easier-to-work-with systems came on the market.
Edgar Codd worked for IBM in the development of hard disk systems, and he was not happy with the lack of a search engine in the CODASYL approach, and the IMS model. He wrote a series of papers, in 1970, outlining novel ways to construct databases. His ideas eventually evolved into a paper titled, A Relational Model of Data for Large Shared Data Banks, which described new method for storing data and processing large databases. Records would not be stored in a free-form list of linked records, as in CODASYL navigational model, but instead used a “table with fixed-length records.”
IBM had invested heavily in the IMS model, and wasn’t terribly interested in Codd’s ideas. Fortunately, some people who didn’t work for IBM “were” interested. In 1973, Michael Stonebraker and Eugene Wong (both then at UC Berkeley) made the decision to research relational database systems. The project was called INGRES (Interactive Graphics and Retrieval System), and successfully demonstrated a relational model could be efficient and practical. INGRES worked with a query language known as QUEL, in turn, pressuring IBM to develop SQL in 1974, which was more advanced (SQL became ANSI and OSI standards in 1986 1nd 1987). SQL quickly replaced QUEL as the more functional query language.
RDBM Systems were an efficient way to store and process structured data. Then, processing speeds got faster, and “unstructured” data (art, photographs, music, etc.) became much more common place. Unstructured data is both non-relational and schema-less, and Relational Database Management Systems simply were not designed to handle this kind of data.
NoSQL (“Not only” Structured Query Language) came about as a response to the Internet and the need for faster speed and the processing of unstructured data. Generally speaking, NoSQL databases are preferable in certain use cases to relational databases because of their speed and flexibility. The NoSQL model is non-relational and uses a “distributed” database system. This non-relational system is fast, uses an ad-hoc method of organizing data, and processes high-volumes of different kinds of data.
“Not only” does it handle structured and unstructured data, it can also process unstructured Big Data, very quickly. The widespread use of NoSQL can be connected to the services offered by Twitter, LinkedIn, Facebook, and Google. Each of these organizations store and process colossal amounts of unstructured data. These are the advantages NoSQL has over SQL and RDBM Systems:
- Higher scalability
- A distributed computing system
- Lower costs
- A flexible schema
- Can process unstructured and semi-structured data
- Has no complex relationship
Unfortunately, NoSQL does come with some problems. Some NoSQL databases can be quite resource intensive, demanding high RAM and CPU allocations. It can also be difficult to find tech support if your open source NoSQL system goes down.
NoSQL Data Distribution
Hardware can fail, but NoSQL databases are designed with a distribution architecture that includes redundant backup storage of both data and function. It does this by using multiple nodes (database servers). If one, or more, of the nodes goes down, the other nodes can continue with normal operations and suffer no data loss. When used correctly, NoSQL databases can provide high performance at an extremely large scale, and never shut down. In general, there are four kinds of NoSQL databases, with each having specific qualities and characteristics.
A Document Store (often called a document-oriented database), manages, stores, and retrieves semi-structured data (also known as document-oriented information). Documents can be described as independent units that improve performance and make it easier to spread data across a number of servers. Document Stores typically come with a powerful query engine and indexing controls that make queries fast and easy. Examples of Document Stores are: Mongo DB, and Amazon Dynamo DB
Document-oriented databases store all information for a given “object” within the database, and each object in storage can be quite different from the others. This makes it easier for mapping objects to the database and makes document storage for web programming applications very attractive. (An “object” is a set of relationships. An article object could be related to a tag [an object], a category [another object], or a comment [another object].)
A DBMS using columns is quite different from traditional relational database systems. It stores data as portions of columns, instead of as rows. The change in focus, from row to a column, lets column databases maximize their performance when large amounts of data are stored in a single column. This strength can be extended to data warehouses and CRM applications. Examples of column-style databases include Cloudera, Cassandra, and HBase (Hadoop based).
A key-value pair database is useful for shopping cart data or storing user profiles. All access to the database is done using a primary key. Typically, there is no fixed schema or data model. The key can be identified by using a random lump of data. Key-value stores “are not” useful when there are complex relationships between data elements or when data needs to be queried by other than the primary key. Examples of key-value stores are: Riak, Berkeley DB, and Aerospike.
An element can be any single “named” unit of stored data that might, or might not, contain other data components.
Graph Data Stores
Location aware systems, routing and dispatch systems, and social networks are the primary users of Graph Databases (also called Graph Data Stores). These databases are based on graph theory, and work well with data that can be displayed as graphs. They provide a very functional, cohesive picture of Big Data.
It differs from relational databases, and other NoSQL databases, by storing data relationships as actual relationships. This type of storage for relationship data results in fewer disconnects between an evolving schema and the actual database. It has interconnected elements, using an undetermined number of relationships between them. Examples Graph Databases are: Neo4j, GraphBase, and Titan.
Polyglot Persistence is a spin-off of “polyglot programming,” a concept developed in 2006 by Neal Ford. The original idea promoted applications be written using a mix of languages, with the understanding that a specific language may solve a certain kind of problem easily, while another language would have difficulties. Different languages are suitable for tackling different problems.
Many NoSQL systems run on nodes and large clusters. This allows for significant scalability and redundant backups of data on each node. Using different technologies at each node supports a philosophy of Polyglot Persistence. This means “storing” data on multiple technologies with the understanding certain technologies will solve one kind of problem easily, while others will not. An application communicating with different database management technologies uses each for the best fit in achieving the end goal.
Photo Credit: arka38/Shutterstock.com