A Data Warehouse (DW) stores corporate information and data from operational systems and a wide range of other data resources. Data Warehouses are designed to support the decision-making process through data collection, consolidation, analytics, and research. They can be used in analyzing a specific subject area, such as “sales,” and are an important part of modern Business Intelligence. The architecture for Data Warehouses was developed in the 1980s to assist in transforming data from operational systems to decision-making support systems. Normally, a Data Warehouse is part of a business’s mainframe server or in the Cloud.
In a Data Warehouse, data from many different sources is brought to a single location and then translated into a format the Data Warehouse can process and store. For example, a business stores data about its customer’s information, products, employees and their salaries, sales, and invoices. The boss may ask about the latest cost-reduction measures, and getting answers will require an analysis of all of the previously mentioned data. Unlike basic operational data storage, Data Warehouses contains aggregate historical data (highly useful data taken from a variety of sources).
Punch cards were the first solution for storing computer generated data. By the 1950s, punch cards were an important part of the American government and businesses. The warning “Do not fold, spindle, or mutilate” originally came from punch cards. Punch cards continued to be used regularly until the mid-1980s. They are still used to record the results of voting ballots and standardized tests.
“Magnetic storage” slowly replaced punch cards starting in the 1960s. Disk storage came as the next evolutionary step for data storage. Disk storage (hard drives and floppies) started becoming popular in 1964 and allowed data to be accessed directly, which was a significant improvement over the clumsier magnetic tapes.
IBM was primarily responsible for the early evolution of disk storage. They invented the floppy disk drive as well as the hard disk drive. They are also credited with several of the improvements now supporting their products. IBM began developing and manufacturing disk storage devices in 1956. In 2003, they sold their “hard disk” business to Hitachi.
Database Management Systems
Disk storage was quickly followed by software called a Database Management System (DBMS). In 1966, IBM came up with its own DBMS called, at the time, an Information Management System. DBMS software was designed to manage “the storage on the disk” and included the following abilities:
- Identify the proper location of data
- Resolve conflicts when more than on unit of data is mapped to the same location
- Allow data to be deleted
- Find room when stored data won’t fit in a specific, limited physical location
- Find data quickly (which was the greatest benefit)
In the late 1960s and early ‘70s, commercial online applications came into play, shortly after disk storage and DBMS software became popular. Once it was realized data could be accessed directly, information began being shared between computers. As a result, there were a large number of commercial applications which could be applied to online processing. Some examples included:
- Claims processing
- Bank teller processing
- Automated teller processing (ATMs)
- Airline reservation processing
- Retail point of sale processing
- Manufacturing control processing
In spite of these improvements, finding specific data could be difficult, and it was not necessarily trustworthy. The data found might be based on “old” information. At this time, so much data was being generated by corporations, people couldn’t trust the accuracy of the data they were using.
Personal Computers and 4GL Technology
In response to this confusion and lack of trust, personal computers became viable solutions.
Personal computer technology let anyone bring their own computer to work and do processing when convenient. This led to personal computer software, and the realization that the personal computer’s owner could store their “personal” data on their computer. With this change in work culture, it was thought a centralized IT department might no longer be needed.
Simultaneously, a technology called 4GL was developed and promoted. 4GL technology (developed in the 1970s through 1990) was based on the idea that programming and system development should be straightforward and anyone should be able to do it. This new technology also prompted the disintegration of centralized IT departments.
4GL technology and personal computers had the effect of freeing the end user, allowing them to take much more control of the computer system and find information quickly and efficiently. The goal of freeing end users and allowing them to access their own data was a very popular step forward. Personal computers and 4GL quickly gained popularity in the corporate environment. But along the way, something unexpected happened. End users discovered that:
- Incorrect data can be misleading.
- Incomplete data may not be very useful.
- Old data is not desirable.
- Multiple versions of the same data can be confusing.
- Data lacking documentation is questionable.
Relational databases became popular in the 1980s. Relational databases were significantly more user-friendly than their predecessors. Structured Query Language (SQL) is the language used by relational database management systems (RDBMS). By the late 1980s, a large number of businesses had moved from mainframe computers on to client servers. Staff members were now assigned a personal computer, and office applications (Excel, Microsoft Word, and Access) started gaining favor.
The Need for Data Warehouses
During the 1990s major cultural and technological changes were taking place. The internet was surging in popularity. Competition had increased due to new free trade agreements, computerization, globalization, and networking. This new reality required greater business intelligence, resulting in the need for true data warehousing. During this time, the use of application systems exploded.
By the year 2000, many businesses discovered that, with the expansion of databases and application systems, their systems had been badly integrated and that their data was inconsistent. They discovered they were receiving and storing lots of fragmented data. Somehow, the data needed to be integrated to provide the critical “Business Information” needed for decision-making in a competitive, constantly-changing global economy.
Data Warehouses were developed by businesses to consolidate the data they were taking from a variety of databases, and to help support their strategic decision-making efforts
The Use of NoSQL
As Data Warehouses came into being, an accumulation of Big Data began to develop. This accumulation required the development of computers, smart phones, the Internet, and the Internet of Things to provide the data. Credit cards have also played a role, as has social media.
Facebook began using a NoSQL system in 2008. NoSQL is a “non-relational” Database Management System that uses fairly simple architecture. It is quite useful when processing Big Data. NoSQL database systems are diverse, and while SQL systems normally have more flexibility than NoSQL systems, the lack (though that has changed recently) of scalability in SQL gives NoSQL systems a decisive advantage. Non-relational databases (or NoSQL) use two novel concepts: horizontal scaling (the spreading of storage and work) and the elimination of the need for Structured Query Language to arrange and organize data. NoSQL databases have gradually evolved to include a wide variety of differing models. Cassandra and Hadoop are two examples of the 225+ NoSQL-style databases available.
Data Warehouse Alternatives
Data Silos can be a natural occurrence in large organizations, with each department having different goals, responsibilities, and priorities. Data silos are storage areas of fixed data which are under the control of a single department and have been separated and isolated from access by other departments for privacy and security. Data silos can also happen when departments compete instead of working together towards common goals. They are generally considered a hindrance to collaboration and efficient business practices.
A Data Mart is an area for storing data that serves a particular community or group of workers. They are storage areas with fixed data and deliberately under the control of one department within the organization.
Data Lakes use a more flexible structure for data on the way in than a Data Warehouse. Data is organized to fit the lake’s database schema, and they use a more fluid approach in storing it. Data Lakes only add structure to data as it moves to the application layer. Data Lakes preserve the original structure of data and can be used as a storage and retrieval system for Big Data, which could, theoretically, scale upward indefinitely.
Data Swamps can be the result of a poorly designed or neglected Data Lake. A Data Swamp describes the failures to document stored data correctly. This situation makes the data difficult to analyze and use efficiently. While the original data may still be there, a Data Swamp cannot recover it without the appropriate metadata for context.
A Data Cube is software that stores data in matrices of three or more dimensions. Any transformations in the data are expressed as tables and arrays of processed information. After tables have matched the rows of data strings with the columns of data types, the data cube then cross-references tables from a single data source or multiple data sources, increasing the detail of each data point. This arrangement provides researchers with the ability to find deeper insights than other techniques.