A data warehouse stores data from in-house systems and various outside sources. Data warehouses are designed to support the decision-making process through data collection, consolidation, analytics, and research. They can be used in analyzing a specific subject area, such as “sales,” and are an important part of modern business intelligence. The architecture for data warehouses was developed in the 1980s to assist in transforming data from operational systems to decision-making support systems.
Data warehouses can be part of a business’s mainframe server but are more often located in the cloud.
In a data warehouse, data from many different sources is brought to a single location and then translated into a format the data warehouse can process and store. For example, a business stores data about its customer’s information, products, employees and their salaries, sales, and invoices. If upper management asks for the latest cost-reduction measures, getting answers may require analyzing all of the previously mentioned data. Below, we highlight the history of the data warehouse and data storage from the 1950s to the present day.
Early Data Storage
Punch cards were the first solution for storing computer-generated data. By the 1950s, punch cards were an important part of the American government and businesses. The warning “Do not fold, spindle, or mutilate” originally came from punch cards. Punch cards continued to be used regularly until the mid-1980s. They are still used to record the results of voting ballots and standardized tests.
“Magnetic storage” slowly replaced punch cards starting in the 1960s. Disk storage came as the next evolutionary step for data storage. Disk storage (hard drives and floppies) started becoming popular in 1964 and allowed data to be accessed directly, significantly improving the clumsier magnetic tapes.
IBM was primarily responsible for the early evolution of disk storage. They invented the floppy disk drive as well as the hard disk drive. They are also credited with several of the improvements now supporting their products. IBM began developing and manufacturing disk storage devices in 1956. In 2003, they sold their “hard disk” business to Hitachi.
Database Management Systems
Disk storage was quickly followed by software called a database management system (DBMS). In 1966, IBM came up with its own DBMS called, at the time, an information management system. DBMS software was designed to manage “the storage on the disk” and included the following abilities:
- Identify the proper location of data
- Resolve conflicts when more than one unit of data is mapped to the same location
- Allow data to be deleted
- Find room when stored data won’t fit in a specific, limited physical location
- Find data quickly (which was the greatest benefit)
- Online Applications
In the late 1960s and early ‘70s, commercial online applications came into play, shortly after disk storage and DBMS software became popular. Once it was realized data could be accessed directly, information began being shared between computers. As a result, there were a large number of commercial applications which could be applied to online processing. Some examples included:
- Claims processing
- Bank teller processing
- Automated teller processing (ATMs)
- Airline reservation processing
- Retail point-of-sale processing
- Manufacturing control processing
In spite of these improvements, finding specific data could be difficult, and it was not necessarily trustworthy. The data found might be based on “old” information. At this time, so much data was being generated by corporations that people couldn’t trust the accuracy of the data they were using.
Personal Computers and 4GL Technology
In response to this confusion and lack of trust, personal computers became reasonable, functional solutions.
Personal computer technology lets anyone bring their computer to work and do processing when convenient. This led to personal computer software and the realization that the personal computer’s owner could store their “personal” data on their computer. With this change in work culture, it was thought that a centralized IT department might no longer be needed.
Simultaneously, a technology called 4GL was developed and promoted. 4GL technology (developed in the 1970s through 1990) was based on the idea that programming and system development should be straightforward and anyone can do it. This new technology also prompted the disintegration of centralized IT departments.
4GL technology and personal computers had the effect of freeing the end user, allowing them to take much more control of the computer system and find information quickly and efficiently. The goal of freeing end users and allowing them to access their own data was a very popular step forward. Personal computers and 4GL quickly gained popularity in the corporate environment. But along the way, something unexpected happened. End users discovered that:
- Incorrect data can be misleading.
- Incomplete data may not be very useful.
- Old data is not desirable.
- Multiple versions of the same data can be confusing.
- Data lacking documentation is questionable.
Relational databases became popular in the 1980s. Relational databases were significantly more user-friendly than their predecessors. Structured Query Language (SQL) is the language used by relational database management systems (RDBMS). By the late 1980s, many businesses had moved away from mainframe computers. Staff members were now assigned a personal computer, and office applications (Excel, Microsoft Word, and Access) started gaining favor.
The Need for Data Warehouses
During the 1990s major cultural and technological changes were taking place. The internet was surging in popularity. Competition had increased due to new free trade agreements, computerization, globalization, and networking. This new reality required greater business intelligence, resulting in the need for true data warehousing. During this time, the use of application systems exploded.
By the year 2000, many businesses discovered that, with the expansion of databases and application systems, their systems had been badly integrated and that their data was inconsistent. They discovered they were receiving and storing lots of fragmented data. Somehow, the data needed to be integrated to provide the critical “business information” needed for decision-making in a competitive, constantly-changing global economy.
Data warehouses were developed by businesses to consolidate the data they were taking from a variety of databases and to help support their strategic decision-making efforts.
The Use of NoSQL
As data warehouses emerged, an accumulation of big data began to develop. This accumulation required the development of computers, smartphones, the internet, and the Internet of Things to provide the data. Credit cards have also played a role, as has social media.
Facebook began using a NoSQL system in 2008. NoSQL is a “non-relational” database management system that uses fairly simple architecture. It is quite useful when processing big data. NoSQL database systems are diverse, and while SQL systems normally have more flexibility than NoSQL systems, the lack (though that has changed recently) of scalability in SQL gives NoSQL systems a decisive advantage.
Non-relational databases (or NoSQL) use two novel concepts: horizontal scaling (the spreading of storage and work) and the elimination of the need for Structured Query Language to arrange and organize data. NoSQL databases have gradually evolved to include a wide variety of differing models. Cassandra and Hadoop are two examples of the 225-plus NoSQL-style databases available.
Data Warehouse Alternatives
Data lakes, in addition to data lakehouses, have recently gained popularity. Data lakes use a more flexible structure for collecting and storing data than a data warehouse. Data lakes preserve the original structure of data and can be used as a retrieval and storage system for big data, which could, theoretically, scale upward indefinitely. (The term “big data” is dropping out of use, because, these days, big data is normal and no longer “big.”)
A data mart is an area for storing data that serves a particular community or group of workers. It is a storage area with fixed data and is deliberately under the control of one department within the organization.
A data cube is software that stores data in matrices of three or more dimensions. Any transformations in the data are expressed as tables and arrays of processed information. After tables have matched the rows of data strings with the columns of data types, the data cube then cross-references tables from a single data source or multiple data sources, increasing the detail of each data point. This arrangement provides researchers with the ability to find deeper insights than other techniques.
Data silos can naturally occur in large organizations, with each department having different goals, responsibilities, and priorities. Data silos are storage areas of fixed data that are under the control of a single department and have been separated and isolated from access by other departments for privacy and security. Data silos can also happen when departments compete instead of working together towards common goals. They are generally considered a hindrance to collaboration and efficient business practices.
Data swamps can result from a poorly designed or neglected data lake. A data swamp describes the failures to document stored data correctly. This situation makes the data difficult to analyze and use efficiently. While the original data may still be there, a data swamp cannot recover it without the appropriate metadata for context.
Image used under license from Shutterstock.com