The original purpose of a data silo was to keep secrets. People have been keeping secrets for a long, long time. Prior to the written word, keeping a secret meant not sharing specific information with anyone else, verbally. And then came the written word. Secrets could be shared accidentally, or even stolen. Life became more complicated, but in some ways, more efficient. Then came mathematics, eventually followed by the computer, which stored information in the form of data. Computers also store secrets and some of those end up in data silos, whether purposefully or just from years of doing business in the same way.
A data silo is ostensibly meant to keep private information from the eyes of those who do not need to know. Unfortunately, information that is needed by others may also stored in the data silo. In the worst case scenario, a silo becomes a dumping ground for data that “might be” useful sometime in the future, and then sits there, never used.
Data silos often contain “incompatible data” that is believed important enough for translation at later a time. For many organizations, a significant amount of data was stored for later translation. The inappropriate, and all too human, tendency to “stash” potentially valuable information in a convenient and “safe” place (such as a data silo) has created a significant problem for Big Data Analysts.
Keeping Secrets on Paper
Prior to the 1700s, individuals would traditionally fold and bundle private, confidential papers, contracts, and documents, storing them in envelopes, and then hiding them in a wooden box with a lock. Metal safes became popular during the 1700s, and file cabinets (which could be locked) were introduced in 1898. File cabinets and safes were (and, to some extent, still are) used to store confidential business information.
Paper is a solid, tangible object that can be touched, held, and felt. Because paper documents are real, and not virtual, they can be stolen, lost, or destroyed. (So can virtual documents, but in different ways.) Fire is a very real danger for paper documents, as is water damage. Even small amounts of damage to a document may leave it illegible.
Important clients, profits, and confidential information can be revealed on paper. If these documents are not stored (or shredded) properly, there could be significant problems, such as the theft of confidential client information, trade secrets, or even the theft of a client’s identity. Information conveyed on paper is still used, though since the 1970s, computers have become a more and more popular method of communicating and storing information.
The Data Warehouse
Disk storage started to become popular in 1964. It was a new technology that allowed data to be accessed directly, and significantly increased data storage. Prior to disk storage, magnetic tape, with a painfully slow processing speed, was the popular format for storing data.
The increases in speed and storage space were evolutionary steps necessary for the development of data warehouses, and their extension, data silos. In 1988, the concept of data warehousing originated with an article in IBM Systems Journal. Data warehouses came first, with data silos developing later as a subdivision of data warehouses. In 1992, Bill Inmon published a book titled Building the Data Warehouse.
A data warehouse is data storage for “all the data” gathered and collected by an organization’s different operational systems. Data warehouses gather data from a variety of sources for both access and analysis.
A data warehouse is typically located within the relational database of an organization’s mainframe server, though, presently, it might also be located in a cloud. Data is collected from a variety of OLTP (online transaction processing) applications, as well as other sources for purposes of business intelligence, decision support, and answering user inquiries. Data warehouses can also be used for OLAP (online analytical processing).
Relational databases were also a necessary step in the evolution of data silos. Relational databases became popular in the 1980s, sparking an era of faster data and greater computing efficiency. SQL (structured query language) is commonly used by RDBMS (relational database management systems).
By the late 1980s, many businesses had shifted from using mainframe computers to using client servers. Staff were each assigned a personal computer, along with office applications, such as Excel, Access, and Microsoft Word. Relational databases were meant to operate using a single server and the bigger it was, the better. Increasing the capacity of the server meant physically upgrading the memory and processors.
Free trade agreements, globalization, networking, and computerization have made competition more and more intense. This reality has required a greater emphasis on “business intelligence,” in turn demanding the storage of data in a data warehouse.
In the late 1990s, as many businesses attempted to adjust and expand their databases, they discovered their systems were badly integrated, and their data inconsistent. They also discovered they were storing large amounts of fragmented data. The goal became accessing the unintegrated data for research purposes and accessing the business information needed for staying competitive in a constantly changing global economy.
Departments that had developed their own data silos within the data warehouse were especially difficult to work with. Department managers did not want to share their secrets. Secrets they were “responsible” for protecting. This resulted in great frustration for researchers.
Enter NoSQL and the Criticism of Data Silos
NoSQL (Not only SQL) has provided a way to access the unintegrated, non-relational data stored in data warehouses (and their extensions, data silos). NoSQL database technologies were created to deal with the need to gather business information from bulk data, and more recently, have been used in building modern applications.
Relational databases, on the other hand, were not designed to handle the challenges of Big Data research, nor for developing modern applications. The NoSQL model has a distributed database system, which means a system made up of multiple computers. When more processing power is needed, another computer is added to the system.
NoSQL also made “data silos” a source of irritation to Big Data researchers. Many Big Data researchers dislike data silos so much, they believe they should be eliminated entirely. From their perspective, the largest obstacle blocking the use of Big Data and advanced data analytics isn’t a lack of skilled workers (that might be the second largest obstacle), but a lack of access to the data. From the department manager’s perspective, data silos have secrets that should not be shared with the general public, nor with strangers.
Consider the Business Dictionary’s description of “silo mentality,” which exists as a mindset when department managers decide it’s a bad idea to share their information with researchers or other members of the organization. This kind of behavior is generally considered to be detrimental and destructive to the organization.
For example, two in-house data silos storing (theoretically) the same data may actually have different content, causing chaos and confusion about the data’s accuracy in at least one silo. While it is true a silo mentality may provide excellent security, it also has the potential to impede productivity in terms of Big Data research and development of modern applications.
Brian Moffo, the Director of Analytics Delivery at Anexinet, blames silo mentality on fear, inertia, and an overly exaggerated focus on departmental security. Moffo said:
“Most silos start out of necessity, and have existed for a long period of time. They often throw up walls and boundaries to access the data of those silos because they feel like they are the only ones who can manage it.”
Consolidation Rather than Elimination
While many IT experts want data silos to be eliminated completely, their continued use suggests they have continuing value. Their value is ultimately based on their ability to maintain secrets. Some organizations have opted for an alternative to the complete elimination of data silos. They have taken the option of “consolidating” data silos, in essence, “minimizing” the number of silos they have.
For example, The Intern Group chose silo consolidation as means of streamlining its everyday business tasks. Kevin Harper, a data scientist at the Intern Group, said:
“What we have uncovered along the way, are many opportunities to build data science tools that support our operations. Without first consolidating our data, we would not have been able to move on to more advanced analytics and data science. Data migrations can be complicated and long-term projects. So, you need to start with leadership to inspire motivation throughout the organization. After that, it comes down to finding the right talent to lead the transition.”
Three key factors should be considered when planning the migration and consolidation of data silos:
- Researching and finding a technology that allows its users to consolidate data silos while minimizing the disruption for departments and staff.
- Interrelate all the existing data and enrich it uniformly with metadata to build a rich context and provide improvements to the user experience. The process of adding context is a key part of the consolidation process. (Duplicates and mismatching versions are removed in the process).
- While the data is consolidated, the infrastructure of the old silos also gets consolidated. Applications, processes, and workflows can be evaluated — at your own pace — with new solutions replacing the old.
Image used under license from Shutterstock.com