Organizations from all over the world are gathering, analyzing, and evaluating huge data volumes from a large variety of sources, with the goal of increasing productivity and efficiency. Big Data Analytics is being used to stop credit card fraud, anticipate hardware failures, and reroute internet traffic to avoid congestion. Big Data technologies can improve network operations and this data can also provide an understanding of the organization’s “business” operations. Additionally, it can provide insights into user behavior and increase revenues. Many companies using Big Data explain it is driving their revenues by delivering deep insights into their customers’ behavior. Not too surprisingly, the use of Big Data also comes with some challenges.
The primary goal of Big Data is to provide Business Intelligence. This goal presents the first challenge: finding useful information. Useful information is buried in a variety of resources across the network. And it is not easy to draw insights from massive amounts of data. Maksim Tsvetovat, author of Social Network Analysis for Startups, using an analogy taken from broadcast radio wave communications, said, “There has to be a discernible signal in the noise” that you can detect, and sometimes there just isn’t one. Once we’ve done our intelligence on the data, sometimes we have to come back and say we just didn’t measure this right or measured the wrong variables because there’s nothing we can detect here.”
However, when used effectively, Big Data can provide highly useful business insights. When used properly, it can also be used as “fast data.” Pivotal Chief Executive Officer of the EMC Federation, Paul Maritz, wrote in a CapGemini Report:
“If you can obtain all the relevant data, analyze it quickly, surface actionable insights, and drive them back into operational systems, then you can affect events as they’re still unfolding. The ability to catch people or things ‘in the act,’ and affect the outcome, can be extraordinarily important, valuable, and disruptive.”
Finding skilled data scientists and Big Data analysts is the second challenge of working with Big Data. The field is new and there is a shortage of skilled labor. The skills require a combination of statistical experience and intuition, which makes for a curious mix of personality traits. People who are good with statistics and mathematics tend to avoid situations requiring intuition, and vice versa.
One option in dealing with this situation (if the money is available) is to build a data analyst team for the company, through a combination of re-training current workers and recruiting new staff who specialize in Big Data. A less expensive option is to hire a freelance Big Data contractor. With freelance contractors, a standard protocol should be established for data entry, with information entered in a standardized way to prevent confusion between permanent and temporary staff.
After the decision has been made to improve the IT infrastructure, many of the upcoming problems are predictable. The shift to using Big Data should be well-organized and the architecture should be well planned. Organizations should take a systematic approach in planning the evolution of their computer system. Additionally, companies should:
- Schedule workshops for staff in preparation for using Big Data
- Pay attention to costs and plan upscaling in the future
- Recognize data is not 100 percent accurate and manage its quality
- Get serious about finding useful business insights
- Never neglect the security of Big Data
Collecting and Storing Data
Collecting and accumulating Big Data can be a challenge. Big Data research sources are often spread out through government agencies, in-house accounts, the Internet of Things, and other data sources. Bringing it all together requires thoughtful planning.
Additionally, the quality and accuracy of the data needs to be ensured. This requires data cleansing (often a manual process), as well as a review of Data Governance. (Is the data accurate? Was it recorded accurately? Have errors crept in over time?)
Data lakes are used to store all of the data that has been captured as separate units having nothing to do with each other. This data is stored with the hope it will be useful later (and in some cases, it is legally required). In this state, and lacking a NoSQL system, the stored data cannot be manipulated and researched for insights because it has not been integrated. To be used properly, the data in a data silo should be integrated, or the data should be shifted to a NoSQL system.
Data lakes can also be clumsy to use because they often provide inaccurate data. According to a recent report developed by Experian Data Quality, up to 75 percent of all businesses polled believe their own customer contact info is incorrect. A database full of inaccurate customer info can be worse than no data at all. Data can be integrated as it comes in, but may require additional software and hardware.
The IT Infrastructure
Realizing the promises of Big Data Analytics requires organizations adjust how they do business. For some organizations, there may be concerns about “ripping and replacing” the majority of their IT infrastructure (a cloud service provider may be an alternative). The combined effects of increasingly higher data volumes, complex data content, and a wide variety of data types have presented some serious problems for businesses.
While NoSQL systems, such as Hadoop, are extremely popular, there is Big Data software that works well with “smaller” amounts of Big Data and Relational Database Management Systems (RDBMSs). A relational database is a database designed to save data using a structured format, with rows and columns. It is called “relational” because the values stored within each table are associated, or “related” to one another other.
Two popular programs are: Wizard: Statistics & Data Analysis Software, designed for Macs; and The R Project for Statistical Computing, which is free, and runs on a variety of UNIX platforms, MacOS, and Windows.
However, a lack of scaling, or the inability to scale, can present significant problems when working with Big Data. The most common feature of Big Data is its impressive ability to grow. And this ability is one of the most significant challenges of Big Data. This is why NoSQL systems are so popular. They can scale out to fit the amounts of stored data used for research. The problem isn’t the actual process of installing new storing capacity in an SQL system, but rather that the system’s performance may decline if not done properly. A good architectural design can keep this from becoming a problem.
A good architectural design can also minimize problems that might occur later. The design of Big Data algorithms also plays a role in eliminating problems. And the design should allow for easy upscaling in the future. This is also a good time to plan the system’s maintenance, and schedule systematic performance audits to help identify weaknesses and address them quickly.
Big Data Issues in the Cloud
Cloud computing essentially describes a type of computing that delivers services through the internet or a network of servers. The primary purpose of public cloud computing is to provide large amounts of computing power to paying customers.
The cloud uses networks of servers that come with specialized connections designed to distribute the data processing work among the servers. Rather than installing specialized software in each computer, public cloud technologies install software programs in a “host” computer, that users log into as a web-based service. The cloud hosts a large variety of Big Data programs that are useful to the user. This can shift the workload significantly, and lessen the burden of hosting several programs and applications on an in-house computer system.
There is something called the “plumbing problem” when working with the cloud. This is based on the continuing problem of the increasing amounts of data being created and saved every day. This has the effect of slowing down processing speeds and creating bottlenecks. Without jumping through lots of hoops, the easiest way to deal with this problem is to find a cloud that doesn’t have this problem, or to work on the cloud during times of low usage. There are more expensive (and more efficient) ways of dealing with the cloud’s plumbing problems.
Technical difficulties may shut a cloud down, temporarily. For example, in early June, Google’s Cloud went down, and took with it a variety of services that relied on Google software. (This is mildly amusing, in that Google had no way to access the “down” cloud-based tools they needed to repair their cloud. They had accidentally locked themselves out without a key.) In a situation like this, it’s best to have some “other” clouds available for use. They may not be your first choice, but they’ll be there in an emergency.
Security is also important issue for working in a cloud. Cloud technology comes with a variety of security issues. A cloud encompasses several technologies, which may include databases, networks, operating systems, resource allocation, containerization, virtualization, resource scheduling, transaction management, load balancing, etc. These all provide potential security breaches. For example, the network connecting the systems in the cloud might have a backdoor, allowing a hacker access. Or, a container may have delivered malware or a virus to the cloud. Data can be protected in these ways:
- The use of data mining techniques can detect malware in clouds.
- Sensitive data can be protected through the use of cryptography and granular access control techniques.
- A variety of threat models can be developed against the most common cyber-attacks and/or data leakage scenarios.
Image used under license from Shutterstock.com