by Sunil Soares
I have met with dozens of clients this year in industries such as financial services, retail, and government, and when the subject of big data first comes up, the first question is usually “what data am I supposed to be looking at?”
Before anyone can answer that question, a broader question must be answered: “what problem are you trying to solve?”
Let’s assume that answer comes quickly, and it’s time to look at the different kinds of big data that businesses aren’t necessarily getting insights from today.
We need to have a good classification for big data. The figure above provides a framework that I have been fine tuning for the past several months.
I believe that most big data can be broadly classified into five types:
- Web and social media
- Machine-to-machine data
Machine-to-machine (M2M) refers to technologies that allow both wireless and wired systems to communicate with other devices. M2M uses a device such as a sensor or meter to capture an event (such as speed, temperature, pressure, flow, or salinity) which is relayed through a wireless, wired, or hybrid network to an application that translates the captured event into meaningful information. M2M communications create the so-called “internet of things.” The big data governance program needs to establish a number of policies around M2M data. For example, the program needs to draw up guidelines around the acceptable use of geolocation and RFID data that can be used to build a profile of individuals and potentially violate their privacy. The program also needs to establish retention policies around the massive volumes of M2M data that can easily overwhelm IT budgets if not properly controlled. The big data governance program also needs to address any data quality concerns such as RFID read rates in environments with high moisture content and lots of congestion.
- Big transaction data
This includes healthcare claims, telecommunications call detail records, and utility billing records. Big transaction data is increasingly available in semi-structured and unstructured formats. Information governance challenges such as metadata, data quality, privacy, and information lifecycle management also apply to this data.
Biometric information includes fingerprints, retinal scans, facial recognition, and genetics. Advances in technology have vastly increased the available biometric data. Law enforcement, the legal system, and intelligence agencies have been using this information for a long time. However, biometric data is increasingly available in the commercial arena where it can be co-mingled with other types of data such as social media. For example, page 45 of the attached FTC report describes a scenario where retailers can combine facial recognition with social media to personalize messages to customers.
All this opens up new business opportunities as well as several governance issues relating to privacy and data retention.
- Human generated data
Human beings generate vast quantities of data such as call center agents’ notes, voice recordings, email, paper documents, surveys, and electronic medical records. This data may contain sensitive information that needs to be masked. It may contain insights that can improve the quality of structured data sets and should be integrated with MDM. Finally, organizations need to establish policies regarding the retention period for this data to adhere to regulations and to manage storage costs.
Of course, the true test of a framework is that it can withstand the test of time and address different scenarios. I had to evolve this framework a number of times to account for new types of big data. As such, it is entirely possible that I missed something and I am looking forward to your feedback.