by Sunil Soares
I have met with dozens of clients this year in industries such as financial services, retail, and government, and when the subject of big data first comes up, the first question is usually “what data am I supposed to be looking at?”
Before anyone can answer that question, a broader question must be answered: “what problem are you trying to solve?”
Let’s assume that answer comes quickly, and it’s time to look at the different kinds of big data that businesses aren’t necessarily getting insights from today.
We need to have a good classification for big data. The figure above provides a framework that I have been fine tuning for the past several months.
I believe that most big data can be broadly classified into five types:
- Web and social media
This includes clickstream and social media data such as Facebook, Twitter, LinkedIn, and blogs. Big data governance programs will increasingly be required to integrate this data with master data and with core business processes such as customer loyalty programs. The big data governance program needs to establish policies regarding the acceptable use of social media data especially as regulations and precedents are continually evolving. The program also needs to establish guidelines regarding the acceptable use of cookies, especially third-party cookies, to track users and to personalize their web interactions. Metadata is also critical to web and social media. For example, two sites may measure the term “unique visitors” differently for clickstream analytics. One site may measure unique visitors within a month while another one may measure unique visitors within a week. - Machine-to-machine data
Machine-to-machine (M2M) refers to technologies that allow both wireless and wired systems to communicate with other devices. M2M uses a device such as a sensor or meter to capture an event (such as speed, temperature, pressure, flow, or salinity) which is relayed through a wireless, wired, or hybrid network to an application that translates the captured event into meaningful information. M2M communications create the so-called “internet of things.” The big data governance program needs to establish a number of policies around M2M data. For example, the program needs to draw up guidelines around the acceptable use of geolocation and RFID data that can be used to build a profile of individuals and potentially violate their privacy. The program also needs to establish retention policies around the massive volumes of M2M data that can easily overwhelm IT budgets if not properly controlled. The big data governance program also needs to address any data quality concerns such as RFID read rates in environments with high moisture content and lots of congestion. - Big transaction data
This includes healthcare claims, telecommunications call detail records, and utility billing records. Big transaction data is increasingly available in semi-structured and unstructured formats. Information governance challenges such as metadata, data quality, privacy, and information lifecycle management also apply to this data. - Biometrics
Biometric information includes fingerprints, retinal scans, facial recognition, and genetics. Advances in technology have vastly increased the available biometric data. Law enforcement, the legal system, and intelligence agencies have been using this information for a long time. However, biometric data is increasingly available in the commercial arena where it can be co-mingled with other types of data such as social media. For example, page 45 of the attached FTC report describes a scenario where retailers can combine facial recognition with social media to personalize messages to customers.
http://ftc.gov/os/2012/03/120326privacyreport.pdf
All this opens up new business opportunities as well as several governance issues relating to privacy and data retention. - Human generated data
Human beings generate vast quantities of data such as call center agents’ notes, voice recordings, email, paper documents, surveys, and electronic medical records. This data may contain sensitive information that needs to be masked. It may contain insights that can improve the quality of structured data sets and should be integrated with MDM. Finally, organizations need to establish policies regarding the retention period for this data to adhere to regulations and to manage storage costs.
Of course, the true test of a framework is that it can withstand the test of time and address different scenarios. I had to evolve this framework a number of times to account for new types of big data. As such, it is entirely possible that I missed something and I am looking forward to your feedback.