Big Data is the term used to describe the massive data sets collected by an individual or organization in order to search, analyze, visualize, or share a significant trend or pattern in human behavior. It has been collected to try to understand everything from how consumers make their purchases to collecting intelligence data by aircrafts hovering over warzones. It has been used to attempt to detect maritime security threats emerging from the shores of Somalia to developing metric tracking for baseball players to achieve big wins (see: Moneyball).
But for all those arenas in which Big Data is utilized to understand a bigger picture, the methods by which data is acquired, stored, searched, and analyzed comes with an endless amount of problems and criticisms from all sides. There are even the concerns from those worried about privacy and how data becomes acquired. Others are curious about what happens once a conclusion has been made—specifically if the conclusion of a Big Data analysis leads one to make spurious correlations or decisions that adversely affect a given population.
With all of the innovations and critiques that the Big Data universe attracts, the acquisition of massive amounts of data doesn’t seem to be waning. Indeed, even the United States government has collected more data than it could ever possibly read. But while the amelioration or advancement of Big Data can be argued within the realm of commercial enterprise, global politics, or even sports, can it possibly help solve problems surrounding something as non-partisan as a regional or global health crisis?
As Big Data moves like an unstoppable shark into the sphere of (global) public health, this is what thousands of epidemiologists are now asking. Epidemiology is the study of medicine and human activities that deal with the incidence, distribution, and control of diseases – in one word, epidemics. And epidemiologists are now experiencing the highs, lows, and everything in between when it comes to the world of Big Data.
Academic journals of epidemiology are now discussing more than the spread of viruses across the globe and throughout various social-economic classes. One can now find articles and material that focus on how to manage Big Data collections during research projects, how Big Data affects epidemic forecasting, and why massive amounts of unstructured data is needed to visualize global health trends.
Just August of last year, the issue made its way to the stages of TED, where Nathan Wolfe, director of the Global Viral Forecasting Initiative went “beyond talking about the role of viruses in human history” to suggesting the implications and consequences of connectedness and information exchange within Big Data and digital epidemiology.
Last month, Big Data experts working with consulting firm, Perficient, gave a presentationon using Big Data for improving healthcare operations and analytics. A large part of the presentation focused on how new developments in big data analytics could help track the spread of disease based on streaming data and visualize global outbreaks which could ultimately determine the source of an infection
There are several ways in which epidemiologists are acquiring, sharing, and displaying data from massive amounts of unstructured data. And the rate that this data is being collected is rapidly increasing. According to the Global Viral Forecasting Initiative:
“[G]lobalisation will also speed the ﬂow of health data […] People in viral hotspots around the world will report suspicious human and animal deaths (often a warning sign of a coming plague) by mobile phones. These data will be posted to the web, instantly enriching the data that came from traditional surveillance systems and electronic medical records. Organizations like Google.org will scour search patterns around the world, expanding their search-based predictions of inﬂuenza to other infectious diseases.”
Slate magazine made phenomenal use of data that had previous been acquired and stored by the Center for Disease Control made available for public consumption via its website. Slate took the data—in this case, on diabetes in the United States—and created a web-based visualization on the spread of diabetes over a period of time.
Greenplum, a company under EMC’s Big Data Division, noted the relevance, not only for epidemiology, but for the visualization of the material from massive data sets:
“The CDC had already produced static heat maps to visualize the rates of diabetes in the United States by merging the modeled estimates in database format using geographic boundary files named shapefiles. This allowed the CDC to spatially reference the statistical data with associated state and county boundaries. The resulting maps had a wealth of useful information, but lacked the interactivity necessary for effective data-driven journalism. By accessing the CDC’s data, which is free to download in .xls .ppt formats from the Department of Health and Human Services website, Slate transformed the CDC’s data into an engaging narrative, creating an interactive time-elapsed map that allowed the site’s readers to get specific rates for counties with a simple rollover. On the merits of this compelling visualization, Slate’s “Diabetes on the March” coverage went viral online, in a way that a typical Centers for Disease Control report might not.”
The US military’s Electronic Surveillance System for the Early Notification of Community-based Epidemics, called Essence, can monitor health data on the hour across its 400 facilities around the world. Since 2008, the program has collected around 2.5 terabytes of data per month that monitors trends within the military health system. With a relatively quick compilation of data, the US military can tell the difference between a few flu cases and a pandemic on an air force base in Germany to a gastrointestinal outbreak among a large population in South Korea.
And epidemiologists are trying to achieve a similar feat with human populations at regional and global levels. And while the Big Data experience for the US military’s health system has so far proved successful, epidemiology is still faced with the same Big Data glut the Pentagon and State Department faces when attempting to analyze the petabytes of intelligence data it gathers from drones and ground forces alike.
While the challenges seem to come from all directions, more is being learned about how to address the challenges of Big Data and forecasting epidemics. SupplyChainBrain magazine spelled out just one of the lessons learned:
“During the H1N1 pandemic of 2009, a manufacturer indirectly tracked the spread of the virus faster than the Center for Disease Control by looking at changes in daily tissue forecasts. Not surprisingly, consumers reached for tissues at the first sign of a runny nose, well in advance of full blown symptoms and doctor visits. While tissue forecasts as an outbreak precursor is fascinating, the real story is that the manufacturer was able to identify the change in demand and respond by shifting production and deployment. While competitors experienced stock-outs, their products were on-shelf, capturing an unexpected lift in revenue and building brand loyalty.”
But while this may not be the best methodology to follow when it comes to pandemics or disease outbreaks (especially in the developing world where commercial spaces for tissues may not be so readily available), it does reveal how one can change, alter, and skew visualization by shifting the ways of thinking about the behavior and location of a given population.
Nevertheless, research projects attempting to solve the Big Data problem for epidemiologists are already underway. One such project consists of 12 teams in 8 European countries that have come to a consensus on the Big Data needs for the study of epidemiology:
- The foundation and development of the mathematical and computational methods needed to achieve prediction and predictability of disease spreading in complex techno-social systems;
- The development of large scale, data driven computational models endowed with a high level of realism and aimed at epidemic scenario forecast;
- The design and implementation of original data-collection schemes motivated by identified modeling needs, such as the collection of real-time disease incidence, through innovative web and ICT applications;
- The set-up of a computational platform for epidemic research and data sharing that will generate important synergies between research communities and countries.
If research projects like these wield significant progress in addressing the Big Data problem, the prediction and prevention of diseases may be reliant on massive computational algorithms, streamlining and making more efficient the mandate of organizations like the Department of Health and Hospitals, Centers for Disease Control and Prevention, International Red Cross, and the World Health Organization.