In 2003, the Human Genome Project completed the first full sequence of the human genetic code of over three billion base pairs. The scientific world let out both a sigh of relief coupled with a gasp of exasperation. So much data had to be analyzed for it to be useful. As of May 2009, the International Nucleotide Sequence Database Collaboration – a group working to gather, evaluate, and distribute the combined DNA and RNA information from the DNA Data Bank of Japan, European Molecular Biology Laboratory in Germany, and GenBank in the USA – has collected more than 287 billion pieces of genetic information. The gargantuan task of gathering and disseminating so much data is underway and soon (in regards to the next decade or two) individual genetic healthcare solutions will be accessible to people all over the world. Big Science is working with Big Data to make such avenues possible; other Big Data projects include those in physics, space exploration, weather, healthcare, and any other industry that seeks to gain the advantages of Big Data analytics.
The Sloan Digital Sky Survey (SDSS) at Apache Point Observatory in New Mexico began collecting data in 2000 and will continue through 2014. Built in cooperation with more than 150 scientists and institutions worldwide, the primary focus of the SDSS is to map some 35% of the sky in detail. It has three primary phases:
- SDSS-I (2000-2005): The first phase allowed the telescope to image some 8,000 square degrees in five passes and acquire detailed galaxy and quasar data from around 5,700 degrees, as well as a meticulous scan of the southern Galactic cap.
- SDSS-II (2005-2008): The success of the first phase allowed the research institutions to gain more funding and continue the Sloan Legacy Survey (SDSS-I), SEGUE (the Sloan Extension for Galactic Understanding and Exploration) which looked deeper into the overall structure and history of the Milky Way galaxy, and the Sloan Supernova Survey which found more than 500 confirmed Type 1a supernovae.
- SDSS-III (2008-2014): The final stage of the survey comprises four distinct examinations, the Apache Point Observatory Galactic Evolution Experiment (APOGEE), Baryon Oscillation Spectroscopic Survey (BOSS), Multi-Object APO Radial Velocity Exoplanet Large-area Survey (MARVELS), and Sloan Extension for Galactic Understanding Exploration 2 (SEGUE 2).
The latest data release (DR8) in January of 2011 covered 14,000 square degrees of the sky (about 35%) and now more than 930,000 galaxies, 120,000 quasars, 230 million celestial objects, and 460,000 stars have been successfully mapped. The 2.5-meter telescope utilizes a 120-megapixel camera and produces around 200 GB of data every night that has to be collected and distributed to the various researchers and institutions taking part in the project.
The original database design in 2000 was a fairly standard snowflake schema in two separate databases that only totaled about 818 GB (back in the early 2000s that was considerable). The SDSS-III data cluster and server configurations are now much more complex, as the data loads and numbers of users have skyrocketed during the various phases of the project (see the bottom of this article for links to further information).
The next major advancement in astrophysics data collection is planned for launch sometime in 2014: the Large Synoptic Survey Telescope (LSST). Built with an 8.4-meter telescope and 3200-megapixel camera, the LSST research teams plan to look for evidence of Dark Matter, Dark Energy, map near Earth asteroids down to the size of 100 meters, and continue the mapping more of the Milky Way among numerous other experiments. It will be located on Mount Cerro Pachón in northern Chile. The data collection task is monumental at somewhere around 300 MB per second, 8-16 TB per night, or more than 10,000 TB per year. The computing grid is still being designed and built, but a number of papers have been released on the LSST main site proposing how the grid will be constructed:
- Connolly, A., “LSST Data Management: Prospects for Processing and Archiving Massive Astronomical Data Sets” (n.d.)
- Axelrod, T., Becla, J., Cook, K., Nikolaev, S., Gray, J., Plante, R., Nieto-Santisteban, M., Szalay, A., Ani Thakar, A., “Designing for Peta-Scale in the LSST Database” (2007)
- LSST Petascale Data R&D Challenges – this link just discusses some of the main R&D challenge associated with the LSST
- LSST Data Management Historical Material – this page has a list of .PDF files with more information on the LSST data question
There are of course many more proposals, papers, and articles on the problem of how to deal with the petabyte-level data storage issue with the LSST, but those above give a good overview of the issue and were published on the LSST site.
The Large Hadron Collider (LHC) is now a celebrity in the particle physics world. The search for the Higgs Boson has become popular science for experts and laypersons alike. The overall structure of the LHC amounts to a 27km circular tunnel beneath the border of France and Switzerland near Geneva. It is governed by CERN (European Organization for Nuclear Research, or Conseil Européen pour la Recherche Nucléaire) which is currently the largest particle physics laboratory in the world.
The LHC took approximately 10 years and over $10 billion to build. It contains four major particle detectors that work record the many millions of collisions that happen every second when the LHC is under full operation: ALICE (A Large Ion Collider Experiment), LHCb (studying B-meson decay), CMS (Compact Muon Solenoid), and ATLAS (A Toroidal LHC Apparatus). Three other experiments currently in operation are TOTEM (Total Cross Section, Elastic Scattering and Diffraction Dissociation at the LHC) and LHCf (Large Hadron Collider forward), and MoEDAL (Monopole and Exotics Detector At the LHC). For detailed information on each of these experiments visit the CERN/LHC site or STFC/LHC site.
The Data Management system for the LHC computing grid is ground-breaking in its size and complexity. Comprised of some 200,000 processing cores, 150 PB of storage space, and covering 140 computing centers and 240 institutions in 35 countries, the grid was designed in multiple tiers to allow for a constant reduction of the massive 15 PB of data collected each year. At its height the project is expected to generate some 27 TB of data per day, with an extra 10 TB of summary data that is first collected and reduced at Tier 0 (CERN), then farmed out to Tier 1 (primary institutions in Europe, Asia and North America), which then reduce and analyze the data further so that it can be sent to Tier 2 (some 150 institutions worldwide that will help evaluate the data). In 2010, all of the LHC experiments produced around 13 PB of data.
According to the LHC website (and to bring that amount of data down to earth a bit) consider that during full operation (expected to happen sometime in 2014-15), the ATLAS and CMS detectors will record some 600 million proton to proton collisions per second, or about 1 MB per event. Taken further, that means:
- 109 collisions/s x 1 MB/collision = 1015 bytes/s = 1 PB/s (1 Petabyte/second)
That amounts to approximately 200,000 DVDs or 6000 iPods per second. The system is not designed to deal with that amount of data, so the engineers have setup a three-level trigger system that removes much of the data before it is ever sent to the storage and analysis facilities in the three-tier model. The trigger system allows CERN and the rest of the institutions involved in such a monumental project to filter down the results to their “more manageable” goal of around 15 PB per year.
In June of 2011 the LHC experiments ATLAS and CMS had delivered about 1 inverse femtobarn of integrated luminosity, or about 70 million million (70 x 1012) collisions. The LHC is currently under a repair phase until 2014.
Big Data is traversing the lecture halls, research laboratories, engineering facilities, and operations centers of Big Science everywhere now. Projects like the Large Hadron Collider, Sloan Digital Sky Survey, and Large Synoptic Survey Telescope are only a few more well-known projects that continue to make the headlines. The 1000 Genomes Project is a collaboration of numerous public and private organizations worldwide with Amazon Web Service cloud to share genetic data with researchers everywhere. The Climate Corporation is working to make better predictions of extreme weather, with more than 2.5 million weather measurements daily and 10 trillion data points stored already in their data systems; they are only one company working on Big Weather and there are many more. The Obama Administration has launched a new Big Data Initiative, and other governments everywhere are following suit or already have their own initiatives underway. Alex Yoder, CEO of Webtrends, summarized the issue well in his recent CNET article:
“Gold requires mining and processing before it finds its way into our jewelry, electronics, and even the Fort Knox vault. Oil requires extraction and refinement before it becomes the gasoline that fuels our vehicles. Likewise, data requires collection, mining and, finally, analysis before we can realize its true value for businesses, governments, and individuals alike.”
No industry is hiding from this new trend. Every industry needs the discoveries that Big Data analytics can provide; the future is open to those who understand the implications and are moving forward with a clear focus.
SDSS Further Information
LSST Further Information
LSST Data Management Pipelines
LSST Data Management Facilities
LSST Data Products
Case Study: Designing the Large Synoptic Survey Telescope with Enterprise Architect
SLAC National Research Laboratory LSST News Page
LSST: from Science Drivers to Reference Design and Anticipated Data Products
NOAO and the Large Synoptic Survey Telescope
LHC Further Research