A Brief History of Big Data

Big Data has been described by some Data Management pundits (with a bit of a snicker) as “huge, overwhelming, and uncontrollable amounts of information.” In 1663, John Graunt dealt with “overwhelming amounts of information” as well, while he studied the bubonic plague, which was currently ravaging Europe. Graunt used statistics and is credited with being the first person to use statistical data analysis. In the early 1800s, the field of statistics expanded to include collecting and analyzing data.

The evolution of Big Data includes a number of preliminary steps for its foundation, and while looking back to 1663 isn’t necessary for the growth of data volumes today, the point remains that “Big Data” is a relative term depending on who is discussing it. Big Data to Amazon or Google is very different than Big Data to a medium-sized insurance organization, but no less “Big” in the minds of those contending with it.

Such foundational steps to the modern conception of Big Data involve the development of computers, smart phones, the internet, and sensory (Internet of Things) equipment to provide data. Credit cards also played a role, by providing increasingly large amounts of data, and certainly social media changed the nature of data volumes in novel and still developing ways. The evolution of modern technology is interwoven with the evolution of Big Data.

The Foundations of Big Data

Data became a problem for the U.S. Census Bureau in 1880. They estimated it would take eight years to handle and process the data collected during the 1880 census, and predicted the data from the 1890 census would take more than 10 years to process. Fortunately, in 1881, a young man working for the bureau, named Herman Hollerith, created the Hollerith Tabulating Machine. His invention was based on the punch cards designed for controlling the patterns woven by mechanical looms. His tabulating machine reduced ten years of labor into three months of labor.

In 1927, Fritz Pfleumer, an Austrian-German engineer, developed a means of storing information magnetically on tape. Pfleumer had devised a method for adhering metal stripes to cigarette papers (to keep a smokers’ lips from being stained by the rolling papers available at the time), and decided he could use this technique to create a magnetic strip, which could then be used to replace wire recording technology. After experiments with a variety of materials, he settled on a very thin paper, striped with iron oxide powder and coated with lacquer, for his patent in 1928.

During World War II (more specifically 1943), the British, desperate to crack Nazi codes, invented a machine that scanned for patterns in messages intercepted from the Germans. The machine was called Colossus, and scanned 5.000 characters a second, reducing the workload from weeks to merely hours. Colossus was the first data processor. Two years later, in 1945, John Von Neumann published a paper on the Electronic Discrete Variable Automatic Computer (EDVAC), the first “documented” discussion on program storage, and laid the foundation of computer architecture today.

It is said these combined events prompted the “formal” creation of the United States’ NSA (National Security Agency), by President Truman, in 1952. Staff at the NSA were assigned the task of decrypting messages intercepted during the Cold War. Computers of this time had evolved to the point where they could collect and process data, operating independently and automatically.

The Internet Effect and Personal Computers

ARPANET began on Oct 29, 1969, when a message was sent from UCLA’s host computer to Stanford’s host computer. It received funding from the Advanced Research Projects Agency (ARPA), a subdivision of the Department of Defense. Generally speaking, the public was not aware of ARPANET. In 1973, it connected with a transatlantic satellite, linking it to the Norwegian Seismic Array. However, by 1989, the infrastructure of ARPANET had started to age. The system wasn’t as efficient or as fast as newer networks. Organizations using ARPANET started moving to other networks, such as NSFNET, to improve basic efficiency and speed. In 1990, the ARPANET project was shut down, due to a combination of age and obsolescence. The creation ARPANET led directly to the Internet.

In 1965, the U.S. government built the first data center, with the intention of storing millions of fingerprint sets and tax returns. Each record was transferred to magnetic tapes, and were to be taken and stored in a central location. Conspiracy theorists expressed their fears, and the project was closed. However, in spite of its closure, this initiative is generally considered the first effort at large scale data storage.

Personal computers came on the market in 1977, when microcomputers were introduced, and became a major stepping stone in the evolution of the internet, and subsequently, Big Data. A personal computer could be used by a single individual, as opposed to mainframe computers, which required an operating staff, or some kind of time-sharing system, with one large processor being shared by multiple individuals. After the introduction of the microprocessor, prices for personal computers lowered significantly, and became described as “an affordable consumer good.” Many of the early personal computers were sold as electronic kits, designed to be built by hobbyists and technicians. Eventually, personal computers would provide people worldwide with access to the internet.

In 1989, a British Computer Scientist named Tim Berners-Lee came up with the concept of the World Wide Web. The Web is a place/information-space where web resources are recognized using URLs, interlinked by hypertext links, and is accessible via the Internet. His system also allowed for the transfer of audio, video, and pictures. His goal was to share information on the Internet using a hypertext system. By the fall of 1990, Tim Berners-Lee, working for CERN, had written three basic IT commands that are the foundation of today’s web:

HTML: HyperText Markup Language. The formatting language of the web.
URL: Uniform Resource Locator. A unique “address” used to identify each resource on the web. It is also called a URI (Uniform Resource Identifier).
HTTP: Hypertext Transfer Protocol. Used for retrieving linked resources from all across the web.

In 1993, CERN announced the World Wide Web would be free for everyone to develop and use. The free part was a key factor in the effect the Web would have on the people of the world. (It’s the companies providing the “internet connection” that charge us a fee).

The Internet of Things (IoT)

The concept of Internet of Things was assigned its official name in 1999. By 2013, the IoT had evolved to include multiple technologies, using the Internet, wireless communications, micro-electromechanical systems (MEMS), and embedded systems. All of these transmit data about the person using them. Automation (including buildings and homes), GPS, and others, support the IoT.

The Internet of Things, unfortunately, can make computer systems vulnerable to hacking. In October of 2016, hackers crippled major portions of the Internet using the IoT. The early response has been to develop Machine Learning and Artificial Intelligence focused on security issues.

Computing Power and Internet Growth

There was an incredible amount of internet growth in the 1990s, and personal computers became steadily more powerful and more flexible. Internet growth was based both on Tim Berners-Lee’s efforts, Cern’s free access, and access to individual personal computers.

In 2005, Big Data, which had been used without a name, was labeled by Roger Mougalas. He was referring to a large set of data that, at the time, was almost impossible to manage and process using the traditional business intelligence tools available. Additionally, Hadoop, which could handle Big Data, was created in 2005. Hadoop was based on an open-sourced software framework called Nutch, and was merged with Google’s MapReduce. Hadoop is an Open Source software framework, and can process structured and unstructured data, from almost all digital sources. Because of this flexibility, Hadoop (and its sibling frameworks) can process Big Data.

Big Data Storage

Magnetic storage is currently one of the least expensive methods for storing data. Fritz Pfleumer’s 1927 concept of striped magnetic lines has been adapted to a variety of formats, ranging from magnetic tape, magnetic drums, floppies, and hard disk drives. Magnetic storage describes any data storage based on a magnetized medium. It uses the two magnetic polarities, North and South, to represent a zero or one, or on/off.

Cloud Data Storage has become quite popular in recent years. The first true Cloud appeared in 1983, when CompuServe offered its customers 128K of data space for personal and private storage. In 1999, Salesforce offered Software-as-a-service (SaaS) from their website. Technical improvements within the internet, combined with falling data storage costs, have made it more economical for businesses and individuals to use the Cloud for data storage purposes. This saves organizations the cost of buying, maintaining, and eventually replacing their computer system. The Cloud provides a near-infinite amount of scalability, and is accessible anywhere, anytime, and offers a variety of services.

The Uses of Big Data

Big Data is revolutionizing entire industries and changing human culture and behavior. It is a result of the information age and is changing how people exercise, create music, and work. The following provides some examples of Big Data use.

Big Data is being used in healthcare to map disease outbreaks and test alternative treatments.
NASA uses Big Data to explore the universe.
The music industry replaces intuition with Big Data studies.
Utilities use Big Data to study customer behavior and avoid blackouts.
Nike uses health monitoring wearables to track customers and provide feedback on their health.
Big Data is being used by cybersecurity to stop cybercrime.

Big Data Analytics

Analytics has, in a sense, been around since 1663, when John Graunt dealt with “overwhelming amounts of information,” using statistics to study the bubonic plague. In 2017, 2,800 experienced professionals who worked with Business Intelligence were surveyed, and they predicted Data Discovery and Data Visualization will become an important trend. Data Visualization is a form of visual communication (think infographics). It describes information which has been translated into schematic format, and includes changes, variables, and fluctuations. A human brain can process visual patterns very efficiently.

Visualization models are steadily becoming more popular as an important method for gaining insights from Big Data. (Graphics are common, and animation will become common. At present, data visualization models are a little clumsy, and could use some improvement.) Listed below are some of the businesses offering Big Data visualization models:

To be sure, the Brief History of Big Data is not as brief as it seems. Even though the 17^th century didn’t see anywhere near the exabyte-level volumes of data that organizations are contending with today, to those early data pioneers the data volumes certainly seemed daunting at the time. Big Data is only going to continue to grow and with it new technologies will be developed to better collect, store, and analyze the data as the world of data-driven transformation moves forward at ever greater speeds.

Photo Credit: garagestock/Shutterstock.com

Data Topics

Leave a Reply Cancel reply