Name: Big Data
Rating: 3.7 (8592 reviews)
Author: Viktor Mayer-Schönberger

“Big data is transforming how we live, work, and think, often revealing patterns and relationships we never knew existed.”

1. The New Era of Big Data

In the past, data gathering was slow and limited by the technology available. For example, the 1880 US Census took over eight years to process, rendering the information outdated before it could be used. With modern tools such as computers and the Internet, data can now be collected, stored, and analyzed with incredible speed and efficiency. This marks the arrival of the big data age.

Big data refers to collecting and analyzing immense amounts of information. It isn’t just the scale of data that has changed; it’s the possibilities it offers. By identifying patterns and trends in vast datasets, people can solve problems and make decisions with unprecedented accuracy and speed.

A clear example of big data's power is Google’s 2009 flu-tracking project. By analyzing millions of search queries, Google could predict flu outbreaks faster than traditional health authorities. This innovation showcased how big data could be used for proactive, life-saving applications.

Examples

The 1880 Census took years to process compared to today’s near-instantaneous analysis.
Google used search term analysis to track and predict flu outbreaks.
The Internet has made information gathering almost seamless and passive.

2. Data Is Everywhere Around Us

Technology has made it possible to measure almost anything, from how we walk to how we click. This explosion of data—sometimes from unexpected sources—is what experts call datafication. It involves turning everyday actions and characteristics into quantifiable data.

Researchers and companies use these measurements to create inventive solutions. For instance, scientists in Japan have developed car seats equipped with pressure sensors to identify drivers based on their unique weight distribution. This data can even act as a security feature, starting the car only when an approved user is seated.

Companies like Apple and IBM are tapping into this trend. Apple’s earbuds aim to track health metrics like heartbeat and temperature, while IBM has invented touch-sensitive floors that can recognize people by the way they move. The hidden everyday data captured through technology could improve lives in ways we're just beginning to understand.

Examples

IBM’s touch-sensitive flooring technology identifies people’s movements.
Apple patented earbuds that can measure heart rate and other vitals.
Japanese researchers use pressure sensors in car seats to identify drivers.

3. Big Data Eliminates Sampling

In the past, people relied on small samples to draw conclusions about large populations. While effective to an extent, sampling has limitations, especially as subgroups become smaller and harder to analyze. This approach often leads to minimal data for detailed predictions.

Big data changes the game by providing access to large-scale datasets that include entire populations instead of mere samples. For instance, in an election survey, rather than polling a small number of voters and assuming their preferences represent the entire population, data from everyone in a region can be analyzed.

This ability to gather vast and detailed records means researchers can now "zoom in" on specific groups or trends without worrying about the unreliability of small sample sizes. With big data, the precision and reliability of predictions improve significantly.

Examples

Election surveys no longer need assumptions if voting data for an entire town is available.
Sampling can leave subgroups—like public servants under 30—underrepresented.
Big datasets eliminate risks tied to under-sampling or partial analysis.

4. Large and Messy Data Often Beats Smaller, Perfect Data

Having messy but large datasets often trumps having smaller, perfect datasets. For example, IBM’s 1980s attempts to create a language-translation program faltered because they relied on high-quality but limited data. The system struggled with less-common phrases.

In contrast, Google succeeded in the early 2000s by using billions of low-quality data points scraped from the entire Internet. Their system analyzed vast quantities of text, making its translations more accurate despite imperfections in the data.

Big data works by compensating for low-quality inputs with sheer quantity. When errors or inaccuracies appear in massive datasets, their effects are diluted, allowing more accurate results overall. This shows why prioritizing scale over accuracy often works best in a big data landscape.

Examples

IBM’s translation project failed due to a lack of sufficient diversity in high-quality data.
Google’s translation system used billions of imperfect web pages to improve accuracy.
Internet-wide datasets often outperform smaller, carefully curated samples.

5. Discovering Connections Without Needing Explanations

Big data often reveals relationships between variables without explaining why they exist. A data-analysis contest once found that used cars painted orange were less prone to defects. This finding was unexpected and lacked an obvious explanation, yet it worked for practical purposes.

Big data doesn’t aim to explain causes, but it can identify useful correlations. For instance, research into premature babies’ health data discovered an odd pattern: vitals stabilized just before severe infections. Although the reason behind this "calm before the storm" remains unknown, it allows doctors to act earlier.

Correlations are sometimes enough without the need for deeper explanations, as their practical applications—like guiding decisions—can be beneficial even without fully understanding the "why."

Examples

Orange-colored used cars were found to have fewer defects than other vehicles.
Predictive health systems flagged stable vitals as an early infection sign.
FlightCaster predicted delays using flight and weather data correlations.

6. Secondary Uses of Data Can Be More Valuable

When companies collect data for one purpose, they often discover that the same datasets can be used for entirely different purposes. Swift, a global transaction processor, uses its payment data to correctly forecast global GDP, beyond its original goal of tracking financial records.

Internet companies like Experian analyze old search term data to reveal shopping trends or user interests. Similarly, mobile companies’ location tracking data is being repurposed to monitor traffic flow or enable targeted advertising.

This points to a growing market for finding alternative applications of data. Designing systems to optimize such secondary usage is invaluable in the big data ecosystem and could often represent untapped gold mines.

Examples

Swift created accurate GDP indexes using banking transaction data.
Search data from past users became a marketing tool for Experian.
Mobile phone companies reuse users’ real-time data to analyze traffic flows.

7. Combining Datasets Unlocks New Possibilities

Data becomes even more powerful when different sources are merged. This "data mashup" often reveals trends that individual datasets couldn’t show. For example, Danish researchers combined cancer patient records with mobile usage data, testing for any links between phone use and cancer.

Companies like Inrix aggregate data from cars, trucks, and navigation apps to create real-time traffic reports. Individually, this data may hold little weight, but when combined, it provides deep valuable insights.

The value of combined datasets is that they highlight connections that might not even exist within single datasets alone. Businesses and researchers alike can harness this practice to draw broader and deeper conclusions about the world.

Examples

Danish researchers found no correlation between cancer and phone signals using combined data.
Inrix combines various traffic sources to provide live congestion updates.
Merging unrelated datasets unveils patterns otherwise hidden to observers.

8. Online Behavior Fuels Constant Optimization

Online companies like Facebook and Google track everything we do to improve their services. This passive collection of "data exhaust" includes everything from mouse movements to time spent hovering over options.

For example, Google uses typo data from search terms to refine tools like spell-check and Autocomplete. Similarly, social networks like Facebook update layouts based on which features users engage with most often.

Even online games use such insights. Gaming company Zynga adjusts its gameplay based on where users quit or struggle, ensuring continuous engagement. Harnessing unnoticed digital behaviors allows big data to transform user experience in real time.

Examples

Google developed spell-check and Autocomplete using user error logs.
Facebook observed how interactions between friends drive more overall engagement.
Zynga modifies gaming experiences based on where players quit.

9. Balancing Benefits and Ethical Concerns

Big data comes with risks—privacy laws struggle to keep up, and anonymization methods often fail. For instance, AOL’s attempt at anonymizing search data in 2006 backfired when a user, Thelma Arnold, was identified using only her search history.

Moreover, data misuse could lead to discrimination. Predictive policing may help identify crime hotspots but can unfairly target specific groups. If we over-rely on predictions, we risk disregarding human free will and moral agency.

The challenge lies in balancing the insights provided by big data with the ethical responsibility to protect individuals' rights and freedoms.

Examples

A widow’s identity was accidentally revealed from anonymized AOL search records.
Predictive policing unfairly profiles poorer neighborhoods or specific social groups.
Current privacy laws often restrict unforeseen uses of originally gathered data.

Takeaways

Look at everyday data sources and brainstorm new applications for information that’s already being collected.
Find opportunities to combine datasets—even unrelated ones—to uncover connections others might not notice.
Approach conclusions drawn from big data critically, evaluating ethical implications and ensuring that biases don’t skew outcomes.