Big Data : When data gets big, it turns out to be a serious issue

8 min readSep 16, 2020

What is Big Data?

The term “big data” refers to data that is so large, fast or complex that it’s difficult or impossible to process using traditional methods. The act of accessing and storing large amounts of information for analytics has been around a long time. But the concept of big data gained momentum in the early 2000s when industry analyst Doug Laney articulated the now-mainstream definition of big data as the three V’s:

Volume: Organizations collect data from a variety of sources, including business transactions, smart (IoT) devices, industrial equipment, videos, social media and more. In the past, storing it would have been a problem — but cheaper storage on platforms like data lakes and Hadoop have eased the burden.

Velocity: With the growth in the Internet of Things, data streams in to businesses at an unprecedented speed and must be handled in a timely manner. RFID tags, sensors and smart meters are driving the need to deal with these torrents of data in near-real time.

Variety: Data comes in all types of formats — from structured, numeric data in traditional databases to unstructured text documents, emails, videos, audios, stock ticker data and financial transactions.

Talking about the three V’s

VOLUME

Volume is the V most associated with big data because, well, volume can be big. What we’re talking about here is quantities of data that reach almost incomprehensible proportions.

Facebook, for example, stores photographs. That statement doesn’t begin to boggle the mind until you start to realize that Facebook has more users than China has people. Each of those users has stored a whole lot of photographs. Facebook is storing roughly 250 billion images.

Can you imagine? Seriously. Go ahead. Try to wrap your head around 250 billion images. Try this one. As far back as 2016, Facebook had 2.5 trillion posts. Seriously, that’s a number so big it’s pretty much impossible to picture.

So, in the world of big data, when we start talking about volume, we’re talking about insanely large amounts of data. As we move forward, we’re going to have more and more huge collections. For example, as we add connected sensors to pretty much everything, all that telemetry data will add up.

How much will it add up? Consider this. Gartner, Cisco, and Intel estimate there will be between 20 and 200 (no, they don’t agree, surprise!) connected IoT devices, the number is huge no matter what. But it’s not just the quantity of devices.

Consider how much data is coming off of each one. I have a temperature sensor in my garage. Even with a one-minute level of granularity (one measurement a minute), that’s still 525,950 data points in a year, and that’s just one sensor. Let’s say you have a factory with a thousand sensors, you’re looking at half a billion data points, just for the temperature alone.

Or, consider our new world of connected apps. Everyone is carrying a smartphone. Let’s look at a simple example, a to-do list app. More and more vendors are managing app data in the cloud, so users can access their to-do lists across devices. Since many apps use a freemium model, where a free version is used as a loss-leader for a premium version, SaaS-based app vendors tend to have a lot of data to store.

Todoist, for example (the to-do manager I use) has roughly 10 million active installs, according to Android Play. That’s not counting all the installs on the Web and iOS. Each of those users has lists of items — and all that data needs to be stored. Todoist is certainly not Facebook scale, but they still store vastly more data than almost any application did even a decade ago.

Then, of course, there are all the internal enterprise collections of data, ranging from energy industry to healthcare to national security. All of these industries are generating and capturing vast amounts of data.

That’s the volume vector.

VELOCITY

Remember our Facebook example? 250 billion images may seem like a lot. But if you want your mind blown, consider this: Facebook users upload more than 900 million photos a day. A day. So that 250 billion number from last year will seem like a drop in the bucket in a few months.

Velocity is the measure of how fast the data is coming in. Facebook has to handle a tsunami of photographs every day. It has to ingest it all, process it, file it, and somehow, later, be able to retrieve it.

Here’s another example. Let’s say you’re running a marketing campaign and you want to know how the folks “out there” are feeling about your brand right now. How would you do it? One way would be to license some Twitter data from Gnip (acquired by Twitter) to grab a constant stream of tweets, and subject them to sentiment analysis.

That feed of Twitter data is often called “the firehose” because so much data (in the form of tweets) is being produced, it feels like being at the business end of a firehose.

Here’s another velocity example: packet analysis for cyber security. The Internet sends a vast amount of information across the world every second. For an enterprise IT team, a portion of that flood has to travel through firewalls into a corporate network.

Unfortunately, due to the rise in cyber attacks, cyber crime, and cyber espionage, sinister payloads can be hidden in that flow of data passing through the firewall. To prevent compromise, that flow of data has to be investigated and analyzed for anomalies, patterns of behavior that are red flags. This is getting harder as more and more data is protected using encryption. At the very same time, bad guys are hiding their malware payloads inside encrypted packets.

That flow of data is the velocity vector.

VARIETY

You may have noticed that I’ve talked about photographs, sensor data, tweets, encrypted packets, and so on. Each of these are very different from each other. This data isn’t the old rows and columns and database joins of our forefathers. It’s very different from application to application, and much of it is unstructured. That means it doesn’t easily fit into fields on a spreadsheet or a database application.

Take, for example, email messages. A legal discovery process might require sifting through thousands to millions of email messages in a collection. Not one of those messages is going to be exactly like another. Each one will consist of a sender’s email address, a destination, plus a time stamp. Each message will have human-written text and possibly attachments.

Photos and videos and audio recordings and email messages and documents and books and presentations and tweets and ECG strips are all data, but they’re generally unstructured, and incredibly varied.

All that data diversity makes up the variety vector of big data.

Examples of Big Data

The New York Stock Exchange generates about one terabyte of new trade data per day.

Social Media

The statistic shows that 500+terabytes of new data get ingested into the databases of social media site Facebook, every day. This data is mainly generated in terms of photo and video uploads, message exchanges, putting comments etc.

A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, generation of data reaches up to many Petabytes.

How Google Solved Big Data Problem?

This problem tickled google first due to their search engine data, which exploded with the revolution of the internet industry. And it is very hard to get any proof of it that its internet industry. They smartly resolved this difficulty using the theory of parallel processing. They designed an algorithm called MapReduce. This algorithm distributes the task into small pieces and assigns those pieces to many computers joined over the network, and assembles all the events to form the last event dataset.

Well, this looks logical when you understand that I/O is the most costly operation in data processing. Traditionally, database systems were storing data into a single machine, and when you need data, you send them some commands in the form of SQL query. These systems fetch data from the store, put it in the local memory area, process it and send it back to you. This is the real thing which you could do with limited data in control and limited processing capability.

But when you see Big Data, you cannot collect all data in a single machine. You MUST save it into multiple computers (maybe thousands of devices). And when you require to run a query, you cannot aggregate data into a single place due to high I/O cost. So what MapReduce algorithm does; it works on your query into all nodes individually where data is present, and then aggregate the final result and return to you.

It brings two significant improvements, i.e. very low I/O cost because data movement is minimal; and second less time because your job parallel ran into multiple machines into smaller data sets.

Thanks for Reading

HAPPY READING!