Defining Big Data
There is no single definition of Big Data. In this lesson, we will try to explain what is usually understood by “Big Data” and how it can help to solve business problems.
Wikipedia defines big Data as “an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications”.
In 2011, McKinsey defined Big Data as “datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.”
Thus, Big data is a term that refers to the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis.
However, the relevance and importance of Big Data does not rely on the size, but its on its value and on the insights that companies can derive from the data to make better strategic decisions.
Watch this 2-minute video by Bernard Marr called “What is Big Data” to learn more.
Prevalence of Big Data
Dimensions of Big Data
When thinking about Big Data we often just associate the term with data that is big in size. This is data that occupies petabytes* instead of megabytes.
But this is a very simplistic view of Big Data – it is much more than just lots of data.
*A petabyte (PB) is equivalent to 1,000 terabytes (TB) or 1,000,000 gigabytes (GB).
So to understand the real value and meaning of Big Data, let’s look at Cesar Hidalgo’s definition of Big Data. Cesar Hidalgo is an Associate Professor of Media Arts and Sciences at MIT and the Director of the Macro Connections group at The MIT Media Lab.
According to Hidalgo, Big Data needs to be big in three different ways. He describes three dimensions that define Big Data:
1) Big in Size
This is the simplest criteria and the best known one. Data has to be big in size.
So how does a bank have Big Data in size? Well a bank has financial data from millions of customers, rather than from just a small sample of a few hundred of the population.
2) Big in Resolution
We may have lots of data from millions of individuals but if this data is very aggregated, it will not be very meaningful and cannot be considered Big Data.
For example, consider if we have financial data from bank customers but we only know their balance at the end of the year – this information is not too useful and the insights we can derive from it will be very limited. However, a bank does have Big Data in resolution as it records every transaction with very fine resolution. In each record you can get information about who did the transaction, at what time, the type of transaction, where it occurred etc.
So if we have the transactions with a space detail of coordinates, this is high resolution. However, if we have the transactions aggregated by zip code, this data will have a lower resolution.
3) Big in Scope
This is probably the most forgotten dimension but the most important one. So, big in scope means that this is data that we can actually derive more insights from than the purpose for which the data was collected.
There are probably many sources of data that are big in resolution and size but not necessarily in scope.
There is so much more to learn about Big Data. What are the 4 V’s of Big Data? What are the common software used to process Big Data? If you’re interested in understanding the basics of Big Data, along with 11 other core areas of business administration, check out The Lean MBA program.
This is an excerpt from a lesson in the Statistics & Big Data Module of The Lean MBA program.