Big Data Types

The different things that make up big data

The holidays have unfortunately ended last week, but in better news we have started learning about big data and specifically their types. These are the many ways of classifying the data when in a big size, it can help with determining what the basic requirements to be able to start a project and just is interesting. Even without knowing the terms, most people would know how to roughly describe the data.

Now that is enough putting off saying what the different types of data are. To determine weather big data is actually “big” we can use some methods. In a simple way they are called the “4vs”, this stands for the Volume, Velocity, Variety, and Veracity of the data.

The volume of data is mostly about how much raw data is accessible. We want as much data as possible so the visualisations (or whatever you are doing to that data) are as accurate as can be. A rule of thumb is “more data is better”.

The velocity of data is the time to acquire the data (speed) and how new it is. The last thing anyone wants is waiting time downloading the data (in other words - time when you can’t be analysing it). It also includes the time it takes to process, so if you have a potato computer the velocity will be small, but if you had a supercomputer then it most likely would be high. How real time the data is also contributes significantly, if the data was from 10 years ago, and it was comparing things that have changed lots since now then it isn’t that real time. A rule of thumb is “the faster we can get data, the better”.

The variety of data is about how many types of data (the different headings, eg. sales data/profiles). This is because it opens up more opportunities for data manipulations, and that is basically what is data science is about. There is primarily 2 types: Structured Data; and Unstructured Data, these are the way of being able to use. Structured data is easy to use be machines (eg spreadsheets/json), and unstructured data is annoying for a computer to try to utilise because it needs some ML/AI (eg wikis/words). A role of thumb is “more types of data makes better data”.

The veracity of data is how “clean” it is. Bias and noise are the main factors for making the data “unclean”. Bias is bad because data is meant to be 100% factual, if it has someone’s opinion on it (skewed data - someone could manually change it) then any visualisations will be misleading. Noise is outliers and not fully filled out data/ irrelevant data. These have to be “cleaned” before use, the only way to fix bias is getting a new dataset. A role of thumb is “clean data before use”.


This week I learnt the different ways to determine if big data is actually big. This was relatively basic and wasn’t that hard to wrap my head around, and I think everyone else (in my class) had a similar opinion. However, there are still ways for me to improve, instead of having a mindset of understanding it and then leaving it, I should study/extend it further. This will mean that there will be a better retention and will be-able to help others in class. Talking about helping others, I would say that I have more distracted than helped this week, this is unfortunate but should be a learning opportunity to be better. I was distracting probably because it was the first week back, and I wasn’t as bothered as I should have been on completing class work, this week I will thrive to be better. My perspective also needs be changed (back to pre-holidays), this is because without a good mindset, no work can be done well. The reason that it got worse is that with the break I got lazy and didn’t do much work, so naturally it still wants to be the same, hopefully with this week it’ll be better.

TL;DR: New semester, started learning about big data. And will try to be better next week.