Issues with Data Analysis

How you can analyse with bias (spoiler. not good)

Overview

Over the past week in data science we looked at the many ways that data can be misleading (mainly in big data). It can be inaccurate because it is trying to convey a message, this is called bias. It is not good to have bias in statistics (or in data) because it is unethical and confuse/mislead people looking at the representation (an example of showing data).

What was learned

There are a few different types of data bias. Response bias, where the input of the data is not the best. Selection bias, is basically a feedback loop. Presentation bias, mainly about the presentation of the data. Omitted variable bias, should be self-explanatory. And societal bias, is when it is based on existing societal structure. All of these are bad and should be minimised, very hard to not have it completely as we are humans.

The response bias, the one where the data itself is just untrustworthy is one of the major one. An example is “7% of users produce 50% of the posts on Facebook”, this shows how bad it is to assume the whole of Facebook is like something when the data is probably just a minor proportion of the users. This is mostly found when using online derived social data, and is hard to combat.

The Selection Bias, the one where you can create a feedback loop because it is using itself for creating new data. An example is the YouTube algorithm, when going to the next video recommended enough times, it can get confused and isn’t as good as where the user is doing the random interactions. It is also hard to combat, but it can be dealt with by providing a message to the affected user once it thinks it is required.

Presentation Bias is the easiest to be accurate to the data but mislead people into thinking something. An example is using font and highlighting, to draw attention to less important things (and hide the important data). This is an simple fix for the creator, just don’t do it (even though some of it could be not intentional).

Omitted variable bias, aka not using enough data to prove something. It can be done when data in inputted by humans, and events or actions are not recorded due to privacy concerns or lack of access (or just not realising it’s important). An example of this is the probability of death (POD) AI made by a hospital, it didn’t include the treatment that was done to make up its mind of who needs the ventilator.

And the societal bias, making a decision based on the current ways of doing things. An example is where Amazon had made an AI to determine who should be interviewed, but it discriminated against females because they currently had a male dominated workplace. It is probably the hardest to combat because there needs to be some manual intervention to remove the bias.

Reflection

What’s the purpose of this?

My best understanding on why this is being taught, is that when we are analysing data (as it is data science), we need to make sure that we don’t have any of the above happening (bias) in any form or at least minimise as much. This would allow for better analysis and make it not subject to critique by being more sensible with it having accurate data. And also good to do, so I don’t lose easy marks on the assessments.

Did you understand the task given?

It did take some time to understand the differences between the bias’s. Some of them are similar to each other (like response and variable bias), so learning the differences of all of them was necessary. Luckily, for the most part I now know how to differentiate them, but might still struggle if asked to explain them without any notes. This can be improved on by studying them and with the task where I have to provide examples of them.

Were you understood when you showed what you meant?

When I was trying to understand the concepts (since I didn’t fully understand it), I tried to explain my thinking to others in the class to see it I actually know what's them. However, it was hard to explain, but I think I people did understand me. I would say I tried to help people that didn’t know them to understand it also. I think that this approach was a good idea and I will try to use it if and when I get stuck trying to understand something in IT.