Ranting about “big data”

I’m way behind on my blog writing, and this turned into quite a rant with not much structure. But it’s all I’ve got, and having missed Thursday’s post entirely, and failed to do any reviews like I promised, I feel I have to put SOMETHING up. So here is my “big data” rant:

Last week, I wrote about what happens when researchers study sexual arousal, and how it causes me some issues. I wrote:

I have a more general nightmare, related to “big data”, surveillance society, and so on, of a world in which I will say I feel one way and the computers will say I am lying or mistaken, and everyone else will believe the computer and not me. My internal feelings and emotions will be irrelevant because they “know” that I will like X and dislike Y, even if I repeatedly say I hate X and love Y, so I will forever be condemned to X.

Big data appears to refer to the myriad ways in which the processing capacity of modern computers allow for the mass accumulation of information about populations – whether animal, human, or mechanical – and analysis to demonstrate statistical patterns of significance in behaviours and outcomes. It was mistakenly reported that engine data from flight MH370 was being transmitted back to Boeing several hours after the plane disappeared from air traffic control radar (it was a different kind of signal altogether) but it makes topical the fact that Boeing use the data received from every engine on every plane they have built in recent years, to analyse and develop anticipation of general or specific problems before they become serious enough to cause a disaster, and thus improve air travel safety immensely.

A more immediate form is the “people who looked at this item bought these items” section on an Amazon product page, or their “your recommendations” page, based on previous purchases, wishlists and ratings.

There are all sorts of applications, including surveillance society issues (such as the NSA data-gathering that’s also been a hot topic in recent months) and even translation software. There is an ongoing tendency to try to predict people’s behaviour based on how other people similar to them in various ways have behaved in the past.

Thus, Amazon often suggests to me things that I know I don’t like, because other people who liked some of the things that I did like, also liked the thing I don’t. Sometimes it does lead to a revelation or new discovery, but more often it is simply unhelpful. Back when I used Google’s services, I discovered that, based on the websites I frequented, they believed me to be ten years younger, and female. One of the big reasons I switched to DuckDuckGo as my search engine of choice is that they do not track my searches or site visits, and do not attempt to prioritise results based on whether they think I will like or agree with those sites. Google does do that, and it is most worrying to me.

As I wrote last week, this leads me to imagine a future in which niches and social groupings are ever more rigorously policed and enforced by the range of options that people are offered on the basis of their similarities with others. We already see how gender, age, disability, sexuality and race affect how people’s choices are limited or bounded by social perceptions and norms. In some ways it is possible to push against that, because we can at least see what options those outside of our group are offered. But suppose your online services “know” that girls like pink, and know you are a girl? What happens if every toy you search for, the answers come back with dolls and pink things? If you want a blue toy truck, you get Barbie’s beach buggy and some story about how she likes the blue of the sea? What happens when you don’t even know that there could be a toy that was blue, or a truck, because that option is never visible to you?

It’s an extreme example, perhaps, but this seems to be where the concept and uses of big data are leading, as they have been put into practice. We don’t get to know what the options are because some options are presumed to be preferable to us than the alternatives, and unless we dig deeper we might not find out that those options exist.

One of the big reasons I buy music from charity shops is because the range offered is inherently random, and manageable in size. If I go to a record shop, then the albums are listed by artist name, and segregated by genres, so that there is no chance of being surprised by something I’ve never heard of and just thinking, “I wonder what that sounds like!” or simply, “What a cool title/cover/concept!” But flicking through the second-hand CDs in a charity shop, this is precisely what happens, because the CDs are jumbled, at best segregated by “classical” and “other”, so there will be folk, heavy metal, jazz and hiphop side-by-side in the same rack. I’ve listened to music in all of those genres, but if it were left to big data, it’s doubtful I would have the opportunity to discover some artists that don’t sit comfortably in the niches, or who just are not well-known to figure in the statistical significance. In a charity shop, I get to be surprised and delighted by something new and unexpected. And when I discover an artist that way, I tend to go out and look for them online, in record shops, wherever they might be. If it weren’t for the charity shop, I never would.

I distrust big data. while I believe that there are valuable lessons to be learned from statistical correlations in populations, I believe that too often more is claimed for these relationships than is reasonable. I certainly believe that humans are prone to deviation, which is to say, that there is nearly always an exception to the rule, people who do not fit with the neat theory that supposedly explains what “people” are like. In various ways, I have almost become accustomed to not fitting in in this way.

I believe that a computer that wanted to predict my future behaviour based on my past choices might very well be able, with sufficient access to my previous actions and outcomes, to do so to a reasonable degree of accuracy. But otherwise, it seems like an “emperor of China’s nose fallacy” – which is that no one has ever seen the face of the emperor, so the best way to get an accurate estimate of the length of his nose is to ask many people how long they think it is. You will then have a very precise average answer. The trouble is, at no point is there any factual basis for the answer that relates to the actual nose – it’s just a lot of wild guesses. It will be more useful for figuring out the average length of the noses of the population, rather than the actual length of one specific nose. Similarly, it seems to me that in using big data to predict an individual’s choices based on the behaviours of other people, one is merely creating a very precise average that has no connection to the actual situation, which in this case is the person’s genuine state of mind.

For all the similarities that a person may have to members of a specific population, there will always be specific elements by which they differ (even if it’s only to deny that they’re an individual – obligatory Monty Python reference!) Sometimes these will be significant, and sometimes they won’t, but only that individual knows which.


About ValeryNorth

I overthink everything.
This entry was posted in Economics, Philosophy, Politics, Science and tagged , , , , , , . Bookmark the permalink.

One Response to Ranting about “big data”

  1. Pingback: The agency of personality: in defence of MBTI | Valery North – Writer

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s