Mark Zuckerberg’s appearance before Congress is a good example of the extent to which politicians and regulators have no idea, to quote The Donald, of “what on earth is going on”. It is not just them, this lack of understanding extends into the communities of thought and opinion framed by academia and journalism. This is a problem, because it means we have not yet identified the questions we need to be asking or the problems we need to be solving. If we think we are going to achieve anything by hauling Mark Zuckerberg over the coals, or telling Facebook to “act on data privacy or face regulation”, we have another think coming.
This is my attempt to provide that think.
The Google search and anonymity problem
Let’s start with Google Search. Imagine you sit down at a computer in a public library (i.e. a computer that has no data history associated with you) and type a question into Google. In this situation you are reasonably anonymous, especially if we imagine that the computer has a browser that isn’t tracking track search history. Despite this anonymity, Google can serve you up an answer that is incredibly specific to you and what it is you are looking for. Google knows almost nothing about you, yet is able to infer a very great deal about you – at least in relation to the very specific task associated with finding an answer to your question. It can do this because it (or its algorithms) ‘knows’ a very great deal about everyone or everything in relation to this specific search term.
So what? Most people sort of know this is how Google works and understand that Google uses data derived from how we all use Google, to make our individual experiences of Google better. But hidden within this seemingly benign and beneficial use of data is the same algorithmic process that could drive cyber warfare or mass surveillance. It therefore has incredibly important implications for how we think about privacy and regulation, not least because we have to find a way to outlaw the things we don’t like, while still allowing the things that we do (like search). You could call this the Google search problem or possibly the Google anonymity problem, because it demonstrates that in the world of the algorthm, anonymity has very little meaning and provides very little defence.
The Stasi problem
When you frame laws or regulations you need to start with defining what sort of problem you are trying to solve or avoid. To date the starting point for regulations on data and privacy (including the GDPR – the regulation to come) is what I call the STASI problem. The STASI was the East German Security Service and it was responsible for a mass surveillance operation that encouraged people to spy on each other and was thus able to amass detailed data files on a huge number of its citizens. The thinking behind this, and indeed the thinking applied to the usage of personal data everywhere in the age before big data, is that the only way to ‘know’ stuff about a person is to collect as much information about them as possible. The more information you have, the more complete the story and the better your understanding. At the heart of this approach is the concept that there exists in some form a data file on an individual which can be interrogated, read or owned.
The ability of a state or an organisation to compile such data files was seen as a bad thing and our approach to data regulation and privacy has therefore been based on trying to stop this from happening. This is why we have focused on things like anonymity, in the belief that a personal data file without a name attached to it becomes largely useless in terms of its impact on the individual to whom the data relates. Or we have established rights that allow us to see these data files, so that we can check that they don’t contain wrong information or give us the ability to edit, correct or withdraw information. Alternatively, regulation has sought to establish rights for us to determine how our the data in the file data is used or for us to have some sort of ownership or control over that data, wherever they may be held.
But think again about the Google search example. Our anonymity had no material bearing on what Google was able to do. It was able to infer a very great deal about us – in relation to a specific task – without actually knowing anything about us. It did this because it knew a lot about everything, which it had gained from gathering a very small amount of data from a huge number of people (i.e. everyone who had previously entered that same search term). It was analysing data laterally, not vertically. This is what I call Google anonymity, and it is a key part of Google’s privacy defence when it comes to things such as gmail. If you have a gmail account, Google ‘reads’ all your emails. If you have Google keyboard on your mobile, Google ‘knows’ everything that you enter into your mobile (including the passwords to your bank account) – but Google will say that it doesn’t really know this because algorithmic reading and knowledge is a different sort of thing. We can all swim in a sea of Google anonymity right up until the moment a data fisherman (such as a Google search query) gets us on the hook.
The reason this defence (sort of) stacks-up is that Google can only really know your bank account password is if it analyses your data vertically. The personal data file is a vertical form of data analysis. It requires that you mine downwards and digest all the data to then derive any range of conclusions about the person to whom that data corresponds. It has its limitations, as the Stasi found out, in that if you collect too much data you suffer from data overload. The bigger each file becomes the more cumbersome it is to read or digest the information that lies within it. It is a small data approach. Anyone who talks about data overload or data noise is a small data person.
Now while it might have been possible to get the Stasi to supply all the information it has on you, the idea that you place the same requirement on Google is ridiculous. If I think about all the Google services I use and the vast amount of data this generates, there is no way this data could be assembled into a single file and even if it could, it would have no meaning, because the way it would be structured has no relevance to the way in which Google uses this data. Google already has vastly more data on me than the biggest data file the STASI ever had on a single individual. But this doesn’t mean that Google actually knows anything about me as an individual. I still have a form of anonymity, but this anoymity is largely useless because it has no bearing on the outcomes that derive from the usage of my data.
The KitKat Problem
Algorithms don’t suffer from data overload, not just because of the speed at which they can process information but because they are designed to create shortcuts through the process of correlations and pattern recognition. One of the most revealing nuggets of information within Carole Cadwalladr’s expose of the Facebook / Cambridge Analytica ‘scandal’ was the fact that a data agency of the like of Cambridge Analytica working for a state intelligence service had discovered a correlation between people who self-confess to hating Israel and a tendency to like Nike trainers and KitKats. This exercise, in fact, became known as Operation KitKat. To put it another way, with an algorithm it is possible to infer something very consequential about someone (that they hate Israel) not by a detailed analysis of their data file, but by looking at consumption of chocolate bars. This is an issue I first flagged back in 2012.
I think this is possibly the most important revelation of the whole saga, because, as with the Google search example it cuts right to the heart of the issue and exposes the extent to which our current definition of the problem is so misplaced. We shouldn’t be worrying about the STASI problem, we should be worried about the KitKat problem. Operation KitKat demonstrates two of the fundamental characteristics of algorithmic analysis (or algorithmic surveillance). First, not only can you derive something quite significant about a person based on data that has nothing whatsoever to do with what it is you are looking for. Second, algorithms can tell you what to do (discover haters of Israel by looking at chocolate and trainers) without the need to understand why this works. An algorithm cannot tell you why there is a link between haters of Israel and KitKats. There may not even be reason that makes any sort of sense. Algorithms cannot explain themselves, they leave no audit trail – they are the classic black box.
The reason this is so important is that it drives a cart and horse through any form of regulation that tries to establish a link between any one piece of data and the use to which that data is then put. How could one create a piece of legislation that requires manufacturers or retailers of KitKats to anticipate (and otherwise encourage or prevent) data about their product being used to identify haters of Israel? It also scuppers the idea that any form protection can be provided through the act of data ownership. You cannot make the consumption of a chocolate bar or the wearing of trainers a private act, the data on which is ‘owned’ by the people concerned.
KitKats and trainers bring us neatly to the Internet of Things. Up until now we have been able to assume that most data is created by, or about, people. This is about to change as the amount of data produced by people is dwarfed by the amount of data produced by things. How do we establish rules about data produced by things especially when it is data about other things? If your fridge is talking to your lighting about your heating thermostat, who owns that conversation? There is a form of Facebook emerging for objects and it is going to be much bigger than the Facebook for people.
Within this world the concept of personal data as a discrete category will melt away and instead we will see the emergence of vaste new swathes of data, most of which is entirely unregulatable or even unownable.
The digital caste problem
A recent blog post by Doc Searls has made the point that what Facebook has been doing is simply the tip of an iceberg, in that all online publishers or any owners of digital platforms are doing the same thing to create targeted digital advertising opportunities. However, targeted digital advertising is itself the tip of a much bigger iceberg. One of Edward Snowden’s Wikileaks exposures concerned something known as Operation Prism. This was (probably still is) a programme run by the NSA in the US that involves the abilty to hoover-up huge swathes of data from all of the world’s biggest internet companies. Snowden also revealed that the UK’s GCHQ is copying huge chunks of the internet by accessing the data cables that carry internet traffic. This expropriation of data is essentially the same as Cambridge Analytica’s usage of the ‘breach’ of Facebook data, except on a vastly greater scale. Cambridge Analytica used their slice of Facebook to create a targeting algorithm to analyse political behaviour or intentions, whereas GCHQ or the NSA can use their slice of the internet to create algorithms that analyse the behaviour or intentions of all of us about pretty much anything. Apparently GCHQ only holds the data it copies for a maximum of 30 days, but once you have built your algorithms and are engaged in a process of real-time sifting, the data that you used to build the agorithm in the first place, or the data that you then sift through it, is of no real value anymore. Retention of data is only an issue if you are still thinking about personal data files and the STASI problem.
This is all quite concerning on a number of levels, but when it comes to thinking about data regulation it highlights the fact that, provided we wish to maintain the idea that we live in a democracy where governments can’t operate above the law, any form of regulation you might decide to apply to Facebook and any current or future Cambridge Analyticas also has to apply to GCHQ and the NSA. The NSA deserves to be put in front of Congress just as much as Mark Zuckerberg.
Furthermore, it highlights the extent to which this is so much bigger than digital advertising. We are moving towards a society structured along lines defined by a form of digital caste system. We will all be assigned membership of a digital caste. This won’t be fixed but will be related to specific tasks in the same way that Google search’s understanding of us is related to the specific task of answering a particular search query. These tasks could be as varied as providing us with search results, to deciding whether to loan us money, or whether we are a potential terrorist. For some things we may be desirable digital Brahmins, for others we may be digital untouchables and it will be algorithms that will determine our status. And the data the algorithms use to do this could come from KitKats and fridges – not through any detailed analysis of our personal data files. In this world the reality of our lives becomes little more than personal opinion: we are what the algorithm says we are, and the algorithm can’t or won’t tell us why it thinks that. In a strange way, creating a big personal data file and making this available is the only way to provide protection in this world so that we can ‘prove’ our identity (cue reference to a Blockchain solution which I could devise if I knew more about Blockchains), rather than have an algorithmic identity (or caste) assigned to us. Or to put it another way, the problem we are seeking to avoid could actually be a solution to the real problem we need to solve.
The digital caste problem is the one we really need to be focused on.
So – the challenge is how do we prevent or manage the emergence of a digital caste system. And how do we do this in a way which still allows Google search to operate, doesn’t require that we make consumption of chocolate bars a private act or regulate conversations between household objects (and all the things on the Internet of Things) and can apply both to the operations of Facebook and the NSA. I don’t see any evidence thus far the the great and the good have any clue that this is what they need to be thinking about. If there is any clue as to the direction of travel it is that the focus needs to be on the algorithms, not the data they feed on.
We live in a world of a rising tide of data, and trying to control the tides is a futile exercise, as Canute the Great demonstrated in the 11th century. The only difference between then and now is that Canute understood this, and his exercise in placing his seat by the ocean was designed to demostrate the limits of kingly power. The GDPR is currently dragging its regulatory throne to the waters edge anticipating an entirely different outcome.
P.S. I am talking about this post on the excellent Echo Junction podcast, hosted by Adam Fraser