A new era of data analysis is dawning, and it’s because people are sharing so much information about themselves.
Christian Rudder is one of the founders of the popular online dating site OkCupid. People under 50 can go to the site, enter information about themselves and then make contact with prospective dates and mates according to suggestions made by the site’s algorithms. Rudder realised he was sitting on a mine of data that can reveal new insights about the human condition. He’s written a book about it, titled Dataclysm (London: Fourth Estate, 2014).
At OkCupid, users make judgements about various things, including the looks of other people. Examining only the judgements of self-declared heterosexuals, Rudder plots the age of the member of the opposite sex who is rated most attractive. For women up to the age of 30, the most attractive man is slightly older; thereafter, the most attractive man is slightly younger than they are. Rudder shows this on this graph.
Then he does the same thing with men, and comes up with a very different sort of graph.
The men, overwhelmingly, see 20 to 23-year old women as most attractive. Not too surprising, perhaps, but here is the bonus. This sort of data enables research that overcomes many of the shortcomings of conventional psychological research, for which the experimental subjects are mostly university undergraduates in artificial conditions. On OkCupid, a broader cross-section of the population is included, and the conditions are real-life.
There are methodological obstacles to be sure. One of them is that people lie, for example about their own attributes. But there’s something more in the data that people are unlikely to lie about: their behaviour. Subscribers at OkCupid, after obtaining the address of a possible match, can choose to contact the person, or not, and the recipient of a message can choose to respond, or not. Given the information collected by OkCupid, it is possible to look for correlations between this behaviour and any number of attributes, for example age, looks, ethnicity and sexuality.
There is yet another source of information: the words people use to describe themselves. Rudder provides some useful tables of words characteristically used by particular groups on OkCupid, for example white men. He tabulates the words used by white men that most distinguish them from black men, Latinos and Asian men: these include “my blue eyes,” “blonde hair” and “ween.” In contrast, words most distinctively used by black men include “dreads,” “jill scott” and “haitian.” And so on with many more words for each group, and for various other groups, such as Asian women. Then there are the antithetical words, namely the words a group is least likely to use compared to other groups. For Latinos, these include “southern accent,” “from the midwest” and “ann arbor.”
Rudder uses data from OkCupid because he knows it best, but he also draws on data from Facebook, Google, Twitter and other sites that have far larger user numbers. He provides fascinating insights by looking at people’s locations, political views, sexuality and much else. Who would have thought, for example, that data can be used to show that two people meeting through an online dating site, with no prior information about appearance, would be equally satisfied with the date independently of the difference between their attractiveness ratings. As Rudder notes, “people appear to be heavily preselecting online for something [attractiveness] that, once they sit down in person, doesn’t seem important to them” (p. 90).
Rudder confirms the widely noted bias in favour of good looking women. He goes beyond this to comment on a perverse result:
Think about how the Shiftgig data changes our understanding of women’s perceived workplace performance. They are evidently being sought out (and exponentially so) for a trait [beauty] that has nothing to do with their ability to do a job well. Meanwhile, men have no such selection imposed. It is therefore simple probability that women’s failure rate, as a whole, will be higher. And, crucially, the criteria are to blame, not the people. Imagine if men, no matter the job, were hired for their physical strength. You would, by design, end up with strong men facing challenges that strength has nothing to do with. In the same way, to hire women based on their looks is to (statistically) guarantee poor performance. It’s either that or you limit their opportunities. Thus Ms. Wolf [Naomi Wolf in The Beauty Myth]: “The beauty myth is always actually prescribing behavior and not appearance.” She was speaking primarily in a sexual context, but here, we see how it plays out, with mathematical equivalence, in the workplace. (p. 121)
One of Rudder’s key topics is racism. One way to detect racist views using Internet data is by looking at the terms people put into search engines. Using Google data (in particular, the Google Trends tool), Rudder plots the number of searches for the word “nigger” against the months before and after Barack Obama’s election victory in November 2008. Several spikes in the graph connect to significant events in the campaign. As Rudder puts it, the graph enables you to “watch the country come to grips with the prospect of a black president” (p. 129). Rudder also uses online data to show that racism in the US is pervasive; biases are widespread rather than restricted to a few open racists. On the other hand, racial biases shown by US data are nearly absent in comparable data about people in Britain and Japan.
Then there is online mobbing, when people gang up against a target. Rudder uses the example of Justine Sacco, who tweeted a poor attempt at humour. She was condemned by thousands for racism, received death threats and lost her job. The hashtag #HasJustineLandedYet was followed by tens of millions of people. Rudder tells about his own effort to inject some sense into the conversation about Sacco, only to be countered by a damaging claim about Sacco — a claim that turned out to be false.
Rudder reviews what researchers say about rumours, gossip and human sacrifice, as social phenomena in history and in the Internet age.
So much of what makes the Internet useful for communication — asynchrony, anonymity, escapism, a lack of central authority — also makes it frightening. People can act however they want (and say whatever they want) without consequences, a phenomenon first studied by John Suler, a professor of psychology at Rider University. His name for it is the “online disinhibition effect.” The webcomic Penny Arcade puts it a little better:
Greater Internet Fuckwad Theory
normal person + anonymity + audience = total fuckwad (p. 145)
Rudder comments that it is strange to be writing a book, in old fashioned hard copy, in the digital age. But what a book it is! It is stylishly laid out, with an elegant font and beautifully crafted diagrams. It is not quite a coffee table book — there are no colour photos — but for an intellectual work it is exceptionally attractive.
Rudder’s writing style is equally striking, with a mixture of colloquial language, wide-ranging cultural references, scholarly citations and astute observations. Referring to the Twitterstorm against Justine Sacco, Rudder muses:
… this, to me, is why the data generated from outrage could ultimately be so important. It embodies (and therefore lets us study) the contradictions inherent in us all. It shows we fight hardest against those who can least fight back. And, above all, it runs to ground our age-old desire to raise ourselves up by putting other people down. Scientists have established that the drive is as old as time, but this doesn’t mean they understand it yet. As Gandhi put it, “It has always been a mystery to me how men can feel themselves honored by the humiliation of their fellow beings.”
I invite you to imagine when it will be a mystery no more. That will be the real transformation — to know not just that people are cruel, and in what amounts, and when, but why. Why we search for “nigger jokes” when a black man wins; why inspiration is hollow-eyed, stripped, and above all, #thin; why people scream at each other about the true age of the earth. And why we seem to define ourselves as much by what we hate as by what we love. (pp. 148–149)
Rudder suggests that a new approach to studying human behaviour is emerging. Rather than relying on studies of undergraduate students in experimental (artificial) conditions, data will become available for examining human behaviour in “natural” conditions, namely when people think no one is looking at them. This is the idea underlying the subtitle of Dataclysm: Who We Are* — with the footnote *When We Think No One’s Looking.
Rudder is quite aware that online data are incomplete. Those who use OkCupid are not a perfect cross-section of the population between 18 and 50. Even Facebook, with its billions of users, does not incorporate everyone. But there is a qualitative as well as a quantitative jump in what it is possible to analyse: the behaviour of millions of people in natural conditions. This requires access to the data and knowledge of quantitative methodologies.
Rudder comments on the disappearance of privacy, and the fact that most people seem not to care too much: they willingly share all sorts of intimate data, for example on Facebook. It is now possible for marketers to predict fairly accurately, on the basis of automated analysis of data and words, whether you are gay, straight, unemployed or pregnant, among other information relevant for marketing. Analysts are working on how to assess a person’s intelligence from their online presence. Few people realise the potential implications for their careers of their casual interactions on social media.
Masses of data about individuals now available can be mined for insights about human behaviour, and many of these insights are fascinating, sometimes confirming conventional ideas and sometimes challenging them. Readers of Dataclysm can obtain a good sense of a future, part of which is already here, in which data obtained about seemingly innocent activities — such as your Facebook likes, the words you use on Twitter or the terms you enter into search engines — can be used to draw inferences about your prejudices, activities and capabilities. Perhaps, like Rudder, you may decide to become a bit more cautious about your online activities.