TNW had a conversation with Cambridge University scholar Aleksandr Kogan, one of the architects of Cambridge Analytica’s Facebook targeting model, to learn how exactly the statistical model processed Facebook data for use in targeting and influencing voters:
In 2013, Cambridge University researchers Michal Kosinski, David Stillwell and Thore Graepel published an article on the predictive power of Facebook data, using information gathered through an online personality test. Their initial analysis was nearly identical to that used on the Netflix Prize, using SVD to categorize both users and things they “liked” into the top 100 factors.
The paper showed that a factor model made with users’ Facebook “likes” alone was 95 percent accurate at distinguishing between black and white respondents, 93 percent accurate at distinguishing men from women, and 88 percent accurate at distinguishing people who identified as gay men from men who identified as straight. It could even correctly distinguish Republicans from Democrats 85 percent of the time.
It was also useful, though not as accurate, for predicting users’ scores on the “Big Five” personality test.
This is exactly why allowing nefarious companies like Cambridge Analytica to use personal data provided to Facebook is extremely dangerous. It becomes rather easy to profile and subsequently target people with scary accuracy. The average Facebook user never stops to consider the potential unintended consequences of their providing data to the platform. They merely look at the immediate benefit rather than the long term effect.
Knowing how the model is built helps explain Cambridge Analytica’s apparently contradictory statements about the role – or lack thereof – that personality profiling and psychographics played in its modeling. They’re all technically consistent with what Kogan describes.
A model like Kogan’s would give estimates for every variable available on any group of users. That means it would automatically estimate the Big Five personality scores for every voter. But these personality scores are the output of the model, not the input. All the model knows is that certain Facebook likes, and certain users, tend to be grouped together.
With this model, Cambridge Analytica could say that it was identifying people with low openness to experience and high neuroticism. But the same model, with the exact same predictions for every user, could just as accurately claim to be identifying less educated older Republican men.
Using statistics to influence the response to appear as if the data use was less evil than reality.