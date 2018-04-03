Facebook makes most of its money from advertising, and – as the Cambridge Analytica scandal continues to haunt Mark Zuckerberg’s company – users are demanding to know how their data is being wrangled and harvested.

But while concern about Facebook user privacy has spiked, it’s been clear since Facebook’s inception that its business is based on widespread surveillance of people, whose data is the product.

Some have portrayed the revelations of the Cambridge Analytica scandal – in which data was allegedly harvested from 50m Facebook profiles – as an “existential crisis”, while others have highlighted potential implications for academic research.

In short, Facebook’s data harvesting methods have become a subject of sudden and widespread concern.

What is data harvesting?

Harvesting data, as its agricultural name suggests, is similar to gathering crops because it involves collection and storage with the expectation of future reward.

Data can be harvested in different ways, ranging from simple copy-and-pasting to more complicated programming. The chosen method is often constrained by the site being harvested. At simple search levels, many sites combat automated harvesting with Google CAPTCHAs and reCAPTCHAs, which help sites differentiate between humans and bots.

If you’ve ever copy-and-pasted text from Facebook or saved an image from Twitter, you’ve harvested social media data. The action of “screenshotting” is permitted on most sites because users can usually only access information that is either public or visible to them because they have logged in. Also, it would be impossible to completely eradicate the simplest data harvesting methods, such as making notes and taking photographs.

Facebook and other social networks are more concerned with restricting automated data harvesting, due to demands on web servers and to control who has access to what data (and why). Personal information and behaviour on social media have commercial, political and research value.

Social networks decide their own usage policies, balancing commercial interests with third parties and regulatory user privacy concerns – often described in company documents as juggling the optimisation of “customer behaviour” and adhering to “community standards”.

How is data harvested?

Application Programming Interfaces (APIs) are used by Facebook, Twitter, Instagram and other sites to restrict would-be harvesters’ access. APIs work as a software go-between that allows a researcher or app developer’s computer to “talk” to a social network in a controlled way.

One of the main conditions involves restrictions on how collected data can be used and shared, which can be pursued aggressively. In 2010, computer programmer Pete Warden harvested data from 210m public Facebook profiles for research purposes. But he failed to seek permission from Facebook first, thereby violating its terms of service. He later faced the threat of legal action from Facebook and was forced to delete the data – in an echo of academic researcher Aleksandr Kogan‘s alleged part in the Cambridge Analytica scandal.

Kogan’s app, dubbed “thisisyourdigitallife”, developed in 2014 through his company Global Science Research (GSR) – separate from his university work – was a personality test that 270,000 users logged into, accepting that it would have access to some of their personal information and some of their friends’ data too. It also meant that those friends had not consented to their data being used in this way.