The global AI explosion has greatly increased the need for common sense, people-centric methodology for handling data privacy and ownership. Leading the way is the European General Data Protection Regulation (GDPR), but there is more than just personally identifiable information (PII) at stake in the modern market.
What about the data we generate as content and art? It is certainly not legal to copy someone else’s work and then present it as your own. But there are AI systems that try to: scrape as much human-generated content from the web as possible to generate content that is comparable.
Can the GDPR or any other EU-focused policy protect this type of content? As it turns out, like most things in the machine learning world, it depends on the data.
Privacy vs. Property
The primary purpose of the GDPR is to protect European citizens from harmful actions and consequences related to the misuse, abuse or exploitation of their private data. Citizens (or organisations) are of little use when it comes to protecting intellectual property (IP).
Unfortunately, to the best of our knowledge, the policies and regulations put in place to protect IP are not equipped to cover data scraping and anonymization. That makes it difficult to understand exactly where the regulations apply when it comes to searching for content on the web.
These techniques, and the data they obtain, are used to create massive databases for use in training large AI models such as OpenAI’s GPT-3 and DALL-E 2 systems.
The only way to teach an AI to imitate humans is to expose it to human-generated data. And the more data you put into an AI system, the more robust its output is.
Here’s how it works: imagine you draw a picture of a flower and post it on an online artist forum. Using scraping techniques, a tech outfit sucks up your image, along with billions of others, so it can create a massive dataset of artwork. The next time someone asks the AI to generate an image of a ‘flower’, there’s a greater than zero chance that your work will be used in the AI’s interpretation of the prompt.
Whether such use would be ethical remains an open question.
Public data vs PII
While the regulatory oversight of the GDPR can be described as far-reaching when it comes to protecting private information and providing the right to delete, it seemingly does very little to protect the content from scraping. However, that does not mean that the GDPR and other EU regulations are completely infallible in this regard.
Individuals and organizations have to follow very specific rules for deleting PII or else they will be in violation of the law – something that can get quite costly.
For example, it becomes nearly impossible for Clearview AI, a company that builds facial recognition databases for government use by scrape social media data, to do business in Europe. EU watchdogs from at least seven countries have already issued hefty fines or recommended fines for the company’s refusal to comply with GDPR and similar regulations.
At the other end of the spectrum, companies like Google, OpenAI and Meta use similar data scraping practices directly or through the purchase or use of scraped datasets for many of their AI models without any consequences. And while major tech companies in Europe have received a large share of the fines, very few of the violations involve data scraping.
Why not ban deletion?
At first glance, scraping may seem like a practice with too much potential for abuse not to ban outright. However, for many organizations that rely on scraping, the data that is obtained is not necessarily “content” or “PII”, but information that can serve the public.
We have contacted the UK data privacy agency, the Office of the Information Commissioner (ICO), to find out how they regulated internet-scale scraping techniques and datasets, and to understand why it was so important not to over-regulate.
A spokesperson for the ICO told TNW:
Using publicly available information can bring many benefits, from research to developing new products, services and innovations, including in the field of AI. However, if this information is personal data, it is important to understand that data protection laws apply. This is whether the techniques used to collect the data include scraping or something else.
In other words, it’s more about the type of data used than how it’s collected.
Whether you’re copying images from Facebook profiles or using machine learning to scrape the web for tagged images, you’re likely violating GDPR and other European privacy rules if you build a facial recognition engine without the consent of the people whose faces are in its database.
But it’s generally acceptable to scour the Internet for massive amounts of data, as long as you either… anonymize it or make sure there is no PII in the dataset.
Further gray areas
But even within the allowed use cases, there are still some gray areas associated with private information.
For example, GPT-2 and GPT-3 are: known to occasionally perform PII in the form of addresses, phone numbers, and other information apparently baked into its corpus via large-scale training datasets.
Here, where it is clear that the company behind GPT-2 and GPT-3 is taking steps to mitigate this, the GDPR and similar regulations are doing their job.
Simply put, we can choose not to train large AI models or give the companies training them the ability to investigate edge cases and address the concerns.
What might be needed is a GDUR, a General Data Use Regulation, something that could provide clear guidance on how human-generated content can be used legally in large data sets.
At the very least, it seems worth having a conversation about whether European citizens should have as much right to have the content they create removed from datasets as their selfies and profile photos.
For now it seems that in the UK and in the rest of Europe the right to erasure only extends to our PII. Everything we put online probably ends up in some AI’s training dataset.