3 reasons why web scraping for AI development may be coming to an end
A look at the drawbacks and alternatives
TL;DR
This newsletter is about the future of web scraping for developing generative AI models. It covers the usefulness of web-scraped datasets, why developers might start using them less and the alternative sources that might be relied on instead.
Here are the key takeaways:
Web-scraped data has become standard for the development of modern AI models. For example, the CommonCrawl datasets provide a useful corpus of text data for the development of large language models.
However, using web-scraped data may become less feasible in the future. There are three primary reasons for this:
Legal issues
Low quality data
Data revolts
In the absence of web-scraped datasets, developers may have three alternative sources to turn to:
Data partnerships
Synthetic data
Proprietary data
Web scraping and AI development
Web-scraped datasets have been crucial to the development of large language models.
This has particularly been the case with the CommonCrawl datasets. Containing petabytes of text data from millions of webpages gathered over several years, they provide a useful corpus to feed large LLMs during pre-training.
With pre-training, the models learn to predict the next word in a sentence. The aim is to build a model capable of parsing and generating language derived from a large reservoir of text.
The CommonCrawl datasets have proven crucial for this purpose. The paper for GPT-3 lists it as one of the several datasets used to train the model,1 and since 2019 many more LLMs have been trained using the same.2
Web-scraped data has therefore become standard for the development of modern AI models. But this might not be the case for much longer.
The drawbacks of web scraping
There are three primary reasons why web scraping for generative AI development may be gradually coming to an end:
Legal issues
Low quality data
Data revolts
Legal issues
Developers that use web-scraped datasets for LLM development are confronted with two legal issues:
Copyright
Data protection
The lawsuit brought by The New York Times against OpenAI highlights the copyright issue. The newspaper argues that millions of its articles have been illegally used to train OpenAI's models, which in turn have been shown to reproduce that content verbatim.
OpenAI insist that their actions are protected under 'fair use' because the use of the articles serves a new "transformative" purpose. But NYT's contention is that using such content without payment to develop models that reproduce them almost completely is not transformative and strays well beyond fair use.
NYT articles are one of the biggest sets of proprietary data in the Common Crawl datasets. The outcome of the proceedings between NYT and OpenAI will therefore have important implications for other developers using these datasets.
The data protection issue is highlighted by the investigations of the Italian data protection authority (DPA) into ChatGPT. In an order published in April 2023, the DPA highlighted various ways in which the development and deployment of the model falls short of GDPR requirements.
Among these compliance issues was the lack of a suitable legal basis for using the personal data contained in the model's training data. The data scraping carried out by Common Crawl inevitably captures lots of personal data existing on the web, which may also be used to train an LLM relying on these web-scraped datasets.
In response to the Italian DPA's investigation, OpenAI published an article citing 'legitimate interest' as its legal basis, with its European privacy policy explaining that this includes the improvement and development of its services like ChatGPT. But as
points out:The idea that tech companies can rely on legitimate interests to train AI systems through scraping to feed generative AI applications is not obvious. This is a new interpretation/legal construct of the legitimate interest reasoning that deserves further analysis.
These legal developments could make it more difficult for LLM developers to build models using massive datasets of text scraped from the web.
Low quality data
Web-scraped datasets like those from Common Crawl come with little curation and therefore lots of "junk".
This is a result of the scaling doctrine influencing current AI development, whereby scale is believed to correlate with greater model performance. Training really big models on massive training datasets is touted as the clearest path to optimal performance.
This doctrine also assumes that scale is a substitute for quality control. If the datasets are big enough, careful curation and filtering becomes less important and noisy data just gets averaged out.3
But these scaling laws also come with their downsides. The bigger these web-scraped datasets, the greater variety of data that is collected, including data that is of low quality and potentially harmful.4
These data are then learned by LLMs during pre-training. Accordingly, given the size of these models,5 they can sometimes exhibit emergent behaviours that propagate harmful text data in unpredictable ways.6
So the output of the pre-training stage is a base model that is untamed and unruly. Its parameters have been set by a mixture of high-quality, desirable text data with low-quality, harmful text data.
This necessitates the need for fine-tuning the base model. Using reinforcement learning with human feedback (RLHF) for example, the model can be trained to source answers to prompts from those parts of its training data that is more relevant and less toxic.
RLHF can therefore be thought of as a form of content moderation. It nudges the model towards a more friendly user experience that hides the monster underneath.
But that monster behind models like ChatGPT can still be unleashed with the right prompts,7 bypassing the human preferences it was trained to follow during fine-tuning. The low-quality, harmful text data never truly goes away.
This is the symbolism behind the infamous Shoggoth meme. It encapsulates the messiness of LLMs like ChatGPT trained on massive text datasets collected from the internet, and how developers attempt to tame this with RLHF.
So in addition to fine-tuning, developers will need to be more careful when using web-scraped datasets for LLM training in the first place. The minimal filtering measures typically applied to these datasets are likely insufficient for removing the toxic and harmful content that causes undesirable model behaviour.
Data revolts
With generative AI developers increasingly relying on internet data to train their models, so-called 'data revolts' have been taking place across the web. Website owners and content creators are attempting to take back control of their data.
For example, in April 2023, Reddit stated that it would start charging for API access to its platform. StackOverflow has also announced similar changes.
Artists are making it more difficult to collect their works from the internet for training image-generation models. Tools like Nightshade can turn such images into 'poison' samples that make them unviable for model training.
Additionally, website owners can use robots.txt files to block web crawlers and scrapers from collecting data from its pages. NYT has been using this technique against OpenAI's web crawler since at least August 2023.
These actions will make access to high quality data for model training more scarce and expensive. Although, developers that already have access to such data or have the resources to obtain it may be less affected by this.
The alternatives
In the absence of web-scraped datasets, developers may have three alternative sources to turn to:
Data partnerships. With internet data being harder to collect freely, the best data for model training may only be available via agreements with organisations that hold that data. OpenAI for example is seeking partnerships with publishers and other organisations to collect the data needed for its next models.
Synthetic data. Some developers may use generative AI models to build the datasets needed to train new models. The Verge reported in December 2023 that ByteDance, the company behind TikTok, has been using the OpenAI API to generate synthetic data to develop its own models in contravention of OpenAI's terms of service.
Proprietary data. Some developers may be tempted to turn to data they already hold to build generative AI models. Meta for instance has allegedly used millions of user images from Instagram and Facebook to develop its image-generation model.
Brown et al, ‘Language Models are Few Shot Learners’ (2020), 8.
Baack, ‘Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI’ (2024), 12. See also Kirk et al, ‘Handling and Presenting Harmful Text in NLP Research’ (2023), 3.
Birhane et al, ‘On Hate Scaling Laws for Data Swamps’ (2023), 3.
Kirk et al, ‘Handling and Presenting Harmful Text in NLP Research’ (2023), 3.
Brown et al, ‘Language Models are Few Shot Learners’ (2020), 5.
Kirk et al, ‘Handling and Presenting Harmful Text in NLP Research’ (2023), 3.
Perrigo, ‘The New AI-Powered Bing Is Threatening Users. That’s No Laughing Matter’ (Time, 17 February 2023). See also Ovadya, ‘Red Teaming Improved GPT-4. Violet Teaming Goes Even Further’ (Wired, 29 March 2023).