Will AI Systems Run Out of Publicly Available Data on the Internet?

08:32 June 12, 2024

Will AI Systems Run Out of Publicly Available Data on the Internet?

A research group says artificial intelligence companies (AI) could run out of publicly available data for their systems in less than eight years.

Training data includes writing and information publicly available on the Internet. AI companies use the internet to “train” AI systems to create human-sounding writing. This “training” is what developers use to create large language models. Currently, many technology companies are developing large language models this way.

The nonprofit research group Epoch AI examines issues relating to AI. It has been following the development of large language models for a few years. In a recent paper, the group said technology companies will exhaust the supply of publicly available training data for AI language models between 2026 and 2032.

The team’s latest paper has been reviewed by experts, or peer reviewed. It is to be presented at the International Conference on Machine Learning in Vienna, Austria, this summer. Epoch AI is linked to the research group Rethink Priorities based in San Francisco, California.

A ‘gold rush’

Researcher Tamay Besiroglu is one of the paper’s writers. He compared the current situation to a “gold rush” in which limited resources are depleted. He said the field of AI might face problems as the current speed of development uses up the current supply of human writing.

As a result, technology companies like the maker of ChatGPT, OpenAI and Google are seeking to pay for high quality data. Their goal is to ensure a flow of good material to train their systems. OpenAI has made deals with social media service Reddit and news provider News Corp. to use their material. The researchers consider this a short-term answer.

Over the long term, the group said, there will not be enough new blogs, news stories or social media writing to support the speed of AI development. That could lead companies to seek online data considered private, such as email and phone communications. They also might increasingly use AI-created data, such as chatbot content.

A ‘bottleneck’ in development?

Besiroglu described the issue as a “bottleneck” that can prevent companies from making improvements to their AI models, a process called “scaling up.”

“…Scaling up models has been probably the most important way of expanding their capabilities and improving the quality of their output.”

The Epoch AI group first made their predictions two years ago. That was weeks before the release of ChatGPT. At the time, the group said “high-quality language data” would be exhausted by 2026. Since then, AI researchers have developed new methods that make better use of data and that “overtrain” models on the same data many times. But there are limits to such methods.

While the amount of written information that is fed into AI systems has been growing, so has computing power, Epoch AI said. The parent company of Facebook, Meta Platforms, recently said the latest version of its Llama 3 model was trained on up to 15 trillion word pieces called tokens.

But whether a “bottleneck” in development is a concern remains the subject of debate.

Nicolas Papernot teaches computer engineering at the University of Toronto. He was not involved in the Epoch study. He said building more skilled AI systems can come from training them for specialized tasks. Papernot said he is concerned that training AI systems on AI-produced writing could lead to a situation known as “model collapse.”

Permission and quality

Also, internet-based services such as Reddit and the information service Wikipedia are considering how they are being used by AI models. Wikipedia has placed few restrictions on how AI companies use its articles, which are written by volunteers.

But professional writers are worried about their protected materials. Last fall, 17 writers brought a legal action against Open AI for what they called “systematic theft on a mass scale.” They said ChatGPT was using their materials, which are protected by copyright laws, without permission.

AI developers are concerned about the quality of what they train their systems on. Epoch AI’s study noted that paying millions of humans to write for AI models “is unlikely to be an economical way” to improve performance.

The chief of OpenAI, Sam Altman, told a group at a United Nations event last month that his company has experimented with “generating lots of synthetic data” for training. He said both humans and machines produce high- and low-quality data.

Altman expressed concerns, however, about depending too heavily on synthetic data over other technical methods to improve AI models.

“There’d be something very strange if the best way to train a model was to just generate…synthetic data and feed that back in,” Altman said. “Somehow that seems inefficient.”

I’m Caty Weaver.

And I'm Mario Ritter, Jr.

Google Play VOA Learning English - Digdok