Garbage Pail Kids
The internet's steady fall into the AI-garbled dumpster continues. As Vice reports, a recent study conducted by researchers at the Amazon Web Services (AWS) AI Lab found that a "shocking amount of the web" is already made up of poor-quality AI-generated and translated content.
The paper is yet to be peer-reviewed, but "shocking" feels like the right word. According to the study, over half — specifically, 57.1 percent — of all of the sentences on the internet have been translated into two or more other languages. The poor quality and staggering scale of these translations suggest that large language model (LLM) -powered AI models were used to both create and translate the material. The phenomenon is especially prominent in "lower-resource languages," or languages with less readily available data with which to more effectively train AI models.
In other words, in what the researchers believe to be a ploy to garner clickbait-driven ad revenue, AI is being used to first generate poor-quality English-language content at a remarkable scale, and then AI-powered machine translation (MT) tools transcribe said content into several other languages. The translated material gets worse each time — and as a result, entire regions of the web are filling to the brim with degrading AI-scrambled copies of copies.
"Machine-generated, multi-way parallel translations not only dominate the total amount of translated content on the web in lower-resource languages," the AWS researchers write in the paper, "it also constitutes a large fraction of the total web content in those languages."'
Dead Internet Theory
This wouldn't be the first warning sign of generative AI's existential threat to the web's usability. Google, for example, has been forced to grapple with the persistence of AI-generated material in its search and — as a new 4o4 Media report shows — its Google News algorithms. Amazon has also had a notably rough go with AI content; in addition to its serious AI-generated book listings problem, a recent Futurism report revealed that the e-commerce giant is flooded with products featuring titles such as "I cannot fulfill this request it goes against OpenAI use policy."
But while the English-language web is experiencing a steady — if palpable — AI creep, this new study suggests that the issue is far more pressing for many non-English speakers.
What's worse, the prevalence of AI-spun gibberish might make effectively training AI models in lower-resource languages nearly impossible in the long run. To train an advanced LLM, AI scientists need large amounts of high-quality data, which they generally get by scraping the web. If a given area of the internet is already overrun by nonsensical AI translations, the possibility of training advanced models in rarer languages could be stunted before it even starts.