ChatGPT’s parasitic machine

What do ChatGPT and other large language models owe to the human creators who provide the information they train on? What if creators stop making their insights publicly available?

In tech we are all, ultimately, parasites. As Drupal creator Dries Buytaert said years ago, we are all more “taker” than “maker.” Buytaert was referring to common practice in open source communities: “Takers don’t contribute back meaningfully to the open source project that they take from,” hurting the projects upon which they depend. Even the most ardent open source contributor takes more than she contributes.

This same parasitic trend has played out for Google, Facebook, and Twitter—each dependent on others’ content—and is arguably much more true of generative AI (GenAI) today. Sourcegraph developer Steve Yegge dramatically declares, “LLMs aren’t just the biggest change since social, mobile, or cloud—they’re the biggest thing since the World Wide Web,” and he’s likely correct. But those large language models (LLMs) are essentially parasitic in nature: They depend on scraping others’ repositories of code (GitHub), technology answers (Stack Overflow), literature, and much more.

As has happened in open source, content creators and aggregators are starting to wall off LLM access to their content. In light of declining site traffic, for example, Stack Overflow has joined Reddit in demanding LLM creators pay for the right to use their data to train the LLMs, as detailed by Wired. It’s a bold move, reminiscent of the licensing wars that have played out in open source and paywalls imposed by publishers to ward off Google and Facebook. But will it work?

Overgrazing the commons

I’m sure the history of technology parasites predates open source, but that’s when my career started, so I’ll begin there. Since the earliest days of Linux or MySQL, there were companies set up to profit from others’ contributions. Most recently in Linux, for example, Rocky Linux and Alma Linux both promise “bug for bug compatibility” with Red Hat Enterprise Linux (RHEL), while contributing nothing toward Red Hat’s success. Indeed, the natural conclusion of these two RHEL clones’ success would be to eliminate their host, leading to their own demise, which is why one person in the Linux space called them the “dirtbags” of open source.

Perhaps too colorful a phrase, but you see their point. It’s the same criticism once lobbed at AWS (a “strip-mining” criticism that loses relevance by the day) and has motivated a number of closed source licensing permutations, business model contortions, and seemingly endless discussion about open source sustainability.

Open source, of course, has never been stronger. Individual open source projects, however, have varying degrees of health. Some projects (and project maintainers) have figured out how to manage “takers” within their communities; others have not. As a trend, however, open source keeps growing in importance and strength.

Draining the well

This brings us to the LLMs. Large enterprises such as JP Morgan Chase are spending billions of dollars and hiring more than 1,000 data scientists, machine learning engineers, and others to drive billion-dollar impact in personalization, analytics, etc. Although many enterprises have been skittish to publicly embrace things like ChatGPT, the reality is that their developers are already using LLMs to drive productivity gains.

The cost of those gains is only just now becoming clear. That is, the cost to companies like Stack Overflow that have historically been the source of productivity improvements.

For example, traffic to Stack Overflow traffic has declined by 6% on average every month since January 2022, and dropped a precipitous 13.9% in March 2023, as detailed by Similarweb. It’s likely an oversimplification to blame ChatGPT and other GenAI-driven tools for such decline, but it would also be naive to think they’re not involved.

Just ask Peter Nixey, founder of Intentional.io and a top 2% user on Stack Overflow, with answers that have reached more than 1.7 million developers. Despite his prominence on Stack Overflow, Nixey says, “It’s unlikely I’ll ever write anything there again.” Why? Because LLMs like ChatGPT threaten to drain the pool of knowledge on Stack Overflow.

“What happens when we stop pooling our knowledge with each other and instead pour it straight into The Machine?” Nixey asks. By “The Machine” he is referring to GenAI tools such as ChatGPT. It’s fantastic to get answers from an AI tool like GitHub Copilot, for example, which was trained on GitHub repositories, Stack Overflow Q&A, etc. But those questions, asked in private, yield no public repository of information, unlike Stack Overflow. “So while GPT-4 was trained on all of the questions asked before 2021 [on Stack Overflow,] what will GPT-6 train on?” he asks.

One-way information highways

See the problem? It’s not trivial, and it may be more serious than what we’ve haggled over in open source land. “If this pattern replicates elsewhere and the direction of our collective knowledge alters from outward to humanity to inward into the machine then we are dependent on it in a way that supersedes all of our prior machine dependencies,” he suggests. To put it mildly, this is a problem. “Like a fast-growing COVID-19 variant, AI will become the dominant source of knowledge simply by virtue of growth,” he stresses. “If we take the example of Stack Overflow, that pool of human knowledge that used to belong to us may be reduced down to a mere weighting inside the transformer.”

There’s a lot at stake, and not just the copious quantities of cash that keep flowing into AI. We also need to take stock of the relative worth of the information generated by things like ChatGPT. Stack Overflow, for example, banned ChatGPT-derived answers in December 2022 because they were text-rich and information-poor: “Because the average rate of getting correct answers from ChatGPT is too low, the posting of answers created by ChatGPT is substantially harmful to the site and to users who are asking and looking for correct answers [emphasis in original].” Things like ChatGPT aren’t designed to yield correct information, but simply probabilistic information that fits patterns in the data. In other words, open source might be filled with “dirtbags,” but without a steady stream of good training data, LLMs may simply replenish themselves with garbage information, becoming less useful.

I’m not disparaging the promise of LLMs and GenAI, generally. As with open source, news publishers, and more, we can be grateful for OpenAI and other companies that help us harness collectively produced information while still cheering on contributors like Reddit (itself an aggregator of individual contributions) for expecting payment for the parts they play. Open source had its licensing wars, and it looks like we’re about to have something similar in the world of GenAI, but with bigger consequences.

Next read this:

Matt Asay runs developer relations at MongoDB. The views expressed herein are Matt’s and do not reflect those of his employer.