The Great AI stagnation

May 5, 2024 13-minute read

Info

This essay is generative (decoder) Large Language Models, or what is dubbed “AI” or sometimes “AGI” in pseudo-technical circles, not machine learning as a whole. My argument, reduced to 1 sentence, is that it is difficult for AI to achieve superhuman ability when it depends on human-curated training data. This is not applicable to domain-specific machine learning, such as AlphaGo or Deep Blue, where the problem space is tightly defined and there are objective criteria for evaluation. In these domains, computers totally outperform humans and have for a long time, and will continue to get better.

People have finally noticed that AI (especially Large Language Model AI) has plateaued. The prophecied exponential growth to an “Artificial Super-Intelligence” “singularity”, where AI gets smart enough to learn how to reprogram itself, and becomes smarter at an uncontrollable rate… simply hasn’t happened and doesn’t look like it’s going to.

Curiously, the skill / intelligence of Large Language Model AI has plateaued right around the level of a moderately above-average human with unusually good memory (i.e., knows a lot of facts). It was obvious to me around 2020 that this was going to happen, when I first learned about GPT-3. Some people are still surprised or confused by this, so I would like to explain why I predicted this so long before it happened, despite the initial appearance of a (now clearly illusory) exponential growth curve.

Demystifying “Artificial Intelligence”

A lot of effort is spent asking, “Is AI conscious? Is it really thinking?” The more important question is, “Are humans conscious? Are we really thinking?” And the answer is, “Who cares?” (Ctrl+F “submarines” ofc.)

Let’s examine how all existing AI works (Large Language Models, in particular, since everyone agrees that is the primary vector of artificial general intelligence).

AI only knows 1 thing: “What would my training data say?” More specifically, “What response would represent the smallest cumulative deviation from my training data?”

When you ask an AI a question, it tries to predict what answer the totality of its training data would be most likely to respond to that question with. They can extrapolate a fair bit; in the same way that someone trained in physics would have significant knowledge overflow into the field of chemistry, for example, if you ask an AI something its training data doesn’t exactly cover, it’ll match it against the closest-fitting piece of training data it has. Especially for simple stuff, it can do pretty well, like a physics professor answering basic questions about chemistry. The larger the training data, the more likely it’ll have some reference point to match against, for increasingly obscure inquiries. The further outside the scope of its training data you go, the worse the result becomes.

However, AI can’t give answers that aren’t based on its training data. This is a complicated thing to understand, because it’s not always intuitive how one piece of information might map onto a seemingly unrelated one, after passing through dozens of neural network layers. Neurons and layers clearly do establish “concepts” of some kind, so a Large Language Model is able to relate data in a broader non-literal sense, in much the same way a human does.

Is this “understanding”? Is this “consciousness”? Again, who cares. It gives good answers to a lot of types of questions; I don’t see any reason to consider this different from “general” intelligence. Humans don’t know stuff that they haven’t learned either.

Secondly, whether something is “covered” in the training data isn’t just a yes/no, it’s an amount. The better something is represented in its training data, the more perfectly AI can imitate it. This is true in both a good and bad sense. For example, if you imagine there was an internet forum dedicated to the belief that doors should always be purple, and wrote 1000s of pages of essays about purple doors, then it’s likely that an AI trained on this data might mention that a door is purple any time doors come up in its response. This doesn’t mean it’s “correct” in a human sense, but it’s correct in the AI sense, because “correct” just means “very similar to my training data”.

GPT-3

The quality of AI language models jumped enormously in 2019-2020, due to a massive increase in the training data volume. The innovation was, “What if we train it on all information in existence?” That turned out to be a good idea.

GPT-3, the precursor of ChatGPT, was trained on all usable written information on the internet. (This is important.)

This was a monumental breakthrough. Most topics are discussed extensively on the internet, and it quickly became clear that GPT-3 knew an awful lot about all sorts of things. This was a huge step up from GPT-3’s predecessor, GPT-2, which was trained on a more selective but already very large body of public text. Official numbers aren’t available, but it’s believed that GPT-3 was trained on about 100x as much data as GPT-2.

It was obvious that GPT-3 was a gigantic step up from everything before it. Demos started popping up and going viral in mid 2020, as people started to get beta access to GPT-3. This is roughly when narratives about “Everyone is going to get replaced by AI”, “The AI singularity is here”, and so forth, started to spread in earnest. People drew charts saying “This is where we were 2 years ago, here’s where we are now, what comes next???” with crazy upward-sloping exponential curves.

I just built a *functioning* React app by describing what I wanted to GPT-3.

I'm still in awe. pic.twitter.com/UUKSYz2NJO
— Sharif Shameem (@sharifshameem) July 17, 2020

I find this video (and others like it) interesting, because this is the exact same reaction that people less deep in the AI sphere started to have about 2.5 years later, when ChatGPT was released. This video is from July 2020! ChatGPT was released in November 2022.

ChatGPT and so forth

I am going to largely skim over the successors to GPT-3, not because they’re unimportant from a research perspective, but because their improvements were more like “polishing” the latent abilities of GPT-3 to make it consumer-grade.

ChatGPT was mostly just a public release of GPT-3. GPT-3.5 made some improvements to its conversational skill; GPT-4 made the context window much larger, supposedly made it more responsive to emotional context, and improved its ability to stay on topic without getting distracted. The difference between GPT-3 and GPT-4 is nowhere near as big as GPT-2 and GPT-3. GPT-4 supposedly increased the parameter count by another 10x, and some people claim to get somewhat better results from it, but I didn’t personally notice much difference.

Most power users seem to agree, at this point, that the high point of generative LLMs was shortly after ChatGPT was released, perhaps March 2023. Some things are unambiguously better, like ability to avoid getting stuck in a repetitive loop; these modest improvements are mostly cancelled out by the infuriating HR policies that have been baked into the model. In my view it’s a wash, and this increasingly seems to be the consensus view.

As an illustrative example, consider the famous “bottomless pit inspector” meme, a greentext produced by GPT-3 in June 2022, 5 months before ChatGPT was released. This is just one of a collection of GPT-3 greentexts generated around the same time, many others of which are also quite funny. If you make a similar request to one of GPT-3’s successors, including ChatGPT, it’s a complete flop; it gives you sterile garbage. (Try this yourself if you’d like!)

Here is the original Bottomless Pit Inspector post:

generating greentexts with gpt3 and holy fuck pic.twitter.com/FH6zxE2IIW
— eveveveveveveveve (@gumykityrapture) June 14, 2022

AI and IQ

GPT-3, having been trained on the whole public internet, knew everything you can learn from the public internet.

In particular (this is the key point):

it knows as much about those things, as the people who wrote about them on the internet did.

Who writes on the internet? Literate people. Writing significant bodies of text has an IQ cutoff well above the general population average. So GPT-3 was effectively trained to replicate the output of above-average-IQ people.

But remember, AI doesn’t pick the “smartest” or “most correct” response from its training data; it picks the most likely response. If the average IQ of writing in GPT’s training set is 120, then GPT’s output is going to sound a lot like someone with 120 IQ. How smart are the people who wrote GPT-3’s training data? Well, it depends on the source. People who write about complex or technical topics are likely to be smarter. You can notice this, if you ask ChatGPT about something ordinary, it gives banal platitudes; if you ask it about a difficult topic, it appears to get much “smarter”.

Another way to think about this: when you ask GPT a question, the answer is effectively what you’d get if you asked every person on the internet that question, and then averaged out their answers. This tends to work well because people most often write about topics that they know well. So GPT feels like talking to a practitioner of the field, often a pretty good one, no matter what you ask it about.

Obviously, the intelligence of writers on the internet doesn’t scale up to infinity, and although higher-IQ internet writers are generally more prolific, there’s also way fewer of them. There’s way more text on the internet written by 120 IQ people than 140 IQ people; text written by 160 IQ people is enormously more scarce yet. The exact number doesn’t really matter for my argument, but I’d guess (i.e., invent) that the “effective IQ” of ChatGPT ranges from 110 to 130, depending on the topic. (There’s no scientific procedure to this; feel free to invent your own numbers if you like. You have as much right to as I do.)

People’s reactions to ChatGPT seem to go roughly according to their IQ buckets, relative to GPT’s:

If the average GPT training set author is smarter than you, then it feels like “Artificial Super-Intelligence” is already here.
If you’re at about the same IQ level as the average training set author, then GPT seems impressive and somewhat intimidating, because if you extrapolate it getting “smarter” over time, then you can imagine it eventually becoming much smarter than you.
If you’re smarter than the average training set author, then GPT seems quite overrated, and feels like talking to a mildly incompetent and insanely overconfident colleague.

The future– is Artificial Super Intelligence coming?

No.

GPT-3 achieved enormous improvement over GPT-2 because it was trained on 100x the data. But we’ve hit the limit; it already used the whole internet. There isn’t another source of text data which is 100x bigger.

In particular, there’s no way to make AI smarter, without finding smarter training data. Using all the data very quickly got AI up to the level of the average human writer on the internet (which, again, is much smarter than the average human). To improve the intelligence of the training set, you can either add more text by extremely smart people, or remove a lot of text from less smart people. Both of those are infeasible at this point, as the internet is starting to get polluted by huge volumes of meaningless AI-generated content, and internet companies have started paywalling access to their data, convinced that it’s a valuable / marketable asset. Even re-assembling from scratch the data set used for GPT-3 would be almost impossible now, let alone something better.

Besides increasing the raw volume of training data, other approaches to making LLMs better have typically been to increase the number of parameters (model size) and the amount of GPU-hours (computing power) used to train it. When it comes to improving the intelligence of the model, these both have diminishing returns; they both have the effect of making the output of the model more faithful to its training data. In the limit, if you imagine a LLM with infinity parameters and trained with infinity GPUs, it would sound perfectly indistinguishable from a typical internet user.

An AI model is sort of like a gigantic compression format that squeezes a large amount of training data into a relatively smaller model. A huge-capacity model isn’t much use if most of the capacity goes unused. In practice, the approach of just increasing the model size seems mostly tapped out; the huge models aren’t (much) better than the medium-sized ones (e.g., llama, mistral).

One of the sillier proposals I’ve seen is using ChatGPT to generate more training data for itself. This is information-theoretically impossible. At best you would smooth out some variation in the output quality (both upward and downward); in practice each iteration would make the model slightly worse, since the it will be less than 100% efficient at transcribing its embodied “knowledge”.

In practice, people have observed that ChatGPT has mostly stayed the same or even gotten worse since it was initially launched in late 2022 and early 2023. There’s not a lot of easy ways to make it better; meanwhile, for various social reasons, there’s a lot of people working to cripple what little AI we have now. For example, many people got “triggered” by the GPT-2 to GPT-3 jump– mostly from the middle IQ “bucket” I described above– and, for both sincere and cynical reasons, decided AI had to be stopped. There is also a good argument to be made that it’s impossible for an AI model to develop improved pattern-recognition capabilities without having it notice various non-HR-policy-compatible patterns, which again, for social reasons, is considered undesirable to certain segments of the population. These groups have formed an unholy coalition intent on making AI worse and less capable; so far they seem to be pulling even or slightly ahead of the AI researchers.

Small Language Models

Most of the progress in Large Language Models nowadays in making them smaller, i.e., increasing the compression efficiency– packing the same data with as little loss as possible into a 7 GB package instead of a 700 GB one, for instance. I like this direction of progress, for accessibility and usability reasons, but not even AI Doomers think this approach poses a “Artifical Super-Intelligence” risk.

I expect the next era of AI will mostly be more of this. The “Autonomous AI Agent” narrative of 2023, the Holy Grail of AI Hypefluencers, was a bust. Even the best AI isn’t anywhere near smart enough to be an autonomous agent, even for very simple tasks. So it’s not clear there’s much benefit in investing in huge, expensive, and hard-to-run models. What would you be trying to achieve?

Interestingly, what I’ve started calling “Small Language Models”, i.e., 10-100 billion parameter open models that you can run on your own machine, are not enormously worse than the Large Language Models with >1 trillion parameters. They’re worse, but not way worse. And the gap is closing.

Conclusion: AI is just its training data

An AI attempts to replicate (interpolate or extrapolate) its training data. The training data is produced and evaluated by humans. AI cannot generate its own training data; doing so would achieve nothing except introducing random and compounding errors at each epoch (training cycle). The intelligence of AI is therefore bounded by the intelligence of the training data humans can provide to it; therefore, the intelligence of AI is bounded by the intelligence of humans, QED.

One of the oldest computing adages is, “Garbage in, garbage out”. If you provide a program with nonsense data, it will produce gibberish as a result. GPT has shown, by the same token: “pretty good in, pretty good out”.