There Will Never Be a Generalised Artificial Intelligence (AGI)

At the recent AI summit in India, OpenAI CEO Sam Altman made some pretty controversial statements that were picked up by the world’s press, including this choice quote that picked up some traction (I spotted it in an article in the Guardian):

People talk about how much energy it takes to train an AI model – but it also takes a lot of energy to train a human. It takes about 20 years of life – and all the food you consume during that time – before you become smart.

There’s a lot to unpick in that statement, but the thing I’m most interested in is an implied assumption that he makes, which he later dives into more explicitly (quoted from an article from India Express):

…AGI feels pretty close at this point. If you had asked most people 6 years ago, what if we had systems that could do new research on their own, or programme on their own, you would say that sounds pretty intelligent and pretty general.

This narrative - that AGI is close, and it will change everything - is one that is peddled by excitable tech bros and AI doomsayers with equal vigour. For Sam Altman et al, it’s their key selling point for why we should all be eagerly awaiting the next iteration of their LLMs - because with each new GPT model, we get one step closer to AGI.

However, this idea has one big problem for me - it simply isn’t true.

What is AGI?

AGI has been the future-state scenario for AI for as long as the technological drive for non-human intelligence has existed. “Artificial General Intelligence”, as a concept, refers to the idea that we might one day have a computer (or application, or algorithm) that is capable of solving a diverse range of problems without additional need for training, tuning, or prompting.

An exemplar depiction of AGI in the popular media world would be Tony Stark’s JARVIS AI assistant from the early Iron Man movies. Initially, it seems that JARVIS is something that we’re not far away from at all - a kind of “advanced Alexa”, who is capable of retreiving any information from the internet, providing contextual advice, solving advanced mathematical and data analytic problems, controlling the Iron Man suit systems and giving tactical tips, and managing the automated systems within the house (not to mention his advanced situational sarcastic jibe system).

Given that we already have Alexa to control our lights and our music, and Gemini to tell us what the time currently is in Shanghai, and Wolfram Alpha for solving hard maths problems, and auto-pilot systems for flying drones, and ChatGPT for telling us about the history of the Roman Senate in a sassy tone, it seems like we’re pretty much there in terms of having a single JARVIS-like system for our lives. Surely it’s just a case of combining these models into one big mega-model, and then it’ll all just work together harmoniously.

Right?

Flexibility vs Accuracy

A key challenge here is that there is this fundamental mathematical limitation in machine learning (the field of study dedicated to designing and training AI models) called the bias-variance trade-off. This basically describes the idea that when we train these models (by showing them lots and lots of examples of what we want them to do), we have to ensure our training data has a balance of the following two features:

Specificity: there have to be enough similar examples of the problem we want to solve and the solution we want, so that the model can solve the specific problem we’re asking it to;
Variability: the examples that we show it have to be varied enough to ensure that when the model is shown new, unseen problems within the same problem set, it is able to solve them too.

Think of it like trying to learn your times tables. You want to see the same times tables enough that you see the pattern, but with enough new examples scattered in there that you actually learn how multiplication works, rather than just only being able to parrot answers you’ve already seen.

If you don’t see the same examples enough, you never learn the pattern. In ML, we call this underfitting - it means you have a high bias. If you see the same examples too often, you’ll only learn how to solve those examples, and never be able to generalise. In ML, we call this overfitting - it means you have a high variance.

Thus, when training new AI models, we are always fighting to trade off these phenomena. In the case of our times tables, many of us have overfit to our training data, since we can do our times tables up to 12 with ease (due to having them drilled into us on a daily basis in primary school) but struggle to generalise with numbers higher than that in our heads.

Thanks, Mr Niblett. Now I have high variance on my times tables. I hope you’re proud.

LLMs Aren’t Special

It’s not often talked about, but this trade-off applies to LLMs too. They’re built on a type of ML model called a transformer, which is specifically set up to have a very high variance (i.e. they tend to overfit). And although they’re marketed as “all-purpose” models, LLMs are actually just trying to solve one specific problem - to accuractely predict sequences of words. It just happens that we humans give a lot of contextual meaning to those sequences of words, particularly when they’re long enough.

This tendency to overfit on the problem of accurately predicting sequences of words is why LLMs will often latch onto key words or phrases that you’ve put in your prompt, and also why they’ll really confidently tell you something that isn’t true - they’re designed to sound like a person, not to understand the meaning of what they’re writing (this article contains a more detailed explanation).

This is also why they’re actually bad at many tasks, compared to other types of model that are specifically set up to complete those tasks. For example, this paper talks about the problem of classifying mental health status based on social media text statements - a classic problem in the field of Natural Language Processing (NLP, a sub-field of ML). An prompt-engineered LLM achieved only a 65% accuracy on this task, and while the fine-tuned LLM was a lot better (91%), it still misclassified almost twice as often as a feature-engineered classical ML model (a support vector machine, for those interested), which had a 95% accuracy.

This is because LLMs are not designed to solve the problem of classification. And so, no matter how impressive they are at solving the problem they’re designed to solve (i.e. predicting text sequences), it shouldn’t be a surprise that they’re not good at solving problems that they’re not designed to solve (even when those problems are also based on text data).

Diminishing Returns

Fine, you might say, LLMs aren’t great at solving this kind of text-based classification problem right now, but they’re getting better all the time, aren’t they? If we just keep increasing the model complexity and adding more data, won’t they become more able to solve more general problems eventually?

Not according to Dr Mike Pound over at Computerphile. He says that, according to a paper released in 2024, we are reaching a point of dimishing returns when it comes to the ability for LLMs to solve more generalised problems. According to the paper, “multimodal models require exponentially more data to achieve linear improvements in downstream “zero-shot” performance, following a sample inefficient log-linear scaling trend”.

In English, this means that in order to increase LLM performance on new tasks, we will need exponentially more data to train the model on in order to achieve that improvement. And given where we are with the performance on these tasks right now - good, but not great (taking our 91% from the example above as a benchmark) - and the amount of training data required to be at that level already (billions of data points), we are soon going to reach the level where there is not enough data in the world to increase accuracy of these models by any meaningful amount, especially for more specific tasks.

We may even have reached that point of dimishing returns already, although I’m sure that the NVIDIA sales reps hawking the GPUs used to train these models would disagree vehemently.

What About Agent Orchestration?

Okay, so we can’t have a single AI model that performs all of these tasks to a high level, but what about orchestration? Surely the promise of AI agents is that you can combine a number of different specialist models into a combined tool that is greater than the sum of its parts?

Well, yes, to an extent. Tools like LangChain/LangGraph make it easy to string together different AI models, each of which can be tuned to provide the highest possible level of accuracy on the thing they’re tasked with doing. Therefore, it would seem like a logical next step that you could then combine all of the AI tools you need into a JARVIS-style generalist platform.

There is, however, still a large technical barrier for this - context. Each of these individual models still has a very limited context window (i.e. amount of contextual information it can receive in its input), which severely limits the potential ways in which these separate models could interact within the orchestration. In other words, with this approach, your real life JARVIS could only receive a very limited amount of information from one use case to the next - so your home automation agent wouldn’t be able to dynamically react to the news from your online information agent (like a flood alert) unless you specifically created a workflow for it to do so.

And once we’re at the point where we need to think up and hard-code each specific situational flow of contextual information from each agent to each other agent, we’re starting to get to the point where it’s going to be a lot of manual work to create these connections. At some point, wouldn’t we just rather have the human as the orchestrator, given that’s what we’re going to effectively be doing anyway?

Conclusion

LLMs are massive, extremely impressive tools. I don’t doubt their potential to improve the way that many data science teams, and companies, will work going forwards. I also don’t doubt that there are some very loud voices from some very big companies who have a vested interested in getting everyone excited about their next big product launch (because they’re leveraged up to their eyeballs and are desperately trying to dig themselves out of a hole).

The cynical and misguided assertions from these companies about the future of LLMs are contributing to a massive tech bubble that is looking ever-closer to popping, and could have massive knock-on implications in the future. And the worst part? Nobody even wants this.

Nobody wants the AI overlord, the HAL9000, that controls every aspect of our lives and is owned and run by a single company. No business says “I wish that there was one platform, that had a complete monopoly over all tech aspects of what I do, who could make changes to their services and charge whatever they want without asking their consumers first”.

So, I’m sorry, Sam Altman, but you’re wrong - and worse than that, you don’t know what your customers want. AGI is still just a pipe dream for tech bros to sell to clueless investors.

As for the rest of us - let’s just stick with using the right tools for the job.

Fervently disagree with me? Think I’ve hit the nail on the head? Simply want to discuss this topic further? Feel free to reach out to me via our contact page, or directly on LinkedIn.