The Generalisation Illusion: A 2025 Psychological Audit of Artificial Intelligence

A dark blue background features a stylized human head in profile on the right, with a glowing blue circuit board pattern representing the brain. On the left, a smaller, glowing blue neural network structure is connected via lines to a microchip icon, which in turn is connected to the brain circuit pattern.
The brilliance of Large Language Models requires a critical psychological audit. (📷:empowervmedia)

The acceleration of Artificial Intelligence, especially the monumental Large Language Models, has dramatically blurred the lines defining intelligence. In 2025, systems like GPT-4 appear to hold cognitive capacities that were unimaginable until recently, prompting widespread (and often premature) declarations of Artificial General Intelligence (AGI). Yet, this dazzling performance compels us toward a deeper, more necessary question: What is the true measure of intelligence, and does AI’s current success satisfy this criterion? This dilemma revives the fundamental philosophical and psychological debate first articulated in 1980 and meticulously updated in 2021 by Van der Maas, Snoek, and Stevenson in their seminal paper.

The core conflict hinges on the very definition of intelligence. If success is merely defined by the rapid processing and synthesis of massive information (a statistical recall function), modern AI is undeniably triumphant. However, if true intelligence demands flexibility, adaptability, and the capacity to solve problems never before encountered, then the human mind remains the gold standard. Understanding this divide requires grounding computational feats back into cognitive science, clarifying why the success of 2025 models is not solely an engineering breakthrough but a profound psychological challenge.

'Joseph Weizenbaum and Roger Schank about AI and the shared experience' ▶️1m32s

Psychological Roots of AI

It is easy to forget that the foundational techniques driving modern AI are deeply inspired by psychological theories of learning. Contemporary architectures, including Deep Learning (DL) and Reinforcement Learning (RL), are computational versions of principles long studied in cognitive and behavioural science. Reinforcement Learning, for example, shares conceptual DNA with operant conditioning, where agents learn optimal behaviour through iterative interaction and reward maximisation within an environment. By emphasising these psychological underpinnings, we move beyond a purely technological assessment and frame AI development as a continuation of the study of human cognition itself. This intersection establishes a critical role for psychology, not merely as a beneficiary of AI, but as an essential discipline for guiding its advancement and understanding its cognitive limitations.

The Critical Gap

The 2021 analysis by Van der Maas et al. concluded that the single most difficult issue facing AI remains generalisation, aligning with Schank’s decades-old observation that intelligence is all about generalisation. Generalisation, in this critical context, means the ability to apply acquired knowledge and learned patterns to completely novel cases or tasks (often referred to as Out-of-Distribution (OOD) generalisation).

This update confirms that this fundamental struggle has withstood the test of time, despite exponential increases in training data and model parameters. Current AI systems, while masterful at tasks defined by high-data abundance, consistently fail when confronted with genuine conceptual novelty. They demonstrate a persistent difficulty in learning invariant, across-tasks patterns that would enable them to build truly flexible and generalisable representations, a hallmark of the human brain. This inability to flexibly transfer knowledge to unfamiliar contexts defines the critical scientific gap separating current systems from true Artificial General Intelligence (AGI).

The Problem of Measurement

A crucial difficulty in evaluating AI lies in the conflation of different types of intelligence. Public perception of AI’s cognitive prowess is often derived from a model’s remarkable fluency (its ability to process complex language sequences and generate highly coherent, statistically probable output). However, researchers in cognitive science meticulously distinguish this fluency, which is largely rooted in crystallised intelligence, from genuine fluidity (the flexible, abstract problem-solving capacity necessary for OOD generalisation).

When LLMs display intelligence, they are primarily demonstrating a mastery of knowledge retrieval and synthesis (a highly sophisticated form of statistical mimicry). For a true psychological audit, we must move past benchmarks that measure statistical association and focus instead on those that test abstract reasoning under conditions of maximum novelty.

High-Stakes Simulation

One of the most widely celebrated achievements of modern AI, frequently cited as proof of AGI capability, is GPT-4's performance on the simulated Bar Exam, where it achieved a score around the top 10% of test takers. However, a closer psychological analysis suggests this is predominantly an achievement of crystallised intelligence. The Bar Exam, while rigorous, tests a closed set of known legal rules, concepts, and precedents. The model excels because it can rapidly synthesise relevant knowledge from its vast training dataset and apply known rules to familiar compositional structures.

Critical perspectives further temper this achievement, pointing out that comparison cohorts often include repeat test takers, who generally score lower than first-time candidates, potentially exaggerating the reported percentile gain. More fundamentally, the ability to synthesise existing legal knowledge does not necessarily translate into the fluid intelligence required for developing completely novel legal arguments or navigating unprecedented ethical dilemmas (tasks that demand true OOD generalisation).

The True Test

To move beyond the limitations of crystallised intelligence testing, the research community is developing new benchmarks specifically designed to force models into situations requiring genuine abstract and novel reasoningThese rigorous tests serve as the field’s hypothesis-testing mechanism, aiming to assess whether increasing computational scale can truly overcome the generalisation barrier.

Two prominent examples from the 2025 landscape include:

1. MM-IQ (Multimodal Intelligence Quotient): This comprehensive framework was proposed to quantify critical cognitive dimensions in multimodal systems, specifically focusing on abstraction and reasoning . It comprises 2,710 meticulously curated test items spanning eight distinct reasoning paradigms, intentionally designed to stress-test the model’s ability to handle highly abstract and unfamiliar concepts, providing a stringent measure of fluid intelligence.

2. Humanity’s Last Exam (HLE): Created as a multi-modal benchmark to push LLMs toward expert-level human knowledge and reasoning, HLE consists of 2,500 questions across numerous subjects. Crucially, the questions were rigorously filtered to eliminate items that could be answered via prompt-memorisation or simple web search, ensuring that success is contingent upon genuine conceptual transfer and OOD generalisation.

The introduction of these cognitive-oriented assessments confirms a consensus within the psychological community: true AGI tests must take into account the complexity and diversity of human intelligence, encompassing crystallised, fluid, social, and embodied intelligence aspects.

An infographic contrasting "Crystallised Intelligence (LLMs)" on the left with "Fluid Intelligence (Humans)" on the right, separated by a red banner stating "GENERALISATION GAP". The LLM side shows a complex neural network, icons for vast knowledge, data knowledge, complex calculations, and Bar Exam success, pointing to "HIGH PERFORMANCE" with a green upward arrow. The Human side features abstract geometric shapes, icons for novel problems, novel reasoning, and abstract transfer, pointing to "LOW PERFORMANCE" with a red downward arrow (in AI context). Below the gap, a question mark asks "WHY AGI ISN'T HERE... YET". The background is a dark blue circuit board pattern.
(📷:empowervmedia)

The 2025 Generalisation Frontier

Empirical investigations into LLM behaviour reveal recurring patterns where the facade of generalised intelligence falters. A common thread in research is the model’s reliance on superficial patterns and statistical correlations inherent in the training data, rather than developing deep, abstract logical rules (a critical limitation termed 'brittle logic'). This reliance means that the system is learning to be a sophisticated predictor within known statistical boundaries, but struggles the moment those boundaries are fundamentally breached.

A stark, undeniable example of this failure to learn invariant properties is the phenomenon known as the Reversal Curse. This finding demonstrates that a model fine-tuned on a statement like “A is B” often fails to generalise the symmetric relationship “B is A” when queried later. This failure to learn a simple, bilateral property of facts, instead learning a one-way statistical association, serves as definitive evidence against true conceptual generalisation. Similarly, recent studies show that when LLMs are prompted to learn completely novel functions, they often revert to highly similar functions encountered during pre-training. This tendency to default to the familiar indicates a persistent inability to truly build new, flexible, and generalisable cognitive representations.

Quantifying the Generalisation Gap

The most definitive quantification of the generalisation gap comes from the application of the newly developed fluid intelligence benchmarks. The results sharply contradict the narrative of unrestricted cognitive growth often associated with computational scale.

Consider the disparity in performance across intelligence domains. In crystallised intelligence tasks, which involve rote knowledge, synthesis, and rule application, models demonstrate high statistical mimicry, excelling in domains like the simulated Bar Exam where they score in the top 10% of test takers. Conversely, when rigorously evaluated on the challenging MM-IQ benchmark (designed specifically to test core competencies in abstraction and reasoning across novel paradigms) leading architectures show poor generalisation. Despite their impressive size, state-of-the-art models achieved performance only marginally superior to random chance, scoring 27.49% accuracy against a 25% baseline. This critical data point demonstrates that when memorisation and simple pattern matching are deliberately filtered out, current AI systems fail to achieve even minimal human proficiency in transferable intelligence.

A detailed table titled "Model" displays performance scores across various intelligence tests for several Open-Source LLMs and Proprietary LLMs, with "Human Performance" listed at the bottom for comparison. The columns include Mean, LO, Math, 2D-G, 3D-G, VI, TM, SR, and CO, each showing numerical scores. The "Human Performance" row consistently shows significantly higher scores across all categories (e.g., Mean 51.27, LO 61.36, Math 45.03) compared to all listed AI models, which have scores generally in the 20s and 30s. One open-source model, Qwen2.5-VL-7B-Instruct (RL), has its highest scores highlighted in blue for Mean, LO, Math, 2D-G, 3D-G, VI, TM, SR, and CO.
LMMs and Human Performance on MM-IQ (%). Abbreviations adopted: LO for Logical Operation; 2D-G for 2D-Geometry; 3D-G for 3D-Geometry; VI for Visual Instruction; TM for Temporal Movement; SR for Spatial Relationship; CO for Concrete Object. Qwen2.5-VL-7B-Instruct (RL) denotes the RL-trained baseline. (📷:Cai, H., Yang, Y., & Hu, W., 2025)

Another frequently discussed observation relates to the concept of "emergent abilities". These are seemingly unpredictable new capabilities that appear sharply as model scale increases. Recent critical analyses suggest that many so-called emergent abilities may, in fact, be a "Measurement Mirage", appearing due to the researcher’s choice of non-linear or discontinuous metrics, rather than fundamental, instantaneous changes in the model’s intrinsic cognitive capacity. If emergent abilities are merely artefacts of evaluation methodology, it further supports the notion that the path to generalisation requires architectural innovation derived from cognitive theory, not just endless scaling.

Reclaiming the Human Edge

The persistently low scores on fluid intelligence benchmarks like MM-IQ and HLE provide powerful empirical support for the enduring psychological hypothesis put forth by Schank and reinforced by Van der Maas et al.: the core of intelligence lies in generalisation. The analysis confirms that current AI models are optimised for maximising statistical correlation within known data distributions, yet they have not developed the robust, transferable cognitive structures that characterise the human g-factor, or general intelligence.

The findings demonstrate that simply scaling up a statistical prediction machine does not intrinsically yield the flexible wisdom required for true problem-solving. This suggests that the ultimate solution to AGI is not merely computational engineering but a return to fundamental, structured cognitive theory.

Beyond Scaling

If sheer scale fails to guarantee generalisation, then the necessary evolution must be architectural. The scientific community is increasingly looking toward cognitive science-inspired systems (known as Cognitive Architectures) to address this gap. Cognitive architectures, such as ACT-R and SOAR, propose fixed, structured components (like dedicated memory modules and explicit decision-making processes) that dictate how knowledge is acquired and utilised.

Moving computational models toward frameworks that are explicitly "cognitive-oriented" and capable of generalising representations, often incorporating concepts like episodic memory and meta-learning, is considered crucial for future psychological plausibility. By adopting a structured cognitive foundation, AI systems might finally be able to decouple abstract rules from the specific training data, enabling the efficient, transferable reasoning currently exclusive to human intelligence.

Limitations, Implications, and the Prosocial Imperative

The generalisation gap carries profound real-world implications, transforming the challenge into an ethical and societal imperative. Systems that are brilliant mimics but suffer from brittle logic and OOD failure pose substantial risks, including the perpetuation of bias, the amplification of systemic errors, and the dissemination of convincing misinformation. Therefore, relying solely on AI performance on high-stakes, closed-domain tests provides a dangerously incomplete picture of capability.

This limitation reinforces the critical necessity of a pro-social AI framework, which demands that AI development be guided by human ethical and psychological oversight. This prosocial approach is defined by four essential pillars: solutions must be tailored to specific societal needs, trained on diverse datasets to mitigate bias, tested through rigorous ethical audits, and targeted at solving measurable societal challenges, such as improving global education access or supporting mental health resources. The current deficiencies in AI generalisation transform the pursuit of AGI into an inspirational, shared goal (leveraging computational power to overcome shared community challenges, guided by a robust ethical compass).

A dynamic image depicts a human hand and a robotic hand, both with extended index fingers, nearly touching a glowing, translucent blue human brain icon at the center. Around the brain are various hexagonal icons representing technology, data, AI, and human thought. The background is a dark, futuristic cityscape with digital lines and circuitry, suggesting advanced technology and connectivity. The overall color scheme is blue and black, highlighting the interaction between human and artificial intelligence.
When LLMs display intelligence, they are primarily demonstrating a mastery of knowledge retrieval and synthesis. (📷:fity.club)

The journey toward understanding and mitigating the generalisation illusion requires a massive parallel investment in computational cognitive science. Psychologists are uniquely positioned to spearhead this endeavour, providing the foundational theories necessary for the next generation of AI architecture.

Future research must prioritise two crucial areas:

First, the development of sophisticated, psychologically sound AGI tests that encompass the full diversity of human intelligence (fluid, crystallised, social, and embodied). These benchmarks must be resistant to prompt-memorisation and demand genuine, flexible transfer of knowledge.

Second, the proactive integration of AI ethics and responsible use into advanced psychology curricula. The next generation of scholars who successfully bridge the theoretical divide between cognitive psychology and computational implementation are not simply desirable; they are essential to achieving a truly beneficial, reliable, and generalised Artificial General Intelligence. This convergence of disciplines represents the most exciting and critical pursuit for advanced research degrees, offering the opportunity to define not just what machines can be, but what humanity must remain: the irreplaceable source of creative, flexible, and generalised wisdom.

 

Popular posts from this blog

Unlocking Speaking Skills: Current Research Insights for TESOL Educators

Trump vs. Tylenol: Psychology, Politics, and the “Social Pain” Factor