Today's large language models (LLMs) routinely generate coherent, grammatical and seemingly meaningful paragraphs of text.
This achievement has led to speculation that these networks are -- or will soon become -- "thinking machines", capable of performing tasks that require abstract knowledge and reasoning.
Here, we review the capabilities of LLMs by considering their performance on two different aspects of language use: 'formal linguistic competence', which includes knowledge of rules and patterns of a given language, and 'functional linguistic competence', a host of cognitive abilities required for language understanding and use in the real world.
Drawing on evidence from cognitive neuroscience, we show that formal competence in humans relies on specialized language processing mechanisms, whereas functional competence recruits multiple extralinguistic capacities that comprise human thought, such as formal reasoning, world knowledge, situation modeling, and social cognition.
In line with this distinction, LLMs show impressive (although imperfect) performance on tasks requiring formal linguistic competence, but fail on many tests requiring functional competence.
Based on this evidence, we argue that:
- Contemporary LLMs should be taken seriously as models of formal linguistic skills;
- Models that master real-life language use would need to incorporate or develop not only a core language module, but also multiple non-language-specific cognitive capacities required for modeling thought.
Overall, a distinction between formal and functional linguistic competence helps clarify the discourse surrounding LLMs' potential and provides a path toward building models that understand and use language in human-like ways.
When we hear a sentence, we typically assume that it was produced by a rational, thinking agent (another person).
The sentences that people generate in day-to-day conversations are based on their world knowledge (“Not all birds can fly.”), their reasoning abilities (“You’re 15, you can’t go to a bar.”), and their goals (“Would you give me a ride, please?”).
Naturally, we often use other people’s statements not only as a reflection of their linguistic skill, but also as a window into their mind, including how they think and reason.
In 1950, Alan Turing leveraged this tight relationship between language and thought to propose his famous test [Turing, 1950].
The Turing test uses language as an interface between two agents, allowing human participants to probe the knowledge and reasoning capacities of two other agents to determine which of them is a human and which is a machine.
Although the utility of the Turing test has since been questioned, it has undoubtedly shaped the way society today thinks of machine intelligence [French, 1990, 2000, Boneh et al., 2019, Pinar Saygin et al., 2000, Moor, 1976, Marcus et al., 2016].
The popularity of the Turing test, combined with the fact that language can, and typically does, reflect underlying thoughts has led to several common fallacies related to the language-thought relationship.
We focus on two of these:
The first fallacy is that an entity (be it a human or a machine) that is good at language must also be good at thinking. If an entity generates long coherent stretches of text, it must possess rich knowledge and reasoning capacities.
1) Let’s call this the “good at language -> good at thought” fallacy.
The rise of large language models [LLMs; Vaswani et al., 2017a, Devlin et al., 2019, Bommasani et al., 2021], most notably OpenAI’s GPT-3 [Brown et al., 2020], has brought this fallacy to the forefront.
Some of these models can produce text that is difficult to distinguish from human output, and even outperform humans at some text comprehension tasks [Wang et al., 2018, 2019a, Srivastava et al., 2022].
As a result, claims have emerged—both in the popular press and in the academic literature—that LLMs represent not only a major advance in language processing but, more broadly, in Artificial General Intelligence (AGI), i.e., a step towards a “thinking machine” (see e.g., Dale 2021 for a summary of alarmist newspaper headlines about GPT-3).
Some, like philosopher of mind David Chalmers Chalmers , have even taken seriously the idea that these models have become sentient [although
Chalmers stops short of arguing that they are sentient; see also Cerullo, 2022].
However, as we show below, LLMs’ ability to think is more questionable.
The “good at language -> good at thought” fallacy is unsurprising given the propensity of humans to draw inferences based on their past experiences.
It is still novel, and thus uncanny, to encounter an entity (e.g., a model) that generates fluent sentences despite lacking a human identity.
Thus, our heuristics for understanding what the language model is doing—heuristics that emerged from our language experience with other humans—are broken.
2) Let’s call this the “bad at thought -> bad at language” fallacy
The second fallacy is that a model that is bad at thinking must also be a bad model of language.
LLMs are commonly criticized for their lack of consistent, generalizable world knowledge [e.g. Elazar et al., 2021a], lack of commonsense reasoning abilities [e.g., the ability to predict the effects of gravity Marcus, 2020], and failure to understand what an utterance is really about [e.g., Bender and Koller, 2020a, Bisk et al., 2020].
While these efforts to probe model limitations are useful in identifying things that LLMs can’t do, some critics suggest that the models’ failure to produce linguistic output that fully captures the richness and sophistication of human thought means that they are not good models of human language.
Chomsky said in a 2019 interview (Lex Fridman, 2019):
“We have to ask here a certain question: is [deep learning] engineering or is it science? [. . . ]
On engineering grounds, it’s kind of worth having, like a bulldozer.
Does it tell you anything about human language? Zero.”
The view that deep learning models are not of scientific interest remains common in linguistics and psycholinguistics, and, despite a number of position pieces arguing for integrating such models into research on human language processing and acquisition [Baroni, 2021, Linzen, 2019, Linzen and Baroni, 2021, Pater, 2019, Warstadt and Bowman, 2022, Lappin, 2021], this integration still encounters resistance (e.g., from Chomsky above).
Both the “good at language -> good at thought” and the “bad at thought -> bad at language” fallacies stem from the conflation of language and thought, and both can be avoided if we distinguish between two kinds of linguistic competence: formal linguistic competence (the knowledge of rules and statistical regularities of language) and functional linguistic competence (the ability to use language in the real world, which often draws on non-linguistic capacities).
Of course, language does not live in a vacuum and is fundamentally embedded and social, so the formal capacity is of limited value without being integrated in a situated context [e.g., Clark, 1996, Hudley et al., 2020, Bucholtz and Hall, 2005, Labov, 1978, Wittgenstein, 1953, Grice, 1975, Lakoff, 1972, Clark, 1992].
But even solving the more restricted problem of formal linguistic competence (e.g., what counts as a valid string of a language) is far from trivial and indeed has been a major goal of modern linguistics.
Our motivation for the distinction between formal and functional linguistic competence comes from the human brain.
A wealth of evidence from cognitive science and neuroscience has established that language and thought in humans are robustly dissociable: the machinery dedicated to processing language is separate from the machinery responsible for memory, reasoning, and social skills [e.g., Fedorenko and Varley, 2016a, ; Section 2].
Armed with this distinction, we evaluate contemporary LLM performance and argue that LLMs have promise as scientific models of one piece of the human cognitive toolbox—formal language processing—but fall short of modeling human thought.
Ultimately, what “pure” LLMs can learn is necessarily constrained both by the information available in their training data and by whether that information is learnable through a word prediction mechanism.
It has turned out that quite a lot of linguistic knowledge, e.g., about syntax and semantics, can be learned from language data alone [Potts, 2020, Merrill et al., 2022, Bommasani et al., 2021], in our opinion far more than most researchers in the field would have guessed 5 or 10 years ago (see Merrill et al.  for an argument of how semantic information is in-principle learnable from language data, and Piantadosi and Hill  for an argument that models can genuinely learn meaning).
The success of these models is a major development, with far-reaching implications.
But LLMs’ success in developing linguistic knowledge by predicting words using massive amounts of text does not guarantee that all aspects of thought and reasoning could be learned that way (although, as we will discuss, some aspects of thought and reasoning can be learned that way provided the relevant information is typically encoded in distributional patterns over words).
By saying that LLMs do not, in and of themselves, model human thought, we are not suggesting that AI approaches which start from building LLMs will necessarily run up against hard limits.
Indeed, at the end of this article, we discuss current modular approaches in which separate architectures or diverse objectives are combined. InstructGPT [Ouyang et al., 2022] and ChatGPT are examples of successes in this vein, in that they combine an LLM with Reinforcement Learning from Human Feedback (RLHF) [Christiano et al., 2017], whereby human feedback is used to iteratively adjust the trained models.
In that sense, they are more than just LLMs and can learn based on more than just what is available in massive amounts of passively observed text.
For our purposes here, we will use the term LLMs to refer primarily to “pure” language models (such as the original GPT-3) that are trained to predict held-out language tokens conditional on the immediate linguistic context, from large corpora of naturally observed language use.
In the rest of the paper, we formulate an account of what we should and should not expect from a model of language and evaluate contemporary LLMs within this framework.
In Section 2, we elaborate on the constructs of formal and functional linguistic competence and motivate this distinction based on the evidence from human cognitive science and neuroscience.
In Section 3, we discuss the successes of LLMs in achieving formal linguistic competence, showing that models trained on word filling-in/prediction tasks capture numerous complex linguistic phenomena.
In Section 4, we consider several domains required for functional linguistic competence—formal reasoning, world knowledge, situation modeling, and social-cognitive abilities—on which today’s LLMs fail, or at least perform much worse than humans.
In Section 5, we discuss the implications of our framework for building and evaluating future models of language and Artificial General Intelligence (AGI).
In Section 6, we summarize our key conclusions.
Dissociating language and thought in large language models: a cognitive perspective