Sparks of Artificial General Intelligence: Early experiments with GPT-4

AGI, Artificial Cognition -

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Given the breadth and depth of GPT-4's capabilities, we believe that it could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.


Artificial intelligence (AI) researchers have been developing and refining large language models (LLMs) that exhibit remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition.

The latest model developed by OpenAI, GPT-4, was trained using an unprecedented scale of compute and data.

In this paper, we report on our investigation of an early version of GPT-4, when it was still in active development by OpenAI.

We contend that (this early version of) GPT-4 is part of a new cohort of LLMs (along with ChatGPT and Google's PaLM for example) that exhibit more general intelligence than previous AI models.

We discuss the rising capabilities and implications of these models. We demonstrate that, beyond its mastery of language, GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more, without needing any special prompting.

Moreover, in all of these tasks, GPT-4's performance is strikingly close to human-level performance, and often vastly surpasses prior models such as ChatGPT.

In our exploration of GPT-4, we put special emphasis on discovering its limitations, and we discuss the challenges ahead for advancing towards deeper and more comprehensive versions of AGI, including the possible need for pursuing a new paradigm that moves beyond next-word prediction.

We conclude with reflections on societal influences of the recent technological leap and future research directions.




Intelligence is a multifaceted and elusive concept that has long challenged psychologists, philosophers, and computer scientists.

An attempt to capture its essence was made in 1994 by a group of 52 psychologists who signed onto a broad definition published in an editorial about the science of intelligence [Got97].

The consensus group defined intelligence as a very general mental capability that, among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly and learn from experience.

This definition implies that intelligence is not limited to a specific domain or task, but rather encompasses a broad range of cognitive skills and abilities.

Building an artificial system that exhibits the kind of general intelligence captured by the 1994 consensus definition is a long-standing and ambitious goal of AI research. In early writings, the founders of the modern discipline of artificial intelligence (AI) research called out sets of aspirational goals for understanding intelligence [MMRS06].

Over decades, AI researchers have pursued principles of intelligence, including generalizable mechanisms for reasoning (e.g., [NSS59], [LBFL93]) and construction of knowledge bases containing large corpora of commonsense knowledge [Len95].

However, many of the more recent successes in AI research can be described as being narrowly focused on well-defined tasks and challenges, such as playing chess or Go, which were mastered by AI systems in 1996 and 2016, respectively. In the late-1990s and into the 2000s, there were increasing calls for developing more general AI systems (e.g., [SBD+96]) and scholarship in the field has sought to identify principles that might underly more generally intelligent systems (e.g., [Leg08, GHT15]).

The phrase, “artificial general intelligence” (AGI), was popularized in the early-2000s (see [Goe14]) to emphasize the aspiration of moving from the “narrow AI”, as demonstrated in the focused, real-world applications being developed, to broader notions of intelligence, harkening back to the long-term aspirations and dreams of earlier AI research.

We use AGI to refer to systems that demonstrate broad capabilities of intelligence as captured in the 1994 definition above, with the additional requirement, perhaps implicit in the work of the consensus group, that these capabilities are at or above human-level.

We note however that there is no single definition of AGI that is broadly accepted, and we discuss other definitions in the conclusion section.

The most remarkable breakthrough in AI research of the last few years has been the advancement of natural language processing achieved by large language models (LLMs).

These neural network models are based on the Transformer architecture [VSP+17] and trained on massive corpora of web-text data, using at its core a self-supervised objective of predicting the next word in a partial sentence.

In this paper, we report on evidence that a new LLM developed by OpenAI, which is an early and non-multimodal version of GPT-4 [Ope23], exhibits many traits of intelligence, according to the 1994 definition.

Despite being purely a language model, this early version of GPT-4 demonstrates remarkable capabilities on a variety of domains and tasks, including abstraction, comprehension, vision, coding, mathematics, medicine, law, understanding of human motives and emotions, and more.

We interacted with GPT-4 during its early development by OpenAI using purely natural language queries (prompts)1 .

In Figure 1.1, we display some preliminary examples of outputs from GPT-4, asking it to write a proof of infinitude of primes in the form of a poem, to draw a unicorn in TiKZ (a language for creating graphics in LATEX), to create a complex animation in Python, and to solve a high-school level mathematical problem.

It easily succeeds at all these tasks, and produces outputs that are essentially indistinguishable from (or even better than) what humans could produce.

We also compare GPT-4’s performance to those of previous LLMs, most notably ChatGPT, which is a fine-tuned version of (an improved) GPT-3 [BMR+20].


In Figure 1.2, we display the results of asking ChatGPT for both the infinitude of primes poem and the TikZ unicorn drawing.

While the system performs non-trivially on both tasks, there is no comparison with the outputs from GPT-4. These preliminary observations will repeat themselves throughout the paper, on a great variety of tasks.

The combination of the generality of GPT-4’s capabilities, with numerous abilities spanning a broad swath of domains, and its performance on a wide spectrum of tasks at or beyond human-level, makes us comfortable with saying that GPT-4 is a significant step towards AGI.


GPT-4 Prompt:

Can you write a proof that there are infinitely many primes, with every line that rhymes?

GPT-4: Yes, I think I can, Though it might take a clever plan.

I’ll start by noting Euclid’s proof,

Which shows that primes aren’t just aloof.

Assume we have a finite list, Of primes, and that none have been missed. Multiply them all together, And add one, just to be clever.

The result will be a number,

That has no prime factors, oh wonder!

But every number has a prime,

That divides it, it’s just a matter of time.

So we’ve found a contradiction,

And our finite list needs eviction.

There must be infinitely many primes,

And that’s the end of my rhyming lines.


Prompt: Draw a unicorn in TiKZ.

GPT-4: [Produces LATEX compiling to following picture.]


Prompt: Can you generate a pyplot for the following data: x = [1, 3, 5, 6, 8], y1 = [2, 3, 5, 18, 1], y2 = [3, 5, 6, 8, 1], y3 = [5, 1, 2, 3, 4], y4 = [9, 7, 2, 3, 1]. I want y1, y2 to be in the same plot, but y3 is in another plot next to that plot, y4 is in below. I want the legend of y1 to be ”bob”, y2 to be ”alice”, y3 to be ”bilbo”, y4 to be ”allie”. I want the x-axis to be labeled with ”time” and y axis to be labeled with ”money”. I want to add a 10 %-40% random error bar to each curve, through all times (including non-integers). I want smoothed curves to show the plot, and smoothed error bar. Do not use linear interpolation, use smooth interpolation! I want to also add some small zig-zag to the smoothed curve to make it look more real. I want to put a baseline as the mean of each line. I want to put a pie chart below indicating the fraction of the four people in each time step. I also want an animation to show how the fractions are changing in the pie chart in **continuous time**. Interpolate the missing fractions! I also want vertical line animation in other three plots to match the pie chart. I want the fanciest plot. Please add as many fancy things as possible. Here is a possible code to generate a pyplot for the following data, with some comments to explain the steps:

GPT-4: [We give two snapshots from the animation resulting from the produced code]


Prompt: Suppose g(x) = f−1 (x), g(0) = 5, g(4) = 7, g(3) = 2, g(7) = 9, g(9) = 6 what is f(f(f(6)))?


GPT-4: There are different ways to approach this problem, but one possible method is to use the fact that g(x) = f−1 (x) implies that f(g(x)) = x for any x in the domain of g.

That is, applying f to both sides of the equation g(x) = f−1 (x) gives f(g(x)) = f(f−1 (x)) = x.

This means that we can undo the effect of f by applying g, and vice versa. Using this idea, we can try to find f(6) by looking for the value of x such that g(x) = 6. From the given information, we see that g(9) = 6, so f(6) = f(g(9)) = 9.

Similarly, we can find f(9) by looking for the value of x such that g(x) = 9. We see that g(7) = 9, so f(9) = f(g(7)) = 7.

Finally, we can find f(7) by looking for the value of x such that g(x) = 7. We see that g(4) = 7, so f(7) = f(g(4)) = 4. Therefore, f(f(f(6))) = f(f(9)) = f(7) = 4. Figure 1.1:

Preliminary examples of GPT-4’s capabilities in language, vision, coding, and mathematics.

ChatGPT Prompt: Can you write a proof that there are infinitely many primes, with every line that rhymes?


ChatGPT: Sure! Here is a proof that there are infinitely many primes, with each line of the proof rhyming: Consider the set of all primes. Call this set P and let it start with just two, the first prime. If we take all the primes in P and add one, We get a new set of numbers that has never been done. This set, call it S, is composed of only composite numbers. No primes will be found among them, that’s for sure. But we know that there must be at least one prime, Since S is infinite, and that much is clear. If we take the smallest number in S that is not in P, And call it p, we can add it to our set, don’t you see? But this process can be repeated indefinitely. Thus, our set P must also be infinite, you’ll agree. So there you have it, a proof that’s quite solid. There are infinitely many primes, that much we can see. No need to be coy, no need to be contrite. Infinity is a big number, just like P.


Prompt: Draw a unicorn in TiKZ. ChatGPT: [Produces LATEX compiling to following picture.] Figure 1.2: ChatGPT output on the first two prompts from Figure 1.1.

Our claim that GPT-4 represents progress towards AGI does not mean that it is perfect at what it does, or that it comes close to being able to do anything that a human can do (which is one of the usual definition of AGI; see the conclusion section for more on this), or that it has inner motivation and goals (another key aspect in some definitions of AGI).

In fact, even within the restricted context of the 1994 definition of intelligence, it is not fully clear how far GPT-4 can go along some of those axes of intelligence, e.g., planning (see Section 8), and arguably it is entirely missing the part on “learn quickly and learn from experience” as the model is not continuously updating (although it can learn within a session, see Section 5 for example).

Overall GPT-4 still has many limitations, and biases, which we discuss in detail below and that are also covered in OpenAI’s report [Ope23].

In particular it still suffers from some of the well-documented shortcomings of LLMs such as the problem of hallucinations [MNBM20] (see Figure 1.8) or making basic arithmetic mistakes [CKB+21] (see Appendix D), and yet it has also overcome some fundamental obstacles such as acquiring many non-linguistic capabilities (e.g., it solves most of the LLM failure modes described in [MIB+23], and it also made great progress on common-sense, see Figure 1.7 for a first example and Appendix A for more).

This highlights the fact that, while GPT-4 is at or beyond human-level for many tasks, overall its patterns of intelligence are decidedly not human-like.

However, GPT-4 is almost certainly only a first step towards a series of increasingly generally intelligent systems, and in fact GPT-4 itself has improved throughout our time testing it, see Figure 1.3 for the evolution of the unicorn drawing over the course of a month of training.


Even as a first step, however, GPT-4 challenges a considerable number of widely held assumptions about machine intelligence, and exhibits emergent behaviors and capabilities whose sources and mechanisms are, at this moment, hard to discern precisely (see again the conclusion section for more discussion on this).

Our primary goal in composing this paper is to share our exploration of GPT-4’s capabilities and limitations in support of our assessment that a technological leap has been achieved.


We believe that GPT-4’s intelligence signals a true paradigm shift in the field of computer science and beyond.

Sparks of Artificial General Intelligence: Early experiments with GPT-4


Leave a comment

Please note, comments must be approved before they are published

#WebChat .container iframe{ width: 100%; height: 100vh; }