To assess our approach, we evaluate the agent's ability to complete decision-making tasks in AlfWorld environments and knowledge-intensive, search-based question-and-answer tasks in HotPotQA environments.
We observe success rates of 97% and 51%, respectively, and provide a discussion on the emergent property of self-reflection.
Recent advancements in decision-making large language model (LLM) agents have demonstrated impressive performance across various benchmarks.
However, these state-of-the-art approaches typically necessitate internal model fine-tuning, external model fine-tuning, or policy optimization over a defined state space.
Implementing these methods can prove challenging due to the scarcity of high-quality training data or the lack of well-defined state space.
Moreover, these agents do not possess certain qualities inherent to human decision-making processes, specifically the ability to learn from mistakes.
Self-reflection allows humans to efficiently solve novel problems through a process of trial and error.
Building on recent research, we propose Reflexion, an approach that endows an agent with dynamic memory and self-reflection capabilities to enhance its existing reasoning trace and task-specific action choice abilities.
To achieve full automation, we introduce a straightforward yet effective heuristic that enables the agent to pinpoint hallucination instances, avoid repetition in action sequences, and, in some environments, construct an internal memory map of the given environment.
Mastering decision-making and knowledge-intensive search tasks in novel environments is a crucial skill set for large-scale natural language agents. LLMs such as OpenAI’s GPT-3 (Brown et al., 2020), Google’s PaLM (Chowdhery et al., 2022), and others have achieved impressive results on various benchmarks (Kaplan et al., 2020; Rae et al., 2021; Nakano et al., 2021; Kojima et al., 2022; Ouyang et al., 2022; Chung et al., 2022).
These models exhibit human-like abilities to understand tasks in given environments, marking significant progress in the field of natural language processing.
Grounding complex tasks in natural language allows agents to overcome high syntactic barriers that may result in false-negative errors.
However, learning optimal policies for natural language RL agents is challenging due to vast and mostly unbound state spaces.
Several decision-making approaches have been proposed to enable natural language agents to select their next action without a learned policy in text-based environments. Chain-of-thought (CoT) reasoning leverages emergent properties such as reasoning and commonsense to solve tasks in a single action but reasoned through several steps (Huang et al., 2022a; Wei et al., 2022b).
However, the accuracy of these approaches decrease as the number of required subtasks increase as the model is more prone to hallucinate over longer sequences. ReAct (Yao et al., 2023) is an approach that utilizes emergent properties in LLMs, such as verbal reasoning traces, to solve problems by allowing the agent to reason and act, proving substantial performance in various text-based benchmarks.
In addition, several recent works have aimed to allow natural language agents to exhibit reflective-like qualities to infer more intuitive future actions.
The Describe, Explain, Plan, and Select (DEPS) approach uses multi-step reasoning and sub-task error correction to solve long-range tasks (Wang et al., 2023).
DEPS demonstrates impressive performance due to its ability to explain mistakes in sub-tasks within trials, but relies on immediate failure detection for subtasks and cannot explain mistakes that may have developed over a long range of actions and subtasks.
Huang et al. (2022b) use inner monologue to further process next decisions within closed-loop feedback environments.
Huang et al. (2022b) use a success detection approach in which the agent will explicitly know if an executed action has led to a successful state. (Huang et al., 2022a; Haluptzok et al., 2022) use self-generated solutions to fine-tune an LLM to improve performance without access to a labeled dataset.
Although these approaches have achieved remarkable accuracy across various decisionmaking tasks or knowledge-intensive tasks, they lack the ability to utilize success detection cues to improve their behavior over long trajectories. In addition, they often succumb to common mistakes, such as repetitive action choice, cyclic hallucination, or random action choice.
In other words, while these methods achieve state-of-the-art results, a small subset of tasks remain unsolved due to the agent’s inability to learn from its own mistakes over long trajectories to correct future action sequence planning and execution.
To address common failure points, human-in-the-loop (HITL) approaches have been commonly used to improve performance Fan et al. (2022); Wu et al. (2022) Yao et al. (2023) briefly explore a human-in-the-loop (HITL) approach to redirect the agent’s reasoning trace after erroneous actions.
While this approach achieves improved performance with minimal human intervention, it is not fully autonomous by its reliance on human trainers to monitor trajectories at each time step.
Large-scale LLMs have shown to exhibit advanced human-like qualities that enable natural language agents to solve tasks in more intuitive ways (Wei et al., 2022a).
We hypothesize that LLMs possess an emergent property of self-reflection and could effectively utilize self-optimization grounded in natural language if given the opportunity to autonomously close the trial loop.
To test our hypothesis, we equip an LLM-based agent with a self-reflective LLM and a simple heuristic for detecting hallucination and inefficient action execution in an approach named Reflexion.
We then challenge the agent to learn from its own mistakes on the AlfWorld text-based benchmark (Shridhar et al., 2021) and the HotPotQA question-answering benchmark (Yang et al., 2018).
This results in improved performance in decision-making and knowledge-intensive tasks. When combined with the ReAct problem-solving technique (Yao et al., 2023), self-reflection guides the Reflexion agent to achieve a 97% success discovery rate on the AlfWorld benchmark in just 12 autonomous trials, outperforming the base ReAct agent with an accuracy of 75%. We also evaluated a Reflexion-based ReAct agent on 100 questions from HotPotQA.
The agent achieved a 51% success discovery rate by iteratively refining its content search and content extraction by receiving advice from its memory, outperforming a base ReAct agent by 17%.
It is essential to emphasize that Reflexion is not designed to achieve near-perfect accuracy scores; instead, its goal is to demonstrate learning through trial and error to enable discovery in tasks and environments previously considered nearly impossible to solve.
We proposed an approach that allows natural language agents to learn from past mistakes and redirect future decisions in planning sequences which removes the human trainer in a human-in-the-middle approach.
We demonstrated learning curves on the AlfWorld and HotPotQA benchmarks that significantly outperform base ReAct agents.
In addition, we include an inconclusive attempt to improve performance on the WebShop benchmark and provide a discussion that highlights a few limitations of this approach.
Reflexion is a highly applicable method to improve performance between trials on decision-making and knowledge-intensive tasks due to its sole dependence on a binary reward model. In the AlfWorld and HotPotQA experiments, we constrained the reward model to imitate environments in which informative reward models may be difficult to design or compute.
We encourage others to apply Reflexion to more complex tasks in which the agent must learn to develop new ideas, explore larger unseen state spaces, and form more accurate plans of action through its experiences in past environments.