Our Techniques for Building LLM-Powered Autonomous Agents

Matt Payne | Patrick Hennis
October 26, 2023
Planning via LLMs
Architectural framework for LLM-powered autonomous agents (Source: Wang et al.)

Large language models (LLMs) like ChatGPT excel at language tasks like creative writing and summarization because of the volume of text data they're trained on.

Now, researchers are discovering that the deep understanding of language that LLMs have, combined with knowledge about the real world present in their training data, makes them impressively capable at language-adjacent problems like reasoning, task planning, decision-making, and action selection — the very capabilities needed for fully autonomous agents.

In this article, we explore various techniques you can use to build LLM-powered autonomous agents. Through case studies, we also study the design and operations of some actual agents.

What Are LLM-Powered Autonomous Agents?

LLM + context framework for agents

LLM-powered autonomous agents are software components whose logic and state are controlled by an LLM and demonstrate these traits of autonomy:

  • Ability to plan: These agents accept high-level instructions or goals in natural language and break them down into smaller tasks, plan their sequence of execution, evaluate their results, and refine them toward satisfying the instruction or goal as correctly as possible.
  • Ability to use tools: These agents demonstrate the ability to understand software call syntax and semantics, select the software tools they need for any task, and run them by supplying syntactically and semantically sound parameters. These tools may be other LLM-based agents, non-LLM-based artificial intelligence agents, external application programming interfaces (APIs), retrievers from private data sources, or just simple functions that execute some data processing logic.
  • Ability to use contextual information: Lastly, such agents can adapt their planning and actions based on contextual information present in the prompts (also called in-context learning) or in external data sources like vector databases (known as retrieval-augmented generation).

In the following sections, we’ll take a look at how we implement SOTA frameworks and techniques to create LLM systems that complete tasks on their own.

How to Implement Task Decomposition

Autonomous agents must be capable of decomposing a high-level instruction or goal into a logical set of tasks in the right order. We explore some simple as well as complex techniques to do that.

Chain-of-Thought Prompting

Chain-of-thought (CoT) prompting is one of the simplest task decomposition techniques because it neither requires any special training of the LLM nor involves any additional software.

All it involves is a simple instruction like "think step by step" or "let's think step by step" added to the user prompt or system prompt. Most LLMs respond to it by breaking down the goal into a logical sequence of steps.

Despite its simplicity, CoT turns out to be surprisingly effective for simple goals.

In this example, we provide a system prompt to GPT-4 explaining the tools available and ask it to use CoT for task decomposition:

A system prompt with tool descriptions and CoT reasoning
A system prompt with tool descriptions and CoT reasoning

The tools are deliberately listed in a random order to prevent the LLM from being influenced by their sequence in the system prompt.

As shown below, CoT forces the LLM to break down the instruction into a logical sequence of tasks using the tools we have described:

Task decomposition with CoT

The reasoning and sequence of steps are correct. The only missing aspect here is more structured invocations of the tools that a script can process programmatically. It's easy to add that too, but for now, we're only interested in checking how well CoT works.

If the "think step by step" CoT instruction is removed, the steps are not always as logical:

Task decomposition without CoT
Task decomposition without CoT

How We’ve Used CoT Prompting

CoT prompting is a great framework for adding a conversational frontend to other generative ai tasks to make a more “agent like” workflow. The planning nature of the output allows us to walk through multi-step chatbot inputs to ensure we perform each task and cover our grounds. The rather simple prompt structure of CoT makes it easy for LLMs to understand how we’ve outlined our task and follow few-shot examples.

Chatbots that require code generation are one of the best use cases for chain-of-thought. We’ve used this framework to generate code queries, run the code that is generated step by step, and evaluate the results to ensure a quality output relative to the provided query. This breaking up of the code required into multiple steps makes it easier to follow the logic and allows you to use simpler queries that are easier for the model to generate. You can even work <THINK> bubbles into the plan generation to add a bit of language understanding to why you are generating the specific code and how it relates to the original query. Here’s an example of what the format looks like for generating a plan with thought sections that describe why we perform this step.

Plan generation with GPT-4

While Chain-of-thought is the most simple way to set up LLM agents, its a really valuable addition to these chatbots that allow you to autotomize the workflow to understand what to do. Adding error messaging or other return results make it so code can just check the results that come back and perform any required operations on its own.

Tree of Thoughts Reasoning

ToT vs. CoT (Source: Yao et al.)
ToT vs. CoT (Source: Yao et al.)

Tree of thoughts (ToT) reasoning extends CoT to achieve better reasoning toward goals where exploration, strategic lookahead, or criticality of initial decisions matter. On some tasks, ToT achieves a success rate of 74% compared to CoT's 4%.

ToT improves decision-making by:

  • Considering multiple reasoning paths
  • Self-evaluating its choices to decide the next course of action
  • Looking ahead or backtracking as needed to make global choices

ToT perceives any instruction or goal as a tree of possibilities and searches for the best possible path through that tree. Essentially, it does a tree search using common algorithms like depth-first or breadth-first search.

ToT starts by decomposing the instruction into several intermediate thought states. Each thought state is a node in the tree that represents a thought and was the result of traversing a particular path in the tree. Next, it generates multiple thoughts from each thought state. It then evaluates each thought using a classifier or voting.

An example of ToT applied to a creative writing task is shown here:

ToT example for creative writing task (Source: Yao et al.)
ToT example for creative writing task (Source: Yao et al.)

How We’ve Used Tree of Thoughts

Tree of Thoughts is best known as a prompting framework for creative writing and complex math related tasks. The combination of prompt based planning and the look ahead mechanisms solves some of the common issues seen in replicating human level long form tasks that require a bit of understanding of what has already been done and what will be done in the future for any given task.

We’ve used this in our long form content generation agents that focus on generating long form content (blog posts), refining and editing them, adding required images and alt text, and publishing them out. One of the key issues customers see with blog post generation is the models desire to repeat information already said in the sections above. This reads poorly as it looks like someone who didn’t write the sections above is now writing this one. It commonly comes out as very basic information that the reader doesn’t need to read again, and sometimes adds a “summary” paragraph to the end of each generation.

Tree of Thoughts cleans this up quite a bit by providing the entire plan for the blog post to the model as it generates the sections. Some prompting to connect these two concepts together makes sure the model understands that specific topics and ideas will be covered later in the blog based on the plan and the outline the plan generates. The plan also helps to make sure the blog post covers all required topics and doesn’t miss anything. This greatly expands the length of our blog posts and has allowed us to generate blogs that reach 2500+ words without repeating information unless required in a conclusion style section.

How to Implement Self-Introspection and Task Refinement

For better reasoning and task decomposition, the LLM should be able to evaluate its reasoning and backtrack or refine its steps if necessary. In this section, we review techniques to help do that.

Chain-of-Thought With Self-Consistency (CoT-SC)

Plain CoT vs. CoT with self-consistency (Source: Wang et al.)
Plain CoT vs. CoT with self-consistency (Source: Wang et al.)

This technique aims to improve upon CoT using a metric of self-consistency.

As we know, LLMs can generate multiple results for a prompt. In regular CoT, we normally just pick the first result as the answer. Instead, we can pick multiple results using top-k or similar sampling techniques.

CoT-SC intuitively recognizes that there must be some ground truth answer for each prompt. If we rerun the LLM multiple times and sample from its top-k results rather than just the first result, a significant number of results will be similar to one another and semantically close to the ground truth answer. This concept of similarity between results is called self-consistency.

CoT-SC samples multiple reasoning paths following a "think step by step" instruction. Then, instead of only taking the first result, it searches for the most consistent answer by marginalizing out the sampled reasoning paths.

The final answer is then decided by majority voting.

Graph of Thought Reasoning

GoT vs. other decomposition techniques (Source: Besta et al.)
GoT vs. other decomposition techniques (Source: Besta et al.)

Graph of thoughts (GoT) goes beyond ToT's tree model by interpreting each output from an LLM as an arbitrarily interconnected graph with complex operations (branches) like aggregating multiple thoughts into a new one or looping over a thought to refine it.

The full GoT framework (Source: Besta et al.)
The full GoT framework (Source: Besta et al.)

GoT can be used for use cases like document merging. Besta et al. give the example of creating a new legal agreement by combining legal clauses from other existing agreement documents. If you use GoT to implement an autonomous agent for agreement creation, the existing suitable agreements and relevant clauses are first fetched by suitable data fetching tools after which the merging is done using GoT.

Chain of Hindsight

Chain of hindsight (Source: Liu et al.)
Chain of hindsight (Source: Liu et al.)

Chain of hindsight is an LLM fine-tuning technique that combines supervised fine-tuning and reinforcement learning with human feedback but without RL.

Instead, the human feedback for a multi-turn conversation is converted to natural language text and directly used as additional training sequences for supervised fine-tuning of the LLM. The hindsight in the name refers to all the feedback provided during the conversation.

How to Implement Tool Use and Actions

Autonomous agents must have the ability to run other tools and services to either directly achieve the given goals or obtain additional information necessary to achieve them. In this section, we review techniques to add tool use and actions to LLMs.

Our Most Used Framework - ReAct

ReAct example (Source: Yao et al.)
ReAct example (Source: Yao et al.)

ReAct combines reasoning and action as interleaving sets of thoughts, actions, and observations.

ReAct works by providing few-shot examples that demonstrate reasoning, action selection, and observations following the action as shown in the example above.

A ReAct prompt example framework
A ReAct prompt

The action descriptions are similar to the CoT example above but more structured so that we can process the output.

If you're using GPT-4, the action specification and detection can be made more structured and reliable using the OpenAI function-calling APIs.

Using the API, specify actions and their information in detail as shown below:

OpenAI function calling API example (Source: OpenAI)
OpenAI function calling API example (Source: OpenAI)

The LLM uses this information — the function descriptions, parameter descriptions, and so on — to decide when and how to invoke your actions. The LLM's response contains structured details on how your script must invoke a selected action.

I wrote an entire guide to using ReAct for common use cases like chatbots and document summarization as the framework's ability to use various tools inline with the result generation is perfect for these use cases. The evaluation step at the end helps make sure the results actually make sense given the provided task and query. ReAct is my favorite framework for autonomous agents as building workflow examples for training and prompting are a breeze and allow you to refine the agents ability and understanding of the task down to the very line. I highly recommend this framework in chatbots that leverage external APIs and indexed knowledge bases in the same tool as the management of multiple external data sources relative to the input query can be challenging. This same workflow can be used to attach this chatbot framework to other systems to create a full autonomous agent.

Simply provide the prompting framework with access to different tools with a description of what the tool should be used for and the action step will manage accessing and leveraging the tools. Here’s a look at our banking chatbot architecture that accesses multiple external API services such as an interest rate calculator, a customer profile database, and Mortgage value calculator.

Banking Chatbot with ReAct


Textual reinforcement in Reflexion (Source: Shinn et al.)
Textual reinforcement in Reflexion (Source: Shinn et al.)

Reflexion achieves self-introspection and refinement by applying reinforcement learning (RL) to LLMs. But instead of updating model weights like regular RL, it provides textual feedback that can be included in subsequent prompt contexts to improve the actions and results.


ToolFormer finetunes an LLM to decide if a tool must be used, select a suitable tool, and invoke it with the correct syntax, semantics, and arguments.

An example conversation using ToolFormer is shown below:

Question-answering session with a ToolFormer-trained LLM
Question-answering session with a ToolFormer-trained LLM (Source: Schick et al.)

Finetuning can be more reliable than in-context learning but also less flexible. If your agent addresses a narrow problem for which it needs only a limited number of predetermined tools, use ToolFormer. However, if you need the LLM to learn a new tool, you must fine-tune it again.

On the other hand, if your LLM needs access to a wide and easily extensible set of tools in the future, use an in-context technique like ReAct.

One of the reasons we use this framework is the inline nature of the tool usage when generating the response. Most of these systems use the tool, generate a result, then generate the full language result by combining the information. ToolFormer uses the tools right inline with the natural language result. This makes it easier to tweak the conversational tone, length, and style.

How to Implement Contextual Memory

Context determines the planning and action decisions of the LLM.

The first type of context consists of the LLM's own weights. They form a type of contextual memory that is non-volatile and usually read-only. They represent all the text sequences and surrounding contexts that the LLM saw during its training. Fine-tuning and reinforcement learning are typically used to modify this memory.

A second type of context consists of all the information present in the user and system prompts. This is more volatile memory (analogous to stack memory in programming) that's only used for the duration of a single request to the LLM and then forgotten by the LLM. The technique of providing relevant information for a single request is called in-context learning.

A third type of context involves storing useful information in an external database and supplying it to the LLM on demand (analogous to disk storage). When the LLM receives a new prompt, it looks for relevant information in the database, retrieves it, and injects it into the prompt. This technique is called retrieval-augmented generation (RAG).

The typical RAG implementation is:

  1. Derive embeddings for relevant documents, document fragments, or relevant few-shot examples. Either a local framework like SentenceTransformers or a managed API like OpenAI Embeddings is used to derive these vector representations. We recommend SentenceTransformers as its open source nature makes it possible to fine-tune the embeddings to be domain specific. This greatly increases the accuracy of the search for specific use cases. I wrote a whole guide on it here.
  2. Store the embeddings in a vector database like Weaviate or a vector database service like Pinecone.
  3. When a new prompt is received, derive an embedding for it too.
  4. Using the database's vector similarity search, look up stored embeddings that are semantically similar to the prompt.
  5. Fetch the text of the documents, fragments, or examples associated with those matching embeddings.
  6. Alternatively, instead of using embeddings, an action in the LLM may trigger a data fetch request and get relevant text that way.
  7. Inject that text into the prompt as context.
  8. The LLM applies the instruction in the prompt to that context.

End-to-End Frameworks for LLM-Powered Autonomous Agents

There are helper frameworks that already implement most of the moving parts explained in previous sections. You just have to customize a few bits for your specific autonomous agent. In this section, we review some of these existing frameworks.


Implementing any of the above techniques typically involves stringing together logic that interacts with diverse AI libraries like PyTorch and Transformers, client libraries for external APIs like Wikipedia, and different database technologies like FAISS or MySQL.

One problem is that such logic can be brittle — fine for a proof of concept but unreliable for a production system. Another problem with it is that it's usually not fully reusable across multiple agents; you're forced to reinvent the wheel to some extent for every agent you need.

A better approach is to treat this entire paradigm as a specialized domain, the domain of LLM programming, and come up with a domain-specific language (DSL). In such a DSL, all these techniques like RAG or self-reflection become first-class citizens that can be expressed directly by name rather than indirectly in programming code.

DSPy is a framework that provides an improved approach to programming LLMs using a domain-specific language with all the common LLM techniques as first-class citizens.

A DSPy example for RAG-based question-answering is shown below:

DSPy example code
DSPy example (Source: DSPy)


AutoGPT is both a general-purpose autonomous agent you can run standalone as well as a framework for implementing new tools. Its configurable settings make it very powerful for both purposes. With simple configuration settings, you can enable an entire ecosystem of tools for:

  • Web search
  • News search
  • Text to speech
  • Image generation
  • Image captioning
  • Video summaries
  • …and more

AutoGPT Demo

This demo shows how AutoGPT can answer complex questions by combining reasoning and tools like web search.

We gave it the following complex task involving legal and medical compliance: "Summarize the key provisions of the California Consumer Privacy Act relevant to a medical assistance app."

In response, AutoGPT first set a role for itself as a legal expert and decomposed the task into this set of logical goals:

Next, it thought and reasoned about the goals to create a plan of action:

It subjected its reasoning and plan to a round of self-introspection:

As its first action, it searched the web to get relevant knowledge (using its web search implementation based on the DuckDuckGo search engine):

However, after fetching this information, it repeated the same search a couple of times, apparently stuck in a loop. We intervened with an instruction to stop searching and proceed with the information already fetched:

AutoGPT completed the task by summarizing complex information (in this case, the provisions of a legal statute) relevant to a specific scenario (a medical assistance app):


BabyAGI framework
BabyAGI's algorithm (Source: BabyAGI)

BabyAGI is a reusable autonomous agent implementation that's capable of task planning and mainly useful for retrieval-augmented generation use cases using GPT LLMs. It has built-in support for various vector databases.

LLM-Powered Autonomous Agents for Your Workflows

In this article, we explored state-of-the-art techniques and technologies being used to implement autonomous agents. LLMs were initially meant for natural language processing but they have turned out to be surprisingly good at general problem solving, reasoning, and task planning. You can automate even your complex business workflows to a great extent using just LLMs and natural language instructions.

If you have a novel business idea that you want to test with a proof of concept, you should seriously consider prototyping it using an autonomous agent with an LLM brain. Contact us for help.


  • Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, Karthik Narasimhan (2023). "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." arXiv:2305.10601 [cs.CL]. https://arxiv.org/abs/2305.10601
  • Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, Torsten Hoefler (2023). "Graph of Thoughts: Solving Elaborate Problems with Large Language Models." arXiv:2308.09687 [cs.CL]. https://arxiv.org/abs/2308.09687
  • Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171 [cs.CL]. https://arxiv.org/abs/2203.11171
  • Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, Shunyu Yao (2023). "Reflexion: Language Agents with Verbal Reinforcement Learning." arXiv:2303.11366 [cs.AI]. https://arxiv.org/abs/2303.11366
  • Hao Liu, Carmelo Sferrazza, Pieter Abbeel (2023). "Chain of Hindsight Aligns Language Models with Feedback." arXiv:2302.02676 [cs.LG]. https://arxiv.org/abs/2302.02676
  • Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao (2022). "ReAct: Synergizing Reasoning and Acting in Language Models." arXiv:2210.03629 [cs.CL]. https://arxiv.org/abs/2210.03629
  • Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." arXiv:2302.04761 [cs.CL]. https://arxiv.org/abs/2302.04761
  • Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, Ji-Rong Wen (2023). "A Survey on Large Language Model based Autonomous Agents." arXiv:2308.11432 [cs.AI]. https://arxiv.org/abs/2308.11432