Chain-of-Thought Prompting — Improve Accuracy by Getting LLMs to Reason

Karthik Shiraly
August 15, 2023

Large language models (LLMs) sometimes have trouble processing complex questions or elaborate instructions that involve multiple conditional sentences, knotty logic, intricate associations between named entities, or math calculations.

You can drastically improve the accuracy of such tasks using a technique called chain-of-thought (CoT) prompting.

What Is Chain-of-Thought Prompting?

Chain-of-thought prompting is a prompt engineering technique to make LLMs answer complex questions or follow elaborate instructions by first generating a sequence of intermediate reasoning steps in natural language.

Researchers have shown that using a few special prompts, and without any additional fine-tuning, you can make LLMs accurately follow elaborate instructions and work out answers to complicated questions by simply asking them to reason step by step toward the right answers.

In the following sections, we survey some fundamental research in CoT reasoning.

Overview of Chain-of-Thought Research

Few-shot CoT was proposed first and zero-shot CoT was demonstrated a few months later. But because zero-shot CoT is easier and you can use it for any task in any domain, we explore it first.

Zero-Shot Chain-of-Thought Prompting

In their 2022 paper, Large Language Models are Zero-Shot Reasoners, Kojima et al. explored a zero-shot CoT approach using appropriate prompts.

To force CoT reasoning, they proposed a simple, two-step prompting sequence:

  1. Reasoning prompt: The first step is a prompt containing the problem or question and a trigger instruction that makes an LLM generate a chain of thought. By comparing the metrics of candidate prompts, they found that the prompt that works best is: "Let's think step by step." The LLM generates a chain of thought in response to it.
  2. Answer extraction: Using another prompt, the second step extracts an answer in a format suitable for the task. The LLM applies this prompt to the CoT it generated and obtains a valid answer. For example, the prompt for math word problems is: "Therefore, the answer (Arabic numerals) is."

Answer Cleansing Step

The two-step sequence doesn't always yield usable answers. Sometimes, the LLM hallucinates multiple results. When asked to select an answer from multiple choices, it may select more than one.

To overcome such problems, the paper employs an additional answer-cleansing step of applying task- or format-specific manual rules to the generated answers.

In the section on improving CoT prompting, we explore how you can wield more control over malformed answers.

Zero-Shot CoT for Math Problems

Math problems can be quite confusing for LLMs.

Here, standard GPT-3 gets an arithmetic problem wrong:

zero shot cot prompting for math

With zero-shot CoT reasoning, the LLM reasons out the steps correctly:

results with zero shot cot prompting

Zero-Shot CoT for Common Sense Question-Answering

The example below shows a common-sense question regarding the real world:

Zero-Shot CoT for Common Sense Question-Answering

While the answer isn't technically wrong, it's not the best among the available options.

With zero-shot CoT, the LLM gets it right:

real results for common sense question answering

Zero-Shot CoT Results

Zero-shot CoT outperforms the baseline LLM on several tasks, especially in arithmetic and symbolic reasoning. Note however that it's not a panacea and performance in some cases may actually reduce.

Zero-Shot CoT Results on different datasets

Few-Shot Chain-of-Thought Prompting

CoT prompting was first proposed by Wei et al. in their paper, Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Their approach used few-shot prompt engineering for in-context learning. In their prompts, they included several pairs of questions and answers with reasoning chains as examples for the LLM. The examples guided the LLM to use reasoning in its generated answers.

In the sections below, we explore few-shot CoT in action.

Few-Shot CoT for Math Problems

For few-shot CoT on math problems, the researchers supplied examples like the ones below:

few shot cot prompting for math problems
Few-shot in-context learning for math problems (Source: Wei et al.)

The test problems with CoT reasoning are shown below:

Few-shot CoT reasoning for math problems
Few-shot CoT reasoning for math problems (Source: Wei et al.)

Few-Shot CoT for Logical problems

For logical and common-sense reasoning, few-shot examples like these were provided in the prompt:

Few-shot in-context learning for logical questions

Few-shot CoT reasoning for logical questions
Few-shot CoT reasoning for logical questions (Source: Wei et al.)

Few-Shot CoT Results

The paper demonstrated that few-shot CoT reasoning outperformed baseline performance on several math reasoning datasets:

Few-shot CoT comparison on math problems
Few-shot CoT comparison on math problems (Source: Wei et al.)

The results below show improved performance of few-shot CoT on logical and common-sense questions:

Few-shot CoT comparison on logical and common-sense questions

Why Does Chain-of-Thought Prompting Work?

Are LLMs really reasoning? Does CoT prove that LLMs are on par with sentient intelligence? In this section, we try to develop an intuition for why CoT works. Doing so helps develop insights into adapting these methods for custom workflows.

Basic Working of an LLM

It's useful to start with an understanding of how LLMs work. You can think of an LLM as a black box that produces one text token at each step based on a certain probability distribution.

The factors that influence the probabilities of tokens in each step are:

  • The fixed query, key, and value matrices of the multi-head attention layers, learned during training
  • The fixed weights and biases of the multi-layer perceptrons, residual connections, and layer normalizations, all learned during training
  • The dynamic probabilities of all the earlier tokens in the current input sequence
  • The positional embedding of the current token

At each step, all these are combined to yield an output probability for each token in the vocabulary. The LLM then selects the token with the highest probability and produces it as the output. The entire process then repeats for the next token, including the current step's probabilities.

In-context learning happens because the current step's probabilities are influenced by the probabilities of all the tokens that came before it.

CoT Does Not Emerge Due to Instruction Following

Does the training provided for instruction-following, using reinforcement learning with human feedback or supervised fine-tuning, result in CoT capabilities?

We can eliminate this possibility straight away because of the evidence. Both zero-shot and few-shot CoT reasoning were shown using the standard LLMs like GPT-3, not just their instruction-following variants.

However, it's certainly possible that instruction-following improves the quality of CoT reasoning.

CoT Emerges Only Above a Certain Scale

Another intriguing phenomenon, reported by both Wei et al. and Kojima et al., is that good CoT reasoning apparently emerges only in large LLMs above 100 billion parameters, and it actually reduces the performance of smaller LLMs.

Is CoT Due to Specific Datasets During LLM Training?

Though Wei et al. leave the question unanswered, they do include the possibility that pretraining data may be one of the possible factors.

Combining that with our knowledge of basic LLM operation, one possibility for the emergence of reasoning only in large LLMs is that scale itself had nothing to do with it. It may just be a coincidence that those larger LLMs also happened to be trained on some web-based or textbook datasets that included step-by-step reasoning.

Final Intuition

LLMs may not necessarily be reasoning. When they get a step-by-step trigger instruction, the probabilities of tokens get directed toward reproducing the reasoning they had seen during training. They may simply be imitating reasoning by generating reasoning-like tokens based on the probabilities in some datasets by coincidence.

We can test this hypothesis by training open-source LLMs like LLaMA, DollyV2, or GPT-J on reasoning datasets and testing them with CoT prompts. That's one approach for deliberately adding CoT capabilities to custom self-hosted LLMs for businesses.

Automatic Prompt Engineer for Improved Zero-Shot CoT

Automatic Prompt Engineer for Improved Zero-Shot CoT
APE workflow (Source: Zhou et al.)

We saw earlier that the key step in zero-shot CoT is the trigger prompt, "Let's think step by step." The problem with it is that it's hand-crafted. Kojima et al. tested a few hand-crafted prompts, observed their metrics on different datasets, and selected that particular prompt because it gave the best results.

Intuition tells us that we can surely find better CoT trigger prompts if we can somehow systematically explore such a massive search space. Automatic Prompt Engineer (APE) by Zhou et al. is designed to do exactly that — discover better CoT prompts automatically. For this, it again uses the power of LLMs as explained below.

The APE Approach

APE's automated discovery of CoT trigger prompts goes like this:

  1. Use an LLM to generate suitable CoT trigger prompts for a dataset of input questions and reasoning chains.
  2. Score the prompts.
  3. For the high-scoring prompts, use the LLM to generate alternative trigger prompts using the prompt: "Generate a variation of the following instruction while keeping the semantic meaning." It scored the alternatives too.
  4. Select the prompt that scored best.

For the dataset, they selected example questions and generated reasoning chains using regular zero-shot CoT.

To generate the trigger prompts, they used the following prompt template for the LLM:

prompt instructions for ape

The scoring goal was to increase the maximum likelihood of generating the reasoning chain.

APE Results

APE discovered that the best scoring prompt was: "Let’s work this out in a step by step way to be sure we have the right answer."

APE achieves on-par or better accuracy on multiple instruction induction datasets:

accuracy and results of ape vs other methods of prompt optimization
APE accuracy (Source: Zhou et al.)

APE Examples

The table below shows APE's discovered trigger prompts for various tasks:

examples of the results of automatic prompt engineer
APE discovered CoT trigger prompts (Source: Zhou et al.)
APE discovered CoT trigger prompts (Source: Zhou et al.)

Autogenerate Examples for Few-Shot CoT

A major inconvenience and time drain in few-shot CoT is its need for task-specific example pairs of questions and reasoned answers. To streamline that, Zhang et al. proposed a technique to autogenerate the examples in their paper, Automatic Chain of Thought Prompting in Large Language Models.

Their approach combines zero-shot CoT and clustering as explained next.

Clustering to Ensure a Diverse Set of Examples

First, collect a dataset of problems or questions suitable for the task. For example, use the GSM8K for math problems.

A naive approach for selecting examples is by starting with a set of test questions and using cosine similarity to find other questions that are similar to each of them. For these test questions, we can generate reasoning chains using zero-shot CoT.

But the problem there is that if the zero-shot CoT yields a faulty reasoning chain for a selected test question, it's likely to do so for most of the other similar questions. These wrong demonstrations of reasoning mislead the LLM into faulty reasoning.

To avoid that, we must ensure reasonable diversity in the selected example questions. A simple way to do this is:

  1. Convert the questions to embeddings using Sentence-BERT.
  2. Cluster the embeddings using k-means clustering to group them based on contextual similarities.
  3. Use zero-shot CoT or APE to generate a reasoning chain for each question. Apply the two-step prompting explained earlier.
  4. Sample from all the clusters using some selection criteria.

The selection criteria for the questions are simple rules, like:

  • They must be close to their respective cluster centers.
  • They shouldn't exceed some number of tokens (60 in the paper).
  • Their reasoning chains shouldn't exceed some number of steps (five in the paper).

This process yields a diverse set of example questions along with their reasoning chains and final answers.

Autogeneration Results

The results below show that accuracy with autogenerated CoT examples closely matches or exceeds those of hand-crafted CoT examples but with significant time savings:

Accuracy with autogenerated CoT examples
Accuracy with autogenerated CoT examples (Source: Zhang et al.)

Autogeneration Examples

The illustrations below show some autogenerated examples for math problems:

Autogenerated few-shot CoT examples for math problems
Autogenerated few-shot CoT examples for math problems (Source: Zhang et al.)

Here are some autogenerated examples for common-sense questions:

Autogenerated few-shot CoT examples for common-sense questions
Autogenerated few-shot CoT examples for common-sense questions (Source: Zhang et al.)

Let's see CoT reasoning in practice.

Practical Chain-of-Thought Reasoning

We demonstrate some CoT reasoning tasks in common business activities.

Code Generation

We ask an LLM to generate code to split text into sentences. This is non-trivial because sentences can have real numbers, periods inside direct quotations, parentheses, and so on.

A basic prompt generates the following code:

code generation with gpt-4 and cot

By asking the LLM to reason about it step by step, the generated code is improved:

results with better function for code generation

Interpret Complex Legal Judgments and Documents

Look at this paragraph from a legal judgment:

An excerpt from a legal judgment
An excerpt from a legal judgment (Source: Wiredu)

Legal judgments and contracts are often full of complex sentence constructions, multiple clauses, and legal concepts.

They are difficult to understand for laypersons and time-consuming for legal professionals. Plus, the complex language may result in mistaken interpretations, legal risks, compliance risks, and penalties for businesses.

Using CoT reasoning, GPT-4 can explain such complex legal paragraphs step by step:

CoT reasoning for legalese using GPT-4

Understand Medical Records

Medical records and reports may have complex critical information that medical professionals must get answers from. LLM-based chatbots can save them time by answering questions based on information in patients' health and medical records. The confidence in LLM-based chatbots will be higher if they can demonstrate the ability to reason while answering questions.

Below's a rather confusing medical investigation report:

A medical investigation report
A medical investigation report (Source: NHS)

We ask an LLM to interpret it using CoT reasoning and answer follow-up questions:

CoT reasoning for a medical report by GPT-4
CoT reasoning for a medical report by GPT-4

Some follow-up questions based on the reasoning:

questions that follow up the prompt

Comprehend Real-World Dialogue

In customer service or other business activities involving dialogue, customers or business partners may use sarcasm, jokes, or similar expressions. Automated systems may misinterpret such dialogues during sentiment classification or summarization.

In this example, a reviewer posts a sarcastic comment but the system misunderstands it:

Faulty customer sentiment identification

But using advanced CoT reasoning, you can make it correctly understand the customer's intention:

Correct sentiment identification using advanced CoT reasoning
Correct sentiment identification using advanced CoT reasoning (Source: Sun et al.)

Reasoning for Your Business Workflows With Chain-of-Thought Prompting

Many businesses hesitate to automate their workflows using LLMs due to a lack of confidence in their accuracy, reliability, and repeatability. Just one wrong or ad-hoc answer can incur substantial costs to introduce manual quality checks in business workflows.

By using chain-of-thought prompting and step-by-step reasoning, you can confidently start using LLMs in your critical business processes. We have the insights you need to use LLMs without losing sleep. Contact us to know more about accurate CoT prompting and reasoning in your business workflows.


  • Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, Yusuke Iwasawa (2022). "Large Language Models are Zero-Shot Reasoners." arXiv:2205.11916 [cs.CL]. https://arxiv.org/abs/2205.11916
  • Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." arXiv:2201.11903 [cs.CL]. https://arxiv.org/abs/2201.11903
  • Zhuosheng Zhang, Aston Zhang, Mu Li, Alex Smola (2022). "Automatic Chain of Thought Prompting in Large Language Models." arXiv:2210.03493 [cs.CL]. https://arxiv.org/abs/2210.03493
  • Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, Jimmy Ba (2022). "Large Language Models Are Human-Level Prompt Engineers." arXiv:2211.01910 [cs.LG]. https://arxiv.org/abs/2211.01910
  • Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shangwei Guo, Tianwei Zhang, Guoyin Wang (2023). "Text Classification via Large Language Models." arXiv:2305.08377 [cs.CL]. https://arxiv.org/abs/2305.08377