Prompting Large Language Models to Plan and Execute Actions Over Long Documents (PEARL)

Matt Payne | Patrick Hennis | Andre Beukers

March 15, 2024

how pearl works — *PEARL overview (Source:* *Sun et al*.)

Using large language models (LLMs) to process long-form content like legal contracts, financial reports, books, and web content brings enormous productivity and efficiency improvements. Unfortunately, it's a problem that's not been completely solved. LLMs still face difficulties and drawbacks when processing lengthy documents.

In this article, we review the kinds of problems LLMs face and explore a technique that prompts large language models to plan and execute actions over long documents.

Problems Faced by LLMs With Long Documents

When using LLMs for long document processing, you often see issues that lead to the development of “workarounds”. Some of these are due to context length limits but you can overcome those to an extent using chunking strategies.

However, there are other semantic-level drawbacks even when the document satisfies token limits. For example, if a particular concept is referred to in many sections due to its connections with other concepts, we find that straightforward prompts often ignore entire connections and sections.

Additionally, Liu et al. showed that the accuracies of LLM-generated answers degrade with longer context lengths and when relevant information is in the middle of long contexts.

In this article, we explore planning and executable actions for reasoning over long documents (PEARL), a technique that addresses such drawbacks through more sophisticated reasoning than naive prompts.

Introduction to PEARL

PEARL is a strategy to reason over long documents using specially crafted prompts and a small number of few-shot, in-context examples.

PEARL combines task-specific decomposition with prompt generation to create specially crafted prompts that help LLMs process complex semantic connections across the document.

Briefly, PEARL consists of three high-level stages:

Action mining: For the given complex task in a particular domain, dynamically decompose the task into a set of actions expressed like programming functions. Each action is accompanied by an instruction for the LLM and the inputs the LLM needs.
Plan generation: Form a plan consisting of these actions in the right order along with inputs and information passing. The generated plan resembles a pseudocode program.
Plan execution: Apply this action plan to the target document by asking the LLM to execute each action's instruction.

The illustration below shows one such generated plan consisting of mined actions:

We first provide some intuition into why PEARL's approach works.

Intuition Behind PEARL

Since PEARL's outputs come across as unnecessarily complicated at first glance, building some intuition may help you implement it better for your use cases.

Instead of sticking to natural language, PEARL surprises you by switching to text that resembles pseudocode or a high-level programming language:

Actions look like functions, with input arguments and output variables.
The generated plan looks a lot like a programming script with a particular grammar.
Finally, the LLM is made to behave like a simulator to run that plan as if it's a program.

But why go to all this trouble? Why not stick to natural language, like chain-of-thought prompting and other strategies? Why simulate a virtual machine to run the plan?

Intuition #1 — PEARL Generates Task-Specific Semantic Pipelines

Here's a key insight that might help: Think of PEARL as a generator of task-specific text processing pipelines.

If you've worked on traditional search or text processing, you'll be familiar with expressing high-level semantic goals using low-level processing pipelines that:

Clean up and preprocess the raw text. Steps like removing stop words, modifying cases, or handling abbreviations are common. They're often domain- and task-specific too.
Find regions of semantic interest using low-level techniques like regular expressions and keyword matching. The regular expressions and keywords are selected by observing patterns in the particular text and may not be reusable across documents.
Extract the information you need from the regions of interest. These tend to be custom rules with plenty of conditional logic suited to the domain and task.
Modify the extracted information to obtain the final answer, again using more conditional logic related to the domain and task.

PEARL essentially automates the generation of such text-processing pipelines. But it goes a step further.

Recognizing that LLMs are inherently better at semantics, PEARL completely ignores low-level approximations like regular expressions. Instead, it directly uses semantic criteria and goals in actions like these examples:

FIND_EMOTION(CTX, X, Y): Find the emotion or feeling X feels towards Y given the input CTX.
FIND_ENDING(CTX, X): Find the ending or conclusion of X's story or the input CTX.
FIND_EVENT(CTX, X): Find the event involving X in the input CTX (e.g., betrayal, change, climax).

Notice how the actions are expressed in terms of high-level semantic concepts like emotions, conclusions, events, and so on.

For example, to detect emotions, most human developers would use regular expressions or string matching to search for phrases like "I/he/she/they felt angry/sad/happy." Some may use better semantic techniques like sentence-level embeddings but use conditional logic for matching. Unlike them, PEARL leverages the strong semantic capabilities of LLMs to directly search for the abstract concept of emotion.

Intuition #2 — Pseudocode Reduces Confusion

But why use pseudocode to represent these pipelines? That's to reduce ambiguity.

When conversing with LLMs like ChatGPT, we refer to earlier contexts using vague expressions like:

"In the article…"
"In your second answer…"
"Earlier you said…"

When a conversation has multiple contexts (which is common with text processing pipelines and long documents), such vague references can confuse or mislead the LLMs. PEARL's formal pseudocode syntax eliminates such ambiguities.

How PEARL Works

PEARL consists of three primary stages but requires some secondary stages to work well.

In the following sections, we get into the mechanics of each stage, especially the input aspects and prompt engineering. But before that, it helps to understand the dataset used by the PEARL paper.

The QuALITY Long-Context Question-Answering Dataset

Though PEARL is a general-purpose prompting approach, the generated actions are specific to the domain, training dataset, and task. In particular, the PEARL research paper and code are customized for a question-answering dataset called the QuALITY dataset. Since the sections below show GPT-4 prompts and results generated from this dataset, let's get to know it first.

QuALITY is a human-annotated comprehension dataset based on long-form content. It contains fiction and nonfiction content from Project Gutenberg, Slate Magazine, and other sources.

Each article is accompanied by a set of related human-written questions with multiple-choice answers and best answers. The questions tend to be complex and can't be answered trivially by skimming or searching keywords.

PEARL makes use of QuALITY's articles and questions. However, instead of selecting one of the given answers, it generates a detailed long-form answer using information and relationships from all parts of the article.

Stage 1 — Action Mining

In the action mining stage, the LLM is asked to generate a comprehensive set of actions that are likely to help the task at hand. It's supplied with the following information:

A seed set of human-written actions
Detailed instructions on what the LLM should do
Few–shot examples that demonstrate action creation
The target question for which the LLM must create an answering plan using a plausible set of necessary actions. However, the focus at this stage is not the plan itself but the actions and action descriptions that constitute that plan.

This prompt is repeated for each question in the training set.

The prompt template for action mining is illustrated below.

Prompt template for action mining (Source: Sun et al.) — *Prompt template for action mining (Source:* *Sun et al*.)

Since action mining is the most important stage in PEARL, a detailed explanation for each component of the prompt follows.

Seed Set of Actions

The main purpose of the seed set is to demonstrate the pseudocode syntax for actions using function names, function call syntax, function descriptions, and arguments.

They are mostly general-purpose text processing or semantic actions, like concatenating text or finding relevant information. However, some, like FIND_MORAL, can be dataset-specific.

Action Mining Instructions

These instructions guide the LLM on using existing actions and, if necessary, generating new actions. Plus, they demonstrate the desired format for the results.

‍

Few-Shot Examples for Action Mining

Using sample questions and actions, the few-shot examples clarify what the LLM must do. Crucially, these must be relevant to the domain and task at hand.

*Few-shot demonstrations for action mining*

Notice how, unlike the seed set, the action arguments and output variables are sometimes actual examples relevant to the dataset rather than generic terms.

Input Question for Action Mining

The last component of the prompt is the training set question for which we want the LLM to discover plausible new actions while formulating an answering plan.

Example Results of Action Mining

The illustration below shows some results from action mining the QuALITY training dataset.

In this example, the LLM discovered that it needs a new action, FIND_FEELINGS, to answer question 22.

Stage 2 — Action Consolidation and Refinement

Action mining is run on every question in the training dataset. In QuALITY's case, we're talking about 2,100+ questions. With such a large number of questions, some generated actions may be faulty or duplicated as shown below:

PEARL gathers the results of action mining all the training questions. It then relies on the semantic strengths of the LLM to find duplicated or faulty actions and produce a final set of refined actions. This is the prompt for action refinement:

The final consolidated set of actions for the QuALITY dataset contains about 80 actions. This limit is configurable. The paper experimented with different limits and found that for this particular dataset, 80 is an optimal number to get the best results.

Let's examine some of these 80 actions. Some are general-purpose actions:

EXPLAIN_PROCESS(CTX, X): Provide a detailed explanation of the process X given the input CTX.
FIND_COMMON(CTX, X, Y, Z): Find the common ground, characteristics, or commonalities between X, Y, and Z given the input CTX.

Others are dataset-specific:

FIND_ELEMENT(CTX, X, Y): Find the element X related to Y given the input CTX. This function can cover message, method, metrics, mismatch, mission, mistake, most likely, motif, motivation, nationalities, negative critique, negative effect, next event, normal, objective, obstacles, outcome, outlier, paradox, part, perception, personality traits, perspective, pivotal decision, plan, plausible explanations, plot, prediction, probable cause, promises, proof, protagonist, purpose, qualities, rank, reaction, reasons reception, recommendations, relation, repercussion results, role representation, sentiment thoughts, setting effect, significance, similarity, solution struggles, theme symbolism, tech advancements, tension tone, traits values, unexpected, and writing style, etc.
FIND_EMOTION(CTX, X, Y): Find the emotion or feeling X feels towards Y given the input CTX.
FIND_FEARS_DISTRACTIONS(CTX, X): Find the fears, concerns, or distractions of X given the input CTX.
FIND_FORESHADOW(CTX, X, Y): Find the instance where X foreshadows Y in the input CTX.
FIND_FUTURE(CTX, X): Find the future, predicted outcome, or action of X given the input CTX.
FIND_GRIEVANCE(CTX, X): Find and summarize the grievance X has against something or someone in the input CTX.

Stage 3 — Plan Generation

The next stage in PEARL focuses on making the LLM generate an answering plan (or task plan, in general) for a given question using the mined actions.

The prompt template for plan generation is depicted below:

We supply the full list of mined actions in the prompt.

It's followed by these detailed instructions:

Instructions for plan generation with PEARL framework — *Instructions for plan generation*

Notice how the instructions now focus on the plan and urge the LLM to try to use only the provided actions. However, it remains open to the possibility of new actions as well.

PEARL then includes some few-shot examples with questions and their answering plans:

Few-shot examples for plan generation with PEARL — *Few-shot examples for plan generation (Source:* *PEARL*)

The default prompt for QuALITY contains 11 few-shot examples. Ideally, they must be written manually and suitable for the domain and task. However, if you prefer, the PEARL code has a "refine" function to autogenerate these few-shot examples from a dataset.

Finally, we append the question for which we want the LLM to generate a plan using the mined actions.

Importantly, we do not provide the full article as input at this stage. The plan is generated using just the mined actions, the target question, and the few-shot examples.

Plan Generation Example

An example question from QuALITY and its generated plan are shown below:

In this example, the LLM has expressed the plan using only the mined actions. Further, it has replaced the action parameters with question-specific arguments.

Stage 4 — Plan Self-Correction

With just a small number of few-shot examples and complex instructions for the LLM, a generated plan may have problems like:

Unknown actions
Wrong number of arguments for an action
Wrong ordering of actions

So, PEARL follows this iterative self-correction strategy:

Evaluate the plan using validation rules for action names and arguments.
If the plan is deemed invalid, prepare suitable text error messages to describe the problems.
Generate a new plan. But provide the invalid plan and error messages too so that the LLM can self-correct those errors.
Repeat these steps for a few iterations (seven by default) until the plan is valid.

At the end, you'll have a self-corrected plan that has the correct syntax and grammar. Note that even at this stage, the long document isn't used.

The prompt for plan self-correction is very similar to plan generation but with these self-correcting instructions:

*Instructions for plan self-correction (Source:* *PEARL*)

Plan Self-Correction Example

In this example, the LLM finds syntax errors over two iterations:

In the end, it generates this correct plan:

Stage 5 — Plan Execution

The last stage is to apply the plan's actions to the supplied document to get the result. PEARL examines the actions of the plan and replaces all CTX arguments with the supplied document.

After that, each action is run by the LLM one by one based on the action description, and its results are assigned to generated output variables.

The execution trace below shows the first three actions and their results for our example plan:

*The first three steps of example plan execution*

Here are the next two steps of the example six-step plan:

*The next two steps of example plan execution*

Example Plan Result

The final execution step produces the result for the question we asked ("What makes Gubelin an outlier in the present day?"):

PEARL for Medical Report Question Answering

In this section, we explore how PEARL performs in the question and answer domain. This is a pretty common use case for longer context windows. Engineering these systems spends a ton of time on adjusting both the chunk parameters (size,splitting,keywords etc), and the structure of how these chunks are pulled into the context window. Some systems struggle due to the fact that the most relevant chunks aren’t always the ones that best answer the question. If the database has multiple medical reports for different patients, or the same patient but different visits, the similar chunks that get retrieved might all be exactly the same section, when the query actually requires various parts of the document.

Adapting PEARL

We know the inputs that are either domain-specific or manually supplied:

Seed set of actions for action mining
Domain-specific training questions for action mining
Few-shot examples for action mining
Few-shot examples for plan generation
Relevant long documents

To prepare these inputs, we first look for suitable medical datasets.

Medical Datasets

We get our long documents — medical reports — from the Kaggle medical transcriptions dataset that we have used earlier for summarization work:

Sample medical report (Source: Kaggle medical transcriptions dataset) — *Sample medical report*

We choose the longest reports that are 12,000-16,000 tokens.

The input question will be a complex one: How is the medical condition affecting their lifestyle and what changes should the patient make in their lifestyle?

I love this question for evaluation because it is abstract enough that it causes a few key issues that you see with these Q&A systems:

- The specific medical condition is not specified. Keyword heavy systems or systems with limited domain knowledge will struggle with this.

- The first question is vague and does not give a direct correlation to anything in the document. This means the LLM requires in-domain knowledge of how to correlate different parts of the context to the query.

- The model has two options for what makes the most sense to answer the second part of the question. Either it can find a summary in the provided context and perform an answer extraction, or if this cannot be found, summarize multiple sections into an answer. This requires a deep level of understanding of the input text relative to the query.

- Multiple questions in the same query. One of the places where RAG based systems struggle. The model has to pull in context for both parts of the question.

That leaves us with questions and few-shot plans for action mining and plan generation. We use the MEDIQA-AnS (MEDIQA-Answer Summarization) dataset by Savery et al. because:

It contains questions from laypersons regarding medical problems. While some of these are quite long with a conversational tone, it's possible to summarize and simplify them to short questions.
It contains detailed long-form answers sourced from medical experts or extracted from articles by medical experts.
Simple, short, expert-created summaries of those complex answers are available.
Each pair of long-form answer and its simple summary enables us to prepare an answering plan that maps the question to the answer.

A sample entry from MEDIQA-AnS is shown below:

Medical Action Mining

For the medical seed set of actions, we reuse most of the seed set as PEARL since they're generic except for FIND_MORAL which isn't relevant for medical data.

For the few-shot examples, we extract questions from 30 random entries in MEDIQA-AnS primary dataset file and apply action mining to them. The results look quite promising:

Medical Plan Generation

We supply these few-shot examples in the plan generation prompt:

For a test question, "What blood tests confirm celiac disease," it generates this plan:

It's not a valid plan because FILTER isn't on the list of refined actions. Nonetheless, it's a promising start.

Plan Refinement

For our actual question on connections between the patient's medical condition and their lifestyle, PEARL generates this final plan after refinement:

Medical Report Question Answering Results

For a qualitative evaluation of the approach, we applied this lifestyle recommendation plan to the top five longest medical reports from the transcription dataset. These reports contain multiple sections for different medical specialties. We want the recommendation plan to extract relevant information from all sections.

Here are some of the results. Below is the longest and most complex report:

The lifestyle recommendation plan for it returns this comprehensive answer:

Notice how the plan includes information from all sections of the report.

The next longest report contains complex information and connections:

The lifestyle plan for it is shown below:

Again, the answer is comprehensive with information from all sections of the report.

These results show that PEARL can be easily adapted for multiple domains.

PEARL Overcomes Many LLM Drawbacks

In this article, you saw how PEARL can dynamically generate pipelines to solve complex tasks. In addition, it does so using just prompts and few-shot examples without requiring any fine-tuning. This is a capability that was previously not possible with other prompting strategies or task-planning techniques. Contact us to discover how you can apply PEARL for your difficult business problems.

References

Simeng Sun, Yang Liu, Shuohang Wang, Chenguang Zhu, Mohit Iyyer (2023). "PEARL: Prompting Large Language Models to Plan and Execute Actions Over Long Documents." arXiv:2305.14564 [cs.CL]. https://arxiv.org/abs/2305.14564
Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thompson, He He, Samuel R. Bowman (2021). "QuALITY: Question Answering with Long Input Texts, Yes!" arXiv:2112.08608 [cs.CL]. https://arxiv.org/abs/2112.08608
Max Savery, Dina Demner, Asma Ben Abacha, Soumya Gayen (2022). "Question-Driven Summarization of Answers to Consumer Health Questions." doi:10.17605/OSF.IO/FYG46. https://doi.org/10.17605/OSF.IO/FYG46

‍

Lets Talk