Fine-tuning Open LLMs with Reinforcement Learning from Human Feedback

Matt Payne
November 2, 2023
Llama 2 RLHF fine-tuning process
Llama 2 RLHF process (Source: Touvron et al.)

The open-source large language model (LLM) ecosystem has now matured past the point of simple manually built prompts and completion fine-tuning. They can now follow complex instructions and hold useful multi-turn conversations with your users and customers on par with market-leading commercial LLMs.

In this article, we explore a technique called reinforcement learning from human feedback that transforms raw LLMs into capable conversationalists and instruction-following autonomous agents.

What Is Reinforcement Learning From Human Feedback?

Out of the box, pretrained large language models are only capable of completions, i.e., they can only add text to a supplied prompt. You can't talk back and forth with a pretrained LLM like it's a chatbot.

To enable them to conduct dialogue with people, you must first train and finetune them for instruction-following and conversations.

Reinforcement Learning from Human Feedback (RLHF) is a training process to teach an LLM for instruction-following and multi-turn conversations using reinforcement learning.


The best way to understand what RLHF can do is to compare the output of a stock LLM with its RLHF-enhanced version.

In the demo below, we ask a question to a stock Llama 2 LLM that hasn't been finetuned for conversations or instructions:

example of base llama

The simple question in the prompt elicits almost random text as the output. That's because the LLM examines the last word "mean" and decides the next word purely based on the text corpora it's trained on. It doesn't understand that the prompt is a question it must answer.

Now compare the result from an RLHF-enhanced Llama 2 chat LLM:

example of the results of fine-tuning

Thanks to RLHF, the LLM understands that it's being asked a question and that its generated text must not only be a meaningful answer but also have the tone and cadence of an answer.

In the rest of this article, we explore how to implement RLHF for open-source LLMs.

Survey of RLHF-Trained Open-Source LLMs

When the first set of open-source LLMs like LLaMA came out in early 2023, open-source RLHF datasets and techniques weren't available. Only the large commercial LLMs were implementing RLHF. But by the third quarter of 2023, RLHF-powered open-source models as well as software and datasets to implement it were readily available.

As of September 2023, you can use these RLHF-trained or RLHF-ready open-source LLMs.

Outline of the options of models

LLM Training Procedure Using RLHF

The diagram below shows the typical process to finetune an LLM for instruction-following and multi-turn dialogue using RLHF.

Fine-tuning outline connecting the data and LLM

We explain each of these steps and datasets in the sections below.

LLM Pretraining

LLM pretraining is the initial training of an untrained LLM on large text corpora. All LLMs start out like this, including popular ones like the legacy GPT-3 davinci-002 model and the Llama 2 base models. Since so many pretrained, open-source, ready-for-commercial-use LLMs are readily available, you rarely have to implement this step from scratch.

Supervised Fine-Tuning

SFT is often the first step in creating an LLM capable of instruction-following or multi-turn dialogue. It takes a pretrained LLM that's only capable of completions and trains it to follow instructions or chat with a user.

Instruction-following refers to the ability of an LLM to follow instructions for specific tasks like summarization or code generation. These LLMs typically contain "instruct" in their names (like InstructGPT, for example).

In contrast, multi-turn dialogue is an LLM's ability to follow any number of diverse instructions and answer a variety of questions through back-and-forth conversation with a user while remembering all the information exchanged so far. It's a superset of instruction-following. Such LLMs are typically labeled as chat or chatbot LLMs (for example, Llama-2-70b-chat).

How SFT Works

SFT's goal is to make a pretrained LLM reply to a user's prompt with specialized information in a suitable tone instead of simply adding text to the prompt like a raw LLM. It's similar to LLM pretraining but with differences in these aspects:

  • Demonstration data for training: Instead of raw text, the training data is organized as pairs of prompts and relevant responses that demonstrate instruction-following or multi-turn dialogue to the LLM.
  • Loss function based only on responses: Though a conversation includes user prompts and LLM responses, the cross-entropy loss in SFT examines only the LLM-generated tokens against the training responses. The tokens in the user prompts are ignored.

In the next section, we examine some datasets useful for SFT.

Demonstration Datasets for SFT

For SFT, you can use these public datasets.

Dolly-15K Dataset

Demonstration data for SFT
Demonstration data for SFT (Source: databricks-dolly-15k)

The Databricks-dolly-15k is a dataset for instruction-following created by Databricks employees through an internal gamified crowdsourcing process. The Dolly-15K dataset has these characteristics:

  • Nature: It includes prompts and responses covering multiple categories like chatting, brainstorming, classification, question answering, text generation, information extraction, summarization, and more.
  • Structure: Each row consists of an instruction or question, a response, the category, and an optional context for the question.
  • Number of rows: It contains 15,000 entries that conform to the recommendations of the InstructGPT paper.
  • License: It's available under the Creative Commons Attribution-ShareAlike 3.0 Unported License which allows modifications for commercial use as long as they're also distributed under the same license.

Stanford Alpaca Dataset

Examples from the Alpaca dataset
Examples from the Alpaca dataset (Source: Alpaca)

The Stanford Alpaca dataset has these characteristics:

  • Nature: It consists of LLM-generated prompts, responses, and inputs for additional context.
  • Structure: Each row has a synthetically generated instruction, response, and inputs. All of them were generated by the OpenAI text-davinci-003 GPT-3 model.
  • Number of rows: It contains 52,000 entries.
  • License: It's available under the Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license that does not permit any commercial use of the data.

OpenAssistant Conversations Dataset

Example row (Source: OpenAssistant)
Example row (Source: OpenAssistant)

The OpenAssistant conversations dataset has these characteristics:

  • Nature: It's a human-written, human-annotated corpus generated by over 13,000 volunteers through a worldwide crowdsourcing effort. It contains conversations in 35 different languages spanning multiple rows. About 10,000 conversations have fully- annotated quality ratings.
  • Structure: This dataset contains message trees that span multiple rows. Each message tree has the initial prompt message as the root node and multiple child messages as replies. A child message can also have multiple replies. Each message has a role that can either be "assistant" or "prompter."
  • Number of rows: It contains over 161,000 messages.
  • License: It's available under the Apache 2.0 license.

SFT Limitations

SFK teaches an LLM to generate plausible and meaningful responses to instructions and questions. Unfortunately, people may not perceive many of its responses as relevant, helpful, or safe. To overcome that problem, additional training using RLHF is necessary.

Reinforcement Learning From Human Feedback

RLHF is a model training technique that aligns an LLM with human preferences in terms of relevance, helpfulness, and safety. The technique combines reinforcement learning with a machine learning model that mimics human preferences, as explained in the sections below.

RLHF Intuition

The standard technique to combine human preferences or behavior with a machine learning model is reinforcement learning (RL).

The typical RL paradigm works like this: In response to an input, a model acts according to a policy. The policy can be a complex non-linear function implemented by a neural network. A separate reward model (RM) then evaluates its action and either rewards or penalizes it. This reward or penalty is fed back to the policy to optimize it such that the probability of future reward is maximized.

Applying the RL algorithm to an LLM, the language model becomes the policy and its action is the generation of the next token. The RM rewards the policy if the generated tokens satisfy human preferences and penalizes it if not.

This reward is implemented as part of the loss function for the fine-tuning. A reward minimizes the training loss while a penalty maximizes it. Accordingly, the gradient optimization learning algorithm adjusts the policy's (i.e., the LLM's) unfrozen weights such that future rewards are maximized.

We first explore how to implement RLHF reward models in more depth.

Reward Models in RLHF

There are many ways to implement reward models that mimic human preferences.

Multivariate Ranking Models

The RM can be a multivariate regression model that produces a composite numerical score as a reward (or penalty) for an action. This regression model combines scores on various aspects like:

  • Quality
  • Relevance
  • Helpfulness
  • Harmfulness
  • Correctness
  • Realism
  • Comprehensibility

The training data for such a model is provided by thousands of human evaluators who score the LLM responses on the above aspects on some scale like one to five.

However, such scoring is time-consuming and expensive. Besides, numerical scores bring a lot of subjectivity into the process and require additional quality control checks like measuring inter-rater agreement using Fleiss' Kappa or similar metrics.

Binary Ranking Models

A faster, simpler, and cheaper approach is to present just two alternative responses to each evaluator and ask them to select the better one. In the training data created this way, each row will have a prompt, a winning response, and a losing response.

The reward function calculates a scalar score for each winning and losing response. The loss is simply a function of the difference in their scores, also known as the pairwise ranking loss. It ensures that the winning response's score is always greater than the loser's.

An example of this is the Llama 2 RM's loss function:

Llama 2 reward model's loss function
Llama 2 reward model's loss function

Reinforcement Learning With Proximal Policy Optimization

Proximal policy optimization (PPO) is a technique to optimally update the policy (i.e., the LLM's unfrozen layers).

Normally, policy gradient optimization using gradient descent involves just one gradient update per data sample. In contrast, PPO performs a gradient update across a batch of samples. This turns out to be faster as well as simpler to implement.

Reinforcement Learning With Rejection Sampling

Another technique you can use for RL is rejection sampling. It's one of the two techniques, along with PPO, used by Llama 2 for its RLHF implementation.

In rejection sampling, multiple responses are generated by the policy (i.e., the LLM), and the best candidate among them is selected by the reward model. For each prompt, the sample with the highest reward score is considered the new gold standard. The model is then finetuned on this new set of re-ranked samples, effectively reinforcing the reward with another reward.

RLHF Datasets

You can implement your RLHF finetuning using these public datasets.

Stanford Human Preferences (SHP) Dataset

A row from the SHP dataset

The SHP dataset has these characteristics:

  • Nature: It consists of organic (not synthetic) natural human-written single-turn responses derived from Reddit conversations from 18 diverse topic forums like academics, baking, engineering, legal advice, and more.
  • Structure: A conversation spans multiple rows. Each row consists of a question or instruction (which is just the Reddit post) and two top-level replies where evaluators have collectively preferred one reply over the other. For each reply, a score is provided as the difference between its upvotes and downvotes.
  • Number of rows: It contains about 385,000 rows.
  • License: This data is scraped from Reddit and is subject to that platform's terms of use.

Anthropic HH-RLHF Dataset

Example from the Anthropic HH-RLHF dataset

The Anthropic HH-RLHF dataset has these characteristics:

  • Nature: It consists of synthetic, LLM-generated, multi-turn conversations.
  • Structure: Each row consists of a chosen conversation and a rejected conversation.
  • Number of rows: It contains about 170,000 rows.
  • License: This data is available under the MIT license.

Implementing RLHF

In this section, we review useful frameworks and libraries you can use to implement RLHF.

Transformer Reinforcement Learning (TRL) Library

Finetuning process using TRL (Source: TRL)
Finetuning process using TRL (Source: TRL)

TRL is a Hugging Face helper library with these out-of-the-box capabilities:

  • Reward modeling
  • Proximal policy optimization
  • Supervised finetuning
  • Rejective sampling
  • Direct preference optimization

LLaMA Efficient Tuning Framework

LLaMA efficient tuning is a framework for fine-tuning open-source LLMs like Llama 2, LLaMA, Falcon, and more. It provides the following capabilities:

  • Supervised finetuning
  • Reward modeling
  • Proximal policy optimization training
  • Direct preference optimization training

Interested in fine-tuning your LLM?

RLHF has been a key advancement in turning LLMs from pure language models into useful tools for business workflows. In this article, you explored how you can apply supervised fine-tuning and RLHF to open-source LLMs to create your own capable agents and chatbots. Contact us if you need such customized chatbots in your customer service or other processes.


  • Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288 [cs.CL]. https://arxiv.org/abs/2307.09288