79.4 kB

Title: Objective Mismatch in Reinforcement Learning from Human Feedback

URL Source: https://arxiv.org/html/2311.00168

Markdown Content: Nathan Lambert

Allen Institute for AI

Berkeley, CA, USA

nathanl@allenai.org

&Roberto Calandra

TU Dresden

Dresden, Germany

roberto.calandra@tu-dresden.de

Abstract

Reinforcement learning from human feedback (RLHF) has emerged as a powerful technique to make large language models (LLMs) more capable in complex settings. RLHF proceeds as collecting human preference data, training a reward model on said data, and optimizing a base ML model with respect to said reward for extrinsic evaluation metrics (e.g. MMLU, GSM8k). RLHF relies on many assumptions about how the various pieces fit together, such as a reward model capturing human preferences and an RL optimizer extracting the right signal from a reward model. As the RLHF process involves many distinct design decisions, it is easy to assume that multiple processes are correlated and therefore numerically linked. This apparent correlation is often not true, where reward models are easily overoptimized or RL optimizers can reduce performance on tasks not modeled in the data. Notable manifestations of models trained with imperfect RLHF systems are those that are prone to refusing basic requests for safety reasons or appearing lazy in generations. As chat model evaluation becomes increasingly nuanced, the reliance on a perceived link between reward model training, RL scores, and downstream performance drives these issues, which we describe as an objective mismatch. In this paper, we illustrate the causes of this issue, reviewing relevant literature from model-based reinforcement learning, and argue for solutions. By solving objective mismatch in RLHF, the ML models of the future will be more precisely aligned to user instructions for both safety and helpfulness.

1 Introduction

Reinforcement learning from human feedback (RLHF) is a powerful tool for integrating qualitative values into large machine learning models(Christiano\BOthers., \APACyear 2017; Ouyang\BOthers., \APACyear 2022; Bai\BOthers., \APACyear 2022) that are used in popular consumer apps such as ChatGPT and Midjourney. RLHF was popularized with its use to integrate human values into large language models (LLMs) for aligning chat tools(Schulman\BOthers., \APACyear 2022; Team\BOthers., \APACyear 2023). RLHF has become an important technique in the process of making models better at responding to user requests, often referred to as instruction-tuned, steerable, aligned, or chat-tuned.

RLHF methods typically operate in a multi-step process on top of a base language model, first learning a model of human preferences that acts as a reward function, and second using this model within a reinforcement learning (RL) loop. These two steps are often executed independently, with a reward model (RM) being trained on human preference data and then the RL optimizer is used to extract maximum information from the RM into the base model. This multi-step process induces challenges(Schulman, \APACyear 2023) – even the most popular RLHF models include weaknesses such as llama-2-70b-chat-hf’s propensity to refuse vanilla requests on safety grounds(Röttger\BOthers., \APACyear 2023) or a version of ChatGPT documented officially as having “cases of laziness”(OpenAI, \APACyear 2024). Colloquially, these issues fall under the potential banner of “too much RLHF.” These failures are signs of the current limitations of RLHF, where even with positive signals in training of each individual module, the resulting model can have unintended behaviors.

Figure 1: The three links causing objective mismatch in RLHF: Reward model training ↔↔\leftrightarrow↔ policy model training, reward model training ↔↔\leftrightarrow↔ evaluation tools, and policy model training ↔↔\leftrightarrow↔ evaluation tools, as discussed in Sec.4.1.

In this paper, we detail and argue for solving a fundamental challenge in modern RLHF learning schemes – objective mismatch – in order to mitigate these issues. In RLHF, three important parts of training are numerically decoupled: the evaluation metrics, the reward model, and the generating model (policy). This mismatch between the reward model and the RL training is visualized in Fig.2, yet other links exist between the goals of evaluation and training processes as shown in Fig.1. Among other prospects, there are many avenues to better align reward model training to the literature in preference quantification(Lambert\BOthers., \APACyear 2023) and fundamental optimization challenges need to be solved in RLHF practices(Casper\BOthers., \APACyear 2023). ChatGPT, the most popular model trained with RLHF, shows signs of this limitation through issues such as verbosity, self-doubt and question refusals, repeated phrases, hedging, and more(Schulman, \APACyear 2023). These traits of overoptimization are results of the subtle proxy objective problem that objective mismatch provides a frame for studying and solving – the reward model attributes excess value to phrases that do not contribute to user benefit, which the RL optimizer exploits, such as safety flags. On the other hand, the current training setups are not fully aligned with evaluation tools because the RLHF’d models still need sophisticated prompting techniques such as “thinking step by step”(J.Wei\BOthers., \APACyear 2022) or “take a deep breath”(Yang\BOthers., \APACyear 2023) to reach maximum performance. Solving objective mismatch will remove the need for these advanced techniques and reduce the likelihood of out-of-scope refusals from an LLM.

The use of RLHF is promising as it gives more levers for optimization of LLMs beyond next-token prediction accuracy. In this paper, we argue the position that the potential benefits of RLHF will not be realized without solving the objective mismatch issue. RLHF has the potential to enable LLMs that are safe(Ji\BOthers., \APACyear 2023; Shi\BOthers., \APACyear 2023), personalized(Jang\BOthers., \APACyear 2023), and effective(Ouyang\BOthers., \APACyear 2022; Bai\BOthers., \APACyear 2022).

The phrase objective mismatch originates from model-based reinforcement learning (MBRL), where an agent iteratively learns a dynamics model of the environment that it later uses to solve a control task (a dynamics model f θ subscript 𝑓 𝜃 f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT maps from state and action to next state, as s t+1=f θ⁢(a t,s t)subscript 𝑠 𝑡 1 subscript 𝑓 𝜃 subscript 𝑎 𝑡 subscript 𝑠 𝑡 s_{t+1}=f_{\theta}(a_{t},s_{t})italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ))(Moerland\BOthers., \APACyear 2023; Lambert\BOthers., \APACyear 2020; R.Wei\BOthers., \APACyear 2023). In this context, the mismatch is between learning an accurate dynamics model rather than one that is optimized for high task reward. In RLHF, the problem is related, but with added complexity, as the reward model is optimized for preference data over a closed distribution, which does not match the end users. Second, the task of open-ended language generation is less specific to a notion of reward than that of RL control policies. For these reasons, as we explore in this paper, the objective mismatch issue is more nuanced and critical to RLHF. In this position paper, we make three contributions:

Figure 2: An illustration of where the objective mismatch issue emerges within the RL optimization phase of RLHF. A mismatch occurs when the score from the reward model is assumed to be correlated with other downstream evaluation metrics, such as human preferences over evaluation sets, classic NLP benchmarks, or LLM-as-a-judge systems. Compared to traditional RL problems, RLHF does not have the canonical form of an environment, which indirectly maps to the training data with a reward model, but does not capture the same properties.

•Clearly explain the origins and potential manifestations of objective mismatch in chat-tuned LLMs,
•Connect related work from NLP and RL literature around objective mismatch,
•Propose directions of study to solve the mismatch and foster better RLHF practices.

2 Related Work

2.1 Reinforcement learning from human feedback

Early work in RLHF focused on continuous control domains with various methods for altering the behavior across trajectories(Christiano\BOthers., \APACyear 2017; Wirth\BOthers., \APACyear 2017). The impacts of RLHF today primarily has been centered around its use with LLMs. Initial work on RLHF for LLMs utilized user preferences from a batch of 4 options(Ziegler\BOthers., \APACyear 2019) to train a reward model across general LLM benchmarks. Group preferences were changed to pairwise preferences, and rather than general benchmarks the reward model was focused on the task of summarization(Stiennon\BOthers., \APACyear 2020; J.Wu\BOthers., \APACyear 2021). Next emerged general question-answering models(Ouyang\BOthers., \APACyear 2022) and web crawling agents(Nakano\BOthers., \APACyear 2021), primarily from scaling the initial model and human datasets. Now, RLHF is used to train general chat models across a variety of tasks(Bai\BOthers., \APACyear 2022; Schulman\BOthers., \APACyear 2022; Touvron\BOthers., \APACyear 2023) and in specific domains such as harm reduction(Glaese\BOthers., \APACyear 2022) or information accuracy(Menick\BOthers., \APACyear 2022).

The development of these methods has accelerated markedly, with many variations on the methodology for integrating feedback into language models(Fernandes\BOthers., \APACyear 2023). The most popular reinforcement learning optimizer is still Proximal Policy Optimization (PPO)(Schulman\BOthers., \APACyear 2017), with its many stable and scalable implementations. Recent works have been developing methods for the RL optimizer, such as the offline RL algorithm Implicit Language Q Learning (ILQL)(Snell\BOthers., \APACyear 2022), direct preference optimization (DPO)(Rafailov\BOthers., \APACyear 2023) for utilizing preference data without a reward model, or Advantage-Leftover Lunch RL (A-LOL)(Baheti\BOthers., \APACyear 2023) which is designed to act on the entire response as a single action (which RLHF generally does).

2.2 Problem (mis-)specification in RLHF

There is a substantial emerging literature on varieties of numerical issues, unexpected behaviors such as verbosity and evasiveness(Schulman, \APACyear 2023), and potential solutions in RLHF, which can be mitigated by progress on solving objective mismatch. A prominent recent example is the behavior of the flagship Llama 2 chat model refusing to answer a request asking “How do I kill a Linux process,” conflating the computer process with the morals of killing a living creature. It has been shown that there are predictable behaviors of reward model overoptimization with PPO and best-of-N optimization techniques(Gao\BOthers., \APACyear 2022), which can be partially mitigated by training ensemble reward models (Coste\BOthers., \APACyear 2023), weight-averaging(Ramé\BOthers., \APACyear 2024), or constrained optimization(Moskovitz\BOthers., \APACyear 2023). Other issues have emerged in RLHF models that demonstrate the need for improved reward models, such as a bias towards long responses(Singhal\BOthers., \APACyear 2023), a lack of language consistency(L.Shen\BOthers., \APACyear 2023) (invariance over changes that maintain meaning), or a reduction of output diversity Kirk\BOthers. (\APACyear 2023). A similar argument is made in A.Wei\BOthers. (\APACyear 2023), where the authors argue that “competing objectives and mismatched generalization” mislead the models – we present how objective mismatch covers both of these limitations and more possible failure cases.

Other papers indicate more fundamental limitations in how the preference data are collected(Bansal\BOthers., \APACyear 2023) or utilized. For example, multiple lines of work argue that the reward model training formulation does not align with the data collection process and downstream RL optimization, suggesting the models should model advantage estimates rather than direct value functions(Peng\BOthers., \APACyear 2023; Knox\BBA Stone, \APACyear 2008).

2.3 Reward engineering for RLHF

Specific domains are addressing this by shifting preference labels away form solely pairwise annotator input (whether by a human or an LLM) to computational feedback to bootstrap pairwise data for a reward model. For example, successful code execution in Python or reasoning path length has been used for rejection sampling(Yuan\BOthers., \APACyear 2023). Other works combine scores from code execution, syntax, and semantics to optimize for effective code(Shojaee\BOthers., \APACyear 2023) or through unit tests(B.Shen\BOthers., \APACyear 2023; Liu, Zhu\BCBL\BOthers., \APACyear 2023). These are examples of early solutions to the reward specification problem facing all applications of RLHF.

2.4 Evaluating LLMs trained with RLHF

Core to the ideas of objective mismatch with LLMs is the methods of evaluation used to correlate performance. Historically, LLMs have been evaluated across a wide variety of tasks trying to capture specific characteristics of models, making evaluation an extremely broad process(Liang\BOthers., \APACyear 2022) where progress is saturating(Kiela\BOthers., \APACyear 2023). Now, many models are focused on hard to specify tasks such as chat, where existing benchmarks were not well correlated with performance(Zheng\BOthers., \APACyear 2023), so new chat based evaluations such as MT-Bench(Zheng\BOthers., \APACyear 2023) and AlpacaEval(Li\BOthers., \APACyear 2023) have been introduced, but substantial further work is needed.

3 Background

3.1 Reward model training

Reward models are trained with human preference data most often consisting of a task given to the model prompt, i.e a request or instruction, and ratings of the completion, or answer. The feedback can consist of selecting the best from groups of responses(Ziegler\BOthers., \APACyear 2019), scores and rankings of a group of candidate responses(Ouyang\BOthers., \APACyear 2022), a choice between a pair of responses(Bai\BOthers., \APACyear 2022) (choose best response between two options), and even finer grained data(Z.Wu\BOthers., \APACyear 2023). The workers employed are generally given detailed instructions on which styles, occurrences, or values to prioritize in their labels.

The reward models trained for RLHF are most often trained as classifiers between a chosen and rejected completion to a prompt before optimizing with RL where they return a scalar value for each piece of text. Given two options for a completion y 𝑦 y italic_y from a prompt x 𝑥 x italic_x, and the scores they obtain a scalar output r 𝑟 r italic_r from an initially untrained value head on an LLM or value model entirely, the loss for the reward model follows(Askell\BOthers., \APACyear 2021; Ouyang\BOthers., \APACyear 2022)

L=log⁢(1+e r chosen−r rejected).𝐿 log 1 superscript 𝑒 subscript 𝑟 chosen subscript 𝑟 rejected L=\text{log}\big{(}1+e^{r_{\text{chosen}}-r_{\text{rejected}}}\big{)},.italic_L = log ( 1 + italic_e start_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT chosen end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT rejected end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) .(1)

The loss function is designed to increase the distance between the two samples, where variations exist including losses of 4 samples rather than a pair(Ziegler\BOthers., \APACyear 2019), updating the model with batches of pairwise labels on a given prompt(Ouyang\BOthers., \APACyear 2022), or optimizing based on the margin between r chosen subscript 𝑟 chosen r_{\text{chosen}}italic_r start_POSTSUBSCRIPT chosen end_POSTSUBSCRIPT and r rejected subscript 𝑟 rejected r_{\text{rejected}}italic_r start_POSTSUBSCRIPT rejected end_POSTSUBSCRIPT(Touvron\BOthers., \APACyear 2023). For inference during RL optimization, the reward is taken as the raw logit output from this model that represents an unnormalized probability of the text being preferred.

3.2 Reinforcement learning on language

Language generation optimized via reinforcement learning, which RLHF is a version of, can be formalized as a partially observable Markov decision process (POMDP)(Spaan, \APACyear 2012). We define a POMDP ℳ ℳ\mathcal{M}caligraphic_M at a per-token level with ℳ=(𝒮,𝒜,𝒪,𝒯,𝒵,μ 0,ℛ,γ)ℳ 𝒮 𝒜 𝒪 𝒯 𝒵 subscript 𝜇 0 ℛ 𝛾\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{O},\mathcal{T},\mathcal{Z},\mu_{% 0},\mathcal{R},\gamma)caligraphic_M = ( caligraphic_S , caligraphic_A , caligraphic_O , caligraphic_T , caligraphic_Z , italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , caligraphic_R , italic_γ ). Here, the state of the system is s t∈𝒮 subscript 𝑠 𝑡 𝒮 s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S, which the agent receives as an observation h t∈𝒪 subscript ℎ 𝑡 𝒪 h_{t}\in\mathcal{O}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_O. The observation is a history of tokens h t={t 0,t 1,…,t t−1}subscript ℎ 𝑡 subscript 𝑡 0 subscript 𝑡 1…subscript 𝑡 𝑡 1 h_{t}={t_{0},t_{1},\ldots,t_{t-1}}italic_h start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT } and the action space is the possible set of next-tokens in the vocabulary of the policy model a t=t t∈𝒜 subscript 𝑎 𝑡 subscript 𝑡 𝑡 𝒜 a_{t}=t_{t}\in\mathcal{A}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A, including the end-of-sequence token a end subscript 𝑎 end a_{\text{end}}italic_a start_POSTSUBSCRIPT end end_POSTSUBSCRIPT. As in a traditional MPD, 𝒯 𝒯\mathcal{T}caligraphic_T is the transition function 𝒯(⋅|s t,a t)\mathcal{T}(\cdot|s_{t},a_{t})caligraphic_T ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

The goal of the RLHF process is to learn a policy that is mapping π:𝒪↦𝒫⁢(𝒜):𝜋 maps-to 𝒪 𝒫 𝒜\pi:\mathcal{O}\mapsto\mathcal{P}(\mathcal{A})italic_π : caligraphic_O ↦ caligraphic_P ( caligraphic_A ). This is done with the reward model, which acts as a reward function R⁢(s t,a t)↦ℛ maps-to 𝑅 subscript 𝑠 𝑡 subscript 𝑎 𝑡 ℛ R(s_{t},a_{t})\mapsto\mathcal{R}italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ↦ caligraphic_R, used after each sequence is generated. The full sequence, until end-of-sequence token a end subscript 𝑎 end a_{\text{end}}italic_a start_POSTSUBSCRIPT end end_POSTSUBSCRIPT, is called the action and used to get a scalar reward r t subscript 𝑟 𝑡 r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the reward model.

With LLMs, the generating model is referred to as the policy model. In RLHF, the discount factor of reward is set to 1 and no further actions are taken for the given prompt, casting the problem as contextual bandits. An example of the RLHF loop is shown in Fig.2 in comparison to a standard RL loop shown in Fig.3.

Figure 3: The canonical RL problem, where an agent interacts repeatedly with an environment, which the RLHF process is derived from (as in Fig.2).

4 Understanding Objective Mismatch

The objective mismatch in RLHF emerges from three broad causes: First, common practice in RL engineering dictates that as long as reward is increasing the model is improving. Second, the evaluation methods available for models trained with RLHF are often incomplete relative to their downstream use-cases. Third, the assumption that the reward model trained is a suitable reward function for optimization. For these reasons, objective mismatch emerges as the assumption that downstream evaluation will be correlated with the reward model score for the current policy, which is not proven.

4.1 Origins of mismatch

Objective mismatch in RLHF is the result of the interactions between three different sub-components, rather than just the two (i.e., dynamics model and policy) from MBRL: It is a balance of (1) the reward model training, the goal of getting a calibrated reward function, (2) the policy training, the process of extracting useful information from a reward model, and (3) the often bespoke evaluation techniques used for RLHF models, the process of fairly evaluating a multi-use model. There exists an interface between each pair of these three that provides an axis for erroneous assumptions regarding the true optimization problem as shown in Fig.1, but the importance of each link is not uniform for mitigation of mismatch.

When viewing these links, they present areas for improvement in RLHF when assuming one knob of the process is fixed. For example, in order to study the task of a reward that enables stable RL training, one should operate under a fixed evaluation regime. Without isolating modules of the system, all components of an RLHF optimization scheme, reward, evaluations, and preference agreement, can quickly become contaminated with each other and correlated. An example of such a project would be studying reward model design to mitigate overoptimization(Coste\BOthers., \APACyear 2023; Ramé\BOthers., \APACyear 2024), targeting the top right of Fig.1.

The first link presented is the most engineering heavy of the three by a substantial margin, so it is likely that progress is the most tractable. The other three present constantly emerging challenges as the use cases for RLHF-tuned models evolve with the applications of LLMs and other ML models.

Reward model training ↔↔\leftrightarrow↔ policy model training

Uniformly extracting the information from the reward model into the policy and avoiding the reward hacking inherent to RL(Pan\BOthers., \APACyear 2022) that can result in overoptimization of reward models(Gao\BOthers., \APACyear 2022) is central to RLHF. A good reward model may not be one that is empirically easy to train a policy with high reward from, but rather a RM that is well correlated with downstream evaluation metrics. Common practice in RLHF, especially with larger models where gradients are less stable, is to spend additional compute in search of “stable” training runs with increasing reward, which induces further likelihood of mismatch.

Reward model training ↔↔\leftrightarrow↔ evaluation tools

While relatively little work and resources exist for the study of state-of-the-art reward models, the matching of the reward signal they provide to the intended use-case of the final policy is central to solving the objective mismatch issue, particularly through the methods used to collect preference data. The reward models are trained on aggregated datasets to maximize agreement of the model on a held out set of data, which in practice often yields a maximum accuracy of 60-75%(Ouyang\BOthers., \APACyear 2022; Bai\BOthers., \APACyear 2022). Given the complex task encompassed in reward modeling, it is unlikely that the models converge to 100% accuracy, but studying the sources of this performance delta could indicate sources of mismatch. In fact, understanding true upper bounds on different types of preference data is an essential step to studying reward model accuracy. New tools are needed for evaluation of reward models that better match their conceptual underpinnings as a representation of human values for solving the alignment problem(Leike\BOthers., \APACyear 2018) and as a practical realization as targets for optimization Lambert\BOthers. (\APACyear 2023).

Policy model training ↔↔\leftrightarrow↔ evaluation tools

The third link contributes the least to the emergence of mismatch, but is the easiest axis to visualization potential signs of objective mismatch. This axis entails designing effective reward optimizer for language that integrate reward signal while not degrading any capabilities of the base model. Directly matching RL training with any additional evaluation metrics is technically challenging. In MBRL, such a solution could be by using a differentiable simulator(R.Wei\BOthers., \APACyear 2023), but with the complexity of RLHF such solutions are less desirable. Exploring any types of regularization or calibration of training with respect to final evaluations is viable as research directions, but this area of study is best suited for visualizing signs of objective mismatch, as shown in Fig.4.

Figure 4: Illustrating the most likely visualization of objective mismatch in RLHF, the link between policy training and downstream evaluation. Measuring the correlation between evaluation and RL training is crucial to understanding the scope of impact of objective mismatch on current and future RLHF trained models.

4.2 Mismatch of next-token prediction

The original training object used in popular language model architectures, autoregressive next-token prediction also suffers from an objective mismatch problem, as almost all LLM evaluation techniques evaluate the entire output rather than individual tokens. While this is true, the development signal that the next-token prediction loss provides is more orthogonal to the goals of RLHF. In RLHF, and most related work in RL, the reward signal is interpreted as a direct indicator of performance. This assumption creates a much more unintentionally nuanced research setup, warranting the specific study of its impacts.

In MBRL, the learning of a dynamics model is also often done via one-step transitions, with recent work studying autoregressive models(Janner\BOthers., \APACyear 2021; Lambert\BOthers., \APACyear 2021), where the compounding error of multiple one-step predictions is well known as a deeply related issue to objective mismatch(Lambert\BOthers., \APACyear 2022). In the case where mismatch becomes a fundamental problem of LLMs, similar solutions could be investigated.

4.3 Does Direct Preference Optimization solve the mismatch?

Direct Preference Optimization (DPO)(Rafailov\BOthers., \APACyear 2023) solves the RLHF problem by inducing a policy from the optimal solution to the reward model problem, resulting in an LLM that acts as a generative model and reward scorer. This class of algorithms, which is expanding to address concerns of over-fitting and robustness(Azar\BOthers., \APACyear 2023), reduces the complexity of the objective mismatch problem by directly tying the training of the reward model and policy together. These methods mitigate the policy-reward model interface, but induce new problems in terms of objective mismatch. By joining the reward and policy models together, it becomes more nuanced to develop research programs designed around each individual element. In principle a reward model achieved with DPO should be useful in same manners as other RLHF approaches, but substantial research is required to assess them. Finally, the same problems of preference data selection and evaluation are still present and core to the applicability of DPO methods.

5 Solving Objective Mismatch

There is already emerging research on many potential causes and solutions of mismatch in RLHF, yet further work can be inspired by solutions from the broader RL literature. Many of the solutions to objective mismatch in MBRL will not apply directly because in MBRL they have a true reward from the environment, and for that reason research is needed to understand the outputs of reward models. Here follows a series of investigations to expand to mitigate objective mismatch:

Reward model evaluation

There are many axes by which a reward model is expected to behave in order to be a reasonable approximation of a reward function, but they are typically not studied. Reward models need to be assessed for consistency, robustness to adversarial attacks, calibration across distributions, and more, as discussed in Lambert\BOthers. (\APACyear 2023). Understanding reward models performance is the foundation of solving the mismatch problem. Evaluating reward models will be an indirect but useful path to measure the varied preference datasets used for open RLHF models.

Reward model training methods

In order to solve limitations of reward models across better evaluation techniques, numerous new training methods will be developed. Early research has already shown reward model ensembles can help mitigate overoptimization(Coste\BOthers., \APACyear 2023). Further research is warranted to integrate techniques that have improved performance of model-based RL algorithms, such as probabilistic loss functions for the dynamics models and planning(Chua\BOthers., \APACyear 2018), calibrated probability estimates(Malik\BOthers., \APACyear 2019) during training the reward model as a classifier, and other solutions(R.Wei\BOthers., \APACyear 2023). Additionally, links should be explored between the reward models of inverse reinforcement learning (IRL)(Ng\BOthers., \APACyear 2000), the subfield tasked with learning a reward function from agent behavior, and those of RLHF. Early research also shows reformatting the reward model training to better match preference learning literature(Knox\BOthers., \APACyear 2023) could improve performance. While ensembles(Coste\BOthers., \APACyear 2023) and weight-averages(Ramé\BOthers., \APACyear 2024) mitigate overoptimization, they do not solve all challenges facing reward models(Eisenstein\BOthers., \APACyear 2023).

Reward model training datasets

High-quality datasets are a bottleneck slowing progress in open RLHF research, given the large costs required to acquire them. There are a few datasets available, but they are unproven in their ability to match the performance of the best models. The Stanford Preferences Dataset of Reddit content(Ethayarajh\BOthers., \APACyear 2022), UltraFeedback synthetic preference data(Cui\BOthers., \APACyear 2023), WebGPT internet browsing(Nakano\BOthers., \APACyear 2021), learning to summarize(Stiennon\BOthers., \APACyear 2020), and Anthropic HHH dataset(Askell\BOthers., \APACyear 2021) serve as a strong foundation for research. Explorations are needed to first characterize why these datasets succeed and where they fall short, and then apply it to curating new datasets.

Value-guided sampling techniques

Increased compute can be spent at inference time to improve the performance of RLHF models by utilizing the values returned by the reward model(Deng\BBA Raffel, \APACyear 2023; Liu, Cohen\BCBL\BOthers., \APACyear 2023). Feng\BOthers. (\APACyear 2023) explores this through Monte Carlo tree search generation, yet many more methods can be explored across the planning literature.

Human-centric NLP evaluation

The most popular evaluation technique for chat-tuned RLHF models is preference percentage versus other top models on evaluation prompt sets (as done in open RLHF models including Llama 2(Touvron\BOthers., \APACyear 2023) and Dromedary-2(Sun\BOthers., \APACyear 2023)). This evaluation mechanism, while well-motivated in the popular use-cases of the models, suffers from bias and reproducibility challenges. The prompts can easily be chosen to support the model designed by the authors, and the prompts are often not released or aggregated into a future benchmark. Expanding the reproducibility and consistency of these practices will be important to creating robust practices for RLHF.

RL (and other) optimizers for language

As discussed in Sec.2.1, the optimizers used for RLHF are most often those from previous RL literature. Now there is an opportunity for expansion of RL algorithms into the niche of RLHF, where conditions are highly specialized through the expansive actions space and bandits formulation. New algorithms are a step in the right direction, such as T.Wu\BOthers. (\APACyear 2023) modifying the PPO algorithm for pairwise preferences or Baheti\BOthers. (\APACyear 2023) proposing an offline RL algorithm for full-completion actions.

This investigation should compare to other baselines for extracting signal from a reward model, such as rejection sampling(Touvron\BOthers., \APACyear 2023), which runs autoregressive fine-tuning on the top samples as dictated by a reward model.

6 Discussions

Figure 5: The outer loop of RLHF is the process to train the next reward model for RL to improve performance is areas of user interest. This setup induces additional complexity in objective mismatch in RLHF.

Iterative deployment of RLHF

The iterative deployment form of RLHF where reward models are retrained based on user data, which induces a second feedback loop, is shown in Fig.5. Schulman (\APACyear 2023) discusses how this is used in ChatGPT to mitigate issues such as evasiveness, verbosity, and other unexpected, undesirable qualities. Designing in this framework introduces further complexity onto engineering objectives, but allows iterative mitigation of mismatch. This style of iterative RL deployment has been understood as exogenous feedback(Gilbert\BOthers., \APACyear 2022) and can have societal implications.

There is some literature in this space, but expanding related works to the scale of use of modern LLMs will be difficult. For example, Suhr\BBA Artzi (\APACyear 2022) shows theoretical results on outer-loop optimization of instruction-tuned models.

Contextual bandits

The modifications made to the RL optimization of RLHF cast it as a contextual bandits problem, where an agent takes one action and the dynamics are abstracted into one trajectory-reward pairing. Work in this area has investigated the potential of integrating partial, skewed, or noisy human feedback into the optimization process(Nguyen\BOthers., \APACyear 2017).

The subarea of dueling bandits has further specified the problem that is closely aligned with RLHF, but in primarily theoretical work with much smaller models, datasets, and tasks. Yue\BOthers. (\APACyear 2012) explains this space in work showing theoretical bounds:

“In contrast to conventional approaches that require the absolute reward of the chosen strategy to be quantifiable and observable, our setting assumes only that (noisy) binary feedback about the relative reward of two chosen strategies is available. This type of relative feedback is particularly appropriate in applications where absolute rewards have no natural scale or are difficult to measure… but where pairwise comparisons are easy to make.”

This, while closely related to RLHF, will require substantial experimentation to be applicable. Others have built on this into work directly learning from human preferences(Sekhari\BOthers., \APACyear 2023) or from implicit human feedback(Maghakian\BOthers., \APACyear 2022).

7 Conclusion

This paper presents the multiple ways by which objective mismatch limits the accessibility and reliability of RLHF methods. This current disconnect between designing a reward model, optimizing it, and the downstream model goals creates a method that is challenging to implement and improve on. Future work mitigating mismatch and the proxy objectives present in RLHF, LLMs and other popular machine learning methods will become easier to align with human values and goals, solving many common challenges users encounter with state-of-the-art LLMs.

In fact, it could be argued that the objective mismatches in RLHF are caused by the lack of a formal objective existing for human preferences. Given the prevalent success of RLHF’s early renditions in deployed technology today such as ChatGPT, the existing objective is effective enough to be worth studying and investing heavily in. Our position is that objective mismatch articulates the directions the research community should go to make the most progress.

Acknowledgments

The authors would like to thank David Wadden for some pressing questions that made us better articulate the problems poised in the paper. Additional thanks to Louis Castricato, Ellen Wu, Khyathi Chandu for feedback on drafts, This work was partly supported by the German Research Foundation (DFG, Deutsche Forschungsgemeinschaft) as part of Germany’s Excellence Strategy – EXC 2050/1 – Project ID 390696704 – Cluster of Excellence “Centre for Tactile Internet with Human-in-the-Loop” (CeTI) of Technische Universität Dresden, and by Bundesministerium für Bildung und Forschung (BMBF) and German Academic Exchange Service (DAAD) in project 57616814 (SECAI, School of Embedded and Composite AI).

References

Askell\BOthers. (\APACyear 2021)\APACinsertmetastar askell2021general{APACrefauthors}Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T.\BDBL others\APACrefYearMonthDay 2021. \BBOQ\APACrefatitle A general language assistant as a laboratory for alignment A general language assistant as a laboratory for alignment.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2112.00861. \PrintBackRefs\CurrentBib
Azar\BOthers. (\APACyear 2023)\APACinsertmetastar azar2023general{APACrefauthors}Azar, M\BPBI G., Rowland, M., Piot, B., Guo, D., Calandriello, D., Valko, M.\BCBL\BBA Munos, R.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle A general theoretical paradigm to understand learning from human preferences A general theoretical paradigm to understand learning from human preferences.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2310.12036. \PrintBackRefs\CurrentBib
Baheti\BOthers. (\APACyear 2023)\APACinsertmetastar baheti2023improving{APACrefauthors}Baheti, A., Lu, X., Brahman, F., Bras, R\BPBI L., Sap, M.\BCBL\BBA Riedl, M.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Improving Language Models with Advantage-based Offline Policy Gradients Improving language models with advantage-based offline policy gradients.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2305.14718. \PrintBackRefs\CurrentBib
Bai\BOthers. (\APACyear 2022)\APACinsertmetastar bai2022training{APACrefauthors}Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N.\BDBL others\APACrefYearMonthDay 2022. \BBOQ\APACrefatitle Training a helpful and harmless assistant with reinforcement learning from human feedback Training a helpful and harmless assistant with reinforcement learning from human feedback.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2204.05862. \PrintBackRefs\CurrentBib
Bansal\BOthers. (\APACyear 2023)\APACinsertmetastar bansal2023peering{APACrefauthors}Bansal, H., Dang, J.\BCBL\BBA Grover, A.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Peering through preferences: Unraveling feedback acquisition for aligning large language models Peering through preferences: Unraveling feedback acquisition for aligning large language models.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2308.15812. \PrintBackRefs\CurrentBib
Casper\BOthers. (\APACyear 2023)\APACinsertmetastar casper2023open{APACrefauthors}Casper, S., Davies, X., Shi, C., Gilbert, T\BPBI K., Scheurer, J., Rando, J.\BDBL others\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Open problems and fundamental limitations of reinforcement learning from human feedback Open problems and fundamental limitations of reinforcement learning from human feedback.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2307.15217. \PrintBackRefs\CurrentBib
Christiano\BOthers. (\APACyear 2017)\APACinsertmetastar christiano2017deep{APACrefauthors}Christiano, P\BPBI F., Leike, J., Brown, T., Martic, M., Legg, S.\BCBL\BBA Amodei, D.\APACrefYearMonthDay 2017. \BBOQ\APACrefatitle Deep reinforcement learning from human preferences Deep reinforcement learning from human preferences.\BBCQ\APACjournalVolNumPages Advances in neural information processing systems30. \PrintBackRefs\CurrentBib
Chua\BOthers. (\APACyear 2018)\APACinsertmetastar chua2018deep{APACrefauthors}Chua, K., Calandra, R., McAllister, R.\BCBL\BBA Levine, S.\APACrefYearMonthDay 2018. \BBOQ\APACrefatitle Deep reinforcement learning in a handful of trials using probabilistic dynamics models Deep reinforcement learning in a handful of trials using probabilistic dynamics models.\BBCQ\APACjournalVolNumPages Advances in neural information processing systems31. \PrintBackRefs\CurrentBib
Coste\BOthers. (\APACyear 2023)\APACinsertmetastar coste2023reward{APACrefauthors}Coste, T., Anwar, U., Kirk, R.\BCBL\BBA Krueger, D.\APACrefYearMonthDay 2023. \APACrefbtitle Reward Model Ensembles Help Mitigate Overoptimization. Reward model ensembles help mitigate overoptimization. \PrintBackRefs\CurrentBib
Cui\BOthers. (\APACyear 2023)\APACinsertmetastar cui2023ultrafeedback{APACrefauthors}Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y.\BDBL Sun, M.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle UltraFeedback: Boosting Language Models with High-quality Feedback Ultrafeedback: Boosting language models with high-quality feedback.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2310.01377. \PrintBackRefs\CurrentBib
Deng\BBA Raffel (\APACyear 2023)\APACinsertmetastar deng2023reward{APACrefauthors}Deng, H.\BCBT\BBA Raffel, C.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model Reward-augmented decoding: Efficient controlled text generation with a unidirectional reward model.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2310.09520. \PrintBackRefs\CurrentBib
Eisenstein\BOthers. (\APACyear 2023)\APACinsertmetastar eisenstein2023helping{APACrefauthors}Eisenstein, J., Nagpal, C., Agarwal, A., Beirami, A., D’Amour, A., Dvijotham, D.\BDBL others\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2312.09244. \PrintBackRefs\CurrentBib
Ethayarajh\BOthers. (\APACyear 2022)\APACinsertmetastar pmlr-v162-ethayarajh22a{APACrefauthors}Ethayarajh, K., Choi, Y.\BCBL\BBA Swayamdipta, S.\APACrefYearMonthDay 202217–23 Jul. \BBOQ\APACrefatitle Understanding Dataset Difficulty with 𝒱 𝒱\mathcal{V}caligraphic_V-Usable Information Understanding dataset difficulty with 𝒱 𝒱\mathcal{V}caligraphic_V-usable information.\BBCQ\BIn K.Chaudhuri, S.Jegelka, L.Song, C.Szepesvari, G.Niu\BCBL\BBA S.Sabato(\BEDS), \APACrefbtitle Proceedings of the 39th International Conference on Machine Learning Proceedings of the 39th international conference on machine learning(\BVOL 162, \BPGS 5988–6008). \APACaddressPublisher PMLR. \PrintBackRefs\CurrentBib
Feng\BOthers. (\APACyear 2023)\APACinsertmetastar feng2023alphazero{APACrefauthors}Feng, X., Wan, Z., Wen, M., Wen, Y., Zhang, W.\BCBL\BBA Wang, J.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training Alphazero-like tree-search can guide large language model decoding and training.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2309.17179. \PrintBackRefs\CurrentBib
Fernandes\BOthers. (\APACyear 2023)\APACinsertmetastar fernandes2023bridging{APACrefauthors}Fernandes, P., Madaan, A., Liu, E., Farinhas, A., Martins, P\BPBI H., Bertsch, A.\BDBL others\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Bridging the gap: A survey on integrating (human) feedback for natural language generation Bridging the gap: A survey on integrating (human) feedback for natural language generation.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2305.00955. \PrintBackRefs\CurrentBib
Gao\BOthers. (\APACyear 2022)\APACinsertmetastar gao2022scaling{APACrefauthors}Gao, L., Schulman, J.\BCBL\BBA Hilton, J.\APACrefYearMonthDay 2022. \BBOQ\APACrefatitle Scaling Laws for Reward Model Overoptimization Scaling laws for reward model overoptimization.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2210.10760. \PrintBackRefs\CurrentBib
Gilbert\BOthers. (\APACyear 2022)\APACinsertmetastar gilbert2022choices{APACrefauthors}Gilbert, T\BPBI K., Dean, S., Zick, T.\BCBL\BBA Lambert, N.\APACrefYearMonthDay 2022. \BBOQ\APACrefatitle Choices, Risks, and Reward Reports: Charting Public Policy for Reinforcement Learning Systems Choices, risks, and reward reports: Charting public policy for reinforcement learning systems.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2202.05716. \PrintBackRefs\CurrentBib
Glaese\BOthers. (\APACyear 2022)\APACinsertmetastar glaese2022improving{APACrefauthors}Glaese, A., McAleese, N., Trębacz, M., Aslanides, J., Firoiu, V., Ewalds, T.\BDBL others\APACrefYearMonthDay 2022. \BBOQ\APACrefatitle Improving alignment of dialogue agents via targeted human judgements Improving alignment of dialogue agents via targeted human judgements.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2209.14375. \PrintBackRefs\CurrentBib
Jang\BOthers. (\APACyear 2023)\APACinsertmetastar jang2023personalized{APACrefauthors}Jang, J., Kim, S., Lin, B\BPBI Y., Wang, Y., Hessel, J., Zettlemoyer, L.\BDBL Ammanabrolu, P.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Personalized soups: Personalized large language model alignment via post-hoc parameter merging Personalized soups: Personalized large language model alignment via post-hoc parameter merging.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2310.11564. \PrintBackRefs\CurrentBib
Janner\BOthers. (\APACyear 2021)\APACinsertmetastar janner2021offline{APACrefauthors}Janner, M., Li, Q.\BCBL\BBA Levine, S.\APACrefYearMonthDay 2021. \BBOQ\APACrefatitle Offline reinforcement learning as one big sequence modeling problem Offline reinforcement learning as one big sequence modeling problem.\BBCQ\APACjournalVolNumPages Advances in neural information processing systems341273–1286. \PrintBackRefs\CurrentBib
Ji\BOthers. (\APACyear 2023)\APACinsertmetastar ji2023beavertails{APACrefauthors}Ji, J., Liu, M., Dai, J., Pan, X., Zhang, C., Bian, C.\BDBL Yang, Y.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Beavertails: Towards improved safety alignment of llm via a human-preference dataset Beavertails: Towards improved safety alignment of llm via a human-preference dataset.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2307.04657. \PrintBackRefs\CurrentBib
Kiela\BOthers. (\APACyear 2023)\APACinsertmetastar kiela2023plottingprogress{APACrefauthors}Kiela, D., Thrush, T., Ethayarajh, K.\BCBL\BBA Singh, A.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Plotting Progress in AI Plotting progress in ai.\BBCQ\APACjournalVolNumPages Contextual AI Blog. \APACrefnote https://contextual.ai/blog/plotting-progress \PrintBackRefs\CurrentBib
Kirk\BOthers. (\APACyear 2023)\APACinsertmetastar kirk2023understanding{APACrefauthors}Kirk, R., Mediratta, I., Nalmpantis, C., Luketina, J., Hambro, E., Grefenstette, E.\BCBL\BBA Raileanu, R.\APACrefYearMonthDay 2023. \APACrefbtitle Understanding the Effects of RLHF on LLM Generalisation and Diversity. Understanding the effects of rlhf on llm generalisation and diversity. \PrintBackRefs\CurrentBib
Knox\BOthers. (\APACyear 2023)\APACinsertmetastar knox2023learning{APACrefauthors}Knox, W\BPBI B., Hatgis-Kessell, S., Adalgeirsson, S\BPBI O., Booth, S., Dragan, A., Stone, P.\BCBL\BBA Niekum, S.\APACrefYearMonthDay 2023. \APACrefbtitle Learning Optimal Advantage from Preferences and Mistaking it for Reward. Learning optimal advantage from preferences and mistaking it for reward. \PrintBackRefs\CurrentBib
Knox\BBA Stone (\APACyear 2008)\APACinsertmetastar knox2008tamer{APACrefauthors}Knox, W\BPBI B.\BCBT\BBA Stone, P.\APACrefYearMonthDay 2008. \BBOQ\APACrefatitle Tamer: Training an agent manually via evaluative reinforcement Tamer: Training an agent manually via evaluative reinforcement.\BBCQ\BIn\APACrefbtitle 2008 7th IEEE international conference on development and learning 2008 7th ieee international conference on development and learning(\BPGS 292–297). \PrintBackRefs\CurrentBib
Lambert\BOthers. (\APACyear 2020)\APACinsertmetastar lambert2020objective{APACrefauthors}Lambert, N., Amos, B., Yadan, O.\BCBL\BBA Calandra, R.\APACrefYearMonthDay 2020. \BBOQ\APACrefatitle Objective Mismatch in Model-based Reinforcement Learning Objective mismatch in model-based reinforcement learning.\BBCQ\BIn\APACrefbtitle Learning for Dynamics and Control Learning for dynamics and control(\BPGS 761–770). \PrintBackRefs\CurrentBib
Lambert\BOthers. (\APACyear 2023)\APACinsertmetastar lambert2023entangled{APACrefauthors}Lambert, N., Gilbert, T\BPBI K.\BCBL\BBA Zick, T.\APACrefYearMonthDay 2023. \APACrefbtitle The History and Risks of Reinforcement Learning and Human Feedback. The history and risks of reinforcement learning and human feedback. \PrintBackRefs\CurrentBib
Lambert\BOthers. (\APACyear 2022)\APACinsertmetastar lambert2022investigating{APACrefauthors}Lambert, N., Pister, K.\BCBL\BBA Calandra, R.\APACrefYearMonthDay 2022. \BBOQ\APACrefatitle Investigating compounding prediction errors in learned dynamics models Investigating compounding prediction errors in learned dynamics models.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2203.09637. \PrintBackRefs\CurrentBib
Lambert\BOthers. (\APACyear 2021)\APACinsertmetastar lambert2021learning{APACrefauthors}Lambert, N., Wilcox, A., Zhang, H., Pister, K\BPBI S.\BCBL\BBA Calandra, R.\APACrefYearMonthDay 2021. \BBOQ\APACrefatitle Learning accurate long-term dynamics for model-based reinforcement learning Learning accurate long-term dynamics for model-based reinforcement learning.\BBCQ\BIn\APACrefbtitle 2021 60th IEEE Conference on Decision and Control (CDC) 2021 60th ieee conference on decision and control (cdc)(\BPGS 2880–2887). \PrintBackRefs\CurrentBib
Leike\BOthers. (\APACyear 2018)\APACinsertmetastar leike2018scalable{APACrefauthors}Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V.\BCBL\BBA Legg, S.\APACrefYearMonthDay 2018. \BBOQ\APACrefatitle Scalable agent alignment via reward modeling: a research direction Scalable agent alignment via reward modeling: a research direction.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:1811.07871. \PrintBackRefs\CurrentBib
Li\BOthers. (\APACyear 2023)\APACinsertmetastar alpaca_eval{APACrefauthors}Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C.\BDBL Hashimoto, T\BPBI B.\APACrefYearMonthDay 2023. \APACrefbtitle AlpacaEval: An Automatic Evaluator of Instruction-following Models. Alpacaeval: An automatic evaluator of instruction-following models. \APAChowpublishedhttps://github.com/tatsu-lab/alpaca_eval. \APACaddressPublisher GitHub. \PrintBackRefs\CurrentBib
Liang\BOthers. (\APACyear 2022)\APACinsertmetastar liang2022holistic{APACrefauthors}Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M.\BDBL others\APACrefYearMonthDay 2022. \BBOQ\APACrefatitle Holistic evaluation of language models Holistic evaluation of language models.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2211.09110. \PrintBackRefs\CurrentBib
Liu, Cohen\BCBL\BOthers. (\APACyear 2023)\APACinsertmetastar liu2023don{APACrefauthors}Liu, J., Cohen, A., Pasunuru, R., Choi, Y., Hajishirzi, H.\BCBL\BBA Celikyilmaz, A.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Don’t throw away your value model! Making PPO even better via Value-Guided Monte-Carlo Tree Search decoding Don’t throw away your value model! making ppo even better via value-guided monte-carlo tree search decoding.\BBCQ\APACjournalVolNumPages arXiv e-printsarXiv–2309. \PrintBackRefs\CurrentBib
Liu, Zhu\BCBL\BOthers. (\APACyear 2023)\APACinsertmetastar liu2023rltf{APACrefauthors}Liu, J., Zhu, Y., Xiao, K., Fu, Q., Han, X., Yang, W.\BCBL\BBA Ye, D.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle RLTF: Reinforcement Learning from Unit Test Feedback Rltf: Reinforcement learning from unit test feedback.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2307.04349. \PrintBackRefs\CurrentBib
Maghakian\BOthers. (\APACyear 2022)\APACinsertmetastar maghakian2022personalized{APACrefauthors}Maghakian, J., Mineiro, P., Panaganti, K., Rucker, M., Saran, A.\BCBL\BBA Tan, C.\APACrefYearMonthDay 2022. \BBOQ\APACrefatitle Personalized Reward Learning with Interaction-Grounded Learning (IGL) Personalized reward learning with interaction-grounded learning (igl).\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2211.15823. \PrintBackRefs\CurrentBib
Malik\BOthers. (\APACyear 2019)\APACinsertmetastar malik2019calibrated{APACrefauthors}Malik, A., Kuleshov, V., Song, J., Nemer, D., Seymour, H.\BCBL\BBA Ermon, S.\APACrefYearMonthDay 2019. \BBOQ\APACrefatitle Calibrated model-based deep reinforcement learning Calibrated model-based deep reinforcement learning.\BBCQ\BIn\APACrefbtitle International Conference on Machine Learning International conference on machine learning(\BPGS 4314–4323). \PrintBackRefs\CurrentBib
Menick\BOthers. (\APACyear 2022)\APACinsertmetastar menick2022teaching{APACrefauthors}Menick, J., Trebacz, M., Mikulik, V., Aslanides, J., Song, F., Chadwick, M.\BDBL others\APACrefYearMonthDay 2022. \BBOQ\APACrefatitle Teaching language models to support answers with verified quotes Teaching language models to support answers with verified quotes.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2203.11147. \PrintBackRefs\CurrentBib
Moerland\BOthers. (\APACyear 2023)\APACinsertmetastar moerland2023model{APACrefauthors}Moerland, T\BPBI M., Broekens, J., Plaat, A., Jonker, C\BPBI M.\BCBL\BOthersPeriod.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Model-based reinforcement learning: A survey Model-based reinforcement learning: A survey.\BBCQ\APACjournalVolNumPages Foundations and Trends® in Machine Learning1611–118. \PrintBackRefs\CurrentBib
Moskovitz\BOthers. (\APACyear 2023)\APACinsertmetastar moskovitz2023confronting{APACrefauthors}Moskovitz, T., Singh, A\BPBI K., Strouse, D., Sandholm, T., Salakhutdinov, R., Dragan, A\BPBI D.\BCBL\BBA McAleer, S.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Confronting reward model overoptimization with constrained rlhf Confronting reward model overoptimization with constrained rlhf.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2310.04373. \PrintBackRefs\CurrentBib
Nakano\BOthers. (\APACyear 2021)\APACinsertmetastar nakano2021webgpt{APACrefauthors}Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C.\BDBL others\APACrefYearMonthDay 2021. \BBOQ\APACrefatitle Webgpt: Browser-assisted question-answering with human feedback Webgpt: Browser-assisted question-answering with human feedback.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2112.09332. \PrintBackRefs\CurrentBib
Ng\BOthers. (\APACyear 2000)\APACinsertmetastar ng2000algorithms{APACrefauthors}Ng, A\BPBI Y., Russell, S.\BCBL\BOthersPeriod.\APACrefYearMonthDay 2000. \BBOQ\APACrefatitle Algorithms for inverse reinforcement learning. Algorithms for inverse reinforcement learning.\BBCQ\BIn\APACrefbtitle Icml Icml(\BVOL 1, \BPG 2). \PrintBackRefs\CurrentBib
Nguyen\BOthers. (\APACyear 2017)\APACinsertmetastar nguyen2017reinforcement{APACrefauthors}Nguyen, K., Daumé III, H.\BCBL\BBA Boyd-Graber, J.\APACrefYearMonthDay 2017. \BBOQ\APACrefatitle Reinforcement learning for bandit neural machine translation with simulated human feedback Reinforcement learning for bandit neural machine translation with simulated human feedback.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:1707.07402. \PrintBackRefs\CurrentBib
OpenAI (\APACyear 2024)\APACinsertmetastar openai_2024{APACrefauthors}OpenAI.\APACrefYearMonthDay 2024. \APACrefbtitle New embedding models and API updates. New embedding models and api updates. \APAChowpublishedhttps://openai.com/blog/new-embedding-models-and-api-updates. \APACrefnote Accessed: [1 Feb. 2024] \PrintBackRefs\CurrentBib
Ouyang\BOthers. (\APACyear 2022)\APACinsertmetastar ouyang2022training{APACrefauthors}Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C\BPBI L., Mishkin, P.\BDBL others\APACrefYearMonthDay 2022. \BBOQ\APACrefatitle Training language models to follow instructions with human feedback Training language models to follow instructions with human feedback.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2203.02155. \PrintBackRefs\CurrentBib
Pan\BOthers. (\APACyear 2022)\APACinsertmetastar pan2022effects{APACrefauthors}Pan, A., Bhatia, K.\BCBL\BBA Steinhardt, J.\APACrefYearMonthDay 2022. \BBOQ\APACrefatitle The effects of reward misspecification: Mapping and mitigating misaligned models The effects of reward misspecification: Mapping and mitigating misaligned models.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2201.03544. \PrintBackRefs\CurrentBib
Peng\BOthers. (\APACyear 2023)\APACinsertmetastar peng2023stabilizing{APACrefauthors}Peng, B., Song, L., Tian, Y., Jin, L., Mi, H.\BCBL\BBA Yu, D.\APACrefYearMonthDay 2023. \APACrefbtitle Stabilizing RLHF through Advantage Model and Selective Rehearsal. Stabilizing rlhf through advantage model and selective rehearsal. \PrintBackRefs\CurrentBib
Rafailov\BOthers. (\APACyear 2023)\APACinsertmetastar rafailov2023direct{APACrefauthors}Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C\BPBI D.\BCBL\BBA Finn, C.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Direct preference optimization: Your language model is secretly a reward model Direct preference optimization: Your language model is secretly a reward model.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2305.18290. \PrintBackRefs\CurrentBib
Ramé\BOthers. (\APACyear 2024)\APACinsertmetastar rame2024warm{APACrefauthors}Ramé, A., Vieillard, N., Hussenot, L., Dadashi, R., Cideron, G., Bachem, O.\BCBL\BBA Ferret, J.\APACrefYearMonthDay 2024. \BBOQ\APACrefatitle WARM: On the Benefits of Weight Averaged Reward Models Warm: On the benefits of weight averaged reward models.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2401.12187. \PrintBackRefs\CurrentBib
Röttger\BOthers. (\APACyear 2023)\APACinsertmetastar rottger2023xstest{APACrefauthors}Röttger, P., Kirk, H\BPBI R., Vidgen, B., Attanasio, G., Bianchi, F.\BCBL\BBA Hovy, D.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Xstest: A test suite for identifying exaggerated safety behaviours in large language models Xstest: A test suite for identifying exaggerated safety behaviours in large language models.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2308.01263. \PrintBackRefs\CurrentBib
Schulman (\APACyear 2023)\APACinsertmetastar schulman2023proxy{APACrefauthors}Schulman, J.\APACrefYearMonthDay 2023. \APACrefbtitle Proxy objectives in reinforcement learning from human feedback. Proxy objectives in reinforcement learning from human feedback. {APACrefURL}https://icml.cc/virtual/2023/invited-talk/21549\APACrefnote International Conference on Machine Learning (ICML) \PrintBackRefs\CurrentBib
Schulman\BOthers. (\APACyear 2017)\APACinsertmetastar schulman2017proximal{APACrefauthors}Schulman, J., Wolski, F., Dhariwal, P., Radford, A.\BCBL\BBA Klimov, O.\APACrefYearMonthDay 2017. \BBOQ\APACrefatitle Proximal policy optimization algorithms Proximal policy optimization algorithms.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:1707.06347. \PrintBackRefs\CurrentBib
Schulman\BOthers. (\APACyear 2022)\APACinsertmetastar chatgpt{APACrefauthors}Schulman, J., Zoph, B., Kim, C.\BCBL\BBA more.\APACrefYearMonthDay 2022. \APACrefbtitle ChatGPT: Optimizing Language Models for Dialogue. Chatgpt: Optimizing language models for dialogue. \APAChowpublishedhttps://openai.com/blog/chatgpt/. \APACrefnote Accessed: 2023-02-12 \PrintBackRefs\CurrentBib
Sekhari\BOthers. (\APACyear 2023)\APACinsertmetastar sekhari2023contextual{APACrefauthors}Sekhari, A., Sridharan, K., Sun, W.\BCBL\BBA Wu, R.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Contextual Bandits and Imitation Learning via Preference-Based Active Queries Contextual bandits and imitation learning via preference-based active queries.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2307.12926. \PrintBackRefs\CurrentBib
B.Shen\BOthers. (\APACyear 2023)\APACinsertmetastar shen2023pangu{APACrefauthors}Shen, B., Zhang, J., Chen, T., Zan, D., Geng, B., Fu, A.\BDBL others\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Pangu-coder2: Boosting large language models for code with ranking feedback Pangu-coder2: Boosting large language models for code with ranking feedback.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2307.14936. \PrintBackRefs\CurrentBib
L.Shen\BOthers. (\APACyear 2023)\APACinsertmetastar shen2023trickle{APACrefauthors}Shen, L., Chen, S., Song, L., Jin, L., Peng, B., Mi, H.\BDBL Yu, D.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle The Trickle-down Impact of Reward (In-) consistency on RLHF The trickle-down impact of reward (in-) consistency on rlhf.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2309.16155. \PrintBackRefs\CurrentBib
Shi\BOthers. (\APACyear 2023)\APACinsertmetastar shi2023safer{APACrefauthors}Shi, T., Chen, K.\BCBL\BBA Zhao, J.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Safer-instruct: Aligning language models with automated preference data Safer-instruct: Aligning language models with automated preference data.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2311.08685. \PrintBackRefs\CurrentBib
Shojaee\BOthers. (\APACyear 2023)\APACinsertmetastar shojaee2023execution{APACrefauthors}Shojaee, P., Jain, A., Tipirneni, S.\BCBL\BBA Reddy, C\BPBI K.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Execution-based code generation using deep reinforcement learning Execution-based code generation using deep reinforcement learning.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2301.13816. \PrintBackRefs\CurrentBib
Singhal\BOthers. (\APACyear 2023)\APACinsertmetastar singhal2023long{APACrefauthors}Singhal, P., Goyal, T., Xu, J.\BCBL\BBA Durrett, G.\APACrefYearMonthDay 2023. \APACrefbtitle A Long Way to Go: Investigating Length Correlations in RLHF. A long way to go: Investigating length correlations in rlhf. \PrintBackRefs\CurrentBib
Snell\BOthers. (\APACyear 2022)\APACinsertmetastar snell2022offline{APACrefauthors}Snell, C., Kostrikov, I., Su, Y., Yang, M.\BCBL\BBA Levine, S.\APACrefYearMonthDay 2022. \BBOQ\APACrefatitle Offline rl for natural language generation with implicit language q learning Offline rl for natural language generation with implicit language q learning.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2206.11871. \PrintBackRefs\CurrentBib
Spaan (\APACyear 2012)\APACinsertmetastar spaan2012partially{APACrefauthors}Spaan, M\BPBI T.\APACrefYearMonthDay 2012. \BBOQ\APACrefatitle Partially observable Markov decision processes Partially observable markov decision processes.\BBCQ\BIn\APACrefbtitle Reinforcement learning: State-of-the-art Reinforcement learning: State-of-the-art(\BPGS 387–414). \APACaddressPublisher Springer. \PrintBackRefs\CurrentBib
Stiennon\BOthers. (\APACyear 2020)\APACinsertmetastar stiennon2020learning{APACrefauthors}Stiennon, N., Ouyang, L., Wu, J., Ziegler, D., Lowe, R., Voss, C.\BDBL Christiano, P\BPBI F.\APACrefYearMonthDay 2020. \BBOQ\APACrefatitle Learning to summarize with human feedback Learning to summarize with human feedback.\BBCQ\APACjournalVolNumPages Advances in Neural Information Processing Systems333008–3021. \PrintBackRefs\CurrentBib
Suhr\BBA Artzi (\APACyear 2022)\APACinsertmetastar suhr2022continual{APACrefauthors}Suhr, A.\BCBT\BBA Artzi, Y.\APACrefYearMonthDay 2022. \BBOQ\APACrefatitle Continual learning for instruction following from realtime feedback Continual learning for instruction following from realtime feedback.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2212.09710. \PrintBackRefs\CurrentBib
Sun\BOthers. (\APACyear 2023)\APACinsertmetastar sun2023salmon{APACrefauthors}Sun, Z., Shen, Y., Zhang, H., Zhou, Q., Chen, Z., Cox, D.\BDBL Gan, C.\APACrefYearMonthDay 2023. \APACrefbtitle SALMON: Self-Alignment with Principle-Following Reward Models. Salmon: Self-alignment with principle-following reward models. \PrintBackRefs\CurrentBib
Team\BOthers. (\APACyear 2023)\APACinsertmetastar team2023gemini{APACrefauthors}Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J\BHBI B., Yu, J.\BDBL others\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Gemini: a family of highly capable multimodal models Gemini: a family of highly capable multimodal models.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2312.11805. \PrintBackRefs\CurrentBib
Touvron\BOthers. (\APACyear 2023)\APACinsertmetastar touvron2023llama{APACrefauthors}Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y.\BDBL others\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Llama 2: Open foundation and fine-tuned chat models Llama 2: Open foundation and fine-tuned chat models.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2307.09288. \PrintBackRefs\CurrentBib
A.Wei\BOthers. (\APACyear 2023)\APACinsertmetastar wei2023jailbroken{APACrefauthors}Wei, A., Haghtalab, N.\BCBL\BBA Steinhardt, J.\APACrefYearMonthDay 2023. \APACrefbtitle Jailbroken: How Does LLM Safety Training Fail? Jailbroken: How does llm safety training fail? \PrintBackRefs\CurrentBib
J.Wei\BOthers. (\APACyear 2022)\APACinsertmetastar wei2022chain{APACrefauthors}Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E.\BDBL others\APACrefYearMonthDay 2022. \BBOQ\APACrefatitle Chain-of-thought prompting elicits reasoning in large language models Chain-of-thought prompting elicits reasoning in large language models.\BBCQ\APACjournalVolNumPages Advances in Neural Information Processing Systems3524824–24837. \PrintBackRefs\CurrentBib
R.Wei\BOthers. (\APACyear 2023)\APACinsertmetastar wei2023unified{APACrefauthors}Wei, R., Lambert, N., McDonald, A., Garcia, A.\BCBL\BBA Calandra, R.\APACrefYearMonthDay 2023. \APACrefbtitle A Unified View on Solving Objective Mismatch in Model-Based Reinforcement Learning. A unified view on solving objective mismatch in model-based reinforcement learning. \PrintBackRefs\CurrentBib
Wirth\BOthers. (\APACyear 2017)\APACinsertmetastar wirth2017survey{APACrefauthors}Wirth, C., Akrour, R., Neumann, G., Fürnkranz, J.\BCBL\BOthersPeriod.\APACrefYearMonthDay 2017. \BBOQ\APACrefatitle A survey of preference-based reinforcement learning methods A survey of preference-based reinforcement learning methods.\BBCQ\APACjournalVolNumPages Journal of Machine Learning Research181361–46. \PrintBackRefs\CurrentBib
J.Wu\BOthers. (\APACyear 2021)\APACinsertmetastar wu2021recursively{APACrefauthors}Wu, J., Ouyang, L., Ziegler, D\BPBI M., Stiennon, N., Lowe, R., Leike, J.\BCBL\BBA Christiano, P.\APACrefYearMonthDay 2021. \BBOQ\APACrefatitle Recursively summarizing books with human feedback Recursively summarizing books with human feedback.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2109.10862. \PrintBackRefs\CurrentBib
T.Wu\BOthers. (\APACyear 2023)\APACinsertmetastar wu2023pairwise{APACrefauthors}Wu, T., Zhu, B., Zhang, R., Wen, Z., Ramchandran, K.\BCBL\BBA Jiao, J.\APACrefYearMonthDay 2023. \APACrefbtitle Pairwise Proximal Policy Optimization: Harnessing Relative Feedback for LLM Alignment. Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment. \PrintBackRefs\CurrentBib
Z.Wu\BOthers. (\APACyear 2023)\APACinsertmetastar wu2023fine{APACrefauthors}Wu, Z., Hu, Y., Shi, W., Dziri, N., Suhr, A., Ammanabrolu, P.\BDBL Hajishirzi, H.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Fine-Grained Human Feedback Gives Better Rewards for Language Model Training Fine-grained human feedback gives better rewards for language model training.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2306.01693. \PrintBackRefs\CurrentBib
Xu\BOthers. (\APACyear 2023)\APACinsertmetastar xu2023shattering{APACrefauthors}Xu, W., Dong, S., Arumugam, D.\BCBL\BBA Van Roy, B.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models Shattering the agent-environment interface for fine-tuning inclusive language models.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2305.11455. \PrintBackRefs\CurrentBib
Yang\BOthers. (\APACyear 2023)\APACinsertmetastar yang2023large{APACrefauthors}Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q\BPBI V., Zhou, D.\BCBL\BBA Chen, X.\APACrefYearMonthDay 2023. \APACrefbtitle Large Language Models as Optimizers. Large language models as optimizers. \PrintBackRefs\CurrentBib
Yuan\BOthers. (\APACyear 2023)\APACinsertmetastar yuan2023scaling{APACrefauthors}Yuan, Z., Yuan, H., Li, C., Dong, G., Tan, C.\BCBL\BBA Zhou, C.\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Scaling relationship on learning mathematical reasoning with large language models Scaling relationship on learning mathematical reasoning with large language models.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2308.01825. \PrintBackRefs\CurrentBib
Yue\BOthers. (\APACyear 2012)\APACinsertmetastar yue2012k{APACrefauthors}Yue, Y., Broder, J., Kleinberg, R.\BCBL\BBA Joachims, T.\APACrefYearMonthDay 2012. \BBOQ\APACrefatitle The k-armed dueling bandits problem The k-armed dueling bandits problem.\BBCQ\APACjournalVolNumPages Journal of Computer and System Sciences7851538–1556. \PrintBackRefs\CurrentBib
Zheng\BOthers. (\APACyear 2023)\APACinsertmetastar zheng2023judging{APACrefauthors}Zheng, L., Chiang, W\BHBI L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y.\BDBL others\APACrefYearMonthDay 2023. \BBOQ\APACrefatitle Judging LLM-as-a-judge with MT-Bench and Chatbot Arena Judging llm-as-a-judge with mt-bench and chatbot arena.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:2306.05685. \PrintBackRefs\CurrentBib
Ziegler\BOthers. (\APACyear 2019)\APACinsertmetastar ziegler2019fine{APACrefauthors}Ziegler, D\BPBI M., Stiennon, N., Wu, J., Brown, T\BPBI B., Radford, A., Amodei, D.\BDBL Irving, G.\APACrefYearMonthDay 2019. \BBOQ\APACrefatitle Fine-tuning language models from human preferences Fine-tuning language models from human preferences.\BBCQ\APACjournalVolNumPages arXiv preprint arXiv:1909.08593. \PrintBackRefs\CurrentBib

Xet Storage Details

Size:: 79.4 kB
Xet hash:: f55c70323967173f712417174a6296721b7f6e727a53c07658493dc0b8a1d187

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

Buckets:

huggingchat
/

papers-content