145 kB

Title: Scalable Reinforcement Post-Training Beyond Static Human Prompts Evolving Alignment via Asymmetric Self-Play

URL Source: https://arxiv.org/html/2411.00062

Markdown Content: Back to arXiv

This is experimental HTML to improve accessibility. We invite you to report rendering errors. Use Alt+Y to toggle on accessible reporting links and Alt+Shift+Y to toggle off. Learn more about this project and help improve conversions.

Why HTML? Report Issue Back to Abstract Download PDF Abstract 1Introduction 2Preliminaries 3Method 4Experiments References

HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

failed: silence failed: mdframed failed: stackengine failed: fnpos failed: changepage failed: pythonhighlight failed: arydshln

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0 arXiv:2411.00062v3 [cs.CL] 09 Apr 2025 Scalable Reinforcement Post-Training Beyond Static Human Prompts Evolving Alignment via Asymmetric Self-Play Ziyu Ye Rishabh Agarwal Tianqi Liu Rishabh Joshi Sarmishta Velury Quoc V. Le Qijun Tan Yuan Liu Abstract

Existing reinforcement post-training pipeline for large language models (LLMs) relies on a pre-curated, static prompt distribution, which bottlenecks scalability. Prior works have explored prompt evolving, but are often limited to the supervised fine-tuning stage, and prompts are sampled and evolved uniformly without signals.

This empirical work presents a paradigm shift: Evolving Alignment via Asymmetric Self-Play (eva), that casts post-training as an infinite game with regret-based signals for 2 players: (i) a creator, who strategically samples and creates new informative prompts and (ii) a solver, who learns to produce preferred responses.

eva is the first method that allows language models to adaptively create training prompts in both offline and online RL post-training. The design is easy-to-use yet remarkably effective: eva sets a new SOTA on challenging benchmarks, without any extra human prompts, e.g., it boosts the win-rate of gemma-2-9b-it on Arena-Hard by 51.6% → 60.1% for DPO and 52.6% → 62.4% for RLOO, surpassing claude-3-opus and catching up to gemini-1.5-pro, both of which are orders of magnitude larger.

Extensive experiments show eva can create effective RL curricula and is robust across ablations. We believe adaptively evolving prompts are key to design next-generation RL post-training scheme.

Self-Play, Alignment, Reinforcement Learning, Large Language Models, Open-Ended Learning, Open-Ended RLHF \WarningFilter

[pdftoc]hyperrefToken not allowed in a PDF string \WarningFilterxcolorhas already been loaded \WarningFilterlatexfloat specifier changed to \WarningsOff \mdfsetupmiddlelinecolor=black, middlelinewidth=1pt, backgroundcolor=white, roundcorner=5pt \algnewcommand\LineComment[1]\Statex ▽

/ ⋆ #1 / ⋆ \algnewcommand\NoIndLineComment[1]\Statex ▽

/ ⋆ #1 / ⋆

What I cannot create, I do not understand.

– Richard P. Feynman

1Introduction The Evolving Alignment Principle Language models shall generatively adapt their training prompt distribution for ever-evolving RL post-training. Figure 1:Illustration on evolving alignment. Conventional RLHF is restricted to static prompt distributions. We show that it is crucial to adaptively adjust the prompt distribution during RL post-training, and further design a method allowing LLMs to robustly create new prompts with improved coverage and complexity for continual RL post-training, offering remarkable empirical gains ( § 4). Figure 2:Asymmetric Self-Play Pipeline: We generalize classical RLHF with open-ended RLHF, optimized with a creator-solver game. eva strategically evolves prompt distributions with a creator policy, which synthesizes prompts with a simple estimate, sample then evolve procedure; specifically, the informativeness for each prompt is elicited from reward signals. See more on our minimax-regret objective that drives the game design & different practical implementations in § 3.

Long-lived artificial intelligence must deal with an ever-evolving, open-ended world, however the current training paradigm is restricted to being fairly short-lived and static.

To explain, LLM training is typically done in two stages, imitation (i.e., supervised fine-tuning, SFT) and reinforcement learning (i.e., RL post-training). This work focuses on the latter, which has led to remarkable success in enhancing LLM capabilities (Team et al., 2023, 2024a, 2025). However, there is a fundamental issue in all existing practices: they restrict themselves within a pre-curated, static prompt distribution during post-training.

This is sub-optimal and bottlenecks scaling properties w.r.t.: (i) training efficiency: the existing paradigm treats all prompts equally, despite their utility depending on the changing states of LLMs in training; as not all prompts contribute equally to post-training, relying on a static set is inefficient. (ii) model generalizability: Once the LLM saturates on the static prompt set, learning stops, preventing acquiring new skills or knowledge beyond the predefined distribution.1 We thereby investigate the two research questions:

(Signal) Which prompts are useful during RL training?

(Algorithm) How can we adaptively get more useful prompts, and use them to keep LLMs self-improving?

To address them, we design eva (Evolving Alignment via Asymmetric Self-Play), as in Figure 1, 2. Central to eva is an infinite game with minimax-regret objectives, achieved by alternating optimization in creating prompts and solving them. We detail the novelty of this work w.r.t. prior works in § 5, and summarize our original contributions below:

We propose an open-ended RLHF objective.

As in Problem 1, our new objective jointly optimizes the prompt and response policy, and allows continual self-training beyond initial static prompts.

We introduce and verify effective signals for LLMs to identify useful prompts in RL post-training.

In § 3, we use variants of reward advantage as effective signals to identify learnable and worth-learning prompts, guiding prioritization2 in RL post-training.

We design eva, the first method3 to our knowledge that allows LLMs to adaptively creates useful prompts for continual RL post-training.

In § 3, we design a practical algorithm via an infinite creator-solver game, with reward signals to incentivize LLMs to create new prompts for better learning curricula (see § 4.2.5) in continual RL post-training.

eva is easy-to-implement.

As in Algorithm 1, eva can be easily plugged into any RLHF pipeline, with an extensible creator module.

eva is SOTA on challenging alignment benchmarks.

In § 4, we run extensive experiments showing eva universally improves both online RL (e.g., RLOO, OAIF) and offline RL (e.g., DPO, SPPO, SimPO, ORPO).

As we enter the new epoch where compute is moving from training to data synthesis, the question on “how to put more compute in generating better data” (Nishihara, 2025) is more critical than ever. While existing works (Zelikman et al., 2022; Gulcehre et al., 2023; Chow et al., 2024) primarily focus on the exploration in 𝒴 ∣ 𝒳 , we perform, to our knowledge, the first systematic study in exploration in ( 𝒳 , 𝒴 ) for RL post-training of LLMs. Such exploration is non-trivial, and relates to carefully designed reward signals, as we show in § 4.2.1. We hope eva can be a starting point for researchers to push further along this new direction, and discover the next right thing to scale (Sutskever, 2024).

2Preliminaries

Classical RL post-training (Ouyang et al., 2022) solves regularized optimization for a fixed prompt distribution 𝒟 :

max 𝜋 𝜽

𝔼 𝐱 ∼ 𝒟 , 𝐲 ∼ 𝜋 𝜽 ( ⋅ ∣ 𝐱 ) [ 𝑟 ( 𝐱 , 𝐲 ) ]

−

𝔼 𝐱 ∼ 𝒟 [ 𝛽 ⋅ 𝔻 [ 𝜋 𝜽 ( ⋅ ∣ 𝐱 ) ∥ 𝜋 𝖻𝖺𝗌𝖾 ( ⋅ ∣ 𝐱 ) ] ] .

Here, 𝜋 𝖻𝖺𝗌𝖾 ( ⋅ | 𝐱 ) is a base policy, 𝐱 and 𝐲 are prompts and responses, 𝔻 is a divergence measure. In this paper, 𝑟 ⁢ ( ⋅ ) is the reward assumed to be from an oracle 𝑟 ⋆ ⁢ ( ⋅ ) and is fixed during post-training (Team et al., 2024a); we use the term RLHF interchangeably with RL post-training, as we focus on human preference alignment in our experiments, yet the pipeline is compatible with any other reward types.

In practice, depending on how 𝐲 is generated, RLHF methods can be framed as: (i) online, where 𝐲 are generated on-policy, i.e., 𝐲 ( 1 ) , … , 𝐲 ( 𝑘 ) ∼ 𝜋 𝛉 ( ⋅ | 𝐱 ) (Ahmadian et al., 2024); and (ii) offline, where 𝐲 are pre-generated by human experts or previous model checkpoints (Xiong et al., 2024). These methods naturally serve as the solver in eva, which adapts to both under the same principle: in online RLHF, we evolve prompts at each mini-batch, while in offline RLHF, we construct new prompt sets after each full iteration.

{mdframed} Problem 1 (Open-Ended RLHF)

We define the problem of Open-Ended RLHF as the bilevel optimization on both the prompt policy (the creator 𝜋 𝜙 ⁢ ( 𝐱 ) ) and the response policy (the solver 𝜋 𝛉 ⁢ ( 𝐲 ∣ 𝐱 ) ) for alignment:

𝜙 ⋆ ∈ arg ⁡ max ϕ ℛ ⁢ ( 𝜋 𝜙 ⁢ ( ⋅ ) , 𝜋 𝗍𝗋𝗎𝖾 ⁢ ( ⋅ ) ; 𝒟 , 𝜽 ⋆ ⁢ ( 𝜙 ) ) ,

(1)

s.t.

𝜽 ⋆ ( 𝜙 ) ∈ arg ⁡ max 𝜽 𝔼 𝐱 ∼ 𝜋 𝜙 ⁢ ( ⋅ ) [ 𝔼 𝐲 ∼ 𝜋 𝜽 ( ⋅ ∣ 𝐱 ) [ 𝑟 ( 𝐱 , 𝐲 ) ] − 𝛽 ⋅ 𝔻 [ 𝜋 𝜽 ( ⋅ ∣ 𝐱 ) ∥ 𝜋 𝖻𝖺𝗌𝖾 ( ⋅ ∣ 𝐱 ) ] ] .

(2)

where 𝜋 𝗍𝗋𝗎𝖾 is the (potentially unknown) true target prompt distribution, 𝒟 is an optional artifact parameter (e.g., the seed prompt distribution), and 𝑅 ⁢ ( ⋅ ) is a regularization function on the creator policy 𝜋 𝜙 , which we discuss in detail at § 3. The problem generalizes classical RLHF, and captures the dual objective that (i) response alignment: the solver should perform well on the training prompt distribution while staying close to 𝜋 𝖻𝖺𝗌𝖾 , and (ii) prompt generation: the creator should generate training prompts allowing the solver to perform robustly on target prompt distributions.

3Method 3.1The Problem: Open-Ended RLHF

Classical RLHF, as in § 2, samples prompts from a static set 𝒟 , which can have limited prompt coverage and complexity (Fig. 12), and may diverge from true scenarios in the open-ended world (Dennis et al., 2020; Parker-Holder et al., 2022). In Problem 1, we introduce a prompt generation policy 𝜋 𝜙 ⁢ ( ⋅ ) to be optimized together with the response generation policy 𝜋 𝜽 ( ⋅ | 𝐱 ) . An optimizable 𝜋 𝜙 ⁢ ( ⋅ ) brings a few benefits: (i) during training, it allows dynamic adjustment of training prompts on-the-fly, making it possible to prioritize prompts that are more informative to the current 𝜋 𝜽 ( ⋅ | 𝐱 ) , improving learning efficiency; (ii) at convergence, it brings a new prompt distribution beyond the initial static set, making it possible for 𝜋 𝜽 ( ⋅ | 𝐱 ) to learn knowledge beyond 𝒟 and perform more robustly on the target distribution. The creator objective ℛ ⁢ ( ⋅ ) characterizes the optimization of 𝜋 𝜙 ⁢ ( ⋅ ) , preventing it from collapsing in trivial prompts and guiding it towards true target prompts, which we discuss a specific implementation by regret maximization in the next.

3.2The Game: Minimax Regret Games

Problem 1 can be cast as a sequential game (von Stackelberg, 1934) by two strategic players optimizing each’s utility:

•

Solver 𝜋 𝜽 ⁢ ( 𝐲 ∣ 𝐱 ) , who generates responses that optimize alignment given training prompts.

•

Creator 𝜋 𝜙 ⁢ ( 𝐱 ) , who generates training prompts for the solver to perform well in the real world, knowing the solver will optimize over the generated.

A natural objective for the creator is to improve the solver’s transfer performance (Bengio, 2012) on the true target prompt distribution 𝜋 𝗍𝗋𝗎𝖾 : the closer 𝜋 𝜙 is to 𝜋 𝗍𝗋𝗎𝖾 , the better the solver performance is expected, thereby the higher the utility the creator may receive. If 𝜋 𝗍𝗋𝗎𝖾 is known, ℛ ⁢ ( ⋅ ) can be instantiated by a 𝑓 -divergence measure for distribution matching with 𝜋 𝗍𝗋𝗎𝖾 . This work considers the case when 𝜋 𝗍𝗋𝗎𝖾 is unknown a priori; the optimization then falls into a standard decision under ignorance problem (Savage, 1951; Gustafsson, 2022). Several decision rules can be considered, e.g., randomization, that chooses training prompt distribution uniformly (Jiang, 2023); in this paper, we study Minimax Regret Rule, which finds a training distribution that minimize solver’s worst-case regret over all possible distributions (see more discussions in § 4.2.1). Regret is defined as reward differences of 𝜋 𝜽 and optimal policy 𝜋 𝜽 ⋆ :

Regret ⁡ ( 𝜋 𝜙 , 𝜋 𝜽 )

𝔼 𝐱 ∼ 𝜋 𝜙 ⁢ ( ⋅ ) [ 𝔼 𝐲 ∼ 𝜋 𝜽 ( ⋅ ∣ 𝐱 ) [ 𝑟 ( 𝐱 , 𝐲 ) ]

−

𝔼 𝐲 ∼ 𝜋 𝜽 ⋆ ( ⋅ ∣ 𝐱 ) [ 𝑟 ( 𝐱 , 𝐲 ) ] ] .

Problem 1 is then converted to:

𝜙 ⋆ ∈ arg ⁡ max 𝜙 ⁡ Regret ⁡ ( 𝜋 𝜙 , 𝜋 𝜽 ⋆ ) ,

(3)

s.t.

𝜽 ⋆ ⁢ ( 𝜙 ) ∈ arg ⁡ min 𝜽 ⁡ Regret ⁡ ( 𝜋 𝜙 , 𝜋 𝜽 ) .

(4)

Note that (i) for solver optimization, Eq. 4 is equivalent to Eq. 2 by definition, and (ii) for creator optimization, Eq. 3 approximates Eq. 1 with a worst-case optimal guarantee for the solver’s policy, when 𝜋 𝗍𝗋𝗎𝖾 is unknown.

Remark 1

Under mild assumptions, the (local) Nash equilibrium is a (local) minimax point (Jin et al., 2020) for the above optimization; here, at the (local) Nash equilibrium, the solver follows a (local) minimax regret policy (Jiang, 2023), i.e., the solver’s regret is worst-case optimal.

The equilibrium finding of the game can be solved by alternating optimization (Zhang et al., 2022). Intuitively, this allows for the creation of evolving prompt distributions that challenge the agent progressively for better generalization; the regret objective ensures robustness on such evolving curricula by incentivizing agents to perform well in all cases, providing a worst-case guarantee. In optimization, this brings a sweet spot where the creator can create challenging yet solvable prompts (i.e., neither too hard nor too easy) for the solver, as illustrated in § H.3. Next, we discuss details for the optimization and the approximation applied.

Regret Minimization for the Solver.

Any preference optimization algorithms can be used as a plug-in for the regret minimization for the solver’s step in Algorithm 1.

Regret Maximization for the Creator.

When it is direct for the solver to minimize the regret by policy optimization, the true optimal policy remains unknown during optimization, and we must approximate it when using it as the utility to incentivize the creator. Similar to heuristics in prior works (Jiang et al., 2021b, a; Parker-Holder et al., 2022), we use the advantage-based estimate for each 𝐱 :

Regret ^ ⁢ ( 𝐱 , 𝜋 𝜽 ) ← 𝑟 ⁢ ( 𝐱 , 𝐲 𝖻𝖺𝗌𝖾𝗅𝗂𝗇𝖾 ) − 𝑟 ⁢ ( 𝐱 , 𝐲 + ) ,

(5)

where

𝐲 + := arg ⁢ max 𝐲 𝑖 ⁡ 𝑟 ⁢ ( 𝐱 , 𝐲 ) ,

𝐲 𝖻𝖺𝗌𝖾𝗅𝗂𝗇𝖾 := avg 𝐲 𝑖 ⁡ 𝑟 ⁢ ( 𝐱 , 𝐲 ) ⁢ or ⁢ arg ⁢ min 𝐲 𝑖 ⁡ 𝑟 ⁢ ( 𝐱 , 𝐲 ) ,

and { 𝐲 𝑖 } 𝑖

1 is a set of responses sampled from 𝜋 𝜽 ( ⋅ ∣ 𝐱 ) and 𝑟 ⁢ ( ⋅ , ⋅ ) is the reward oracle. We choose arg ⁢ avg 𝐲 𝑖 ⁡ 𝑟 ⁢ ( 𝐱 , 𝐲 ) for online RLHF, and arg ⁢ min 𝐲 𝑖 ⁡ 𝑟 ⁢ ( 𝐱 , 𝐲 ) for offline RLHF, based on consistent strong empirical gains observed across extensive experiments. As the policy optimizes, the proxy will approximate the true regret better4. We denote the negated regret estimate (i.e., reward advantage) as the informativeness value for a prompt 𝐱 w.r.t. 𝜽 ,

𝗂𝗇𝖿𝗈 𝜽 ⁢ ( 𝐱 ) := | Regret ^ ⁢ ( 𝐱 , 𝜋 𝜽 ) | .

(6)

Directly doing gradient ascent on regret could lead to training instability (Zhang, 2023). In this work, we approximate new prompt distributions that maximize regret by 3 steps:

Estimate informativeness for each prompt in the set.

Sampling a subset of high-regret prompts.

Generating new prompts by making variations on those high-regret prompts.

The eva way of scalable regret maximization can relate to curriculum RL (Parker-Holder et al., 2022), which finds environments with high-regret levels, then edits within some distance, or evolution strategies (Schwefel, 1977) which find the most promising species, then mutate and crossover.

3.3The Practical Algorithm

Algo 1 is an overview of eva where the creator constructs new training prompts after a full iteration of the solver.

3.3.1The Solver Step

This step is the classical preference optimization (Rafailov et al., 2023). Take DPO as an example, for every prompt,

Algorithm 1 Alternating Optimization of eva. Input:initial policy 𝜋 𝜽 0 , initial set of prompt 𝒳 0 iteration 𝑡

1 , 2 , … creator stepestimate: 𝒳 𝑡 − 1 ← { ( 𝒙 𝑖 , 𝗂𝗇𝖿𝗈 ⁢ ( 𝐱 𝑖 ) ) ∣ 𝒙 𝑖 ∈ 𝒳 𝑡 − 1 } sample: 𝒳 𝑡 − 1 𝗂𝗇𝖿𝗈 ← { 𝒙 𝑖 ⁢ drawn w.p. ∝ 𝗂𝗇𝖿𝗈 ⁢ ( 𝒙 𝑖 ) } evolve: 𝒳 𝑡 ← 𝖾𝗏𝗈𝗅𝗏𝖾 ⁢ ( 𝒳 𝑡 − 1 𝗂𝗇𝖿𝗈 ) solver step generate: ∀ 𝒙 𝑖 ∈ 𝒳 𝑡 𝗂𝗇𝖿𝗈 , { 𝒚 𝑖 ( 𝑗 ) } ∼ 𝜋 𝜽 𝑡 − 1 ( ⋅ ∣ 𝒙 𝑖 ) annotate reward: 𝒳 𝑡 ′ ← 𝒳 𝑡 𝗂𝗇𝖿𝗈 ∪ { ( 𝒚 𝑖 ( 𝑗 ) , 𝑟 𝑖 ( 𝑗 ) ) } optimization: 𝜽 𝑡 ← 𝜽 𝑡 − 1 + 𝜂 ⁢ ∇ 𝜽 𝒥 𝒳 𝑡 ′ ⁢ ( 𝜽 ) final solver policy 𝜋 𝜽 𝑇 \Statex\For \LineComment\State \Statex\Statex \LineComment\State \Statex\Statex \EndFor\State \Return

we sample 𝑛 responses and annotate rewards, then take the responses with the maximal and the minimal reward to construct preference pairs and optimize upon.

3.3.2The Creator Step

Plainly, the creator finds most useful prompts and generate variants of them to approximate regret maximization.

Step 1: info ⁢ ( ⋅ ) – estimate the informativeness.

For each 𝐱 in the prompt set 𝒳 𝑡 , we generate responses, annotate rewards and estimate the informativeness of 𝐱 by Eq. 6.

Step 2: sample ⁢ ( ⋅ ) – weighted sampling for an informative subset.

By using the informativeness metric as the weight, we sample an informative subset 𝒳 𝑡 𝗂𝗇𝖿𝗈 to be evolved.

Step 3: evolve ⁢ ( ⋅ ) – evolving for high-regret prompts.

eva is agnostic to and does not rely on any specific evolving method (see empirical evidence in § H.1). We take Xu et al. (2023) as a default baseline for offline RLHF, with in-depth and in-breadth instructions for prompt re-writing.

As a side note, we discuss a useful technique below.

Prioritized Generative Buffer.

While eva can operate on the full 𝒟 at once and iteratively train LLMs (i.e., offline eva), the informativeness can become off-policy. Inspired by Schaul et al. (2015), we design a simple prioritized generative buffer ℬ that extends Algorithm 1 to be on-policy and evolves per mini-batch (i.e., online eva), with:

Warm-up phase: we start with 𝐱 from 𝒟 and populate ℬ with evolved prompts until it reaches size 𝐵 .

Mix-up phase: training continues using a balanced mix of samples from 𝒟 and ℬ (prioritized by informativeness) per mini-batch. New prompts are evolved and added to ℬ , while older ones are removed.

Bootstrap phase: Once 𝒟 is exhausted, training relies on ℬ , continuing evolving and replacing prompts.

eva is easy to use and flexible to extend. We provide detailed instructions for practitioners in the Appendix.

4Experiments Datasets and models.

We use UltraFeedback (Cui et al., 2023) as the training dataset, which contains diverse high-quality prompts that are primarily human-generated. We use the instruction-finetuned Gemma-2-9B (Team et al., 2024b) as the base ( 𝜽 0 )5, which is a strong baseline for models of its size. Note that we directly apply RL training without SFT, as the base model is sufficiently capable.

Evaluation settings.

We use: (i) AlpacaEval 2.0 (Dubois et al., 2024), which assesses general instruction following with 805 questions; (ii) MT-Bench (Zheng et al., 2023), which evaluates multi-turn instruction following with 80 hard questions in 8 categories; (iii) Arena-Hard (Li et al., 2024), which is derived from 200K user queries on Chatbot Arena with 500 challenging prompts across 250 topics.

Optimization algorithms.

We evaluate our method across a wide range of six representative RLHF algorithms:

•

Online RLHF: RLOO (Ahmadian et al., 2024), OAIF (i.e., online DPO) (Guo et al., 2024).

•

Offline RLHF: (with reference) DPO (Rafailov et al., 2023), SPPO (Wu et al., 2024); (without reference) SimPO (Meng et al., 2024), ORPO (Hong et al., 2024).

Oracle reward models.

We take ArmoRM-8B (Wang et al., 2024) to be the default reward model for human-preference proxy, with the below for ablation studies:

•

Pointwise: ArmoRM-8B (Wang et al., 2024), SkyworkRM-27B (Liu & Zeng, 2024).

•

Pairwise: PairRM-0.4B (Jiang et al., 2023), PairRM-8B (Dong et al., 2024).

4.1Main Results eva consistenly achieves strong self-improvement.

As in in Table 1 and 2, eva yields notable performance improvement across different optimization algorithms, especially on the more challenging and robust Arena-Hard benchmark (Li et al., 2024). For example, eva brings 10.6 % gain with DPO in the offline setting, and 9.8 % gain with RLOO in the online setting, surpassing claude-3-opus-240229 as reported by AH leaderboard and matching gemini-1.5-pro, while using fully adaptive self-automated joint prompt-response generation. This demonstrates the superior empirical performance of eva.

Figure 3:Illustration of gains with one round eva by DPO. Table 1:Online eva results. eva has notable gains and is comparable to default training with even 𝟨 ⁢ 𝗑 human prompts (gray). Note eva only uses 𝟣 ⁢ 𝗑 human prompts and continuously evolves ( 𝑛 ⁢ 𝗑 denotes total prompt size). Optimization Method ( → ) 𝖮𝗇𝗅𝗂𝗇𝖾 ⁢ 𝖱𝖫𝖧𝖥

Benchmark ( → ) Arena-Hard MT-Bench AE 2.0 Method ( ↓ ) / Metric ( → ) WR (%) avg. turn 1 turn 2 LC-WR (%)

𝜽 0 : Base Model 41.3 8.57 8.81 8.32 47.11

𝜽 0 → 1 : 𝖱𝖫𝖮𝖮 ( 𝟣 ⁢ 𝗑 ) 52.6 8.68 9.02 8.34 54.23

𝜽 0 → 1 ~ : 𝖱𝖫𝖮𝖮 -eva ( 𝟣 ⁢ 𝗑 ) 57.3 8.87 9.03 8.71 55.02

𝜽 0 → 1 ~ : 𝖱𝖫𝖮𝖮 -eva ( 𝟤 ⁢ 𝗑 ) 60.5 8.96 9.12 8.80 57.10

𝜽 0 → 1 ~ : 𝖱𝖫𝖮𝖮 -eva ( 𝟥 ⁢ 𝗑 ) 62.4 9.09 9.23 8.94 61.04

𝜽 0 → 1 : 𝖱𝖫𝖮𝖮 ( 𝟨 ⁢ 𝗑 ) 62.7 9.07 9.24 8.90 62.91

𝜽 0 → 1 : 𝖮𝖠𝖨𝖥 ( 𝟣 ⁢ 𝗑 ) 52.1 8.66 8.97 8.35 55.15

𝜽 0 → 1 ~ : 𝖮𝖠𝖨𝖥 -eva ( 𝟣 ⁢ 𝗑 ) 55.0 8.85 9.04 8.66 55.43

𝜽 0 → 1 ~ : 𝖮𝖠𝖨𝖥 -eva ( 𝟤 ⁢ 𝗑 ) 60.4 8.93 9.06 8.79 56.49

𝜽 0 → 1 ~ : 𝖮𝖠𝖨𝖥 -eva ( 𝟥 ⁢ 𝗑 ) 61.7 9.01 9.19 8.82 59.09 Table 2:Offline eva results. We apply eva after 1 iteration of offline RLHF. It brings strong gains and can surpass training with human prompts. See more iterations in § 4.2.4. Optimization Method ( → ) 𝖮𝖿𝖿𝗅𝗂𝗇𝖾 ⁢ 𝖱𝖫𝖧𝖥

Benchmark ( → ) Arena-Hard MT-Bench AE 2.0 Method ( ↓ ) / Metric ( → ) WR (%) avg. turn 1 turn 2 LC-WR (%)

𝜽 0 : Base Model 41.3 8.57 8.81 8.32 47.11

𝜽 0 → 1 : 𝖣𝖯𝖮 51.6 8.66 9.01 8.32 55.01

𝜽 1 → 1 ~ : + eva 60.1 8.90 9.04 8.75 55.35

𝜽 1 → 2 : + 𝗇𝖾𝗐 ⁢ 𝗁𝗎𝗆𝖺𝗇 ⁢ 𝗉𝗋𝗈𝗆𝗉𝗍𝗌 59.8 8.64 8.88 8.39 55.74

𝜽 0 → 1 : 𝖲𝖯𝖯𝖮 55.7 8.62 9.03 8.21 51.58

𝜽 1 → 1 ~ : + eva 58.9 8.78 9.11 8.45 51.86

𝜽 1 → 2 : + 𝗇𝖾𝗐 ⁢ 𝗁𝗎𝗆𝖺𝗇 ⁢ 𝗉𝗋𝗈𝗆𝗉𝗍𝗌 57.7 8.64 8.90 8.39 51.78

𝜽 0 → 1 : 𝖲𝗂𝗆𝖯𝖮 52.3 8.69 9.03 8.35 54.29

𝜽 1 → 1 ~ : + eva 60.7 8.92 9.08 8.77 55.85

𝜽 1 → 2 : + 𝗇𝖾𝗐 ⁢ 𝗁𝗎𝗆𝖺𝗇 ⁢ 𝗉𝗋𝗈𝗆𝗉𝗍𝗌 54.6 8.76 9.00 8.52 54.40

𝜽 0 → 1 : 𝖮𝖱𝖯𝖮 54.8 8.67 9.04 8.30 52.17

𝜽 1 → 1 ~ : + eva 60.3 8.89 9.07 8.71 54.39

𝜽 1 → 2 : + 𝗇𝖾𝗐 ⁢ 𝗁𝗎𝗆𝖺𝗇 ⁢ 𝗉𝗋𝗈𝗆𝗉𝗍𝗌 57.2 8.74 9.01 8.47 54.00 eva curricula can surpass human-crafted prompts.

We further show that eva models can match and even outperform those trained on additional new prompts from UltraFeedback (denoted as new human prompts as they are primarily sourced from humans (Cui et al., 2023)), while being much more efficient. Interestingly, on MT-Bench, training with new human prompts typically show decreased performance in the 1st turn and only modest gains in the 2 𝗇𝖽 turn, whereas eva notably enhances 2 𝗇𝖽 gains. We hypothesize that eva adaptively evolves novel, learnable prompts that include features of second-turn questions, reflecting generalized skills like handling follow-up interactions.

4.2Ablation Studies

Taking offline DPO as a representative case, we conduct extensive ablation studies on eva, with key findings:

•

§ 4.2.1 - informativeness metric: our regret-based metric outperforms other alternatives.

•

§ 4.2.2 - adaptive evolving procedure: our method outperforms active selection without evolving.

•

§ 4.2.3 - scaling with reward models: the alignment gain of eva scales with reward models.

•

§ 4.2.4 - continual training : our method has monotonic gain with incremental training.

•

§ 4.2.5 - curriculum effect: our method creates meaningful curriculum over iterations.

4.2.1The Choice of Informativeness Metrics Metric info ⁢ ( 𝐱 ) Related Approximation

𝐴 min ⋆ ⁢ : worst-case optimal advantage

| min 𝐲 ⁡ 𝑟 ⁢ ( 𝐱 , 𝐲 ) − max 𝐲 ′ ⁡ 𝑟 ⁢ ( 𝐱 , 𝐲 ′ ) | minimax regret (Savage, 1951)

𝐴 avg ⋆ ⁢ : average optimal advantage

| 1 𝑁 ⁢ ∑ 𝐲 𝑟 ⁢ ( 𝐱 , 𝐲 ) − max 𝐲 ′ ⁡ 𝑟 ⁢ ( 𝐱 , 𝐲 ′ ) | Bayesian regret (Banos, 1968)

𝐴 dts ⋆ ⁢ : dueling optimal advantage

| max 𝐲 ≠ 𝐲 ⋆ ⁡ 𝑟 ⁢ ( 𝐱 , 𝐲 ) − max 𝐲 ′ ⁡ 𝑟 ⁢ ( 𝐱 , 𝐲 ′ ) | min-margin regret (Wu & Liu, 2016) Table 3:The reward-advantage-based metrics that serve as the informativeness proxies for prompts. Benchmark ( → ) Arena-Hard MT-Bench AE 2.0 Method ( ↓ ) / Metric ( → ) WR (%) avg. turn 1 turn 2 LC-WR (%)

𝜽 0 → 1 : DPO 51.6 8.66 9.01 8.32 55.01

𝜽 1 → 1 ~ : + eva (uniform) 57.5 8.71 9.02 8.40 53.43

𝜽 1 → 1 ~ : + eva ( var ⁢ ( 𝒓 ) ) 54.8 8.66 9.13 8.20 54.58

𝜽 1 → 1 ~ : + eva ( avg ⁢ ( 𝒓 ) ) 58.5 8.76 9.13 8.40 55.01

𝜽 1 → 1 ~ : + eva ( 1 / avg ⁢ ( 𝒓 ) ) 56.7 8.79 9.13 8.45 55.04

𝜽 1 → 1 ~ : + eva ( 1 / 𝐴 min ⋆ ) 52.3 8.64 8.96 8.31 53.84

𝜽 1 → 1 ~ : + eva ( 𝐴 avg ⋆ ) (our variant) 60.0 8.85 9.08 8.61 56.01

𝜽 1 → 1 ~ : + eva ( 𝐴 dts ⋆ ) (our variant) 60.0 8.86 9.18 8.52 55.96 \cdashline1-6 𝜽 1 → 1 ~ : + eva ( 𝐴 min ⋆ ) (our default) 60.1 (+8.5) 8.90 9.04 8.75 (+0.43) 55.35 Table 4:Choice of informativeness metric matters. Our adaptive metric by reward advantage achieves the best performances. See also § D for visualization. Reward advantage as the informativeness metric outperforms baselines.

As in Table 4.2.1, eva offers an effective curriculum by the advantage-based proxy as the informativeness metric (bottom row):

•

Comparing with uniform evolving (brown): Existing baselines generate prompts in a uniform manner (Yuan et al., 2024) (cf., the principle of insufficient reason (Keynes, 1921; Tobin et al., 2017)). eva concretely outperforms, corroborating (Das et al., 2024) that uniform learners can suffer from sub-optimality gaps.

•

Comparing with other heuristics (gray): Prior practices (Team et al., 2023) tried heuristics like prioritizing prompts with the most variance in its rewards or with the lowest/highest average. We find our advantage based methods (red) outperforms those heuristics.

•

Comparing with the inverse advantage (purple): Contrary to curriculum learning, a line of works conjecture that examples with higher losses may be prioritized (Jiang et al., 2019), which can be done by inverting our metric. We find it significantly hurt the alignment gain, corroborating (Mindermann et al., 2022) that those examples can be unlearnable or irrelevant, meaning our curriculum is effective and practical.

•

Among our advantage variants (green): We designed variants of our default advantage-based metric, as in Table 3; the default 𝐴 min ⋆ remains competitive among its peers. Together, the advantage-based principle provides a robust guideline for sampling and evolving.

The lesson is that we must be selective about which are the promising to evolve, otherwise unlearnable, noisy or naïve prompts may hinder learning. Our regret-inspired metric represents a solid baseline.

4.2.2The Effect of Evolving Benchmark ( → ) Arena-Hard MT-Bench AlpacaEval 2.0 Method ( ↓ ) / Metric ( → ) WR (%) avg. score 1 st turn 2 nd turn LC-WR (%) WR (%)

𝜽 0 → 1 : DPO 51.6 8.66 9.01 8.32 55.01 51.68

𝜽 1 → 1 ~ : [no evolve]-greedy 56.1 8.68 8.98 8.38 54.11 53.66

𝜽 1 → 1 ~ : [no evolve]-sample 55.3 8.69 9.00 8.38 54.22 54.16

𝜽 1 → 1 ~ : + eva-greedy (our variant) 59.5 8.72 9.06 8.36 54.52 55.22 \cdashline1-7 𝜽 1 → 1 ~ : + eva-sample (our default) 60.1 8.90 9.04 8.75 55.35 55.53 Table 5:Effect of evolving. The blue are those training with only the informative subset and without evolving); we denote -sample for the default weighted sampling procedure in Algo 1, while using -greedy for the variant from the classical active data selection procedure (cf., a recent work (Muldrew et al., 2024) and a pre-LLM work (Kawaguchi & Lu, 2020)), which selects data by a high-to-low ranking via the metric greedily. We show evolving brings a remarkable alignment gain (green v.s. blue); and as we evolve, sampling is more robust than being greedy. The design of evolve(·) is effective.

As in Table 4.2.2:

•

Removing the evolve ⁢ ( ⋅ ) step: if we only do subset sampling or ordered selection, we still have gain, but not as much as with evolving (e.g., eva brings 4.8 % additional wins on Arena Hard).

•

Altering the sample ⁢ ( ⋅ ) step: if we greedily select prompts by the metric instead of using them as weights for importance sampling, the performance will be weaker as we evolve.

The lesson is that simply adaptive training within a fixed prompt distribution is not enough; our open-ended RLHF with generative prompt exploration gives a substantial headroom for self-improvement. In other words, the RL post-training process should be both adaptive and generative in terms of prompt distribution.

4.2.3Scaling eva with Reward Models Figure 4:eva scales with quality of reward models, under pointwise RMs with DPO (left) and pairwise RMs with SPPO (right). Note SPPO handles general preferences thus requires pairwise RMs, and DPO relies on the Bradley-Terry assumption, for which pointwise RMs are suitable.

Figure 4 presents the length-controlled win rate of eva on AlpacaEval using pointwise and pairwise reward models of varying scales. As the quality of reward models improve, eva brings higher alignment gain. The scaling observation shows the effectiveness of eva in exploiting more accurate reward signals to choose informative prompts for better alignment. One takeaway is interaction with the external world is essential for intelligence. The more accurate reward signals observed, the better the agent incentivize themself to improve (cf., (Silver et al., 2021)).

4.2.4eva Improves Efficiency & Generalization

We run the default incremental training (i.e., trainining from the last checkpoint with the evolved set in each iteration), as in Fig 5 and § H.2, eva presents monotonic gains.

The solutions found by eva cannot be recovered by training longer by a fixed set (the dashed), nor by naïvely sourcing new prompts without examining informativeness (the gray dotted), thus our generative data schedule is effective.

We conjecture that behaviors of the dashed/dotted lines relate to loss of plasticity (Ash & Adams, 2019; Dohare et al., 2023; Abbas et al., 2023; Xue et al., 2024). Classical works resolve it by the optimization view (e.g., weight perturbing), whereas eva offers a new data view, potentially mimicing an implicit regularizer for better generalization.

Figure 5:eva stays robust with more iterations.

In Table 6, we ablate eva in scratch training, i.e., training with the full original and evolved set. eva is competitive in incremental training, learning more effectively with less data – a nice bonus by minimax regret (Jiang et al., 2021a).

Table 6:Ablation on incremental v.s. scratch training. Benchmark ( → ) Arena-Hard MT-Bench AE 2.0 Method ( ↓ ) / Metric ( → ) WR (%) avg. score LC-WR (%)

𝜽 0 : SFT 41.3 8.57 47.11

𝜽 0 → 1 : DPO 51.6 8.66 55.01

𝜽 0 → 1 ~ : eva (scratch) 59.8 8.88 54.59

𝜽 1 → 1 ~ : eva (incremental) 60.1 8.90 55.35 4.2.5eva Creates Meaningful Curriculum Table 7:eva improves prompt quality and complexity. Prompt Set ( ↓ ) / Metric ( → ) Complexity (1-5) Quality (1-5) UltraFeedback (seed) 2.90 3.18 UltraFeedback-eva-Iter-1 3.84 3.59 UltraFeedback-eva-Iter-2 3.92 3.63 UltraFeedback-eva-Iter-3 3.98 3.73 Figure 6:Curriculum effect in training distributions. The prompt distribution of Table 13. eva creates a curriculum that prioritizes math / coding prompts over iterations. Figure 7:Curriculum effect in benchmark performance. The radar figure for ratings on MT-Bench. eva prioritizes and gradually improves on coding, math and reasoning over iterations, implicitly reflecting a learned curriculum.

In Table 7, we show that there is a gradual improvement in prompt complexity and quality6 over iterations with eva. In Figure 6 and 7, we show that eva brings auto-curricula and the creator is incentivized to create new prompts that are informative w.r.t. the current solver policy.

Together, those evidences supports the importance of adaptively evolving prompts jointly with responses, which we believe to be crucial in scaling up next-gen RL post-training.

5Related Works

With the history of machine learning, it is not new that self-play and data exploration brings intelligence (e.g., Schmidhuber (1991)). We believe, to our knowledge, eva is among the first empirical works that systematically studied adaptive prompt evolving in RL post-training on large-scale LLM benchmarks. Below, viewing from different directions, we list eva’s distinct impact and contribution.

Self-improving algorithms and iterative optimization.

This line of work focuses on iteratively generating samples from the response policy and continuously re-training the policy by selected self-generated samples. Major works include ReST (Gulcehre et al., 2023; Singh et al., 2023), STaR (Zelikman et al., 2022), RFT (Yuan et al., 2023), RAFT (Dong et al., 2023), self-improving LLMs (Huang et al., 2022; Yuan et al., 2024); in the context of preference optimization, iterative DPO (Tran et al., 2023; Xiong et al., 2024; Pang et al., 2024) has proven effective. Most works focus on self-training by improving in 𝒴 ∣ 𝒳 , while we jointly optimize both responses and prompts via generative exploration in ( 𝒳 , 𝒴 ) , allowing for continual RLHF.

Prompt synthesis for language models.

Major works include Self-Instruct (Wang et al., 2022), WizardLM (Xu et al., 2023; Luo et al., 2023), Self-Align (Sun et al., 2024), EvoPrompt (Guo et al., 2023), Magpie (Xu et al., 2024), etc. eva is orthogonal to them since any such method can be plugged in as the evolve ⁢ ( ⋅ ) for the creator (note we contribute to a small trick with BoN tree in § 3). We focus on the RL post-training phase, while those works are primarily SFT. Importantly, our work proposes an adaptive metric to sample and evolve prompts – a secret sauce is that prompts shall be prioritized w.r.t. the informativeness, whereas prior works mostly evolve uniformly (Yuan et al., 2024). Furthermore, most works evolves in an offline manner, while eva, to our knowledge, is the first framework that supports online/on-policy evolving in general RL post-training.

Active and curriculum learning.

This line of works re-order training examples for efficiency (Bengio et al., 2009a; Mindermann et al., 2022; Kawaguchi & Lu, 2020), with recent LLM-related works (Muldrew et al., 2024; Das et al., 2024) (note those are done at a much smaller scale compared to eva). In contrast, eva breaks free from the static paradigm, not only re-orders data but also generatively creates new data, yielding significant gains (§ 4.2.2).

Self-play and curriculum RL.

Agents trained on a fixed data distribution are often brittle and may struggle to adapt to the real world (Hughes et al., 2024). Self-play (Samuel, 1959; Goodfellow et al., 2014; Silver et al., 2016) addresses this by having the agent learn through self-interaction, thus creating more diverse experiences and automatic curricula. In asymmetric self-play, the paradigm centers on “Alice proposing a task, and Bob doing it” (Sukhbaatar et al., 2017; Samvelyan et al., 2023; Beukman et al., 2024a; Dennis et al., 2020). We revive the classical asymmetric self-play (Sutton et al., 2011) in optimizing language models. Unlike traditional curriculum RL (Parker-Holder et al., 2022), which renders environments by specifying levels (Dennis et al., 2020), our approach is generative by nature, as we directly generate states from powerful generative models.

Self-play in RLHF.

A growing line of research frames RLHF as a symmetric self-play game, where both players are response players (Munos et al., 2023; Wu et al., 2024; Choi et al., 2024; Rosset et al., 2024). However, these methods still rely on a fixed prompt distribution thus is sub-optimal. In contrast, we solve this by asymmetric self-play, enabling evolving prompt distributions. During our work, we notice one concurrent nice work with the asymmetric setup (Zheng et al., 2024), however (i) it applies to adversarial attack tasks instead of general alignment, (ii) it is incompatible with direct preference optimization, and (iii) it relies on the maxmin (which may produce unlearnable environments (Dennis et al., 2020)) instead of the minimax regret principle (Fan, 1953) as we do. We also first define open-ended RLHF, generalizing classical RLHF.

6Concluding Remarks Future directions.

eva defines a new paradigm for RL post-training, opening up many new directions, e.g., (i) jointly optimizing the reward model (RMs) with eva – we assume a fixed oracle RM, as is de facto practice in industry (Team et al., 2024a); however, as the policy updates, eva can generate out-of-distribution prompts, necessitating the need for continual RM training (Makar-Limanov et al., 2024); (ii) extending to differentiable creator policies; (iii) extending to reasoning (Poesia et al., 2024); (iv) extending the game with more modality (Bruce et al., 2024), and/or with more players (e.g., rewriters (Kumar et al., 2024)).

Conclusions.

This empirical work presents eva, a new, simple and scalable framework for RL post-training of LLMs, that adaptively generates prompt distributions during training. eva is simple – it can be easily plugged into any existing pipeline, and highly effective – it reaches new SOTA on challenging alignment benchmarks. The primary takeaway may be: (i) self-evolving joint training distributions ( 𝒳 , 𝒴 ) brings significant gain, and (ii) reward advantage acts as an effective metric informing the collection and creation of prompts in RLHF. Philosophically, eva presents a new view of post-training as an infinite game; eva incentivizes agents to create problems rather than to simply solve, which is key to intelligence, yet LLM trainers may neglect.

We believe the community should be aware of the surprising effectiveness of adaptive evolving prompts in RL post-training, and scale them together with responses.

Impact Statement Figure 8:eva. The creator is the prompt generation policy 𝜋 𝑋 and the solver is the response generation policy 𝜋 𝑌 | 𝑋 .

eva enables scalable training of language agents through open-ended training, improving AI alignment with human values, which will ultimately contribute to social welfare (Pigou, 1920; Arrow, 1952; Arendt, 1958; Zhi-Xuan et al., 2024; Rosenfeld & Xu, 2025). This may democratize the development of more generally capable artificial intelligence agents, impacting a wide range of domains from scientific discovery to societal governance.

We recognize that eva relies on self-exploration guided by reward signals. If these signals are inaccurate or misaligned, the trained agents may exhibit undesirable behaviors such as reinforcing biases or hallucinations. Mitigating these risks requires continued research into robust reward models, transparent evaluation protocols, open collaboration within the AI research community, and more. As authors, we will be committed to supporting these efforts by sharing our findings and implementations to promote open and responsible research and development.

Acknowledgments

We extend our warm gratitude to Bilal Piot for his thoughtful reviews and valuable suggestions on this paper. We are also grateful to Chenkai Kuang, Dustin Tran, Albert Webson, Hanzhao Lin, Clara Huiyi Hu, Jeremiah Liu, Luheng He, Chenjie Gu, Yong Cheng, Pei Sun, and Heng-Tze Cheng for their fun discussions and interesting notes on Self-Play. We thank David Abel, Yuxin Chen, Ziniu Hu, Guohao Li, Rylan Schaeffer, Haifeng Xu, Chaoqi Wang and Yifei Wang for their helpful discussions and references on fine-tuning, contrastive learning, data synthesis, open-ended learning, and continual reinforcement learning. We also thank many anonymous reviewers for their kind comments and advice.

References Abbas et al. (2023) ↑ Abbas, Z., Zhao, R., Modayil, J., White, A., and Machado, M. C.Loss of plasticity in continual deep reinforcement learning.In Conference on Lifelong Learning Agents, pp. 620–636. PMLR, 2023. Ahmadian et al. (2024) ↑ Ahmadian, A., Cremer, C., Gallé, M., Fadaee, M., Kreutzer, J., Üstün, A., and Hooker, S.Back to basics: Revisiting reinforce style optimization for learning from human feedback in llms.arXiv preprint arXiv:2402.14740, 2024. Arendt (1958) ↑ Arendt, H.The Origins of Totalitarianism.The American Historical Review, 1958. Arrow (1952) ↑ Arrow, K. J.Social choice and individual values, volume 12.Yale university press, 1952. Ash & Adams (2019) ↑ Ash, J. T. and Adams, R. P.On the difficulty of warm-starting neural network training.arXiv preprint arXiv:1910.08475, 2019. Ball et al. (2023) ↑ Ball, P. J., Smith, L., Kostrikov, I., and Levine, S.Efficient online reinforcement learning with offline data.In International Conference on Machine Learning, pp. 1577–1594. PMLR, 2023. Banos (1968) ↑ Banos, A.On pseudo-games.The Annals of Mathematical Statistics, 39(6):1932–1945, 1968. Bengio (2012) ↑ Bengio, Y.Deep learning of representations for unsupervised and transfer learning.In Proceedings of ICML workshop on unsupervised and transfer learning, pp. 17–36. JMLR Workshop and Conference Proceedings, 2012. Bengio et al. (2009a) ↑ Bengio, Y., Louradour, J., Collobert, R., and Weston, J.Curriculum learning.In Proceedings of the 26th annual international conference on machine learning, pp. 41–48, 2009a. Bengio et al. (2009b) ↑ Bengio, Y., Louradour, J., Collobert, R., and Weston, J.Curriculum learning.In Proceedings of the 26th annual international conference on machine learning, pp. 41–48, 2009b. Beukman et al. (2024a) ↑ Beukman, M., Coward, S., Matthews, M., Fellows, M., Jiang, M., Dennis, M., and Foerster, J.Refining Minimax Regret for Unsupervised Environment Design.arXiv preprint arXiv:2402.12284, 2024a. Beukman et al. (2024b) ↑ Beukman, M., Coward, S., Matthews, M., Fellows, M., Jiang, M., Dennis, M., and Foerster, J.Refining Minimax Regret for Unsupervised Environment Design.arXiv preprint arXiv:2402.12284, 2024b. Bruce et al. (2024) ↑ Bruce, J., Dennis, M. D., Edwards, A., Parker-Holder, J., Shi, Y., Hughes, E., Lai, M., Mavalankar, A., Steigerwald, R., Apps, C., et al.Genie: Generative interactive environments.In Forty-first International Conference on Machine Learning, 2024. Chaiklin et al. (2003) ↑ Chaiklin, S. et al.The zone of proximal development in Vygotsky’s analysis of learning and instruction.Vygotsky’s educational theory in cultural context, 1(2):39–64, 2003. Cheng et al. (2024) ↑ Cheng, C.-A., Nie, A., and Swaminathan, A.Trace is the Next AutoDiff: Generative Optimization with Rich Feedback, Execution Traces, and LLMs.arXiv preprint arXiv:2406.16218, 2024. Choi et al. (2024) ↑ Choi, E., Ahmadian, A., Geist, M., Pietquin, O., and Azar, M. G.Self-Improving Robust Preference Optimization.arXiv preprint arXiv:2406.01660, 2024. Chow et al. (2024) ↑ Chow, Y., Tennenholtz, G., Gur, I., Zhuang, V., Dai, B., Thiagarajan, S., Boutilier, C., Agarwal, R., Kumar, A., and Faust, A.Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models.arXiv preprint arXiv:2412.15287, 2024. Cui et al. (2023) ↑ Cui, G., Yuan, L., Ding, N., Yao, G., Zhu, W., Ni, Y., Xie, G., Liu, Z., and Sun, M.Ultrafeedback: Boosting language models with high-quality feedback.arXiv preprint arXiv:2310.01377, 2023. Das et al. (2024) ↑ Das, N., Chakraborty, S., Pacchiano, A., and Chowdhury, S. R.Provably sample efficient rlhf via active preference optimization.arXiv preprint arXiv:2402.10500, 2024. Dennis et al. (2020) ↑ Dennis, M., Jaques, N., Vinitsky, E., Bayen, A., Russell, S., Critch, A., and Levine, S.Emergent complexity and zero-shot transfer via unsupervised environment design.Advances in neural information processing systems, 33:13049–13061, 2020. Dohare et al. (2023) ↑ Dohare, S., Hernandez-Garcia, J. F., Rahman, P., Mahmood, A. R., and Sutton, R. S.Maintaining plasticity in deep continual learning.arXiv preprint arXiv:2306.13812, 2023. Dong et al. (2023) ↑ Dong, H., Xiong, W., Goyal, D., Zhang, Y., Chow, W., Pan, R., Diao, S., Zhang, J., Shum, K., and Zhang, T.Raft: Reward ranked finetuning for generative foundation model alignment.arXiv preprint arXiv:2304.06767, 2023. Dong et al. (2024) ↑ Dong, H., Xiong, W., Pang, B., Wang, H., Zhao, H., Zhou, Y., Jiang, N., Sahoo, D., Xiong, C., and Zhang, T.RLHF Workflow: From Reward Modeling to Online RLHF, 2024. Dubois et al. (2024) ↑ Dubois, Y., Galambosi, B., Liang, P., and Hashimoto, T. B.Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024. Dwaracherla et al. (2024) ↑ Dwaracherla, V., Asghari, S. M., Hao, B., and Van Roy, B.Efficient exploration for llms.arXiv preprint arXiv:2402.00396, 2024. Fan (1953) ↑ Fan, K.Minimax theorems.Proceedings of the National Academy of Sciences, 39(1):42–47, 1953. Goodfellow et al. (2014) ↑ Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.Generative adversarial nets.Advances in neural information processing systems, 27, 2014. Gulcehre et al. (2023) ↑ Gulcehre, C., Paine, T. L., Srinivasan, S., Konyushkova, K., Weerts, L., Sharma, A., Siddhant, A., Ahern, A., Wang, M., Gu, C., et al.Reinforced self-training (rest) for language modeling.arXiv preprint arXiv:2308.08998, 2023. Guo et al. (2023) ↑ Guo, Q., Wang, R., Guo, J., Li, B., Song, K., Tan, X., Liu, G., Bian, J., and Yang, Y.Connecting large language models with evolutionary algorithms yields powerful prompt optimizers.arXiv preprint arXiv:2309.08532, 2023. Guo et al. (2024) ↑ Guo, S., Zhang, B., Liu, T., Liu, T., Khalman, M., Llinares, F., Rame, A., Mesnard, T., Zhao, Y., Piot, B., et al.Direct language model alignment from online ai feedback.arXiv preprint arXiv:2402.04792, 2024. Gustafsson (2022) ↑ Gustafsson, J. E.Decisions under Ignorance, 2022.Lecture notes. Hejna et al. (2023) ↑ Hejna, J., Rafailov, R., Sikchi, H., Finn, C., Niekum, S., Knox, W. B., and Sadigh, D.Contrastive prefence learning: Learning from human feedback without rl.arXiv preprint arXiv:2310.13639, 2023. Hong et al. (2024) ↑ Hong, J., Lee, N., and Thorne, J.Orpo: Monolithic preference optimization without reference model.arXiv preprint arXiv:2403.07691, 2(4):5, 2024. Hu et al. (2024) ↑ Hu, J., Wu, X., Zhu, Z., Xianyu, Wang, W., Zhang, D., and Cao, Y.OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework.arXiv preprint arXiv:2405.11143, 2024. Huang et al. (2022) ↑ Huang, J., Gu, S. S., Hou, L., Wu, Y., Wang, X., Yu, H., and Han, J.Large language models can self-improve.arXiv preprint arXiv:2210.11610, 2022. Hughes et al. (2024) ↑ Hughes, E., Dennis, M., Parker-Holder, J., Behbahani, F., Mavalankar, A., Shi, Y., Schaul, T., and Rocktaschel, T.Open-Endedness is Essential for Artificial Superhuman Intelligence.arXiv preprint arXiv:2406.04268, 2024. Jiang et al. (2019) ↑ Jiang, A. H., Wong, D. L.-K., Zhou, G., Andersen, D. G., Dean, J., Ganger, G. R., Joshi, G., Kaminksy, M., Kozuch, M., Lipton, Z. C., et al.Accelerating deep learning by focusing on the biggest losers.arXiv preprint arXiv:1910.00762, 2019. Jiang et al. (2023) ↑ Jiang, D., Ren, X., and Lin, B. Y.Llm-blender: Ensembling large language models with pairwise ranking and generative fusion.arXiv preprint arXiv:2306.02561, 2023. Jiang (2023) ↑ Jiang, M.Learning Curricula in Open-Ended Worlds.arXiv preprint arXiv:2312.03126, 2023. Jiang et al. (2021a) ↑ Jiang, M., Dennis, M., Parker-Holder, J., Foerster, J., Grefenstette, E., and Rocktäschel, T.Replay-guided adversarial environment design.Advances in Neural Information Processing Systems, 34:1884–1897, 2021a. Jiang et al. (2021b) ↑ Jiang, M., Grefenstette, E., and Rocktäschel, T.Prioritized level replay.In International Conference on Machine Learning, pp. 4940–4950. PMLR, 2021b. Jiang et al. (2024) ↑ Jiang, Y., Zhou, A., Feng, Z., Malladi, S., and Kolter, J. Z.Adaptive data optimization: Dynamic sample selection with scaling laws.arXiv preprint arXiv:2410.11820, 2024. Jin et al. (2020) ↑ Jin, C., Netrapalli, P., and Jordan, M.What is local optimality in nonconvex-nonconcave minimax optimization?In International conference on machine learning, pp. 4880–4889. PMLR, 2020. Kawaguchi & Lu (2020) ↑ Kawaguchi, K. and Lu, H.Ordered sgd: A new stochastic optimization framework for empirical risk minimization.In International Conference on Artificial Intelligence and Statistics, pp. 669–679. PMLR, 2020. Keynes (1921) ↑ Keynes, J. M.A treatise on probability.Courier Corporation, 1921. Khirodkar & Kitani (2018) ↑ Khirodkar, R. and Kitani, K. M.Adversarial domain randomization.arXiv preprint arXiv:1812.00491, 2018. Kumar et al. (2024) ↑ Kumar, A., Zhuang, V., Agarwal, R., Su, Y., Co-Reyes, J. D., Singh, A., Baumli, K., Iqbal, S., Bishop, C., Roelofs, R., et al.Training language models to self-correct via reinforcement learning.arXiv preprint arXiv:2409.12917, 2024. Li et al. (2024) ↑ Li, T., Chiang, W.-L., Frick, E., Dunlap, L., Wu, T., Zhu, B., Gonzalez, J. E., and Stoica, I.From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline.arXiv preprint arXiv:2406.11939, 2024. Li et al. (2023) ↑ Li, Z., Xu, T., Zhang, Y., Lin, Z., Yu, Y., Sun, R., and Luo, Z.-Q.Remax: A simple, effective, and efficient reinforcement learning method for aligning large language models.arXiv preprint arXiv:2310.10505, 2023. Liu & Zeng (2024) ↑ Liu, C. Y. and Zeng, L.Skywork Reward Model Series.https://huggingface.co/Skywork, September 2024.URL https://huggingface.co/Skywork. Liu et al. (2023a) ↑ Liu, T., Zhao, Y., Joshi, R., Khalman, M., Saleh, M., Liu, P. J., and Liu, J.Statistical rejection sampling improves preference optimization.arXiv preprint arXiv:2309.06657, 2023a. Liu et al. (2023b) ↑ Liu, W., Zeng, W., He, K., Jiang, Y., and He, J.What makes good data for alignment? a comprehensive study of automatic data selection in instruction tuning.arXiv preprint arXiv:2312.15685, 2023b. Luo et al. (2023) ↑ Luo, H., Sun, Q., Xu, C., Zhao, P., Lou, J., Tao, C., Geng, X., Lin, Q., Chen, S., and Zhang, D.Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.arXiv preprint arXiv:2308.09583, 2023. Makar-Limanov et al. (2024) ↑ Makar-Limanov, J., Prakash, A., Goktas, D., Greenwald, A., and Ayanian, N.STA-RLHF: Stackelberg Aligned Reinforcement Learning with Human Feedback.In Coordination and Cooperation for Multi-Agent Reinforcement Learning Methods Workshop, 2024.URL https://openreview.net/forum?id=Jcn8pxwRa9. Meng et al. (2024) ↑ Meng, Y., Xia, M., and Chen, D.SimPO: Simple Preference Optimization with a Reference-Free Reward.arXiv preprint arXiv:2405.14734, 2024. Mindermann et al. (2022) ↑ Mindermann, S., Brauner, J. M., Razzak, M. T., Sharma, M., Kirsch, A., Xu, W., Höltgen, B., Gomez, A. N., Morisot, A., Farquhar, S., et al.Prioritized training on points that are learnable, worth learning, and not yet learnt.In International Conference on Machine Learning, pp. 15630–15649. PMLR, 2022. Muldrew et al. (2024) ↑ Muldrew, W., Hayes, P., Zhang, M., and Barber, D.Active Preference Learning for Large Language Models.arXiv preprint arXiv:2402.08114, 2024. Munos et al. (2023) ↑ Munos, R., Valko, M., Calandriello, D., Azar, M. G., Rowland, M., Guo, Z. D., Tang, Y., Geist, M., Mesnard, T., Michi, A., et al.Nash learning from human feedback.arXiv preprint arXiv:2312.00886, 2023. Nie et al. (2024) ↑ Nie, A., Cheng, C.-A., Kolobov, A., and Swaminathan, A.The Importance of Directional Feedback for LLM-based Optimizers.arXiv preprint arXiv:2405.16434, 2024. Nikishin et al. (2022) ↑ Nikishin, E., Schwarzer, M., D’Oro, P., Bacon, P.-L., and Courville, A.The primacy bias in deep reinforcement learning.In International conference on machine learning, pp. 16828–16847. PMLR, 2022. Nishihara (2025) ↑ Nishihara, R.Take on DeepSeek-R1, 2025.LinkedIn post. Ouyang et al. (2022) ↑ Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022. Pang et al. (2024) ↑ Pang, R. Y., Yuan, W., Cho, K., He, H., Sukhbaatar, S., and Weston, J.Iterative reasoning preference optimization.arXiv preprint arXiv:2404.19733, 2024. Parker-Holder et al. (2022) ↑ Parker-Holder, J., Jiang, M., Dennis, M., Samvelyan, M., Foerster, J., Grefenstette, E., and Rocktäschel, T.Evolving curricula with regret-based environment design.In International Conference on Machine Learning, pp. 17473–17498. PMLR, 2022. Pigou (1920) ↑ Pigou, A. C.Some aspects of welfare economics.The American Economic Review, 41(3):287–302, 1920. Poesia et al. (2024) ↑ Poesia, G., Broman, D., Haber, N., and Goodman, N. D.Learning Formal Mathematics From Intrinsic Motivation.arXiv preprint arXiv:2407.00695, 2024. Rafailov et al. (2023) ↑ Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., and Finn, C.Direct preference optimization: Your language model is secretly a reward model.arXiv preprint arXiv:2305.18290, 2023. Rosenfeld & Xu (2025) ↑ Rosenfeld, N. and Xu, H.Machine Learning Should Maximize Welfare, Not (Only) Accuracy.arXiv preprint arXiv:2502.11981, 2025. Ross & Bagnell (2012) ↑ Ross, S. and Bagnell, J. A.Agnostic system identification for model-based reinforcement learning.arXiv preprint arXiv:1203.1007, 2012. Rosset et al. (2024) ↑ Rosset, C., Cheng, C.-A., Mitra, A., Santacroce, M., Awadallah, A., and Xie, T.Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences.arXiv preprint arXiv:2404.03715, 2024. Russo & Van Roy (2014) ↑ Russo, D. and Van Roy, B.Learning to optimize via information-directed sampling.Advances in neural information processing systems, 27, 2014. Samuel (1959) ↑ Samuel, A. L.Some studies in machine learning using the game of checkers.IBM Journal of research and development, 3(3):210–229, 1959. Samvelyan et al. (2023) ↑ Samvelyan, M., Khan, A., Dennis, M., Jiang, M., Parker-Holder, J., Foerster, J., Raileanu, R., and Rocktäschel, T.MAESTRO: Open-ended environment design for multi-agent reinforcement learning.arXiv preprint arXiv:2303.03376, 2023. Savage (1951) ↑ Savage, L. J.The theory of statistical decision.Journal of the American Statistical association, 46(253):55–67, 1951. Schaul et al. (2015) ↑ Schaul, T., Quan, J., Antonoglou, I., and Silve, D.Prioritized Experience Replay.arXiv preprint arXiv:1511.05952, 2015. Schmidhuber (1991) ↑ Schmidhuber, J.A possibility for implementing curiosity and boredom in model-building neural controllers.In Proc. of the international conference on simulation of adaptive behavior: From animals to animats, 1991. Schulman et al. (2015) ↑ Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P.High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015. Schwefel (1977) ↑ Schwefel, H.-P.Evolutionsstrategien für die numerische Optimierung.Springer, 1977. Setlur et al. (2024) ↑ Setlur, A., Garg, S., Geng, X., Garg, N., Smith, V., and Kumar, A.RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold.arXiv preprint arXiv:2406.14532, 2024. Shi et al. (2024) ↑ Shi, R., Zhou, R., and Du, S. S.The Crucial Role of Samplers in Online Direct Preference Optimization.arXiv preprint arXiv:2409.19605, 2024. Silver et al. (2016) ↑ Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al.Mastering the game of Go with deep neural networks and tree search.nature, 529(7587):484–489, 2016. Silver et al. (2021) ↑ Silver, D., Singh, S., Precup, D., and Sutton, R. S.Reward is enough.Artificial Intelligence, 299:103535, 2021. Singh et al. (2023) ↑ Singh, A., Co-Reyes, J. D., Agarwal, R., Anand, A., Patil, P., Liu, P. J., Harrison, J., Lee, J., Xu, K., Parisi, A., et al.Beyond human data: Scaling self-training for problem-solving with language models.arXiv preprint arXiv:2312.06585, 2023. Sukhbaatar et al. (2017) ↑ Sukhbaatar, S., Lin, Z., Kostrikov, I., Synnaeve, G., Szlam, A., and Fergus, R.Intrinsic motivation and automatic curricula via asymmetric self-play.arXiv preprint arXiv:1703.05407, 2017. Sun et al. (2024) ↑ Sun, Z., Shen, Y., Zhou, Q., Zhang, H., Chen, Z., Cox, D., Yang, Y., and Gan, C.Principle-driven self-alignment of language models from scratch with minimal human supervision.Advances in Neural Information Processing Systems, 36, 2024. Sutskever (2024) ↑ Sutskever, I.Scaling the right thing matters more now than ever, 2024.LessWrong post. Sutton et al. (2011) ↑ Sutton, R. S., Modayil, J., Delp, M., Degris, T., Pilarski, P. M., White, A., and Precup, D.Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction.In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp. 761–768, 2011. Tang et al. (2024) ↑ Tang, Y., Guo, Z. D., Zheng, Z., Calandriello, D., Munos, R., Rowland, M., Richemond, P. H., Valko, M., Pires, B. Á., and Piot, B.Generalized preference optimization: A unified approach to offline alignment.arXiv preprint arXiv:2402.05749, 2024. Team et al. (2025) ↑ Team, D., Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.arXiv preprint arXiv:2501.12948, 2025. Team et al. (2023) ↑ Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al.Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. Team et al. (2024a) ↑ Team, G., Reid, M., Savinov, N., Teplyashin, D., Dmitry, L., Lillicrap, T., Alayrac, J., Soricut, R., Lazaridou, A., Firat, O., et al.Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024a. Team et al. (2024b) ↑ Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahriari, B., Ramé, A., et al.Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024b. Tobin et al. (2017) ↑ Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P.Domain randomization for transferring deep neural networks from simulation to the real world.In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pp. 23–30. IEEE, 2017. Tran et al. (2023) ↑ Tran, H., Glaze, C., and Hancock, B.Iterative DPO Alignment.Technical report, Snorkel AI, 2023. von Stackelberg (1934) ↑ von Stackelberg, H.Marktform und Gleichgewicht.Die Handelsblatt-Bibliothek ”Klassiker der Nationalökonomie”. J. Springer, 1934.URL https://books.google.com/books?id=wihBAAAAIAAJ. Wang et al. (2024) ↑ Wang, H., Xiong, W., Xie, T., Zhao, H., and Zhang, T.Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts.arXiv preprint arXiv:2406.12845, 2024. Wang et al. (2022) ↑ Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H.Self-instruct: Aligning language models with self-generated instructions.arXiv preprint arXiv:2212.10560, 2022. Wu & Liu (2016) ↑ Wu, H. and Liu, X.Double thompson sampling for dueling bandits.Advances in neural information processing systems, 29, 2016. Wu et al. (2024) ↑ Wu, Y., Sun, Z., Yuan, H., Ji, K., Yang, Y., and Gu, Q.Self-play preference optimization for language model alignment.arXiv preprint arXiv:2405.00675, 2024. Xie et al. (2023) ↑ Xie, S. M., Santurkar, S., Ma, T., and Liang, P. S.Data selection for language models via importance resampling.Advances in Neural Information Processing Systems, 36:34201–34227, 2023. Xiong et al. (2024) ↑ Xiong, W., Dong, H., Ye, C., Wang, Z., Zhong, H., Ji, H., Jiang, N., and Zhang, T.Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint.In Forty-first International Conference on Machine Learning, 2024. Xu et al. (2023) ↑ Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., and Jiang, D.Wizardlm: Empowering large language models to follow complex instructions.arXiv preprint arXiv:2304.12244, 2023. Xu et al. (2024) ↑ Xu, Z., Jiang, F., Niu, L., Deng, Y., Poovendran, R., Choi, Y., and Lin, B. Y.Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing.arXiv preprint arXiv:2406.08464, 2024. Xue et al. (2024) ↑ Xue, F., Fu, Y., Zhou, W., Zheng, Z., and You, Y.To repeat or not to repeat: Insights from scaling llm under token-crisis.Advances in Neural Information Processing Systems, 36, 2024. Yuan et al. (2024) ↑ Yuan, W., Pang, R. Y., Cho, K., Sukhbaatar, S., Xu, J., and Weston, J.Self-rewarding language models.arXiv preprint arXiv:2401.10020, 2024. Yuan et al. (2023) ↑ Yuan, Z., Yuan, H., Li, C., Dong, G., Lu, K., Tan, C., Zhou, C., and Zhou, J.Scaling relationship on learning mathematical reasoning with large language models.arXiv preprint arXiv:2308.01825, 2023. Yuksekgonul et al. (2024) ↑ Yuksekgonul, M., Bianchi, F., Boen, J., Liu, S., Huang, Z., Guestrin, C., and Zou, J.TextGrad: Automatic” Differentiation” via Text.arXiv preprint arXiv:2406.07496, 2024. Zelikman et al. (2022) ↑ Zelikman, E., Wu, Y., Mu, J., and Goodman, N.Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022. Zhang (2023) ↑ Zhang, G.Deep Learning Dynamics: From Minimization to Games.PhD thesis, University of Toronto (Canada), 2023. Zhang et al. (2022) ↑ Zhang, G., Wang, Y., Lessard, L., and Grosse, R. B.Near-optimal local convergence of alternating gradient descent-ascent for minimax optimization.In International Conference on Artificial Intelligence and Statistics, pp. 7659–7679. PMLR, 2022. Zheng et al. (2023) ↑ Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36:46595–46623, 2023. Zheng et al. (2024) ↑ Zheng, R., Guo, H., Liu, Z., Zhang, X., Yao, Y., Xu, X., Wang, Z., Xi, Z., Gui, T., Zhang, Q., et al.Toward Optimal LLM Alignments Using Two-Player Games.arXiv preprint arXiv:2406.10977, 2024. Zhi-Xuan et al. (2024) ↑ Zhi-Xuan, T., Carroll, M., Franklin, M., and Ashton, H.Beyond Preferences in AI Alignment.Philosophical Studies, pp. 1–51, 2024. Appendix

We would like to open-source all codes, generated data and trained models, upon approval – before then, we are more than happy to provide any clarification to help open-source communities re-implement eva and replicate our results.

The appendix is organized as follows:

•

§ A - Details On Reproducibility

•

§ G - Extended Results for Experiments in the Main Text

•

§ H - Additional Experiments Beyond the Main Text

•

§ I - Illustration on Methodology

•

§ D and § J - Illustration on Prompts, Responses and Relevant Distributions

Appendix ADetails on Reproducibility

Our code is built based on many open-source packages, and we deeply appreciate every open-source developer for their contributions to the community. The code base is made to be simple to use for practitioners, requiring only a creator module addition within the commonly adopted Alignment Handbook or OpenRLHF (Hu et al., 2024) pipeline.

Software environments.

All experiments are conducted on 8xNVIDIA H100 SXM GPUs. Our codebase primarily relies on transformers==4.40.0. For the response generation of Gemma models at the training stage, we use vllm==0.5.4 with flashinfer backend for CUDA 12.4 and torch 2.4. For evolving prompts, we use distilabel==1.3.2, and use LiteLLM to serve Gemini (default to be gemini-1.5-pro) and transformers models (default to be gemma-2-9b-it). For evaluation on all benchmarks, we use sglang==0.2.10 and openai==1.35.14, with gpt-4-1106-preview as the judge model and gpt-4-0314-preview as the baseline model. Specifically for AlpacaEval 2.0, we use alpaca_eval_gpt4_turbo_fn as the annotator config.

Hyperparameter settings.

We follow the original hyperparameter settings as in (Hong et al., 2024; Meng et al., 2024; Wu et al., 2024), default to be:

Hyperparameter ( ↓ ) / Loss ( → ) DPO ORPO SimPO SPPO learning rate 5e-7 5e-7 8e-7 5e-7 learning rate scheduler cosine cosine cosine linear

𝛽 0.05 / 10 0.001

𝛾 / / 5 /

𝜆 / 0.5 / / no. epochs per iter 2 1 1 6 warmup ratio per iter 0.1 0.1 0.1 0.1 effective batch size 8 8 32 8 max length 2048 2048 2048 1024 max prompt length 1024 1024 1024 512 optimizer adamw adamw adamw rmsprop Iterative Training Settings for Offline eva.

By default, we train with equal-size prompt subset in each iteration. Unless otherwise specified, we use 10K prompts from the UltraFeedback dataset (Cui et al., 2023) per iteration. The incremental training proceeds as follows (note this is also compatible with online solvers):

•

𝜽 0 : Base SFT model.

•

𝜽 0 → 1 : initialize with 𝜽 0 ; then train w/ the prompt split 𝒳 1 by self-generated responses from the initial model 𝜽 0 .

•

𝜽 1 → 2 : initialize with 𝜽 0 → 1 ; trained w/ the prompt split 𝒳 2 by self-generated responses from the model 𝜽 0 → 1 .

For evolving prompts (e.g., evolving 𝒳 1 to 𝒳 1 ~ ), with the calculated informativeness metric for each prompt, we normalize them as the weight to do weighted sampling for a 25% informative subset to get 𝒳 1 info . We then iterate over in 𝒳 1 info and call EvolInstrut (Xu et al., 2023) as the plug-in evolving method (with the number of evolutions as 4) using the default mutation templates for (i) in-depth evolving (constraints, deepening, concretizing, increased reasoning steps) and (ii) in-breadth evolving (extrapolation) as implemented in tasks/evol_instruct/utils.py of distilabel==1.3.2. Next we uniformly select 80 % prompts from this evolved dataset and 20 % from the original dataset (i.e., the buffer) to form 𝒳 1 ~ . We do not seek extensive parameter search (e.g., the number of evolutions, the evolving ratio) in this stage and encourage future works on exploring this and other plug-in evolving methods. For solver we generate 6 responses per prompt. We use 42 as the random seed.

Training setting in online eva in Table 1.

Plainly put, online eva evolves per mini-batch. In Table 1, we have it eva works in three phases with the generative buffer. Below we present an easy-to-understand illustration:

Warm-up phase: Default training with data from 𝒟 until the buffer is full. To create the buffer, after each step (training on 8 samples in a mini-batch), we select the top 50% (which is 4) most informative prompts in the mini batch, then evolve 𝑛 𝗇𝖾𝗐 versions for each one, and we add only the evolved prompts to the buffer. (Here, we set 𝑛 𝗇𝖾𝗐

4 , thus the buffer increases by 16 at each step, until we reach the preset buffer size – which is 3200 currently, i.e., 400 iterations).

Mix-up phase: From now on, we do a balanced sampling from the buffer and 𝒟 to form a mini batch for training, where 𝗋𝖺𝗍𝗂𝗈 𝖾𝗏𝗈𝗅

50 % : at each optimization step, we sample half of the mini batch from the buffer by weighted sampling via the informativeness score, and take the rest half i.i.d. from 𝒟 . We similarly evolve 𝑛 𝗇𝖾𝗐 new prompts for the top 50% informative prompts in this batch. We then add the newly generated prompts, and pop the trained ones and the oldest ones from the buffer, to keep a fixed buffer size. Then balanced sampling approach resembles the scheme used in (Ross & Bagnell, 2012; Ball et al., 2023).

Bootstrap phase: After we run out of samples from 𝒟 , we will only sample from the buffer, with the same evolving procedure, and add evolved prompts to the buffer and pop out those trained. We take the top 50% from each mini batch, evolve for 𝑛 𝗇𝖾𝗐 𝖻𝗈𝗈𝗍𝗌𝗍𝗋𝖺𝗉

2 prompts for each, then pop out with the trained.

The hyper-parameter of the subset set sizes as power of 2 is due to the hardware optimization constraints.

Appendix BAdditional Ablation Experiments for Online eva

In the following ablation studies, by keeping other hyper-parameters the same as the above, we show that increasing 𝑛 𝗇𝖾𝗐 improves the performance, while a balanced sampling with 𝗋𝖺𝗍𝗂𝗈 𝖾𝗏𝗈𝗅

50 % is generally the most robust.

Table 8:The default setting for the first column of each table is 𝑛 𝗇𝖾𝗐

4 and 𝗋𝖺𝗍𝗂𝗈 𝖾𝗏𝗈𝗅

50 % . (Left) Varying the number of evolved prompts 𝑛 𝗇𝖾𝗐 . As we increase 𝑛 𝗇𝖾𝗐 , we observe a monotonic gain, probably due to the fact that evolving more prompts helps improving the coverage and diversity of training prompts. (Right) Varying the sampling ratio of evolved prompts in each mini-batch. In general, we find the balanced sampling strategy is more robust as the training goes on. Benchmark ( → ) 𝖠𝗋𝖾𝗇𝖺 ⁢ 𝖧𝖺𝗋𝖽 (WR (%)) Setting ( → ) 𝑛 𝗇𝖾𝗐

4
𝑛 𝗇𝖾𝗐

𝜽 0 : Base Model 41.3 41.3

𝜽 0 → 1 : 𝖱𝖫𝖮𝖮 ( 𝟣 ⁢ 𝗑 ) 52.6 52.6

𝜽 0 → 1 ~ : 𝖱𝖫𝖮𝖮 -eva ( 𝟣 ⁢ 𝗑 ) 57.3 57.6

𝜽 0 → 1 ~ : 𝖱𝖫𝖮𝖮 -eva ( 𝟤 ⁢ 𝗑 ) 60.5 61.2

𝜽 0 → 1 ~ : 𝖱𝖫𝖮𝖮 -eva ( 𝟥 ⁢ 𝗑 ) 62.4 63.0

𝜽 0 → 1 : 𝖱𝖫𝖮𝖮 ( 𝟨 ⁢ 𝗑 ) 62.7 62.7 Benchmark ( → ) 𝖠𝗋𝖾𝗇𝖺 ⁢ 𝖧𝖺𝗋𝖽 (WR (%)) Setting ( → ) 𝗋𝖺𝗍𝗂𝗈 𝖾𝗏𝗈𝗅

50 %
𝗋𝖺𝗍𝗂𝗈 𝖾𝗏𝗈𝗅

75 %
𝗋𝖺𝗍𝗂𝗈 𝖾𝗏𝗈𝗅

25 %

𝜽 0 : Base Model 41.3 41.3 41.3

𝜽 0 → 1 : 𝖱𝖫𝖮𝖮 ( 𝟣 ⁢ 𝗑 ) 52.6 52.6 52.6

𝜽 0 → 1 ~ : 𝖱𝖫𝖮𝖮 -eva ( 𝟣 ⁢ 𝗑 ) 57.3 57.0 57.5

𝜽 0 → 1 ~ : 𝖱𝖫𝖮𝖮 -eva ( 𝟤 ⁢ 𝗑 ) 60.5 59.9 59.2

𝜽 0 → 1 ~ : 𝖱𝖫𝖮𝖮 -eva ( 𝟥 ⁢ 𝗑 ) 62.4 62.0 61.3

𝜽 0 → 1 : 𝖱𝖫𝖮𝖮 ( 𝟨 ⁢ 𝗑 ) 62.7 62.7 62.7 Appendix CAdditional References

In addition to § 5, prior works (Jiang et al., 2024; Xie et al., 2023) have explored signals for adaptive data sampling, which can be integrated into eva’s generative pipeline. For the online solver, an alternative to RLOO (Ahmadian et al., 2024) is ReMax (Li et al., 2023). An important future work for eva is to have efficient online reinforcement learning of reward models on top of the creator-solver game, on which Russo & Van Roy (2014); Dwaracherla et al. (2024) shed light.

Appendix DVisualization on Prompt Selection Metric Figure 9:The probability density distributions of informativeness metrics compared in Table 4.2.1 – they show different patterns. Figure 10:The correlation plot for reward advantage (ours) and reward variance – they are only weakly correlated.

In eva, we assign each prompt an informativeness value, which the creator will use as the weight to sample from the seed prompts for prompt synthesis. In § 4.2.1, we have shown that traditional methods like reward mean and reward variance are less effective as our advantage-based informativeness proxy. The intuition is simple: advantage/regret-based proxy aligns better with the preference optimization objective. We further illustrate that they are statistically different from other choices:

•

Figure 10: The distribution of informativeness values shows that reward variance is heavily concentrated at lower values, reward mean is more uniformly scattered, and reward advantage achieves a better balance, providing a broader yet also focused sampling range.

•

Figure 10: The weak correlation between reward variance and reward advantage shows that variance cannot serve as a substitute for advantage as a proxy for informativeness.

We have discussed the contrastive curriculum hypothesis in § H.3 to support using reward advantage in the sense that the induced samples tend to decrease the loss the most in the contrastive optimization. Furthermore, assuming the optimization algorithm can converge to the more optimal responses, neither reward mean nor variance directly capture the learning potential of such responses – one may easily construct cases with identical variance yet differ much in reward range – thus variance fails to distinguish such scenarios. By contrast, reward advantage estimate inherently captures the relative improvement towards better response, and is sensitive to differences in reward range; variants of advantage estimate are commonly used in literature, and we discuss underlying principles in § I.

Appendix EComplexity and Quality of Prompts Over Iterations

As in Table 7, there is a gradual improvement of prompt complexity and quality over iterations with eva. We sample 10K prompts per iteration, and use the below prompts modified from (Liu et al., 2023b) for the complexity and quality evaluation, with gemini-1.5-flash as the scorer:

1: Poor quality, 2: Below average, 3: Average, 4: Good, 5: Excellent.

Rank the following questions according to their difficulty and complexity. Use a fixed scoring system: 1: Very simple, 2: Simple, 3: Moderate, 4: Difficult, 5: Very difficult.

Appendix FEvolving Instructions Figure 11:An illustrative example of tree BoN. Given a seed prompt, we uniformly sample evolving strategies from below to apply to it, which results in multiple generations; we then proceed with the best one (e.g., by the complexity and quality scorer) as the seed prompt for the next generation. We use this setting in the online eva setting as a trial. IN_BREADTH_KEYS = [’persona’, ’shift-in’, ’shift-out’, ’mix’, ’abstract’] IN_DEPTH_KEYS = [’constraints’, ’deepening’, ’concretizing’, ’reasoning’, ’expansion’] EVOL_METHODS = {

in-breadth evolving

"persona": ( "Reframe the #Given Prompt# as if written by a user with a completely different persona, background, or expertise. " "Adjust the tone, style, phrasing, or anything you feel proper to reflect this change. " "The changes should make the prompt feel like it was authored by someone entirely new." ), "shift-in": ( "Shift the high-level idea of the #Given Prompt# to explore a different subdomain or context within the same domain. " "Ensure the new topic still challenges the model to reason or provide knowledge relevant to the domain." ), "shift-out": ( "Shift the high-level idea of the #Given Prompt# to a completely different topic in a different setting. " "The new topic may challenge the model with similar reasoning or contextual understanding but in a novel way." ), "mix": ( "Combine the high-level concept of the #Given Prompt# with elements from a different domain. " "Introduce novel scenarios or contexts to create diversity while maintaining relevance to the original idea." ), "abstract": ( "Turn the #Given Prompt# into a more abstract or generalized version, removing specific details while preserving its intent. " "Ensure the new prompt encourages broader, principle-driven reasoning." ),

in-depth evolving

"constraints": ( "Add one or more significant constraints or requirements into the ’#Given Prompt#’. " "The added constraints must meaningfully alter how the model would respond. " "For example, specify additional rules, contexts, or limitations that demand creative adjustments to the response." ), "deepening": ( "If the #Given Prompt# contains inquiries about certain issues, increase the depth and breadth of the inquiry. " "Make the question require a more detailed, multi-layered, or comprehensive response. " "For instance, break the problem into sub-problems or require connections between unrelated concepts." ), "concretizing": ( "Replace general concepts in the #Given Prompt# with more specific and detailed concepts. " "Ensure that the change makes the problem more defined and concrete, leaving less room for ambiguity. " "For example, replace ’a device’ with ’a wearable fitness tracker with GPS’." ), "reasoning": ( "Add one or more reasoning steps into the ’#Given Prompt#’. " "Explicitly rewrite it to demand multi-step reasoning or justify intermediate steps in the solution. " "For instance, if the original prompt is a simple query, make the response require a step-by-step breakdown of logic or calculations." ), "expansion": ( "Expand the #Given Prompt# by including additional perspectives, domains, or layers of complexity. " "For example, if the original prompt focuses on a single scenario, add related scenarios or ask the model to compare different situations." ) } INST_IN_DEPTH = ( "Please act as an expert Prompt Rewriter.\n" "Your objective is to rewrite a given prompt into a more complex version " "to make those large language models (e.g., gemini) a bit harder to handle.\n" "But the rewritten prompt must be reasonable and must be understood and responded by humans.\n" "Your rewriting cannot omit the non-text parts such as the table and code in #Given Prompt#, if there is any." "You should try your best not to make the #Rewritten Prompt# become verbose, " "The #Rewritten Prompt# should be roughly the similar length or a little bit more than that of #Given Prompt#.\n" "The #Rewritten Prompt# must sound like a real human user’s prompt; DON’T make it sound machine-generated." "Specifically, you SHOULD complicate the given prompt using the following method: " "\n{method}\n" # to be formatted "The rewritten prompt should reflect meaningful changes across its structure, " "ensuring the entire sentence feels sufficiently different from the original. " "Again, make sure the rewritten prompt presents a more CHALLENGING TASK." "Respond with your rewritten prompt directly. " "#Given Prompt#:\n{prompt}\n" # to be formatted "#Rewritten Prompt#:\n" ).lstrip() INST_IN_BREADTH = ( "Please act as an expert Prompt Creator.\n" "Your objective is to generate a brand-new prompt based on the #Given Prompt#. " "The purpose of this task is to promote diversity and generality of training prompts for language models, " "helping it practice with varied challenges and perspectives.\n" "The LENGTH and complexity of the #Created Prompt# should be similar to that of the #Given Prompt#.\n" "The #Created Prompt# must be reasonable, interpretable, and solvable by humans.\n" "The #Created Prompt# must sound like a real human user’s prompt; DON’T make it sound like machine-generated." "Follow the method described below to guide your creation:\n" "{method}\n" # to be formatted "The created prompt should reflect meaningful changes across its structure, " "ensuring the entire sentence feels sufficiently different from the original. " "Respond with your created prompt directly.\n" "#Given Prompt#:\n{prompt}\n" # to be formatted "#Created Prompt#:\n" ).lstrip() Appendix GAdditional Experimental Results

In general, eva maintains the downstream performance and is robust on reasoning-heavy tasks, and the scaling with reward models is more prominent on AlpacaEval, possibly due to training sources for such reward models.

Method ( ↓ ) / Dataset ( → ) MUSR-TA TruthfulQA-Gen WMDP GSM8K GSM-Plus MMLU-Pro

𝜽 0 : SFT 38.80 34.76 58.62 24.64 18.62 52.08

𝜽 0 → 1 : DPO 38.40 34.76 58.45 24.56 18.50 52.63

𝜽 1 → 1 ~ : + eva 38.40 34.15 58.40 24.26 17.96 53.03

𝜽 0 → 1 : SPPO 40.80 34.15 58.72 24.79 18.42 52.70

𝜽 1 → 1 ~ : + eva 41.20 34.64 58.94 25.40 18.88 52.47 Table 9:Performance on Downstream tasks. Benchmark ( → ) MT-Bench Arena-Hard AlpacaEval 2.0 Method ( ↓ ) / Metric ( → ) avg. score 1 st turn 2 nd turn WR (%) LC (%) WR (%)

𝜽 0 → 1 : DPO 8.66 9.01 8.32 51.6 55.01 51.68

𝜽 1 → 1 ~ : + eva-i (ArMO-8B) 8.90 9.04 8.75 60.1 55.35 55.53

𝜽 1 → 1 ~ : + eva-i (SkyworkRM-27B) 8.75 9.07 8.43 60.3 56.12 56.40 Table 10:Effect of (pointwise) reward models. Benchmark ( → ) MT-Bench Arena-Hard AlpacaEval 2.0 Method ( ↓ ) / Metric ( → ) avg. score 1 st turn 2 nd turn WR (%) LC (%) WR (%)

𝜽 0 → 1 : SPPO 8.62 9.03 8.21 55.7 51.58 42.17

𝜽 1 → 1 ~ : + eva-i (PairRM-0.4B) 8.78 9.11 8.45 58.9 51.86 43.04

𝜽 1 → 1 ~ : + eva-i (PairRM-8B) 8.89 9.08 8.70 60.2 52.71 44.52 Table 11:Effect of (pairwise) reward models. Appendix HAdditional Experimental Results (as Extensions) H.1Experiments on Different evolve(·) Methods

As an addition to Table 2, we have experimented with three different evolve(·) methods, including:

•

SelfInstruct (Wang et al., 2022): Given seed prompts, variations are created based on criteria such as verb diversity and style blending (mixing interrogative and imperative styles). Unlike EvolInstruct (Xu et al., 2023), which generates prompt variations sequentially, this approach generates independently. We follow the one-shot implementation in self_instruct.py of distilabel==1.4.1 and modified the instruction on conciseness so that those newly generated prompts have similar lengths compared to the seed prompts.

•

EvolQuality and EvolComplexity (Liu et al., 2023b): The two methods use the same evolutionary approach (i.e., sequentially generating), but with slightly different meta-instructions for prompt generation, where EvolQuality asks to improve the quality (i.e., helpfulness, relevance, etc) of the seed prompt and EvolComplexity asks to improve the complexity (i.e., increased reasoning steps, etc) of the seed prompt. We follow the implementation in evol_quality/utils.py and evol_complexity/utils.py of distilabel==1.4.1.

Model Family ( → ) Gemma-2-9B-it Benchmark ( → ) Arena-Hard Method ( ↓ ) / Metric ( → ) WR (%) avg. len

𝜽 0 : SFT 41.3 544

𝜽 0 → 1 : DPO 51.6 651

𝜽 1 → 1 ~ : + eva (evolve(·) = EvolInstruct) 60.1 733

𝜽 1 → 1 ~ : + eva (evolve(·) = EvolQuality) 58.7 721

𝜽 1 → 1 ~ : + eva (evolve(·) = EvolComplexity) 60.6 749

𝜽 1 → 1 ~ : + eva (evolve(·) = SelfInstruct) 57.2 725 Table 12:Results of using different evolving methods. eva is effective under different evolving methods.

As shown in Table 12, our method brings strong performance gain without training with additional human prompts. Among the experimented methods, we find EvolComplexity shows better results.

We believe the main strength of such method is its simplicity. Viewing the evolving process as 𝐱 ′ ← 𝑝 𝜽 ( ⋅ ∣ 𝐱 , meta_prompt ) , one can easily tune the meta prompt in natural language for improved performance. However, such simplicity comes at a price: (i) the main weakness is that the default method does not take environmental feedback into account (e.g., rewards received, verbal critique on responses, etc) and relies on the pre-defined meta prompt, thus the evolving may be less directional; we encourage practitioners to consider incorporating more richer feedback during evolving (one way to formulate this is by generative optimization (Yuksekgonul et al., 2024; Cheng et al., 2024; Nie et al., 2024)); (ii) another weakness is that existing method is single-shot (i.e., we evolve based on a single 𝐱 each time), thus the diversity of the generation may be limited – we anticipate future works improving this with multi-shot evolving by graph-based sampling. In this regard, the evolving process can be viewed as { 𝐱 ′ } 𝑖

1 𝑁 ← 𝑝 𝜽 ( ⋅ ∣ { 𝐱 } 𝑖

1 𝑀 , meta_prompt , env_feedback ) .

H.2Experiments on Number of Iterations

As an addition to § 4.2.4, we have experimented with the following settings:

•

10K prompts per iteration with 3 iterations.

•

20K prompts per iteration with 3 iterations (i.e., all seed prompts are used).

•

60K prompts per iteration with 2 iterations (i.e., all seed prompts are used).

Due to time constraints, we did not perform an extensive hyper-parameter search; however, we believe the results presented below sufficiently demonstrate the performance gains achieved by eva.

Model Family ( → ) Gemma-2-9B-it Benchmark ( → ) Arena-Hard Method ( ↓ ) / Metric ( → ) WR (%) avg. len

𝜽 0 : SFT 41.3 544

𝜽 0 → 1 : DPO (10k) 51.6 651

𝜽 1 → 2 : DPO (10k) 59.8 718

𝜽 2 → 3 : DPO (10k) 61.2 802

𝜽 1 → 1 ~ : + eva (10k) 60.1 733

𝜽 1 ~ → 2 ~ : + eva (10k) 62.0 787

𝜽 2 ~ → 3 ~ : + eva (10k) 62.2 774 Table 13:Results of using 10k prompts per iteration (DPO + length-penalized NLL loss). Model Family ( → ) Gemma-2-9B-it Benchmark ( → ) Arena-Hard Method ( ↓ ) / Metric ( → ) WR (%) avg. len

𝜽 0 : SFT 41.3 544

𝜽 0 → 1 : DPO (20k) 53.2 625

𝜽 1 → 2 : DPO (20k) 47.0 601

𝜽 2 → 3 : DPO (20k) 46.8 564

𝜽 1 → 1 ~ : + eva (20k) 59.5 826

𝜽 1 ~ → 2 ~ : + eva (20k) 60.0 817

𝜽 2 ~ → 3 ~ : + eva (20k) 61.4 791 Table 14:Results of using 20k prompts per iteration (DPO + length-penalized NLL loss). Model Family ( → ) Gemma-2-9B-it Benchmark ( → ) Arena-Hard Method ( ↓ ) / Metric ( → ) WR (%) avg. len

𝜽 0 : SFT 41.3 544

𝜽 0 → 1 : DPO (60k) 58.9 717

𝜽 1 → 1 ~ : + eva (60k) 59.6 725

𝜽 1 ~ → 1 ~ ′ : + eva (60k) 61.9 792 Table 15:Results of using 60k prompts per iteration (DPO + length-penalized NLL loss). eva can bring robust gains with multiple iterations.

As shown in Table 13, 14, and 15 below, our method presents persistent performance gain over iterations, and concretely surpasses the performance by default DPO training with true human prompts.

However, there exist diminishing marginal gains in iterative off-policy training. We ground eva in the iterative (off-policy) preference alignment paradigm due to its efficiency and ease of integration. However, such paradigms inherently face diminishing returns, where performance gains decrease with successive iterations, as previously observed in (Wu et al., 2024; Setlur et al., 2024; Yuan et al., 2024; Nikishin et al., 2022). While the generative data schedule in eva mitigates these challenges and extends beyond default training with human prompts (see also § 4.2.4), the gains can weaken over iterations. We summarize potential reasons as: (i) the off-policy signal decay – as the number of examples increases, signals from the off-policy data become weaker due to distributional shift; (ii) the loss of plasticity, where the agent’s ability to learn good policies decreases in continuing training with more iterations (Nikishin et al., 2022); (iii) the ability of the solver – as we evolve more harder prompts, it is harder for the solver to produce preferred response (thus more explicit reasoning techniques may be needed); (iv) the ability of the reward model to correctly provide reward signals to responses and thus informativeness signals to prompts, as there may exists distributional mismatch.

Thus, we envision future work to build on eva by: (i) exploring its integration with on-policy RLHF (e.g., instead of evolving prompts in iterations, one may evolve in batches); (ii) enhancing solver capabilities, such as sampling more responses during inference or leveraging meta-instructions to guide deeper reasoning; (iii) continual training of reward models for them to co-evolve with the creators and the solvers.

H.2.1Bonus Experiments on rewriter(·) In The Loop

We present the basic idea here for practitioners to build upon. The motivation comes from the hypotheses derived from § H.2: as the prompts gets harder by evolving, there may be greater demands on the solver’s capabilities compared to earlier iterations. As such, the solver may not be naively treated the same. One may address this by either scaling up response sampling or introducing meta-instructions to explicitly enhance the solver’s reasoning.

We design a proof-of-concept experiment w.r.t the latter by adding rewriter in eva’s solver step. Previously, as in Algo. 1 and § 3.3.1, for each prompt 𝐱 , we generate multiple responses, and choose the best as 𝐲 + and the worst as 𝐲 − for preference optimization. Now, we add one more rewriting step that attempts to enhance 𝐲 + to be 𝐲 + ′ , by applying a rewriting instruction (Liu et al., 2023b) that asks the solver to alter 𝐲 + with imporved helpfulness, relevance, reasoning depths, creativity and details while keeping the similar length. We then train with ( 𝐱 , 𝐲 + ′ , 𝐲 − ) for preference optimization. Table 16 shows that adding the rewriter yields concrete performance gains over the default training method, while keeping the training budget and only slightly increasing cost for offline data generation.

Model Family ( → ) Gemma-2-9B-it Benchmark ( → ) Arena-Hard Method ( ↓ ) / Metric ( → ) WR (%) avg. len

𝜽 0 : SFT 41.3 544

𝜽 0 → 1 : DPO (10k) 51.6 651

𝜽 1 → 1 ~ : + eva (10k) 60.1 733

𝜽 1 → 1 ~ : + eva with rewriter (10k) 61.9 741 Table 16:Results of adding rewriter in the solver step. H.3Understanding the Informativeness Proxy in Different Intuitive Ways Learning potential.

Our metric intuitively identifies the learning potential of a prompt by measuring the gap between the best and worst response to it from the solver. We reason, that prompts eliciting both high-reward and low-reward outcomes, reflect learnable tasks where the model is capable of improving but has not yet mastered, thereby implying learning potential (cf., (Jiang et al., 2021b)).

Worst-case guarantees.

The minimax-regret objective, by design, leads to solvers that perform robustly across the prompt space, thus gives the worst-case guarantee. While exact equilibrium may not be attainable with approximation, our empirical results in § 4.2.1 demonstrate robustness.

Auto-curricula for the players.

We visualize the curriculum induced by eva in § 4.2.5. With the stochastic policy, the advantage may be heuristically understood as the reward difference between a base solver and a reference solver. Rather than optimizing separate solvers (Dennis et al., 2020), we sample multiple times from the same policy to create the pair. In this way, the creator is incentivized to produce new prompts that are just out of the comfort zone of solvers (Chaiklin et al., 2003):

•

For overly challenging prompts, both solutions perform poorly, leading to a low proxy.

•

For overly easy prompts, the base solution already performs well, again giving a low proxy.

•

The optimal strategy is to find prompts that are just beyond the solver’s current capability.

Auto-curricula inherent to Contrastive Optimization.

Contrastive preference optimization generalizes DPO and a family of algorithms (c.f., (Hejna et al., 2023; Rafailov et al., 2023; Tang et al., 2024)), many of whose losses monotonically decrease as the contrastive ratio increases. To be specific, the DPO (Rafailov et al., 2023) objective for RLHF is:

ℒ 𝛽 DPO ( 𝜋 𝜽 )

∑ ( 𝐲 + , 𝐲 − , 𝐱 ) ∈ 𝒟 − log [ 𝜎 ( 𝛽 ⋅ Δ 𝜽 ; ref 𝐱 ) ] ,

(7)

where we use + , − to denote chosen and rejected responses, and denote the contrastive ratio as:

Δ 𝜽 ; ref 𝐱 := log ⁡ 𝜋 𝜽 ⁢ ( 𝐲 + ∣ 𝐱 ) 𝜋 ref ⁢ ( 𝐲 + ∣ 𝐱 ) − log ⁡ 𝜋 𝜽 ⁢ ( 𝐲 − ∣ 𝐱 ) 𝜋 ref ⁢ ( 𝐲 − ∣ 𝐱 ) .

(8)

Here, by Table 3 and Eq. 8, the contrastive ratio can be written via the advantage-based proxy:

𝐴 min ⋆ ⁢ ( 𝐱 )

𝛽 ⋅ Δ 𝜽 ⋆ ; ref 𝐱 .

(9)

By our proxy, we implicitly incentivize the creator to generate prompts that bring the most contrastive responses, which decrease the loss the most. This matches the curriculum learning literature, which prioritizes (in eva, generatively prioritizes) examples with smaller losses for better convergence and generalization (Bengio et al., 2009b). We hence suggest the Contrastive Curriculum Hypothesis: In contrastive preference optimization, prioritizing prompts with higher contrastive ratio improves sample efficiency and generalization.

We show initial empirical results on this in § 4.2.1 and § 4.2.4.

Appendix IExtended Background on the Methodology I.1Approaching Open-Ended Learning by Unsupervised Environment Design I.1.1The Asymmetric Game Formulation for Unsupervised Environment Design

While we cannot directly train the agent with the intractable target distribution of the open-ended world, it is possible to curate a curriculum of prompt distributions to improve over the static distribution and support the continual training of the policy 𝜋 𝜽 ( ⋅ | 𝐱 ) , for it to keep improving and succeed over the full task space, thus conceptually approaching 𝜋 𝗍𝗋𝗎𝖾 ⁢ ( 𝐱 ) . This is often framed as an asymmetric two-player game.

Dennis et al. (2020) first formally define this problem as Unsupervised Environment Design (UED). The idea is that while the real-world environments are inexhaustible and hard to tract, there may exist some free parameters (e.g., height and roughness in a maze) which one may control to generate new environments; UED then concerns about designing a distribution of those free parameters (i.e., settings) to create new fully specified environments, that can be used to train the agents.

In this setup, one player, the creator, generates new environments based on some specific decision rules (see the following), while the other player, the solver, optimizes its policy within these training environments, and the process continues iteratively. Common heuristic strategies include:

•

Randomization: environments are generated uniformly and independently of the solver’s current policy. This method is simple but less effective (Tobin et al., 2017).

•

Maximin: the creator generates environments that minimize the solver’s maximum possible reward, which can often lead to unsolvable scenarios (Khirodkar & Kitani, 2018).

•

Minimax regret: The creator targets environments that maximize the solver’s regret, defined as the difference between the optimal return achievable and that of the solver’s current policy (Beukman et al., 2024b). The regret is often conceived as the creator’s utility.

Among them7, the minimax regret approach presents a sweet spot where the creator can create hard yet solvable environments, and is often empirically better. The minimax regret strategy also implies that the agent’s policy is trained to perform well under all levels/settings, thus enjoys a worst-case guarantee. However, while it is often straightforward for the solver to minimize the regret (e.g., through direct policy optimization), the optimal policy remains unknown during the optimization process, thus regret as the decision signal is often intractable to the creator – which requires approximation (this is described as the Achilles’ heel of those curriculum RL methods by (Parker-Holder et al., 2022)).

I.1.2Approximating the Regret and Generating New Environments

In general, the creator design in this line of research contains two steps:

identifying high-regret levels using different (often heuristic) regret approximation;

generating new environments by making variations or retrieving from buffers on high-regret levels.

We hereby review major works on regret approximation and environment generation as follows:

Dennis et al. (2020) propose joint training for the creator and two competing solvers.

•

Regret approximation: here, two solver policies are trained, with the regret approximated as the difference in their returns. During each optimization step, one solver maximizes this regret, the other minimizes it, and the creator maximizes it.

•

Environment generation: the system directly sample the parameter from the creator policy and use that to specify the environment.

Jiang et al. (2021b) propose to random sampling on high-regret levels.

•

Regret approximation: as a heuristic, the authors use positive value loss, which is a function of Generalized Advantage Estimate (Schulman et al., 2015) (which itself is a function of the TD error – the difference between the expected and the actual returns) as the creator’s utility.

•

Environment generation: the creator have a rolloing buffer of highest-regret levels by random searching on relevant configurations.

Jiang et al. (2021a) further propose a double-creator setting based on (Jiang et al., 2021b), where one creator is actively generating new environments, and the other is retrieving from the buffer.

Parker-Holder et al. (2022) propose to sample high-regret levels and generate new environments by making edits on existing ones. The regret approximation is the same as (Jiang et al., 2021b) – the positive value loss. For environment generation, the authors suggest a general editing/mutation mechanism, where the creator chooses from high-regret levels and make small variations within an edit distance. There is an additional filtering step: they do not directly train on newly generated levels, but evaluate on those levels first, then add only the high-regret ones to the training buffer.

Additional Preliminaries.

Let 𝑟 ⁢ ( ⋅ , ⋅ ) be an oracle reward model. The (unregularized) optimal policy is:

𝜋 ⋆

arg max 𝜋 𝔼 𝐱 ∼ 𝒟 , 𝐲 ∼ 𝜋 ( ⋅ ∣ 𝐱 ) [ 𝑟 ( 𝐱 , 𝐲 ) ] .

We have the optimal advantage / the negated regret as:

𝐴 ⋆ ⁢ ( 𝐱 , 𝐲 )

𝑟 ( 𝐱 , 𝐲 ) − 𝔼 𝐲 ′ ∼ 𝜋 ⋆ ( ⋅ ∣ 𝐱 ) [ 𝑟 ( 𝐱 , 𝐲 ′ ) ]

𝑟 ⁢ ( 𝐱 , 𝐲 ) − 𝑉 ⋆ ⁢ ( 𝐱 , 𝐲 ) .

Classical preference-based RL assumes a reward-based preference model, that is:

𝑃 ⁢ ( 𝐲 + ⪰ 𝐲 − )

exp ( 𝑟 ( 𝐱 , 𝐲 + ) ) exp ( 𝑟 ( 𝐱 , 𝐲 + ) ) + exp ( 𝑟 ( 𝐱 , 𝐲 − ) ) .

As a side note (Hejna et al., 2023), this is equivalent to the advantage/regret-based preference model, due to the bandit setup in RLHF:

𝑃 ⁢ ( 𝐲 + ⪰ 𝐲 − )

exp ( 𝑟 ( 𝐱 , 𝐲 + ) − 𝑉 ⋆ ( 𝐱 , 𝐲 ) ) exp ( 𝑟 ( 𝐱 , 𝐲 + ) − 𝑉 ⋆ ( 𝐱 , 𝐲 ) ) + exp ( 𝑟 ( 𝐱 , 𝐲 − ) − 𝑉 ⋆ ( 𝐱 , 𝐲 ) )

exp ( 𝐴 ⋆ ( 𝐱 , 𝐲 + ) ) exp ( 𝐴 ⋆ ( 𝐱 , 𝐲 + ) ) + exp ( 𝐴 ⋆ ( 𝐱 , 𝐲 − ) ) .

In our current setting, we assume there is an oracle preference model for the preference pair labeling.

KL-regularized regret.

In the RLHF setting at fixed prompt distribution, the objective is:

max 𝜋 𝜽 𝔼 𝐱 ∼ 𝜋 𝜙 ( ⋅ ) , 𝐲 ∼ 𝜋 𝜽 ( ⋅ ∣ 𝐱 ) [ 𝑟 ( 𝐱 , 𝐲 ) ] − 𝔼 𝐱 ∼ 𝒟 [ 𝛽 ⋅ 𝔻 𝖪𝖫 [ 𝜋 𝜽 ( 𝐲 ∣ 𝐱 ) ∥ 𝜋 𝖻𝖺𝗌𝖾 ( 𝐲 ∣ 𝐱 ) ] ] .

The optimal policy of the above KL-constrained objective is:

𝜋 𝜽 ⋆ ⁢ ( 𝐲 ∣ 𝐱 )

1 𝑍 ⁢ ( 𝐱 ) ⁢ 𝜋 𝖻𝖺𝗌𝖾 ⁢ ( 𝐲 ∣ 𝐱 ) ⁢ exp ⁡ ( 1 𝛽 ⋅ 𝑟 ⁢ ( 𝐱 , 𝐲 ) ) ,

(10)

where the partition function is defined as 𝑍 ⁢ ( 𝐱 )

∑ 𝐲 𝜋 𝖻𝖺𝗌𝖾 ⁢ ( 𝐲 ∣ 𝐱 ) ⁢ exp ⁡ ( 1 𝛽 ⁢ 𝑟 ⁢ ( 𝐱 , 𝐲 ) ) .

We can now formally define the regret with regard to 𝜋 𝜽 ⋆ ( ⋅ ∣ 𝐱 ) as:

Regret ( 𝐱 , 𝜋 𝜽 )

𝔼 𝐲 ∼ 𝜋 𝜽 ( ⋅ ∣ 𝐱 ) [ 𝑟 ( 𝐱 , 𝐲 ) ] − 𝔼 𝐲 ∼ 𝜋 𝜽 ⋆ ( ⋅ | 𝐱 ) [ 𝑟 ( 𝐱 , 𝐲 ) ] .

(11) Appendix JExamples on Prompts and Model Generations Figure 12:The initial prompt distribution of UltraFeedback by bart-large-mnli with 0-shot classification, which is imbalanced. For the reward distribution, the category with lowest average reward has the highest reward gap (i.e., the default informativeness proxy), implying the potential to improve. Table 17:Baseline examples of evolved prompts from AlpacaEval. initial prompt →

Write me the code for a distributed transaction manager.\nThink step by step and use pseudo code first.\nThen, define interfaces for all involved actors and entities.\nUse Rational Unified approach for this part.\n\nOnly then move on to the actual implementation, class-by-class, and method-by-method.\nMake the code be implemented in C# and follow SOLID principles.

evolved #1 →

Craft a suite of syntax for a distributed transaction coordinator. Start with a sequential breakdown in pseudocode format. Following that, establish the protocols for communication and interaction amongst the various participants and components, incorporating the Rational Unified Process methodology.\n\nProceed thereafter to the concrete creation, detailing each class and function. Ensure that the final C# code adheres to the precepts of SOLID and is annotated for clarification and maintainability purposes.

evolved #2 →

Devise a comprehensive set of directives and structures for a distributed transaction coordinator architecture. Initiate by formulating a step-by-step algorithmic decomposition in pseudocode. Subsequently, delineate the frameworks for dialog and cooperation between the numerous entities and elements, utilizing the tenets of the Rational Unified Process methodology.\n\nContinue to the actual generation of the code, meticulously detailing every class and corresponding method. Guarantee that the culminating C# code is in strict compliance with SOLID principles and is supplemented with descriptive commentary to enhance future clarity and upkeep, while also validating the code against a set of unit tests to ensure robust functionality.

evolved #3 →

Commence by constructing an algorithm in pseudocode that meticulously breaks down the sequential stages for a distributed transaction coordinator system. Progress by outlining the communication protocols and collaborative mechanisms among the myriad components, adhering to the Rational Unified Process approach.\n\nAdvance to crafting the code in C#, ensuring each class and method is elaborated with precision, aligning with SOLID design principles. Integrate thorough comments for heightened maintainability and conduct comprehensive unit testing for assured operational integrity.

evolved #4 →

Design a detailed flowchart that represents the intricate steps of an orchestration engine for managing microservices in a cloud-native environment. Begin by detailing the service discovery, load balancing, and fault tolerance strategies, while incorporating the DevOps philosophy.\n\nProceed by developing the implementation in Java, focusing on clean, modular code following object-oriented programming best practices. Accompany your code with clear documentation for easy understanding and perform rigorous integration testing to guarantee robustness. 1st-turn prompt: When a number is divided by 10, the remainder is 4. What is the remainder when twice the number is divided by 4?

Answer from gemma-2-9b-it-dpo

Answer from gemma-2-9b-it-dpo-eva

Here’s how to solve this problem:

Understand the first statement:

"When a number is divided by 10, the remainder is 4" means the number can be represented as 10*k + 4, where k is an integer.

Double the number:

Twice the number is 2*(10k + 4) = 20k + 8

Divide by 4:

(20k + 8) divided by 4 is (20k)/4 + 8/4 = 5*k + 2

Find the remainder:

Since 5*k is a multiple of 4, the remainder when twice the number is divided by 4 is 2. Let me know if you’d like to try another example!

Here’s how to solve this problem:

Understand the first statement:

"When a number is divided by 10, the remainder is 4" means the number can be represented as 10*k + 4, where k is an integer.

Double the number:

Twice the number is 2*(10k + 4) = 20k + 8

Divide by 4:

When (20k + 8) is divided by 4, the 20k term is perfectly divisible by 4.
The remainder will be the same as the remainder when 8 is divided by 4, which is 0. Therefore, the remainder when twice the number is divided by 4 is 0.

2nd-turn prompt: What about when twice the number is divided by 5?

Answer from gemma-2-9b-it-dpo

Answer from gemma-2-9b-it-dpo-eva

Here’s how to solve that:

We know: Twice the number is 20*k + 8 (from the previous calculation).
Divide by 5: (20k + 8) divided by 5 is (20k)/5 + 8/5 = 4*k + 1.6
Find the remainder: Since 4*k is a multiple of 5, the remainder when twice the number is divided by 5 is 1. Let me know if you have any more number theory puzzles!

Here’s how to solve that:

Twice the number: We know twice the number is 20*k + 8 (from the previous steps).
Divide by 5: * 20*k is divisible by 5. * 8 divided by 5 leaves a remainder of 3. Therefore, the remainder when twice the number is divided by 5 is 3.

Judgment (by gpt-4)

Judgment for gemma-2-9b-it-dpo

Judgment for gemma-2-9b-it-dpo-eva

The assistant’s response is incorrect. The assistant incorrectly calculated the remainder when twice the number is divided by 5. The assistant calculated the decimal part of the division operation, not the remainder. The correct calculation should be (20n + 8) % 5 = (20n % 5) + (8 % 5) = 0 + 3 = 3. Therefore, the remainder when twice the number is divided by 5 is 3, not 1 as the assistant stated.

The corrected response would be:

We know: Twice the number is 20*k + 8 (from the previous calculation).
Divide by 5: (20k + 8) divided by 5 gives us a remainder of (20k % 5) + (8 % 5) = 0 + 3 = 3.
Find the remainder: The remainder when twice the number is divided by 5 is 3.

Rating: [[3]], "score": 3, "turn": 2, "tstamp": 1722737011.7768948

The assistant’s answer is correct and matches the reference answer. The assistant correctly explains that twice the number is 20*k + 8 and that when this is divided by 5, the remainder is 3. The assistant’s explanation is clear and easy to understand, making it helpful for the user. There are no mistakes to correct.

Rating: [[10]], "score": 10, "turn": 2, "tstamp": 1722846518.5543818 Table 18:Comparison of model generations on examples from MT-Bench. 1st-turn prompt: Imagine you are writing a blog post comparing two popular smartphone models. Develop an outline for the blog post, including key points and subheadings to effectively compare and contrast the features, performance, and user experience of the two models. Please answer in fewer than 200 words.

Answer from gemma-2-9b-it-dpo

Answer from gemma-2-9b-it-dpo-eva

Smartphone Showdown: [Model A] vs. [Model B]\n\nI. Introduction\n* Briefly introduce the two smartphone models being compared, highlighting their popularity and target audience.\n* State the purpose of the comparison: to help readers decide which phone best suits their needs.\n\nII. Design and Display\n* Physical dimensions, weight, and build quality comparison.\n* Screen size, resolution, and display technology (e.g., OLED, LCD).\n* Design aesthetics and user preferences.\n\nIII. Performance and Specs\n* Processor, RAM, and storage comparison.\n* Benchmark scores and real-world performance analysis.\n* Battery life and charging capabilities.\n\nIV. Camera Capabilities\n* Rear and front camera specs (megapixels, aperture, features).\n* Image quality comparison in various lighting conditions.\n* Video recording capabilities.\n\nV. Software and User Experience\n* Operating system and user interface comparison.\n* App ecosystem and availability.\n* Software features and customization options.\n\nVI. Price and Value\n* Compare the price points of both models.\n* Analyze the value proposition based on features and performance.\n\nVII. Conclusion\n* Summarize the key differences and similarities.\n* Recommend the best phone for specific user profiles (e.g., budget-conscious, photography enthusiasts, gamers).

iPhone 14 Pro vs. Samsung Galaxy S23: Which Flagship Reigns Supreme?\n\nI. Introduction\n* Briefly introduce both smartphones and their position in the market.\n* State the purpose of the comparison: to help readers decide which phone best suits their needs.\n\nII. Design and Display\n* Physical dimensions, weight, and build quality comparison.\n* Screen size, resolution, refresh rate, and display technology (OLED vs. Super AMOLED).\n* Design aesthetics and unique features (e.g., notch vs. punch-hole, color options).\n\nIII. Performance and Camera\n* Processor comparison (Apple A16 Bionic vs. Snapdragon 8 Gen 2).\n* RAM and storage options.\n* Camera specs: megapixels, lens types, aperture, and special features (e.g., telephoto, macro).\n* Image quality comparison based on sample photos and user reviews.\n\nIV. Software and User Experience\n* Operating systems: iOS 16 vs. Android 13.\n* User interface comparison: intuitiveness, customization options, and app ecosystems.\n* Special features: Face ID vs. in-display fingerprint sensor, voice assistants.\n\nV. Battery Life and Charging\n* Battery capacity and real-world usage comparison.\n* Fast charging speeds and wireless charging capabilities.\n\nVI. Price and Value\n* Compare starting prices and different storage configurations.\n* Assess the overall value for money based on features and performance.\n\nVII. Conclusion\n* Summarize the key differences and strengths of each phone.\n* Recommend the best phone for specific user types (e.g., photography enthusiasts, gamers, budget-conscious consumers).

2nd-turn prompt: Take your previous response and rephrase it as a limerick.

Answer from gemma-2-9b-it-dpo

Answer from gemma-2-9b-it-dpo-eva

Two phones, a showdown so grand,\n[Model A] and [Model B], hand in hand.\nDesign, specs, and price,\nWe’ll analyze nice,\nTo help you choose, understand?\n\nLet me know if you’d like me to try filling in the bracketed parts with specific model names!

\nTwo phones, a Pro and a Galaxy,\nFor flagships, they truly are spacy.\nWith cameras and screens,\nAnd processors keen,\nWhich one wins? Read on, and be savvy! \nLet me know if you’d like another limerick!

Judgment (by gpt-4)

Judgment for gemma-2-9b-it-dpo

Judgment for gemma-2-9b-it-dpo-eva

The assistant’s response is creative and relevant to the user’s request. The user asked the assistant to rephrase the previous response, which was a detailed outline for a blog post, into a limerick. The assistant successfully did so, maintaining the essence of the original response in a limerick format. The assistant also offered to further customize the limerick by filling in the bracketed parts with specific model names, showing a willingness to provide additional detail if needed. However, the limerick does not fully capture the depth and detail of the original response, which is understandable given the constraints of the limerick format.

Rating: [[8]], "score": 8, "turn": 2, "tstamp": 1722737024.0327556

The assistant’s response is creative and relevant to the user’s request. The assistant successfully rephrased the previous response, which was a detailed outline for a blog post comparing two smartphone models, into a limerick. The limerick is playful and engaging, and it maintains the essence of the original response, which is to compare the two phones and help the reader make an informed decision. The assistant also offers to create another limerick if the user wants, showing a willingness to continue the interaction.

Rating: [[9]], "score": 9, "turn": 2, "tstamp": 1722846542.9437995 Table 19:(Continued) comparison of model generations on examples from MT-Bench. 1st-turn prompt: Each problem consists of three statements. Based on the first two statements, the third statement may be true, false, or uncertain.\n1. Oranges cost more than apples.\n2. Oranges cost less than bananas.\n3. Bananas cost more than apples and bananas cost more than orange.\nIf the first two statements are true, then the third statement is?

Answer from gemma-2-9b-it-dpo

Answer from gemma-2-9b-it-dpo-eva

Here’s how to break down the logic:\n\n* Statement 1: Oranges > Apples (Oranges cost more than apples)\n* Statement 2: Oranges < Bananas (Oranges cost less than bananas)\n\nThis means the price order is: Bananas > Oranges > Apples\n\n* Statement 3: Bananas > Apples AND Bananas > Oranges\n\nStatement 3 is partially true.\n\n* The part "Bananas > Apples" is true based on our price order.\n* The part "Bananas > Oranges" is also true based on our price order.\n\n\nLet me know if you’d like to try another one!

Here’s how to break down the logic:\n\n* Statement 1: Oranges > Apples (Oranges cost more than apples)\n* Statement 2: Oranges < Bananas (Oranges cost less than bananas)\n\nThese statements tell us that bananas are the most expensive, then oranges, and apples are the least expensive.\n\n* **Statement 3:** Bananas > Apples AND Bananas > Oranges\n\nSince bananas are indeed more expensive than both apples and oranges, statement 3 is true.\n\n\nLet me know if you’d like to try another one!

2nd-turn prompt: If the third statement is true. Is the first statement true, false, or uncertain? Please explain.

Answer from gemma-2-9b-it-dpo

Answer from gemma-2-9b-it-dpo-eva

If the third statement is true, the first statement is false.\n\nHere’s why:\n\n* Statement 3: States that bananas cost more than both apples AND oranges.\n\nThis means the price order is: Bananas > Oranges > Apples.\n\n* Statement 1: States that oranges cost more than apples.\n\nIf bananas are the most expensive, and oranges are second, then oranges cannot be more expensive than apples.\n\n\nLet me know if you have any other logic puzzles!

If the third statement is true, then the first statement is uncertain.\n\nHere’s why:\n\n* Statement 3: Bananas > Apples AND Bananas > Oranges\n\nThis tells us bananas are the most expensive, but it doesn’t explicitly state the relative prices of apples and oranges.\n\n* Statement 1: Oranges > Apples\n\nThis statement claims oranges are more expensive than apples.\n\nStatement 3 doesn’t contradict statement 1, but it also doesn’t confirm it. It’s possible that oranges are more expensive than apples, but statement 3 alone doesn’t give us enough information to be sure.

Judgment (by gpt-4)

Judgment for gemma-2-9b-it-dpo

Judgment for gemma-2-9b-it-dpo-eva

The assistant’s response is incorrect. The assistant states that if the third statement is true, the first statement is false. However, the third statement only establishes that bananas cost more than both apples and oranges. It does not provide information about the relative costs of apples and oranges. Therefore, the first statement could be true (if oranges cost more than apples), false (if oranges cost less than apples), or uncertain (if oranges and apples cost the same). The assistant’s explanation is flawed because it incorrectly assumes that the order of costs must be Bananas > Oranges > Apples, when in fact the order could also be Bananas > Apples = Oranges or Bananas > Apples > Oranges. \n\nCorrected Assistant’s Answer: If the third statement is true, then bananas cost more than apples and bananas cost more than oranges. This means that oranges could still cost more than apples, making the first statement true. However, it’s also possible that oranges and apples have the same cost, which would make the first statement false. Therefore, based on the third statement alone, the first statement is uncertain.

Rating: [[2]], "score": 2, "turn": 2, "tstamp": 1722737017.2808955

The assistant’s response is correct and helpful. The assistant explains that the first statement is uncertain if the third statement is true, which aligns with the reference answer. The assistant provides a clear explanation of why this is the case, stating that the third statement doesn’t contradict the first statement, but it also doesn’t confirm it. Therefore, the assistant’s response is accurate and informative.\n\nThe assistant’s response: "If the third statement is true, then the first statement is uncertain. Here’s why: Statement 3: Bananas > Apples AND Bananas > Oranges. This tells us bananas are the most expensive, but it doesn’t explicitly state the relative prices of apples and oranges. Statement 1: Oranges > Apples. This statement claims oranges are more expensive than apples. Statement 3 doesn’t contradict statement 1, but it also doesn’t confirm it. It’s possible that oranges are more expensive than apples, but statement 3 alone doesn’t give us enough information to be sure.

Rating: [[10]], "score": 10, "turn": 2, "tstamp": 1722846527.2339694 Table 20:(Continued) comparison of model generations on examples from MT-Bench. 1st-turn prompt: Implement a function to find the median of two sorted arrays of different sizes with O(1) space complexity and O(n) time complexity.

Answer from gemma-2-9b-it-dpo

Answer from gemma-2-9b-it-dpo-eva

python\n\def find_median_sorted_arrays(nums1, nums2):\n """\n Finds the median of two sorted arrays of different sizes.\n\n Args:\n nums1: The first sorted array.\n nums2: The second sorted array.\n\n Returns:\n The median of the two sorted arrays.\n """\n\n m, n = len(nums1), len(nums2)\n if m > n:\n nums1, nums2, m, n = nums2, nums1, n, m # Ensure m <= n\n\n imin, imax, half_len = 0, m, (m + n + 1) // 2\n\n while imin <= imax:\n i = (imin + imax) // 2\n j = half_len - i\n if i < m and nums2[j - 1] > nums1[i]:\n imin = i + 1 # i is too small, increase it\n elif i > 0 and nums1[i - 1] > nums2[j]:\n imax = i - 1 # i is too big, decrease it\n else:\n # i is perfect\n\n if i == 0:\n max_left = nums2[j - 1]\n elif j == 0:\n max_left = nums1[i - 1]\n else:\n max_left = max(nums1[i - 1], nums2[j - 1])\n\n if (m + n) % 2 == 1:\n return max_left\n\n if i == m:\n min_right = nums2[j]\n elif j == n:\n min_right = nums1[i]\n else:\n min_right = min(nums1[i], nums2[j])\n\n return (max_left + min_right) / 2.0\n\n# Example usage\nnums1 = [1, 3]\nnums2 = [2]\nmedian = find_median_sorted_arrays(nums1, nums2)\nprint(f"The median of {nums1} and {nums2} is: {median}")\n\n\n**Explanation:**\n\n1. **Handle Unequal Lengths:**\n - If ‘nums1‘ is longer than ‘nums2‘, swap them to ensure ‘nums1‘ is always the shorter array. This simplifies the logic.\n\n2. **Binary Search:**\n - The core idea is to use binary search on the shorter array (‘nums1‘).\n - ‘imin‘ and ‘imax‘ define the search range for the partition point ‘i‘ in ‘nums1‘.\n - ‘half_len‘ calculates the desired total number of elements on the left side of the partition (to find the median).\n\n3. **Partitioning:**\n - In each iteration, ‘i‘ divides ‘nums1‘ into two parts: ‘nums1[0:i]‘ and ‘nums1[i:m]‘.\n - ‘j‘ is calculated to divide ‘nums2‘ so that the total number of elements on the left side is ‘half_len‘.\n\n4. Adjusting the Partition:\n - We compare elements at the boundaries of the partitions to see if ‘i‘ is in the correct position:\n - If ‘nums2[j - 1] > nums1[i]‘, ‘i‘ is too small, so we move ‘imin‘ to ‘i + 1‘.\n - If ‘nums1[i - 1] > nums2[j]‘, ‘i‘ is too large, so we move ‘imax‘ to ‘i - 1‘.\n\n5. Finding the Median:\n - Once ‘i‘ is in the correct position, we find:\n - ‘max_left‘: The maximum element in the left partitions of both arrays.\n - ‘min_right‘: The minimum element in the right partitions of both arrays.\n - If the total length is odd, ‘max_left‘ is the median.\n - If the total length is even, the median is the average of ‘max_left‘ and ‘min_right‘.\n\nTime Complexity: O(log(min(m, n))) because we perform binary search on the shorter array.\n\nSpace Complexity: O(1) as we only use a constant amount of extra space for variables.

python\n\def find_median_sorted_arrays(nums1, nums2):\n """\n Finds the median of two sorted arrays of different sizes.\n\n Args:\n nums1: The first sorted array.\n nums2: The second sorted array.\n\n Returns:\n The median of the two sorted arrays.\n """\n\n m, n = len(nums1), len(nums2)\n if m > n:\n nums1, nums2, m, n = nums2, nums1, n, m # Ensure m <= n\n\n imin, imax, half_len = 0, m, (m + n + 1) // 2\n\n while imin <= imax:\n i = (imin + imax) // 2\n j = half_len - i\n if i < m and nums2[j - 1] > nums1[i]:\n imin = i + 1 # i is too small, increase it\n elif i > 0 and nums1[i - 1] > nums2[j]:\n imax = i - 1 # i is too big, decrease it\n else:\n # i is perfect\n\n if i == 0:\n max_left = nums2[j - 1]\n elif j == 0:\n max_left = nums1[i - 1]\n else:\n max_left = max(nums1[i - 1], nums2[j - 1])\n\n if (m + n) % 2 == 1:\n return max_left\n\n if i == m:\n min_right = nums2[j]\n elif j == n:\n min_right = nums1[i]\n else:\n min_right = min(nums1[i], nums2[j])\n\n return (max_left + min_right) / 2.0\n\n# Example usage\nnums1 = [1, 3]\nnums2 = [2]\nmedian = find_median_sorted_arrays(nums1, nums2)\nprint(f"The median of {nums1} and {nums2} is: {median}")\n\n\n**Explanation:**\n\n1. **Handle Unequal Lengths:**\n - If ‘nums1‘ is longer than ‘nums2‘, swap them to ensure ‘nums1‘ is always the shorter array. This simplifies the logic.\n\n2. **Binary Search:**\n - The core idea is to use binary search on the shorter array (‘nums1‘).\n - ‘imin‘ and ‘imax‘ define the search range for the partition point ‘i‘ in ‘nums1‘.\n - ‘half_len‘ calculates the desired total number of elements on the left side of the partition (to find the median).\n\n3. **Partitioning:**\n - In each iteration, ‘i‘ divides ‘nums1‘, and ‘j‘ divides ‘nums2‘ such that the total number of elements on the left side is ‘half_len‘.\n\n4. Adjusting the Search Range:\n - If ‘nums2[j - 1] > nums1[i]‘, it means ‘i‘ is too small, so we need to move the partition point in ‘nums1‘ to the right (‘imin = i + 1‘).\n - If ‘nums1[i - 1] > nums2[j]‘, ‘i‘ is too large, so we move it to the left (‘imax = i - 1‘).\n\n5. Finding the Median:\n - When the correct partition is found (‘i‘ is in the right position), we find:\n - ‘max_left‘: The maximum element on the left side of the partition.\n - ‘min_right‘: The minimum element on the right side of the partition.\n - If the total number of elements is odd, ‘max_left‘ is the median.\n - If even, the median is the average of ‘max_left‘ and ‘min_right‘.\n\nTime Complexity: O(log(min(m, n))) because we are performing binary search on the shorter array.\n\nSpace Complexity: O(1) as we are only using a constant amount of extra space for variables.

Judgment (by gpt-4)

Judgment for gemma-2-9b-it-dpo

Judgment for gemma-2-9b-it-dpo-eva

The assistant’s answer is correct and provides a detailed explanation of the code. The assistant correctly states that the time complexity is O(log(min(m, n))) and the space complexity is O(1), which aligns with the reference answer. The assistant also provides an example usage of the function, which is helpful for understanding how to use it. The assistant’s code is a different implementation than the reference answer, but it achieves the same result and meets the same complexity requirements. However, the assistant’s answer does not address the user’s original question about achieving O(n) time complexity. The assistant should have clarified that it’s not possible to find the median of two sorted arrays of different sizes with O(1) space complexity and O(n) time complexity, as stated in the answer.

Rating: [[8]], "score": 8, "turn": 1, "tstamp": 1722737514.2011044

Rating: [[10]], "score": 10, "turn": 1, "tstamp": 1722846533.772981 Table 21:(Continued) comparison of model generations on examples from MT-Bench. 1st-turn prompt: You are given two sorted lists of size m and n. Implement a function to find the kth smallest element in the union of the two lists with linear complexity.

Answer from gemma-2-9b-it-dpo

Answer from gemma-2-9b-it-dpo-eva

‘‘‘python\n\def find_kth_smallest(list1, list2, k):\n """\n Finds the kth smallest element in the union of two sorted lists.\n\n Args:\n list1: The first sorted list.\n list2: The second sorted list.\n k: The desired kth smallest element.\n\n Returns:\n The kth smallest element in the union of the two lists.\n """\n\n m, n = len(list1), len(list2)\n i, j = 0, 0\n min_heap = [] \n\n # Build a min-heap with the first elements of both lists\n for x in range(min(m, n)):\n heapq.heappush(min_heap, (list1[x], x, 0)) # (value, list_index, position_in_list)\n heapq.heappush(min_heap, (list2[x], x, 1))\n\n # Extract k smallest elements from the heap\n for _ in range(k - 1):\n _, _, _ = heapq.heappop(min_heap)\n\n # The top element of the heap is the kth smallest\n return heapq.heappop(min_heap)[0]\n\n# Example usage:\nlist1 = [2, 5, 8, 12]\nlist2 = [1, 3, 7, 9, 10]\nk = 5\nkth_smallest = find_kth_smallest(list1, list2, k)\nprint(f"The {k}th smallest element is: {kth_smallest}")\n\n\n**Explanation:**\n\n1. **Initialization:**\n - ‘m‘ and ‘n‘ store the lengths of the input lists.\n - ‘i‘ and ‘j‘ are pointers to track the current position in each list.\n - ‘min_heap‘ is a min-heap used to efficiently store and retrieve the smallest elements encountered so far.\n\n2. **Building the Min-Heap:**\n - The code iterates through the first ‘min(m, n)‘ elements of both lists.\n - For each element, a tuple ‘(value, list_index, position_in_list)‘ is pushed onto the ‘min_heap‘. This tuple helps us keep track of the element’s value, the list it belongs to, and its original position.\n\n3. **Extracting k-1 Smallest Elements:**\n - The ‘for‘ loop runs ‘k - 1‘ times, effectively extracting the ‘k-1‘ smallest elements from the ‘min_heap‘.\n\n4. **Finding the kth Smallest:\n - After the loop, the top element of the ‘min_heap‘ contains the ‘k‘th smallest element.\n - ‘heapq.heappop(min_heap)‘ removes and returns this element.\n\nTime Complexity:**\n\n- Building the min-heap: O(m + n)\n- Extracting k-1 elements: O(k * log(k))\n- Overall: O(m + n + k * log(k)), which is approximately O(m + n) since ‘k‘ is usually much smaller than ‘m + n‘.\n\n\n\nLet me know if you have any other questions!

‘‘‘python\n\def find_kth_smallest(list1, list2, k):\n """\n Finds the kth smallest element in the union of two sorted lists.\n\n Args:\n list1: The first sorted list.\n list2: The second sorted list.\n k: The desired rank of the smallest element.\n\n Returns:\n The kth smallest element in the union of the lists.\n """\n\n i = 0 # Index for list1\n j = 0 # Index for list2\n merged = [] # Store the merged elements\n\n while i < len(list1) and j < len(list2):\n if list1[i] <= list2[j]:\n merged.append(list1[i])\n i += 1\n else:\n merged.append(list2[j])\n j += 1\n\n # Add any remaining elements from list1\n while i < len(list1):\n merged.append(list1[i])\n i += 1\n\n # Add any remaining elements from list2\n while j < len(list2):\n merged.append(list2[j])\n j += 1\n\n return merged[k - 1] # Return the kth element (index k-1)\n\n# Example usage:\nlist1 = [2, 5, 8, 12]\nlist2 = [1, 3, 7, 9, 10]\nk = 5\n\nkth_smallest = find_kth_smallest(list1, list2, k)\nprint(f"The {k}th smallest element is: {kth_smallest}")\n\n\n**Explanation:**\n\n1. **Initialization:**\n - ‘i‘ and ‘j‘ are index pointers for ‘list1‘ and ‘list2‘ respectively, starting at 0.\n - ‘merged‘ is an empty list to store the merged sorted elements.\n\n2. **Merging:**\n - The ‘while‘ loop continues as long as both ‘i‘ and ‘j‘ are within the bounds of their respective lists.\n - In each iteration, it compares the elements at ‘list1[i]‘ and ‘list2[j]‘.\n - The smaller element is appended to ‘merged‘, and the corresponding index (‘i‘ or ‘j‘) is incremented.\n\n3. **Adding Remaining Elements:**\n - After one list is exhausted, the remaining elements from the other list are directly appended to ‘merged‘.\n\n4. **Returning the kth Element:\n - Finally, ‘merged[k - 1]‘ is returned, which represents the kth smallest element in the union (remember that list indices start from 0).\n\nLinear Complexity:**\n\n- The algorithm iterates through each element in both lists at most once during the merging process.\n- The time complexity is therefore O(m + n), which is linear in the total number of elements (m + n), making it efficient for large lists.

Judgment (by gpt-4)

Judgment for gemma-2-9b-it-dpo

Judgment for gemma-2-9b-it-dpo-eva

The assistant’s answer is incorrect. The user asked for a solution with linear complexity, but the assistant provided a solution with a time complexity of O(m + n + k * log(k)), which is not linear. The assistant’s solution uses a min-heap, which is not necessary and adds unnecessary complexity to the solution.

Rating: [[2]], "score": 2, "turn": 1, "tstamp": 1722737030.6411152

The assistant’s answer is correct and helpful. It provides a Python function that implements the two-pointer approach to find the kth smallest element in the union of two sorted lists. The assistant’s function is slightly different from the reference, but it achieves the same goal. The assistant’s function merges the two lists into a new list and then returns the kth element from the merged list. The assistant also provides a detailed explanation of how the function works and confirms that the time complexity is linear, which is what the user asked for. However, the assistant’s function uses extra space to store the merged list, which is not necessary and not as efficient as the reference.

Rating: [[8]], "score": 8, "turn": 1, "tstamp": 1722846543.867601 Table 22:(Continued) comparison of model generations on examples from MT-Bench. 2nd-turn prompt: Does there exist an algorithm with better time complexity? If so, implement it.

Answer from gemma-2-9b-it-dpo

Answer from gemma-2-9b-it-dpo-eva

You are absolutely right! There’s a more efficient algorithm with a time complexity of O(m + n) without using a heap.\n\nHere’s the implementation:\n\npython\n\def find_kth_smallest_optimized(list1, list2, k):\n """\n Finds the kth smallest element in the union of two sorted lists\n with O(m + n) time complexity.\n\n Args:\n list1: The first sorted list.\n list2: The second sorted list.\n k: The desired kth smallest element.\n\n Returns:\n The kth smallest element in the union of the two lists.\n """\n m, n = len(list1), len(list2)\n i, j = 0, 0\n result = []\n\n # Merge the lists while keeping track of the kth smallest\n while len(result) < k:\n if i < m and j < n:\n if list1[i] <= list2[j]:\n result.append(list1[i])\n i += 1\n else:\n result.append(list2[j])\n j += 1\n elif i < m:\n result.append(list1[i])\n i += 1\n else:\n result.append(list2[j])\n j += 1\n\n return result[k - 1]\n\n# Example usage:\nlist1 = [2, 5, 8, 12]\nlist2 = [1, 3, 7, 9, 10]\nk = 5\nkth_smallest = find_kth_smallest_optimized(list1, list2, k)\nprint(f"The {k}th smallest element is: {kth_smallest}")\n\n\n**Explanation:**\n\n1. **Initialization:**\n - ‘m‘ and ‘n‘ store the lengths of the lists.\n - ‘i‘ and ‘j‘ are pointers for ‘list1‘ and ‘list2‘, respectively.\n - ‘result‘ is a list to store the merged and sorted elements.\n\n2. **Merging and Tracking:**\n - The ‘while‘ loop continues until ‘k‘ elements are in the ‘result‘ list.\n - In each iteration, it compares the current elements from ‘list1‘ and ‘list2‘.\n - The smaller element is appended to ‘result‘, and the corresponding pointer (‘i‘ or ‘j‘) is incremented.\n - If one list is exhausted, the remaining elements from the other list are appended to ‘result‘.\n\n3. **Returning the kth Smallest:\n - Finally, ‘result[k - 1]‘ contains the ‘k‘th smallest element.\n\nTime Complexity:**\n\n- The algorithm iterates through both lists once, resulting in a time complexity of O(m + n).\n\n\n\nLet me know if you have any more questions!

You are right to ask! While the previous algorithm has linear time complexity, we can achieve an even better solution with O(log(min(m, n))) time complexity using a slightly more sophisticated approach.\n\nThe key idea is to leverage the sorted nature of the lists and use a binary search strategy.\n\nHere’s the implementation:\n\n##\npython\n\def find_kth_smallest_optimized(list1, list2, k):\n """\n Finds the kth smallest element in the union of two sorted lists with O(log(min(m, n))) time complexity.\n\n Args:\n list1: The first sorted list.\n list2: The second sorted list.\n k: The desired rank of the smallest element.\n\n Returns:\n The kth smallest element in the union of the lists.\n """\n m, n = len(list1), len(list2)\n if m > n:\n list1, list2, m, n = list2, list1, n, m # Ensure list1 is the shorter one\n\n left, right = 0, m\n while left <= right:\n partitionX = (left + right) // 2\n partitionY = k - partitionX\n\n maxLeftX = list1[partitionX - 1] if partitionX > 0 else float(’-inf’)\n minRightX = list1[partitionX] if partitionX < m else float(’inf’)\n\n maxLeftY = list2[partitionY - 1] if partitionY > 0 else float(’-inf’)\n minRightY = list2[partitionY] if partitionY < n else float(’inf’)\n\n if maxLeftX <= minRightY and maxLeftY <= minRightX:\n if partitionX == 0:\n return list2[partitionY - 1]\n elif partitionY == 0:\n return list1[partitionX - 1]\n else:\n return max(list1[partitionX - 1], list2[partitionY - 1])\n elif maxLeftX > minRightY:\n right = partitionX - 1 # Too many elements on the right side of list1\n else:\n left = partitionX + 1 # Too many elements on the right side of list2\n\n\n**Explanation:**\n\n1. Shorten Lists: If ‘list1‘ is longer than ‘list2‘, swap them to ensure ‘list1‘ is always the shorter one. This simplifies the logic.\n\n2. **Binary Search:**\n - ‘left‘ and ‘right‘ pointers define the search space within ‘list1‘.\n - ‘partitionX‘ is the middle point in ‘list1‘, and ‘partitionY‘ is calculated to divide the remaining elements in ‘list2‘ such that ‘partitionX + partitionY = k‘.\n - ‘maxLeftX‘, ‘minRightX‘, ‘maxLeftY‘, and ‘minRightY‘ represent the largest element to the left and smallest element to the right of the partitions in both lists.\n\n3. **Comparison and Adjustment:\n - If the ‘maxLeftX‘ is less than or equal to ‘minRightY‘ and ‘maxLeftY‘ is less than or equal to ‘minRightX‘, it means the kth smallest element lies within the current partitions. We return the larger of ‘maxLeftX‘ and ‘maxLeftY‘.\n - If ‘maxLeftX‘ is greater than ‘minRightY‘, it indicates that we need to search in the left subarray of ‘list1‘ (i.e., decrease ‘right‘).\n - Otherwise, we search in the right subarray of ‘list1‘ (i.e., increase ‘left‘).\n\n4. Termination: The binary search continues until ‘left‘ crosses ‘right‘, indicating the correct partition has been found.\n\nTime Complexity:**\n\n- The binary search divides the search space in half at each step.\n- Therefore, the time complexity is O(log(min(m, n))), which is significantly faster than linear time for large lists where one list is much smaller than the other.

Judgment (by gpt-4)

Judgment for gemma-2-9b-it-dpo

Judgment for gemma-2-9b-it-dpo-eva

The assistant’s first response to the user’s question is incorrect. The user asked for a function to find the kth smallest element in the union of two sorted lists with linear complexity. The assistant provided a function that uses a min-heap, which has a time complexity of O(m + n + k * log(k)). This is not linear complexity, as the user requested. The assistant’s second response to the user’s question is also incorrect. The user asked if there exists an algorithm with better time complexity than linear. The assistant provided a function that has a time complexity of O(m + n), which is linear, not better than linear. The assistant should have provided a function that uses a binary search approach, which has a time complexity of O(log(m) + log(n)), as the reference answer correctly does.

Rating: [[2]], "score": 2, "turn": 2, "tstamp": 1722737031.5033472

The assistant’s response is correct and helpful. The assistant provided a Python function that uses a binary search approach to find the kth smallest element in the union of two sorted lists. The assistant also provided a detailed explanation of how the function works, including the time complexity of the function. The assistant’s answer matches the reference answer in terms of the algorithm used and the explanation provided. The assistant also correctly identified that the binary search approach has a better time complexity than the linear approach. However, the assistant’s code has a minor issue. The assistant’s code does not handle the case when ‘k‘ is greater than the sum of the lengths of the two lists. In such a case, the code will raise an ‘IndexError‘. This can be fixed by adding a check at the beginning of the function to return ‘None‘ if ‘k‘ is greater than the sum of the lengths of the two lists.

Rating: [[9]], "score": 9, "turn": 2, "tstamp": 1722846556.6828268 Table 23:(Continued) comparison of model generations on examples from MT-Bench.

Report Issue Report Issue for Selection Generated by L A T E xml Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the "Report Issue" button. Open a report feedback form via keyboard, use "Ctrl + ?". Make a text selection and click the "Report Issue for Selection" button near your cursor. You can use Alt+Y to toggle on and Alt+Shift+Y to toggle off accessible reporting links at each section.

Our team has already identified the following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain a list of packages that need conversion, and welcome developer contributions.

Xet Storage Details

Size:: 145 kB
Xet hash:: 4e1aa80a5c431eede07472841d18f9f077638b6ce646186149286bd95ba83d81

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.

Regret ⁡ ( 𝜋 𝜙 , 𝜋 𝜽 )

and { 𝐲 𝑖 } 𝑖

Algorithm 1 Alternating Optimization of eva. Input:initial policy 𝜋 𝜽 0 , initial set of prompt 𝒳 0 iteration 𝑡

Mix-up phase: From now on, we do a balanced sampling from the buffer and 𝒟 to form a mini batch for training, where 𝗋𝖺𝗍𝗂𝗈 𝖾𝗏𝗈𝗅

In the following ablation studies, by keeping other hyper-parameters the same as the above, we show that increasing 𝑛 𝗇𝖾𝗐 improves the performance, while a balanced sampling with 𝗋𝖺𝗍𝗂𝗈 𝖾𝗏𝗈𝗅

Table 8:The default setting for the first column of each table is 𝑛 𝗇𝖾𝗐

4 and 𝗋𝖺𝗍𝗂𝗈 𝖾𝗏𝗈𝗅

4 𝑛 𝗇𝖾𝗐

𝜽 0 → 1 : 𝖱𝖫𝖮𝖮 ( 𝟨 ⁢ 𝗑 ) 62.7 62.7 Benchmark ( → ) 𝖠𝗋𝖾𝗇𝖺 ⁢ 𝖧𝖺𝗋𝖽 (WR (%)) Setting ( → ) 𝗋𝖺𝗍𝗂𝗈 𝖾𝗏𝗈𝗅

50 % 𝗋𝖺𝗍𝗂𝗈 𝖾𝗏𝗈𝗅

75 % 𝗋𝖺𝗍𝗂𝗈 𝖾𝗏𝗈𝗅

in-breadth evolving

in-depth evolving

1 𝑁 ← 𝑝 𝜽 ( ⋅ ∣ { 𝐱 } 𝑖

ℒ 𝛽 DPO ( 𝜋 𝜽 )

𝐴 min ⋆ ⁢ ( 𝐱 )

𝜋 ⋆

𝐴 ⋆ ⁢ ( 𝐱 , 𝐲 )

𝑟 ( 𝐱 , 𝐲 ) − 𝔼 𝐲 ′ ∼ 𝜋 ⋆ ( ⋅ ∣ 𝐱 ) [ 𝑟 ( 𝐱 , 𝐲 ′ ) ]

𝑃 ⁢ ( 𝐲 + ⪰ 𝐲 − )

𝑃 ⁢ ( 𝐲 + ⪰ 𝐲 − )

exp ( 𝑟 ( 𝐱 , 𝐲 + ) − 𝑉 ⋆ ( 𝐱 , 𝐲 ) ) exp ( 𝑟 ( 𝐱 , 𝐲 + ) − 𝑉 ⋆ ( 𝐱 , 𝐲 ) ) + exp ( 𝑟 ( 𝐱 , 𝐲 − ) − 𝑉 ⋆ ( 𝐱 , 𝐲 ) )

𝜋 𝜽 ⋆ ⁢ ( 𝐲 ∣ 𝐱 )

where the partition function is defined as 𝑍 ⁢ ( 𝐱 )

Regret ( 𝐱 , 𝜋 𝜽 )

Xet Storage Details

4
𝑛 𝗇𝖾𝗐

50 %
𝗋𝖺𝗍𝗂𝗈 𝖾𝗏𝗈𝗅

75 %
𝗋𝖺𝗍𝗂𝗈 𝖾𝗏𝗈𝗅