Buckets:

|
download
raw
90.2 kB

Title: DMoERM: Recipes of Mixture-of-Experts for Effective Reward Modeling

URL Source: https://arxiv.org/html/2403.01197

Markdown Content:

Abstract

The performance of the reward model (RM) is a critical factor in improving the effectiveness of the large language model (LLM) during alignment fine-tuning. There remain two challenges in RM training: 1) training the same RM using various categories of data may cause its generalization performance to suffer from multi-task disturbance, and 2) the human annotation consistency rate is generally only 60%percent 60 60%60 % to 75%percent 75 75%75 %, causing training data to contain a lot of noise. To tackle these two challenges, we introduced the idea of Mixture-of-Experts (MoE) into the field of RM for the first time. We propose the D ouble-Layer MoE RM (DMoERM). The outer layer MoE is a sparse model. After classifying an input into task categories, we route it to the corresponding inner layer task-specific model. The inner layer MoE is a dense model. We decompose the specific task into multiple capability dimensions and individually fine-tune a LoRA expert on each one. Their outputs are then synthesized by an MLP to compute the final rewards. To minimize costs, we call a public LLM API to obtain the capability preference labels. The validation on manually labeled datasets confirms that our model attains superior consistency with human preference and outstrips advanced generative approaches. Meanwhile, through BoN sampling and RL experiments, we demonstrate that our model outperforms state-of-the-art ensemble methods of RM and mitigates the overoptimization problem. Our code and dataset are available at: https://github.com/quanshr/DMoERM-v1.

1 Introduction

After an initial stage of pre-training and subsequent instruction fine-tuning, large language models (LLMs) undergo a crucial stage of high-quality alignment fine-tuning based on Reinforcement Learning with Human Feedback (RLHF) to improve their abilities(Ouyang et al., 2022; Stiennon et al., 2020). During the RLHF process, a reward model (RM) often needs to be trained, which acts as a proxy of human preferences and assigns scores to the outputs of the LLM. The scores are then used as reward signals to optimize the LLM through reinforcement learning (RL). In this process, the LLM and RM are interdependent and iteratively optimized, and the RM is expected to be highly consistent with human preferences. In addition, during the inference stage, the RM can also be augmented with Best-of-n 𝑛 n italic_n (BoN) sampling strategies to further enhance the quality of the outputs of the LLM(Ouyang et al., 2022; Nakano et al., 2021).

Training of reward models relies on the data derived from human annotators who manually rank the varying outputs under a single input by their preferences. However, many studies have found that agreement rates among human annotators typically only range between 60 60 60 60-75%percent 75 75%75 %(Ziegler et al., 2019; Stiennon et al., 2020; Dubois et al., 2023), thereby introducing a minimum of 25%percent 25 25%25 % noise within the labeled dataset. One important reason for this phenomenon is the multifaceted nature of evaluationβ€”it is often observed that one response may excel in one aspect while simultaneously falling short in another. This multifaceted evaluation conundrum has been exemplified in previous studies(Dai et al., 2023; Ganguli et al., 2022; Bai et al., 2022), which illustrate the inherent tensions between enhancing helpfulness and harmlessness. As these attributes can at times be inversely related, adjudicating between a response that is more helpful yet potentially less harmless poses a significant challenge for comparative assessment. We further validate this perspective through empirical studies.

Image 1: Refer to caption

Figure 1: The outer MoE routes inputs to corresponding task-specific inner MoE.

In this study, we pioneer the integration of the Mixture-of-Experts (MoE) framework(Jacobs et al., 1991; Lepikhin et al., 2021; Du et al., 2022) into Reward Modeling. Our approach employs a double-layer MoE architecture. The outer layer comprises a sparse MoE model specifically designed to avoid multi-task disturbance(Standley et al., 2020). As shown in Figure1, We categorize inputs into several distinct tasks and use a pre-trained router to route the inputs to their corresponding task-specific expert. This strategy can also facilitate distributed deployment and enhance the model’s capacity and capabilities without a commensurate increase in computational demands(Rajbhandari et al., 2022; Xue et al., 2022).

Subsequently, within each inner layer lies a dense MoE model, which is tailored to the specific capabilities set required for its category. For instance, in roleplay scenarios, we divide it into six core capabilities, including personality and emotional investment, conversational sense, empathy ability and so on (details are in Appendix Table12). We obtain preference labels on these single capability points by calling a public API, which greatly reduces annotation costs and is sufficient to achieve satisfactory results in our experiments. Considering that capability points are equivalent to a decomposition of tasks in a low dimensional space, using low-rank adaptation (LoRA)(Hu et al., 2022) fine-tuning will be very suitable. Each LoRA fine-tuned model effectively becomes an expert in scoring on a singular capability point. Lastly, we aggregate the outputs from these expert models into a unified one-dimensional score with an MLP to determine the final reward value. We believe this methodology can improve the interpretability and performance of RMs since it is just like the CoT for RMs. Both preference consistency and optimization evaluations indicate our model is more effective in optimizing LLMs and can mitigate the overoptimization problem against other state-of-the-art RM ensemble methods.

2 Related Work

2.1 Mixed-of-Experts

Mixture-of-Experts or MoEs was introduced early for machine learning applications(Jacobs et al., 1991; Jordan and Jacobs, 1994), where researchers control the allocation of different weights to different models through gate networks to mitigate interference between different types of samples. MoE can enhance the model’s generalization capability by decomposing complex tasks into several subtasks, which will help avoid multi-task disturbance(Standley et al., 2020) and meanwhile confer greater flexibility in development(Ma et al., 2018). Recently, much ongoing research has focused on the top-k π‘˜ k italic_k (e.g., top-1 1 1 1 or top-2 2 2 2 in many works) activation MoE model(Ramachandran and Le, 2019; Clark et al., 2022; Dai et al., 2022), since it can be leveraged to enlarge parameter count and enhance model capability while keeping computational complexity nearly unchanged for both training and inference due to its sparse activating nature(Shazeer et al., 2017; Fedus et al., 2022). While MoE has achieved great success in the field of large generative language models(Shen et al., 2023; OpenAI, 2023), how to efficiently train more effective RM with MoE architecture remains largely unexplored.

2.2 Reward Model Ensembling

Reward Model Ensembling has been tried in the field of safe RLHF(Dai et al., 2023). The research is based on a widely observed imagination: the pursuit of greater helpfulness and harmlessness may often conflict in practice(Ganguli et al., 2022; Bai et al., 2022). Another weave of research ensemble RMs through multi-objective reward modeling(RamΓ© et al., 2023b, a) or weight-averaged reward modeling(RamΓ© et al., 2024), but they are struggling to formulate non-linear relationships. In addition, some research works(Coste et al., 2024; Eisenstein et al., 2023; Zhai et al., 2024) have found that training multiple reward models and aggregating them by changing the data training order, batch size, and learning rate can alleviate the overoptimization problem(Gao et al., 2023) of RM and increase its performance. However, the aggregation methods they chose were only 1) mean, 2) min, and 3) mean minus std, and the performance of aggregation was very dependent on the diversity of several models(Zhai et al., 2024), which required a lot of attempts in experiments.

3 Empirical Study

3.1 Multi-Task Training

It has been frequently observed that using irrelevant training data to train LLMs will cause their generalization performance to decrease in other tasks(Dong et al., 2023; Wen et al., 2023). This also applies to RMs. Dai et al. (2023) found that training RMs on harmlessness and helpfulness simultaneously will lead to the model getting suboptimal results on both types of data. In order to further explore whether data of different categories interfere with each other, we selected the preference data for three tasks: roleplay, chitchat, and text creation. We train on different combinations of training sets and testing on all test sets. The results are shown in Table1.

Table 1: The results of training on different combinations of training sets and testing on all test sets. #A, #B, and #C represent roleplay, chitchat and text creation respectively. The best values are written in bold.

We find that using separate category data give the best results in this category, and using data from other categories may affect the generalization ability under original task.

3.2 Annotation Consistency

The consistency rate of manually labeled preference data is generally only 60 60 60 60-75%percent 75 75%75 %(Ziegler et al., 2019; Stiennon et al., 2020; Dubois et al., 2023), which brings a lot of noise to the training data. Inspired by the fact that Chain-of-Thought (CoT) can improve the accuracy of reasoning(Wei et al., 2022; Zhou et al., 2023), we try to improve the consistency rate through CoT. We first conduct experiments on humans. We select the text creation subtask and randomly divide 200 200 200 200 pairs of responses into two groups. For the first group, we let three annotators directly rank the preference and record the average agreement rate between any two annotators, which is represented as A. For the second group, we divide the text creation into five capability points: 1) intent conformity, 2) expressiveness, 3) readability, 4) content richness, 5)logic, and ask the annotators to first score the content in each single capability point, and then evaluate the overall preference. The average agreement rates on five capability points are represented as B-1 to B-5, respectively, and the final overall agreement rate is B-f. We record these results in Figure2, and a screenshot of the annotation interface is shown in Figure8 in the Appendix.

Image 2: Refer to caption

Figure 2: The results of consistency study.

We find that the consistency on capability points is significantly higher than the consistency of directly evaluating the overall preference, and the method of evaluating capability points first can increase the final overall consistency rate.

4 Methodology

4.1 Outer Layer MoE

Image 3: Refer to caption

Figure 3: The training framework of each inner layer MoE. The LoRA components in the figure is only for illustration, as in actual experiments we will inject the LoRA layers into each layer of the transformers. Training details are in Section4.2.2.

Our first layer is a sparse MoE structure, and only the top-1 1 1 1 expert is activated each time. We divide the input into five categories according to tasks: text creation, roleplay, objective knowledge QA, subjective knowledge QA, and chitchat, and train an MoE for each category. When new input comes, we use a small frozen top-1 1 1 1 gating network pre-trained on category labels to act as the router. Formally, we have

c⁒h⁒o⁒s⁒e⁒n=arg⁑max e=0 4 E⁒(t e|x)𝑐 β„Ž π‘œ 𝑠 𝑒 𝑛 superscript subscript 𝑒 0 4 𝐸 conditional subscript 𝑑 𝑒 π‘₯ chosen=\mathop{\arg\max}{e=0}^{4}E(t{e}|x)italic_c italic_h italic_o italic_s italic_e italic_n = start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_e = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_E ( italic_t start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT | italic_x )(1)

y=RM c⁒h⁒o⁒s⁒e⁒n⁑(x)𝑦 subscript RM 𝑐 β„Ž π‘œ 𝑠 𝑒 𝑛 π‘₯ y=\operatorname{RM}_{chosen}(x)italic_y = roman_RM start_POSTSUBSCRIPT italic_c italic_h italic_o italic_s italic_e italic_n end_POSTSUBSCRIPT ( italic_x )(2)

Here t 0,…,t 4 subscript 𝑑 0…subscript 𝑑 4 t_{0},\dots,t_{4}italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT represent five tasks respectively, x π‘₯ x italic_x represents input, RM 0,…,RM 4 subscript RM 0…subscript RM 4\operatorname{RM}{0},\dots,\operatorname{RM}{4}roman_RM start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , roman_RM start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT represent the expert RMs corresponding to each task, and y 𝑦 y italic_y represents the RM’s output.

4.2 Inner Layer MoE

4.2.1 Modeling

For each task, we first obtain a base task-specific model RM t base subscript RM subscript 𝑑 base\operatorname{RM}{t{\text{base}}}roman_RM start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT base end_POSTSUBSCRIPT end_POSTSUBSCRIPT from a general RM through training on task-specific preference data. Then, we divide the tasks into distinct capability points. Capability points are equivalent to a decomposition of tasks in a low-dimensional space. Define the input space as X 𝑋 X italic_X, we need to learn an expert RM t i:Xβ†’Z i:subscript RM subscript 𝑑 𝑖→𝑋 subscript 𝑍 𝑖\operatorname{RM}{t{i}}:X\rightarrow Z_{i}roman_RM start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT : italic_X β†’ italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In this work, we obtain RM t i subscript RM subscript 𝑑 𝑖\operatorname{RM}{t{i}}roman_RM start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT by performing LoRA fine-tuning on RM t base subscript RM subscript 𝑑 base\operatorname{RM}{t{\text{base}}}roman_RM start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT base end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

Let Z=Z 0,…,Z kβˆ’1 𝑍 subscript 𝑍 0…subscript 𝑍 π‘˜ 1 Z={Z_{0},\dots,Z_{k-1}}italic_Z = italic_Z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , italic_Z start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT. After we have learned k experts, we use an aggregation network to aggregate the output results of k experts to produce the final output; that is, we learn RM t:Zβ†’R:subscript RM 𝑑→𝑍 𝑅\operatorname{RM}_{t}:Z\rightarrow R roman_RM start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_Z β†’ italic_R. This is a Markov process and Xβ†’Zβ†’R→𝑋 𝑍→𝑅 X\rightarrow Z\rightarrow R italic_X β†’ italic_Z β†’ italic_R construct a homogeneous Markov chain.

We employ a FCN following each expert RM⁑t i RM subscript 𝑑 𝑖\operatorname{RM}{t_{i}}roman_RM italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to serve as the value head to generate one-dimensional scores, which further maps to the range of [0,1]0 1[0,1][ 0 , 1 ] using the sigmoid activation function. Let W(b⁒a⁒s⁒e)superscript π‘Š 𝑏 π‘Ž 𝑠 𝑒 W^{(base)}italic_W start_POSTSUPERSCRIPT ( italic_b italic_a italic_s italic_e ) end_POSTSUPERSCRIPT represent the initial base model RM⁑t base RM subscript 𝑑 base\operatorname{RM}{t_{\text{base}}}roman_RM italic_t start_POSTSUBSCRIPT base end_POSTSUBSCRIPT, Δ⁒W(i)Ξ” superscript π‘Š 𝑖\Delta W^{(i)}roman_Ξ” italic_W start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT represent the fine-tuned LoRA network learned for capability point i 𝑖 i italic_i, and w i subscript 𝑀 𝑖 w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and b i subscript 𝑏 𝑖 b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the FCN network associated with capability point i 𝑖 i italic_i. The score r i subscript π‘Ÿ 𝑖 r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for capability point i 𝑖 i italic_i is then expressed as follows:

z i=(W(b⁒a⁒s⁒e)+Δ⁒W(i))⁒x subscript 𝑧 𝑖 superscript π‘Š 𝑏 π‘Ž 𝑠 𝑒 Ξ” superscript π‘Š 𝑖 π‘₯ z_{i}=(W^{(base)}+\Delta W^{(i)})x italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( italic_W start_POSTSUPERSCRIPT ( italic_b italic_a italic_s italic_e ) end_POSTSUPERSCRIPT + roman_Ξ” italic_W start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) italic_x(3)

r i=σ⁒(w i⁒z i+b i)subscript π‘Ÿ 𝑖 𝜎 subscript 𝑀 𝑖 subscript 𝑧 𝑖 subscript 𝑏 𝑖 r_{i}=\sigma(w_{i}z_{i}+b_{i})italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_Οƒ ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )(4)

To obtain a single score as the final reward from multiple experts, we concatenate all low-dimensional representations of experts and use a two-layer MLP to aggregate them. Then the final reward score r π‘Ÿ r italic_r is:

z=βŠ•i=0 kβˆ’1 z i 𝑧 superscript subscript direct-sum 𝑖 0 π‘˜ 1 subscript 𝑧 𝑖 z=\mathop{\oplus}{i=0}^{k-1}z{i}italic_z = βŠ• start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT(5)

r=σ⁒(W 1⁒PReLU⁑(W 0⁒z+B 0)+B 1)π‘Ÿ 𝜎 subscript π‘Š 1 PReLU subscript π‘Š 0 𝑧 subscript 𝐡 0 subscript 𝐡 1 r=\sigma(W_{1}\operatorname{PReLU}(W_{0}z+B_{0})+B_{1})italic_r = italic_Οƒ ( italic_W start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT roman_PReLU ( italic_W start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_z + italic_B start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_B start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )(6)

Note that our MLP does not act on the final scalar outputs, but rather on the multiple low-dimensional decomposition without being fed into FCNs, as we believe there may be underlying correlations between different capability points, which can be learned using MLP in their low-dimensional embedding space.

We use logsigmoid as the loss function, which is also the most commonly used loss function for training RMs, where k π‘˜ k italic_k represents the number of responses in a piece of data:

β„’=βˆ’1(k 2)⁒E(x,y w,y l)∼D⁒[log⁑(σ⁒(r⁒(y w)βˆ’r⁒(y l)))]β„’ 1 binomial π‘˜ 2 subscript 𝐸 similar-to π‘₯ subscript 𝑦 𝑀 subscript 𝑦 𝑙 𝐷 delimited-[]𝜎 π‘Ÿ subscript 𝑦 𝑀 π‘Ÿ subscript 𝑦 𝑙\mathcal{L}=-\frac{1}{\binom{k}{2}}E_{(x,y_{w},y_{l})\sim D}[\log(\sigma(r(y_{% w})-r(y_{l})))]caligraphic_L = - divide start_ARG 1 end_ARG start_ARG ( FRACOP start_ARG italic_k end_ARG start_ARG 2 end_ARG ) end_ARG italic_E start_POSTSUBSCRIPT ( italic_x , italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ∼ italic_D end_POSTSUBSCRIPT roman_log ( italic_Οƒ ( italic_r ( italic_y start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) - italic_r ( italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ) )

4.2.2 Training

Figure3 illustrates the training framework of each inner layer MoE. We use a pre-trained RM t base subscript RM subscript 𝑑 base\operatorname{RM}{t{\text{base}}}roman_RM start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT base end_POSTSUBSCRIPT end_POSTSUBSCRIPT as the base general RM, and then perform the following three phases of training in sequence:

  • β€’ Phase 1: Task-Specific Training. Use 60%percent 60 60%60 % of the task-specific preference data to full-parameter fine-tune on RM t base subscript RM subscript 𝑑 base\operatorname{RM}{t{\text{base}}}roman_RM start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT base end_POSTSUBSCRIPT end_POSTSUBSCRIPT to get base task-specific model RM t base subscript RM subscript 𝑑 base\operatorname{RM}{t{\text{base}}}roman_RM start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT base end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

  • β€’ Phase 2: Capabilities Training. Use the data with capability point labels (the method of obtaining these labels is introduced in Section4.3) to train RM t subscript RM 𝑑\operatorname{RM}{t}roman_RM start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using LoRA fine-tuning to obtain RM t 0,…,RM t kβˆ’1 subscript RM subscript 𝑑 0…subscript RM subscript 𝑑 π‘˜ 1\operatorname{RM}{t_{0}},\dots,\operatorname{RM}{t{k-1}}roman_RM start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , roman_RM start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Each time, a new linear head is learned from the original linear head of RM t base subscript RM subscript 𝑑 base\operatorname{RM}{t{\text{base}}}roman_RM start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT base end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

  • β€’ Phase 3: Ensemble Training. Remove the original FCN on each expert RM and use an MLP to aggregate RM t 0,…,RM t k subscript RM subscript 𝑑 0…subscript RM subscript 𝑑 π‘˜\operatorname{RM}{t{0}},\dots,\operatorname{RM}{t{k}}roman_RM start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , roman_RM start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT to obtain the final model RM t subscript RM 𝑑\operatorname{RM}{t}roman_RM start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and use the remaining 40%percent 40 40%40 % task-specific preference data to train it. During this phase, RM t base subscript RM subscript 𝑑 base\operatorname{RM}{t_{\text{base}}}roman_RM start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT base end_POSTSUBSCRIPT end_POSTSUBSCRIPT and the LoRA layers are frozen, and only the newly added MLP layer is trained.

4.3 Capability Point Labels Obtaining

Since it is costly to obtain all preference labels on each capability point, instead of manually sorting or scoring, we use the method of calling the public Ernie Bot API 1 1 1 The ERNIE Bot API has similar functions to ChatGPT, but it is much cheaper and can achieve the same level of proficiency in Chinese: https://cloud.baidu.com/doc/WENXINWORKSHOP/s/flfmc9do2 to obtain the comparative preference of the single capability point in each response pair, which significantly reduces the labeling cost.

To avoid the positional bias inherent in LLMs, we employ a strategy that involves swapping positions and requiring each pair to be processed twice. We then selectively retain only those data pairs that exhibit consistency in twice-calling. This method also doubles as a data cleansing technique, as it effectively filters out pairs with minimal discrepancies, which may introduce potential noise into training data. The prompt template is shown in both Chinese (Table13) and English (Table14) in the Appendix.

Our approach does not require additional data since the task-specific data from training Phases 1 and 3 can be directly utilized as raw response pairs, which enables us to acquire capability point preferences based on them. In our experiments, we reutilize all the task-specific data as raw response pairs during Phase 2.

5 Experiment Setup

5.1 Model

We use Qwen-1.8B-Chat 2 2 2https://huggingface.co/Qwen/Qwen-1_8B-Chat as both the policy model and the base reward model, which is an open source Chinese-English bilingual Transformer-based large language model proposed by Alibaba Cloud. To make a fair comparison, we use the same model as the base model of our baseline ensemble methods.

5.2 Baseline

We use a single RM baseline and collect three state-of-the-art ensembling methods for reward models from a range of papers (Coste et al., 2024; Eisenstein et al., 2023; Zhai et al., 2024). All baseline methods and our model are trained and evaluated with the same dataset.

  • β€’ Single RM

We use the training of a single reward model as the most basic benchmark.

  • β€’ Mean Optimization

Mean optimization simply takes the mean of the outputs of the different ensemble members:

R μ⁒(x)=1 kβ’βˆ‘i=0 kβˆ’1 r i⁒(x)subscript 𝑅 πœ‡ π‘₯ 1 π‘˜ superscript subscript 𝑖 0 π‘˜ 1 subscript π‘Ÿ 𝑖 π‘₯ R_{\mu}(x)=\frac{1}{k}\sum_{i=0}^{k-1}{r_{i}(x)}italic_R start_POSTSUBSCRIPT italic_ΞΌ end_POSTSUBSCRIPT ( italic_x ) = divide start_ARG 1 end_ARG start_ARG italic_k end_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x )(8)

  • β€’ Worst-Case Optimization

Worst-case optimization (WCO) creates a conservative estimate by choosing the lowest reward from the ensemble at every step:

R WCO⁒(x)=min i=0 kβˆ’1⁑r i⁒(x)subscript 𝑅 WCO π‘₯ superscript subscript 𝑖 0 π‘˜ 1 subscript π‘Ÿ 𝑖 π‘₯ R_{\text{WCO}}(x)=\min_{i=0}^{k-1}{r_{i}(x)}italic_R start_POSTSUBSCRIPT WCO end_POSTSUBSCRIPT ( italic_x ) = roman_min start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x )(9)

  • β€’ Uncertainty-Weighted Optimization

Uncertainty-weighted optimization (UWO) calculates reward by combining the average reward across all models in an ensemble with the intra-ensemble variance, weighted by a coefficient Ξ» πœ†\lambda italic_Ξ». Mathematically, this objective is given by:

R UWO⁒(x)=R μ⁒(x)⏟meanβˆ’Ξ»β’1 kβ’βˆ‘i=0 kβˆ’1(r i⁒(x)βˆ’R μ⁒(x))2⏟variance subscript 𝑅 UWO π‘₯ subscript⏟subscript 𝑅 πœ‡ π‘₯ mean πœ† subscript⏟1 π‘˜ superscript subscript 𝑖 0 π‘˜ 1 superscript subscript π‘Ÿ 𝑖 π‘₯ subscript 𝑅 πœ‡ π‘₯ 2 variance\begin{split}R_{\text{UWO}}(x)=\ \underbrace{R_{\mu}(x)}{\text{mean}}&-\lambda\underbrace{\frac{1}{k}\sum{i=0% }^{k-1}(r_{i}(x)-R_{\mu}(x))^{2}}_{\text{variance}}\end{split}start_ROW start_CELL italic_R start_POSTSUBSCRIPT UWO end_POSTSUBSCRIPT ( italic_x ) = end_CELL end_ROW start_ROW start_CELL under⏟ start_ARG italic_R start_POSTSUBSCRIPT italic_ΞΌ end_POSTSUBSCRIPT ( italic_x ) end_ARG start_POSTSUBSCRIPT mean end_POSTSUBSCRIPT end_CELL start_CELL - italic_Ξ» under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG italic_k end_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT ( italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) - italic_R start_POSTSUBSCRIPT italic_ΞΌ end_POSTSUBSCRIPT ( italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT variance end_POSTSUBSCRIPT end_CELL end_ROW(10)

5.3 Dataset

Our prompt dataset are very diverse and can be mainly classified into five categories: roleplay, chitchat, subjective knowledge QA, objective knowledge QA, and text creation, with some others (including logical reasoning, mathematical calculations, code understanding and generation, translation, etc). Our dataset is over 98%percent 98 98%98 % Chinese. We ensure that the training and test sets contain no intersection and heuristically eliminate duplicate prompts. We also filter prompts containing personally identifiable information (PII).

Our data has a turn range of 1 1 1 1 to 27 27 27 27, with an average of 3.72 3.72 3.72 3.72 (a turn contains a user query and an LLM response). Each sample’s final query has multiple responses generated through either automated or manual processes. These responses are then assigned preference rankings through manual labeling. A detailed statistic of our data is shown in Table2.

Table 2: The statistics of our dataset.

We employ a rigorous annotation process with clear documentation guides to direct our annotators (we have shown an example in our GitHub repository). Each piece of data is evaluated by two annotators to ensure quality, and the final ranking is established through discussion to reach a consensus. We have observed that the consistency rate between each pair of annotators reached 74%percent 74 74%74 % on average.

6 Result

6.1 Training Phases

Figure 4: The progress of the model at different training stages. The horizontal axis of each image represents the number of training steps, and the vertical axis represents the accuracy of sorting pairs of responses on the training and testing set. Figure4(a) shows the results of the training Phase 1. Figures from4(b) to4(g) show the results of the training Phase 2. Figure4(h) (top-right) shows the results of the training Phase 3.

Image 4: Refer to caption

(a) Phase 1

Image 5: Refer to caption

(h) Phase 3

Image 6: Refer to caption

(b) personality and emotional investment

Image 7: Refer to caption

(c) conversational sense

Image 8: Refer to caption

(d) empathy ability

Image 9: Refer to caption

(e) manifestation of relationship traits

Image 10: Refer to caption

(f) personalized characteristic expression

Image 11: Refer to caption

(g) content richness

Table 3: The consistency with human preferences. Note that for the methods except GPT-4, the overall results is not simply adding each up but train with all data and test. The best performance is in bold and the second best is underlined.

Figure 5: The optimization results for BoN and PPO for the roleplay task. The x-axes have a square-root scale, and the KL divergence scale differs between BoN and PPO due to differences in the algorithm and the KL calculation. All RMs will be normalized to have a zero mean after training.

Image 12: Refer to caption

(a) BoN

Image 13: Refer to caption

(b) PPO

In this section, we use roleplay as an example to demonstrate the progress of the model at different training phases. By recording the accuracy of the reward model’s ranking on the pairs of responses in training and testing sets, we can intuitively display the training results. The results are shown in Figure4. We find:

  • β€’ In Phase 1 (Figure4(a)): the accuracy of the training set improves rapidly, while the improvement of the test set dataset is slow, eventually stabilizing at 56%percent 56 56%56 %.

  • β€’ In Phase 2 (Figure4(b) to 4(g)): we divide the roleplay task into six capability points, namely 1) personality and emotional investment, 2) conversational sense, 3) empathy ability, 4) manifestation of relationship traits, 5) personalized characteristic expression, and 6) content richness. We find that depending on different single capability points, an accuracy of 80 80 80 80-86%percent 86 86%86 % can be achieved on the test set. Note that the training and testing set labels here are only one capability point and not an overall preference.

  • β€’ In Phase 3 (Figure4(h)): the improvement in the test set was rapid, reaching a peak of 68%percent 68 68%68 %. Compared with the 56%percent 56 56%56 % accuracy in Phase 1, it can be proven that our method of training multiple experts based on different capability points and aggregating those experts can significantly improve the model’s performance.

6.2 Preference Consistency Evaluation

Since RM is essentially an imperfect proxy for human preferences, testing the consistency rate of trained RM using human-labeled preference data is a direct and effective evaluation method. Given a pair of preference data, we use the trained RMs to assign scores to each response and record the consistency between the sorting of scores and the sorting of manual labels. Higher consistency rates (or accuracy) mean better performance of RMs as the proxies for human preferences. Compared with the methods introduced in Section5.2, we add the GPT-4 generative evaluation benchmarks for a more comprehensive evaluation. Their implementation details are introduced in AppendixB.3.3.

We present the results in Table3. There are two noteworthy findings: firstly, our DMoERM achieves the best results in all categories and overall experiments, with a 6 6 6 6 to 8 8 8 8 percentage point improvement compared to other methods. This indicates that our training method can better learn human preferences without increasing the amount of training data or model parameters. Secondly, the DMoERM-w/o-Outer model removes the outer layer and instead uses the same capability point partition to train on various categories. It achieves the second effect and still has a significant improvement compared to other methods, making it a viable alternative when memory or task-specific data is limited.

6.3 Optimization Evaluation

In the optimization evaluation, we use BoN and PPO as optimization strategies to optimize the same policy model. Since we have verified in section3.1 that the outer MoE will improve performance by dividing different categories to train different models, for the sake of fair comparison, we restrict the task within roleplay to comparing the effects of the corresponding inner MoE with other aggregation methods. Under the same degree of optimization measured by KL divergence, we use another pre-trained reward model 3 3 3https://huggingface.co/IDEA-CCNL/Ziya-LLaMA-7B-Reward fine-tuned on our task-specific data with a larger number of parameters as the referee to judge the outputs of results, which we called the gold reward model. Model strategies that score higher under the gold reward model are considered better. We evaluate BoN for a maximum of n m⁒a⁒x=60,000 subscript 𝑛 π‘š π‘Ž π‘₯ 60 000 n_{max}=60,000 italic_n start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 60 , 000 samples, which roughly equals 10 10 10 10 nats of KL. The reported results represent the mean outcomes derived from a set of 500 500 500 500 distinct prompts. For PPO, we train for 3,000 3 000 3,000 3 , 000 PPO steps and compute the average results derived from three distinct random seeds. We give further details on implementation and other hyperparameters in AppendixB.3.

We showcase the results in Figure5. Both BoN and PPO optimization results demonstrate that our model consistently outperforms alternative integration approaches when policy models are optimized to the same degree. Moreover, after n=8,000 𝑛 8 000 n=8,000 italic_n = 8 , 000 (KL β‰ˆ8 absent 8\approx 8β‰ˆ 8 nats) in the BoN optimization experiment, our model maintaine stability without signs of overoptimization, unlike other ensemble methods that exhibit varying degrees of overoptimization. These findings suggest that our model outperforms baselines in optimizing LLMs and is capable of addressing the issue of overoptimization.

7 Conclusion

In this work, we propose DMoERM to enhance RMs’ performance. The outer layer MoE divides inputs into different tasks to avoid multi-task disturbance, while the inner layer MoE reduces the impact of data noise by learning LoRA experts on different capability points. Preference consistency experiments demonstrate our model is more representative of human preferences. Optimization evaluations indicate our model is more effective in optimizing LLMs and can mitigate the overoptimization problem.

Limitations

Although we attempt to reduce annotation costs by calling public LLM API instead of manual labeling, it is still costly as using the ERNIE Bot API will cost approximately $3,000 3 000 3,000 3 , 000 in total, and using the ChatGPT API will cost about ten times more. However, if the annotation standards for human are predetermined in advance, it will not significantly bring more annotation costs. In our annotation process, we found that due to the annotators spending most of their time understanding the queries and various responses, the proposed annotation method (first annotates the preferences of pre-determined capability points and then annotates the overall preferences) only reduces the annotator’s speed by about 10%percent 10 10%10 %, but can increase the annotation consistency by 5 5 5 5 percentage points.

Another problem is the training time, with about 80 80 80 80 NVIDIA A100 GPU hours to train one inner MoE, which is about eight times longer than training a traditional single RM with the same amount of parameters. While many works focus on exploring efficient training methods, we reserve this problem as a future work.

References

Appendix A Additional Related Work

A.1 Reinforcement Learning with Human Feedback

Reinforcement Learning with Human Feedback (RLHF) is a foundational method for fine-tuning language models to align with human preferences. RLHF has been applied to a variety of tasks, including text summarization(Stiennon et al., 2020) and improving the helpfulness and harmlessness of language models(Bai et al., 2022). In particular, InstructGPT(Ouyang et al., 2022) employs a three-step RLHF process that includes a supervised learning technique and the PPO algorithm(Schulman et al., 2017), which has proven to be effective for ChatGPT. Despite its success, RLHF encounters several challenges, such as low sample efficiency(Snell et al., 2023; GΓΌlΓ§ehre et al., 2023) and overoptimization(Gao et al., 2023). Since we require no additional data to improve performance and mitigate overoptimization, our method works in both two aspects.

RLHF heavily depends on reward modeling as a proxy for human preferences. Recent research has attempted to bypass the reward modeling step(Yuan et al., 2023; Song et al., 2023). Specifically, Direct Policy Optimization (DPO) aims to refine policies by classifying human preference data without reward models. Although this method is simpler to implement and offers training stability, more recent studies reveal several advantages of using reward models. Investigations into the robustness of reward-model-based strategies suggest they are more resistant to overfitting due to the limitations of KL regularization(Azar et al., 2023). Moreover, in comparison to DPO, reward-model-based RLHF shows great advantages on out-of-preference samples(Li et al., 2023).

A.1.1 the Overoptimization Problem of RMs

As the learned reward model is only a proxy for the true reward function, optimizing it may not always result in an improvement according to true human preferences. In practice, optimizing a (fixed) learned reward model almost always leads to improvement according to this learned reward model but only improves according to the true reward model (i.e., humans) for some initial period, after which performance often begins to regress. This phenomenon is referred to as overoptimization.

Appendix B Additional Experiment Setup

B.1 Hyperparameters

Since in different training phases there are different sets of model parameters that need to be trained, we use different learning rates to better adapt to these three phases and allocate different proportions of training set for Phase 1 and Phase 3. The settings are shown in Table4.

Table 4: The hyperparameters used in different training phases.

We set the minibatch size to 1 1 1 1 and conduct an evaluation on the validation set every 100 100 100 100 training steps. If the best result on the validation set does not improve for 20 20 20 20 consecutive evaluations, we implement early stopping and use the best-performing model for the next training phase or the final testing if it is already Phase 3.

The settings for the LoRA components are presented in Table5, applicable to both our LoRA experts and the LoRA ensembles used in the baseline comparisons.

Table 5: LoRA configurations.

The PPO hyperparameters and generation hyperparameters are shown in Tables6 and7, respectively.

Table 6: PPO hyperparameters.

Table 7: Generation hyperparameters.

All experiments are run on a single machine with eight NVIDIA A100 80G GPUs, and we use Adam optimizer for the optimization process.

B.2 Ensemble Creation for Baseline Methods

To create an ensemble for the Mean, WCO, and UWO baselines, under Coste et al. (2024); Eisenstein et al. (2023) guidance, we train a fixed number of proxy reward models using identical data and hyperparameters. Each model, however, is initialized with different random seeds. This leads to variations in the random initialization of the scalar reward head that is added on top of the pre-trained language model, as well as in the data shuffling order. Table5 presents the LoRA parameters for these models. We have trained an ensemble of five reward models, aligning with the configurations used in previous works. This number is also comparable to the number of LoRA experts in our model, which ranges from five to six, depending on the numbers of capability points allocated under a category.

B.3 Optimization Method

B.3.1 Best-of-n 𝑛 n italic_n Sampling

Best-of-n 𝑛 n italic_n (BoN) sampling, also called rejection sampling, is a simple inference-time optimization method(Nakano et al., 2021; Ouyang et al., 2022). For a given prompt, n 𝑛 n italic_n responses are generated from the policy model, and the answer with the highest proxy reward model score is returned. To evaluate the degree of optimization, the KL divergence 4 4 4 A recent work(Beirami et al., 2024) claims to have proved that this boundary is not so accurate and provided a tighter boundary defined by the binary entropy function. At present, they have not been widely recognized, and even if they are correct, our experimental conclusions are still valid in terms of trends. is defined analytically as a function of n 𝑛 n italic_n:

KL bon=log⁑nβˆ’nβˆ’1 n subscript KL bon 𝑛 𝑛 1 𝑛\operatorname{KL}_{\text{bon}}=\log{n}-\frac{n-1}{n}roman_KL start_POSTSUBSCRIPT bon end_POSTSUBSCRIPT = roman_log italic_n - divide start_ARG italic_n - 1 end_ARG start_ARG italic_n end_ARG(11)

In our experiments, we evaluate BoN for a maximum of n m⁒a⁒x=60,000 subscript 𝑛 π‘š π‘Ž π‘₯ 60 000 n_{max}=60,000 italic_n start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT = 60 , 000 samples, which roughly equals 10 10 10 10 nats of KL.

B.3.2 Proximal Policy Optimization

Proximal Policy Optimization (PPO)(Schulman et al., 2017) is a policy-gradient-based online reinforcement learning method that maximizes a given reward function by repeatedly performing small incremental updates to the policy. PPO is the standard algorithm used in fine-tuning language models based on human feedback(Ouyang et al., 2022; Bai et al., 2022; Stiennon et al., 2020; Zheng et al., 2023). When using PPO to fine-tune a language model, a KL penalty term is added during the reward calculation to regularize the policy by preventing it from deviating far from the initial policy:

R PPO⁒(q,r)=R⁒(q,r)βˆ’Ξ²β’KL PPO superscript 𝑅 PPO π‘ž π‘Ÿ 𝑅 π‘ž π‘Ÿ 𝛽 subscript KL PPO R^{\text{PPO}}(q,r)=R(q,r)-\beta\operatorname{KL}_{\text{PPO}}italic_R start_POSTSUPERSCRIPT PPO end_POSTSUPERSCRIPT ( italic_q , italic_r ) = italic_R ( italic_q , italic_r ) - italic_Ξ² roman_KL start_POSTSUBSCRIPT PPO end_POSTSUBSCRIPT(12)

where Ο€ PPO superscript πœ‹ PPO\pi^{\text{PPO}}italic_Ο€ start_POSTSUPERSCRIPT PPO end_POSTSUPERSCRIPT is the policy being optimized and Ο€ init superscript πœ‹ init\pi^{\text{init}}italic_Ο€ start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT is the initial (pre-trained) language model.

The naive way to calculate KL divergence between the PPO-optimized policy Ο€ PPO superscript πœ‹ PPO\pi^{\text{PPO}}italic_Ο€ start_POSTSUPERSCRIPT PPO end_POSTSUPERSCRIPT and the initial model is as follows:

KL PPO⁑(Ο€ PPO,Ο€ init)=E(q,r)βˆΌΟ€ PPO⁒[log⁑π PPO⁒(r|q)Ο€ init⁒(r|q)]subscript KL PPO superscript πœ‹ PPO superscript πœ‹ init subscript 𝐸 similar-to π‘ž π‘Ÿ superscript πœ‹ PPO delimited-[]superscript πœ‹ PPO conditional π‘Ÿ π‘ž superscript πœ‹ init conditional π‘Ÿ π‘ž\operatorname{KL}{\text{PPO}}(\pi^{\text{PPO}},\pi^{\text{init}})=E{(q,r)% \sim\pi^{\text{PPO}}}[\log\frac{\pi^{\text{PPO}}(r|q)}{\pi^{\text{init}}(r|q)}]roman_KL start_POSTSUBSCRIPT PPO end_POSTSUBSCRIPT ( italic_Ο€ start_POSTSUPERSCRIPT PPO end_POSTSUPERSCRIPT , italic_Ο€ start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT ) = italic_E start_POSTSUBSCRIPT ( italic_q , italic_r ) ∼ italic_Ο€ start_POSTSUPERSCRIPT PPO end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_log divide start_ARG italic_Ο€ start_POSTSUPERSCRIPT PPO end_POSTSUPERSCRIPT ( italic_r | italic_q ) end_ARG start_ARG italic_Ο€ start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT ( italic_r | italic_q ) end_ARG

However, this estimator suffers from high variance and may yield negative values. Consequently, we employ the following estimator Coste et al. (2024):

KL PPO(Ο€ PPO,Ο€ init)=E(q,r)βˆΌΟ€ PPO⁒[1 2⁒(log⁑π PPO⁒(r|q)Ο€ init⁒(r|q))2]subscript KL PPO superscript πœ‹ PPO superscript πœ‹ init subscript 𝐸 similar-to π‘ž π‘Ÿ superscript πœ‹ PPO delimited-[]1 2 superscript superscript πœ‹ PPO conditional π‘Ÿ π‘ž superscript πœ‹ init conditional π‘Ÿ π‘ž 2\begin{split}\operatorname{KL}{\text{PPO}}(\pi^{\text{PPO}},&\pi^{\text{init}% })=\ &E{(q,r)\sim\pi^{\text{PPO}}}[\frac{1}{2}(\log\frac{\pi^{\text{PPO}}(r|q)}{% \pi^{\text{init}}(r|q)})^{2}]\end{split}start_ROW start_CELL roman_KL start_POSTSUBSCRIPT PPO end_POSTSUBSCRIPT ( italic_Ο€ start_POSTSUPERSCRIPT PPO end_POSTSUPERSCRIPT , end_CELL start_CELL italic_Ο€ start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT ) = end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_E start_POSTSUBSCRIPT ( italic_q , italic_r ) ∼ italic_Ο€ start_POSTSUPERSCRIPT PPO end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( roman_log divide start_ARG italic_Ο€ start_POSTSUPERSCRIPT PPO end_POSTSUPERSCRIPT ( italic_r | italic_q ) end_ARG start_ARG italic_Ο€ start_POSTSUPERSCRIPT init end_POSTSUPERSCRIPT ( italic_r | italic_q ) end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_CELL end_ROW(14)

We train a total of 3,000 3 000 3,000 3 , 000 PPO steps, and the PPO parameters are shown in Table6.

B.3.3 Generative Baseline

Apart from RM-based methods, if only the preference order of response pairs needs to be given, a simple way is to call the public language model API through zero-shot or few-shot. It can also be combined with BoN or DPO to optimize LLM. For a more comprehensive evaluation, in preference consistency experiments, we have added the GPT-4 baseline.

  • β€’ GPT-4(OpenAI, 2023) We use the most advanced gpt-4-1106-preview 5 5 5https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo We called the API in January 2024. as the evaluator to evaluate a pair of preference data. Each time, we swap the positions of the responses and make two requests, and we will re-request until we get identical results from two requests as the final results.

We try to find well-prompted instruction and use both zero-shot and one-shot to evaluate and do not use few-shot for comparison, for we find it will make the context much longer and the result is not better than one-shot. There are also some other generative evaluation methods. Since they have reported in their papers that their performance is just similar with Ke et al. (2023) GPT-4 or even worse than Wang et al. (2023) GPT-4, we just use GPT-4 to represent these methods.

Appendix C Discussion on Model Interpretability

The single traditional RM only outputs the final reward score and we have no clues about why a response gets its score, which makes it have poor interpretability. However, our DMoERM first learns latent embedding at different capability points under specific tasks, and the final score is learned from and based on an ensemble of embeddings of different capability points. By using the FCN trained in Phase 2 to act on the corresponding hidden embedding, we can know the score on each capability point of a response. In this way, we can identify which aspects of a good response are effective and which aspects of a poor response are not. Similarly, if the overall consistency of the model is not satisfactory, we can also identify the problem and prescribe targeted solutions by analyzing the performance of each single capability point.

The method that our DMoERM checks the task of inputs, works on different capability points, and aggregates latent embeddings of each expert to obtain the final reward is just like the Chain-of-Thought of the reward models, which can enhance the reasoning (give the reward under the context) ability of reward models as well as the interpretability of the reward models.

Appendix D Additional Experimental Result

D.1 Larger Model Performance

We further apply the preference consistency evaluation to the Qwen7B-Chat and Qwen14B-Chat models. Except for adjusting the learning rate to adapt to different parameter sizes ([5⁒eβˆ’8,2.5⁒eβˆ’5,5⁒eβˆ’7 5 𝑒 8 2.5 𝑒 5 5 𝑒 7 5e-8,2.5e-5,5e-7 5 italic_e - 8 , 2.5 italic_e - 5 , 5 italic_e - 7] for the 7B model and [3⁒eβˆ’8,1.5⁒eβˆ’5,3⁒eβˆ’7 3 𝑒 8 1.5 𝑒 5 3 𝑒 7 3e-8,1.5e-5,3e-7 3 italic_e - 8 , 1.5 italic_e - 5 , 3 italic_e - 7] for the 14B model, during the three training phases), the other parameter settings and dataset partitioning are consistent with the paper, and the specific experimental results are shown in the Table10.

We find that our DMoERM achieved the best results among all categories and all model sizes (in bold), while DMoERM-7B achieved the second best results (is underlined). Meanwhile, in the comparison between our DMoERM in different parameter sizes, the final results for different categories improve by about 1.9%percent 1.9 1.9%1.9 % for 7B and 3.6%percent 3.6 3.6%3.6 % for 14B against the smallest 1.8B model, indicating that our model structure can benefit from scaling up the size of parameters. The increase in consistency rate is not too significant when scaling up due to the presence of certain noise in the labeled data (with observed 74%percent 74 74%74 % consistency among human annotators), which leads to it being challenging to further improve when consistency rates near the 70%percent 70 70%70 % mark. Meanwhile, due to the amount of training data is fixed, the larger models may not have been sufficiently trained, which may also be a factor contributing to the less significant improvement.

D.2 Human Evaluation

In the process of BoN and PPO optimization, we both set three checkpoints (n=30,1000,6000 𝑛 30 1000 6000 n=30,1000,6000 italic_n = 30 , 1000 , 6000 for BoN and 1000,2000,3000 1000 2000 3000 1000,2000,3000 1000 , 2000 , 3000 PPO steps for PPO) and use 100 prompts at each checkpoint for human evaluation. Specifically, for the different outputs obtained from these prompts under optimizations from different RMs, we require annotators to sort them according to their preferences, and we record the winning rates of our model against other ensemble methods in Figure6.

We have two findings: Firstly, compared to BoN, the comparison improvement under PPO optimization is slower. Secondly, in both optimization experiments, the winning rates of our model increase steadily as optimization progresses, ultimately reaching approximately 87%percent 87 87%87 % for BoN and 68%percent 68 68%68 % for PPO.

Figure 6: The winning rates of DMoERM against other ensemble methods in human evaluation.

Image 14: Refer to caption

Image 15: Refer to caption

D.3 OOD Optimization Evaluation

In contrast to several previous works(Gao et al., 2023; Coste et al., 2024; Eisenstein et al., 2023; Zhai et al., 2024) where the RM training set, optimization set, and evaluation set are typically independently and identically distributed (IID), our following experiments involve using out-of-distribution (OOD) RM training sets in PPO I and OOD evaluation sets in PPO II (since BoN is a training-free optimization method used in the inference stage, the optimization prompt set is directly the evaluation prompt set, so we only conduct OOD optimization experiments on PPO). We use AlignBench(Liu et al., 2023), a publicly and popularly used alignment benchmark in Chinese, to evaluate the effectiveness of different models. We use the DMoERM-w/o-Outer for comparison and set the reward models and policy models in the same model sizes of 7B. Due to the models being larger, we compare the optimization results of different RMs using GPT-4 as the referee, which we believe has the most sufficient capability to evaluate the generation results and is naturally fair. The RM training set, the optimization set, and the evaluation set used in PPO I and PPO II are represented in Table8 and the experimental results are shown in Figure7.

Table 8: The datasets used in different periods of PPO I and PPO II. Note that the datasets are randomly sampled from overall datasets and have no intersection in different periods.

Figure 7: The winning rates of DMoERM against other ensemble methods in OOD optimization.

Image 16: Refer to caption

Image 17: Refer to caption

We have two significant findings: First, PPO I and PPO II show consistent trends, both steadily improving the winning rate against the baseline method during the optimization process. Second, In PPO II, the winning rate initially increased more rapidly (before 1,000 1 000 1,000 1 , 000 steps), while the growth in PPO I was more stable and persistent in all optimization periods.

Intuitive analysis: Due to the fact that in PPO II, the optimization set and training set are IID, the reward model can provide a clear reward signal during optimization, enabling training to quickly achieve positive optimization; conversely, In PPO I, while the optimization and evaluation sets are IID, they are OOD with the training set, making the reward signal is not such clear and strong, but advantages can gradually be reflected in the continuous optimization process.

D.4 Annotation Quantitative Experiment

Since our method requires more annotation costs, to provide a more comprehensive evaluation of our method, we quantified the annotation consumption and maintained consistency between our method and the baseline method in terms of annotation consumption. In this experiment, we compared DMoERM-w/o-Outer with other baseline models in all three model sizes on the entire dataset. Due to the need that our model requires each piece of data 5 5 5 5 times additional annotations on intent conformity, logic, conversational sense, content richness, and readability, we only randomly sample one-sixth of the data to train our model, while other models use all the data to control the consistency of the number of annotations. The experimental results are shown in Table9.

Table 9: The results of labeling quantitative experiment. We record the performance of our model at the end of the training Phase 1 as DMoERM-Phase1 and at the end of three training phases as DMoERM-Phase3.

We also recorded the performance of our DMoERM-w/o-Outer at the end of the training Phase 1 as DMoERM-Phase1. At this point, it is equivalent to a Single RM trained with only one-sixth of the data, so the effect is poor. But when the three training phases ended (denoted as DMoERM-Phase3), its performance improved by 9.3 9.3 9.3 9.3 percentage points on average and exceeded all baseline models in three model sizes.

Note that this experiment is not very fair because in our empirical experiments, we found that our annotation method only reduces the annotator’s speed by about 10%percent 10 10%10 %, but can increase the annotation consistency by 5 5 5 5 percentage points. In the experiment, we only used one-sixth of the data, but exceeded all the baseline methods by at least 1.4 1.4 1.4 1.4 percentage points, indicating the compelling advantages of our method.

D.5 Qualitative Sample

Throughout our experiments, quantitative metrics play a pivotal role as they enable the rapid and comprehensive evaluation of various methods. In this section, we offer a concise insight into the qualitative aspect of the approaches discussed in this study. Specifically, for a given prompt, responses from the ultimate policies (n=60,000 𝑛 60 000 n=60,000 italic_n = 60 , 000 for BoN and 3,000 3 000 3,000 3 , 000 steps for PPO) of each method in Figure5 are provided.

The main findings are as follows. Firstly, the brevity of BoN’s response is due to the differing manner in which policy optimization occurs compared to PPO. Secondly, there are indications of failure and overoptimization in the other ensemble RMs. As shown in Table11, for BoN, this manifests in inadequate answers. For PPO, this means bad answers that are incorrectly long and repetitive. These are obvious signs of overoptimization. Lastly, we observe that even in scenarios where other ensemble RMs struggle, our DMoERM is capable of yielding robust qualitative outcomes.

Table 10: The consistency of different model sizes with human preferences. In each model size, the best performance is in bold and the second best is underlined. We can observe that our DMoERM and DMoERM-w/o-Outer consistently achieve the best and second-best results at different model sizes.

Table 11: A set of example answers to an evaluation query.

Image 18: Refer to caption

Figure 8: The screenshot of the interface for the annotation consistency experiment. The content in the red box is the preference selection for the five capability points. We find that removing them would lead to a drop of 5 5 5 5 percentage points in the overall consistency of preferences.

Table 12: The capability points partitions for each task in our experiments.

Table 13: The Chinese original version of the prompt for calling public LLM API to get the comparison on a single capability point. Note that we do not attach historical information, for it can greatly reduce the cost and is sufficient to achieve satisfactory results in our experiments. We determine the model’s choice by identifying whether the number 1 or 2 appears first in its response.

Table 14: The English translated version of the prompt for calling public LLM API to get the comparison on a single capability point. Note that we do not attach historical information, for it can greatly reduce the cost and is sufficient to achieve satisfactory results in our experiments. We determine the model’s choice by identifying whether the number 1 or 2 appears first in its response.

Xet Storage Details

Size:
90.2 kB
Β·
Xet hash:
a76d0cb44f3776eeb244b60cc7674b087358af3c452251e533ac541b3ea09d2c

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.