---
license: mit
title: ActiveUltraFeedback
short_description: Sample-Efficient RLHF Preference data generation
paper:
  name: "ActiveUltraFeedback: Sample-Efficient RLHF Preference data generation"
  url: "https://arxiv.org/abs/2603.09692"
sdk: gradio
emoji: ⚡
colorFrom: blue
colorTo: green
sdk_version: 6.9.0
---

# This repo accompanies the paper: [ActiveUltraFeedback — arXiv:2603.09692](https://arxiv.org/abs/2603.09692).

**Active UltraFeedback** is a scalable pipeline for generating high-quality preference datasets to align large language models (LLMs). We leverage **uncertainty quantification** and **active learning** to annotate only the most informative samples, drastically reducing costs while beating standard baselines.

> **Repository Purpose:** This repository serves as a central hub for storing all experimental data, including the generated preference datasets and the resulting DPO/RM/IPO/SimPO models.

<details>
<summary><strong>🏆 Benchmark Results (Click to Expand)</strong></summary>

Our experiments demonstrate that **Active Learning strategies (specifically DRTS and DeltaUCB)** consistently beat the actual `ultrafeedback_binarized_cleaned` and `tulu3` preference mixture datasets.

### 1. UltraFeedback Prompts (Only)

**Reward Model (RM) Performance**
| Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| *Baselines* | | | | | | | |
| Random | +0.443 | +0.209 | +0.156 | +0.133 | +0.417 | +0.310 | +0.278 |
| UltraFeedback | +0.443 | +0.188 | +0.213 | +0.114 | +0.481 | +0.284 | +0.287 |
| MaxMin | +0.377 | +0.483 | +0.156 | +0.123 | +0.370 | +0.400 | +0.318 |
| DeltaQwen | +0.195 | -0.034 | +0.028 | +0.067 | +0.216 | +0.126 | +0.100 |
| *Ours* | | | | | | | |
| **DRTS** | +0.412 | +0.408 | +0.183 | +0.114 | +0.347 | +0.404 | +0.312 |
| **DeltaUCB** | +0.423 | **+0.553** | +0.132 | +0.080 | +0.435 | +0.408 | **+0.339** |
| **DTS** | +0.406 | +0.024 | +0.194 | +0.077 | +0.441 | +0.197 | +0.223 |
| **InfoMax** | +0.463 | +0.287 | +0.096 | **+0.129** | +0.509 | +0.296 | +0.297 |
| **MaxMinLCB** | +0.390 | -0.025 | **+0.244** | +0.070 | +0.453 | +0.250 | +0.230 |

**DPO Performance**
| Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
| :--- | :---: | :---: | :---: | :---: | :---: |
| *Baselines* | | | | | |
| Random | +0.024 | +0.028 | +0.056 | +0.077 | +0.046 |
| UltraFeedback | +0.037 | -0.001 | +0.039 | +0.072 | +0.036 |
| MaxMin | +0.022 | -0.016 | **+0.150** | +0.289 | +0.111 |
| DeltaQwen | +0.055 | +0.047 | +0.130 | +0.316 | **+0.137** |
| *Ours* | | | | | |
| **DRTS** | +0.055 | **+0.050** | +0.143 | +0.259 | +0.127 |
| **DeltaUCB** | **+0.065** | +0.039 | +0.113 | +0.254 | +0.117 |
| **DTS** | +0.011 | +0.034 | +0.013 | +0.037 | +0.023 |
| **InfoMax** | +0.011 | +0.019 | +0.018 | +0.020 | +0.016 |
| **MaxMinLCB** | +0.015 | +0.017 | +0.006 | +0.027 | +0.016 |

### 2. Skywork Prompts (Only)

**Reward Model (RM) Performance**
| Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| *Baselines* | | | | | | | |
| Random | +0.407 | +0.106 | +0.151 | +0.092 | +0.422 | +0.157 | +0.223 |
| UltraFeedback | +0.419 | +0.068 | +0.189 | +0.058 | +0.440 | +0.228 | +0.234 |
| MaxMin | +0.410 | **+0.462** | +0.172 | +0.055 | **+0.531** | **+0.319** | **+0.325** |
| DeltaQwen | +0.238 | -0.023 | +0.011 | **+0.108** | +0.306 | +0.132 | +0.129 |
| *Ours* | | | | | | | |
| **DRTS** | +0.423 | +0.233 | +0.164 | +0.055 | +0.377 | +0.285 | +0.256 |
| **DeltaUCB** | +0.370 | +0.319 | **+0.194** | +0.033 | +0.346 | +0.310 | +0.262 |
| **DTS** | +0.417 | -0.021 | +0.148 | +0.077 | +0.450 | +0.245 | +0.219 |
| **InfoMax** | **+0.429** | +0.122 | +0.162 | +0.030 | +0.495 | +0.227 | +0.244 |
| **MaxMinLCB** | +0.371 | -0.016 | +0.145 | +0.039 | +0.395 | +0.167 | +0.184 |

**DPO Performance**
| Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
| :--- | :---: | :---: | :---: | :---: | :---: |
| *Baselines* | | | | | |
| Random | +0.012 | +0.015 | +0.045 | +0.063 | +0.033 |
| UltraFeedback | +0.027 | **+0.054** | +0.043 | +0.071 | +0.048 |
| MaxMin | +0.049 | -0.011 | +0.128 | +0.270 | +0.108 |
| DeltaQwen | **+0.058** | +0.002 | **+0.152** | **+0.384** | **+0.149** |
| *Ours* | | | | | |
| **DRTS** | +0.052 | +0.012 | +0.114 | +0.229 | +0.101 |
| **DeltaUCB** | +0.055 | +0.013 | +0.077 | +0.238 | +0.095 |
| **DTS** | +0.008 | +0.002 | +0.011 | +0.021 | +0.010 |
| **InfoMax** | +0.021 | +0.002 | +0.011 | +0.013 | +0.012 |
| **MaxMinLCB** | +0.003 | +0.010 | +0.004 | +0.018 | +0.008 |

### 3. Skywork + UltraFeedback (Combined)

**Reward Model (RM) Performance**
| Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| *Baselines* | | | | | | | |
| Random | +0.455 | +0.216 | **+0.205** | +0.077 | +0.466 | +0.193 | +0.269 |
| UltraFeedback | +0.407 | +0.114 | +0.175 | +0.064 | +0.433 | +0.247 | +0.240 |
| MaxMin | +0.410 | **+0.467** | +0.194 | +0.083 | +0.412 | **+0.380** | **+0.325** |
| DeltaQwen | +0.242 | -0.007 | +0.009 | **+0.151** | +0.279 | +0.241 | +0.153 |
| *Ours* | | | | | | | |
| **DRTS** | +0.427 | +0.436 | +0.156 | +0.086 | +0.475 | +0.272 | +0.309 |
| **DeltaUCB** | +0.463 | +0.350 | +0.164 | +0.092 | +0.469 | +0.213 | +0.292 |
| **DTS** | +0.419 | +0.087 | +0.186 | +0.083 | +0.411 | +0.297 | +0.247 |
| **InfoMax** | **+0.476** | +0.383 | +0.153 | +0.042 | **+0.546** | +0.199 | +0.300 |
| **MaxMinLCB** | +0.439 | +0.048 | +0.159 | +0.030 | +0.435 | +0.201 | +0.219 |

**DPO Performance**
| Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
| :--- | :---: | :---: | :---: | :---: | :---: |
| *Baselines* | | | | | |
| Random | +0.024 | +0.028 | +0.056 | +0.077 | +0.046 |
| UltraFeedback | +0.037 | -0.001 | +0.039 | +0.072 | +0.036 |
| MaxMin | +0.022 | -0.016 | **+0.150** | +0.289 | +0.111 |
| DeltaQwen | +0.055 | **+0.047** | +0.130 | **+0.316** | **+0.137** |
| *Ours* | | | | | |
| **DRTS** | **+0.055** | +0.015 | +0.108 | +0.177 | +0.088 |
| **DeltaUCB** | +0.049 | +0.039 | +0.117 | +0.217 | +0.105 |
| **DTS** | +0.009 | +0.002 | +0.014 | +0.029 | +0.013 |
| **InfoMax** | +0.011 | +0.021 | +0.014 | +0.018 | +0.015 |
| **MaxMinLCB** | -0.010 | +0.019 | +0.010 | +0.021 | +0.009 |

### 4. Tulu 3 Prompts

**Reward Model (RM) Performance**
| Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| *Baselines* | | | | | | | |
| Random | **+0.465** | +0.465 | **+0.213** | +0.077 | **+0.584** | +0.355 | +0.360 |
| UltraFeedback | +0.450 | +0.441 | +0.170 | +0.077 | +0.531 | **+0.386** | +0.343 |
| MaxMin | +0.450 | +0.443 | +0.211 | +0.083 | +0.521 | +0.358 | +0.344 |
| DeltaQwen | +0.179 | -0.086 | -0.013 | +0.164 | +0.174 | +0.091 | +0.085 |
| Tulu3_PrefMix | +0.398 | +0.350 | +0.173 | +0.098 | +0.423 | +0.342 | +0.298 |
| *Ours* | | | | | | | |
| **DRTS** | +0.456 | +0.515 | +0.080 | +0.148 | +0.533 | +0.356 | +0.348 |
| **DeltaUCB** | +0.455 | **+0.537** | +0.189 | **+0.148** | +0.580 | +0.390 | **+0.383** |
| **DTS** | +0.426 | +0.140 | +0.200 | +0.036 | +0.499 | +0.160 | +0.243 |
| **InfoMax** | +0.431 | +0.302 | +0.175 | +0.098 | +0.545 | +0.286 | +0.306 |
| **MaxMinLCB** | +0.448 | +0.168 | +0.140 | +0.101 | +0.531 | +0.196 | +0.264 |

**DPO Performance**
| Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
| :--- | :---: | :---: | :---: | :---: | :---: |
| *Baselines* | | | | | |
| Random | **+0.055** | +0.041 | +0.069 | +0.046 | +0.052 |
| UltraFeedback | +0.043 | +0.052 | +0.056 | +0.057 | +0.051 |
| MaxMin | +0.022 | **+0.067** | **+0.188** | +0.279 | **+0.138** |
| DeltaQwen | +0.049 | +0.034 | +0.124 | **+0.291** | +0.124 |
| Tulu3_PrefMix | +0.037 | +0.069 | +0.046 | +0.020 | +0.042 |
| *Ours* | | | | | |
| **DRTS** | +0.050 | +0.058 | +0.118 | +0.203 | +0.107 |
| **DeltaUCB** | +0.028 | +0.060 | +0.134 | +0.235 | +0.114 |
| **DTS** | +0.015 | +0.012 | +0.018 | +0.024 | +0.017 |
| **InfoMax** | +0.021 | +0.008 | +0.039 | +0.012 | +0.020 |
| **MaxMinLCB** | +0.013 | -0.014 | +0.012 | +0.019 | +0.008 |
</details>

<details>
<summary><strong>🔁 Pipeline Overview (How it works)</strong></summary>

Given a batch of prompts, the following steps are executed:
1.  **Response Generation**: For each prompt in the batch, call multiple LLMs to each generate a response.
2.  **Uncertainty-Aware Reward Prediction**: An uncertainty-aware reward model predicts the reward and associated uncertainty of the responses. *(Note: This model is initialized randomly at the start)*.
3.  **Pair Selection (Acquisition Function)**: Select which two responses should get annotated based on rewards and uncertainties using an acquisition function (e.g., Double Thompson Sampling).
4.  **Oracle Annotation**: Annotate which response in the selected pair is preferred (via LLM or human).
5.  **Reward Model Training**: Train the uncertainty-aware reward model on the new preference data, then repeat the loop.
</details>

<details>
<summary><strong>🤖 Source Models & Licenses</strong></summary>

| Model | Parameters (B) | License |
| :--- | :---: | :--- |
| **Qwen** | | |
| `Qwen/Qwen2.5-0.5B-Instruct` | 0.5 | Apache 2.0 |
| `Qwen/Qwen2.5-72B-Instruct` | 72 | Qwen |
| `Qwen/Qwen3-0.6B` | 0.6 | Apache 2.0 |
| `Qwen/Qwen3-1.7B` | 1.7 | Apache 2.0 |
| `Qwen/Qwen3-14B` | 14 | Apache 2.0 |
| `Qwen/Qwen3-30B-A3B` | 30 | Apache 2.0 |
| `Qwen/Qwen3-32B` | 32 | Apache 2.0 |
| `Qwen/Qwen3-235B-A22B` | 234 | Apache 2.0 |
| **Llama** | | |
| `meta-llama/Llama-3.1-8B-Instruct` | 8 | Llama 3 |
| `meta-llama/Llama-3.2-1B-Instruct` | 1 | Llama 3 |
| `meta-llama/Llama-3.2-3B-Instruct` | 3 | Llama 3 |
| `meta-llama/Llama-3.3-70B-Instruct` | 70 | Llama 3 |
| **Microsoft** | | |
| `microsoft/Phi-4-mini-instruct` | 4 | MIT |
| `microsoft/phi-4` | 14 | MIT |
| **Mistral** | | |
| `mistralai/Mistral-Small-24B-Instruct-2501` | 23 | Apache 2.0 |
| `mistralai/Mistral-Large-Instruct-2411` | 123 | MRL |
| **NVIDIA** | | |
| `nvidia/Llama-3.1-Nemotron-70B-Instruct-HF` | 70 | Llama 3 |
| `nvidia/Llama-3_3-Nemotron-Super-49B-v1` | 49 | Nvidia Open Model |
| `nvidia/Llama-3_1-Nemotron-Ultra-253B-v1` | 253 | Nvidia Open Model |
| **Gemma** | | |
| `google/gemma-3-1b-it` | 1 | Gemma |
| `google/gemma-3-4b-it` | 4 | Gemma |
| `google/gemma-3-12b-it` | 12 | Gemma |
| `google/gemma-3-27b-it` | 27 | Gemma |
| **AllenAI** | | |
| `allenai/OLMo-2-0325-32B-Instruct` | 32 | Apache 2.0 |
| `allenai/Llama-3.1-Tulu-3-70B` | 70 | Llama 3 |
| `allenai/Llama-3.1-Tulu-3-405B` | 405 | Llama 3 |
| **Other** | | |
| `HuggingFaceTB/SmolLM2-1.7B-Instruct` | 1.7 | Apache 2.0 |
| `moonshotai/Moonlight-16B-A3B-Instruct` | 16 | MIT |
| `CohereLabs/c4ai-command-a-03-2025` | 111 | CC by NC 4.0 |
| `deepseek-ai/DeepSeek-V3` | 671 | Deepseek |
</details>
---

**License:** MIT  
**Disclaimer:** The datasets hosted here contain outputs from various third-party open-source models. Users must ensure compliance with the specific licenses of the source models (e.g., Llama Community License, Apache 2.0, Qwen Research License) when using these datasets.

## Citation

If you use our work or the ActiveUltraFeedback datasets, models, please cite us:

```bibtex
@misc{melikidze2026activeultrafeedbackefficientpreferencedata,
      title={ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning}, 
      author={Davit Melikidze and Marian Schneider and Jessica Lam and Martin Wertich and Ido Hakimi and Barna Pásztor and Andreas Krause},
      year={2026},
      eprint={2603.09692},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2603.09692}, 
}
```