Spaces:
No application file
No application file
File size: 11,847 Bytes
f816d7b c969fb9 98569a3 6529594 f816d7b 1544dd4 464585e a0193b1 7ad6021 714bb8f 464585e 714bb8f 464585e 01f712a 464585e 01f712a 464585e 01f712a 464585e 01f712a 464585e 01f712a 464585e 01f712a 464585e 01f712a 464585e 01f712a 464585e 01f712a 464585e 01f712a 464585e 01f712a 464585e 01f712a 464585e 01f712a 464585e 714bb8f 464585e 714bb8f 464585e 714bb8f 464585e 714bb8f 01f712a 714bb8f 464585e 714bb8f 2a47179 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 | ---
license: mit
title: ActiveUltraFeedback
short_description: Sample-Efficient RLHF Preference data generation
paper:
name: "ActiveUltraFeedback: Sample-Efficient RLHF Preference data generation"
url: "https://arxiv.org/abs/2603.09692"
sdk: gradio
emoji: ⚡
colorFrom: blue
colorTo: green
sdk_version: 6.9.0
---
# This repo accompanies the paper: [ActiveUltraFeedback — arXiv:2603.09692](https://arxiv.org/abs/2603.09692).
**Active UltraFeedback** is a scalable pipeline for generating high-quality preference datasets to align large language models (LLMs). We leverage **uncertainty quantification** and **active learning** to annotate only the most informative samples, drastically reducing costs while beating standard baselines.
> **Repository Purpose:** This repository serves as a central hub for storing all experimental data, including the generated preference datasets and the resulting DPO/RM/IPO/SimPO models.
<details>
<summary><strong>🏆 Benchmark Results (Click to Expand)</strong></summary>
Our experiments demonstrate that **Active Learning strategies (specifically DRTS and DeltaUCB)** consistently beat the actual `ultrafeedback_binarized_cleaned` and `tulu3` preference mixture datasets.
### 1. UltraFeedback Prompts (Only)
**Reward Model (RM) Performance**
| Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| *Baselines* | | | | | | | |
| Random | +0.443 | +0.209 | +0.156 | +0.133 | +0.417 | +0.310 | +0.278 |
| UltraFeedback | +0.443 | +0.188 | +0.213 | +0.114 | +0.481 | +0.284 | +0.287 |
| MaxMin | +0.377 | +0.483 | +0.156 | +0.123 | +0.370 | +0.400 | +0.318 |
| DeltaQwen | +0.195 | -0.034 | +0.028 | +0.067 | +0.216 | +0.126 | +0.100 |
| *Ours* | | | | | | | |
| **DRTS** | +0.412 | +0.408 | +0.183 | +0.114 | +0.347 | +0.404 | +0.312 |
| **DeltaUCB** | +0.423 | **+0.553** | +0.132 | +0.080 | +0.435 | +0.408 | **+0.339** |
| **DTS** | +0.406 | +0.024 | +0.194 | +0.077 | +0.441 | +0.197 | +0.223 |
| **InfoMax** | +0.463 | +0.287 | +0.096 | **+0.129** | +0.509 | +0.296 | +0.297 |
| **MaxMinLCB** | +0.390 | -0.025 | **+0.244** | +0.070 | +0.453 | +0.250 | +0.230 |
**DPO Performance**
| Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
| :--- | :---: | :---: | :---: | :---: | :---: |
| *Baselines* | | | | | |
| Random | +0.024 | +0.028 | +0.056 | +0.077 | +0.046 |
| UltraFeedback | +0.037 | -0.001 | +0.039 | +0.072 | +0.036 |
| MaxMin | +0.022 | -0.016 | **+0.150** | +0.289 | +0.111 |
| DeltaQwen | +0.055 | +0.047 | +0.130 | +0.316 | **+0.137** |
| *Ours* | | | | | |
| **DRTS** | +0.055 | **+0.050** | +0.143 | +0.259 | +0.127 |
| **DeltaUCB** | **+0.065** | +0.039 | +0.113 | +0.254 | +0.117 |
| **DTS** | +0.011 | +0.034 | +0.013 | +0.037 | +0.023 |
| **InfoMax** | +0.011 | +0.019 | +0.018 | +0.020 | +0.016 |
| **MaxMinLCB** | +0.015 | +0.017 | +0.006 | +0.027 | +0.016 |
### 2. Skywork Prompts (Only)
**Reward Model (RM) Performance**
| Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| *Baselines* | | | | | | | |
| Random | +0.407 | +0.106 | +0.151 | +0.092 | +0.422 | +0.157 | +0.223 |
| UltraFeedback | +0.419 | +0.068 | +0.189 | +0.058 | +0.440 | +0.228 | +0.234 |
| MaxMin | +0.410 | **+0.462** | +0.172 | +0.055 | **+0.531** | **+0.319** | **+0.325** |
| DeltaQwen | +0.238 | -0.023 | +0.011 | **+0.108** | +0.306 | +0.132 | +0.129 |
| *Ours* | | | | | | | |
| **DRTS** | +0.423 | +0.233 | +0.164 | +0.055 | +0.377 | +0.285 | +0.256 |
| **DeltaUCB** | +0.370 | +0.319 | **+0.194** | +0.033 | +0.346 | +0.310 | +0.262 |
| **DTS** | +0.417 | -0.021 | +0.148 | +0.077 | +0.450 | +0.245 | +0.219 |
| **InfoMax** | **+0.429** | +0.122 | +0.162 | +0.030 | +0.495 | +0.227 | +0.244 |
| **MaxMinLCB** | +0.371 | -0.016 | +0.145 | +0.039 | +0.395 | +0.167 | +0.184 |
**DPO Performance**
| Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
| :--- | :---: | :---: | :---: | :---: | :---: |
| *Baselines* | | | | | |
| Random | +0.012 | +0.015 | +0.045 | +0.063 | +0.033 |
| UltraFeedback | +0.027 | **+0.054** | +0.043 | +0.071 | +0.048 |
| MaxMin | +0.049 | -0.011 | +0.128 | +0.270 | +0.108 |
| DeltaQwen | **+0.058** | +0.002 | **+0.152** | **+0.384** | **+0.149** |
| *Ours* | | | | | |
| **DRTS** | +0.052 | +0.012 | +0.114 | +0.229 | +0.101 |
| **DeltaUCB** | +0.055 | +0.013 | +0.077 | +0.238 | +0.095 |
| **DTS** | +0.008 | +0.002 | +0.011 | +0.021 | +0.010 |
| **InfoMax** | +0.021 | +0.002 | +0.011 | +0.013 | +0.012 |
| **MaxMinLCB** | +0.003 | +0.010 | +0.004 | +0.018 | +0.008 |
### 3. Skywork + UltraFeedback (Combined)
**Reward Model (RM) Performance**
| Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| *Baselines* | | | | | | | |
| Random | +0.455 | +0.216 | **+0.205** | +0.077 | +0.466 | +0.193 | +0.269 |
| UltraFeedback | +0.407 | +0.114 | +0.175 | +0.064 | +0.433 | +0.247 | +0.240 |
| MaxMin | +0.410 | **+0.467** | +0.194 | +0.083 | +0.412 | **+0.380** | **+0.325** |
| DeltaQwen | +0.242 | -0.007 | +0.009 | **+0.151** | +0.279 | +0.241 | +0.153 |
| *Ours* | | | | | | | |
| **DRTS** | +0.427 | +0.436 | +0.156 | +0.086 | +0.475 | +0.272 | +0.309 |
| **DeltaUCB** | +0.463 | +0.350 | +0.164 | +0.092 | +0.469 | +0.213 | +0.292 |
| **DTS** | +0.419 | +0.087 | +0.186 | +0.083 | +0.411 | +0.297 | +0.247 |
| **InfoMax** | **+0.476** | +0.383 | +0.153 | +0.042 | **+0.546** | +0.199 | +0.300 |
| **MaxMinLCB** | +0.439 | +0.048 | +0.159 | +0.030 | +0.435 | +0.201 | +0.219 |
**DPO Performance**
| Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
| :--- | :---: | :---: | :---: | :---: | :---: |
| *Baselines* | | | | | |
| Random | +0.024 | +0.028 | +0.056 | +0.077 | +0.046 |
| UltraFeedback | +0.037 | -0.001 | +0.039 | +0.072 | +0.036 |
| MaxMin | +0.022 | -0.016 | **+0.150** | +0.289 | +0.111 |
| DeltaQwen | +0.055 | **+0.047** | +0.130 | **+0.316** | **+0.137** |
| *Ours* | | | | | |
| **DRTS** | **+0.055** | +0.015 | +0.108 | +0.177 | +0.088 |
| **DeltaUCB** | +0.049 | +0.039 | +0.117 | +0.217 | +0.105 |
| **DTS** | +0.009 | +0.002 | +0.014 | +0.029 | +0.013 |
| **InfoMax** | +0.011 | +0.021 | +0.014 | +0.018 | +0.015 |
| **MaxMinLCB** | -0.010 | +0.019 | +0.010 | +0.021 | +0.009 |
### 4. Tulu 3 Prompts
**Reward Model (RM) Performance**
| Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| *Baselines* | | | | | | | |
| Random | **+0.465** | +0.465 | **+0.213** | +0.077 | **+0.584** | +0.355 | +0.360 |
| UltraFeedback | +0.450 | +0.441 | +0.170 | +0.077 | +0.531 | **+0.386** | +0.343 |
| MaxMin | +0.450 | +0.443 | +0.211 | +0.083 | +0.521 | +0.358 | +0.344 |
| DeltaQwen | +0.179 | -0.086 | -0.013 | +0.164 | +0.174 | +0.091 | +0.085 |
| Tulu3_PrefMix | +0.398 | +0.350 | +0.173 | +0.098 | +0.423 | +0.342 | +0.298 |
| *Ours* | | | | | | | |
| **DRTS** | +0.456 | +0.515 | +0.080 | +0.148 | +0.533 | +0.356 | +0.348 |
| **DeltaUCB** | +0.455 | **+0.537** | +0.189 | **+0.148** | +0.580 | +0.390 | **+0.383** |
| **DTS** | +0.426 | +0.140 | +0.200 | +0.036 | +0.499 | +0.160 | +0.243 |
| **InfoMax** | +0.431 | +0.302 | +0.175 | +0.098 | +0.545 | +0.286 | +0.306 |
| **MaxMinLCB** | +0.448 | +0.168 | +0.140 | +0.101 | +0.531 | +0.196 | +0.264 |
**DPO Performance**
| Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
| :--- | :---: | :---: | :---: | :---: | :---: |
| *Baselines* | | | | | |
| Random | **+0.055** | +0.041 | +0.069 | +0.046 | +0.052 |
| UltraFeedback | +0.043 | +0.052 | +0.056 | +0.057 | +0.051 |
| MaxMin | +0.022 | **+0.067** | **+0.188** | +0.279 | **+0.138** |
| DeltaQwen | +0.049 | +0.034 | +0.124 | **+0.291** | +0.124 |
| Tulu3_PrefMix | +0.037 | +0.069 | +0.046 | +0.020 | +0.042 |
| *Ours* | | | | | |
| **DRTS** | +0.050 | +0.058 | +0.118 | +0.203 | +0.107 |
| **DeltaUCB** | +0.028 | +0.060 | +0.134 | +0.235 | +0.114 |
| **DTS** | +0.015 | +0.012 | +0.018 | +0.024 | +0.017 |
| **InfoMax** | +0.021 | +0.008 | +0.039 | +0.012 | +0.020 |
| **MaxMinLCB** | +0.013 | -0.014 | +0.012 | +0.019 | +0.008 |
</details>
<details>
<summary><strong>🔁 Pipeline Overview (How it works)</strong></summary>
Given a batch of prompts, the following steps are executed:
1. **Response Generation**: For each prompt in the batch, call multiple LLMs to each generate a response.
2. **Uncertainty-Aware Reward Prediction**: An uncertainty-aware reward model predicts the reward and associated uncertainty of the responses. *(Note: This model is initialized randomly at the start)*.
3. **Pair Selection (Acquisition Function)**: Select which two responses should get annotated based on rewards and uncertainties using an acquisition function (e.g., Double Thompson Sampling).
4. **Oracle Annotation**: Annotate which response in the selected pair is preferred (via LLM or human).
5. **Reward Model Training**: Train the uncertainty-aware reward model on the new preference data, then repeat the loop.
</details>
<details>
<summary><strong>🤖 Source Models & Licenses</strong></summary>
| Model | Parameters (B) | License |
| :--- | :---: | :--- |
| **Qwen** | | |
| `Qwen/Qwen2.5-0.5B-Instruct` | 0.5 | Apache 2.0 |
| `Qwen/Qwen2.5-72B-Instruct` | 72 | Qwen |
| `Qwen/Qwen3-0.6B` | 0.6 | Apache 2.0 |
| `Qwen/Qwen3-1.7B` | 1.7 | Apache 2.0 |
| `Qwen/Qwen3-14B` | 14 | Apache 2.0 |
| `Qwen/Qwen3-30B-A3B` | 30 | Apache 2.0 |
| `Qwen/Qwen3-32B` | 32 | Apache 2.0 |
| `Qwen/Qwen3-235B-A22B` | 234 | Apache 2.0 |
| **Llama** | | |
| `meta-llama/Llama-3.1-8B-Instruct` | 8 | Llama 3 |
| `meta-llama/Llama-3.2-1B-Instruct` | 1 | Llama 3 |
| `meta-llama/Llama-3.2-3B-Instruct` | 3 | Llama 3 |
| `meta-llama/Llama-3.3-70B-Instruct` | 70 | Llama 3 |
| **Microsoft** | | |
| `microsoft/Phi-4-mini-instruct` | 4 | MIT |
| `microsoft/phi-4` | 14 | MIT |
| **Mistral** | | |
| `mistralai/Mistral-Small-24B-Instruct-2501` | 23 | Apache 2.0 |
| `mistralai/Mistral-Large-Instruct-2411` | 123 | MRL |
| **NVIDIA** | | |
| `nvidia/Llama-3.1-Nemotron-70B-Instruct-HF` | 70 | Llama 3 |
| `nvidia/Llama-3_3-Nemotron-Super-49B-v1` | 49 | Nvidia Open Model |
| `nvidia/Llama-3_1-Nemotron-Ultra-253B-v1` | 253 | Nvidia Open Model |
| **Gemma** | | |
| `google/gemma-3-1b-it` | 1 | Gemma |
| `google/gemma-3-4b-it` | 4 | Gemma |
| `google/gemma-3-12b-it` | 12 | Gemma |
| `google/gemma-3-27b-it` | 27 | Gemma |
| **AllenAI** | | |
| `allenai/OLMo-2-0325-32B-Instruct` | 32 | Apache 2.0 |
| `allenai/Llama-3.1-Tulu-3-70B` | 70 | Llama 3 |
| `allenai/Llama-3.1-Tulu-3-405B` | 405 | Llama 3 |
| **Other** | | |
| `HuggingFaceTB/SmolLM2-1.7B-Instruct` | 1.7 | Apache 2.0 |
| `moonshotai/Moonlight-16B-A3B-Instruct` | 16 | MIT |
| `CohereLabs/c4ai-command-a-03-2025` | 111 | CC by NC 4.0 |
| `deepseek-ai/DeepSeek-V3` | 671 | Deepseek |
</details>
---
**License:** MIT
**Disclaimer:** The datasets hosted here contain outputs from various third-party open-source models. Users must ensure compliance with the specific licenses of the source models (e.g., Llama Community License, Apache 2.0, Qwen Research License) when using these datasets.
## Citation
If you use our work or the ActiveUltraFeedback datasets, models, please cite us:
```bibtex
@misc{melikidze2026activeultrafeedbackefficientpreferencedata,
title={ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning},
author={Davit Melikidze and Marian Schneider and Jessica Lam and Martin Wertich and Ido Hakimi and Barna Pásztor and Andreas Krause},
year={2026},
eprint={2603.09692},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2603.09692},
}
``` |