Spaces:
No application file
No application file
| license: mit | |
| title: ActiveUltraFeedback | |
| short_description: Sample-Efficient RLHF Preference data generation | |
| paper: | |
| name: "ActiveUltraFeedback: Sample-Efficient RLHF Preference data generation" | |
| url: "https://arxiv.org/abs/2603.09692" | |
| sdk: gradio | |
| emoji: ⚡ | |
| colorFrom: blue | |
| colorTo: green | |
| sdk_version: 6.9.0 | |
| # This repo accompanies the paper: [ActiveUltraFeedback — arXiv:2603.09692](https://arxiv.org/abs/2603.09692). | |
| **Active UltraFeedback** is a scalable pipeline for generating high-quality preference datasets to align large language models (LLMs). We leverage **uncertainty quantification** and **active learning** to annotate only the most informative samples, drastically reducing costs while beating standard baselines. | |
| > **Repository Purpose:** This repository serves as a central hub for storing all experimental data, including the generated preference datasets and the resulting DPO/RM/IPO/SimPO models. | |
| <details> | |
| <summary><strong>🏆 Benchmark Results (Click to Expand)</strong></summary> | |
| Our experiments demonstrate that **Active Learning strategies (specifically DRTS and DeltaUCB)** consistently beat the actual `ultrafeedback_binarized_cleaned` and `tulu3` preference mixture datasets. | |
| ### 1. UltraFeedback Prompts (Only) | |
| **Reward Model (RM) Performance** | |
| | Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** | | |
| | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | |
| | *Baselines* | | | | | | | | | |
| | Random | +0.443 | +0.209 | +0.156 | +0.133 | +0.417 | +0.310 | +0.278 | | |
| | UltraFeedback | +0.443 | +0.188 | +0.213 | +0.114 | +0.481 | +0.284 | +0.287 | | |
| | MaxMin | +0.377 | +0.483 | +0.156 | +0.123 | +0.370 | +0.400 | +0.318 | | |
| | DeltaQwen | +0.195 | -0.034 | +0.028 | +0.067 | +0.216 | +0.126 | +0.100 | | |
| | *Ours* | | | | | | | | | |
| | **DRTS** | +0.412 | +0.408 | +0.183 | +0.114 | +0.347 | +0.404 | +0.312 | | |
| | **DeltaUCB** | +0.423 | **+0.553** | +0.132 | +0.080 | +0.435 | +0.408 | **+0.339** | | |
| | **DTS** | +0.406 | +0.024 | +0.194 | +0.077 | +0.441 | +0.197 | +0.223 | | |
| | **InfoMax** | +0.463 | +0.287 | +0.096 | **+0.129** | +0.509 | +0.296 | +0.297 | | |
| | **MaxMinLCB** | +0.390 | -0.025 | **+0.244** | +0.070 | +0.453 | +0.250 | +0.230 | | |
| **DPO Performance** | |
| | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** | | |
| | :--- | :---: | :---: | :---: | :---: | :---: | | |
| | *Baselines* | | | | | | | |
| | Random | +0.024 | +0.028 | +0.056 | +0.077 | +0.046 | | |
| | UltraFeedback | +0.037 | -0.001 | +0.039 | +0.072 | +0.036 | | |
| | MaxMin | +0.022 | -0.016 | **+0.150** | +0.289 | +0.111 | | |
| | DeltaQwen | +0.055 | +0.047 | +0.130 | +0.316 | **+0.137** | | |
| | *Ours* | | | | | | | |
| | **DRTS** | +0.055 | **+0.050** | +0.143 | +0.259 | +0.127 | | |
| | **DeltaUCB** | **+0.065** | +0.039 | +0.113 | +0.254 | +0.117 | | |
| | **DTS** | +0.011 | +0.034 | +0.013 | +0.037 | +0.023 | | |
| | **InfoMax** | +0.011 | +0.019 | +0.018 | +0.020 | +0.016 | | |
| | **MaxMinLCB** | +0.015 | +0.017 | +0.006 | +0.027 | +0.016 | | |
| ### 2. Skywork Prompts (Only) | |
| **Reward Model (RM) Performance** | |
| | Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** | | |
| | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | |
| | *Baselines* | | | | | | | | | |
| | Random | +0.407 | +0.106 | +0.151 | +0.092 | +0.422 | +0.157 | +0.223 | | |
| | UltraFeedback | +0.419 | +0.068 | +0.189 | +0.058 | +0.440 | +0.228 | +0.234 | | |
| | MaxMin | +0.410 | **+0.462** | +0.172 | +0.055 | **+0.531** | **+0.319** | **+0.325** | | |
| | DeltaQwen | +0.238 | -0.023 | +0.011 | **+0.108** | +0.306 | +0.132 | +0.129 | | |
| | *Ours* | | | | | | | | | |
| | **DRTS** | +0.423 | +0.233 | +0.164 | +0.055 | +0.377 | +0.285 | +0.256 | | |
| | **DeltaUCB** | +0.370 | +0.319 | **+0.194** | +0.033 | +0.346 | +0.310 | +0.262 | | |
| | **DTS** | +0.417 | -0.021 | +0.148 | +0.077 | +0.450 | +0.245 | +0.219 | | |
| | **InfoMax** | **+0.429** | +0.122 | +0.162 | +0.030 | +0.495 | +0.227 | +0.244 | | |
| | **MaxMinLCB** | +0.371 | -0.016 | +0.145 | +0.039 | +0.395 | +0.167 | +0.184 | | |
| **DPO Performance** | |
| | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** | | |
| | :--- | :---: | :---: | :---: | :---: | :---: | | |
| | *Baselines* | | | | | | | |
| | Random | +0.012 | +0.015 | +0.045 | +0.063 | +0.033 | | |
| | UltraFeedback | +0.027 | **+0.054** | +0.043 | +0.071 | +0.048 | | |
| | MaxMin | +0.049 | -0.011 | +0.128 | +0.270 | +0.108 | | |
| | DeltaQwen | **+0.058** | +0.002 | **+0.152** | **+0.384** | **+0.149** | | |
| | *Ours* | | | | | | | |
| | **DRTS** | +0.052 | +0.012 | +0.114 | +0.229 | +0.101 | | |
| | **DeltaUCB** | +0.055 | +0.013 | +0.077 | +0.238 | +0.095 | | |
| | **DTS** | +0.008 | +0.002 | +0.011 | +0.021 | +0.010 | | |
| | **InfoMax** | +0.021 | +0.002 | +0.011 | +0.013 | +0.012 | | |
| | **MaxMinLCB** | +0.003 | +0.010 | +0.004 | +0.018 | +0.008 | | |
| ### 3. Skywork + UltraFeedback (Combined) | |
| **Reward Model (RM) Performance** | |
| | Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** | | |
| | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | |
| | *Baselines* | | | | | | | | | |
| | Random | +0.455 | +0.216 | **+0.205** | +0.077 | +0.466 | +0.193 | +0.269 | | |
| | UltraFeedback | +0.407 | +0.114 | +0.175 | +0.064 | +0.433 | +0.247 | +0.240 | | |
| | MaxMin | +0.410 | **+0.467** | +0.194 | +0.083 | +0.412 | **+0.380** | **+0.325** | | |
| | DeltaQwen | +0.242 | -0.007 | +0.009 | **+0.151** | +0.279 | +0.241 | +0.153 | | |
| | *Ours* | | | | | | | | | |
| | **DRTS** | +0.427 | +0.436 | +0.156 | +0.086 | +0.475 | +0.272 | +0.309 | | |
| | **DeltaUCB** | +0.463 | +0.350 | +0.164 | +0.092 | +0.469 | +0.213 | +0.292 | | |
| | **DTS** | +0.419 | +0.087 | +0.186 | +0.083 | +0.411 | +0.297 | +0.247 | | |
| | **InfoMax** | **+0.476** | +0.383 | +0.153 | +0.042 | **+0.546** | +0.199 | +0.300 | | |
| | **MaxMinLCB** | +0.439 | +0.048 | +0.159 | +0.030 | +0.435 | +0.201 | +0.219 | | |
| **DPO Performance** | |
| | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** | | |
| | :--- | :---: | :---: | :---: | :---: | :---: | | |
| | *Baselines* | | | | | | | |
| | Random | +0.024 | +0.028 | +0.056 | +0.077 | +0.046 | | |
| | UltraFeedback | +0.037 | -0.001 | +0.039 | +0.072 | +0.036 | | |
| | MaxMin | +0.022 | -0.016 | **+0.150** | +0.289 | +0.111 | | |
| | DeltaQwen | +0.055 | **+0.047** | +0.130 | **+0.316** | **+0.137** | | |
| | *Ours* | | | | | | | |
| | **DRTS** | **+0.055** | +0.015 | +0.108 | +0.177 | +0.088 | | |
| | **DeltaUCB** | +0.049 | +0.039 | +0.117 | +0.217 | +0.105 | | |
| | **DTS** | +0.009 | +0.002 | +0.014 | +0.029 | +0.013 | | |
| | **InfoMax** | +0.011 | +0.021 | +0.014 | +0.018 | +0.015 | | |
| | **MaxMinLCB** | -0.010 | +0.019 | +0.010 | +0.021 | +0.009 | | |
| ### 4. Tulu 3 Prompts | |
| **Reward Model (RM) Performance** | |
| | Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** | | |
| | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | |
| | *Baselines* | | | | | | | | | |
| | Random | **+0.465** | +0.465 | **+0.213** | +0.077 | **+0.584** | +0.355 | +0.360 | | |
| | UltraFeedback | +0.450 | +0.441 | +0.170 | +0.077 | +0.531 | **+0.386** | +0.343 | | |
| | MaxMin | +0.450 | +0.443 | +0.211 | +0.083 | +0.521 | +0.358 | +0.344 | | |
| | DeltaQwen | +0.179 | -0.086 | -0.013 | +0.164 | +0.174 | +0.091 | +0.085 | | |
| | Tulu3_PrefMix | +0.398 | +0.350 | +0.173 | +0.098 | +0.423 | +0.342 | +0.298 | | |
| | *Ours* | | | | | | | | | |
| | **DRTS** | +0.456 | +0.515 | +0.080 | +0.148 | +0.533 | +0.356 | +0.348 | | |
| | **DeltaUCB** | +0.455 | **+0.537** | +0.189 | **+0.148** | +0.580 | +0.390 | **+0.383** | | |
| | **DTS** | +0.426 | +0.140 | +0.200 | +0.036 | +0.499 | +0.160 | +0.243 | | |
| | **InfoMax** | +0.431 | +0.302 | +0.175 | +0.098 | +0.545 | +0.286 | +0.306 | | |
| | **MaxMinLCB** | +0.448 | +0.168 | +0.140 | +0.101 | +0.531 | +0.196 | +0.264 | | |
| **DPO Performance** | |
| | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** | | |
| | :--- | :---: | :---: | :---: | :---: | :---: | | |
| | *Baselines* | | | | | | | |
| | Random | **+0.055** | +0.041 | +0.069 | +0.046 | +0.052 | | |
| | UltraFeedback | +0.043 | +0.052 | +0.056 | +0.057 | +0.051 | | |
| | MaxMin | +0.022 | **+0.067** | **+0.188** | +0.279 | **+0.138** | | |
| | DeltaQwen | +0.049 | +0.034 | +0.124 | **+0.291** | +0.124 | | |
| | Tulu3_PrefMix | +0.037 | +0.069 | +0.046 | +0.020 | +0.042 | | |
| | *Ours* | | | | | | | |
| | **DRTS** | +0.050 | +0.058 | +0.118 | +0.203 | +0.107 | | |
| | **DeltaUCB** | +0.028 | +0.060 | +0.134 | +0.235 | +0.114 | | |
| | **DTS** | +0.015 | +0.012 | +0.018 | +0.024 | +0.017 | | |
| | **InfoMax** | +0.021 | +0.008 | +0.039 | +0.012 | +0.020 | | |
| | **MaxMinLCB** | +0.013 | -0.014 | +0.012 | +0.019 | +0.008 | | |
| </details> | |
| <details> | |
| <summary><strong>🔁 Pipeline Overview (How it works)</strong></summary> | |
| Given a batch of prompts, the following steps are executed: | |
| 1. **Response Generation**: For each prompt in the batch, call multiple LLMs to each generate a response. | |
| 2. **Uncertainty-Aware Reward Prediction**: An uncertainty-aware reward model predicts the reward and associated uncertainty of the responses. *(Note: This model is initialized randomly at the start)*. | |
| 3. **Pair Selection (Acquisition Function)**: Select which two responses should get annotated based on rewards and uncertainties using an acquisition function (e.g., Double Thompson Sampling). | |
| 4. **Oracle Annotation**: Annotate which response in the selected pair is preferred (via LLM or human). | |
| 5. **Reward Model Training**: Train the uncertainty-aware reward model on the new preference data, then repeat the loop. | |
| </details> | |
| <details> | |
| <summary><strong>🤖 Source Models & Licenses</strong></summary> | |
| | Model | Parameters (B) | License | | |
| | :--- | :---: | :--- | | |
| | **Qwen** | | | | |
| | `Qwen/Qwen2.5-0.5B-Instruct` | 0.5 | Apache 2.0 | | |
| | `Qwen/Qwen2.5-72B-Instruct` | 72 | Qwen | | |
| | `Qwen/Qwen3-0.6B` | 0.6 | Apache 2.0 | | |
| | `Qwen/Qwen3-1.7B` | 1.7 | Apache 2.0 | | |
| | `Qwen/Qwen3-14B` | 14 | Apache 2.0 | | |
| | `Qwen/Qwen3-30B-A3B` | 30 | Apache 2.0 | | |
| | `Qwen/Qwen3-32B` | 32 | Apache 2.0 | | |
| | `Qwen/Qwen3-235B-A22B` | 234 | Apache 2.0 | | |
| | **Llama** | | | | |
| | `meta-llama/Llama-3.1-8B-Instruct` | 8 | Llama 3 | | |
| | `meta-llama/Llama-3.2-1B-Instruct` | 1 | Llama 3 | | |
| | `meta-llama/Llama-3.2-3B-Instruct` | 3 | Llama 3 | | |
| | `meta-llama/Llama-3.3-70B-Instruct` | 70 | Llama 3 | | |
| | **Microsoft** | | | | |
| | `microsoft/Phi-4-mini-instruct` | 4 | MIT | | |
| | `microsoft/phi-4` | 14 | MIT | | |
| | **Mistral** | | | | |
| | `mistralai/Mistral-Small-24B-Instruct-2501` | 23 | Apache 2.0 | | |
| | `mistralai/Mistral-Large-Instruct-2411` | 123 | MRL | | |
| | **NVIDIA** | | | | |
| | `nvidia/Llama-3.1-Nemotron-70B-Instruct-HF` | 70 | Llama 3 | | |
| | `nvidia/Llama-3_3-Nemotron-Super-49B-v1` | 49 | Nvidia Open Model | | |
| | `nvidia/Llama-3_1-Nemotron-Ultra-253B-v1` | 253 | Nvidia Open Model | | |
| | **Gemma** | | | | |
| | `google/gemma-3-1b-it` | 1 | Gemma | | |
| | `google/gemma-3-4b-it` | 4 | Gemma | | |
| | `google/gemma-3-12b-it` | 12 | Gemma | | |
| | `google/gemma-3-27b-it` | 27 | Gemma | | |
| | **AllenAI** | | | | |
| | `allenai/OLMo-2-0325-32B-Instruct` | 32 | Apache 2.0 | | |
| | `allenai/Llama-3.1-Tulu-3-70B` | 70 | Llama 3 | | |
| | `allenai/Llama-3.1-Tulu-3-405B` | 405 | Llama 3 | | |
| | **Other** | | | | |
| | `HuggingFaceTB/SmolLM2-1.7B-Instruct` | 1.7 | Apache 2.0 | | |
| | `moonshotai/Moonlight-16B-A3B-Instruct` | 16 | MIT | | |
| | `CohereLabs/c4ai-command-a-03-2025` | 111 | CC by NC 4.0 | | |
| | `deepseek-ai/DeepSeek-V3` | 671 | Deepseek | | |
| </details> | |
| --- | |
| **License:** MIT | |
| **Disclaimer:** The datasets hosted here contain outputs from various third-party open-source models. Users must ensure compliance with the specific licenses of the source models (e.g., Llama Community License, Apache 2.0, Qwen Research License) when using these datasets. | |
| ## Citation | |
| If you use our work or the ActiveUltraFeedback datasets, models, please cite us: | |
| ```bibtex | |
| @misc{melikidze2026activeultrafeedbackefficientpreferencedata, | |
| title={ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning}, | |
| author={Davit Melikidze and Marian Schneider and Jessica Lam and Martin Wertich and Ido Hakimi and Barna Pásztor and Andreas Krause}, | |
| year={2026}, | |
| eprint={2603.09692}, | |
| archivePrefix={arXiv}, | |
| primaryClass={cs.LG}, | |
| url={https://arxiv.org/abs/2603.09692}, | |
| } | |
| ``` |