--- license: mit title: ActiveUltraFeedback short_description: Sample-Efficient RLHF Preference data generation paper: name: "ActiveUltraFeedback: Sample-Efficient RLHF Preference data generation" url: "https://arxiv.org/abs/2603.09692" sdk: gradio emoji: ⚡ colorFrom: blue colorTo: green sdk_version: 6.9.0 --- # This repo accompanies the paper: [ActiveUltraFeedback — arXiv:2603.09692](https://arxiv.org/abs/2603.09692). **Active UltraFeedback** is a scalable pipeline for generating high-quality preference datasets to align large language models (LLMs). We leverage **uncertainty quantification** and **active learning** to annotate only the most informative samples, drastically reducing costs while beating standard baselines. > **Repository Purpose:** This repository serves as a central hub for storing all experimental data, including the generated preference datasets and the resulting DPO/RM/IPO/SimPO models.
🏆 Benchmark Results (Click to Expand) Our experiments demonstrate that **Active Learning strategies (specifically DRTS and DeltaUCB)** consistently beat the actual `ultrafeedback_binarized_cleaned` and `tulu3` preference mixture datasets. ### 1. UltraFeedback Prompts (Only) **Reward Model (RM) Performance** | Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | *Baselines* | | | | | | | | | Random | +0.443 | +0.209 | +0.156 | +0.133 | +0.417 | +0.310 | +0.278 | | UltraFeedback | +0.443 | +0.188 | +0.213 | +0.114 | +0.481 | +0.284 | +0.287 | | MaxMin | +0.377 | +0.483 | +0.156 | +0.123 | +0.370 | +0.400 | +0.318 | | DeltaQwen | +0.195 | -0.034 | +0.028 | +0.067 | +0.216 | +0.126 | +0.100 | | *Ours* | | | | | | | | | **DRTS** | +0.412 | +0.408 | +0.183 | +0.114 | +0.347 | +0.404 | +0.312 | | **DeltaUCB** | +0.423 | **+0.553** | +0.132 | +0.080 | +0.435 | +0.408 | **+0.339** | | **DTS** | +0.406 | +0.024 | +0.194 | +0.077 | +0.441 | +0.197 | +0.223 | | **InfoMax** | +0.463 | +0.287 | +0.096 | **+0.129** | +0.509 | +0.296 | +0.297 | | **MaxMinLCB** | +0.390 | -0.025 | **+0.244** | +0.070 | +0.453 | +0.250 | +0.230 | **DPO Performance** | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** | | :--- | :---: | :---: | :---: | :---: | :---: | | *Baselines* | | | | | | | Random | +0.024 | +0.028 | +0.056 | +0.077 | +0.046 | | UltraFeedback | +0.037 | -0.001 | +0.039 | +0.072 | +0.036 | | MaxMin | +0.022 | -0.016 | **+0.150** | +0.289 | +0.111 | | DeltaQwen | +0.055 | +0.047 | +0.130 | +0.316 | **+0.137** | | *Ours* | | | | | | | **DRTS** | +0.055 | **+0.050** | +0.143 | +0.259 | +0.127 | | **DeltaUCB** | **+0.065** | +0.039 | +0.113 | +0.254 | +0.117 | | **DTS** | +0.011 | +0.034 | +0.013 | +0.037 | +0.023 | | **InfoMax** | +0.011 | +0.019 | +0.018 | +0.020 | +0.016 | | **MaxMinLCB** | +0.015 | +0.017 | +0.006 | +0.027 | +0.016 | ### 2. Skywork Prompts (Only) **Reward Model (RM) Performance** | Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | *Baselines* | | | | | | | | | Random | +0.407 | +0.106 | +0.151 | +0.092 | +0.422 | +0.157 | +0.223 | | UltraFeedback | +0.419 | +0.068 | +0.189 | +0.058 | +0.440 | +0.228 | +0.234 | | MaxMin | +0.410 | **+0.462** | +0.172 | +0.055 | **+0.531** | **+0.319** | **+0.325** | | DeltaQwen | +0.238 | -0.023 | +0.011 | **+0.108** | +0.306 | +0.132 | +0.129 | | *Ours* | | | | | | | | | **DRTS** | +0.423 | +0.233 | +0.164 | +0.055 | +0.377 | +0.285 | +0.256 | | **DeltaUCB** | +0.370 | +0.319 | **+0.194** | +0.033 | +0.346 | +0.310 | +0.262 | | **DTS** | +0.417 | -0.021 | +0.148 | +0.077 | +0.450 | +0.245 | +0.219 | | **InfoMax** | **+0.429** | +0.122 | +0.162 | +0.030 | +0.495 | +0.227 | +0.244 | | **MaxMinLCB** | +0.371 | -0.016 | +0.145 | +0.039 | +0.395 | +0.167 | +0.184 | **DPO Performance** | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** | | :--- | :---: | :---: | :---: | :---: | :---: | | *Baselines* | | | | | | | Random | +0.012 | +0.015 | +0.045 | +0.063 | +0.033 | | UltraFeedback | +0.027 | **+0.054** | +0.043 | +0.071 | +0.048 | | MaxMin | +0.049 | -0.011 | +0.128 | +0.270 | +0.108 | | DeltaQwen | **+0.058** | +0.002 | **+0.152** | **+0.384** | **+0.149** | | *Ours* | | | | | | | **DRTS** | +0.052 | +0.012 | +0.114 | +0.229 | +0.101 | | **DeltaUCB** | +0.055 | +0.013 | +0.077 | +0.238 | +0.095 | | **DTS** | +0.008 | +0.002 | +0.011 | +0.021 | +0.010 | | **InfoMax** | +0.021 | +0.002 | +0.011 | +0.013 | +0.012 | | **MaxMinLCB** | +0.003 | +0.010 | +0.004 | +0.018 | +0.008 | ### 3. Skywork + UltraFeedback (Combined) **Reward Model (RM) Performance** | Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | *Baselines* | | | | | | | | | Random | +0.455 | +0.216 | **+0.205** | +0.077 | +0.466 | +0.193 | +0.269 | | UltraFeedback | +0.407 | +0.114 | +0.175 | +0.064 | +0.433 | +0.247 | +0.240 | | MaxMin | +0.410 | **+0.467** | +0.194 | +0.083 | +0.412 | **+0.380** | **+0.325** | | DeltaQwen | +0.242 | -0.007 | +0.009 | **+0.151** | +0.279 | +0.241 | +0.153 | | *Ours* | | | | | | | | | **DRTS** | +0.427 | +0.436 | +0.156 | +0.086 | +0.475 | +0.272 | +0.309 | | **DeltaUCB** | +0.463 | +0.350 | +0.164 | +0.092 | +0.469 | +0.213 | +0.292 | | **DTS** | +0.419 | +0.087 | +0.186 | +0.083 | +0.411 | +0.297 | +0.247 | | **InfoMax** | **+0.476** | +0.383 | +0.153 | +0.042 | **+0.546** | +0.199 | +0.300 | | **MaxMinLCB** | +0.439 | +0.048 | +0.159 | +0.030 | +0.435 | +0.201 | +0.219 | **DPO Performance** | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** | | :--- | :---: | :---: | :---: | :---: | :---: | | *Baselines* | | | | | | | Random | +0.024 | +0.028 | +0.056 | +0.077 | +0.046 | | UltraFeedback | +0.037 | -0.001 | +0.039 | +0.072 | +0.036 | | MaxMin | +0.022 | -0.016 | **+0.150** | +0.289 | +0.111 | | DeltaQwen | +0.055 | **+0.047** | +0.130 | **+0.316** | **+0.137** | | *Ours* | | | | | | | **DRTS** | **+0.055** | +0.015 | +0.108 | +0.177 | +0.088 | | **DeltaUCB** | +0.049 | +0.039 | +0.117 | +0.217 | +0.105 | | **DTS** | +0.009 | +0.002 | +0.014 | +0.029 | +0.013 | | **InfoMax** | +0.011 | +0.021 | +0.014 | +0.018 | +0.015 | | **MaxMinLCB** | -0.010 | +0.019 | +0.010 | +0.021 | +0.009 | ### 4. Tulu 3 Prompts **Reward Model (RM) Performance** | Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | *Baselines* | | | | | | | | | Random | **+0.465** | +0.465 | **+0.213** | +0.077 | **+0.584** | +0.355 | +0.360 | | UltraFeedback | +0.450 | +0.441 | +0.170 | +0.077 | +0.531 | **+0.386** | +0.343 | | MaxMin | +0.450 | +0.443 | +0.211 | +0.083 | +0.521 | +0.358 | +0.344 | | DeltaQwen | +0.179 | -0.086 | -0.013 | +0.164 | +0.174 | +0.091 | +0.085 | | Tulu3_PrefMix | +0.398 | +0.350 | +0.173 | +0.098 | +0.423 | +0.342 | +0.298 | | *Ours* | | | | | | | | | **DRTS** | +0.456 | +0.515 | +0.080 | +0.148 | +0.533 | +0.356 | +0.348 | | **DeltaUCB** | +0.455 | **+0.537** | +0.189 | **+0.148** | +0.580 | +0.390 | **+0.383** | | **DTS** | +0.426 | +0.140 | +0.200 | +0.036 | +0.499 | +0.160 | +0.243 | | **InfoMax** | +0.431 | +0.302 | +0.175 | +0.098 | +0.545 | +0.286 | +0.306 | | **MaxMinLCB** | +0.448 | +0.168 | +0.140 | +0.101 | +0.531 | +0.196 | +0.264 | **DPO Performance** | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** | | :--- | :---: | :---: | :---: | :---: | :---: | | *Baselines* | | | | | | | Random | **+0.055** | +0.041 | +0.069 | +0.046 | +0.052 | | UltraFeedback | +0.043 | +0.052 | +0.056 | +0.057 | +0.051 | | MaxMin | +0.022 | **+0.067** | **+0.188** | +0.279 | **+0.138** | | DeltaQwen | +0.049 | +0.034 | +0.124 | **+0.291** | +0.124 | | Tulu3_PrefMix | +0.037 | +0.069 | +0.046 | +0.020 | +0.042 | | *Ours* | | | | | | | **DRTS** | +0.050 | +0.058 | +0.118 | +0.203 | +0.107 | | **DeltaUCB** | +0.028 | +0.060 | +0.134 | +0.235 | +0.114 | | **DTS** | +0.015 | +0.012 | +0.018 | +0.024 | +0.017 | | **InfoMax** | +0.021 | +0.008 | +0.039 | +0.012 | +0.020 | | **MaxMinLCB** | +0.013 | -0.014 | +0.012 | +0.019 | +0.008 |
🔁 Pipeline Overview (How it works) Given a batch of prompts, the following steps are executed: 1. **Response Generation**: For each prompt in the batch, call multiple LLMs to each generate a response. 2. **Uncertainty-Aware Reward Prediction**: An uncertainty-aware reward model predicts the reward and associated uncertainty of the responses. *(Note: This model is initialized randomly at the start)*. 3. **Pair Selection (Acquisition Function)**: Select which two responses should get annotated based on rewards and uncertainties using an acquisition function (e.g., Double Thompson Sampling). 4. **Oracle Annotation**: Annotate which response in the selected pair is preferred (via LLM or human). 5. **Reward Model Training**: Train the uncertainty-aware reward model on the new preference data, then repeat the loop.
🤖 Source Models & Licenses | Model | Parameters (B) | License | | :--- | :---: | :--- | | **Qwen** | | | | `Qwen/Qwen2.5-0.5B-Instruct` | 0.5 | Apache 2.0 | | `Qwen/Qwen2.5-72B-Instruct` | 72 | Qwen | | `Qwen/Qwen3-0.6B` | 0.6 | Apache 2.0 | | `Qwen/Qwen3-1.7B` | 1.7 | Apache 2.0 | | `Qwen/Qwen3-14B` | 14 | Apache 2.0 | | `Qwen/Qwen3-30B-A3B` | 30 | Apache 2.0 | | `Qwen/Qwen3-32B` | 32 | Apache 2.0 | | `Qwen/Qwen3-235B-A22B` | 234 | Apache 2.0 | | **Llama** | | | | `meta-llama/Llama-3.1-8B-Instruct` | 8 | Llama 3 | | `meta-llama/Llama-3.2-1B-Instruct` | 1 | Llama 3 | | `meta-llama/Llama-3.2-3B-Instruct` | 3 | Llama 3 | | `meta-llama/Llama-3.3-70B-Instruct` | 70 | Llama 3 | | **Microsoft** | | | | `microsoft/Phi-4-mini-instruct` | 4 | MIT | | `microsoft/phi-4` | 14 | MIT | | **Mistral** | | | | `mistralai/Mistral-Small-24B-Instruct-2501` | 23 | Apache 2.0 | | `mistralai/Mistral-Large-Instruct-2411` | 123 | MRL | | **NVIDIA** | | | | `nvidia/Llama-3.1-Nemotron-70B-Instruct-HF` | 70 | Llama 3 | | `nvidia/Llama-3_3-Nemotron-Super-49B-v1` | 49 | Nvidia Open Model | | `nvidia/Llama-3_1-Nemotron-Ultra-253B-v1` | 253 | Nvidia Open Model | | **Gemma** | | | | `google/gemma-3-1b-it` | 1 | Gemma | | `google/gemma-3-4b-it` | 4 | Gemma | | `google/gemma-3-12b-it` | 12 | Gemma | | `google/gemma-3-27b-it` | 27 | Gemma | | **AllenAI** | | | | `allenai/OLMo-2-0325-32B-Instruct` | 32 | Apache 2.0 | | `allenai/Llama-3.1-Tulu-3-70B` | 70 | Llama 3 | | `allenai/Llama-3.1-Tulu-3-405B` | 405 | Llama 3 | | **Other** | | | | `HuggingFaceTB/SmolLM2-1.7B-Instruct` | 1.7 | Apache 2.0 | | `moonshotai/Moonlight-16B-A3B-Instruct` | 16 | MIT | | `CohereLabs/c4ai-command-a-03-2025` | 111 | CC by NC 4.0 | | `deepseek-ai/DeepSeek-V3` | 671 | Deepseek |
--- **License:** MIT **Disclaimer:** The datasets hosted here contain outputs from various third-party open-source models. Users must ensure compliance with the specific licenses of the source models (e.g., Llama Community License, Apache 2.0, Qwen Research License) when using these datasets. ## Citation If you use our work or the ActiveUltraFeedback datasets, models, please cite us: ```bibtex @misc{melikidze2026activeultrafeedbackefficientpreferencedata, title={ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning}, author={Davit Melikidze and Marian Schneider and Jessica Lam and Martin Wertich and Ido Hakimi and Barna Pásztor and Andreas Krause}, year={2026}, eprint={2603.09692}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2603.09692}, } ```