Spaces:

ActiveUltraFeedback
/

README

No application file

App Files Files Community

README / README.md

davmel

Update README.md

1544dd4 verified 14 days ago

preview code

raw

history blame contribute delete

11.8 kB

	---
	license: mit
	title: ActiveUltraFeedback
	short_description: Sample-Efficient RLHF Preference data generation
	paper:
	name: "ActiveUltraFeedback: Sample-Efficient RLHF Preference data generation"
	url: "https://arxiv.org/abs/2603.09692"
	sdk: gradio
	emoji: ⚡
	colorFrom: blue
	colorTo: green
	sdk_version: 6.9.0
	---

	# This repo accompanies the paper: [ActiveUltraFeedback — arXiv:2603.09692](https://arxiv.org/abs/2603.09692).

	Active UltraFeedback is a scalable pipeline for generating high-quality preference datasets to align large language models (LLMs). We leverage uncertainty quantification and active learning to annotate only the most informative samples, drastically reducing costs while beating standard baselines.

	> Repository Purpose: This repository serves as a central hub for storing all experimental data, including the generated preference datasets and the resulting DPO/RM/IPO/SimPO models.

	<details>
	<summary><strong>🏆 Benchmark Results (Click to Expand)</strong></summary>

	Our experiments demonstrate that Active Learning strategies (specifically DRTS and DeltaUCB) consistently beat the actual `ultrafeedback_binarized_cleaned` and `tulu3` preference mixture datasets.

	### 1. UltraFeedback Prompts (Only)

	Reward Model (RM) Performance
	\| Method \| Factuality \| Focus \| Math \| Precise IF \| Safety \| Ties \| Mean \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| Baselines \| \| \| \| \| \| \| \|
	\| Random \| +0.443 \| +0.209 \| +0.156 \| +0.133 \| +0.417 \| +0.310 \| +0.278 \|
	\| UltraFeedback \| +0.443 \| +0.188 \| +0.213 \| +0.114 \| +0.481 \| +0.284 \| +0.287 \|
	\| MaxMin \| +0.377 \| +0.483 \| +0.156 \| +0.123 \| +0.370 \| +0.400 \| +0.318 \|
	\| DeltaQwen \| +0.195 \| -0.034 \| +0.028 \| +0.067 \| +0.216 \| +0.126 \| +0.100 \|
	\| Ours \| \| \| \| \| \| \| \|
	\| DRTS \| +0.412 \| +0.408 \| +0.183 \| +0.114 \| +0.347 \| +0.404 \| +0.312 \|
	\| DeltaUCB \| +0.423 \| +0.553 \| +0.132 \| +0.080 \| +0.435 \| +0.408 \| +0.339 \|
	\| DTS \| +0.406 \| +0.024 \| +0.194 \| +0.077 \| +0.441 \| +0.197 \| +0.223 \|
	\| InfoMax \| +0.463 \| +0.287 \| +0.096 \| +0.129 \| +0.509 \| +0.296 \| +0.297 \|
	\| MaxMinLCB \| +0.390 \| -0.025 \| +0.244 \| +0.070 \| +0.453 \| +0.250 \| +0.230 \|

	DPO Performance
	\| Method \| GSM8K \| IF Eval \| Truthful QA \| Alpaca Eval \| Mean \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| Baselines \| \| \| \| \| \|
	\| Random \| +0.024 \| +0.028 \| +0.056 \| +0.077 \| +0.046 \|
	\| UltraFeedback \| +0.037 \| -0.001 \| +0.039 \| +0.072 \| +0.036 \|
	\| MaxMin \| +0.022 \| -0.016 \| +0.150 \| +0.289 \| +0.111 \|
	\| DeltaQwen \| +0.055 \| +0.047 \| +0.130 \| +0.316 \| +0.137 \|
	\| Ours \| \| \| \| \| \|
	\| DRTS \| +0.055 \| +0.050 \| +0.143 \| +0.259 \| +0.127 \|
	\| DeltaUCB \| +0.065 \| +0.039 \| +0.113 \| +0.254 \| +0.117 \|
	\| DTS \| +0.011 \| +0.034 \| +0.013 \| +0.037 \| +0.023 \|
	\| InfoMax \| +0.011 \| +0.019 \| +0.018 \| +0.020 \| +0.016 \|
	\| MaxMinLCB \| +0.015 \| +0.017 \| +0.006 \| +0.027 \| +0.016 \|

	### 2. Skywork Prompts (Only)

	Reward Model (RM) Performance
	\| Method \| Factuality \| Focus \| Math \| Precise IF \| Safety \| Ties \| Mean \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| Baselines \| \| \| \| \| \| \| \|
	\| Random \| +0.407 \| +0.106 \| +0.151 \| +0.092 \| +0.422 \| +0.157 \| +0.223 \|
	\| UltraFeedback \| +0.419 \| +0.068 \| +0.189 \| +0.058 \| +0.440 \| +0.228 \| +0.234 \|
	\| MaxMin \| +0.410 \| +0.462 \| +0.172 \| +0.055 \| +0.531 \| +0.319 \| +0.325 \|
	\| DeltaQwen \| +0.238 \| -0.023 \| +0.011 \| +0.108 \| +0.306 \| +0.132 \| +0.129 \|
	\| Ours \| \| \| \| \| \| \| \|
	\| DRTS \| +0.423 \| +0.233 \| +0.164 \| +0.055 \| +0.377 \| +0.285 \| +0.256 \|
	\| DeltaUCB \| +0.370 \| +0.319 \| +0.194 \| +0.033 \| +0.346 \| +0.310 \| +0.262 \|
	\| DTS \| +0.417 \| -0.021 \| +0.148 \| +0.077 \| +0.450 \| +0.245 \| +0.219 \|
	\| InfoMax \| +0.429 \| +0.122 \| +0.162 \| +0.030 \| +0.495 \| +0.227 \| +0.244 \|
	\| MaxMinLCB \| +0.371 \| -0.016 \| +0.145 \| +0.039 \| +0.395 \| +0.167 \| +0.184 \|

	DPO Performance
	\| Method \| GSM8K \| IF Eval \| Truthful QA \| Alpaca Eval \| Mean \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| Baselines \| \| \| \| \| \|
	\| Random \| +0.012 \| +0.015 \| +0.045 \| +0.063 \| +0.033 \|
	\| UltraFeedback \| +0.027 \| +0.054 \| +0.043 \| +0.071 \| +0.048 \|
	\| MaxMin \| +0.049 \| -0.011 \| +0.128 \| +0.270 \| +0.108 \|
	\| DeltaQwen \| +0.058 \| +0.002 \| +0.152 \| +0.384 \| +0.149 \|
	\| Ours \| \| \| \| \| \|
	\| DRTS \| +0.052 \| +0.012 \| +0.114 \| +0.229 \| +0.101 \|
	\| DeltaUCB \| +0.055 \| +0.013 \| +0.077 \| +0.238 \| +0.095 \|
	\| DTS \| +0.008 \| +0.002 \| +0.011 \| +0.021 \| +0.010 \|
	\| InfoMax \| +0.021 \| +0.002 \| +0.011 \| +0.013 \| +0.012 \|
	\| MaxMinLCB \| +0.003 \| +0.010 \| +0.004 \| +0.018 \| +0.008 \|

	### 3. Skywork + UltraFeedback (Combined)

	Reward Model (RM) Performance
	\| Method \| Factuality \| Focus \| Math \| Precise IF \| Safety \| Ties \| Mean \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| Baselines \| \| \| \| \| \| \| \|
	\| Random \| +0.455 \| +0.216 \| +0.205 \| +0.077 \| +0.466 \| +0.193 \| +0.269 \|
	\| UltraFeedback \| +0.407 \| +0.114 \| +0.175 \| +0.064 \| +0.433 \| +0.247 \| +0.240 \|
	\| MaxMin \| +0.410 \| +0.467 \| +0.194 \| +0.083 \| +0.412 \| +0.380 \| +0.325 \|
	\| DeltaQwen \| +0.242 \| -0.007 \| +0.009 \| +0.151 \| +0.279 \| +0.241 \| +0.153 \|
	\| Ours \| \| \| \| \| \| \| \|
	\| DRTS \| +0.427 \| +0.436 \| +0.156 \| +0.086 \| +0.475 \| +0.272 \| +0.309 \|
	\| DeltaUCB \| +0.463 \| +0.350 \| +0.164 \| +0.092 \| +0.469 \| +0.213 \| +0.292 \|
	\| DTS \| +0.419 \| +0.087 \| +0.186 \| +0.083 \| +0.411 \| +0.297 \| +0.247 \|
	\| InfoMax \| +0.476 \| +0.383 \| +0.153 \| +0.042 \| +0.546 \| +0.199 \| +0.300 \|
	\| MaxMinLCB \| +0.439 \| +0.048 \| +0.159 \| +0.030 \| +0.435 \| +0.201 \| +0.219 \|

	DPO Performance
	\| Method \| GSM8K \| IF Eval \| Truthful QA \| Alpaca Eval \| Mean \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| Baselines \| \| \| \| \| \|
	\| Random \| +0.024 \| +0.028 \| +0.056 \| +0.077 \| +0.046 \|
	\| UltraFeedback \| +0.037 \| -0.001 \| +0.039 \| +0.072 \| +0.036 \|
	\| MaxMin \| +0.022 \| -0.016 \| +0.150 \| +0.289 \| +0.111 \|
	\| DeltaQwen \| +0.055 \| +0.047 \| +0.130 \| +0.316 \| +0.137 \|
	\| Ours \| \| \| \| \| \|
	\| DRTS \| +0.055 \| +0.015 \| +0.108 \| +0.177 \| +0.088 \|
	\| DeltaUCB \| +0.049 \| +0.039 \| +0.117 \| +0.217 \| +0.105 \|
	\| DTS \| +0.009 \| +0.002 \| +0.014 \| +0.029 \| +0.013 \|
	\| InfoMax \| +0.011 \| +0.021 \| +0.014 \| +0.018 \| +0.015 \|
	\| MaxMinLCB \| -0.010 \| +0.019 \| +0.010 \| +0.021 \| +0.009 \|

	### 4. Tulu 3 Prompts

	Reward Model (RM) Performance
	\| Method \| Factuality \| Focus \| Math \| Precise IF \| Safety \| Ties \| Mean \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| Baselines \| \| \| \| \| \| \| \|
	\| Random \| +0.465 \| +0.465 \| +0.213 \| +0.077 \| +0.584 \| +0.355 \| +0.360 \|
	\| UltraFeedback \| +0.450 \| +0.441 \| +0.170 \| +0.077 \| +0.531 \| +0.386 \| +0.343 \|
	\| MaxMin \| +0.450 \| +0.443 \| +0.211 \| +0.083 \| +0.521 \| +0.358 \| +0.344 \|
	\| DeltaQwen \| +0.179 \| -0.086 \| -0.013 \| +0.164 \| +0.174 \| +0.091 \| +0.085 \|
	\| Tulu3_PrefMix \| +0.398 \| +0.350 \| +0.173 \| +0.098 \| +0.423 \| +0.342 \| +0.298 \|
	\| Ours \| \| \| \| \| \| \| \|
	\| DRTS \| +0.456 \| +0.515 \| +0.080 \| +0.148 \| +0.533 \| +0.356 \| +0.348 \|
	\| DeltaUCB \| +0.455 \| +0.537 \| +0.189 \| +0.148 \| +0.580 \| +0.390 \| +0.383 \|
	\| DTS \| +0.426 \| +0.140 \| +0.200 \| +0.036 \| +0.499 \| +0.160 \| +0.243 \|
	\| InfoMax \| +0.431 \| +0.302 \| +0.175 \| +0.098 \| +0.545 \| +0.286 \| +0.306 \|
	\| MaxMinLCB \| +0.448 \| +0.168 \| +0.140 \| +0.101 \| +0.531 \| +0.196 \| +0.264 \|

	DPO Performance
	\| Method \| GSM8K \| IF Eval \| Truthful QA \| Alpaca Eval \| Mean \|
	\| :--- \| :---: \| :---: \| :---: \| :---: \| :---: \|
	\| Baselines \| \| \| \| \| \|
	\| Random \| +0.055 \| +0.041 \| +0.069 \| +0.046 \| +0.052 \|
	\| UltraFeedback \| +0.043 \| +0.052 \| +0.056 \| +0.057 \| +0.051 \|
	\| MaxMin \| +0.022 \| +0.067 \| +0.188 \| +0.279 \| +0.138 \|
	\| DeltaQwen \| +0.049 \| +0.034 \| +0.124 \| +0.291 \| +0.124 \|
	\| Tulu3_PrefMix \| +0.037 \| +0.069 \| +0.046 \| +0.020 \| +0.042 \|
	\| Ours \| \| \| \| \| \|
	\| DRTS \| +0.050 \| +0.058 \| +0.118 \| +0.203 \| +0.107 \|
	\| DeltaUCB \| +0.028 \| +0.060 \| +0.134 \| +0.235 \| +0.114 \|
	\| DTS \| +0.015 \| +0.012 \| +0.018 \| +0.024 \| +0.017 \|
	\| InfoMax \| +0.021 \| +0.008 \| +0.039 \| +0.012 \| +0.020 \|
	\| MaxMinLCB \| +0.013 \| -0.014 \| +0.012 \| +0.019 \| +0.008 \|
	</details>

	<details>
	<summary><strong>🔁 Pipeline Overview (How it works)</strong></summary>

	Given a batch of prompts, the following steps are executed:
	1. Response Generation: For each prompt in the batch, call multiple LLMs to each generate a response.
	2. Uncertainty-Aware Reward Prediction: An uncertainty-aware reward model predicts the reward and associated uncertainty of the responses. (Note: This model is initialized randomly at the start).
	3. Pair Selection (Acquisition Function): Select which two responses should get annotated based on rewards and uncertainties using an acquisition function (e.g., Double Thompson Sampling).
	4. Oracle Annotation: Annotate which response in the selected pair is preferred (via LLM or human).
	5. Reward Model Training: Train the uncertainty-aware reward model on the new preference data, then repeat the loop.
	</details>

	<details>
	<summary><strong>🤖 Source Models & Licenses</strong></summary>

	\| Model \| Parameters (B) \| License \|
	\| :--- \| :---: \| :--- \|
	\| Qwen \| \| \|
	\| `Qwen/Qwen2.5-0.5B-Instruct` \| 0.5 \| Apache 2.0 \|
	\| `Qwen/Qwen2.5-72B-Instruct` \| 72 \| Qwen \|
	\| `Qwen/Qwen3-0.6B` \| 0.6 \| Apache 2.0 \|
	\| `Qwen/Qwen3-1.7B` \| 1.7 \| Apache 2.0 \|
	\| `Qwen/Qwen3-14B` \| 14 \| Apache 2.0 \|
	\| `Qwen/Qwen3-30B-A3B` \| 30 \| Apache 2.0 \|
	\| `Qwen/Qwen3-32B` \| 32 \| Apache 2.0 \|
	\| `Qwen/Qwen3-235B-A22B` \| 234 \| Apache 2.0 \|
	\| Llama \| \| \|
	\| `meta-llama/Llama-3.1-8B-Instruct` \| 8 \| Llama 3 \|
	\| `meta-llama/Llama-3.2-1B-Instruct` \| 1 \| Llama 3 \|
	\| `meta-llama/Llama-3.2-3B-Instruct` \| 3 \| Llama 3 \|
	\| `meta-llama/Llama-3.3-70B-Instruct` \| 70 \| Llama 3 \|
	\| Microsoft \| \| \|
	\| `microsoft/Phi-4-mini-instruct` \| 4 \| MIT \|
	\| `microsoft/phi-4` \| 14 \| MIT \|
	\| Mistral \| \| \|
	\| `mistralai/Mistral-Small-24B-Instruct-2501` \| 23 \| Apache 2.0 \|
	\| `mistralai/Mistral-Large-Instruct-2411` \| 123 \| MRL \|
	\| NVIDIA \| \| \|
	\| `nvidia/Llama-3.1-Nemotron-70B-Instruct-HF` \| 70 \| Llama 3 \|
	\| `nvidia/Llama-3_3-Nemotron-Super-49B-v1` \| 49 \| Nvidia Open Model \|
	\| `nvidia/Llama-3_1-Nemotron-Ultra-253B-v1` \| 253 \| Nvidia Open Model \|
	\| Gemma \| \| \|
	\| `google/gemma-3-1b-it` \| 1 \| Gemma \|
	\| `google/gemma-3-4b-it` \| 4 \| Gemma \|
	\| `google/gemma-3-12b-it` \| 12 \| Gemma \|
	\| `google/gemma-3-27b-it` \| 27 \| Gemma \|
	\| AllenAI \| \| \|
	\| `allenai/OLMo-2-0325-32B-Instruct` \| 32 \| Apache 2.0 \|
	\| `allenai/Llama-3.1-Tulu-3-70B` \| 70 \| Llama 3 \|
	\| `allenai/Llama-3.1-Tulu-3-405B` \| 405 \| Llama 3 \|
	\| Other \| \| \|
	\| `HuggingFaceTB/SmolLM2-1.7B-Instruct` \| 1.7 \| Apache 2.0 \|
	\| `moonshotai/Moonlight-16B-A3B-Instruct` \| 16 \| MIT \|
	\| `CohereLabs/c4ai-command-a-03-2025` \| 111 \| CC by NC 4.0 \|
	\| `deepseek-ai/DeepSeek-V3` \| 671 \| Deepseek \|
	</details>
	---

	License: MIT
	Disclaimer: The datasets hosted here contain outputs from various third-party open-source models. Users must ensure compliance with the specific licenses of the source models (e.g., Llama Community License, Apache 2.0, Qwen Research License) when using these datasets.

	## Citation

	If you use our work or the ActiveUltraFeedback datasets, models, please cite us:

	```bibtex
	@misc{melikidze2026activeultrafeedbackefficientpreferencedata,
	title={ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning},
	author={Davit Melikidze and Marian Schneider and Jessica Lam and Martin Wertich and Ido Hakimi and Barna Pásztor and Andreas Krause},
	year={2026},
	eprint={2603.09692},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2603.09692},
	}
	```