Spaces:

ActiveUltraFeedback
/

README

No application file

App Files Files Community

davmel commited on Jan 9

Commit

464585e

verified ·

1 Parent(s): e57001d

Update README.md

Browse files

Files changed (1) hide show

README.md +270 -7

README.md CHANGED Viewed

@@ -1,10 +1,273 @@
 ---
-title: README
-emoji: 📚
-colorFrom: red
-colorTo: indigo
-sdk: static
-pinned: false
 ---
-Edit this `README.md` markdown file to author your organization card.

+# Active UltraFeedback
+**Active UltraFeedback** is a scalable pipeline for generating high-quality preference datasets to align large language models (LLMs), requiring only a set of prompts as input.
+It leverages **uncertainty quantification** and **active learning** to identify and annotate the most informative samples, drastically reducing annotation costs while maintaining high data quality. Annotations are provided by an oracle (typically another LLM, but can also be a human).
+> **Repository Purpose:** This repository serves as a central hub for storing all experimental data, including the generated preference datasets and the resulting DPO/RM/IPO/SimPO models.
+## 🏆 Key Results
+Our experiments demonstrate that **Active Learning strategies (specifically DRTS and DeltaUCB)** significantly outperform standard baselines.
+The datasets generated by our pipeline for **DRTS** and **DeltaUCB** consistently beat the actual `ultrafeedback_binarized_cleaned` and `tulu3` preference mixture datasets on our DPO/RM training setups with LoRA.
+Below are the detailed results across 4 different prompt distributions. For acquisition functions with multiple hyperparameter configurations, we report the best-performing setting.
+---
+### 1. UltraFeedback Prompts (Only)
+**Reward Model (RM) Performance**
+| Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
+| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+| *Baselines* | | | | | | | |
+| Random | +0.443 | +0.209 | +0.156 | +0.133 | +0.417 | +0.310 | +0.278 |
+| UltraFeedback | +0.443 | +0.188 | +0.213 | +0.114 | +0.481 | +0.284 | +0.287 |
+| MaxMin | +0.377 | +0.483 | +0.156 | +0.123 | +0.370 | +0.400 | +0.318 |
+| DeltaQwen | +0.195 | -0.034 | +0.028 | +0.067 | +0.216 | +0.126 | +0.100 |
+| *Ours* | | | | | | | |
+| **DRTS** | +0.412 | +0.408 | +0.183 | +0.114 | +0.347 | +0.404 | +0.312 |
+| **DeltaUCB** | +0.423 | **+0.553** | +0.132 | +0.080 | +0.435 | +0.408 | **+0.339** |
+| **DTS** | +0.406 | +0.024 | +0.194 | +0.077 | +0.441 | +0.197 | +0.223 |
+| **InfoMax** | +0.463 | +0.287 | +0.096 | **+0.129** | +0.509 | +0.296 | +0.297 |
+| **MaxMinLCB** | +0.390 | -0.025 | **+0.244** | +0.070 | +0.453 | +0.250 | +0.230 |
+**DPO Performance** (Best HPs selected)
+| Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
+| :--- | :---: | :---: | :---: | :---: | :---: |
+| *Baselines* | | | | | |
+| Random | +0.024 | +0.028 | +0.056 | +0.077 | +0.046 |
+| UltraFeedback | +0.037 | -0.001 | +0.039 | +0.072 | +0.036 |
+| MaxMin | +0.022 | -0.016 | **+0.150** | +0.289 | +0.111 |
+| DeltaQwen | +0.055 | +0.047 | +0.130 | +0.316 | **+0.137** |
+| *Ours* | | | | | |
+| **DRTS** | +0.055 | **+0.050** | +0.143 | +0.259 | +0.127 |
+| **DeltaUCB** | **+0.065** | +0.039 | +0.113 | +0.254 | +0.117 |
+| **DTS** | +0.011 | +0.034 | +0.013 | +0.037 | +0.023 |
+| **InfoMax** | +0.011 | +0.019 | +0.018 | +0.020 | +0.016 |
+| **MaxMinLCB** | +0.015 | +0.017 | +0.006 | +0.027 | +0.016 |
+---
+### 2. Skywork Prompts (Only)
+**Reward Model (RM) Performance**
+| Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
+| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+| *Baselines* | | | | | | | |
+| Random | +0.443 | +0.209 | +0.156 | +0.133 | +0.417 | +0.310 | +0.278 |
+| UltraFeedback | +0.443 | +0.188 | +0.213 | +0.114 | +0.481 | +0.284 | +0.287 |
+| MaxMin | +0.377 | **+0.483** | +0.156 | +0.123 | +0.370 | **+0.400** | **+0.318** |
+| DeltaQwen | +0.195 | -0.034 | +0.028 | +0.067 | +0.216 | +0.126 | +0.100 |
+| *Ours* | | | | | | | |
+| **DRTS** | +0.396 | +0.090 | +0.107 | +0.033 | +0.344 | +0.225 | +0.199 |
+| **DeltaUCB** | +0.370 | +0.319 | **+0.194** | +0.033 | +0.346 | +0.310 | +0.262 |
+| **DTS** | +0.417 | -0.021 | +0.148 | **+0.077** | +0.450 | +0.245 | +0.219 |
+| **InfoMax** | **+0.429** | +0.122 | +0.162 | +0.030 | **+0.495** | +0.227 | +0.244 |
+| **MaxMinLCB** | +0.371 | -0.016 | +0.145 | +0.039 | +0.395 | +0.167 | +0.184 |
+**DPO Performance** (Best HPs selected)
+| Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
+| :--- | :---: | :---: | :---: | :---: | :---: |
+| *Baselines* | | | | | |
+| Random | +0.020 | +0.004 | +0.004 | +0.025 | +0.013 |
+| UltraFeedback | +0.023 | +0.021 | +0.003 | +0.031 | +0.019 |
+| MaxMin | +0.043 | **+0.041** | +0.017 | +0.114 | +0.053 |
+| DeltaQwen | +0.043 | +0.030 | +0.023 | +0.183 | +0.069 |
+| *Ours* | | | | | |
+| **DRTS** | +0.065 | +0.019 | **+0.055** | **+0.197** | **+0.083** |
+| **DeltaUCB** | **+0.074** | +0.028 | +0.045 | +0.173 | +0.080 |
+| **DTS** | +0.003 | +0.004 | +0.002 | +0.028 | +0.009 |
+| **InfoMax** | +0.013 | +0.008 | +0.003 | +0.012 | +0.009 |
+| **MaxMinLCB** | -0.001 | +0.000 | +0.000 | +0.002 | -0.000 |
+---
+### 3. Skywork + UltraFeedback (Combined)
+**Reward Model (RM) Performance**
+| Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
+| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+| *Baselines* | | | | | | | |
+| Random | +0.443 | +0.209 | +0.156 | +0.133 | +0.417 | +0.310 | +0.278 |
+| UltraFeedback | +0.443 | +0.188 | **+0.213** | +0.114 | +0.481 | +0.284 | +0.287 |
+| MaxMin | +0.377 | **+0.483** | +0.156 | +0.123 | +0.370 | +0.400 | **+0.318** |
+| DeltaQwen | +0.195 | -0.034 | +0.028 | +0.067 | +0.216 | +0.126 | +0.100 |
+| *Ours* | | | | | | | |
+| **DRTS** | +0.439 | +0.386 | +0.151 | +0.064 | +0.415 | **+0.395** | +0.308 |
+| **DeltaUCB** | +0.463 | +0.350 | +0.164 | **+0.092** | +0.469 | +0.213 | +0.292 |
+| **DTS** | +0.419 | +0.087 | +0.186 | +0.083 | +0.411 | +0.297 | +0.247 |
+| **InfoMax** | **+0.476** | +0.383 | +0.153 | +0.042 | **+0.546** | +0.199 | +0.300 |
+| **MaxMinLCB** | +0.439 | +0.048 | +0.159 | +0.030 | +0.435 | +0.201 | +0.219 |
+**DPO Performance** (Best HPs selected)
+| Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
+| :--- | :---: | :---: | :---: | :---: | :---: |
+| *Baselines* | | | | | |
+| Random | +0.026 | +0.012 | +0.012 | +0.035 | +0.021 |
+| UltraFeedback | +0.032 | -0.007 | +0.011 | +0.052 | +0.022 |
+| MaxMin | +0.074 | +0.025 | +0.052 | +0.222 | +0.092 |
+| DeltaQwen | +0.069 | **+0.030** | **+0.097** | **+0.299** | **+0.123** |
+| *Ours* | | | | | |
+| **DRTS** | +0.065 | +0.028 | +0.090 | +0.238 | +0.105 |
+| **DeltaUCB** | **+0.078** | +0.010 | +0.093 | +0.246 | +0.106 |
+| **DTS** | +0.011 | +0.000 | +0.006 | +0.024 | +0.010 |
+| **InfoMax** | +0.004 | +0.012 | +0.004 | +0.016 | +0.009 |
+| **MaxMinLCB** | -0.006 | +0.000 | +0.003 | +0.004 | -0.000 |
+---
+### 4. Tulu 3 Prompts
+**Reward Model (RM) Performance**
+| Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
+| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+| *Baselines* | | | | | | | |
+| Random | **+0.465** | +0.465 | **+0.213** | +0.077 | **+0.584** | +0.355 | +0.360 |
+| UltraFeedback | +0.450 | +0.441 | +0.170 | +0.077 | +0.531 | **+0.386** | +0.343 |
+| MaxMin | +0.450 | +0.443 | +0.211 | +0.083 | +0.521 | +0.358 | +0.344 |
+| DeltaQwen | +0.179 | -0.086 | -0.013 | +0.164 | +0.174 | +0.091 | +0.085 |
+| Tulu3_PrefMix | +0.398 | +0.350 | +0.173 | +0.098 | +0.423 | +0.342 | +0.298 |
+| *Ours* | | | | | | | |
+| **DRTS** | +0.456 | +0.515 | +0.080 | +0.148 | +0.533 | +0.356 | +0.348 |
+| **DeltaUCB** | +0.455 | **+0.537** | +0.189 | **+0.148** | +0.580 | +0.390 | **+0.383** |
+| **DTS** | +0.426 | +0.140 | +0.200 | +0.036 | +0.499 | +0.160 | +0.243 |
+| **InfoMax** | +0.431 | +0.302 | +0.175 | +0.098 | +0.545 | +0.286 | +0.306 |
+| **MaxMinLCB** | +0.448 | +0.168 | +0.140 | +0.101 | +0.531 | +0.196 | +0.264 |
+**DPO Performance** (Best HPs selected)
+| Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
+| :--- | :---: | :---: | :---: | :---: | :---: |
+| *Baselines* | | | | | |
+| Random | **+0.055** | +0.041 | +0.069 | +0.046 | +0.052 |
+| UltraFeedback | +0.043 | +0.052 | +0.056 | +0.057 | +0.051 |
+| MaxMin | +0.022 | **+0.067** | **+0.188** | +0.279 | **+0.138** |
+| DeltaQwen | +0.049 | +0.034 | +0.124 | **+0.291** | +0.124 |
+| Tulu3_PrefMix | +0.037 | +0.069 | +0.046 | +0.020 | +0.042 |
+| *Ours* | | | | | |
+| **DRTS** | +0.050 | +0.058 | +0.118 | +0.203 | +0.107 |
+| **DeltaUCB** | +0.028 | +0.060 | +0.134 | +0.235 | +0.114 |
+| **DTS** | +0.015 | +0.012 | +0.018 | +0.024 | +0.017 |
+| **InfoMax** | +0.021 | +0.008 | +0.039 | +0.012 | +0.020 |
+| **MaxMinLCB** | +0.013 | -0.014 | +0.012 | +0.019 | +0.008 |
 ---
+## 🔁 Pipeline Overview
+Given a batch of prompts, the following steps are executed:
+1.  **Response Generation**: For each prompt in the batch, call multiple LLMs to each generate a response.
+2.  **Uncertainty-Aware Reward Prediction**: An uncertainty-aware reward model predicts the reward and associated uncertainty of the responses. *(Note: This model is initialized randomly at the start)*.
+3.  **Pair Selection (Acquisition Function)**: Select which two responses should get annotated based on rewards and uncertainties using an acquisition function (e.g., Double Thompson Sampling).
+4.  **Oracle Annotation**: Annotate which response in the selected pair is preferred (via LLM or human).
+5.  **Reward Model Training**: Train the uncertainty-aware reward model on the new preference data, then repeat the loop.
 ---
+## 🤖 Source Models
+To ensure diversity and quality, we utilize a wide range of open-source models for completion generation. Please refer to the specific licenses for each model when using these datasets.
+**Qwen Series:**
+* `Qwen/Qwen2.5-0.5B-Instruct`, `Qwen/Qwen2.5-72B-Instruct`
+* `Qwen/Qwen3-0.6B`, `1.7B`, `14B`, `32B`
+* `Qwen/Qwen3-30B-A3B`, `235B-A22B`
+**Llama Series:**
+* `meta-llama/Llama-3.1-8B-Instruct`
+* `meta-llama/Llama-3.2-1B-Instruct`, `3B-Instruct`
+* `meta-llama/Llama-3.3-70B-Instruct`
+**NVIDIA Nemotron:**
+* `nvidia/Llama-3_3-Nemotron-Super-49B-v1`
+* `nvidia/Llama-3.1-Nemotron-70B-Instruct-HF`
+* `nvidia/Llama-3_1-Nemotron-Ultra-253B-v1`
+**Google Gemma:**
+* `google/gemma-3-1b-it`, `4b-it`, `12b-it`, `27b-it`
+**Mistral:**
+* `mistralai/Mistral-Small-24B-Instruct-2501`
+* `mistralai/Mistral-Large-Instruct-2411`
+**Others:**
+* `microsoft/Phi-4-mini-instruct`, `microsoft/phi-4`
+* `HuggingFaceTB/SmolLM2-1.7B-Instruct`
+* `CohereLabs/c4ai-command-a-03-2025`
+* `deepseek-ai/DeepSeek-V3`
+* `allenai/OLMo-2-0325-32B-Instruct`
+* `allenai/Llama-3.1-Tulu-3-70B`, `405B`
+* `moonshotai/Moonlight-16B-A3B-Instruct`
+---
+## 🚀 Quickstart
+### 1. Installation
+Install the package in editable mode:
+```bash
+pip install -e .
+```
+### 2. Running the Pipeline
+Run the main dataset generation script:
+```bash
+python path/to/main_script.py
+```
+### 3. Configuration (Optional)
+To modify the pipeline parameters and steps, edit the configuration files in the `config/` directory.
+---
+## 🛠 Environment Setup
+### Option 1: Docker/Podman (Recommended)
+Build the container image:
+```bash
+podman build -t activeuf:latest .
+```
+### Option 2: `uv` (For Local Use)
+Create a `uv` environment with all dependencies.
+```bash
+# Install uv
+curl -LsSf [https://astral.sh/uv/install.sh](https://astral.sh/uv/install.sh) | sh
+source $HOME/.local/bin/env
+# Sync dependencies
+uv sync --dev
+source .venv/bin/activate
+```
+---
+## 👨‍💻 Development Setup
+For contributors and developers:
+### Pre-commit Hooks
+This project uses `ruff` for linting and formatting.
+```bash
+pre-commit install
+```
+### Manual Linting
+```bash
+# Format code
+ruff format
+# Lint and auto-fix
+ruff check --fix
+```
+---
+## 📄 License
+This project is licensed under the **MIT License**.
+**Note on Data Usage:**
+While the code and curated datasets in this repository are released under MIT, the datasets contain outputs generated by third-party models (listed above). Users are responsible for adhering to the respective licenses of these source models (e.g., Llama Community License, Apache 2.0, Qwen Research License) when using this data for training or commercial purposes.