Spaces:
No application file
No application file
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,20 +1,16 @@
|
|
| 1 |
-
Active UltraFeedback
|
| 2 |
|
| 3 |
-
**Active UltraFeedback** is a scalable pipeline for generating high-quality preference datasets to align large language models (LLMs)
|
| 4 |
|
| 5 |
-
|
|
|
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
## 🏆 Key Results
|
| 10 |
-
|
| 11 |
-
Our experiments demonstrate that **Active Learning strategies (specifically DRTS and DeltaUCB)** significantly outperform standard baselines.
|
| 12 |
-
|
| 13 |
-
The datasets generated by our pipeline for **DRTS** and **DeltaUCB** consistently beat the actual `ultrafeedback_binarized_cleaned` and `tulu3` preference mixture datasets on our DPO/RM training setups with LoRA.
|
| 14 |
|
| 15 |
-
|
|
|
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
### 1. UltraFeedback Prompts (Only)
|
| 20 |
|
|
@@ -48,8 +44,6 @@ Below are the detailed results across 4 different prompt distributions.
|
|
| 48 |
| **InfoMax** | +0.011 | +0.019 | +0.018 | +0.020 | +0.016 |
|
| 49 |
| **MaxMinLCB** | +0.015 | +0.017 | +0.006 | +0.027 | +0.016 |
|
| 50 |
|
| 51 |
-
---
|
| 52 |
-
|
| 53 |
### 2. Skywork Prompts (Only)
|
| 54 |
|
| 55 |
**Reward Model (RM) Performance**
|
|
@@ -82,8 +76,6 @@ Below are the detailed results across 4 different prompt distributions.
|
|
| 82 |
| **InfoMax** | +0.021 | +0.002 | +0.011 | +0.013 | +0.012 |
|
| 83 |
| **MaxMinLCB** | +0.003 | +0.010 | +0.004 | +0.018 | +0.008 |
|
| 84 |
|
| 85 |
-
---
|
| 86 |
-
|
| 87 |
### 3. Skywork + UltraFeedback (Combined)
|
| 88 |
|
| 89 |
**Reward Model (RM) Performance**
|
|
@@ -116,8 +108,6 @@ Below are the detailed results across 4 different prompt distributions.
|
|
| 116 |
| **InfoMax** | +0.011 | +0.021 | +0.014 | +0.018 | +0.015 |
|
| 117 |
| **MaxMinLCB** | -0.010 | +0.019 | +0.010 | +0.021 | +0.009 |
|
| 118 |
|
| 119 |
-
---
|
| 120 |
-
|
| 121 |
### 4. Tulu 3 Prompts
|
| 122 |
|
| 123 |
**Reward Model (RM) Performance**
|
|
@@ -151,24 +141,21 @@ Below are the detailed results across 4 different prompt distributions.
|
|
| 151 |
| **DTS** | +0.015 | +0.012 | +0.018 | +0.024 | +0.017 |
|
| 152 |
| **InfoMax** | +0.021 | +0.008 | +0.039 | +0.012 | +0.020 |
|
| 153 |
| **MaxMinLCB** | +0.013 | -0.014 | +0.012 | +0.019 | +0.008 |
|
|
|
|
| 154 |
|
| 155 |
-
|
| 156 |
-
|
| 157 |
-
## 🔁 Pipeline Overview
|
| 158 |
|
| 159 |
Given a batch of prompts, the following steps are executed:
|
| 160 |
-
|
| 161 |
1. **Response Generation**: For each prompt in the batch, call multiple LLMs to each generate a response.
|
| 162 |
2. **Uncertainty-Aware Reward Prediction**: An uncertainty-aware reward model predicts the reward and associated uncertainty of the responses. *(Note: This model is initialized randomly at the start)*.
|
| 163 |
3. **Pair Selection (Acquisition Function)**: Select which two responses should get annotated based on rewards and uncertainties using an acquisition function (e.g., Double Thompson Sampling).
|
| 164 |
4. **Oracle Annotation**: Annotate which response in the selected pair is preferred (via LLM or human).
|
| 165 |
5. **Reward Model Training**: Train the uncertainty-aware reward model on the new preference data, then repeat the loop.
|
|
|
|
| 166 |
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
## 🤖 Source Models and Licenses
|
| 170 |
-
|
| 171 |
-
To ensure diversity and quality, we utilize a wide range of open-source models for completion generation. Below is the list of models used, along with their parameters and licenses.
|
| 172 |
|
| 173 |
| Model | Parameters (B) | License |
|
| 174 |
| :--- | :---: | :--- |
|
|
@@ -210,13 +197,28 @@ To ensure diversity and quality, we utilize a wide range of open-source models f
|
|
| 210 |
| `moonshotai/Moonlight-16B-A3B-Instruct` | 16 | MIT |
|
| 211 |
| `CohereLabs/c4ai-command-a-03-2025` | 111 | CC by NC 4.0 |
|
| 212 |
| `deepseek-ai/DeepSeek-V3` | 671 | Deepseek |
|
|
|
|
| 213 |
|
| 214 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 215 |
|
| 216 |
-
##
|
|
|
|
|
|
|
|
|
|
| 217 |
|
| 218 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 219 |
|
| 220 |
-
**
|
| 221 |
-
|
| 222 |
-
"""
|
|
|
|
| 1 |
+
# Active UltraFeedback
|
| 2 |
|
| 3 |
+
**Active UltraFeedback** is a scalable pipeline for generating high-quality preference datasets to align large language models (LLMs). We leverage **uncertainty quantification** and **active learning** to annotate only the most informative samples, drastically reducing costs while beating standard baselines.
|
| 4 |
|
| 5 |
+
### 📂 [**Click Here to View the Preference Datasets Collection**](https://huggingface.co/collections/ActiveUltraFeedback/preference-datasets-677f0e3745281481075f1073)
|
| 6 |
+
*(Contains all datasets generated via our DRTS, DeltaUCB, and InfoMax experiments)*
|
| 7 |
|
| 8 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
+
<details>
|
| 11 |
+
<summary><strong>🏆 Benchmark Results (Click to Expand)</strong></summary>
|
| 12 |
|
| 13 |
+
Our experiments demonstrate that **Active Learning strategies (specifically DRTS and DeltaUCB)** consistently beat the actual `ultrafeedback_binarized_cleaned` and `tulu3` preference mixture datasets.
|
| 14 |
|
| 15 |
### 1. UltraFeedback Prompts (Only)
|
| 16 |
|
|
|
|
| 44 |
| **InfoMax** | +0.011 | +0.019 | +0.018 | +0.020 | +0.016 |
|
| 45 |
| **MaxMinLCB** | +0.015 | +0.017 | +0.006 | +0.027 | +0.016 |
|
| 46 |
|
|
|
|
|
|
|
| 47 |
### 2. Skywork Prompts (Only)
|
| 48 |
|
| 49 |
**Reward Model (RM) Performance**
|
|
|
|
| 76 |
| **InfoMax** | +0.021 | +0.002 | +0.011 | +0.013 | +0.012 |
|
| 77 |
| **MaxMinLCB** | +0.003 | +0.010 | +0.004 | +0.018 | +0.008 |
|
| 78 |
|
|
|
|
|
|
|
| 79 |
### 3. Skywork + UltraFeedback (Combined)
|
| 80 |
|
| 81 |
**Reward Model (RM) Performance**
|
|
|
|
| 108 |
| **InfoMax** | +0.011 | +0.021 | +0.014 | +0.018 | +0.015 |
|
| 109 |
| **MaxMinLCB** | -0.010 | +0.019 | +0.010 | +0.021 | +0.009 |
|
| 110 |
|
|
|
|
|
|
|
| 111 |
### 4. Tulu 3 Prompts
|
| 112 |
|
| 113 |
**Reward Model (RM) Performance**
|
|
|
|
| 141 |
| **DTS** | +0.015 | +0.012 | +0.018 | +0.024 | +0.017 |
|
| 142 |
| **InfoMax** | +0.021 | +0.008 | +0.039 | +0.012 | +0.020 |
|
| 143 |
| **MaxMinLCB** | +0.013 | -0.014 | +0.012 | +0.019 | +0.008 |
|
| 144 |
+
</details>
|
| 145 |
|
| 146 |
+
<details>
|
| 147 |
+
<summary><strong>🔁 Pipeline Overview (How it works)</strong></summary>
|
|
|
|
| 148 |
|
| 149 |
Given a batch of prompts, the following steps are executed:
|
|
|
|
| 150 |
1. **Response Generation**: For each prompt in the batch, call multiple LLMs to each generate a response.
|
| 151 |
2. **Uncertainty-Aware Reward Prediction**: An uncertainty-aware reward model predicts the reward and associated uncertainty of the responses. *(Note: This model is initialized randomly at the start)*.
|
| 152 |
3. **Pair Selection (Acquisition Function)**: Select which two responses should get annotated based on rewards and uncertainties using an acquisition function (e.g., Double Thompson Sampling).
|
| 153 |
4. **Oracle Annotation**: Annotate which response in the selected pair is preferred (via LLM or human).
|
| 154 |
5. **Reward Model Training**: Train the uncertainty-aware reward model on the new preference data, then repeat the loop.
|
| 155 |
+
</details>
|
| 156 |
|
| 157 |
+
<details>
|
| 158 |
+
<summary><strong>🤖 Source Models & Licenses</strong></summary>
|
|
|
|
|
|
|
|
|
|
| 159 |
|
| 160 |
| Model | Parameters (B) | License |
|
| 161 |
| :--- | :---: | :--- |
|
|
|
|
| 197 |
| `moonshotai/Moonlight-16B-A3B-Instruct` | 16 | MIT |
|
| 198 |
| `CohereLabs/c4ai-command-a-03-2025` | 111 | CC by NC 4.0 |
|
| 199 |
| `deepseek-ai/DeepSeek-V3` | 671 | Deepseek |
|
| 200 |
+
</details>
|
| 201 |
|
| 202 |
+
<details>
|
| 203 |
+
<summary><strong>🚀 Quickstart & Installation</strong></summary>
|
| 204 |
+
|
| 205 |
+
### 1. Installation
|
| 206 |
+
```bash
|
| 207 |
+
pip install -e .
|
| 208 |
+
```
|
| 209 |
|
| 210 |
+
### 2. Running the Pipeline
|
| 211 |
+
```bash
|
| 212 |
+
python path/to/main_script.py
|
| 213 |
+
```
|
| 214 |
|
| 215 |
+
### 3. Environment Setup (Docker)
|
| 216 |
+
```bash
|
| 217 |
+
podman build -t activeuf:latest .
|
| 218 |
+
```
|
| 219 |
+
</details>
|
| 220 |
+
|
| 221 |
+
---
|
| 222 |
|
| 223 |
+
**License:** MIT
|
| 224 |
+
**Disclaimer:** The datasets hosted here contain outputs from various third-party open-source models. Users must ensure compliance with the specific licenses of the source models (e.g., Llama Community License, Apache 2.0, Qwen Research License) when using these datasets.
|
|
|