davmel commited on
Commit
464585e
·
verified ·
1 Parent(s): e57001d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +270 -7
README.md CHANGED
@@ -1,10 +1,273 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: README
3
- emoji: 📚
4
- colorFrom: red
5
- colorTo: indigo
6
- sdk: static
7
- pinned: false
 
 
 
 
 
8
  ---
9
 
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Active UltraFeedback
2
+
3
+ **Active UltraFeedback** is a scalable pipeline for generating high-quality preference datasets to align large language models (LLMs), requiring only a set of prompts as input.
4
+
5
+ It leverages **uncertainty quantification** and **active learning** to identify and annotate the most informative samples, drastically reducing annotation costs while maintaining high data quality. Annotations are provided by an oracle (typically another LLM, but can also be a human).
6
+
7
+ > **Repository Purpose:** This repository serves as a central hub for storing all experimental data, including the generated preference datasets and the resulting DPO/RM/IPO/SimPO models.
8
+
9
+ ## 🏆 Key Results
10
+
11
+ Our experiments demonstrate that **Active Learning strategies (specifically DRTS and DeltaUCB)** significantly outperform standard baselines.
12
+
13
+ The datasets generated by our pipeline for **DRTS** and **DeltaUCB** consistently beat the actual `ultrafeedback_binarized_cleaned` and `tulu3` preference mixture datasets on our DPO/RM training setups with LoRA.
14
+
15
+ Below are the detailed results across 4 different prompt distributions. For acquisition functions with multiple hyperparameter configurations, we report the best-performing setting.
16
+
17
+ ---
18
+
19
+ ### 1. UltraFeedback Prompts (Only)
20
+
21
+ **Reward Model (RM) Performance**
22
+ | Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
23
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
24
+ | *Baselines* | | | | | | | |
25
+ | Random | +0.443 | +0.209 | +0.156 | +0.133 | +0.417 | +0.310 | +0.278 |
26
+ | UltraFeedback | +0.443 | +0.188 | +0.213 | +0.114 | +0.481 | +0.284 | +0.287 |
27
+ | MaxMin | +0.377 | +0.483 | +0.156 | +0.123 | +0.370 | +0.400 | +0.318 |
28
+ | DeltaQwen | +0.195 | -0.034 | +0.028 | +0.067 | +0.216 | +0.126 | +0.100 |
29
+ | *Ours* | | | | | | | |
30
+ | **DRTS** | +0.412 | +0.408 | +0.183 | +0.114 | +0.347 | +0.404 | +0.312 |
31
+ | **DeltaUCB** | +0.423 | **+0.553** | +0.132 | +0.080 | +0.435 | +0.408 | **+0.339** |
32
+ | **DTS** | +0.406 | +0.024 | +0.194 | +0.077 | +0.441 | +0.197 | +0.223 |
33
+ | **InfoMax** | +0.463 | +0.287 | +0.096 | **+0.129** | +0.509 | +0.296 | +0.297 |
34
+ | **MaxMinLCB** | +0.390 | -0.025 | **+0.244** | +0.070 | +0.453 | +0.250 | +0.230 |
35
+
36
+ **DPO Performance** (Best HPs selected)
37
+ | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
38
+ | :--- | :---: | :---: | :---: | :---: | :---: |
39
+ | *Baselines* | | | | | |
40
+ | Random | +0.024 | +0.028 | +0.056 | +0.077 | +0.046 |
41
+ | UltraFeedback | +0.037 | -0.001 | +0.039 | +0.072 | +0.036 |
42
+ | MaxMin | +0.022 | -0.016 | **+0.150** | +0.289 | +0.111 |
43
+ | DeltaQwen | +0.055 | +0.047 | +0.130 | +0.316 | **+0.137** |
44
+ | *Ours* | | | | | |
45
+ | **DRTS** | +0.055 | **+0.050** | +0.143 | +0.259 | +0.127 |
46
+ | **DeltaUCB** | **+0.065** | +0.039 | +0.113 | +0.254 | +0.117 |
47
+ | **DTS** | +0.011 | +0.034 | +0.013 | +0.037 | +0.023 |
48
+ | **InfoMax** | +0.011 | +0.019 | +0.018 | +0.020 | +0.016 |
49
+ | **MaxMinLCB** | +0.015 | +0.017 | +0.006 | +0.027 | +0.016 |
50
+
51
+ ---
52
+
53
+ ### 2. Skywork Prompts (Only)
54
+
55
+ **Reward Model (RM) Performance**
56
+ | Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
57
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
58
+ | *Baselines* | | | | | | | |
59
+ | Random | +0.443 | +0.209 | +0.156 | +0.133 | +0.417 | +0.310 | +0.278 |
60
+ | UltraFeedback | +0.443 | +0.188 | +0.213 | +0.114 | +0.481 | +0.284 | +0.287 |
61
+ | MaxMin | +0.377 | **+0.483** | +0.156 | +0.123 | +0.370 | **+0.400** | **+0.318** |
62
+ | DeltaQwen | +0.195 | -0.034 | +0.028 | +0.067 | +0.216 | +0.126 | +0.100 |
63
+ | *Ours* | | | | | | | |
64
+ | **DRTS** | +0.396 | +0.090 | +0.107 | +0.033 | +0.344 | +0.225 | +0.199 |
65
+ | **DeltaUCB** | +0.370 | +0.319 | **+0.194** | +0.033 | +0.346 | +0.310 | +0.262 |
66
+ | **DTS** | +0.417 | -0.021 | +0.148 | **+0.077** | +0.450 | +0.245 | +0.219 |
67
+ | **InfoMax** | **+0.429** | +0.122 | +0.162 | +0.030 | **+0.495** | +0.227 | +0.244 |
68
+ | **MaxMinLCB** | +0.371 | -0.016 | +0.145 | +0.039 | +0.395 | +0.167 | +0.184 |
69
+
70
+ **DPO Performance** (Best HPs selected)
71
+ | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
72
+ | :--- | :---: | :---: | :---: | :---: | :---: |
73
+ | *Baselines* | | | | | |
74
+ | Random | +0.020 | +0.004 | +0.004 | +0.025 | +0.013 |
75
+ | UltraFeedback | +0.023 | +0.021 | +0.003 | +0.031 | +0.019 |
76
+ | MaxMin | +0.043 | **+0.041** | +0.017 | +0.114 | +0.053 |
77
+ | DeltaQwen | +0.043 | +0.030 | +0.023 | +0.183 | +0.069 |
78
+ | *Ours* | | | | | |
79
+ | **DRTS** | +0.065 | +0.019 | **+0.055** | **+0.197** | **+0.083** |
80
+ | **DeltaUCB** | **+0.074** | +0.028 | +0.045 | +0.173 | +0.080 |
81
+ | **DTS** | +0.003 | +0.004 | +0.002 | +0.028 | +0.009 |
82
+ | **InfoMax** | +0.013 | +0.008 | +0.003 | +0.012 | +0.009 |
83
+ | **MaxMinLCB** | -0.001 | +0.000 | +0.000 | +0.002 | -0.000 |
84
+
85
+ ---
86
+
87
+ ### 3. Skywork + UltraFeedback (Combined)
88
+
89
+ **Reward Model (RM) Performance**
90
+ | Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
91
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
92
+ | *Baselines* | | | | | | | |
93
+ | Random | +0.443 | +0.209 | +0.156 | +0.133 | +0.417 | +0.310 | +0.278 |
94
+ | UltraFeedback | +0.443 | +0.188 | **+0.213** | +0.114 | +0.481 | +0.284 | +0.287 |
95
+ | MaxMin | +0.377 | **+0.483** | +0.156 | +0.123 | +0.370 | +0.400 | **+0.318** |
96
+ | DeltaQwen | +0.195 | -0.034 | +0.028 | +0.067 | +0.216 | +0.126 | +0.100 |
97
+ | *Ours* | | | | | | | |
98
+ | **DRTS** | +0.439 | +0.386 | +0.151 | +0.064 | +0.415 | **+0.395** | +0.308 |
99
+ | **DeltaUCB** | +0.463 | +0.350 | +0.164 | **+0.092** | +0.469 | +0.213 | +0.292 |
100
+ | **DTS** | +0.419 | +0.087 | +0.186 | +0.083 | +0.411 | +0.297 | +0.247 |
101
+ | **InfoMax** | **+0.476** | +0.383 | +0.153 | +0.042 | **+0.546** | +0.199 | +0.300 |
102
+ | **MaxMinLCB** | +0.439 | +0.048 | +0.159 | +0.030 | +0.435 | +0.201 | +0.219 |
103
+
104
+ **DPO Performance** (Best HPs selected)
105
+ | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
106
+ | :--- | :---: | :---: | :---: | :---: | :---: |
107
+ | *Baselines* | | | | | |
108
+ | Random | +0.026 | +0.012 | +0.012 | +0.035 | +0.021 |
109
+ | UltraFeedback | +0.032 | -0.007 | +0.011 | +0.052 | +0.022 |
110
+ | MaxMin | +0.074 | +0.025 | +0.052 | +0.222 | +0.092 |
111
+ | DeltaQwen | +0.069 | **+0.030** | **+0.097** | **+0.299** | **+0.123** |
112
+ | *Ours* | | | | | |
113
+ | **DRTS** | +0.065 | +0.028 | +0.090 | +0.238 | +0.105 |
114
+ | **DeltaUCB** | **+0.078** | +0.010 | +0.093 | +0.246 | +0.106 |
115
+ | **DTS** | +0.011 | +0.000 | +0.006 | +0.024 | +0.010 |
116
+ | **InfoMax** | +0.004 | +0.012 | +0.004 | +0.016 | +0.009 |
117
+ | **MaxMinLCB** | -0.006 | +0.000 | +0.003 | +0.004 | -0.000 |
118
+
119
+ ---
120
+
121
+ ### 4. Tulu 3 Prompts
122
+
123
+ **Reward Model (RM) Performance**
124
+ | Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
125
+ | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
126
+ | *Baselines* | | | | | | | |
127
+ | Random | **+0.465** | +0.465 | **+0.213** | +0.077 | **+0.584** | +0.355 | +0.360 |
128
+ | UltraFeedback | +0.450 | +0.441 | +0.170 | +0.077 | +0.531 | **+0.386** | +0.343 |
129
+ | MaxMin | +0.450 | +0.443 | +0.211 | +0.083 | +0.521 | +0.358 | +0.344 |
130
+ | DeltaQwen | +0.179 | -0.086 | -0.013 | +0.164 | +0.174 | +0.091 | +0.085 |
131
+ | Tulu3_PrefMix | +0.398 | +0.350 | +0.173 | +0.098 | +0.423 | +0.342 | +0.298 |
132
+ | *Ours* | | | | | | | |
133
+ | **DRTS** | +0.456 | +0.515 | +0.080 | +0.148 | +0.533 | +0.356 | +0.348 |
134
+ | **DeltaUCB** | +0.455 | **+0.537** | +0.189 | **+0.148** | +0.580 | +0.390 | **+0.383** |
135
+ | **DTS** | +0.426 | +0.140 | +0.200 | +0.036 | +0.499 | +0.160 | +0.243 |
136
+ | **InfoMax** | +0.431 | +0.302 | +0.175 | +0.098 | +0.545 | +0.286 | +0.306 |
137
+ | **MaxMinLCB** | +0.448 | +0.168 | +0.140 | +0.101 | +0.531 | +0.196 | +0.264 |
138
+
139
+ **DPO Performance** (Best HPs selected)
140
+ | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
141
+ | :--- | :---: | :---: | :---: | :---: | :---: |
142
+ | *Baselines* | | | | | |
143
+ | Random | **+0.055** | +0.041 | +0.069 | +0.046 | +0.052 |
144
+ | UltraFeedback | +0.043 | +0.052 | +0.056 | +0.057 | +0.051 |
145
+ | MaxMin | +0.022 | **+0.067** | **+0.188** | +0.279 | **+0.138** |
146
+ | DeltaQwen | +0.049 | +0.034 | +0.124 | **+0.291** | +0.124 |
147
+ | Tulu3_PrefMix | +0.037 | +0.069 | +0.046 | +0.020 | +0.042 |
148
+ | *Ours* | | | | | |
149
+ | **DRTS** | +0.050 | +0.058 | +0.118 | +0.203 | +0.107 |
150
+ | **DeltaUCB** | +0.028 | +0.060 | +0.134 | +0.235 | +0.114 |
151
+ | **DTS** | +0.015 | +0.012 | +0.018 | +0.024 | +0.017 |
152
+ | **InfoMax** | +0.021 | +0.008 | +0.039 | +0.012 | +0.020 |
153
+ | **MaxMinLCB** | +0.013 | -0.014 | +0.012 | +0.019 | +0.008 |
154
+
155
  ---
156
+
157
+ ## 🔁 Pipeline Overview
158
+
159
+ Given a batch of prompts, the following steps are executed:
160
+
161
+ 1. **Response Generation**: For each prompt in the batch, call multiple LLMs to each generate a response.
162
+ 2. **Uncertainty-Aware Reward Prediction**: An uncertainty-aware reward model predicts the reward and associated uncertainty of the responses. *(Note: This model is initialized randomly at the start)*.
163
+ 3. **Pair Selection (Acquisition Function)**: Select which two responses should get annotated based on rewards and uncertainties using an acquisition function (e.g., Double Thompson Sampling).
164
+ 4. **Oracle Annotation**: Annotate which response in the selected pair is preferred (via LLM or human).
165
+ 5. **Reward Model Training**: Train the uncertainty-aware reward model on the new preference data, then repeat the loop.
166
+
167
  ---
168
 
169
+ ## 🤖 Source Models
170
+
171
+ To ensure diversity and quality, we utilize a wide range of open-source models for completion generation. Please refer to the specific licenses for each model when using these datasets.
172
+
173
+ **Qwen Series:**
174
+ * `Qwen/Qwen2.5-0.5B-Instruct`, `Qwen/Qwen2.5-72B-Instruct`
175
+ * `Qwen/Qwen3-0.6B`, `1.7B`, `14B`, `32B`
176
+ * `Qwen/Qwen3-30B-A3B`, `235B-A22B`
177
+
178
+ **Llama Series:**
179
+ * `meta-llama/Llama-3.1-8B-Instruct`
180
+ * `meta-llama/Llama-3.2-1B-Instruct`, `3B-Instruct`
181
+ * `meta-llama/Llama-3.3-70B-Instruct`
182
+
183
+ **NVIDIA Nemotron:**
184
+ * `nvidia/Llama-3_3-Nemotron-Super-49B-v1`
185
+ * `nvidia/Llama-3.1-Nemotron-70B-Instruct-HF`
186
+ * `nvidia/Llama-3_1-Nemotron-Ultra-253B-v1`
187
+
188
+ **Google Gemma:**
189
+ * `google/gemma-3-1b-it`, `4b-it`, `12b-it`, `27b-it`
190
+
191
+ **Mistral:**
192
+ * `mistralai/Mistral-Small-24B-Instruct-2501`
193
+ * `mistralai/Mistral-Large-Instruct-2411`
194
+
195
+ **Others:**
196
+ * `microsoft/Phi-4-mini-instruct`, `microsoft/phi-4`
197
+ * `HuggingFaceTB/SmolLM2-1.7B-Instruct`
198
+ * `CohereLabs/c4ai-command-a-03-2025`
199
+ * `deepseek-ai/DeepSeek-V3`
200
+ * `allenai/OLMo-2-0325-32B-Instruct`
201
+ * `allenai/Llama-3.1-Tulu-3-70B`, `405B`
202
+ * `moonshotai/Moonlight-16B-A3B-Instruct`
203
+
204
+ ---
205
+
206
+ ## 🚀 Quickstart
207
+
208
+ ### 1. Installation
209
+ Install the package in editable mode:
210
+ ```bash
211
+ pip install -e .
212
+ ```
213
+
214
+ ### 2. Running the Pipeline
215
+ Run the main dataset generation script:
216
+ ```bash
217
+ python path/to/main_script.py
218
+ ```
219
+
220
+ ### 3. Configuration (Optional)
221
+ To modify the pipeline parameters and steps, edit the configuration files in the `config/` directory.
222
+
223
+ ---
224
+
225
+ ## 🛠 Environment Setup
226
+
227
+ ### Option 1: Docker/Podman (Recommended)
228
+ Build the container image:
229
+ ```bash
230
+ podman build -t activeuf:latest .
231
+ ```
232
+
233
+ ### Option 2: `uv` (For Local Use)
234
+ Create a `uv` environment with all dependencies.
235
+ ```bash
236
+ # Install uv
237
+ curl -LsSf [https://astral.sh/uv/install.sh](https://astral.sh/uv/install.sh) | sh
238
+ source $HOME/.local/bin/env
239
+
240
+ # Sync dependencies
241
+ uv sync --dev
242
+ source .venv/bin/activate
243
+ ```
244
+
245
+ ---
246
+
247
+ ## 👨‍💻 Development Setup
248
+
249
+ For contributors and developers:
250
+
251
+ ### Pre-commit Hooks
252
+ This project uses `ruff` for linting and formatting.
253
+ ```bash
254
+ pre-commit install
255
+ ```
256
+
257
+ ### Manual Linting
258
+ ```bash
259
+ # Format code
260
+ ruff format
261
+
262
+ # Lint and auto-fix
263
+ ruff check --fix
264
+ ```
265
+
266
+ ---
267
+
268
+ ## 📄 License
269
+
270
+ This project is licensed under the **MIT License**.
271
+
272
+ **Note on Data Usage:**
273
+ While the code and curated datasets in this repository are released under MIT, the datasets contain outputs generated by third-party models (listed above). Users are responsible for adhering to the respective licenses of these source models (e.g., Llama Community License, Apache 2.0, Qwen Research License) when using this data for training or commercial purposes.