davmel commited on
Commit
714bb8f
·
verified ·
1 Parent(s): d7884b6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +35 -33
README.md CHANGED
@@ -1,20 +1,16 @@
1
- Active UltraFeedback
2
 
3
- **Active UltraFeedback** is a scalable pipeline for generating high-quality preference datasets to align large language models (LLMs), requiring only a set of prompts as input.
4
 
5
- It leverages **uncertainty quantification** and **active learning** to identify and annotate the most informative samples, drastically reducing annotation costs while maintaining high data quality. Annotations are provided by an oracle (typically another LLM, but can also be a human).
 
6
 
7
- > **Repository Purpose:** This repository serves as a central hub for storing all experimental data, including the generated preference datasets and the resulting DPO/RM/IPO/SimPO models.
8
-
9
- ## 🏆 Key Results
10
-
11
- Our experiments demonstrate that **Active Learning strategies (specifically DRTS and DeltaUCB)** significantly outperform standard baselines.
12
-
13
- The datasets generated by our pipeline for **DRTS** and **DeltaUCB** consistently beat the actual `ultrafeedback_binarized_cleaned` and `tulu3` preference mixture datasets on our DPO/RM training setups with LoRA.
14
 
15
- Below are the detailed results across 4 different prompt distributions.
 
16
 
17
- ---
18
 
19
  ### 1. UltraFeedback Prompts (Only)
20
 
@@ -48,8 +44,6 @@ Below are the detailed results across 4 different prompt distributions.
48
  | **InfoMax** | +0.011 | +0.019 | +0.018 | +0.020 | +0.016 |
49
  | **MaxMinLCB** | +0.015 | +0.017 | +0.006 | +0.027 | +0.016 |
50
 
51
- ---
52
-
53
  ### 2. Skywork Prompts (Only)
54
 
55
  **Reward Model (RM) Performance**
@@ -82,8 +76,6 @@ Below are the detailed results across 4 different prompt distributions.
82
  | **InfoMax** | +0.021 | +0.002 | +0.011 | +0.013 | +0.012 |
83
  | **MaxMinLCB** | +0.003 | +0.010 | +0.004 | +0.018 | +0.008 |
84
 
85
- ---
86
-
87
  ### 3. Skywork + UltraFeedback (Combined)
88
 
89
  **Reward Model (RM) Performance**
@@ -116,8 +108,6 @@ Below are the detailed results across 4 different prompt distributions.
116
  | **InfoMax** | +0.011 | +0.021 | +0.014 | +0.018 | +0.015 |
117
  | **MaxMinLCB** | -0.010 | +0.019 | +0.010 | +0.021 | +0.009 |
118
 
119
- ---
120
-
121
  ### 4. Tulu 3 Prompts
122
 
123
  **Reward Model (RM) Performance**
@@ -151,24 +141,21 @@ Below are the detailed results across 4 different prompt distributions.
151
  | **DTS** | +0.015 | +0.012 | +0.018 | +0.024 | +0.017 |
152
  | **InfoMax** | +0.021 | +0.008 | +0.039 | +0.012 | +0.020 |
153
  | **MaxMinLCB** | +0.013 | -0.014 | +0.012 | +0.019 | +0.008 |
 
154
 
155
- ---
156
-
157
- ## 🔁 Pipeline Overview
158
 
159
  Given a batch of prompts, the following steps are executed:
160
-
161
  1. **Response Generation**: For each prompt in the batch, call multiple LLMs to each generate a response.
162
  2. **Uncertainty-Aware Reward Prediction**: An uncertainty-aware reward model predicts the reward and associated uncertainty of the responses. *(Note: This model is initialized randomly at the start)*.
163
  3. **Pair Selection (Acquisition Function)**: Select which two responses should get annotated based on rewards and uncertainties using an acquisition function (e.g., Double Thompson Sampling).
164
  4. **Oracle Annotation**: Annotate which response in the selected pair is preferred (via LLM or human).
165
  5. **Reward Model Training**: Train the uncertainty-aware reward model on the new preference data, then repeat the loop.
 
166
 
167
- ---
168
-
169
- ## 🤖 Source Models and Licenses
170
-
171
- To ensure diversity and quality, we utilize a wide range of open-source models for completion generation. Below is the list of models used, along with their parameters and licenses.
172
 
173
  | Model | Parameters (B) | License |
174
  | :--- | :---: | :--- |
@@ -210,13 +197,28 @@ To ensure diversity and quality, we utilize a wide range of open-source models f
210
  | `moonshotai/Moonlight-16B-A3B-Instruct` | 16 | MIT |
211
  | `CohereLabs/c4ai-command-a-03-2025` | 111 | CC by NC 4.0 |
212
  | `deepseek-ai/DeepSeek-V3` | 671 | Deepseek |
 
213
 
214
- ---
 
 
 
 
 
 
215
 
216
- ## 📄 License
 
 
 
217
 
218
- This project is licensed under the **MIT License**.
 
 
 
 
 
 
219
 
220
- **Note on Data Usage:**
221
- While the code and curated datasets in this repository are released under MIT, the datasets contain outputs generated by third-party models (listed above). Users are responsible for adhering to the respective licenses of these source models (e.g., Llama Community License, Apache 2.0, Qwen Research License) when using this data for training or commercial purposes.
222
- """
 
1
+ # Active UltraFeedback
2
 
3
+ **Active UltraFeedback** is a scalable pipeline for generating high-quality preference datasets to align large language models (LLMs). We leverage **uncertainty quantification** and **active learning** to annotate only the most informative samples, drastically reducing costs while beating standard baselines.
4
 
5
+ ### 📂 [**Click Here to View the Preference Datasets Collection**](https://huggingface.co/collections/ActiveUltraFeedback/preference-datasets-677f0e3745281481075f1073)
6
+ *(Contains all datasets generated via our DRTS, DeltaUCB, and InfoMax experiments)*
7
 
8
+ ---
 
 
 
 
 
 
9
 
10
+ <details>
11
+ <summary><strong>🏆 Benchmark Results (Click to Expand)</strong></summary>
12
 
13
+ Our experiments demonstrate that **Active Learning strategies (specifically DRTS and DeltaUCB)** consistently beat the actual `ultrafeedback_binarized_cleaned` and `tulu3` preference mixture datasets.
14
 
15
  ### 1. UltraFeedback Prompts (Only)
16
 
 
44
  | **InfoMax** | +0.011 | +0.019 | +0.018 | +0.020 | +0.016 |
45
  | **MaxMinLCB** | +0.015 | +0.017 | +0.006 | +0.027 | +0.016 |
46
 
 
 
47
  ### 2. Skywork Prompts (Only)
48
 
49
  **Reward Model (RM) Performance**
 
76
  | **InfoMax** | +0.021 | +0.002 | +0.011 | +0.013 | +0.012 |
77
  | **MaxMinLCB** | +0.003 | +0.010 | +0.004 | +0.018 | +0.008 |
78
 
 
 
79
  ### 3. Skywork + UltraFeedback (Combined)
80
 
81
  **Reward Model (RM) Performance**
 
108
  | **InfoMax** | +0.011 | +0.021 | +0.014 | +0.018 | +0.015 |
109
  | **MaxMinLCB** | -0.010 | +0.019 | +0.010 | +0.021 | +0.009 |
110
 
 
 
111
  ### 4. Tulu 3 Prompts
112
 
113
  **Reward Model (RM) Performance**
 
141
  | **DTS** | +0.015 | +0.012 | +0.018 | +0.024 | +0.017 |
142
  | **InfoMax** | +0.021 | +0.008 | +0.039 | +0.012 | +0.020 |
143
  | **MaxMinLCB** | +0.013 | -0.014 | +0.012 | +0.019 | +0.008 |
144
+ </details>
145
 
146
+ <details>
147
+ <summary><strong>🔁 Pipeline Overview (How it works)</strong></summary>
 
148
 
149
  Given a batch of prompts, the following steps are executed:
 
150
  1. **Response Generation**: For each prompt in the batch, call multiple LLMs to each generate a response.
151
  2. **Uncertainty-Aware Reward Prediction**: An uncertainty-aware reward model predicts the reward and associated uncertainty of the responses. *(Note: This model is initialized randomly at the start)*.
152
  3. **Pair Selection (Acquisition Function)**: Select which two responses should get annotated based on rewards and uncertainties using an acquisition function (e.g., Double Thompson Sampling).
153
  4. **Oracle Annotation**: Annotate which response in the selected pair is preferred (via LLM or human).
154
  5. **Reward Model Training**: Train the uncertainty-aware reward model on the new preference data, then repeat the loop.
155
+ </details>
156
 
157
+ <details>
158
+ <summary><strong>🤖 Source Models & Licenses</strong></summary>
 
 
 
159
 
160
  | Model | Parameters (B) | License |
161
  | :--- | :---: | :--- |
 
197
  | `moonshotai/Moonlight-16B-A3B-Instruct` | 16 | MIT |
198
  | `CohereLabs/c4ai-command-a-03-2025` | 111 | CC by NC 4.0 |
199
  | `deepseek-ai/DeepSeek-V3` | 671 | Deepseek |
200
+ </details>
201
 
202
+ <details>
203
+ <summary><strong>🚀 Quickstart & Installation</strong></summary>
204
+
205
+ ### 1. Installation
206
+ ```bash
207
+ pip install -e .
208
+ ```
209
 
210
+ ### 2. Running the Pipeline
211
+ ```bash
212
+ python path/to/main_script.py
213
+ ```
214
 
215
+ ### 3. Environment Setup (Docker)
216
+ ```bash
217
+ podman build -t activeuf:latest .
218
+ ```
219
+ </details>
220
+
221
+ ---
222
 
223
+ **License:** MIT
224
+ **Disclaimer:** The datasets hosted here contain outputs from various third-party open-source models. Users must ensure compliance with the specific licenses of the source models (e.g., Llama Community License, Apache 2.0, Qwen Research License) when using these datasets.