Spaces:

ActiveUltraFeedback
/

README

No application file

App Files Files Community

davmel commited on Jan 9

Commit

714bb8f

verified ·

1 Parent(s): d7884b6

Update README.md

Browse files

Files changed (1) hide show

README.md +35 -33

README.md CHANGED Viewed

@@ -1,20 +1,16 @@
-Active UltraFeedback
-**Active UltraFeedback** is a scalable pipeline for generating high-quality preference datasets to align large language models (LLMs), requiring only a set of prompts as input.
-It leverages **uncertainty quantification** and **active learning** to identify and annotate the most informative samples, drastically reducing annotation costs while maintaining high data quality. Annotations are provided by an oracle (typically another LLM, but can also be a human).
-> **Repository Purpose:** This repository serves as a central hub for storing all experimental data, including the generated preference datasets and the resulting DPO/RM/IPO/SimPO models.
-## 🏆 Key Results
-Our experiments demonstrate that **Active Learning strategies (specifically DRTS and DeltaUCB)** significantly outperform standard baselines.
-The datasets generated by our pipeline for **DRTS** and **DeltaUCB** consistently beat the actual `ultrafeedback_binarized_cleaned` and `tulu3` preference mixture datasets on our DPO/RM training setups with LoRA.
-Below are the detailed results across 4 different prompt distributions.
----
 ### 1. UltraFeedback Prompts (Only)
@@ -48,8 +44,6 @@ Below are the detailed results across 4 different prompt distributions.
 | **InfoMax** | +0.011 | +0.019 | +0.018 | +0.020 | +0.016 |
 | **MaxMinLCB** | +0.015 | +0.017 | +0.006 | +0.027 | +0.016 |
----
 ### 2. Skywork Prompts (Only)
 **Reward Model (RM) Performance**
@@ -82,8 +76,6 @@ Below are the detailed results across 4 different prompt distributions.
 | **InfoMax** | +0.021 | +0.002 | +0.011 | +0.013 | +0.012 |
 | **MaxMinLCB** | +0.003 | +0.010 | +0.004 | +0.018 | +0.008 |
----
 ### 3. Skywork + UltraFeedback (Combined)
 **Reward Model (RM) Performance**
@@ -116,8 +108,6 @@ Below are the detailed results across 4 different prompt distributions.
 | **InfoMax** | +0.011 | +0.021 | +0.014 | +0.018 | +0.015 |
 | **MaxMinLCB** | -0.010 | +0.019 | +0.010 | +0.021 | +0.009 |
----
 ### 4. Tulu 3 Prompts
 **Reward Model (RM) Performance**
@@ -151,24 +141,21 @@ Below are the detailed results across 4 different prompt distributions.
 | **DTS** | +0.015 | +0.012 | +0.018 | +0.024 | +0.017 |
 | **InfoMax** | +0.021 | +0.008 | +0.039 | +0.012 | +0.020 |
 | **MaxMinLCB** | +0.013 | -0.014 | +0.012 | +0.019 | +0.008 |
----
-## 🔁 Pipeline Overview
 Given a batch of prompts, the following steps are executed:
 1.  **Response Generation**: For each prompt in the batch, call multiple LLMs to each generate a response.
 2.  **Uncertainty-Aware Reward Prediction**: An uncertainty-aware reward model predicts the reward and associated uncertainty of the responses. *(Note: This model is initialized randomly at the start)*.
 3.  **Pair Selection (Acquisition Function)**: Select which two responses should get annotated based on rewards and uncertainties using an acquisition function (e.g., Double Thompson Sampling).
 4.  **Oracle Annotation**: Annotate which response in the selected pair is preferred (via LLM or human).
 5.  **Reward Model Training**: Train the uncertainty-aware reward model on the new preference data, then repeat the loop.
----
-## 🤖 Source Models and Licenses
-To ensure diversity and quality, we utilize a wide range of open-source models for completion generation. Below is the list of models used, along with their parameters and licenses.
 | Model | Parameters (B) | License |
 | :--- | :---: | :--- |
@@ -210,13 +197,28 @@ To ensure diversity and quality, we utilize a wide range of open-source models f
 | `moonshotai/Moonlight-16B-A3B-Instruct` | 16 | MIT |
 | `CohereLabs/c4ai-command-a-03-2025` | 111 | CC by NC 4.0 |
 | `deepseek-ai/DeepSeek-V3` | 671 | Deepseek |
----
-## 📄 License
-This project is licensed under the **MIT License**.
-**Note on Data Usage:**
-While the code and curated datasets in this repository are released under MIT, the datasets contain outputs generated by third-party models (listed above). Users are responsible for adhering to the respective licenses of these source models (e.g., Llama Community License, Apache 2.0, Qwen Research License) when using this data for training or commercial purposes.
-"""

+# Active UltraFeedback
+**Active UltraFeedback** is a scalable pipeline for generating high-quality preference datasets to align large language models (LLMs). We leverage **uncertainty quantification** and **active learning** to annotate only the most informative samples, drastically reducing costs while beating standard baselines.
+### 📂 [**Click Here to View the Preference Datasets Collection**](https://huggingface.co/collections/ActiveUltraFeedback/preference-datasets-677f0e3745281481075f1073)
+*(Contains all datasets generated via our DRTS, DeltaUCB, and InfoMax experiments)*
+---
+<details>
+<summary><strong>🏆 Benchmark Results (Click to Expand)</strong></summary>
+Our experiments demonstrate that **Active Learning strategies (specifically DRTS and DeltaUCB)** consistently beat the actual `ultrafeedback_binarized_cleaned` and `tulu3` preference mixture datasets.
 ### 1. UltraFeedback Prompts (Only)
 | **InfoMax** | +0.011 | +0.019 | +0.018 | +0.020 | +0.016 |
 | **MaxMinLCB** | +0.015 | +0.017 | +0.006 | +0.027 | +0.016 |
 ### 2. Skywork Prompts (Only)
 **Reward Model (RM) Performance**
 | **InfoMax** | +0.021 | +0.002 | +0.011 | +0.013 | +0.012 |
 | **MaxMinLCB** | +0.003 | +0.010 | +0.004 | +0.018 | +0.008 |
 ### 3. Skywork + UltraFeedback (Combined)
 **Reward Model (RM) Performance**
 | **InfoMax** | +0.011 | +0.021 | +0.014 | +0.018 | +0.015 |
 | **MaxMinLCB** | -0.010 | +0.019 | +0.010 | +0.021 | +0.009 |
 ### 4. Tulu 3 Prompts
 **Reward Model (RM) Performance**
 | **DTS** | +0.015 | +0.012 | +0.018 | +0.024 | +0.017 |
 | **InfoMax** | +0.021 | +0.008 | +0.039 | +0.012 | +0.020 |
 | **MaxMinLCB** | +0.013 | -0.014 | +0.012 | +0.019 | +0.008 |
+</details>
+<details>
+<summary><strong>🔁 Pipeline Overview (How it works)</strong></summary>
 Given a batch of prompts, the following steps are executed:
 1.  **Response Generation**: For each prompt in the batch, call multiple LLMs to each generate a response.
 2.  **Uncertainty-Aware Reward Prediction**: An uncertainty-aware reward model predicts the reward and associated uncertainty of the responses. *(Note: This model is initialized randomly at the start)*.
 3.  **Pair Selection (Acquisition Function)**: Select which two responses should get annotated based on rewards and uncertainties using an acquisition function (e.g., Double Thompson Sampling).
 4.  **Oracle Annotation**: Annotate which response in the selected pair is preferred (via LLM or human).
 5.  **Reward Model Training**: Train the uncertainty-aware reward model on the new preference data, then repeat the loop.
+</details>
+<details>
+<summary><strong>🤖 Source Models & Licenses</strong></summary>
 | Model | Parameters (B) | License |
 | :--- | :---: | :--- |
 | `moonshotai/Moonlight-16B-A3B-Instruct` | 16 | MIT |
 | `CohereLabs/c4ai-command-a-03-2025` | 111 | CC by NC 4.0 |
 | `deepseek-ai/DeepSeek-V3` | 671 | Deepseek |
+</details>
+<details>
+<summary><strong>🚀 Quickstart & Installation</strong></summary>
+### 1. Installation
+```bash
+pip install -e .
+```
+### 2. Running the Pipeline
+```bash
+python path/to/main_script.py
+```
+### 3. Environment Setup (Docker)
+```bash
+podman build -t activeuf:latest .
+```
+</details>
+---
+**License:** MIT
+**Disclaimer:** The datasets hosted here contain outputs from various third-party open-source models. Users must ensure compliance with the specific licenses of the source models (e.g., Llama Community License, Apache 2.0, Qwen Research License) when using these datasets.