Spaces:

ActiveUltraFeedback
/

README

No application file

App Files Files Community

davmel commited on Jan 9

Commit

01f712a

verified ·

1 Parent(s): 464585e

Update README.md

Browse files

Files changed (1) hide show

README.md +108 -83

README.md CHANGED Viewed

@@ -1,4 +1,10 @@
-# Active UltraFeedback
 **Active UltraFeedback** is a scalable pipeline for generating high-quality preference datasets to align large language models (LLMs), requiring only a set of prompts as input.
@@ -12,7 +18,7 @@ Our experiments demonstrate that **Active Learning strategies (specifically DRTS
 The datasets generated by our pipeline for **DRTS** and **DeltaUCB** consistently beat the actual `ultrafeedback_binarized_cleaned` and `tulu3` preference mixture datasets on our DPO/RM training setups with LoRA.
-Below are the detailed results across 4 different prompt distributions. For acquisition functions with multiple hyperparameter configurations, we report the best-performing setting.
 ---
@@ -33,7 +39,7 @@ Below are the detailed results across 4 different prompt distributions. For acqu
 | **InfoMax** | +0.463 | +0.287 | +0.096 | **+0.129** | +0.509 | +0.296 | +0.297 |
 | **MaxMinLCB** | +0.390 | -0.025 | **+0.244** | +0.070 | +0.453 | +0.250 | +0.230 |
-**DPO Performance** (Best HPs selected)
 | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
 | :--- | :---: | :---: | :---: | :---: | :---: |
 | *Baselines* | | | | | |
@@ -56,31 +62,31 @@ Below are the detailed results across 4 different prompt distributions. For acqu
 | Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
 | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
 | *Baselines* | | | | | | | |
-| Random | +0.443 | +0.209 | +0.156 | +0.133 | +0.417 | +0.310 | +0.278 |
-| UltraFeedback | +0.443 | +0.188 | +0.213 | +0.114 | +0.481 | +0.284 | +0.287 |
-| MaxMin | +0.377 | **+0.483** | +0.156 | +0.123 | +0.370 | **+0.400** | **+0.318** |
-| DeltaQwen | +0.195 | -0.034 | +0.028 | +0.067 | +0.216 | +0.126 | +0.100 |
 | *Ours* | | | | | | | |
-| **DRTS** | +0.396 | +0.090 | +0.107 | +0.033 | +0.344 | +0.225 | +0.199 |
 | **DeltaUCB** | +0.370 | +0.319 | **+0.194** | +0.033 | +0.346 | +0.310 | +0.262 |
-| **DTS** | +0.417 | -0.021 | +0.148 | **+0.077** | +0.450 | +0.245 | +0.219 |
-| **InfoMax** | **+0.429** | +0.122 | +0.162 | +0.030 | **+0.495** | +0.227 | +0.244 |
 | **MaxMinLCB** | +0.371 | -0.016 | +0.145 | +0.039 | +0.395 | +0.167 | +0.184 |
-**DPO Performance** (Best HPs selected)
 | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
 | :--- | :---: | :---: | :---: | :---: | :---: |
 | *Baselines* | | | | | |
-| Random | +0.020 | +0.004 | +0.004 | +0.025 | +0.013 |
-| UltraFeedback | +0.023 | +0.021 | +0.003 | +0.031 | +0.019 |
-| MaxMin | +0.043 | **+0.041** | +0.017 | +0.114 | +0.053 |
-| DeltaQwen | +0.043 | +0.030 | +0.023 | +0.183 | +0.069 |
 | *Ours* | | | | | |
-| **DRTS** | +0.065 | +0.019 | **+0.055** | **+0.197** | **+0.083** |
-| **DeltaUCB** | **+0.074** | +0.028 | +0.045 | +0.173 | +0.080 |
-| **DTS** | +0.003 | +0.004 | +0.002 | +0.028 | +0.009 |
-| **InfoMax** | +0.013 | +0.008 | +0.003 | +0.012 | +0.009 |
-| **MaxMinLCB** | -0.001 | +0.000 | +0.000 | +0.002 | -0.000 |
 ---
@@ -90,31 +96,31 @@ Below are the detailed results across 4 different prompt distributions. For acqu
 | Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
 | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
 | *Baselines* | | | | | | | |
-| Random | +0.443 | +0.209 | +0.156 | +0.133 | +0.417 | +0.310 | +0.278 |
-| UltraFeedback | +0.443 | +0.188 | **+0.213** | +0.114 | +0.481 | +0.284 | +0.287 |
-| MaxMin | +0.377 | **+0.483** | +0.156 | +0.123 | +0.370 | +0.400 | **+0.318** |
-| DeltaQwen | +0.195 | -0.034 | +0.028 | +0.067 | +0.216 | +0.126 | +0.100 |
 | *Ours* | | | | | | | |
-| **DRTS** | +0.439 | +0.386 | +0.151 | +0.064 | +0.415 | **+0.395** | +0.308 |
-| **DeltaUCB** | +0.463 | +0.350 | +0.164 | **+0.092** | +0.469 | +0.213 | +0.292 |
 | **DTS** | +0.419 | +0.087 | +0.186 | +0.083 | +0.411 | +0.297 | +0.247 |
 | **InfoMax** | **+0.476** | +0.383 | +0.153 | +0.042 | **+0.546** | +0.199 | +0.300 |
 | **MaxMinLCB** | +0.439 | +0.048 | +0.159 | +0.030 | +0.435 | +0.201 | +0.219 |
-**DPO Performance** (Best HPs selected)
 | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
 | :--- | :---: | :---: | :---: | :---: | :---: |
 | *Baselines* | | | | | |
-| Random | +0.026 | +0.012 | +0.012 | +0.035 | +0.021 |
-| UltraFeedback | +0.032 | -0.007 | +0.011 | +0.052 | +0.022 |
-| MaxMin | +0.074 | +0.025 | +0.052 | +0.222 | +0.092 |
-| DeltaQwen | +0.069 | **+0.030** | **+0.097** | **+0.299** | **+0.123** |
 | *Ours* | | | | | |
-| **DRTS** | +0.065 | +0.028 | +0.090 | +0.238 | +0.105 |
-| **DeltaUCB** | **+0.078** | +0.010 | +0.093 | +0.246 | +0.106 |
-| **DTS** | +0.011 | +0.000 | +0.006 | +0.024 | +0.010 |
-| **InfoMax** | +0.004 | +0.012 | +0.004 | +0.016 | +0.009 |
-| **MaxMinLCB** | -0.006 | +0.000 | +0.003 | +0.004 | -0.000 |
 ---
@@ -136,7 +142,7 @@ Below are the detailed results across 4 different prompt distributions. For acqu
 | **InfoMax** | +0.431 | +0.302 | +0.175 | +0.098 | +0.545 | +0.286 | +0.306 |
 | **MaxMinLCB** | +0.448 | +0.168 | +0.140 | +0.101 | +0.531 | +0.196 | +0.264 |
-**DPO Performance** (Best HPs selected)
 | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
 | :--- | :---: | :---: | :---: | :---: | :---: |
 | *Baselines* | | | | | |
@@ -166,40 +172,50 @@ Given a batch of prompts, the following steps are executed:
 ---
-## 🤖 Source Models
-To ensure diversity and quality, we utilize a wide range of open-source models for completion generation. Please refer to the specific licenses for each model when using these datasets.
-**Qwen Series:**
-* `Qwen/Qwen2.5-0.5B-Instruct`, `Qwen/Qwen2.5-72B-Instruct`
-* `Qwen/Qwen3-0.6B`, `1.7B`, `14B`, `32B`
-* `Qwen/Qwen3-30B-A3B`, `235B-A22B`
-**Llama Series:**
-* `meta-llama/Llama-3.1-8B-Instruct`
-* `meta-llama/Llama-3.2-1B-Instruct`, `3B-Instruct`
-* `meta-llama/Llama-3.3-70B-Instruct`
-**NVIDIA Nemotron:**
-* `nvidia/Llama-3_3-Nemotron-Super-49B-v1`
-* `nvidia/Llama-3.1-Nemotron-70B-Instruct-HF`
-* `nvidia/Llama-3_1-Nemotron-Ultra-253B-v1`
-**Google Gemma:**
-* `google/gemma-3-1b-it`, `4b-it`, `12b-it`, `27b-it`
-**Mistral:**
-* `mistralai/Mistral-Small-24B-Instruct-2501`
-* `mistralai/Mistral-Large-Instruct-2411`
-**Others:**
-* `microsoft/Phi-4-mini-instruct`, `microsoft/phi-4`
-* `HuggingFaceTB/SmolLM2-1.7B-Instruct`
-* `CohereLabs/c4ai-command-a-03-2025`
-* `deepseek-ai/DeepSeek-V3`
-* `allenai/OLMo-2-0325-32B-Instruct`
-* `allenai/Llama-3.1-Tulu-3-70B`, `405B`
-* `moonshotai/Moonlight-16B-A3B-Instruct`
 ---
@@ -207,15 +223,15 @@ To ensure diversity and quality, we utilize a wide range of open-source models f
 ### 1. Installation
 Install the package in editable mode:
-```bash
 pip install -e .
-```
 ### 2. Running the Pipeline
 Run the main dataset generation script:
-```bash
 python path/to/main_script.py
-```
 ### 3. Configuration (Optional)
 To modify the pipeline parameters and steps, edit the configuration files in the `config/` directory.
@@ -226,13 +242,13 @@ To modify the pipeline parameters and steps, edit the configuration files in the
 ### Option 1: Docker/Podman (Recommended)
 Build the container image:
-```bash
 podman build -t activeuf:latest .
-```
 ### Option 2: `uv` (For Local Use)
 Create a `uv` environment with all dependencies.
-```bash
 # Install uv
 curl -LsSf [https://astral.sh/uv/install.sh](https://astral.sh/uv/install.sh) | sh
 source $HOME/.local/bin/env
@@ -240,7 +256,7 @@ source $HOME/.local/bin/env
 # Sync dependencies
 uv sync --dev
 source .venv/bin/activate
-```
 ---
@@ -250,18 +266,18 @@ For contributors and developers:
 ### Pre-commit Hooks
 This project uses `ruff` for linting and formatting.
-```bash
 pre-commit install
-```
 ### Manual Linting
-```bash
 # Format code
 ruff format
 # Lint and auto-fix
 ruff check --fix
-```
 ---
@@ -271,3 +287,12 @@ This project is licensed under the **MIT License**.
 **Note on Data Usage:**
 While the code and curated datasets in this repository are released under MIT, the datasets contain outputs generated by third-party models (listed above). Users are responsible for adhering to the respective licenses of these source models (e.g., Llama Community License, Apache 2.0, Qwen Research License) when using this data for training or commercial purposes.

+import os
+# TRIPLE BACKTICK FIX:
+# I am using double backticks (``) in this string so the code block doesn't break in the chat.
+# The script automatically converts them to real triple backticks (```) when saving.
+readme_text = r"""# Active UltraFeedback
 **Active UltraFeedback** is a scalable pipeline for generating high-quality preference datasets to align large language models (LLMs), requiring only a set of prompts as input.
 The datasets generated by our pipeline for **DRTS** and **DeltaUCB** consistently beat the actual `ultrafeedback_binarized_cleaned` and `tulu3` preference mixture datasets on our DPO/RM training setups with LoRA.
+Below are the detailed results across 4 different prompt distributions.
 ---
 | **InfoMax** | +0.463 | +0.287 | +0.096 | **+0.129** | +0.509 | +0.296 | +0.297 |
 | **MaxMinLCB** | +0.390 | -0.025 | **+0.244** | +0.070 | +0.453 | +0.250 | +0.230 |
+**DPO Performance**
 | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
 | :--- | :---: | :---: | :---: | :---: | :---: |
 | *Baselines* | | | | | |
 | Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
 | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
 | *Baselines* | | | | | | | |
+| Random | +0.407 | +0.106 | +0.151 | +0.092 | +0.422 | +0.157 | +0.223 |
+| UltraFeedback | +0.419 | +0.068 | +0.189 | +0.058 | +0.440 | +0.228 | +0.234 |
+| MaxMin | +0.410 | **+0.462** | +0.172 | +0.055 | **+0.531** | **+0.319** | **+0.325** |
+| DeltaQwen | +0.238 | -0.023 | +0.011 | **+0.108** | +0.306 | +0.132 | +0.129 |
 | *Ours* | | | | | | | |
+| **DRTS** | +0.423 | +0.233 | +0.164 | +0.055 | +0.377 | +0.285 | +0.256 |
 | **DeltaUCB** | +0.370 | +0.319 | **+0.194** | +0.033 | +0.346 | +0.310 | +0.262 |
+| **DTS** | +0.417 | -0.021 | +0.148 | +0.077 | +0.450 | +0.245 | +0.219 |
+| **InfoMax** | **+0.429** | +0.122 | +0.162 | +0.030 | +0.495 | +0.227 | +0.244 |
 | **MaxMinLCB** | +0.371 | -0.016 | +0.145 | +0.039 | +0.395 | +0.167 | +0.184 |
+**DPO Performance**
 | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
 | :--- | :---: | :---: | :---: | :---: | :---: |
 | *Baselines* | | | | | |
+| Random | +0.012 | +0.015 | +0.045 | +0.063 | +0.033 |
+| UltraFeedback | +0.027 | **+0.054** | +0.043 | +0.071 | +0.048 |
+| MaxMin | +0.049 | -0.011 | +0.128 | +0.270 | +0.108 |
+| DeltaQwen | **+0.058** | +0.002 | **+0.152** | **+0.384** | **+0.149** |
 | *Ours* | | | | | |
+| **DRTS** | +0.052 | +0.012 | +0.114 | +0.229 | +0.101 |
+| **DeltaUCB** | +0.055 | +0.013 | +0.077 | +0.238 | +0.095 |
+| **DTS** | +0.008 | +0.002 | +0.011 | +0.021 | +0.010 |
+| **InfoMax** | +0.021 | +0.002 | +0.011 | +0.013 | +0.012 |
+| **MaxMinLCB** | +0.003 | +0.010 | +0.004 | +0.018 | +0.008 |
 ---
 | Method | Factuality | Focus | Math | Precise IF | Safety | Ties | **Mean** |
 | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
 | *Baselines* | | | | | | | |
+| Random | +0.455 | +0.216 | **+0.205** | +0.077 | +0.466 | +0.193 | +0.269 |
+| UltraFeedback | +0.407 | +0.114 | +0.175 | +0.064 | +0.433 | +0.247 | +0.240 |
+| MaxMin | +0.410 | **+0.467** | +0.194 | +0.083 | +0.412 | **+0.380** | **+0.325** |
+| DeltaQwen | +0.242 | -0.007 | +0.009 | **+0.151** | +0.279 | +0.241 | +0.153 |
 | *Ours* | | | | | | | |
+| **DRTS** | +0.427 | +0.436 | +0.156 | +0.086 | +0.475 | +0.272 | +0.309 |
+| **DeltaUCB** | +0.463 | +0.350 | +0.164 | +0.092 | +0.469 | +0.213 | +0.292 |
 | **DTS** | +0.419 | +0.087 | +0.186 | +0.083 | +0.411 | +0.297 | +0.247 |
 | **InfoMax** | **+0.476** | +0.383 | +0.153 | +0.042 | **+0.546** | +0.199 | +0.300 |
 | **MaxMinLCB** | +0.439 | +0.048 | +0.159 | +0.030 | +0.435 | +0.201 | +0.219 |
+**DPO Performance**
 | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
 | :--- | :---: | :---: | :---: | :---: | :---: |
 | *Baselines* | | | | | |
+| Random | +0.024 | +0.028 | +0.056 | +0.077 | +0.046 |
+| UltraFeedback | +0.037 | -0.001 | +0.039 | +0.072 | +0.036 |
+| MaxMin | +0.022 | -0.016 | **+0.150** | +0.289 | +0.111 |
+| DeltaQwen | +0.055 | **+0.047** | +0.130 | **+0.316** | **+0.137** |
 | *Ours* | | | | | |
+| **DRTS** | **+0.055** | +0.015 | +0.108 | +0.177 | +0.088 |
+| **DeltaUCB** | +0.049 | +0.039 | +0.117 | +0.217 | +0.105 |
+| **DTS** | +0.009 | +0.002 | +0.014 | +0.029 | +0.013 |
+| **InfoMax** | +0.011 | +0.021 | +0.014 | +0.018 | +0.015 |
+| **MaxMinLCB** | -0.010 | +0.019 | +0.010 | +0.021 | +0.009 |
 ---
 | **InfoMax** | +0.431 | +0.302 | +0.175 | +0.098 | +0.545 | +0.286 | +0.306 |
 | **MaxMinLCB** | +0.448 | +0.168 | +0.140 | +0.101 | +0.531 | +0.196 | +0.264 |
+**DPO Performance**
 | Method | GSM8K | IF Eval | Truthful QA | Alpaca Eval | **Mean** |
 | :--- | :---: | :---: | :---: | :---: | :---: |
 | *Baselines* | | | | | |
 ---
+## 🤖 Source Models and Licenses
+To ensure diversity and quality, we utilize a wide range of open-source models for completion generation. Below is the list of models used, along with their parameters and licenses.
+| Model | Parameters (B) | License |
+| :--- | :---: | :--- |
+| **Qwen** | | |
+| `Qwen/Qwen2.5-0.5B-Instruct` | 0.5 | Apache 2.0 |
+| `Qwen/Qwen2.5-72B-Instruct` | 72 | Qwen |
+| `Qwen/Qwen3-0.6B` | 0.6 | Apache 2.0 |
+| `Qwen/Qwen3-1.7B` | 1.7 | Apache 2.0 |
+| `Qwen/Qwen3-14B` | 14 | Apache 2.0 |
+| `Qwen/Qwen3-30B-A3B` | 30 | Apache 2.0 |
+| `Qwen/Qwen3-32B` | 32 | Apache 2.0 |
+| `Qwen/Qwen3-235B-A22B` | 234 | Apache 2.0 |
+| **Llama** | | |
+| `meta-llama/Llama-3.1-8B-Instruct` | 8 | Llama 3 |
+| `meta-llama/Llama-3.2-1B-Instruct` | 1 | Llama 3 |
+| `meta-llama/Llama-3.2-3B-Instruct` | 3 | Llama 3 |
+| `meta-llama/Llama-3.3-70B-Instruct` | 70 | Llama 3 |
+| **Microsoft** | | |
+| `microsoft/Phi-4-mini-instruct` | 4 | MIT |
+| `microsoft/phi-4` | 14 | MIT |
+| **Mistral** | | |
+| `mistralai/Mistral-Small-24B-Instruct-2501` | 23 | Apache 2.0 |
+| `mistralai/Mistral-Large-Instruct-2411` | 123 | MRL |
+| **NVIDIA** | | |
+| `nvidia/Llama-3.1-Nemotron-70B-Instruct-HF` | 70 | Llama 3 |
+| `nvidia/Llama-3_3-Nemotron-Super-49B-v1` | 49 | Nvidia Open Model |
+| `nvidia/Llama-3_1-Nemotron-Ultra-253B-v1` | 253 | Nvidia Open Model |
+| **Gemma** | | |
+| `google/gemma-3-1b-it` | 1 | Gemma |
+| `google/gemma-3-4b-it` | 4 | Gemma |
+| `google/gemma-3-12b-it` | 12 | Gemma |
+| `google/gemma-3-27b-it` | 27 | Gemma |
+| **AllenAI** | | |
+| `allenai/OLMo-2-0325-32B-Instruct` | 32 | Apache 2.0 |
+| `allenai/Llama-3.1-Tulu-3-70B` | 70 | Llama 3 |
+| `allenai/Llama-3.1-Tulu-3-405B` | 405 | Llama 3 |
+| **Other** | | |
+| `HuggingFaceTB/SmolLM2-1.7B-Instruct` | 1.7 | Apache 2.0 |
+| `moonshotai/Moonlight-16B-A3B-Instruct` | 16 | MIT |
+| `CohereLabs/c4ai-command-a-03-2025` | 111 | CC by NC 4.0 |
+| `deepseek-ai/DeepSeek-V3` | 671 | Deepseek |
 ---
 ### 1. Installation
 Install the package in editable mode:
+``bash
 pip install -e .
+``
 ### 2. Running the Pipeline
 Run the main dataset generation script:
+``bash
 python path/to/main_script.py
+``
 ### 3. Configuration (Optional)
 To modify the pipeline parameters and steps, edit the configuration files in the `config/` directory.
 ### Option 1: Docker/Podman (Recommended)
 Build the container image:
+``bash
 podman build -t activeuf:latest .
+``
 ### Option 2: `uv` (For Local Use)
 Create a `uv` environment with all dependencies.
+``bash
 # Install uv
 curl -LsSf [https://astral.sh/uv/install.sh](https://astral.sh/uv/install.sh) | sh
 source $HOME/.local/bin/env
 # Sync dependencies
 uv sync --dev
 source .venv/bin/activate
+``
 ---
 ### Pre-commit Hooks
 This project uses `ruff` for linting and formatting.
+``bash
 pre-commit install
+``
 ### Manual Linting
+``bash
 # Format code
 ruff format
 # Lint and auto-fix
 ruff check --fix
+``
 ---
 **Note on Data Usage:**
 While the code and curated datasets in this repository are released under MIT, the datasets contain outputs generated by third-party models (listed above). Users are responsible for adhering to the respective licenses of these source models (e.g., Llama Community License, Apache 2.0, Qwen Research License) when using this data for training or commercial purposes.
+"""
+# Automatically replace double backticks with triple backticks for correct Markdown rendering
+final_content = readme_text.replace("``", "```")
+with open("README.md", "w", encoding="utf-8") as f:
+    f.write(final_content)
+print("✅ README.md generated successfully with all tables and licenses included!")