AINovice2005's picture
|
download
raw
6.05 kB
# Preference Alignment Data
This example demonstrates **preference pair validation and preparation** for RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) training, featuring two modern alignment datasets.
## Datasets
### 1. Anthropic/hh-rlhf (Classic RLHF Data)
- **Size**: ~169K preference pairs
- **Structure**: Chosen vs. Rejected responses to prompts
- **Use Case**: Foundational RLHF dataset used to train Claude and Llama 2
- **Why It Matters**: Established standard; if your model can handle hh-rlhf, it's ready for production RLHF
**Asset**: `dpo_training_dataset`
### 2. allenai/ultrafeedback_binarized_cleaned (Modern High-Quality)
- **Size**: ~64K preference pairs
- **Structure**: (instruction, chosen, rejected) tuples
- **Quality**: Binarized + cleaned (higher quality than raw UltraFeedback)
- **Annotation**: Multi-model annotations provide diversity
- **Use Case**: DPO/ORPO training; better generalization than model-specific data
**Asset**: `ultrafeedback_preference_dataset`
## Why Both?
| Aspect | hh-rlhf | UltraFeedback |
|--------|---------|---------------|
| **Maturity** | Mature, proven | Newer, cutting-edge |
| **Quality** | Foundational | High-quality curated |
| **Model Specificity** | Anthropic-tuned | Model-agnostic |
| **Use Case** | RLHF baseline | DPO/ORPO alternative |
| **Size** | 169K pairs | 64K pairs |
| **Generalization** | Good for Claude-style | Better for diverse LLMs |
## Pipeline: Preference Pair Validation
```
hh-rlhf (raw) UltraFeedback (raw)
↓ ↓
validate (is_valid filter) validate (is_valid filter)
↓ ↓
dpo_training_dataset ultrafeedback_preference_dataset
(169K → ~160K) (64K → ~60K)
↓ ↓
Ready for DPO/ORPO training
```
## Assets
### 1. `dpo_training_dataset` → `MaterializeResult` (hh-rlhf)
**Validation Rules**:
- `chosen` and `rejected` must be non-null
- Both must be non-empty after stripping whitespace
- `chosen ≠ rejected` (cannot be identical)
**Metadata Output**:
```json
{
"original_rows": 169000,
"validated_rows": 160500,
"removed_rows": 8500,
"dataset": "Anthropic/hh-rlhf",
"fingerprint": "abcd1234"
}
```
### 2. `ultrafeedback_preference_dataset` → `MaterializeResult`
**Validation Rules**:
- `instruction` must be non-empty (additional validation vs. hh-rlhf)
- `chosen` and `rejected` must be non-null and non-empty
- `chosen ≠ rejected`
**Metadata Output**:
```json
{
"original_rows": 64000,
"validated_rows": 61500,
"removed_rows": 2500,
"dataset": "allenai/ultrafeedback_binarized_cleaned",
"quality_note": "High-quality, model-agnostic preference pairs"
}
```
## Patterns Demonstrated
### 1. **Preference Pair Validation**
- Validates structure for DPO compatibility
- Ensures chosen > rejected (preference relationship)
- Handles dataset-specific field differences
### 2. **Quality Metrics**
- Tracks removed rows (logging data loss)
- Computes retention % (data quality health)
### 3. **Multi-Dataset Support**
- Different datasets, same validation pattern
- Shows how to extend to other preference sources
## Running Locally
```bash
cd dagster_hf_datasets_examples
dagster dev -m preference_alignment_data.definitions
```
Materialize both assets:
1. `dpo_training_dataset` (hh-rlhf)
2. `ultrafeedback_preference_dataset`
Compare metadata in Dagster UI to see quality differences.
## Use Cases
### Training DPO Models
```python
# After running this example
from datasets import load_from_disk
train_data = load_from_disk(".dagster_hf_storage/dpo_training_dataset")
# or
train_data = load_from_disk(".dagster_hf_storage/ultrafeedback_preference_dataset")
# Use with your DPO trainer
trainer = DPOTrainer(model=model, args=args, train_dataset=train_data)
```
### Mixing Datasets
```python
# Combine both for diverse training
from datasets import concatenate_datasets
combined = concatenate_datasets([
load_from_disk(".dagster_hf_storage/dpo_training_dataset"),
load_from_disk(".dagster_hf_storage/ultrafeedback_preference_dataset"),
])
```
## Customization
### Add More Validation Rules
```python
def is_valid(example):
# Existing checks...
if not is_valid(example):
return False
# Additional: length constraints
if len(example["chosen"]) < 10:
return False
# Additional: reject low-quality formats
if "unable to" in example["rejected"].lower():
return False
return True
```
### Add Downstream Processing
```python
@asset(group_name="preference_alignment")
def dpo_formatted_pairs(dpo_training_dataset: Dataset) -> Dataset:
"""Convert to OpenAI ChatML format for fine-tuning."""
def to_chatml(example):
return {
"chosen": f"<|im_start|>assistant\n{example['chosen']}<|im_end|>",
"rejected": f"<|im_start|>assistant\n{example['rejected']}<|im_end|>",
}
return dpo_training_dataset.map(to_chatml)
```
### Add Quality Scoring
```python
@asset(group_name="preference_alignment")
def preference_quality_scores(ultrafeedback_preference_dataset: Dataset) -> dict:
"""Compute agreement metrics between chosen/rejected."""
# Compute length ratios, diversity, etc.
return quality_report
```
## References
- [hh-rlhf Paper](https://huggingface.co/datasets/Anthropic/hh-rlhf)
- [UltraFeedback Paper](https://arxiv.org/abs/2310.01852)
- [DPO: Direct Preference Optimization](https://arxiv.org/abs/2305.18290)
- [Related Examples](../):
- `multi_modal_data_profiling/` — Vision-language preferences
- `dataset_card_publishing/` — Publishing aligned datasets to Hub
## Tips
- **For RLHF Baselines**: Start with `dpo_training_dataset` (hh-rlhf)
- **For ORPO/DPO**: Try `ultrafeedback_preference_dataset` (better diversity)
- **For Large-Scale**: Combine both + add your own data collection pipeline
- **For Evaluation**: Use preference pairs to build LLM evaluation sets

Xet Storage Details

Size:
6.05 kB
·
Xet hash:
d613d193acaaf7310e0b86bc790c2cc6f69ebbe2f7d7e55e8b2c7289505aa195

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.