Buckets:
210 kB
70 files
Updated 13 days ago
Ctrl+K
| Name | Size | Uploaded | Xet hash |
|---|---|---|---|
| README.md | 6.05 kB xet | d613d193 | |
| __init__.py | 75 Bytes xet | aa31dbbf | |
| assets.py | 3.86 kB xet | 6409a383 | |
| definitions.py | 596 Bytes xet | 466e5d42 |
Preference Alignment Data
This example demonstrates preference pair validation and preparation for RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) training, featuring two modern alignment datasets.
Datasets
1. Anthropic/hh-rlhf (Classic RLHF Data)
- Size: ~169K preference pairs
- Structure: Chosen vs. Rejected responses to prompts
- Use Case: Foundational RLHF dataset used to train Claude and Llama 2
- Why It Matters: Established standard; if your model can handle hh-rlhf, it's ready for production RLHF
Asset: dpo_training_dataset
2. allenai/ultrafeedback_binarized_cleaned (Modern High-Quality)
- Size: ~64K preference pairs
- Structure: (instruction, chosen, rejected) tuples
- Quality: Binarized + cleaned (higher quality than raw UltraFeedback)
- Annotation: Multi-model annotations provide diversity
- Use Case: DPO/ORPO training; better generalization than model-specific data
Asset: ultrafeedback_preference_dataset
Why Both?
| Aspect | hh-rlhf | UltraFeedback |
|---|---|---|
| Maturity | Mature, proven | Newer, cutting-edge |
| Quality | Foundational | High-quality curated |
| Model Specificity | Anthropic-tuned | Model-agnostic |
| Use Case | RLHF baseline | DPO/ORPO alternative |
| Size | 169K pairs | 64K pairs |
| Generalization | Good for Claude-style | Better for diverse LLMs |
Pipeline: Preference Pair Validation
hh-rlhf (raw) UltraFeedback (raw)
↓ ↓
validate (is_valid filter) validate (is_valid filter)
↓ ↓
dpo_training_dataset ultrafeedback_preference_dataset
(169K → ~160K) (64K → ~60K)
↓ ↓
Ready for DPO/ORPO training
Assets
1. dpo_training_dataset → MaterializeResult (hh-rlhf)
Validation Rules:
chosenandrejectedmust be non-null- Both must be non-empty after stripping whitespace
chosen ≠ rejected(cannot be identical)
Metadata Output:
{
"original_rows": 169000,
"validated_rows": 160500,
"removed_rows": 8500,
"dataset": "Anthropic/hh-rlhf",
"fingerprint": "abcd1234"
}
2. ultrafeedback_preference_dataset → MaterializeResult
Validation Rules:
instructionmust be non-empty (additional validation vs. hh-rlhf)chosenandrejectedmust be non-null and non-emptychosen ≠ rejected
Metadata Output:
{
"original_rows": 64000,
"validated_rows": 61500,
"removed_rows": 2500,
"dataset": "allenai/ultrafeedback_binarized_cleaned",
"quality_note": "High-quality, model-agnostic preference pairs"
}
Patterns Demonstrated
1. Preference Pair Validation
- Validates structure for DPO compatibility
- Ensures chosen > rejected (preference relationship)
- Handles dataset-specific field differences
2. Quality Metrics
- Tracks removed rows (logging data loss)
- Computes retention % (data quality health)
3. Multi-Dataset Support
- Different datasets, same validation pattern
- Shows how to extend to other preference sources
Running Locally
cd dagster_hf_datasets_examples
dagster dev -m preference_alignment_data.definitions
Materialize both assets:
dpo_training_dataset(hh-rlhf)ultrafeedback_preference_dataset
Compare metadata in Dagster UI to see quality differences.
Use Cases
Training DPO Models
# After running this example
from datasets import load_from_disk
train_data = load_from_disk(".dagster_hf_storage/dpo_training_dataset")
# or
train_data = load_from_disk(".dagster_hf_storage/ultrafeedback_preference_dataset")
# Use with your DPO trainer
trainer = DPOTrainer(model=model, args=args, train_dataset=train_data)
Mixing Datasets
# Combine both for diverse training
from datasets import concatenate_datasets
combined = concatenate_datasets([
load_from_disk(".dagster_hf_storage/dpo_training_dataset"),
load_from_disk(".dagster_hf_storage/ultrafeedback_preference_dataset"),
])
Customization
Add More Validation Rules
def is_valid(example):
# Existing checks...
if not is_valid(example):
return False
# Additional: length constraints
if len(example["chosen"]) < 10:
return False
# Additional: reject low-quality formats
if "unable to" in example["rejected"].lower():
return False
return True
Add Downstream Processing
@asset(group_name="preference_alignment")
def dpo_formatted_pairs(dpo_training_dataset: Dataset) -> Dataset:
"""Convert to OpenAI ChatML format for fine-tuning."""
def to_chatml(example):
return {
"chosen": f"<|im_start|>assistant\n{example['chosen']}<|im_end|>",
"rejected": f"<|im_start|>assistant\n{example['rejected']}<|im_end|>",
}
return dpo_training_dataset.map(to_chatml)
Add Quality Scoring
@asset(group_name="preference_alignment")
def preference_quality_scores(ultrafeedback_preference_dataset: Dataset) -> dict:
"""Compute agreement metrics between chosen/rejected."""
# Compute length ratios, diversity, etc.
return quality_report
References
- hh-rlhf Paper
- UltraFeedback Paper
- DPO: Direct Preference Optimization
- Related Examples:
multi_modal_data_profiling/— Vision-language preferencesdataset_card_publishing/— Publishing aligned datasets to Hub
Tips
- For RLHF Baselines: Start with
dpo_training_dataset(hh-rlhf) - For ORPO/DPO: Try
ultrafeedback_preference_dataset(better diversity) - For Large-Scale: Combine both + add your own data collection pipeline
- For Evaluation: Use preference pairs to build LLM evaluation sets
- Total size
- 210 kB
- Files
- 70
- Last updated
- Jun 14
- Pre-warmed CDN
- US EU US EU