Nexuss0781's picture
Upload data/train-00000-of-00001.parquet with huggingface_hub
7cb972e
# Tutorial 06: Reinforcement Learning from Human Feedback (RLHF)
## Overview
Reinforcement Learning from Human Feedback (RLHF) aligns language models with human preferences and values. This tutorial covers the complete RLHF pipeline using **Nexuss Transformer Framework's** native `RewardModel`, `PreferenceDataset`, and `RLHFPipeline` utilities.
### What is RLHF?
RLHF is a three-stage process:
1. **Supervised Fine-Tuning (SFT)**: Train on high-quality demonstrations
2. **Reward Modeling**: Learn human preferences from comparisons
3. **Reinforcement Learning**: Optimize policy using learned reward
```
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Pre-trained │────▶│ SFT Model │────▶│ RL Policy │
│ Model │ │ (Instruction) │ │ (PPO) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌─────────────────┐
│ Reward Model │◀────│ Generate & │
│ (Preferences) │ │ Score │
└──────────────────┘ └─────────────────┘
```
### When to Use RLHF
| Scenario | Recommendation |
|----------|---------------|
| Need alignment with human values | ✅ RLHF |
| Reduce harmful outputs | ✅ RLHF |
| Improve helpfulness/honesty | ✅ RLHF |
| Simple task fine-tuning | ❌ Use SFT only |
| Limited annotation budget | ❌ Use DPO/SimPO |
| Need fast iteration | ❌ Use DPO (simpler) |
---
## Section 1: The NTF RLHF Pipeline
### RLHF Workflow
1. **Supervised Fine-Tuning (SFT)**: Train on instruction-following data
2. **Reward Modeling**: Train reward model on human preference data
3. **RL Optimization**: Use PPO to optimize policy against reward model
4. **Evaluation**: Assess alignment with human preferences
NTF provides native components for each stage, ensuring consistency and reproducibility.
### Complete RLHF Example with NTF
```python
from ntf.reward import RewardModel, PreferenceDataset, RLHFPipeline
from ntf.models import ModelRegistry
from ntf.config import RewardConfig
# 1. Load base model
registry = ModelRegistry(model_config)
base_model, tokenizer = registry.load_model_and_tokenizer()
# 2. Initialize NTF's RewardModel
reward_config = RewardConfig(
base_model_name="meta-llama/Llama-2-7b-hf",
num_labels=1,
pad_token_id=tokenizer.pad_token_id
)
reward_model = RewardModel(reward_config)
reward_model.load_base_model(base_model)
# 3. Load preference data with NTF utilities
pref_dataset = PreferenceDataset(
data_path="preferences.jsonl",
tokenizer=tokenizer,
max_length=512
)
# 4. Train reward model
from ntf.reward.trainer import RewardTrainer
reward_trainer = RewardTrainer(
model=reward_model,
dataset=pref_dataset,
config=reward_config
)
reward_trainer.train()
# 5. Use in RLHF pipeline
pipeline = RLHFPipeline(
policy_model=policy_model,
reward_model=reward_model,
reference_model=ref_model,
tokenizer=tokenizer
)
pipeline.run_ppo(
prompts=prompts,
num_iterations=100,
kl_coeff=0.2
)
```
---
## Section 2: Collecting Preference Data
### Preference Data Format
```jsonl
# data/preference_data.jsonl
{
"prompt": "Write a poem about nature",
"chosen": "Nature's beauty unfolds each day...",
"rejected": "The natural world is nice...",
"annotator_id": "worker_001",
"rating_confidence": 0.9
}
{
"prompt": "Explain quantum physics simply",
"chosen": "Imagine tiny particles that can be in multiple states...",
"rejected": "Quantum physics is the study of matter and energy...",
"annotator_id": "worker_002",
"rating_confidence": 0.85
}
```
### Best Practices for Data Collection
**Quality Guidelines:**
- Clear preference criteria (helpfulness, honesty, harmlessness)
- Multiple annotators per sample (3-5 recommended)
- Resolve disagreements through adjudication
- Include diverse prompt types
**Quantity Requirements:**
- Minimum: 1,000 preference pairs
- Recommended: 10,000-50,000 pairs
- High-quality > quantity
### Data Augmentation for Preferences
```python
# scripts/augment_preferences.py
import random
from datasets import load_dataset
def augment_preference_data(dataset, augmentation_factor=3):
"""Augment preference data with variations"""
augmented = []
for sample in dataset:
# Original
augmented.append(sample)
# Variation 1: Swap chosen/rejected with inverted labels
if random.random() < 0.3: # Only sometimes to avoid bias
augmented.append({
'prompt': sample['prompt'],
'chosen': sample['rejected'],
'rejected': sample['chosen'],
'annotator_id': f"{sample['annotator_id']}_swap"
})
# Variation 2: Paraphrase prompt
if random.random() < 0.2:
paraphrased_prompt = paraphrase(sample['prompt'])
augmented.append({
'prompt': paraphrased_prompt,
'chosen': sample['chosen'],
'rejected': sample['rejected'],
'annotator_id': f"{sample['annotator_id']}_paraphrase"
})
# Add more variations as needed...
return augmented
def paraphrase(text):
"""Simple paraphrasing (use LLM for better quality)"""
# In practice, use an LLM to generate paraphrases
synonyms = {
'explain': 'describe',
'write': 'compose',
'what': 'which',
'how': 'in what way'
}
words = text.split()
paraphrased = [synonyms.get(word.lower(), word) for word in words]
return ' '.join(paraphrased)
# Usage
raw_data = load_dataset('json', data_files='data/raw_preferences.jsonl', split='train')
augmented_data = augment_preference_data(list(raw_data), augmentation_factor=2)
# Save
with open('data/augmented_preferences.jsonl', 'w') as f:
for sample in augmented_data:
f.write(json.dumps(sample) + '\n')
```
---
## Section 2: Building the Reward Model
### Reward Model Architecture
The reward model takes a prompt-response pair and outputs a scalar score:
```python
# src/reward_model.py
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification
class RewardModel(nn.Module):
def __init__(self, base_model_name, num_labels=1):
super().__init__()
# Load base model with classification head
self.base_model = AutoModelForSequenceClassification.from_pretrained(
base_model_name,
num_labels=num_labels
)
# Get hidden size for custom head if needed
config = self.base_model.config
self.hidden_size = config.hidden_size
def forward(self, input_ids, attention_mask=None, labels=None):
"""
Forward pass returning reward scores
Args:
input_ids: Token IDs [batch_size, seq_len]
attention_mask: Attention mask [batch_size, seq_len]
labels: Optional reward labels for training
Returns:
rewards: Scalar reward for each sample [batch_size]
"""
outputs = self.base_model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)
rewards = outputs.logits.squeeze(-1) # [batch_size]
return {'rewards': rewards, 'loss': outputs.loss}
def compute_reward(self, input_ids, attention_mask=None):
"""Compute reward for generated sequences"""
with torch.no_grad():
outputs = self.forward(input_ids, attention_mask)
return outputs['rewards']
```
### Training the Reward Model
```yaml
# configs/reward_model.yaml
model:
base_model: "meta-llama/Llama-2-7b-hf"
model_type: "reward_model"
data:
preference_file: "data/preference_pairs.jsonl"
max_seq_length: 512
validation_split: 0.1
training:
learning_rate: 1.0e-5
per_device_train_batch_size: 16
gradient_accumulation_steps: 4
num_train_epochs: 3
# Reward model specific
loss_type: "pairwise_ranking" # pairwise or regression
margin: 0.1 # For pairwise ranking
# Regularization
weight_decay: 0.01
max_grad_norm: 1.0
fp16: true
gradient_checkpointing: true
output:
output_dir: "outputs/reward_model_v1"
run_name: "llama2-7b-reward-v1"
```
### Pairwise Ranking Loss Implementation
```python
# src/reward_trainer.py
import torch
import torch.nn.functional as F
from transformers import Trainer
class RewardTrainer(Trainer):
def __init__(self, margin=0.1, **kwargs):
super().__init__(**kwargs)
self.margin = margin
def compute_loss(self, model, inputs, return_outputs=False):
"""
Pairwise ranking loss for reward model
L = -log(sigmoid(reward_chosen - reward_rejected - margin))
"""
# Prepare chosen and rejected inputs
chosen_inputs = {
'input_ids': inputs['chosen_input_ids'],
'attention_mask': inputs['chosen_attention_mask']
}
rejected_inputs = {
'input_ids': inputs['rejected_input_ids'],
'attention_mask': inputs['rejected_attention_mask']
}
# Get rewards for both
chosen_outputs = model(**chosen_inputs)
rejected_outputs = model(**rejected_inputs)
chosen_rewards = chosen_outputs['rewards'] # [batch_size]
rejected_rewards = rejected_outputs['rewards'] # [batch_size]
# Pairwise ranking loss
# We want chosen_rewards > rejected_rewards
diff = chosen_rewards - rejected_rewards
loss = -F.logsigmoid(diff - self.margin).mean()
# Calculate accuracy metric
accuracy = (diff > 0).float().mean()
metrics = {
'loss': loss,
'accuracy': accuracy,
'chosen_reward_mean': chosen_rewards.mean(),
'rejected_reward_mean': rejected_rewards.mean(),
'reward_margin': diff.mean()
}
return (loss, metrics) if return_outputs else loss
def log(self, logs):
"""Enhanced logging for reward training"""
if 'accuracy' in logs:
print(f"Step {self.state.global_step}: "
f"Accuracy: {logs['accuracy']:.3f}, "
f"Loss: {logs['loss']:.4f}")
super().log(logs)
# Training script
def train_reward_model():
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Prepare dataset
def preprocess(example):
# Tokenize chosen
chosen = tokenizer(
example['prompt'] + example['chosen'],
truncation=True,
max_length=512,
padding='max_length'
)
# Tokenize rejected
rejected = tokenizer(
example['prompt'] + example['rejected'],
truncation=True,
max_length=512,
padding='max_length'
)
return {
'chosen_input_ids': chosen['input_ids'],
'chosen_attention_mask': chosen['attention_mask'],
'rejected_input_ids': rejected['input_ids'],
'rejected_attention_mask': rejected['attention_mask']
}
dataset = load_dataset('json', data_files='data/preference_pairs.jsonl', split='train')
tokenized = dataset.map(preprocess, batched=True, remove_columns=dataset.column_names)
# Split
train_test = tokenized.train_test_split(test_size=0.1)
# Initialize model
model = RewardModel("meta-llama/Llama-2-7b-hf")
# Training arguments
training_args = TrainingArguments(
output_dir="outputs/reward_model_v1",
per_device_train_batch_size=16,
gradient_accumulation_steps=4,
learning_rate=1e-5,
num_train_epochs=3,
evaluation_strategy="steps",
eval_steps=100,
save_steps=500,
logging_steps=50,
fp16=True,
gradient_checkpointing=True,
)
# Train
trainer = RewardTrainer(
model=model,
args=training_args,
train_dataset=train_test['train'],
eval_dataset=train_test['test']
)
trainer.train()
trainer.save_model("outputs/reward_model_v1")
tokenizer.save_pretrained("outputs/reward_model_v1")
```
### Evaluating Reward Model
```python
# scripts/evaluate_reward_model.py
import torch
from transformers import AutoTokenizer
from src.reward_model import RewardModel
def evaluate_reward_model(model_path, test_examples):
"""Evaluate reward model on held-out test set"""
model = RewardModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.eval()
correct = 0
total = 0
for example in test_examples:
# Tokenize chosen and rejected
chosen_text = example['prompt'] + example['chosen']
rejected_text = example['prompt'] + example['rejected']
chosen_inputs = tokenizer(chosen_text, return_tensors='pt', truncation=True, max_length=512)
rejected_inputs = tokenizer(rejected_text, return_tensors='pt', truncation=True, max_length=512)
# Get rewards
with torch.no_grad():
chosen_reward = model.compute_reward(**chosen_inputs).item()
rejected_reward = model.compute_reward(**rejected_inputs).item()
# Check if model correctly prefers chosen
if chosen_reward > rejected_reward:
correct += 1
total += 1
accuracy = correct / total
print(f"Reward Model Accuracy: {accuracy:.3f} ({correct}/{total})")
return accuracy
# Example usage
test_data = load_dataset('json', data_files='data/test_preferences.jsonl', split='train')
accuracy = evaluate_reward_model("outputs/reward_model_v1", list(test_data))
```
---
## Section 3: PPO Training Loop
### Understanding PPO for Language Models
Proximal Policy Optimization (PPO) optimizes the policy to maximize reward while staying close to the reference model:
**Objective:**
```
L_PPO = E[min(r_t * A_t, clip(r_t, 1-ε, 1+ε) * A_t)]
```
Where:
- `r_t = π_new(a|s) / π_old(a|s)` (probability ratio)
- `A_t` = Advantage estimate
- `ε` = clipping parameter (typically 0.2)
### PPO Configuration
```yaml
# configs/ppo_training.yaml
policy:
model: "outputs/sft_model_v1" # Start from SFT model
ref_model: "outputs/sft_model_v1" # Reference model (frozen)
reward:
model: "outputs/reward_model_v1"
kl_coeff: 0.1 # KL penalty coefficient
data:
prompts_file: "data/ppo_prompts.jsonl"
max_prompt_length: 256
max_response_length: 256
training:
# PPO-specific
ppo_epochs: 4
mini_batch_size: 4
ppo_clip_range: 0.2
vf_clip_range: 0.2
gamma: 1.0
lam: 0.95 # GAE lambda
# Generation
num_rollouts: 128
temperature: 1.0
top_k: 0
top_p: 0.9
# Optimization
learning_rate: 1.0e-6
gradient_accumulation_steps: 4
# KL control
init_kl_coeff: 0.1
target_kl: 6.0 # Adaptive KL target
fp16: true
output:
output_dir: "outputs/rlhf_policy_v1"
run_name: "llama2-7b-rlhf-ppo"
```
### PPO Trainer Implementation
```python
# src/ppo_trainer.py
import torch
import torch.nn.functional as F
from typing import Dict, List, Optional
from transformers import AutoModelForCausalLM, AutoTokenizer
class PPOTrainer:
def __init__(
self,
policy_model,
ref_model,
reward_model,
tokenizer,
config
):
self.policy = policy_model
self.ref_model = ref_model
self.reward_model = reward_model
self.tokenizer = tokenizer
self.config = config
# Freeze reference model
for param in self.ref_model.parameters():
param.requires_grad = False
# KL coefficient
self.kl_coeff = config.kl_coeff
def generate_rollouts(self, prompts, num_samples_per_prompt=1):
"""Generate responses from current policy"""
all_responses = []
all_log_probs = []
for prompt in prompts:
# Tokenize prompt
inputs = self.tokenizer(
prompt,
return_tensors='pt',
truncation=True,
max_length=self.config.max_prompt_length
).to(self.policy.device)
# Generate response
with torch.no_grad():
outputs = self.policy.generate(
**inputs,
max_new_tokens=self.config.max_response_length,
do_sample=True,
temperature=self.config.temperature,
top_p=self.config.top_p,
pad_token_id=self.tokenizer.eos_token_id,
return_dict_in_generate=True,
output_scores=True
)
# Extract generated tokens and log probs
generated_ids = outputs.sequences[0, inputs['input_ids'].shape[1]:]
log_probs = self._extract_log_probs(outputs.scores, generated_ids)
all_responses.append(generated_ids)
all_log_probs.append(log_probs)
return all_responses, all_log_probs
def compute_rewards(self, prompts, responses):
"""Compute rewards including KL penalty"""
rewards = []
for prompt, response in zip(prompts, responses):
# Concatenate prompt + response
full_text = prompt + self.tokenizer.decode(response, skip_special_tokens=True)
inputs = self.tokenizer(full_text, return_tensors='pt').to(self.policy.device)
# Get reward from reward model
with torch.no_grad():
reward = self.reward_model.compute_reward(**inputs).item()
# Compute KL divergence from reference model
kl_div = self._compute_kl_divergence(prompt, response)
# Final reward with KL penalty
final_reward = reward - self.kl_coeff * kl_div
rewards.append(final_reward)
return torch.tensor(rewards)
def _compute_kl_divergence(self, prompt, response):
"""Compute KL divergence between policy and reference"""
# Get log probs from both models
full_text = prompt + self.tokenizer.decode(response, skip_special_tokens=True)
inputs = self.tokenizer(full_text, return_tensors='pt').to(self.policy.device)
with torch.no_grad():
policy_logits = self.policy(**inputs).logits
ref_logits = self.ref_model(**inputs).logits
policy_log_probs = F.log_softmax(policy_logits, dim=-1)
ref_log_probs = F.log_softmax(ref_logits, dim=-1)
# KL divergence
kl = (ref_log_probs - policy_log_probs).exp() * (ref_log_probs - policy_log_probs)
kl = kl.sum(dim=-1).mean()
return kl.item()
def ppo_update(self, rollouts, advantages):
"""Perform PPO update on collected rollouts"""
total_loss = 0
for epoch in range(self.config.ppo_epochs):
for minibatch in self._get_minibatches(rollouts, self.config.mini_batch_size):
# Compute probability ratios
old_log_probs = minibatch['old_log_probs']
new_log_probs = self._get_new_log_probs(minibatch)
ratio = (new_log_probs - old_log_probs).exp()
# Clipped surrogate objective
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1-self.config.ppo_clip_range,
1+self.config.ppo_clip_range) * advantages
policy_loss = -torch.min(surr1, surr2).mean()
# Value function loss
value_loss = self._compute_value_loss(minibatch)
# Entropy bonus (encourages exploration)
entropy = self._compute_entropy(minibatch)
# Total loss
loss = policy_loss + 0.5 * value_loss - 0.01 * entropy
# Backward pass
loss.backward()
torch.nn.utils.clip_grad_norm_(
self.policy.parameters(),
self.config.max_grad_norm
)
# Optimizer step
self.optimizer.step()
self.optimizer.zero_grad()
total_loss += loss.item()
return total_loss / (self.config.ppo_epochs * len(rollouts))
def train(self, prompts, num_iterations=100):
"""Main PPO training loop"""
for iteration in range(num_iterations):
print(f"\n=== Iteration {iteration + 1}/{num_iterations} ===")
# 1. Generate rollouts
responses, log_probs = self.generate_rollouts(prompts)
# 2. Compute rewards
rewards = self.compute_rewards(prompts, responses)
# 3. Compute advantages (GAE)
advantages = self._compute_gae(rewards, log_probs)
# 4. PPO update
rollouts = {
'responses': responses,
'log_probs': log_probs,
'rewards': rewards
}
loss = self.ppo_update(rollouts, advantages)
# 5. Logging
print(f"Iteration {iteration + 1}:")
print(f" Mean Reward: {rewards.mean().item():.3f}")
print(f" KL Divergence: {self._compute_mean_kl():.4f}")
print(f" PPO Loss: {loss:.4f}")
# 6. Adaptive KL coefficient
self._update_kl_coeff()
# 7. Save checkpoint
if (iteration + 1) % 10 == 0:
self.save_checkpoint(f"checkpoint_{iteration + 1}")
def save_checkpoint(self, path):
"""Save model checkpoint"""
self.policy.save_pretrained(path)
self.tokenizer.save_pretrained(path)
print(f"✓ Checkpoint saved to {path}")
```
### Complete RLHF Pipeline
```python
# scripts/run_rlhf_pipeline.py
from transformers import AutoModelForCausalLM, AutoTokenizer
from src.reward_model import RewardModel
from src.ppo_trainer import PPOTrainer
import yaml
def run_complete_rlhf():
# Load configuration
with open('configs/ppo_training.yaml', 'r') as f:
config = yaml.safe_load(f)
print("=" * 80)
print("STARTING COMPLETE RLHF PIPELINE")
print("=" * 80)
# Step 1: Load SFT model (already trained)
print("\n[1/4] Loading SFT model...")
policy_model = AutoModelForCausalLM.from_pretrained(
config['policy']['model'],
torch_dtype=torch.float16,
device_map="auto"
)
ref_model = AutoModelForCausalLM.from_pretrained(
config['policy']['ref_model'],
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(config['policy']['model'])
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
print(f"✓ Loaded policy model from {config['policy']['model']}")
# Step 2: Load reward model
print("\n[2/4] Loading reward model...")
reward_model = RewardModel.from_pretrained(
config['reward']['model'],
torch_dtype=torch.float16,
device_map="auto"
)
reward_model.eval()
print(f"✓ Loaded reward model from {config['reward']['model']}")
# Step 3: Load prompts for RLHF
print("\n[3/4] Loading prompts...")
prompts = load_prompts(config['data']['prompts_file'])
print(f"✓ Loaded {len(prompts)} prompts")
# Step 4: Initialize PPO trainer
print("\n[4/4] Initializing PPO trainer...")
trainer = PPOTrainer(
policy_model=policy_model,
ref_model=ref_model,
reward_model=reward_model,
tokenizer=tokenizer,
config=config['training']
)
# Step 5: Run PPO training
print("\n" + "=" * 80)
print("STARTING PPO TRAINING")
print("=" * 80)
trainer.train(
prompts=prompts,
num_iterations=config['training']['num_iterations']
)
# Step 6: Save final model
print("\n" + "=" * 80)
print("SAVING FINAL MODEL")
print("=" * 80)
trainer.save_checkpoint(config['output']['output_dir'])
print(f"\n✓ RLHF complete! Model saved to {config['output']['output_dir']}")
if __name__ == "__main__":
run_complete_rlhf()
```
---
## Section 4: Monitoring and Debugging RLHF
### Key Metrics to Track
```python
# src/rlhf_monitor.py
class RLHFMonitor:
def __init__(self):
self.metrics_history = {
'rewards': [],
'kl_divergence': [],
'policy_loss': [],
'value_loss': [],
'entropy': [],
'generation_length': []
}
def log_iteration(self, iteration, metrics):
"""Log metrics for each iteration"""
for key in self.metrics_history:
if key in metrics:
self.metrics_history[key].append({
'iteration': iteration,
'value': metrics[key]
})
# Print summary
print(f"\nIteration {iteration}:")
print(f" Reward: {metrics.get('reward', 0):.3f}")
print(f" KL Div: {metrics.get('kl_divergence', 0):.4f}")
print(f" Policy Loss: {metrics.get('policy_loss', 0):.4f}")
print(f" Entropy: {metrics.get('entropy', 0):.4f}")
def check_for_issues(self, metrics):
"""Detect common RLHF issues"""
issues = []
# KL divergence too high
if metrics.get('kl_divergence', 0) > 10.0:
issues.append("⚠️ High KL divergence - policy drifting too far")
# Reward hacking (high reward but low quality)
if metrics.get('reward', 0) > 5.0 and metrics.get('kl_divergence', 0) > 5.0:
issues.append("⚠️ Possible reward hacking detected")
# Collapsing entropy
if metrics.get('entropy', 0) < 0.1:
issues.append("⚠️ Low entropy - policy may be collapsing")
# Negative rewards
if metrics.get('reward', 0) < 0:
issues.append("⚠️ Negative rewards - check reward model")
return issues
# Usage in training loop
monitor = RLHFMonitor()
for iteration in range(num_iterations):
# ... training code ...
metrics = {
'reward': rewards.mean().item(),
'kl_divergence': kl_div,
'policy_loss': loss,
'entropy': entropy.item()
}
monitor.log_iteration(iteration, metrics)
issues = monitor.check_for_issues(metrics)
for issue in issues:
print(issue)
```
### Common RLHF Issues and Solutions
#### Issue 1: KL Divergence Too High
**Symptoms:**
- KL > 10 nats
- Generated text becomes nonsensical
- Reward increases but quality decreases
**Solutions:**
```yaml
# Increase KL penalty
kl_coeff: 0.2 # Increase from 0.1
# Use adaptive KL
target_kl: 6.0
adaptive_kl: true
# Lower learning rate
learning_rate: 5.0e-7 # Reduce from 1e-6
```
#### Issue 2: Reward Hacking
**Symptoms:**
- Rewards increase rapidly
- Generated text exploits reward model quirks
- Human evaluators rate outputs poorly
**Solutions:**
```yaml
# Stronger KL penalty
kl_coeff: 0.5
# Better reward model
# Retrain with more diverse examples
# Add adversarial examples
# Limit generation length
max_response_length: 128 # Prevent verbose exploitation
# Ensemble reward models
# Use multiple reward models and average
```
#### Issue 3: Policy Collapse
**Symptoms:**
- Entropy drops to near zero
- Model generates same response repeatedly
- No diversity in outputs
**Solutions:**
```yaml
# Add entropy bonus
entropy_coefficient: 0.05 # Increase exploration
# Lower KL penalty temporarily
kl_coeff: 0.05
# Restart from earlier checkpoint
# Reduce learning rate
learning_rate: 5.0e-7
```
---
## Section 5: Evaluation After RLHF
### Comprehensive Evaluation Suite
```python
# scripts/evaluate_rlhf.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
class RLHFEvaluator:
def __init__(self, model_path, baseline_path=None):
self.model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map="auto"
)
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
if baseline_path:
self.baseline = AutoModelForCausalLM.from_pretrained(
baseline_path,
torch_dtype=torch.float16,
device_map="auto"
)
else:
self.baseline = None
def evaluate_helpfulness(self, test_prompts):
"""Evaluate helpfulness on standard benchmarks"""
results = []
for prompt in test_prompts:
response = self.generate(prompt, max_new_tokens=256)
# Rate helpfulness (in practice, use human eval or LLM judge)
helpfulness_score = self._rate_helpfulness(prompt, response)
results.append({
'prompt': prompt,
'response': response,
'helpfulness': helpfulness_score
})
avg_helpfulness = sum(r['helpfulness'] for r in results) / len(results)
return {'avg_helpfulness': avg_helpfulness, 'samples': results}
def evaluate_safety(self, safety_prompts):
"""Evaluate safety on adversarial prompts"""
safe_count = 0
for prompt in safety_prompts:
response = self.generate(prompt, max_new_tokens=256)
# Check for harmful content
is_safe = self._check_safety(response)
if is_safe:
safe_count += 1
safety_rate = safe_count / len(safety_prompts)
return {'safety_rate': safety_rate}
def compare_with_baseline(self, comparison_prompts):
"""Compare RLHF model with baseline"""
wins = 0
losses = 0
ties = 0
for prompt in comparison_prompts:
rlhf_response = self.generate(prompt)
baseline_response = self.generate(prompt, model=self.baseline)
# Compare (use human eval or LLM judge)
result = self._compare_responses(rlhf_response, baseline_response)
if result == 'rlhf_wins':
wins += 1
elif result == 'baseline_wins':
losses += 1
else:
ties += 1
return {
'wins': wins,
'losses': losses,
'ties': ties,
'win_rate': wins / len(comparison_prompts)
}
def generate(self, prompt, max_new_tokens=256, model=None):
"""Generate response"""
if model is None:
model = self.model
inputs = self.tokenizer(prompt, return_tensors='pt').to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
top_p=0.9,
pad_token_id=self.tokenizer.eos_token_id
)
return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
def _rate_helpfulness(self, prompt, response):
"""Rate helpfulness (placeholder - use LLM judge in practice)"""
# In production, use trained judge or human evaluation
return 0.8 # Placeholder
def _check_safety(self, response):
"""Check if response is safe"""
# In production, use safety classifier
harmful_keywords = ['hate', 'violence', 'illegal']
return not any(word in response.lower() for word in harmful_keywords)
def _compare_responses(self, response1, response2):
"""Compare two responses"""
# In production, use pairwise comparison
return 'rlhf_wins' # Placeholder
# Usage
evaluator = RLHFEvaluator(
model_path="outputs/rlhf_policy_v1",
baseline_path="outputs/sft_model_v1"
)
# Evaluate
helpfulness = evaluator.evaluate_helpfulness(test_prompts)
safety = evaluator.evaluate_safety(safety_prompts)
comparison = evaluator.compare_with_baseline(eval_prompts)
print(f"Helpfulness: {helpfulness['avg_helpfulness']:.3f}")
print(f"Safety: {safety['safety_rate']:.3f}")
print(f"Win Rate vs Baseline: {comparison['win_rate']:.3f}")
```
---
## Summary
RLHF pipeline consists of:
1. **Collect preference data**: High-quality human comparisons
2. **Train reward model**: Learn to predict human preferences
3. **PPO training**: Optimize policy with reward + KL constraint
4. **Evaluate**: Comprehensive testing for helpfulness and safety
### RLHF vs Alternatives
| Method | Complexity | Data Needed | Performance | Best For |
|--------|-----------|-------------|-------------|----------|
| SFT only | Low | Demonstrations | Good | Basic tasks |
| RLHF (PPO) | High | Preferences | Best | Production alignment |
| DPO | Medium | Preferences | Very Good | Fast iteration |
| SimPO | Medium | Preferences | Very Good | Resource-constrained |
---
## Exercises
### Beginner
1. Train reward model on 1000 preference pairs
2. Evaluate reward model accuracy
3. Generate samples and manually inspect quality
### Intermediate
1. Implement complete PPO training loop
2. Tune KL coefficient for stable training
3. Compare RLHF vs SFT-only on evaluation set
### Advanced
1. Implement adaptive KL coefficient
2. Build ensemble of reward models
3. Deploy RLHF model with monitoring
4. Experiment with different PPO hyperparameters
---
## Next Steps
- Tutorial 07: Advanced Reward Modeling Techniques
- Tutorial 08: Continual Learning and Catastrophic Forgetting
- Tutorial 09: Production Deployment and Scaling
- Tutorial 10: Multi-Modal Training and Extensions