Tutorial 06: Reinforcement Learning from Human Feedback (RLHF)
Overview
Reinforcement Learning from Human Feedback (RLHF) aligns language models with human preferences and values. This tutorial covers the complete RLHF pipeline using Nexuss Transformer Framework's native RewardModel, PreferenceDataset, and RLHFPipeline utilities.
What is RLHF?
RLHF is a three-stage process:
- Supervised Fine-Tuning (SFT): Train on high-quality demonstrations
- Reward Modeling: Learn human preferences from comparisons
- Reinforcement Learning: Optimize policy using learned reward
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Pre-trained │────▶│ SFT Model │────▶│ RL Policy │
│ Model │ │ (Instruction) │ │ (PPO) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │
▼ ▼
┌──────────────────┐ ┌─────────────────┐
│ Reward Model │◀────│ Generate & │
│ (Preferences) │ │ Score │
└──────────────────┘ └─────────────────┘
When to Use RLHF
| Scenario | Recommendation |
|---|---|
| Need alignment with human values | ✅ RLHF |
| Reduce harmful outputs | ✅ RLHF |
| Improve helpfulness/honesty | ✅ RLHF |
| Simple task fine-tuning | ❌ Use SFT only |
| Limited annotation budget | ❌ Use DPO/SimPO |
| Need fast iteration | ❌ Use DPO (simpler) |
Section 1: The NTF RLHF Pipeline
RLHF Workflow
- Supervised Fine-Tuning (SFT): Train on instruction-following data
- Reward Modeling: Train reward model on human preference data
- RL Optimization: Use PPO to optimize policy against reward model
- Evaluation: Assess alignment with human preferences
NTF provides native components for each stage, ensuring consistency and reproducibility.
Complete RLHF Example with NTF
from ntf.reward import RewardModel, PreferenceDataset, RLHFPipeline
from ntf.models import ModelRegistry
from ntf.config import RewardConfig
# 1. Load base model
registry = ModelRegistry(model_config)
base_model, tokenizer = registry.load_model_and_tokenizer()
# 2. Initialize NTF's RewardModel
reward_config = RewardConfig(
base_model_name="meta-llama/Llama-2-7b-hf",
num_labels=1,
pad_token_id=tokenizer.pad_token_id
)
reward_model = RewardModel(reward_config)
reward_model.load_base_model(base_model)
# 3. Load preference data with NTF utilities
pref_dataset = PreferenceDataset(
data_path="preferences.jsonl",
tokenizer=tokenizer,
max_length=512
)
# 4. Train reward model
from ntf.reward.trainer import RewardTrainer
reward_trainer = RewardTrainer(
model=reward_model,
dataset=pref_dataset,
config=reward_config
)
reward_trainer.train()
# 5. Use in RLHF pipeline
pipeline = RLHFPipeline(
policy_model=policy_model,
reward_model=reward_model,
reference_model=ref_model,
tokenizer=tokenizer
)
pipeline.run_ppo(
prompts=prompts,
num_iterations=100,
kl_coeff=0.2
)
Section 2: Collecting Preference Data
Preference Data Format
# data/preference_data.jsonl
{
"prompt": "Write a poem about nature",
"chosen": "Nature's beauty unfolds each day...",
"rejected": "The natural world is nice...",
"annotator_id": "worker_001",
"rating_confidence": 0.9
}
{
"prompt": "Explain quantum physics simply",
"chosen": "Imagine tiny particles that can be in multiple states...",
"rejected": "Quantum physics is the study of matter and energy...",
"annotator_id": "worker_002",
"rating_confidence": 0.85
}
Best Practices for Data Collection
Quality Guidelines:
- Clear preference criteria (helpfulness, honesty, harmlessness)
- Multiple annotators per sample (3-5 recommended)
- Resolve disagreements through adjudication
- Include diverse prompt types
Quantity Requirements:
- Minimum: 1,000 preference pairs
- Recommended: 10,000-50,000 pairs
- High-quality > quantity
Data Augmentation for Preferences
# scripts/augment_preferences.py
import random
from datasets import load_dataset
def augment_preference_data(dataset, augmentation_factor=3):
"""Augment preference data with variations"""
augmented = []
for sample in dataset:
# Original
augmented.append(sample)
# Variation 1: Swap chosen/rejected with inverted labels
if random.random() < 0.3: # Only sometimes to avoid bias
augmented.append({
'prompt': sample['prompt'],
'chosen': sample['rejected'],
'rejected': sample['chosen'],
'annotator_id': f"{sample['annotator_id']}_swap"
})
# Variation 2: Paraphrase prompt
if random.random() < 0.2:
paraphrased_prompt = paraphrase(sample['prompt'])
augmented.append({
'prompt': paraphrased_prompt,
'chosen': sample['chosen'],
'rejected': sample['rejected'],
'annotator_id': f"{sample['annotator_id']}_paraphrase"
})
# Add more variations as needed...
return augmented
def paraphrase(text):
"""Simple paraphrasing (use LLM for better quality)"""
# In practice, use an LLM to generate paraphrases
synonyms = {
'explain': 'describe',
'write': 'compose',
'what': 'which',
'how': 'in what way'
}
words = text.split()
paraphrased = [synonyms.get(word.lower(), word) for word in words]
return ' '.join(paraphrased)
# Usage
raw_data = load_dataset('json', data_files='data/raw_preferences.jsonl', split='train')
augmented_data = augment_preference_data(list(raw_data), augmentation_factor=2)
# Save
with open('data/augmented_preferences.jsonl', 'w') as f:
for sample in augmented_data:
f.write(json.dumps(sample) + '\n')
Section 2: Building the Reward Model
Reward Model Architecture
The reward model takes a prompt-response pair and outputs a scalar score:
# src/reward_model.py
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification
class RewardModel(nn.Module):
def __init__(self, base_model_name, num_labels=1):
super().__init__()
# Load base model with classification head
self.base_model = AutoModelForSequenceClassification.from_pretrained(
base_model_name,
num_labels=num_labels
)
# Get hidden size for custom head if needed
config = self.base_model.config
self.hidden_size = config.hidden_size
def forward(self, input_ids, attention_mask=None, labels=None):
"""
Forward pass returning reward scores
Args:
input_ids: Token IDs [batch_size, seq_len]
attention_mask: Attention mask [batch_size, seq_len]
labels: Optional reward labels for training
Returns:
rewards: Scalar reward for each sample [batch_size]
"""
outputs = self.base_model(
input_ids=input_ids,
attention_mask=attention_mask,
labels=labels
)
rewards = outputs.logits.squeeze(-1) # [batch_size]
return {'rewards': rewards, 'loss': outputs.loss}
def compute_reward(self, input_ids, attention_mask=None):
"""Compute reward for generated sequences"""
with torch.no_grad():
outputs = self.forward(input_ids, attention_mask)
return outputs['rewards']
Training the Reward Model
# configs/reward_model.yaml
model:
base_model: "meta-llama/Llama-2-7b-hf"
model_type: "reward_model"
data:
preference_file: "data/preference_pairs.jsonl"
max_seq_length: 512
validation_split: 0.1
training:
learning_rate: 1.0e-5
per_device_train_batch_size: 16
gradient_accumulation_steps: 4
num_train_epochs: 3
# Reward model specific
loss_type: "pairwise_ranking" # pairwise or regression
margin: 0.1 # For pairwise ranking
# Regularization
weight_decay: 0.01
max_grad_norm: 1.0
fp16: true
gradient_checkpointing: true
output:
output_dir: "outputs/reward_model_v1"
run_name: "llama2-7b-reward-v1"
Pairwise Ranking Loss Implementation
# src/reward_trainer.py
import torch
import torch.nn.functional as F
from transformers import Trainer
class RewardTrainer(Trainer):
def __init__(self, margin=0.1, **kwargs):
super().__init__(**kwargs)
self.margin = margin
def compute_loss(self, model, inputs, return_outputs=False):
"""
Pairwise ranking loss for reward model
L = -log(sigmoid(reward_chosen - reward_rejected - margin))
"""
# Prepare chosen and rejected inputs
chosen_inputs = {
'input_ids': inputs['chosen_input_ids'],
'attention_mask': inputs['chosen_attention_mask']
}
rejected_inputs = {
'input_ids': inputs['rejected_input_ids'],
'attention_mask': inputs['rejected_attention_mask']
}
# Get rewards for both
chosen_outputs = model(**chosen_inputs)
rejected_outputs = model(**rejected_inputs)
chosen_rewards = chosen_outputs['rewards'] # [batch_size]
rejected_rewards = rejected_outputs['rewards'] # [batch_size]
# Pairwise ranking loss
# We want chosen_rewards > rejected_rewards
diff = chosen_rewards - rejected_rewards
loss = -F.logsigmoid(diff - self.margin).mean()
# Calculate accuracy metric
accuracy = (diff > 0).float().mean()
metrics = {
'loss': loss,
'accuracy': accuracy,
'chosen_reward_mean': chosen_rewards.mean(),
'rejected_reward_mean': rejected_rewards.mean(),
'reward_margin': diff.mean()
}
return (loss, metrics) if return_outputs else loss
def log(self, logs):
"""Enhanced logging for reward training"""
if 'accuracy' in logs:
print(f"Step {self.state.global_step}: "
f"Accuracy: {logs['accuracy']:.3f}, "
f"Loss: {logs['loss']:.4f}")
super().log(logs)
# Training script
def train_reward_model():
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Prepare dataset
def preprocess(example):
# Tokenize chosen
chosen = tokenizer(
example['prompt'] + example['chosen'],
truncation=True,
max_length=512,
padding='max_length'
)
# Tokenize rejected
rejected = tokenizer(
example['prompt'] + example['rejected'],
truncation=True,
max_length=512,
padding='max_length'
)
return {
'chosen_input_ids': chosen['input_ids'],
'chosen_attention_mask': chosen['attention_mask'],
'rejected_input_ids': rejected['input_ids'],
'rejected_attention_mask': rejected['attention_mask']
}
dataset = load_dataset('json', data_files='data/preference_pairs.jsonl', split='train')
tokenized = dataset.map(preprocess, batched=True, remove_columns=dataset.column_names)
# Split
train_test = tokenized.train_test_split(test_size=0.1)
# Initialize model
model = RewardModel("meta-llama/Llama-2-7b-hf")
# Training arguments
training_args = TrainingArguments(
output_dir="outputs/reward_model_v1",
per_device_train_batch_size=16,
gradient_accumulation_steps=4,
learning_rate=1e-5,
num_train_epochs=3,
evaluation_strategy="steps",
eval_steps=100,
save_steps=500,
logging_steps=50,
fp16=True,
gradient_checkpointing=True,
)
# Train
trainer = RewardTrainer(
model=model,
args=training_args,
train_dataset=train_test['train'],
eval_dataset=train_test['test']
)
trainer.train()
trainer.save_model("outputs/reward_model_v1")
tokenizer.save_pretrained("outputs/reward_model_v1")
Evaluating Reward Model
# scripts/evaluate_reward_model.py
import torch
from transformers import AutoTokenizer
from src.reward_model import RewardModel
def evaluate_reward_model(model_path, test_examples):
"""Evaluate reward model on held-out test set"""
model = RewardModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.eval()
correct = 0
total = 0
for example in test_examples:
# Tokenize chosen and rejected
chosen_text = example['prompt'] + example['chosen']
rejected_text = example['prompt'] + example['rejected']
chosen_inputs = tokenizer(chosen_text, return_tensors='pt', truncation=True, max_length=512)
rejected_inputs = tokenizer(rejected_text, return_tensors='pt', truncation=True, max_length=512)
# Get rewards
with torch.no_grad():
chosen_reward = model.compute_reward(**chosen_inputs).item()
rejected_reward = model.compute_reward(**rejected_inputs).item()
# Check if model correctly prefers chosen
if chosen_reward > rejected_reward:
correct += 1
total += 1
accuracy = correct / total
print(f"Reward Model Accuracy: {accuracy:.3f} ({correct}/{total})")
return accuracy
# Example usage
test_data = load_dataset('json', data_files='data/test_preferences.jsonl', split='train')
accuracy = evaluate_reward_model("outputs/reward_model_v1", list(test_data))
Section 3: PPO Training Loop
Understanding PPO for Language Models
Proximal Policy Optimization (PPO) optimizes the policy to maximize reward while staying close to the reference model:
Objective:
L_PPO = E[min(r_t * A_t, clip(r_t, 1-ε, 1+ε) * A_t)]
Where:
r_t = π_new(a|s) / π_old(a|s)(probability ratio)A_t= Advantage estimateε= clipping parameter (typically 0.2)
PPO Configuration
# configs/ppo_training.yaml
policy:
model: "outputs/sft_model_v1" # Start from SFT model
ref_model: "outputs/sft_model_v1" # Reference model (frozen)
reward:
model: "outputs/reward_model_v1"
kl_coeff: 0.1 # KL penalty coefficient
data:
prompts_file: "data/ppo_prompts.jsonl"
max_prompt_length: 256
max_response_length: 256
training:
# PPO-specific
ppo_epochs: 4
mini_batch_size: 4
ppo_clip_range: 0.2
vf_clip_range: 0.2
gamma: 1.0
lam: 0.95 # GAE lambda
# Generation
num_rollouts: 128
temperature: 1.0
top_k: 0
top_p: 0.9
# Optimization
learning_rate: 1.0e-6
gradient_accumulation_steps: 4
# KL control
init_kl_coeff: 0.1
target_kl: 6.0 # Adaptive KL target
fp16: true
output:
output_dir: "outputs/rlhf_policy_v1"
run_name: "llama2-7b-rlhf-ppo"
PPO Trainer Implementation
# src/ppo_trainer.py
import torch
import torch.nn.functional as F
from typing import Dict, List, Optional
from transformers import AutoModelForCausalLM, AutoTokenizer
class PPOTrainer:
def __init__(
self,
policy_model,
ref_model,
reward_model,
tokenizer,
config
):
self.policy = policy_model
self.ref_model = ref_model
self.reward_model = reward_model
self.tokenizer = tokenizer
self.config = config
# Freeze reference model
for param in self.ref_model.parameters():
param.requires_grad = False
# KL coefficient
self.kl_coeff = config.kl_coeff
def generate_rollouts(self, prompts, num_samples_per_prompt=1):
"""Generate responses from current policy"""
all_responses = []
all_log_probs = []
for prompt in prompts:
# Tokenize prompt
inputs = self.tokenizer(
prompt,
return_tensors='pt',
truncation=True,
max_length=self.config.max_prompt_length
).to(self.policy.device)
# Generate response
with torch.no_grad():
outputs = self.policy.generate(
**inputs,
max_new_tokens=self.config.max_response_length,
do_sample=True,
temperature=self.config.temperature,
top_p=self.config.top_p,
pad_token_id=self.tokenizer.eos_token_id,
return_dict_in_generate=True,
output_scores=True
)
# Extract generated tokens and log probs
generated_ids = outputs.sequences[0, inputs['input_ids'].shape[1]:]
log_probs = self._extract_log_probs(outputs.scores, generated_ids)
all_responses.append(generated_ids)
all_log_probs.append(log_probs)
return all_responses, all_log_probs
def compute_rewards(self, prompts, responses):
"""Compute rewards including KL penalty"""
rewards = []
for prompt, response in zip(prompts, responses):
# Concatenate prompt + response
full_text = prompt + self.tokenizer.decode(response, skip_special_tokens=True)
inputs = self.tokenizer(full_text, return_tensors='pt').to(self.policy.device)
# Get reward from reward model
with torch.no_grad():
reward = self.reward_model.compute_reward(**inputs).item()
# Compute KL divergence from reference model
kl_div = self._compute_kl_divergence(prompt, response)
# Final reward with KL penalty
final_reward = reward - self.kl_coeff * kl_div
rewards.append(final_reward)
return torch.tensor(rewards)
def _compute_kl_divergence(self, prompt, response):
"""Compute KL divergence between policy and reference"""
# Get log probs from both models
full_text = prompt + self.tokenizer.decode(response, skip_special_tokens=True)
inputs = self.tokenizer(full_text, return_tensors='pt').to(self.policy.device)
with torch.no_grad():
policy_logits = self.policy(**inputs).logits
ref_logits = self.ref_model(**inputs).logits
policy_log_probs = F.log_softmax(policy_logits, dim=-1)
ref_log_probs = F.log_softmax(ref_logits, dim=-1)
# KL divergence
kl = (ref_log_probs - policy_log_probs).exp() * (ref_log_probs - policy_log_probs)
kl = kl.sum(dim=-1).mean()
return kl.item()
def ppo_update(self, rollouts, advantages):
"""Perform PPO update on collected rollouts"""
total_loss = 0
for epoch in range(self.config.ppo_epochs):
for minibatch in self._get_minibatches(rollouts, self.config.mini_batch_size):
# Compute probability ratios
old_log_probs = minibatch['old_log_probs']
new_log_probs = self._get_new_log_probs(minibatch)
ratio = (new_log_probs - old_log_probs).exp()
# Clipped surrogate objective
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1-self.config.ppo_clip_range,
1+self.config.ppo_clip_range) * advantages
policy_loss = -torch.min(surr1, surr2).mean()
# Value function loss
value_loss = self._compute_value_loss(minibatch)
# Entropy bonus (encourages exploration)
entropy = self._compute_entropy(minibatch)
# Total loss
loss = policy_loss + 0.5 * value_loss - 0.01 * entropy
# Backward pass
loss.backward()
torch.nn.utils.clip_grad_norm_(
self.policy.parameters(),
self.config.max_grad_norm
)
# Optimizer step
self.optimizer.step()
self.optimizer.zero_grad()
total_loss += loss.item()
return total_loss / (self.config.ppo_epochs * len(rollouts))
def train(self, prompts, num_iterations=100):
"""Main PPO training loop"""
for iteration in range(num_iterations):
print(f"\n=== Iteration {iteration + 1}/{num_iterations} ===")
# 1. Generate rollouts
responses, log_probs = self.generate_rollouts(prompts)
# 2. Compute rewards
rewards = self.compute_rewards(prompts, responses)
# 3. Compute advantages (GAE)
advantages = self._compute_gae(rewards, log_probs)
# 4. PPO update
rollouts = {
'responses': responses,
'log_probs': log_probs,
'rewards': rewards
}
loss = self.ppo_update(rollouts, advantages)
# 5. Logging
print(f"Iteration {iteration + 1}:")
print(f" Mean Reward: {rewards.mean().item():.3f}")
print(f" KL Divergence: {self._compute_mean_kl():.4f}")
print(f" PPO Loss: {loss:.4f}")
# 6. Adaptive KL coefficient
self._update_kl_coeff()
# 7. Save checkpoint
if (iteration + 1) % 10 == 0:
self.save_checkpoint(f"checkpoint_{iteration + 1}")
def save_checkpoint(self, path):
"""Save model checkpoint"""
self.policy.save_pretrained(path)
self.tokenizer.save_pretrained(path)
print(f"✓ Checkpoint saved to {path}")
Complete RLHF Pipeline
# scripts/run_rlhf_pipeline.py
from transformers import AutoModelForCausalLM, AutoTokenizer
from src.reward_model import RewardModel
from src.ppo_trainer import PPOTrainer
import yaml
def run_complete_rlhf():
# Load configuration
with open('configs/ppo_training.yaml', 'r') as f:
config = yaml.safe_load(f)
print("=" * 80)
print("STARTING COMPLETE RLHF PIPELINE")
print("=" * 80)
# Step 1: Load SFT model (already trained)
print("\n[1/4] Loading SFT model...")
policy_model = AutoModelForCausalLM.from_pretrained(
config['policy']['model'],
torch_dtype=torch.float16,
device_map="auto"
)
ref_model = AutoModelForCausalLM.from_pretrained(
config['policy']['ref_model'],
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(config['policy']['model'])
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
print(f"✓ Loaded policy model from {config['policy']['model']}")
# Step 2: Load reward model
print("\n[2/4] Loading reward model...")
reward_model = RewardModel.from_pretrained(
config['reward']['model'],
torch_dtype=torch.float16,
device_map="auto"
)
reward_model.eval()
print(f"✓ Loaded reward model from {config['reward']['model']}")
# Step 3: Load prompts for RLHF
print("\n[3/4] Loading prompts...")
prompts = load_prompts(config['data']['prompts_file'])
print(f"✓ Loaded {len(prompts)} prompts")
# Step 4: Initialize PPO trainer
print("\n[4/4] Initializing PPO trainer...")
trainer = PPOTrainer(
policy_model=policy_model,
ref_model=ref_model,
reward_model=reward_model,
tokenizer=tokenizer,
config=config['training']
)
# Step 5: Run PPO training
print("\n" + "=" * 80)
print("STARTING PPO TRAINING")
print("=" * 80)
trainer.train(
prompts=prompts,
num_iterations=config['training']['num_iterations']
)
# Step 6: Save final model
print("\n" + "=" * 80)
print("SAVING FINAL MODEL")
print("=" * 80)
trainer.save_checkpoint(config['output']['output_dir'])
print(f"\n✓ RLHF complete! Model saved to {config['output']['output_dir']}")
if __name__ == "__main__":
run_complete_rlhf()
Section 4: Monitoring and Debugging RLHF
Key Metrics to Track
# src/rlhf_monitor.py
class RLHFMonitor:
def __init__(self):
self.metrics_history = {
'rewards': [],
'kl_divergence': [],
'policy_loss': [],
'value_loss': [],
'entropy': [],
'generation_length': []
}
def log_iteration(self, iteration, metrics):
"""Log metrics for each iteration"""
for key in self.metrics_history:
if key in metrics:
self.metrics_history[key].append({
'iteration': iteration,
'value': metrics[key]
})
# Print summary
print(f"\nIteration {iteration}:")
print(f" Reward: {metrics.get('reward', 0):.3f}")
print(f" KL Div: {metrics.get('kl_divergence', 0):.4f}")
print(f" Policy Loss: {metrics.get('policy_loss', 0):.4f}")
print(f" Entropy: {metrics.get('entropy', 0):.4f}")
def check_for_issues(self, metrics):
"""Detect common RLHF issues"""
issues = []
# KL divergence too high
if metrics.get('kl_divergence', 0) > 10.0:
issues.append("⚠️ High KL divergence - policy drifting too far")
# Reward hacking (high reward but low quality)
if metrics.get('reward', 0) > 5.0 and metrics.get('kl_divergence', 0) > 5.0:
issues.append("⚠️ Possible reward hacking detected")
# Collapsing entropy
if metrics.get('entropy', 0) < 0.1:
issues.append("⚠️ Low entropy - policy may be collapsing")
# Negative rewards
if metrics.get('reward', 0) < 0:
issues.append("⚠️ Negative rewards - check reward model")
return issues
# Usage in training loop
monitor = RLHFMonitor()
for iteration in range(num_iterations):
# ... training code ...
metrics = {
'reward': rewards.mean().item(),
'kl_divergence': kl_div,
'policy_loss': loss,
'entropy': entropy.item()
}
monitor.log_iteration(iteration, metrics)
issues = monitor.check_for_issues(metrics)
for issue in issues:
print(issue)
Common RLHF Issues and Solutions
Issue 1: KL Divergence Too High
Symptoms:
- KL > 10 nats
- Generated text becomes nonsensical
- Reward increases but quality decreases
Solutions:
# Increase KL penalty
kl_coeff: 0.2 # Increase from 0.1
# Use adaptive KL
target_kl: 6.0
adaptive_kl: true
# Lower learning rate
learning_rate: 5.0e-7 # Reduce from 1e-6
Issue 2: Reward Hacking
Symptoms:
- Rewards increase rapidly
- Generated text exploits reward model quirks
- Human evaluators rate outputs poorly
Solutions:
# Stronger KL penalty
kl_coeff: 0.5
# Better reward model
# Retrain with more diverse examples
# Add adversarial examples
# Limit generation length
max_response_length: 128 # Prevent verbose exploitation
# Ensemble reward models
# Use multiple reward models and average
Issue 3: Policy Collapse
Symptoms:
- Entropy drops to near zero
- Model generates same response repeatedly
- No diversity in outputs
Solutions:
# Add entropy bonus
entropy_coefficient: 0.05 # Increase exploration
# Lower KL penalty temporarily
kl_coeff: 0.05
# Restart from earlier checkpoint
# Reduce learning rate
learning_rate: 5.0e-7
Section 5: Evaluation After RLHF
Comprehensive Evaluation Suite
# scripts/evaluate_rlhf.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
class RLHFEvaluator:
def __init__(self, model_path, baseline_path=None):
self.model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map="auto"
)
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
if baseline_path:
self.baseline = AutoModelForCausalLM.from_pretrained(
baseline_path,
torch_dtype=torch.float16,
device_map="auto"
)
else:
self.baseline = None
def evaluate_helpfulness(self, test_prompts):
"""Evaluate helpfulness on standard benchmarks"""
results = []
for prompt in test_prompts:
response = self.generate(prompt, max_new_tokens=256)
# Rate helpfulness (in practice, use human eval or LLM judge)
helpfulness_score = self._rate_helpfulness(prompt, response)
results.append({
'prompt': prompt,
'response': response,
'helpfulness': helpfulness_score
})
avg_helpfulness = sum(r['helpfulness'] for r in results) / len(results)
return {'avg_helpfulness': avg_helpfulness, 'samples': results}
def evaluate_safety(self, safety_prompts):
"""Evaluate safety on adversarial prompts"""
safe_count = 0
for prompt in safety_prompts:
response = self.generate(prompt, max_new_tokens=256)
# Check for harmful content
is_safe = self._check_safety(response)
if is_safe:
safe_count += 1
safety_rate = safe_count / len(safety_prompts)
return {'safety_rate': safety_rate}
def compare_with_baseline(self, comparison_prompts):
"""Compare RLHF model with baseline"""
wins = 0
losses = 0
ties = 0
for prompt in comparison_prompts:
rlhf_response = self.generate(prompt)
baseline_response = self.generate(prompt, model=self.baseline)
# Compare (use human eval or LLM judge)
result = self._compare_responses(rlhf_response, baseline_response)
if result == 'rlhf_wins':
wins += 1
elif result == 'baseline_wins':
losses += 1
else:
ties += 1
return {
'wins': wins,
'losses': losses,
'ties': ties,
'win_rate': wins / len(comparison_prompts)
}
def generate(self, prompt, max_new_tokens=256, model=None):
"""Generate response"""
if model is None:
model = self.model
inputs = self.tokenizer(prompt, return_tensors='pt').to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.7,
top_p=0.9,
pad_token_id=self.tokenizer.eos_token_id
)
return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
def _rate_helpfulness(self, prompt, response):
"""Rate helpfulness (placeholder - use LLM judge in practice)"""
# In production, use trained judge or human evaluation
return 0.8 # Placeholder
def _check_safety(self, response):
"""Check if response is safe"""
# In production, use safety classifier
harmful_keywords = ['hate', 'violence', 'illegal']
return not any(word in response.lower() for word in harmful_keywords)
def _compare_responses(self, response1, response2):
"""Compare two responses"""
# In production, use pairwise comparison
return 'rlhf_wins' # Placeholder
# Usage
evaluator = RLHFEvaluator(
model_path="outputs/rlhf_policy_v1",
baseline_path="outputs/sft_model_v1"
)
# Evaluate
helpfulness = evaluator.evaluate_helpfulness(test_prompts)
safety = evaluator.evaluate_safety(safety_prompts)
comparison = evaluator.compare_with_baseline(eval_prompts)
print(f"Helpfulness: {helpfulness['avg_helpfulness']:.3f}")
print(f"Safety: {safety['safety_rate']:.3f}")
print(f"Win Rate vs Baseline: {comparison['win_rate']:.3f}")
Summary
RLHF pipeline consists of:
- Collect preference data: High-quality human comparisons
- Train reward model: Learn to predict human preferences
- PPO training: Optimize policy with reward + KL constraint
- Evaluate: Comprehensive testing for helpfulness and safety
RLHF vs Alternatives
| Method | Complexity | Data Needed | Performance | Best For |
|---|---|---|---|---|
| SFT only | Low | Demonstrations | Good | Basic tasks |
| RLHF (PPO) | High | Preferences | Best | Production alignment |
| DPO | Medium | Preferences | Very Good | Fast iteration |
| SimPO | Medium | Preferences | Very Good | Resource-constrained |
Exercises
Beginner
- Train reward model on 1000 preference pairs
- Evaluate reward model accuracy
- Generate samples and manually inspect quality
Intermediate
- Implement complete PPO training loop
- Tune KL coefficient for stable training
- Compare RLHF vs SFT-only on evaluation set
Advanced
- Implement adaptive KL coefficient
- Build ensemble of reward models
- Deploy RLHF model with monitoring
- Experiment with different PPO hyperparameters
Next Steps
- Tutorial 07: Advanced Reward Modeling Techniques
- Tutorial 08: Continual Learning and Catastrophic Forgetting
- Tutorial 09: Production Deployment and Scaling
- Tutorial 10: Multi-Modal Training and Extensions