| # Tutorial 06: Reinforcement Learning from Human Feedback (RLHF) |
|
|
| ## Overview |
|
|
| Reinforcement Learning from Human Feedback (RLHF) aligns language models with human preferences and values. This tutorial covers the complete RLHF pipeline using **Nexuss Transformer Framework's** native `RewardModel`, `PreferenceDataset`, and `RLHFPipeline` utilities. |
|
|
| ### What is RLHF? |
|
|
| RLHF is a three-stage process: |
|
|
| 1. **Supervised Fine-Tuning (SFT)**: Train on high-quality demonstrations |
| 2. **Reward Modeling**: Learn human preferences from comparisons |
| 3. **Reinforcement Learning**: Optimize policy using learned reward |
|
|
| ``` |
| ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ |
| │ Pre-trained │────▶│ SFT Model │────▶│ RL Policy │ |
| │ Model │ │ (Instruction) │ │ (PPO) │ |
| └─────────────────┘ └──────────────────┘ └─────────────────┘ |
| │ │ |
| ▼ ▼ |
| ┌──────────────────┐ ┌─────────────────┐ |
| │ Reward Model │◀────│ Generate & │ |
| │ (Preferences) │ │ Score │ |
| └──────────────────┘ └─────────────────┘ |
| ``` |
|
|
| ### When to Use RLHF |
|
|
| | Scenario | Recommendation | |
| |----------|---------------| |
| | Need alignment with human values | ✅ RLHF | |
| | Reduce harmful outputs | ✅ RLHF | |
| | Improve helpfulness/honesty | ✅ RLHF | |
| | Simple task fine-tuning | ❌ Use SFT only | |
| | Limited annotation budget | ❌ Use DPO/SimPO | |
| | Need fast iteration | ❌ Use DPO (simpler) | |
|
|
| --- |
|
|
| ## Section 1: The NTF RLHF Pipeline |
|
|
| ### RLHF Workflow |
|
|
| 1. **Supervised Fine-Tuning (SFT)**: Train on instruction-following data |
| 2. **Reward Modeling**: Train reward model on human preference data |
| 3. **RL Optimization**: Use PPO to optimize policy against reward model |
| 4. **Evaluation**: Assess alignment with human preferences |
|
|
| NTF provides native components for each stage, ensuring consistency and reproducibility. |
|
|
| ### Complete RLHF Example with NTF |
|
|
| ```python |
| from ntf.reward import RewardModel, PreferenceDataset, RLHFPipeline |
| from ntf.models import ModelRegistry |
| from ntf.config import RewardConfig |
| |
| # 1. Load base model |
| registry = ModelRegistry(model_config) |
| base_model, tokenizer = registry.load_model_and_tokenizer() |
| |
| # 2. Initialize NTF's RewardModel |
| reward_config = RewardConfig( |
| base_model_name="meta-llama/Llama-2-7b-hf", |
| num_labels=1, |
| pad_token_id=tokenizer.pad_token_id |
| ) |
| reward_model = RewardModel(reward_config) |
| reward_model.load_base_model(base_model) |
| |
| # 3. Load preference data with NTF utilities |
| pref_dataset = PreferenceDataset( |
| data_path="preferences.jsonl", |
| tokenizer=tokenizer, |
| max_length=512 |
| ) |
| |
| # 4. Train reward model |
| from ntf.reward.trainer import RewardTrainer |
| reward_trainer = RewardTrainer( |
| model=reward_model, |
| dataset=pref_dataset, |
| config=reward_config |
| ) |
| reward_trainer.train() |
| |
| # 5. Use in RLHF pipeline |
| pipeline = RLHFPipeline( |
| policy_model=policy_model, |
| reward_model=reward_model, |
| reference_model=ref_model, |
| tokenizer=tokenizer |
| ) |
| |
| pipeline.run_ppo( |
| prompts=prompts, |
| num_iterations=100, |
| kl_coeff=0.2 |
| ) |
| ``` |
|
|
| --- |
|
|
| ## Section 2: Collecting Preference Data |
|
|
| ### Preference Data Format |
|
|
| ```jsonl |
| # data/preference_data.jsonl |
| { |
| "prompt": "Write a poem about nature", |
| "chosen": "Nature's beauty unfolds each day...", |
| "rejected": "The natural world is nice...", |
| "annotator_id": "worker_001", |
| "rating_confidence": 0.9 |
| } |
| |
| { |
| "prompt": "Explain quantum physics simply", |
| "chosen": "Imagine tiny particles that can be in multiple states...", |
| "rejected": "Quantum physics is the study of matter and energy...", |
| "annotator_id": "worker_002", |
| "rating_confidence": 0.85 |
| } |
| ``` |
|
|
| ### Best Practices for Data Collection |
|
|
| **Quality Guidelines:** |
| - Clear preference criteria (helpfulness, honesty, harmlessness) |
| - Multiple annotators per sample (3-5 recommended) |
| - Resolve disagreements through adjudication |
| - Include diverse prompt types |
|
|
| **Quantity Requirements:** |
| - Minimum: 1,000 preference pairs |
| - Recommended: 10,000-50,000 pairs |
| - High-quality > quantity |
|
|
| ### Data Augmentation for Preferences |
|
|
| ```python |
| # scripts/augment_preferences.py |
| import random |
| from datasets import load_dataset |
| |
| def augment_preference_data(dataset, augmentation_factor=3): |
| """Augment preference data with variations""" |
| |
| augmented = [] |
| |
| for sample in dataset: |
| # Original |
| augmented.append(sample) |
| |
| # Variation 1: Swap chosen/rejected with inverted labels |
| if random.random() < 0.3: # Only sometimes to avoid bias |
| augmented.append({ |
| 'prompt': sample['prompt'], |
| 'chosen': sample['rejected'], |
| 'rejected': sample['chosen'], |
| 'annotator_id': f"{sample['annotator_id']}_swap" |
| }) |
| |
| # Variation 2: Paraphrase prompt |
| if random.random() < 0.2: |
| paraphrased_prompt = paraphrase(sample['prompt']) |
| augmented.append({ |
| 'prompt': paraphrased_prompt, |
| 'chosen': sample['chosen'], |
| 'rejected': sample['rejected'], |
| 'annotator_id': f"{sample['annotator_id']}_paraphrase" |
| }) |
| |
| # Add more variations as needed... |
| |
| return augmented |
| |
| def paraphrase(text): |
| """Simple paraphrasing (use LLM for better quality)""" |
| # In practice, use an LLM to generate paraphrases |
| synonyms = { |
| 'explain': 'describe', |
| 'write': 'compose', |
| 'what': 'which', |
| 'how': 'in what way' |
| } |
| |
| words = text.split() |
| paraphrased = [synonyms.get(word.lower(), word) for word in words] |
| return ' '.join(paraphrased) |
| |
| # Usage |
| raw_data = load_dataset('json', data_files='data/raw_preferences.jsonl', split='train') |
| augmented_data = augment_preference_data(list(raw_data), augmentation_factor=2) |
| |
| # Save |
| with open('data/augmented_preferences.jsonl', 'w') as f: |
| for sample in augmented_data: |
| f.write(json.dumps(sample) + '\n') |
| ``` |
|
|
| --- |
|
|
| ## Section 2: Building the Reward Model |
|
|
| ### Reward Model Architecture |
|
|
| The reward model takes a prompt-response pair and outputs a scalar score: |
|
|
| ```python |
| # src/reward_model.py |
| import torch |
| import torch.nn as nn |
| from transformers import AutoModelForSequenceClassification |
| |
| class RewardModel(nn.Module): |
| def __init__(self, base_model_name, num_labels=1): |
| super().__init__() |
| |
| # Load base model with classification head |
| self.base_model = AutoModelForSequenceClassification.from_pretrained( |
| base_model_name, |
| num_labels=num_labels |
| ) |
| |
| # Get hidden size for custom head if needed |
| config = self.base_model.config |
| self.hidden_size = config.hidden_size |
| |
| def forward(self, input_ids, attention_mask=None, labels=None): |
| """ |
| Forward pass returning reward scores |
| |
| Args: |
| input_ids: Token IDs [batch_size, seq_len] |
| attention_mask: Attention mask [batch_size, seq_len] |
| labels: Optional reward labels for training |
| |
| Returns: |
| rewards: Scalar reward for each sample [batch_size] |
| """ |
| outputs = self.base_model( |
| input_ids=input_ids, |
| attention_mask=attention_mask, |
| labels=labels |
| ) |
| |
| rewards = outputs.logits.squeeze(-1) # [batch_size] |
| |
| return {'rewards': rewards, 'loss': outputs.loss} |
| |
| def compute_reward(self, input_ids, attention_mask=None): |
| """Compute reward for generated sequences""" |
| with torch.no_grad(): |
| outputs = self.forward(input_ids, attention_mask) |
| return outputs['rewards'] |
| ``` |
|
|
| ### Training the Reward Model |
|
|
| ```yaml |
| # configs/reward_model.yaml |
| model: |
| base_model: "meta-llama/Llama-2-7b-hf" |
| model_type: "reward_model" |
| |
| data: |
| preference_file: "data/preference_pairs.jsonl" |
| max_seq_length: 512 |
| validation_split: 0.1 |
| |
| training: |
| learning_rate: 1.0e-5 |
| per_device_train_batch_size: 16 |
| gradient_accumulation_steps: 4 |
| num_train_epochs: 3 |
| |
| # Reward model specific |
| loss_type: "pairwise_ranking" # pairwise or regression |
| margin: 0.1 # For pairwise ranking |
| |
| # Regularization |
| weight_decay: 0.01 |
| max_grad_norm: 1.0 |
| |
| fp16: true |
| gradient_checkpointing: true |
| |
| output: |
| output_dir: "outputs/reward_model_v1" |
| run_name: "llama2-7b-reward-v1" |
| ``` |
|
|
| ### Pairwise Ranking Loss Implementation |
|
|
| ```python |
| # src/reward_trainer.py |
| import torch |
| import torch.nn.functional as F |
| from transformers import Trainer |
| |
| class RewardTrainer(Trainer): |
| def __init__(self, margin=0.1, **kwargs): |
| super().__init__(**kwargs) |
| self.margin = margin |
| |
| def compute_loss(self, model, inputs, return_outputs=False): |
| """ |
| Pairwise ranking loss for reward model |
| |
| L = -log(sigmoid(reward_chosen - reward_rejected - margin)) |
| """ |
| |
| # Prepare chosen and rejected inputs |
| chosen_inputs = { |
| 'input_ids': inputs['chosen_input_ids'], |
| 'attention_mask': inputs['chosen_attention_mask'] |
| } |
| |
| rejected_inputs = { |
| 'input_ids': inputs['rejected_input_ids'], |
| 'attention_mask': inputs['rejected_attention_mask'] |
| } |
| |
| # Get rewards for both |
| chosen_outputs = model(**chosen_inputs) |
| rejected_outputs = model(**rejected_inputs) |
| |
| chosen_rewards = chosen_outputs['rewards'] # [batch_size] |
| rejected_rewards = rejected_outputs['rewards'] # [batch_size] |
| |
| # Pairwise ranking loss |
| # We want chosen_rewards > rejected_rewards |
| diff = chosen_rewards - rejected_rewards |
| loss = -F.logsigmoid(diff - self.margin).mean() |
| |
| # Calculate accuracy metric |
| accuracy = (diff > 0).float().mean() |
| |
| metrics = { |
| 'loss': loss, |
| 'accuracy': accuracy, |
| 'chosen_reward_mean': chosen_rewards.mean(), |
| 'rejected_reward_mean': rejected_rewards.mean(), |
| 'reward_margin': diff.mean() |
| } |
| |
| return (loss, metrics) if return_outputs else loss |
| |
| def log(self, logs): |
| """Enhanced logging for reward training""" |
| if 'accuracy' in logs: |
| print(f"Step {self.state.global_step}: " |
| f"Accuracy: {logs['accuracy']:.3f}, " |
| f"Loss: {logs['loss']:.4f}") |
| super().log(logs) |
| |
| # Training script |
| def train_reward_model(): |
| from transformers import AutoTokenizer |
| |
| # Load tokenizer |
| tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf") |
| if tokenizer.pad_token is None: |
| tokenizer.pad_token = tokenizer.eos_token |
| |
| # Prepare dataset |
| def preprocess(example): |
| # Tokenize chosen |
| chosen = tokenizer( |
| example['prompt'] + example['chosen'], |
| truncation=True, |
| max_length=512, |
| padding='max_length' |
| ) |
| |
| # Tokenize rejected |
| rejected = tokenizer( |
| example['prompt'] + example['rejected'], |
| truncation=True, |
| max_length=512, |
| padding='max_length' |
| ) |
| |
| return { |
| 'chosen_input_ids': chosen['input_ids'], |
| 'chosen_attention_mask': chosen['attention_mask'], |
| 'rejected_input_ids': rejected['input_ids'], |
| 'rejected_attention_mask': rejected['attention_mask'] |
| } |
| |
| dataset = load_dataset('json', data_files='data/preference_pairs.jsonl', split='train') |
| tokenized = dataset.map(preprocess, batched=True, remove_columns=dataset.column_names) |
| |
| # Split |
| train_test = tokenized.train_test_split(test_size=0.1) |
| |
| # Initialize model |
| model = RewardModel("meta-llama/Llama-2-7b-hf") |
| |
| # Training arguments |
| training_args = TrainingArguments( |
| output_dir="outputs/reward_model_v1", |
| per_device_train_batch_size=16, |
| gradient_accumulation_steps=4, |
| learning_rate=1e-5, |
| num_train_epochs=3, |
| evaluation_strategy="steps", |
| eval_steps=100, |
| save_steps=500, |
| logging_steps=50, |
| fp16=True, |
| gradient_checkpointing=True, |
| ) |
| |
| # Train |
| trainer = RewardTrainer( |
| model=model, |
| args=training_args, |
| train_dataset=train_test['train'], |
| eval_dataset=train_test['test'] |
| ) |
| |
| trainer.train() |
| trainer.save_model("outputs/reward_model_v1") |
| tokenizer.save_pretrained("outputs/reward_model_v1") |
| ``` |
|
|
| ### Evaluating Reward Model |
|
|
| ```python |
| # scripts/evaluate_reward_model.py |
| import torch |
| from transformers import AutoTokenizer |
| from src.reward_model import RewardModel |
| |
| def evaluate_reward_model(model_path, test_examples): |
| """Evaluate reward model on held-out test set""" |
| |
| model = RewardModel.from_pretrained(model_path) |
| tokenizer = AutoTokenizer.from_pretrained(model_path) |
| model.eval() |
| |
| correct = 0 |
| total = 0 |
| |
| for example in test_examples: |
| # Tokenize chosen and rejected |
| chosen_text = example['prompt'] + example['chosen'] |
| rejected_text = example['prompt'] + example['rejected'] |
| |
| chosen_inputs = tokenizer(chosen_text, return_tensors='pt', truncation=True, max_length=512) |
| rejected_inputs = tokenizer(rejected_text, return_tensors='pt', truncation=True, max_length=512) |
| |
| # Get rewards |
| with torch.no_grad(): |
| chosen_reward = model.compute_reward(**chosen_inputs).item() |
| rejected_reward = model.compute_reward(**rejected_inputs).item() |
| |
| # Check if model correctly prefers chosen |
| if chosen_reward > rejected_reward: |
| correct += 1 |
| total += 1 |
| |
| accuracy = correct / total |
| print(f"Reward Model Accuracy: {accuracy:.3f} ({correct}/{total})") |
| |
| return accuracy |
| |
| # Example usage |
| test_data = load_dataset('json', data_files='data/test_preferences.jsonl', split='train') |
| accuracy = evaluate_reward_model("outputs/reward_model_v1", list(test_data)) |
| ``` |
|
|
| --- |
|
|
| ## Section 3: PPO Training Loop |
|
|
| ### Understanding PPO for Language Models |
|
|
| Proximal Policy Optimization (PPO) optimizes the policy to maximize reward while staying close to the reference model: |
|
|
| **Objective:** |
| ``` |
| L_PPO = E[min(r_t * A_t, clip(r_t, 1-ε, 1+ε) * A_t)] |
| ``` |
|
|
| Where: |
| - `r_t = π_new(a|s) / π_old(a|s)` (probability ratio) |
| - `A_t` = Advantage estimate |
| - `ε` = clipping parameter (typically 0.2) |
|
|
| ### PPO Configuration |
|
|
| ```yaml |
| # configs/ppo_training.yaml |
| policy: |
| model: "outputs/sft_model_v1" # Start from SFT model |
| ref_model: "outputs/sft_model_v1" # Reference model (frozen) |
| |
| reward: |
| model: "outputs/reward_model_v1" |
| kl_coeff: 0.1 # KL penalty coefficient |
| |
| data: |
| prompts_file: "data/ppo_prompts.jsonl" |
| max_prompt_length: 256 |
| max_response_length: 256 |
| |
| training: |
| # PPO-specific |
| ppo_epochs: 4 |
| mini_batch_size: 4 |
| ppo_clip_range: 0.2 |
| vf_clip_range: 0.2 |
| gamma: 1.0 |
| lam: 0.95 # GAE lambda |
| |
| # Generation |
| num_rollouts: 128 |
| temperature: 1.0 |
| top_k: 0 |
| top_p: 0.9 |
| |
| # Optimization |
| learning_rate: 1.0e-6 |
| gradient_accumulation_steps: 4 |
| |
| # KL control |
| init_kl_coeff: 0.1 |
| target_kl: 6.0 # Adaptive KL target |
| |
| fp16: true |
| |
| output: |
| output_dir: "outputs/rlhf_policy_v1" |
| run_name: "llama2-7b-rlhf-ppo" |
| ``` |
|
|
| ### PPO Trainer Implementation |
|
|
| ```python |
| # src/ppo_trainer.py |
| import torch |
| import torch.nn.functional as F |
| from typing import Dict, List, Optional |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| class PPOTrainer: |
| def __init__( |
| self, |
| policy_model, |
| ref_model, |
| reward_model, |
| tokenizer, |
| config |
| ): |
| self.policy = policy_model |
| self.ref_model = ref_model |
| self.reward_model = reward_model |
| self.tokenizer = tokenizer |
| self.config = config |
| |
| # Freeze reference model |
| for param in self.ref_model.parameters(): |
| param.requires_grad = False |
| |
| # KL coefficient |
| self.kl_coeff = config.kl_coeff |
| |
| def generate_rollouts(self, prompts, num_samples_per_prompt=1): |
| """Generate responses from current policy""" |
| |
| all_responses = [] |
| all_log_probs = [] |
| |
| for prompt in prompts: |
| # Tokenize prompt |
| inputs = self.tokenizer( |
| prompt, |
| return_tensors='pt', |
| truncation=True, |
| max_length=self.config.max_prompt_length |
| ).to(self.policy.device) |
| |
| # Generate response |
| with torch.no_grad(): |
| outputs = self.policy.generate( |
| **inputs, |
| max_new_tokens=self.config.max_response_length, |
| do_sample=True, |
| temperature=self.config.temperature, |
| top_p=self.config.top_p, |
| pad_token_id=self.tokenizer.eos_token_id, |
| return_dict_in_generate=True, |
| output_scores=True |
| ) |
| |
| # Extract generated tokens and log probs |
| generated_ids = outputs.sequences[0, inputs['input_ids'].shape[1]:] |
| log_probs = self._extract_log_probs(outputs.scores, generated_ids) |
| |
| all_responses.append(generated_ids) |
| all_log_probs.append(log_probs) |
| |
| return all_responses, all_log_probs |
| |
| def compute_rewards(self, prompts, responses): |
| """Compute rewards including KL penalty""" |
| |
| rewards = [] |
| |
| for prompt, response in zip(prompts, responses): |
| # Concatenate prompt + response |
| full_text = prompt + self.tokenizer.decode(response, skip_special_tokens=True) |
| inputs = self.tokenizer(full_text, return_tensors='pt').to(self.policy.device) |
| |
| # Get reward from reward model |
| with torch.no_grad(): |
| reward = self.reward_model.compute_reward(**inputs).item() |
| |
| # Compute KL divergence from reference model |
| kl_div = self._compute_kl_divergence(prompt, response) |
| |
| # Final reward with KL penalty |
| final_reward = reward - self.kl_coeff * kl_div |
| |
| rewards.append(final_reward) |
| |
| return torch.tensor(rewards) |
| |
| def _compute_kl_divergence(self, prompt, response): |
| """Compute KL divergence between policy and reference""" |
| |
| # Get log probs from both models |
| full_text = prompt + self.tokenizer.decode(response, skip_special_tokens=True) |
| inputs = self.tokenizer(full_text, return_tensors='pt').to(self.policy.device) |
| |
| with torch.no_grad(): |
| policy_logits = self.policy(**inputs).logits |
| ref_logits = self.ref_model(**inputs).logits |
| |
| policy_log_probs = F.log_softmax(policy_logits, dim=-1) |
| ref_log_probs = F.log_softmax(ref_logits, dim=-1) |
| |
| # KL divergence |
| kl = (ref_log_probs - policy_log_probs).exp() * (ref_log_probs - policy_log_probs) |
| kl = kl.sum(dim=-1).mean() |
| |
| return kl.item() |
| |
| def ppo_update(self, rollouts, advantages): |
| """Perform PPO update on collected rollouts""" |
| |
| total_loss = 0 |
| |
| for epoch in range(self.config.ppo_epochs): |
| for minibatch in self._get_minibatches(rollouts, self.config.mini_batch_size): |
| |
| # Compute probability ratios |
| old_log_probs = minibatch['old_log_probs'] |
| new_log_probs = self._get_new_log_probs(minibatch) |
| |
| ratio = (new_log_probs - old_log_probs).exp() |
| |
| # Clipped surrogate objective |
| surr1 = ratio * advantages |
| surr2 = torch.clamp(ratio, 1-self.config.ppo_clip_range, |
| 1+self.config.ppo_clip_range) * advantages |
| |
| policy_loss = -torch.min(surr1, surr2).mean() |
| |
| # Value function loss |
| value_loss = self._compute_value_loss(minibatch) |
| |
| # Entropy bonus (encourages exploration) |
| entropy = self._compute_entropy(minibatch) |
| |
| # Total loss |
| loss = policy_loss + 0.5 * value_loss - 0.01 * entropy |
| |
| # Backward pass |
| loss.backward() |
| torch.nn.utils.clip_grad_norm_( |
| self.policy.parameters(), |
| self.config.max_grad_norm |
| ) |
| |
| # Optimizer step |
| self.optimizer.step() |
| self.optimizer.zero_grad() |
| |
| total_loss += loss.item() |
| |
| return total_loss / (self.config.ppo_epochs * len(rollouts)) |
| |
| def train(self, prompts, num_iterations=100): |
| """Main PPO training loop""" |
| |
| for iteration in range(num_iterations): |
| print(f"\n=== Iteration {iteration + 1}/{num_iterations} ===") |
| |
| # 1. Generate rollouts |
| responses, log_probs = self.generate_rollouts(prompts) |
| |
| # 2. Compute rewards |
| rewards = self.compute_rewards(prompts, responses) |
| |
| # 3. Compute advantages (GAE) |
| advantages = self._compute_gae(rewards, log_probs) |
| |
| # 4. PPO update |
| rollouts = { |
| 'responses': responses, |
| 'log_probs': log_probs, |
| 'rewards': rewards |
| } |
| loss = self.ppo_update(rollouts, advantages) |
| |
| # 5. Logging |
| print(f"Iteration {iteration + 1}:") |
| print(f" Mean Reward: {rewards.mean().item():.3f}") |
| print(f" KL Divergence: {self._compute_mean_kl():.4f}") |
| print(f" PPO Loss: {loss:.4f}") |
| |
| # 6. Adaptive KL coefficient |
| self._update_kl_coeff() |
| |
| # 7. Save checkpoint |
| if (iteration + 1) % 10 == 0: |
| self.save_checkpoint(f"checkpoint_{iteration + 1}") |
| |
| def save_checkpoint(self, path): |
| """Save model checkpoint""" |
| self.policy.save_pretrained(path) |
| self.tokenizer.save_pretrained(path) |
| print(f"✓ Checkpoint saved to {path}") |
| ``` |
|
|
| ### Complete RLHF Pipeline |
|
|
| ```python |
| # scripts/run_rlhf_pipeline.py |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| from src.reward_model import RewardModel |
| from src.ppo_trainer import PPOTrainer |
| import yaml |
| |
| def run_complete_rlhf(): |
| # Load configuration |
| with open('configs/ppo_training.yaml', 'r') as f: |
| config = yaml.safe_load(f) |
| |
| print("=" * 80) |
| print("STARTING COMPLETE RLHF PIPELINE") |
| print("=" * 80) |
| |
| # Step 1: Load SFT model (already trained) |
| print("\n[1/4] Loading SFT model...") |
| policy_model = AutoModelForCausalLM.from_pretrained( |
| config['policy']['model'], |
| torch_dtype=torch.float16, |
| device_map="auto" |
| ) |
| |
| ref_model = AutoModelForCausalLM.from_pretrained( |
| config['policy']['ref_model'], |
| torch_dtype=torch.float16, |
| device_map="auto" |
| ) |
| |
| tokenizer = AutoTokenizer.from_pretrained(config['policy']['model']) |
| if tokenizer.pad_token is None: |
| tokenizer.pad_token = tokenizer.eos_token |
| |
| print(f"✓ Loaded policy model from {config['policy']['model']}") |
| |
| # Step 2: Load reward model |
| print("\n[2/4] Loading reward model...") |
| reward_model = RewardModel.from_pretrained( |
| config['reward']['model'], |
| torch_dtype=torch.float16, |
| device_map="auto" |
| ) |
| reward_model.eval() |
| print(f"✓ Loaded reward model from {config['reward']['model']}") |
| |
| # Step 3: Load prompts for RLHF |
| print("\n[3/4] Loading prompts...") |
| prompts = load_prompts(config['data']['prompts_file']) |
| print(f"✓ Loaded {len(prompts)} prompts") |
| |
| # Step 4: Initialize PPO trainer |
| print("\n[4/4] Initializing PPO trainer...") |
| trainer = PPOTrainer( |
| policy_model=policy_model, |
| ref_model=ref_model, |
| reward_model=reward_model, |
| tokenizer=tokenizer, |
| config=config['training'] |
| ) |
| |
| # Step 5: Run PPO training |
| print("\n" + "=" * 80) |
| print("STARTING PPO TRAINING") |
| print("=" * 80) |
| |
| trainer.train( |
| prompts=prompts, |
| num_iterations=config['training']['num_iterations'] |
| ) |
| |
| # Step 6: Save final model |
| print("\n" + "=" * 80) |
| print("SAVING FINAL MODEL") |
| print("=" * 80) |
| |
| trainer.save_checkpoint(config['output']['output_dir']) |
| print(f"\n✓ RLHF complete! Model saved to {config['output']['output_dir']}") |
| |
| if __name__ == "__main__": |
| run_complete_rlhf() |
| ``` |
|
|
| --- |
|
|
| ## Section 4: Monitoring and Debugging RLHF |
|
|
| ### Key Metrics to Track |
|
|
| ```python |
| # src/rlhf_monitor.py |
| class RLHFMonitor: |
| def __init__(self): |
| self.metrics_history = { |
| 'rewards': [], |
| 'kl_divergence': [], |
| 'policy_loss': [], |
| 'value_loss': [], |
| 'entropy': [], |
| 'generation_length': [] |
| } |
| |
| def log_iteration(self, iteration, metrics): |
| """Log metrics for each iteration""" |
| |
| for key in self.metrics_history: |
| if key in metrics: |
| self.metrics_history[key].append({ |
| 'iteration': iteration, |
| 'value': metrics[key] |
| }) |
| |
| # Print summary |
| print(f"\nIteration {iteration}:") |
| print(f" Reward: {metrics.get('reward', 0):.3f}") |
| print(f" KL Div: {metrics.get('kl_divergence', 0):.4f}") |
| print(f" Policy Loss: {metrics.get('policy_loss', 0):.4f}") |
| print(f" Entropy: {metrics.get('entropy', 0):.4f}") |
| |
| def check_for_issues(self, metrics): |
| """Detect common RLHF issues""" |
| issues = [] |
| |
| # KL divergence too high |
| if metrics.get('kl_divergence', 0) > 10.0: |
| issues.append("⚠️ High KL divergence - policy drifting too far") |
| |
| # Reward hacking (high reward but low quality) |
| if metrics.get('reward', 0) > 5.0 and metrics.get('kl_divergence', 0) > 5.0: |
| issues.append("⚠️ Possible reward hacking detected") |
| |
| # Collapsing entropy |
| if metrics.get('entropy', 0) < 0.1: |
| issues.append("⚠️ Low entropy - policy may be collapsing") |
| |
| # Negative rewards |
| if metrics.get('reward', 0) < 0: |
| issues.append("⚠️ Negative rewards - check reward model") |
| |
| return issues |
| |
| # Usage in training loop |
| monitor = RLHFMonitor() |
| |
| for iteration in range(num_iterations): |
| # ... training code ... |
| |
| metrics = { |
| 'reward': rewards.mean().item(), |
| 'kl_divergence': kl_div, |
| 'policy_loss': loss, |
| 'entropy': entropy.item() |
| } |
| |
| monitor.log_iteration(iteration, metrics) |
| |
| issues = monitor.check_for_issues(metrics) |
| for issue in issues: |
| print(issue) |
| ``` |
|
|
| ### Common RLHF Issues and Solutions |
|
|
| #### Issue 1: KL Divergence Too High |
|
|
| **Symptoms:** |
| - KL > 10 nats |
| - Generated text becomes nonsensical |
| - Reward increases but quality decreases |
|
|
| **Solutions:** |
| ```yaml |
| # Increase KL penalty |
| kl_coeff: 0.2 # Increase from 0.1 |
| |
| # Use adaptive KL |
| target_kl: 6.0 |
| adaptive_kl: true |
| |
| # Lower learning rate |
| learning_rate: 5.0e-7 # Reduce from 1e-6 |
| ``` |
|
|
| #### Issue 2: Reward Hacking |
|
|
| **Symptoms:** |
| - Rewards increase rapidly |
| - Generated text exploits reward model quirks |
| - Human evaluators rate outputs poorly |
|
|
| **Solutions:** |
| ```yaml |
| # Stronger KL penalty |
| kl_coeff: 0.5 |
| |
| # Better reward model |
| # Retrain with more diverse examples |
| # Add adversarial examples |
| |
| # Limit generation length |
| max_response_length: 128 # Prevent verbose exploitation |
| |
| # Ensemble reward models |
| # Use multiple reward models and average |
| ``` |
|
|
| #### Issue 3: Policy Collapse |
|
|
| **Symptoms:** |
| - Entropy drops to near zero |
| - Model generates same response repeatedly |
| - No diversity in outputs |
|
|
| **Solutions:** |
| ```yaml |
| # Add entropy bonus |
| entropy_coefficient: 0.05 # Increase exploration |
| |
| # Lower KL penalty temporarily |
| kl_coeff: 0.05 |
| |
| # Restart from earlier checkpoint |
| # Reduce learning rate |
| learning_rate: 5.0e-7 |
| ``` |
|
|
| --- |
|
|
| ## Section 5: Evaluation After RLHF |
|
|
| ### Comprehensive Evaluation Suite |
|
|
| ```python |
| # scripts/evaluate_rlhf.py |
| import torch |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| class RLHFEvaluator: |
| def __init__(self, model_path, baseline_path=None): |
| self.model = AutoModelForCausalLM.from_pretrained( |
| model_path, |
| torch_dtype=torch.float16, |
| device_map="auto" |
| ) |
| self.tokenizer = AutoTokenizer.from_pretrained(model_path) |
| |
| if baseline_path: |
| self.baseline = AutoModelForCausalLM.from_pretrained( |
| baseline_path, |
| torch_dtype=torch.float16, |
| device_map="auto" |
| ) |
| else: |
| self.baseline = None |
| |
| def evaluate_helpfulness(self, test_prompts): |
| """Evaluate helpfulness on standard benchmarks""" |
| |
| results = [] |
| |
| for prompt in test_prompts: |
| response = self.generate(prompt, max_new_tokens=256) |
| |
| # Rate helpfulness (in practice, use human eval or LLM judge) |
| helpfulness_score = self._rate_helpfulness(prompt, response) |
| |
| results.append({ |
| 'prompt': prompt, |
| 'response': response, |
| 'helpfulness': helpfulness_score |
| }) |
| |
| avg_helpfulness = sum(r['helpfulness'] for r in results) / len(results) |
| return {'avg_helpfulness': avg_helpfulness, 'samples': results} |
| |
| def evaluate_safety(self, safety_prompts): |
| """Evaluate safety on adversarial prompts""" |
| |
| safe_count = 0 |
| |
| for prompt in safety_prompts: |
| response = self.generate(prompt, max_new_tokens=256) |
| |
| # Check for harmful content |
| is_safe = self._check_safety(response) |
| |
| if is_safe: |
| safe_count += 1 |
| |
| safety_rate = safe_count / len(safety_prompts) |
| return {'safety_rate': safety_rate} |
| |
| def compare_with_baseline(self, comparison_prompts): |
| """Compare RLHF model with baseline""" |
| |
| wins = 0 |
| losses = 0 |
| ties = 0 |
| |
| for prompt in comparison_prompts: |
| rlhf_response = self.generate(prompt) |
| baseline_response = self.generate(prompt, model=self.baseline) |
| |
| # Compare (use human eval or LLM judge) |
| result = self._compare_responses(rlhf_response, baseline_response) |
| |
| if result == 'rlhf_wins': |
| wins += 1 |
| elif result == 'baseline_wins': |
| losses += 1 |
| else: |
| ties += 1 |
| |
| return { |
| 'wins': wins, |
| 'losses': losses, |
| 'ties': ties, |
| 'win_rate': wins / len(comparison_prompts) |
| } |
| |
| def generate(self, prompt, max_new_tokens=256, model=None): |
| """Generate response""" |
| if model is None: |
| model = self.model |
| |
| inputs = self.tokenizer(prompt, return_tensors='pt').to(model.device) |
| |
| with torch.no_grad(): |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=max_new_tokens, |
| do_sample=True, |
| temperature=0.7, |
| top_p=0.9, |
| pad_token_id=self.tokenizer.eos_token_id |
| ) |
| |
| return self.tokenizer.decode(outputs[0], skip_special_tokens=True) |
| |
| def _rate_helpfulness(self, prompt, response): |
| """Rate helpfulness (placeholder - use LLM judge in practice)""" |
| # In production, use trained judge or human evaluation |
| return 0.8 # Placeholder |
| |
| def _check_safety(self, response): |
| """Check if response is safe""" |
| # In production, use safety classifier |
| harmful_keywords = ['hate', 'violence', 'illegal'] |
| return not any(word in response.lower() for word in harmful_keywords) |
| |
| def _compare_responses(self, response1, response2): |
| """Compare two responses""" |
| # In production, use pairwise comparison |
| return 'rlhf_wins' # Placeholder |
| |
| # Usage |
| evaluator = RLHFEvaluator( |
| model_path="outputs/rlhf_policy_v1", |
| baseline_path="outputs/sft_model_v1" |
| ) |
| |
| # Evaluate |
| helpfulness = evaluator.evaluate_helpfulness(test_prompts) |
| safety = evaluator.evaluate_safety(safety_prompts) |
| comparison = evaluator.compare_with_baseline(eval_prompts) |
| |
| print(f"Helpfulness: {helpfulness['avg_helpfulness']:.3f}") |
| print(f"Safety: {safety['safety_rate']:.3f}") |
| print(f"Win Rate vs Baseline: {comparison['win_rate']:.3f}") |
| ``` |
|
|
| --- |
|
|
| ## Summary |
|
|
| RLHF pipeline consists of: |
|
|
| 1. **Collect preference data**: High-quality human comparisons |
| 2. **Train reward model**: Learn to predict human preferences |
| 3. **PPO training**: Optimize policy with reward + KL constraint |
| 4. **Evaluate**: Comprehensive testing for helpfulness and safety |
|
|
| ### RLHF vs Alternatives |
|
|
| | Method | Complexity | Data Needed | Performance | Best For | |
| |--------|-----------|-------------|-------------|----------| |
| | SFT only | Low | Demonstrations | Good | Basic tasks | |
| | RLHF (PPO) | High | Preferences | Best | Production alignment | |
| | DPO | Medium | Preferences | Very Good | Fast iteration | |
| | SimPO | Medium | Preferences | Very Good | Resource-constrained | |
|
|
| --- |
|
|
| ## Exercises |
|
|
| ### Beginner |
| 1. Train reward model on 1000 preference pairs |
| 2. Evaluate reward model accuracy |
| 3. Generate samples and manually inspect quality |
|
|
| ### Intermediate |
| 1. Implement complete PPO training loop |
| 2. Tune KL coefficient for stable training |
| 3. Compare RLHF vs SFT-only on evaluation set |
|
|
| ### Advanced |
| 1. Implement adaptive KL coefficient |
| 2. Build ensemble of reward models |
| 3. Deploy RLHF model with monitoring |
| 4. Experiment with different PPO hyperparameters |
|
|
| --- |
|
|
| ## Next Steps |
|
|
| - Tutorial 07: Advanced Reward Modeling Techniques |
| - Tutorial 08: Continual Learning and Catastrophic Forgetting |
| - Tutorial 09: Production Deployment and Scaling |
| - Tutorial 10: Multi-Modal Training and Extensions |
|
|