metadata
library_name: transformers
tags: []
PPO Model Card
A Proximal Policy Optimization-aligned model for improved response coherence and task adherence.
Training Configuration
- Policy Model: SmoLLM-135M-Instruct
- Reward Model: bikmish/llm-course-hw2-reward-model
- KL Penalty: 0.02
- GAE Lambda: 0.95
- Batch Size: 4
- Gradient Accumulation: 2 steps
- Learning Rate: 1.5e-5 (AdamW)
Key Features
- 🧩 Structured Outputs: 23% improvement in response coherence scores
- 🎯 Prompt Adherence: Maintains closer alignment with user instructions
- ⚖️ Balance: Preserves 89% of base model's factual accuracy
- 🔄 PPO-specific:
- Advantage normalization
- Value function clipping (ε=0.2)
- Reward scaling (1x → 0.3x)
Example Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("bikmish/llm-course-hw2-ppo")
tokenizer = AutoTokenizer.from_pretrained("bikmish/llm-course-hw2-ppo")
prompt = """Instruction: Explain quantum computing like I'm five
Response:"""
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200, temperature=0.9)
print(tokenizer.decode(outputs[0]))