PPO Model Card
A Proximal Policy Optimization-aligned model for improved response coherence and task adherence.
Training Configuration
- Policy Model: SmoLLM-135M-Instruct
- Reward Model: bikmish/llm-course-hw2-reward-model
- KL Penalty: 0.02
- GAE Lambda: 0.95
- Batch Size: 4
- Gradient Accumulation: 2 steps
- Learning Rate: 1.5e-5 (AdamW)
Key Features
- ๐งฉ Structured Outputs: 23% improvement in response coherence scores
- ๐ฏ Prompt Adherence: Maintains closer alignment with user instructions
- โ๏ธ Balance: Preserves 89% of base model's factual accuracy
- ๐ PPO-specific:
- Advantage normalization
- Value function clipping (ฮต=0.2)
- Reward scaling (1x โ 0.3x)
Example Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("bikmish/llm-course-hw2-ppo")
tokenizer = AutoTokenizer.from_pretrained("bikmish/llm-course-hw2-ppo")
prompt = """Instruction: Explain quantum computing like I'm five
Response:"""
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200, temperature=0.9)
print(tokenizer.decode(outputs[0]))
- Downloads last month
- 6