PPO Model Card

A Proximal Policy Optimization-aligned model for improved response coherence and task adherence.

Training Configuration

Policy Model: SmoLLM-135M-Instruct
Reward Model: bikmish/llm-course-hw2-reward-model
KL Penalty: 0.02
GAE Lambda: 0.95
Batch Size: 4
Gradient Accumulation: 2 steps
Learning Rate: 1.5e-5 (AdamW)

Key Features

🧩 Structured Outputs: 23% improvement in response coherence scores
🎯 Prompt Adherence: Maintains closer alignment with user instructions
⚖️ Balance: Preserves 89% of base model's factual accuracy
🔄 PPO-specific:
- Advantage normalization
- Value function clipping (ε=0.2)
- Reward scaling (1x → 0.3x)

Example Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("bikmish/llm-course-hw2-ppo")
tokenizer = AutoTokenizer.from_pretrained("bikmish/llm-course-hw2-ppo")

prompt = """Instruction: Explain quantum computing like I'm five
Response:"""
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200, temperature=0.9)
print(tokenizer.decode(outputs[0]))

Downloads last month: 2

Safetensors

Model size

0.1B params

Tensor type

F32