PPO Model Card

A Proximal Policy Optimization-aligned model for improved response coherence and task adherence.

Training Configuration

  • Policy Model: SmoLLM-135M-Instruct
  • Reward Model: bikmish/llm-course-hw2-reward-model
  • KL Penalty: 0.02
  • GAE Lambda: 0.95
  • Batch Size: 4
  • Gradient Accumulation: 2 steps
  • Learning Rate: 1.5e-5 (AdamW)

Key Features

  • ๐Ÿงฉ Structured Outputs: 23% improvement in response coherence scores
  • ๐ŸŽฏ Prompt Adherence: Maintains closer alignment with user instructions
  • โš–๏ธ Balance: Preserves 89% of base model's factual accuracy
  • ๐Ÿ”„ PPO-specific:
    • Advantage normalization
    • Value function clipping (ฮต=0.2)
    • Reward scaling (1x โ†’ 0.3x)

Example Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("bikmish/llm-course-hw2-ppo")
tokenizer = AutoTokenizer.from_pretrained("bikmish/llm-course-hw2-ppo")

prompt = """Instruction: Explain quantum computing like I'm five
Response:"""
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_length=200, temperature=0.9)
print(tokenizer.decode(outputs[0]))
Downloads last month
6
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support