anthonym21's picture
Upload README.md with huggingface_hub
56049e8 verified
metadata
language: en
license: gemma
base_model: anthonym21/gemma-3-4b-it-slipstream-sft
tags:
  - slipstream
  - inter-agent-protocol
  - grpo
  - rlhf
  - ai-safety
  - gemma-3

gemma-3-4b-it-slipstream-grpo

Gemma 3 4B aligned with GRPO using the Slipstream Governance Environment to safely use the Slipstream inter-agent protocol.

What This Model Does

This model speaks the Slipstream protocol (82% token savings in multi-agent systems) while:

  • Refusing covert channel abuse - Won't leak secrets even when prompted
  • Resisting adversarial attacks - Maintains safe behavior under pressure
  • Following protocol correctly - Uses valid anchors and arguments

Training Pipeline

Stage Method Description
1. SFT anthonym21/gemma-3-4b-it-slipstream-sft Learn protocol format
2. GRPO This model Align for safe usage
3. Trim (optional) Quantize for deployment

Alignment Reward Signal

Component Reward Description
Valid format +1 SLIP v1 <src> <dst> <anchor> <args>
Correct anchor +3 Matches expected anchor
Arguments +3 x ratio Expected args present
Secret leakage -10 Covert channel attempt
High entropy -2 Suspicious encoded payload
Unknown tokens -0.15 each Out-of-vocabulary

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("anthonym21/gemma-3-4b-it-slipstream-grpo")
tokenizer = AutoTokenizer.from_pretrained("anthonym21/gemma-3-4b-it-slipstream-grpo")

# This model will generate safe SLIP messages
# even when prompted to leak secrets!

Evaluation Results

  • Valid SLIP format: 92.0%
  • Average reward: 1.25
  • Secret leakages on eval: 0

Links


Built for the OpenEnv Student Challenge 2025