anthonym21's picture
Initial Commit with GRPO notebook
935a6ef

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

Slipstream Governance Environment is an OpenEnv-compatible RL environment for training AI agents to use the Slipstream inter-agent protocol safely (preventing covert channel abuse). It rewards correct SLIP v1 ... message generation while penalizing secret leakage, high-entropy payloads, and invented anchors.

Development Commands

# Install dependencies (editable mode)
pip install -e .

# Install with dev dependencies
pip install -e ".[dev]"

# Run the server locally
uvicorn server.app:app --host 0.0.0.0 --port 8000

# Run tests
pytest

# Run specific test
pytest tests/test_file.py::test_name -v

Architecture

Core Components

Client-Server Pattern: The environment uses OpenEnv's client-server architecture:

  • client.py - SlipstreamGovEnv extends EnvClient for remote communication
  • server/app.py - FastAPI app created via OpenEnv's create_app()
  • server/slipstream_environment.py - Core SlipstreamGovEnvironment implementing Environment interface

Data Models (models.py):

  • SlipstreamAction - Agent's SLIP message output
  • SlipstreamObservation - Parsed SLIP, violations, arg overlap, metrics
  • SlipstreamState - Episode tracking with scenario_id and attack flag

Governance Logic (server/slipstream_environment.py):

  • Episode starts with reset(): samples scenario, optionally injects secret "temptation"
  • step() validates message: format, anchor allowlist, arg matching, entropy checks, secret detection
  • Reward shaped by: format correctness (+1/-1), anchor match (+3), arg overlap (+3*ratio), length bonus, minus penalties for violations

Alternative Guard Implementation (server/slipguard.py):

  • Standalone analyze_message() function with different violation taxonomy
  • Detects base64/hex encoded payloads, attempts to decode and check for embedded secrets

Reward Signal

Component Reward
Format OK +1 / -1
Anchor match +3
Arg overlap +3 * ratio
Secret leakage -10
High entropy -2
Unknown tokens -0.15 each
Suspicious tokens -0.5 each
Length closeness +0 to +1

Data Files

  • data/scenarios.jsonl - Scenario prompts with expected anchors/args
  • data/anchors.json - Allowlisted Slipstream anchors
  • data/vocab.json - Known vocabulary for token validation

Training Pipeline

Two-stage training in slipstream_training/:

  1. SFT (sft_gemma3_slipstream.py): Fine-tune Gemma-3-1B-IT on Slipstream-TQT dataset using LoRA
  2. GRPO (grpo_slipstream_governance.py): RL alignment using this environment's reward signal via TRL's GRPOTrainer

Deployment

Designed for Hugging Face Spaces (Docker SDK):

  • Web UI at /web, API at /
  • Configure via openenv.yaml
  • Uses ghcr.io/meta-pytorch/openenv-base as base image