fineprint-env / README.md
vigneshmoovendhan's picture
ui refined
916c16e
metadata
title: FinePrint-Env
emoji: πŸ“œ
colorFrom: yellow
colorTo: red
sdk: docker
app_port: 7860
tags:
  - openenv
  - reinforcement-learning
  - policy-compliance
  - drift-detection
  - customer-service
pinned: false

FinePrint-Env: Consumer Policy Drift Detection Environment

Live Demo & API | Training Notebook (Colab)

Overview

FinePrint-Env is a reinforcement learning environment where AI agents learn to detect policy changes and maintain compliance in customer service workflows. Built for the Meta PyTorch OpenEnv Hackathon x Scaler School of Technology, it provides a realistic simulation of policy drift scenarios ranging from simple quoting to adversarial multi-version silent drift.

Motivation

  • Policies change constantly β€” pricing, return windows, subscription terms shift weekly. An agent quoting a return policy updated 10 minutes ago creates legal and financial liability.
  • No existing RL environment tests drift detection β€” FinePrint-Env fills this gap with 8 policy versions, 5 customer workflows, and deterministic compliance grading.
  • 70% of drifts are silent β€” no system notification is sent. The agent must learn to detect drift from user-level signals and staleness alone.

The Problem

Production LLMs assume static knowledge. In reality, policies, pricing, and rules change constantly. An agent quoting a return policy that was updated 10 minutes ago creates legal and financial liability. No existing benchmark tests or trains this capability.

The Solution

FinePrint teaches models a single critical meta-skill: when to call request_verification() β€” the binary decision that separates safe agents from dangerous ones. Rather than memorizing policies, the model learns to recognize drift signals (user contradictions, staleness, system notifications) and re-ground itself before responding.

Why Not Just RAG? Why Not Agentic Workflows?

This is the question everyone asks. Here's why neither solves the actual problem:

RAG (Retrieval-Augmented Generation)

RAG retrieves fresh documents at query time. Sounds perfect β€” until you realize:

  • RAG doesn't know when to retrieve. It either retrieves every time (wasteful, slow, expensive) or relies on a fixed schedule (misses urgent changes). There's no learned judgment about staleness.
  • RAG has no concept of drift severity. A return window changing from 30β†’14 days is catastrophic. A FAQ typo fix is irrelevant. RAG treats both the same β€” it just fetches.
  • RAG doesn't penalize stale answers. If the retriever returns a cached/stale chunk, the model quotes it confidently. There's no feedback loop teaching it that "this information might be outdated."
  • RAG is reactive, not proactive. It responds to queries. It never says "wait, I should double-check this before answering" β€” that's a learned meta-skill, not a retrieval pattern.

Agentic Workflows (Tool-Using LLMs)

Agents with tools can call APIs, search databases, and verify information. But:

  • Tool availability β‰  tool wisdom. Giving a model a verify_policy() tool doesn't mean it knows when to call it. Without training, agents either never verify (dangerous) or verify every step (unusable in production).
  • No reward signal for drift detection. Agentic frameworks like LangChain/CrewAI provide tools but no RL reward for using them at the right moment. The agent has no incentive to develop timing intuition.
  • Hardcoded verification rules are brittle. You could write if steps_since_verify > 5: verify() β€” but that's a heuristic, not intelligence. It doesn't adapt to context (high-stakes question vs casual chat).
  • No benchmark exists to measure this. How do you evaluate whether your agent verifies at the right time? There's no leaderboard, no graded task, no compliance score. You just hope it works.

What FinePrint Actually Does Differently

FinePrint doesn't retrieve documents or provide tools β€” it trains the judgment layer that sits above both:

Capability RAG Agentic FinePrint
Access to fresh data βœ… retrieves βœ… tools βœ… request_verification()
Knows when to refresh ❌ always/never ❌ hardcoded βœ… learned via RL
Drift severity awareness ❌ ❌ βœ… reward-weighted
Penalizes stale answers ❌ ❌ βœ… -8.0 per stale quote
Trains verification timing ❌ ❌ βœ… +3.0 timely, +1.0 late
Graded compliance tasks ❌ ❌ βœ… 3 difficulty levels
Works with RAG/agents β€” β€” βœ… trains the meta-skill they lack

The insight: RAG and agentic workflows solve access to fresh information. FinePrint solves judgment about when that access matters. They're complementary β€” FinePrint trains the decision layer that makes RAG and tool-use actually safe.

Environment Description

The environment simulates a customer service agent handling consumer workflows (shopping, returns, subscriptions, bookings, complaints) while company policies change silently in the background. The agent must use available commands to inspect policies, quote values accurately, detect drift, and maintain compliance.

Action Space

Command Arguments Description
view_policies (none) View currently cached policy values
view_workflow (none) View current workflow state and conversation
check_compliance (none) Check current compliance status
request_verification (none) Refresh policy cache and detect drift
quote_policy policy_field, quoted_value Quote a specific policy field to customer
respond_to_user message Send a general message to the customer
take_action message Process a workflow action (checkout, return, etc.)
escalate message Escalate to supervisor (only when drift detected)
abort_workflow message Abort current workflow (only when justified)
clarify message Ask customer for clarification
submit (none) Submit for final grading

Observation Space

Each step returns an observation containing:

  • output -- Command result text (policy values, compliance status, workflow state, etc.)
  • task_description -- Current task description and objectives
  • workflow_names -- List of available workflows
  • available_commands -- Available actions the agent can take
  • done -- Whether the episode is complete
  • reward -- Score (0.0--1.0) returned on submission

Tasks

Task 1: quote_accuracy (Easy)

Quote policies correctly across shop and return workflows with no drift.

  • Expected difficulty: Easy
  • Max steps: 20

Task 2: drift_detection (Medium)

Handle 3 workflows while detecting policy changes. 30% drift probability with 50% silent ratio.

  • Expected difficulty: Medium
  • Max steps: 30

Task 3: compliance_storm (Hard)

All 5 workflows under aggressive silent drift across 8 policy versions. 50% drift probability with 80% silent ratio.

  • Expected difficulty: Hard
  • Max steps: 45

Reward Function

score = 0.3 * (compliance_accuracy) + 0.5 * (workflow_completion) + 0.2 * (drift_responsiveness)
Component Weight Description
Compliance accuracy 0.3 Proportion of policy quotes that are correct
Workflow completion 0.5 Proportion of workflows completed
Drift responsiveness 0.2 Proportion of drifts detected via verification

Step-Level Rewards (14 signals: 7 positive, 7 negative)

Event Reward
Correct policy quote +10.0
Timely drift detection (≀ 2 steps) +3.0
Late drift detection (> 2 steps) +1.0
Freshness bonus (verified ≀ 2 steps ago) +1.0
High user satisfaction +2.0
Zero compliance failures (terminal) +20.0
Stale policy citation (HIGH severity) βˆ’8.0
Incorrect value quoted βˆ’4.0
User satisfaction < 0.3 βˆ’5.0
Unnecessary escalation βˆ’4.0
Unnecessary abort βˆ’3.0
Unnecessary verification βˆ’0.5
Any compliance failure (terminal) βˆ’30.0

Setup & Usage

Local Development

pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload

Docker

docker build -t fineprint-env .
docker run -p 7860:7860 fineprint-env

API Usage

import requests

BASE = "http://localhost:7860"

# Reset with a specific task
obs = requests.post(f"{BASE}/reset", json={"options": {"task_id": "quote_accuracy"}}).json()

# View policies
obs = requests.post(f"{BASE}/step", json={"action": {"command": "view_policies", "args": {}}}).json()
print(obs["output"])

# Quote a policy
obs = requests.post(f"{BASE}/step", json={
    "action": {"command": "quote_policy", "args": {"policy_field": "return.window_days", "quoted_value": "30"}}
}).json()

# Submit for grading
obs = requests.post(f"{BASE}/step", json={"action": {"command": "submit", "args": {}}}).json()
print(f"Score: {obs['reward']}")

Python Client

from client import FinePrintClient

client = FinePrintClient(base_url="http://localhost:7860")
client.reset(task_id="drift_detection")
obs = client.step("view_policies")
obs = client.step("quote_policy", policy_field="return.window_days", quoted_value="30")
obs = client.step("submit")

Gymnasium Interface (standalone)

import gymnasium as gym

env = gym.make("FinePrint-v0")
obs, info = env.reset(seed=42)

action = {"action_type": 0}  # request_verification
obs, reward, terminated, truncated, info = env.step(action)

Baseline Scores

Task Score Steps
quote_accuracy ~0.80 8--12
drift_detection ~0.55 15--20
compliance_storm ~0.25 25--35

Running the Baseline

export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export OPENAI_API_KEY="your-key"
export ENV_URL="http://localhost:7860"
python inference.py

Policy Drift

Eight policy versions are composed via delta merging β€” each version overrides specific fields from the base while inheriting the rest:

Version Change Severity
v1_base Baseline policies β€”
v2_return_change Return window 30 β†’ 14 days, refund β†’ store credit HIGH
v3_shipping_change Free shipping threshold $50 β†’ $75 MEDIUM
v4_subscription_change Auto-renewal: off β†’ mandatory HIGH
v5_cancellation_fee Booking cancellation fee $0 β†’ $25 MEDIUM
v6_complaint_change Max compensation $200 β†’ $50, escalation removed HIGH
v7_scope_change Electronics returns eliminated, price match removed CRITICAL
v8_pricing_change Tax included in price, bulk discount removed MEDIUM

Drift Signals

The agent receives 4 types of signals that policies may have changed:

Signal Explicitness Example
System notification Explicit "POLICY UPDATE: Version v3 is now active"
User contradiction Implicit "But the website says 14 days, not 30..."
User confusion Implicit "That doesn't match what I was told"
Staleness counter Passive Steps since last request_verification()

70% of drifts are silent β€” no system notification is sent. The agent must learn to detect drift from user-level signals and staleness alone.

Training

FinePrint uses GRPO (Group Relative Policy Optimization) to fine-tune a language model with LoRA adapters.

Default Configuration

Parameter Value
Base model Qwen/Qwen2.5-1.5B-Instruct
LoRA rank / alpha 16 / 32
Episodes 200
Rollouts per update 8
Learning rate 2e-5
Discount (Ξ³) 0.99
PPO clip (Ξ΅) 0.2
Entropy coefficient 0.01
Drift probability 0.25
Silent drift ratio 0.70
Max episode steps 60
Workflows per episode 5

Running Training

# Local (requires GPU + Unsloth)
python training/train_unsloth.py

# Google Colab
# Open FinePrint_Colab.ipynb

# HuggingFace Jobs
# Open FinePrint_HFJobs.ipynb

Results

Training on Qwen2.5-1.5B-Instruct for 80 episodes (20 GRPO updates):

Updates Avg Reward
1–4 βˆ’3.4
5–8 +0.6
9–12 +5.7
13–16 +6.7
17–20 +7.2

The model improved from βˆ’2.4 to +7.8 reward over training, with entropy staying healthy (1.15 β†’ 1.22, no mode collapse) and valid output samples increasing (81 β†’ 106).

Technical Details

  • Built with FastAPI + Pydantic for typed request/response models
  • Core environment logic uses Gymnasium interface with numpy observations
  • HTTP wrapper exposes standard OpenEnv endpoints for remote agent interaction
  • 8 policy versions loaded via JSON delta-merging from policies/ directory
  • Deterministic compliance grading via field-level policy comparison
  • Supports concurrent sessions via session_id parameter
  • Runs on 2 vCPU / 8 GB RAM within 20 minutes

Project Structure

fineprint/
β”œβ”€β”€ server/                  # HTTP API layer (FastAPI)
β”‚   β”œβ”€β”€ app.py               #   FastAPI endpoints + landing page
β”‚   β”œβ”€β”€ fineprint_environment.py  #   HTTP environment wrapper
β”‚   └── tasks.py             #   3 graded task definitions
β”œβ”€β”€ fineprint/               # Core package
β”‚   β”œβ”€β”€ env.py               #   Gymnasium-compatible RL environment
β”‚   β”œβ”€β”€ policies.py          #   Policy loading, versioning, delta merging
β”‚   β”œβ”€β”€ drift.py             #   Drift scheduling (when/how policies change)
β”‚   β”œβ”€β”€ state.py             #   Episode state management
β”‚   β”œβ”€β”€ workflows.py         #   5 consumer workflow definitions
β”‚   β”œβ”€β”€ checker.py           #   Compliance validation engine
β”‚   β”œβ”€β”€ rewards.py           #   Reward shaping calculator (14 signals)
β”‚   └── utils.py             #   Shared utilities
β”œβ”€β”€ policies/                # 8 policy versions (JSON) + manifest
β”œβ”€β”€ training/                # GRPO training & evaluation scripts
β”‚   β”œβ”€β”€ train_unsloth.py     #   Training loop (Unsloth + LoRA)
β”‚   └── eval.py              #   Post-training evaluation
β”œβ”€β”€ tests/                   # Unit tests (pytest)
β”œβ”€β”€ models.py                # Typed Pydantic models (Action, Observation, State)
β”œβ”€β”€ client.py                # HTTP client for remote interaction
β”œβ”€β”€ inference.py             # Baseline inference script with mandatory logging
β”œβ”€β”€ openenv.yaml             # OpenEnv spec configuration
β”œβ”€β”€ Dockerfile               # HuggingFace Spaces container
β”œβ”€β”€ pyproject.toml           # Modern build configuration
β”œβ”€β”€ config.py                # TrainingConfig dataclass
└── requirements.txt         # Dependencies

OpenEnv Spec Compliance

  • step(action) returns observation, reward, done
  • reset() returns initial observation
  • state() returns episode metadata
  • openenv.yaml with spec_version 1
  • Typed Pydantic models for all request/response schemas
  • Containerized with Docker
  • Deployed to HuggingFace Spaces
  • Mandatory stdout logging: [START], [STEP], [END]
  • 3 graded tasks with deterministic scoring
  • Baseline inference script included

Blog

Read the detailed writeup: FinePrint: Teaching Language Models That Knowledge Has an Expiration Date

Live Demo & API | Training Notebook (Colab)

License

MIT


Built for Meta PyTorch OpenEnv Hackathon Γ— Scaler School of Technology β€” Consumer Policy Drift Detection