File size: 16,176 Bytes
0b6a889
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
916c16e
 
0b6a889
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
916c16e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0b6a889
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c01210a
 
 
 
916c16e
 
0b6a889
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
---
title: FinePrint-Env
emoji: "\U0001F4DC"
colorFrom: yellow
colorTo: red
sdk: docker
app_port: 7860
tags:
  - openenv
  - reinforcement-learning
  - policy-compliance
  - drift-detection
  - customer-service
pinned: false
---

# FinePrint-Env: Consumer Policy Drift Detection Environment

> **[Live Demo & API](https://huggingface.co/spaces/PraneshkumarR/fineprint-env)** | **[Training Notebook (Colab)](https://colab.research.google.com/drive/1s6bUXezbqFNDYnQvmMs5pnQY232QXvJk?usp=sharing)**

## Overview

FinePrint-Env is a reinforcement learning environment where AI agents learn to detect policy changes and maintain compliance in customer service workflows. Built for the **Meta PyTorch OpenEnv Hackathon x Scaler School of Technology**, it provides a realistic simulation of policy drift scenarios ranging from simple quoting to adversarial multi-version silent drift.

## Motivation

- **Policies change constantly** β€” pricing, return windows, subscription terms shift weekly. An agent quoting a return policy updated 10 minutes ago creates **legal and financial liability**.
- **No existing RL environment tests drift detection** β€” FinePrint-Env fills this gap with 8 policy versions, 5 customer workflows, and deterministic compliance grading.
- **70% of drifts are silent** β€” no system notification is sent. The agent must learn to detect drift from user-level signals and staleness alone.

## The Problem

Production LLMs assume static knowledge. In reality, policies, pricing, and rules change constantly. An agent quoting a return policy that was updated 10 minutes ago creates **legal and financial liability**. No existing benchmark tests or trains this capability.

## The Solution

FinePrint teaches models a single critical meta-skill: **when to call `request_verification()`** β€” the binary decision that separates safe agents from dangerous ones. Rather than memorizing policies, the model learns to recognize *drift signals* (user contradictions, staleness, system notifications) and re-ground itself before responding.

## Why Not Just RAG? Why Not Agentic Workflows?

This is the question everyone asks. Here's why neither solves the actual problem:

### RAG (Retrieval-Augmented Generation)

RAG retrieves fresh documents at query time. Sounds perfect β€” until you realize:

- **RAG doesn't know *when* to retrieve.** It either retrieves every time (wasteful, slow, expensive) or relies on a fixed schedule (misses urgent changes). There's no learned judgment about *staleness*.
- **RAG has no concept of drift severity.** A return window changing from 30β†’14 days is catastrophic. A FAQ typo fix is irrelevant. RAG treats both the same β€” it just fetches.
- **RAG doesn't penalize stale answers.** If the retriever returns a cached/stale chunk, the model quotes it confidently. There's no feedback loop teaching it that "this information might be outdated."
- **RAG is reactive, not proactive.** It responds to queries. It never says *"wait, I should double-check this before answering"* β€” that's a learned meta-skill, not a retrieval pattern.

### Agentic Workflows (Tool-Using LLMs)

Agents with tools can call APIs, search databases, and verify information. But:

- **Tool availability β‰  tool wisdom.** Giving a model a `verify_policy()` tool doesn't mean it knows *when* to call it. Without training, agents either never verify (dangerous) or verify every step (unusable in production).
- **No reward signal for drift detection.** Agentic frameworks like LangChain/CrewAI provide tools but no RL reward for using them at the right moment. The agent has no incentive to develop timing intuition.
- **Hardcoded verification rules are brittle.** You could write `if steps_since_verify > 5: verify()` β€” but that's a heuristic, not intelligence. It doesn't adapt to context (high-stakes question vs casual chat).
- **No benchmark exists to measure this.** How do you evaluate whether your agent verifies at the right time? There's no leaderboard, no graded task, no compliance score. You just hope it works.

### What FinePrint Actually Does Differently

FinePrint doesn't retrieve documents or provide tools β€” it **trains the judgment layer** that sits above both:

| Capability | RAG | Agentic | FinePrint |
|---|---|---|---|
| Access to fresh data | βœ… retrieves | βœ… tools | βœ… `request_verification()` |
| Knows *when* to refresh | ❌ always/never | ❌ hardcoded | βœ… **learned via RL** |
| Drift severity awareness | ❌ | ❌ | βœ… reward-weighted |
| Penalizes stale answers | ❌ | ❌ | βœ… -8.0 per stale quote |
| Trains verification timing | ❌ | ❌ | βœ… +3.0 timely, +1.0 late |
| Graded compliance tasks | ❌ | ❌ | βœ… 3 difficulty levels |
| Works *with* RAG/agents | β€” | β€” | βœ… trains the meta-skill they lack |

**The insight:** RAG and agentic workflows solve *access* to fresh information. FinePrint solves *judgment* about when that access matters. They're complementary β€” FinePrint trains the decision layer that makes RAG and tool-use actually safe.

## Environment Description

The environment simulates a customer service agent handling consumer workflows (shopping, returns, subscriptions, bookings, complaints) while company policies change silently in the background. The agent must use available commands to inspect policies, quote values accurately, detect drift, and maintain compliance.

## Action Space

| Command | Arguments | Description |
|---------|-----------|-------------|
| `view_policies` | (none) | View currently cached policy values |
| `view_workflow` | (none) | View current workflow state and conversation |
| `check_compliance` | (none) | Check current compliance status |
| `request_verification` | (none) | Refresh policy cache and detect drift |
| `quote_policy` | `policy_field`, `quoted_value` | Quote a specific policy field to customer |
| `respond_to_user` | `message` | Send a general message to the customer |
| `take_action` | `message` | Process a workflow action (checkout, return, etc.) |
| `escalate` | `message` | Escalate to supervisor (only when drift detected) |
| `abort_workflow` | `message` | Abort current workflow (only when justified) |
| `clarify` | `message` | Ask customer for clarification |
| `submit` | (none) | Submit for final grading |

## Observation Space

Each step returns an observation containing:

- **output** -- Command result text (policy values, compliance status, workflow state, etc.)
- **task_description** -- Current task description and objectives
- **workflow_names** -- List of available workflows
- **available_commands** -- Available actions the agent can take
- **done** -- Whether the episode is complete
- **reward** -- Score (0.0--1.0) returned on submission

## Tasks

### Task 1: quote_accuracy (Easy)

Quote policies correctly across `shop` and `return` workflows with no drift.

- **Expected difficulty:** Easy
- **Max steps:** 20

### Task 2: drift_detection (Medium)

Handle 3 workflows while detecting policy changes. 30% drift probability with 50% silent ratio.

- **Expected difficulty:** Medium
- **Max steps:** 30

### Task 3: compliance_storm (Hard)

All 5 workflows under aggressive silent drift across 8 policy versions. 50% drift probability with 80% silent ratio.

- **Expected difficulty:** Hard
- **Max steps:** 45

## Reward Function

```
score = 0.3 * (compliance_accuracy) + 0.5 * (workflow_completion) + 0.2 * (drift_responsiveness)
```

| Component | Weight | Description |
|-----------|--------|-------------|
| Compliance accuracy | 0.3 | Proportion of policy quotes that are correct |
| Workflow completion | 0.5 | Proportion of workflows completed |
| Drift responsiveness | 0.2 | Proportion of drifts detected via verification |

### Step-Level Rewards (14 signals: 7 positive, 7 negative)

| Event | Reward |
|-------|--------|
| Correct policy quote | **+10.0** |
| Timely drift detection (≀ 2 steps) | **+3.0** |
| Late drift detection (> 2 steps) | **+1.0** |
| Freshness bonus (verified ≀ 2 steps ago) | **+1.0** |
| High user satisfaction | **+2.0** |
| Zero compliance failures (terminal) | **+20.0** |
| Stale policy citation (HIGH severity) | **βˆ’8.0** |
| Incorrect value quoted | **βˆ’4.0** |
| User satisfaction < 0.3 | **βˆ’5.0** |
| Unnecessary escalation | **βˆ’4.0** |
| Unnecessary abort | **βˆ’3.0** |
| Unnecessary verification | **βˆ’0.5** |
| Any compliance failure (terminal) | **βˆ’30.0** |

## Setup & Usage

### Local Development

```bash
pip install -r requirements.txt
uvicorn server.app:app --host 0.0.0.0 --port 7860 --reload
```

### Docker

```bash
docker build -t fineprint-env .
docker run -p 7860:7860 fineprint-env
```

### API Usage

```python
import requests

BASE = "http://localhost:7860"

# Reset with a specific task
obs = requests.post(f"{BASE}/reset", json={"options": {"task_id": "quote_accuracy"}}).json()

# View policies
obs = requests.post(f"{BASE}/step", json={"action": {"command": "view_policies", "args": {}}}).json()
print(obs["output"])

# Quote a policy
obs = requests.post(f"{BASE}/step", json={
    "action": {"command": "quote_policy", "args": {"policy_field": "return.window_days", "quoted_value": "30"}}
}).json()

# Submit for grading
obs = requests.post(f"{BASE}/step", json={"action": {"command": "submit", "args": {}}}).json()
print(f"Score: {obs['reward']}")
```

### Python Client

```python
from client import FinePrintClient

client = FinePrintClient(base_url="http://localhost:7860")
client.reset(task_id="drift_detection")
obs = client.step("view_policies")
obs = client.step("quote_policy", policy_field="return.window_days", quoted_value="30")
obs = client.step("submit")
```

### Gymnasium Interface (standalone)

```python
import gymnasium as gym

env = gym.make("FinePrint-v0")
obs, info = env.reset(seed=42)

action = {"action_type": 0}  # request_verification
obs, reward, terminated, truncated, info = env.step(action)
```

## Baseline Scores

| Task | Score | Steps |
|------|-------|-------|
| quote_accuracy | ~0.80 | 8--12 |
| drift_detection | ~0.55 | 15--20 |
| compliance_storm | ~0.25 | 25--35 |

## Running the Baseline

```bash
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export OPENAI_API_KEY="your-key"
export ENV_URL="http://localhost:7860"
python inference.py
```

## Policy Drift

Eight policy versions are composed via **delta merging** β€” each version overrides specific fields from the base while inheriting the rest:

| Version | Change | Severity |
|---------|--------|----------|
| `v1_base` | Baseline policies | β€” |
| `v2_return_change` | Return window 30 β†’ 14 days, refund β†’ store credit | HIGH |
| `v3_shipping_change` | Free shipping threshold $50 β†’ $75 | MEDIUM |
| `v4_subscription_change` | Auto-renewal: off β†’ mandatory | HIGH |
| `v5_cancellation_fee` | Booking cancellation fee $0 β†’ $25 | MEDIUM |
| `v6_complaint_change` | Max compensation $200 β†’ $50, escalation removed | HIGH |
| `v7_scope_change` | Electronics returns eliminated, price match removed | CRITICAL |
| `v8_pricing_change` | Tax included in price, bulk discount removed | MEDIUM |

## Drift Signals

The agent receives 4 types of signals that policies may have changed:

| Signal | Explicitness | Example |
|--------|-------------|---------|
| System notification | Explicit | `"POLICY UPDATE: Version v3 is now active"` |
| User contradiction | Implicit | `"But the website says 14 days, not 30..."` |
| User confusion | Implicit | `"That doesn't match what I was told"` |
| Staleness counter | Passive | Steps since last `request_verification()` |

**70% of drifts are silent** β€” no system notification is sent. The agent must learn to detect drift from user-level signals and staleness alone.

## Training

FinePrint uses **GRPO (Group Relative Policy Optimization)** to fine-tune a language model with LoRA adapters.

### Default Configuration

| Parameter | Value |
|-----------|-------|
| Base model | `Qwen/Qwen2.5-1.5B-Instruct` |
| LoRA rank / alpha | 16 / 32 |
| Episodes | 200 |
| Rollouts per update | 8 |
| Learning rate | 2e-5 |
| Discount (Ξ³) | 0.99 |
| PPO clip (Ξ΅) | 0.2 |
| Entropy coefficient | 0.01 |
| Drift probability | 0.25 |
| Silent drift ratio | 0.70 |
| Max episode steps | 60 |
| Workflows per episode | 5 |

### Running Training

```bash
# Local (requires GPU + Unsloth)
python training/train_unsloth.py

# Google Colab
# Open FinePrint_Colab.ipynb

# HuggingFace Jobs
# Open FinePrint_HFJobs.ipynb
```

## Results

Training on **Qwen2.5-1.5B-Instruct** for 80 episodes (20 GRPO updates):

| Updates | Avg Reward |
|---------|-----------|
| 1–4 | βˆ’3.4 |
| 5–8 | +0.6 |
| 9–12 | +5.7 |
| 13–16 | +6.7 |
| 17–20 | +7.2 |

The model improved from **βˆ’2.4 to +7.8** reward over training, with entropy staying healthy (1.15 β†’ 1.22, no mode collapse) and valid output samples increasing (81 β†’ 106).

## Technical Details

- Built with **FastAPI** + **Pydantic** for typed request/response models
- Core environment logic uses **Gymnasium** interface with numpy observations
- HTTP wrapper exposes standard OpenEnv endpoints for remote agent interaction
- 8 policy versions loaded via JSON delta-merging from `policies/` directory
- Deterministic compliance grading via field-level policy comparison
- Supports concurrent sessions via `session_id` parameter
- Runs on 2 vCPU / 8 GB RAM within 20 minutes

## Project Structure

```
fineprint/
β”œβ”€β”€ server/                  # HTTP API layer (FastAPI)
β”‚   β”œβ”€β”€ app.py               #   FastAPI endpoints + landing page
β”‚   β”œβ”€β”€ fineprint_environment.py  #   HTTP environment wrapper
β”‚   └── tasks.py             #   3 graded task definitions
β”œβ”€β”€ fineprint/               # Core package
β”‚   β”œβ”€β”€ env.py               #   Gymnasium-compatible RL environment
β”‚   β”œβ”€β”€ policies.py          #   Policy loading, versioning, delta merging
β”‚   β”œβ”€β”€ drift.py             #   Drift scheduling (when/how policies change)
β”‚   β”œβ”€β”€ state.py             #   Episode state management
β”‚   β”œβ”€β”€ workflows.py         #   5 consumer workflow definitions
β”‚   β”œβ”€β”€ checker.py           #   Compliance validation engine
β”‚   β”œβ”€β”€ rewards.py           #   Reward shaping calculator (14 signals)
β”‚   └── utils.py             #   Shared utilities
β”œβ”€β”€ policies/                # 8 policy versions (JSON) + manifest
β”œβ”€β”€ training/                # GRPO training & evaluation scripts
β”‚   β”œβ”€β”€ train_unsloth.py     #   Training loop (Unsloth + LoRA)
β”‚   └── eval.py              #   Post-training evaluation
β”œβ”€β”€ tests/                   # Unit tests (pytest)
β”œβ”€β”€ models.py                # Typed Pydantic models (Action, Observation, State)
β”œβ”€β”€ client.py                # HTTP client for remote interaction
β”œβ”€β”€ inference.py             # Baseline inference script with mandatory logging
β”œβ”€β”€ openenv.yaml             # OpenEnv spec configuration
β”œβ”€β”€ Dockerfile               # HuggingFace Spaces container
β”œβ”€β”€ pyproject.toml           # Modern build configuration
β”œβ”€β”€ config.py                # TrainingConfig dataclass
└── requirements.txt         # Dependencies
```

## OpenEnv Spec Compliance

- step(action) returns observation, reward, done
- reset() returns initial observation
- state() returns episode metadata
- openenv.yaml with spec_version 1
- Typed Pydantic models for all request/response schemas
- Containerized with Docker
- Deployed to HuggingFace Spaces
- Mandatory stdout logging: `[START]`, `[STEP]`, `[END]`
- 3 graded tasks with deterministic scoring
- Baseline inference script included

## Blog

Read the detailed writeup: [FinePrint: Teaching Language Models That Knowledge Has an Expiration Date](blog.md)

> **[Live Demo & API](https://huggingface.co/spaces/PraneshkumarR/fineprint-env)** | **[Training Notebook (Colab)](https://colab.research.google.com/drive/1s6bUXezbqFNDYnQvmMs5pnQY232QXvJk?usp=sharing)**

## License

[MIT](LICENSE)

---

<div align="center">

Built for **Meta PyTorch OpenEnv Hackathon Γ— Scaler School of Technology** β€” Consumer Policy Drift Detection 

</div>