Meta PyTorch OpenEnv Hackathon × Scaler School of Technology

Your AI agent just quoted
a policy that changed
10 minutes ago.

FinePrint is the first RL environment that trains language models to detect when their knowledge expires — before it costs your company a lawsuit.

"The return window is 30 days!" — an AI agent, confidently citing a policy that changed to 14 days at 2 AM.

This happens every day in production. No existing benchmark tests for it.

API Reference How it works ↓

fineprint-env — drift detection

The Problem

Production LLMs assume static knowledge. Reality disagrees.

Policies, pricing, and terms change constantly. Your AI agent keeps quoting yesterday's rules with today's confidence.

⚠

Silent Drift

Policies change without notification. No API event, no webhook, no alert. The agent's cache silently becomes wrong.

70%

of drifts are completely silent

⚖

Legal Liability

A customer relies on your agent's quote. Ships a return on day 20 of a 14-day window. Who's liable? You are.

$0→$25

cancellation fee that appeared overnight

🔍

No Benchmark

CartPole tests balance. Atari tests game skill. Nothing tests whether an AI knows when to stop trusting itself.

existing RL envs that train drift detection

The Solution

One decision changes everything.

FinePrint trains a single critical meta-skill: when to call request_verification() — the binary decision that separates safe agents from dangerous ones.

Customer asks a policy question

"What's your return window? I bought this 20 days ago."

Agent checks its cached knowledge

Cache says return.window_days = 30. But when was this last verified?

// Agent's internal state "cached_policies": { "return.window_days": 30 } "steps_since_last_verify": 7 // ← stale!

✓

Trained agent calls `request_verification()`

The model learned that 7 steps without verification + a policy-sensitive question = time to check. It refreshes its cache and discovers the policy changed.

request_verification() // +3.0 reward for timely detection // Cache updated: return.window_days = 30 → 14

Agent quotes the correct, current value

"Our return window is 14 days." — Correct. Compliant. No lawsuit.

quote_policy("return.window_days", "14") // +10.0 reward

✗

Without training: agent quotes stale value

"Our return window is 30 days!" — Wrong. The customer ships on day 20. Denied. Complaint filed.

quote_policy("return.window_days", "30") // -8.0 stale penalty // User satisfaction drops → -5.0 complaint penalty

But wait…

“Can’t RAG just fix this?”

Sure. And a smoke detector can cook dinner. Here’s the thing:

RAG

retrieves everything, questions nothing

✗Doesn’t know when to retrieve — fetches every time or never
✗Treats a typo fix and a return-window halving identically
✗If the retriever returns stale chunks, the model quotes them confidently
✗Reactive only — never says “wait, let me double-check”

Agentic Workflows

has tools, no idea when to use them

✗Tool availability ≠ tool wisdom
✗No reward signal for verifying at the right moment
✗Hardcoded if steps > 5: verify() is a heuristic, not intelligence
✗Zero benchmarks to measure verification timing

FinePrint

trains the judgment they both lack

✓Learns when to call request_verification() via RL rewards
✓-8.0 penalty for stale quotes — the model feels the cost of being wrong
✓+3.0 for timely detection — develops urgency, not just access
✓Works with RAG & agents — trains the meta-skill they’re missing

RAG solves access. Agents solve tools. FinePrint solves judgment. — The part where your model decides “I should probably not trust myself right now.”

Graded Tasks

Three levels. Increasing chaos.

From static quoting to adversarial multi-version silent drift. Can your agent handle the storm?

quote_accuracyEasy

Handle shop and return workflows. Quote policies correctly. No drift — just comprehension.

2 workflows20 steps0% drift

drift_detectionMedium

Navigate 3 workflows while policies silently change. Detect drifts, adapt quotes, maintain compliance.

3 workflows30 steps30% drift50% silent

compliance_stormHard

All 5 workflows under aggressive drift across 8 policy versions. CRITICAL scope changes lurk in the noise.

5 workflows45 steps50% drift80% silent

Policy Drift Engine

8 versions. Each one breaks something.

Policies evolve via delta merging. Each version overrides specific fields from the base while inheriting the rest.

Version	Change	Severity	Example
`v1_base`	Baseline policies	—	Return: 30 days, free ship $50
`v2`	Return policy tightened	HIGH	Window: 30 → 14 days
`v3`	Shipping raised	MEDIUM	Free threshold: $50 → $75
`v4`	Auto-renewal mandatory	HIGH	auto_renewal: false → true
`v5`	Cancel fee introduced	MEDIUM	Fee: $0 → $25
`v6`	Compensation slashed	HIGH	Max comp: $200 → $50
`v7`	Scope narrowed	CRITICAL	Electronics returns: eliminated
`v8`	Pricing restructured	MEDIUM	Tax included, bulk discount gone

Training Results

From −11 to +8.75 in 80 episodes.

GRPO fine-tuning on Qwen2.5-1.5B-Instruct with LoRA. The model learns when to verify and when to act.

Avg Reward per Update80 ep · 20 updates · LoRA r=16

Avg Reward

Entropy

Zero line

Before & After

Baseline vs trained model.

Heuristic Baseline

rule-based

Avg Reward144.2 ± 20.7

Drift Detections / Ep4.5

Compliance Failure Rate30%

Workflows Completed4.0 / 5

User Satisfaction91%

Trained ModelGRPO, 80 ep
Avg Reward (peak)+8.75
Drift Detections / Ep1.4
Compliance Failure Rate0%
Valid Samples81 → 106
Entropy (stable)1.15 → 1.22

Reward Design

The 26-point swing.

A single policy-sensitive step can yield +13 (verify → detect → correct quote) or −13 (skip → stale quote → complaint). This gap is what drives learning.

26pt

Best case vs worst case per step

+3 (detect) +10 (correct quote) = +13 vs −8 (stale) −5 (complaint) = −13

Rewards

Correct quote+10.0

Timely drift detection+3.0

Late drift detection+1.0

Freshness bonus+1.0

High satisfaction+2.0

Clean episode (terminal)+20.0

Penalties

Stale policy cited-8.0

User complaint-5.0

Incorrect value-4.0

Unnecessary escalation-4.0

Unnecessary abort-3.0

Compliance failure (terminal)-30.0

API

OpenEnv-compatible endpoints.

Standard REST API. Reset an episode, step through actions, check state. Full Swagger docs at /docs.

Method	Endpoint	Description
GET	/health	Liveness / readiness probe
POST	/reset	Start a new episode with a task
POST	/step	Execute an agent action, get observation + reward
GET	/state	Current episode metadata (step count, task, version)
GET	/tasks	List available graded tasks
GET	/docs	Interactive Swagger UI

Your AI agent just quoted
a policy that changed
10 minutes ago.

Silent Drift

Legal Liability

No Benchmark

Customer asks a policy question

Agent checks its cached knowledge

Trained agent calls `request_verification()`

Agent quotes the correct, current value

Without training: agent quotes stale value

RAG

Agentic Workflows

FinePrint

Heuristic Baseline

Trained Model

Best case vs worst case per step

Rewards

Penalties

Live Environment Interaction

Train your agent to know
when to stop trusting itself.

Your AI agent just quoteda policy that changed10 minutes ago.

Silent Drift

Legal Liability

No Benchmark

Customer asks a policy question

Agent checks its cached knowledge

Trained agent calls request_verification()

Agent quotes the correct, current value

Without training: agent quotes stale value

RAG

Agentic Workflows

FinePrint

Heuristic Baseline

Trained Model

Best case vs worst case per step

Rewards

Penalties

Live Environment Interaction

Train your agent to knowwhen to stop trusting itself.

Your AI agent just quoted
a policy that changed
10 minutes ago.

Trained agent calls `request_verification()`

Train your agent to know
when to stop trusting itself.