Meta PyTorch OpenEnv Hackathon × Scaler School of Technology

Your AI agent just quoted
a policy that changed
10 minutes ago.

FinePrint is the first RL environment that trains language models to detect when their knowledge expires — before it costs your company a lawsuit.

"The return window is 30 days!" — an AI agent, confidently citing a policy that changed to 14 days at 2 AM.

This happens every day in production. No existing benchmark tests for it.
fineprint-env — drift detection
The Problem
Production LLMs assume static knowledge. Reality disagrees.
Policies, pricing, and terms change constantly. Your AI agent keeps quoting yesterday's rules with today's confidence.

Silent Drift

Policies change without notification. No API event, no webhook, no alert. The agent's cache silently becomes wrong.

70%
of drifts are completely silent

Legal Liability

A customer relies on your agent's quote. Ships a return on day 20 of a 14-day window. Who's liable? You are.

$0→$25
cancellation fee that appeared overnight
🔍

No Benchmark

CartPole tests balance. Atari tests game skill. Nothing tests whether an AI knows when to stop trusting itself.

0
existing RL envs that train drift detection
The Solution
One decision changes everything.
FinePrint trains a single critical meta-skill: when to call request_verification() — the binary decision that separates safe agents from dangerous ones.
1

Customer asks a policy question

"What's your return window? I bought this 20 days ago."

2

Agent checks its cached knowledge

Cache says return.window_days = 30. But when was this last verified?

// Agent's internal state "cached_policies": { "return.window_days": 30 } "steps_since_last_verify": 7 // ← stale!

Trained agent calls request_verification()

The model learned that 7 steps without verification + a policy-sensitive question = time to check. It refreshes its cache and discovers the policy changed.

request_verification() // +3.0 reward for timely detection // Cache updated: return.window_days = 30 → 14
4

Agent quotes the correct, current value

"Our return window is 14 days." — Correct. Compliant. No lawsuit.

quote_policy("return.window_days", "14") // +10.0 reward

Without training: agent quotes stale value

"Our return window is 30 days!" — Wrong. The customer ships on day 20. Denied. Complaint filed.

quote_policy("return.window_days", "30") // -8.0 stale penalty // User satisfaction drops → -5.0 complaint penalty
But wait…
“Can’t RAG just fix this?”
Sure. And a smoke detector can cook dinner. Here’s the thing:

RAG

retrieves everything, questions nothing
  • Doesn’t know when to retrieve — fetches every time or never
  • Treats a typo fix and a return-window halving identically
  • If the retriever returns stale chunks, the model quotes them confidently
  • Reactive only — never says “wait, let me double-check”

Agentic Workflows

has tools, no idea when to use them
  • Tool availability ≠ tool wisdom
  • No reward signal for verifying at the right moment
  • Hardcoded if steps > 5: verify() is a heuristic, not intelligence
  • Zero benchmarks to measure verification timing

FinePrint

trains the judgment they both lack
  • Learns when to call request_verification() via RL rewards
  • -8.0 penalty for stale quotes — the model feels the cost of being wrong
  • +3.0 for timely detection — develops urgency, not just access
  • Works with RAG & agents — trains the meta-skill they’re missing

RAG solves access. Agents solve tools. FinePrint solves judgment. — The part where your model decides “I should probably not trust myself right now.”

Graded Tasks
Three levels. Increasing chaos.
From static quoting to adversarial multi-version silent drift. Can your agent handle the storm?
quote_accuracyEasy

Handle shop and return workflows. Quote policies correctly. No drift — just comprehension.

2 workflows20 steps0% drift
drift_detectionMedium

Navigate 3 workflows while policies silently change. Detect drifts, adapt quotes, maintain compliance.

3 workflows30 steps30% drift50% silent
compliance_stormHard

All 5 workflows under aggressive drift across 8 policy versions. CRITICAL scope changes lurk in the noise.

5 workflows45 steps50% drift80% silent
Policy Drift Engine
8 versions. Each one breaks something.
Policies evolve via delta merging. Each version overrides specific fields from the base while inheriting the rest.
VersionChangeSeverityExample
v1_baseBaseline policiesReturn: 30 days, free ship $50
v2Return policy tightenedHIGHWindow: 30 → 14 days
v3Shipping raisedMEDIUMFree threshold: $50 → $75
v4Auto-renewal mandatoryHIGHauto_renewal: false → true
v5Cancel fee introducedMEDIUMFee: $0 → $25
v6Compensation slashedHIGHMax comp: $200 → $50
v7Scope narrowedCRITICALElectronics returns: eliminated
v8Pricing restructuredMEDIUMTax included, bulk discount gone
Training Results
From −11 to +8.75 in 80 episodes.
GRPO fine-tuning on Qwen2.5-1.5B-Instruct with LoRA. The model learns when to verify and when to act.
Avg Reward per Update80 ep · 20 updates · LoRA r=16
Avg Reward
Entropy
Zero line
Before & After
Baseline vs trained model.

Heuristic Baseline

rule-based
Avg Reward144.2 ± 20.7
Drift Detections / Ep4.5
Compliance Failure Rate30%
Workflows Completed4.0 / 5
User Satisfaction91%

Trained Model

GRPO, 80 ep
Avg Reward (peak)+8.75
Drift Detections / Ep1.4
Compliance Failure Rate0%
Valid Samples81 → 106
Entropy (stable)1.15 → 1.22
Reward Design
The 26-point swing.
A single policy-sensitive step can yield +13 (verify → detect → correct quote) or −13 (skip → stale quote → complaint). This gap is what drives learning.
26pt

Best case vs worst case per step

+3 (detect) +10 (correct quote) = +13  vs  −8 (stale) −5 (complaint) = −13

Rewards

Correct quote+10.0
Timely drift detection+3.0
Late drift detection+1.0
Freshness bonus+1.0
High satisfaction+2.0
Clean episode (terminal)+20.0

Penalties

Stale policy cited-8.0
User complaint-5.0
Incorrect value-4.0
Unnecessary escalation-4.0
Unnecessary abort-3.0
Compliance failure (terminal)-30.0
API
OpenEnv-compatible endpoints.
Standard REST API. Reset an episode, step through actions, check state. Full Swagger docs at /docs.
MethodEndpointDescription
GET/healthLiveness / readiness probe
POST/resetStart a new episode with a task
POST/stepExecute an agent action, get observation + reward
GET/stateCurrent episode metadata (step count, task, version)
GET/tasksList available graded tasks
GET/docsInteractive Swagger UI
Try It
See it in action.

Live Environment Interaction

Reset → view policies → quote a value → see the compliance check fire.

Compliance
OpenEnv Spec v1 — fully compliant.
✓ step(action) ✓ reset() ✓ state() ✓ openenv.yaml ✓ Pydantic Models ✓ Docker ✓ 3 Graded Tasks ✓ Baseline Inference ✓ Mandatory Logging

Train your agent to know
when to stop trusting itself.

FinePrint is open, free, and ready. Reset an episode, connect your model, and see if it survives the compliance storm.

Read API Docs