FinePrint is the first RL environment that trains language models to detect when their knowledge expires — before it costs your company a lawsuit.
"The return window is 30 days!" — an AI agent, confidently citing a policy that changed to 14 days at 2 AM.
This happens every day in production. No existing benchmark tests for it.Policies change without notification. No API event, no webhook, no alert. The agent's cache silently becomes wrong.
A customer relies on your agent's quote. Ships a return on day 20 of a 14-day window. Who's liable? You are.
CartPole tests balance. Atari tests game skill. Nothing tests whether an AI knows when to stop trusting itself.
request_verification() — the binary decision that separates safe agents from dangerous ones."What's your return window? I bought this 20 days ago."
Cache says return.window_days = 30. But when was this last verified?
request_verification()The model learned that 7 steps without verification + a policy-sensitive question = time to check. It refreshes its cache and discovers the policy changed.
"Our return window is 14 days." — Correct. Compliant. No lawsuit.
"Our return window is 30 days!" — Wrong. The customer ships on day 20. Denied. Complaint filed.
if steps > 5: verify() is a heuristic, not intelligencerequest_verification() via RL rewardsRAG solves access. Agents solve tools. FinePrint solves judgment. — The part where your model decides “I should probably not trust myself right now.”
Handle shop and return workflows. Quote policies correctly. No drift — just comprehension.
Navigate 3 workflows while policies silently change. Detect drifts, adapt quotes, maintain compliance.
All 5 workflows under aggressive drift across 8 policy versions. CRITICAL scope changes lurk in the noise.
| Version | Change | Severity | Example |
|---|---|---|---|
v1_base | Baseline policies | — | Return: 30 days, free ship $50 |
v2 | Return policy tightened | HIGH | Window: 30 → 14 days |
v3 | Shipping raised | MEDIUM | Free threshold: $50 → $75 |
v4 | Auto-renewal mandatory | HIGH | auto_renewal: false → true |
v5 | Cancel fee introduced | MEDIUM | Fee: $0 → $25 |
v6 | Compensation slashed | HIGH | Max comp: $200 → $50 |
v7 | Scope narrowed | CRITICAL | Electronics returns: eliminated |
v8 | Pricing restructured | MEDIUM | Tax included, bulk discount gone |
+3 (detect) +10 (correct quote) = +13 vs −8 (stale) −5 (complaint) = −13
| Method | Endpoint | Description |
|---|---|---|
| GET | /health | Liveness / readiness probe |
| POST | /reset | Start a new episode with a task |
| POST | /step | Execute an agent action, get observation + reward |
| GET | /state | Current episode metadata (step count, task, version) |
| GET | /tasks | List available graded tasks |
| GET | /docs | Interactive Swagger UI |
Reset → view policies → quote a value → see the compliance check fire.
FinePrint is open, free, and ready. Reset an episode, connect your model, and see if it survives the compliance storm.