Spaces:

PraneshkumarR
/

fineprint-env

Sleeping

App Files Files Community

fineprint-env / server /static /blog.html

vigneshmoovendhan

ui refined

916c16e 27 days ago

raw

history blame contribute delete

30.1 kB

	<!DOCTYPE html>
	<html lang="en">
	<head>
	<meta charset="utf-8">
	<meta name="viewport" content="width=device-width,initial-scale=1">
	<title>Blog — FinePrint: Teaching LLMs That Knowledge Has an Expiration Date</title>
	<link rel="preconnect" href="https://fonts.googleapis.com">
	<link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800;900&family=JetBrains+Mono:wght@400;500;600&display=swap" rel="stylesheet">
	<style>
	:root{
	--bg:#fafbfc;--card:#fff;--border:#e6e9ef;
	--text:#0d1117;--text2:#57606a;--text3:#8b949e;
	--accent:#1a56db;--accent-l:#dbeafe;
	--green:#059669;--green-bg:#ecfdf5;
	--amber:#d97706;--amber-bg:#fffbeb;
	--red:#dc2626;--red-bg:#fef2f2;
	--mono:'JetBrains Mono',ui-monospace,monospace;
	--sans:'Inter',-apple-system,BlinkMacSystemFont,sans-serif;
	}
	*{margin:0;padding:0;box-sizing:border-box}
	html{scroll-behavior:smooth}
	body{font-family:var(--sans);background:var(--bg);color:var(--text);-webkit-font-smoothing:antialiased}

	/* NAV */
	nav{position:sticky;top:0;z-index:100;background:rgba(250,251,252,.88);backdrop-filter:blur(16px);border-bottom:1px solid var(--border)}
	.nav-inner{max-width:740px;margin:0 auto;padding:0 24px;height:52px;display:flex;align-items:center}
	.brand{font-weight:900;font-size:1.15rem;color:var(--text);letter-spacing:-.02em;text-decoration:none}
	.brand:hover{text-decoration:none}
	.nav-r{margin-left:auto;display:flex;align-items:center;gap:4px}
	.nav-r a{font-size:12.5px;font-weight:500;color:var(--text2);padding:5px 10px;border-radius:6px;transition:all .18s ease;text-decoration:none}
	.nav-r a:hover{background:#f0f3f6;color:var(--text)}

	/* ARTICLE */
	.article{max-width:740px;margin:0 auto;padding:48px 24px 80px}

	/* HERO */
	.blog-hero{margin-bottom:48px;padding-bottom:32px;border-bottom:1px solid var(--border)}
	.blog-meta{display:flex;align-items:center;gap:12px;margin-bottom:16px;flex-wrap:wrap}
	.blog-tag{font-size:11px;font-weight:700;color:var(--accent);background:var(--accent-l);padding:4px 12px;border-radius:20px;text-transform:uppercase;letter-spacing:.06em}
	.blog-date{font-size:12.5px;color:var(--text3);font-weight:500}
	.blog-hero h1{font-size:clamp(1.8rem,4vw,2.6rem);font-weight:900;letter-spacing:-.035em;line-height:1.15;margin-bottom:16px;color:var(--text)}
	.blog-hero .lead{font-size:1.15rem;color:var(--text2);line-height:1.7;max-width:620px}
	.blog-hero blockquote{margin-top:20px;border-left:3px solid var(--red);padding:12px 20px;background:var(--red-bg);border-radius:0 10px 10px 0}
	.blog-hero blockquote p{font-size:.95rem;color:var(--red);font-style:italic;line-height:1.55;margin:0}

	/* PROSE */
	.prose{line-height:1.8;font-size:1.02rem;color:var(--text)}
	.prose h2{font-size:1.55rem;font-weight:800;letter-spacing:-.025em;margin:48px 0 12px;color:var(--text);padding-top:16px;border-top:1px solid var(--border)}
	.prose h2:first-of-type{border-top:none;padding-top:0}
	.prose h3{font-size:1.15rem;font-weight:700;margin:32px 0 8px;color:var(--text)}
	.prose h4{font-size:1rem;font-weight:700;margin:24px 0 6px;color:var(--text)}
	.prose p{margin:0 0 16px;color:var(--text2)}
	.prose strong{color:var(--text);font-weight:600}
	.prose em{color:var(--text2)}
	.prose a{color:var(--accent);text-decoration:underline;text-underline-offset:2px}
	.prose ul,.prose ol{margin:0 0 16px;padding-left:24px;color:var(--text2)}
	.prose li{margin-bottom:6px;line-height:1.65}
	.prose li strong{color:var(--text)}
	.prose hr{border:none;border-top:1px solid var(--border);margin:40px 0}
	.prose code{font-family:var(--mono);font-size:.84em;background:#f0f3f6;padding:2px 6px;border-radius:4px;color:var(--text)}
	.prose pre{background:#0d1117;border-radius:10px;padding:20px;margin:0 0 20px;overflow-x:auto;border:1px solid #30363d}
	.prose pre code{background:none;padding:0;color:#c9d1d9;font-size:.82rem;line-height:1.7}
	.prose blockquote{border-left:3px solid var(--accent);padding:10px 20px;margin:0 0 20px;background:var(--accent-l);border-radius:0 10px 10px 0}
	.prose blockquote p{color:var(--accent);font-style:italic;margin:0}
	.prose blockquote em{color:var(--accent)}

	/* TABLES */
	.prose table{width:100%;border-collapse:collapse;margin:0 0 20px;font-size:.88rem;border:1px solid var(--border);border-radius:10px;overflow:hidden}
	.prose th{text-align:left;padding:10px 14px;background:var(--bg);font-size:11px;font-weight:700;color:var(--text3);text-transform:uppercase;letter-spacing:.06em;border-bottom:1px solid var(--border)}
	.prose td{padding:10px 14px;border-bottom:1px solid #f3f5f7;color:var(--text2)}
	.prose tr:last-child td{border-bottom:none}
	.prose tr:hover td{background:#f8f9fb}

	/* CALLOUT */
	.callout{background:var(--card);border:1px solid var(--border);border-radius:12px;padding:20px 24px;margin:24px 0;box-shadow:0 1px 3px rgba(0,0,0,.04)}
	.callout-warn{border-left:4px solid var(--amber);background:var(--amber-bg)}
	.callout-info{border-left:4px solid var(--accent);background:var(--accent-l)}
	.callout-danger{border-left:4px solid var(--red);background:var(--red-bg)}
	.callout p{margin:0;font-size:.92rem;line-height:1.6}
	.callout-danger p{color:var(--red)}

	/* PHASE CARDS */
	.phase{background:var(--card);border:1px solid var(--border);border-radius:12px;padding:20px 24px;margin:16px 0;box-shadow:0 1px 3px rgba(0,0,0,.04)}
	.phase-label{font-size:11px;font-weight:700;text-transform:uppercase;letter-spacing:.08em;margin-bottom:6px}
	.phase-label.naive{color:var(--red)}.phase-label.triggered{color:var(--amber)}.phase-label.calibrated{color:var(--green)}
	.phase pre{margin:8px 0 0;font-size:.78rem}

	/* SKILL CARDS */
	.skills-grid{display:grid;grid-template-columns:1fr;gap:14px;margin:16px 0 24px}
	.skill-card{background:var(--card);border:1px solid var(--border);border-radius:12px;padding:20px;box-shadow:0 1px 3px rgba(0,0,0,.04)}
	.skill-num{font-family:var(--mono);font-weight:800;font-size:1.2rem;color:var(--accent);margin-bottom:4px}
	.skill-card h4{font-weight:700;font-size:.95rem;margin-bottom:4px}
	.skill-card .skill-q{font-style:italic;color:var(--text3);font-size:.85rem;margin-bottom:8px}
	.skill-card p{font-size:.88rem;color:var(--text2);line-height:1.55;margin:0}

	/* DOMAIN CARDS */
	.domain-grid{display:grid;grid-template-columns:1fr 1fr;gap:14px;margin:16px 0 24px}
	@media(max-width:640px){.domain-grid{grid-template-columns:1fr}}
	.domain-card{background:var(--card);border:1px solid var(--border);border-radius:12px;padding:20px;box-shadow:0 1px 3px rgba(0,0,0,.04)}
	.domain-card h4{font-weight:700;font-size:.92rem;margin-bottom:6px}
	.domain-card .domain-quote{font-size:.82rem;color:var(--red);font-style:italic;margin-bottom:10px;line-height:1.5}
	.domain-card ul{padding-left:18px;margin:0}
	.domain-card li{font-size:.84rem;color:var(--text2);line-height:1.5;margin-bottom:3px}

	/* FOOTER */
	.blog-footer{max-width:740px;margin:0 auto;padding:24px;display:flex;justify-content:space-between;align-items:center;font-size:12px;color:var(--text3);border-top:1px solid var(--border)}
	.blog-footer a{color:var(--text2)}
	@media(max-width:600px){.blog-footer{flex-direction:column;gap:6px;text-align:center}}

	/* STACK BADGES */
	.stack{display:flex;gap:6px;flex-wrap:wrap;margin-top:8px}
	.stack-badge{font-family:var(--mono);font-size:11px;font-weight:600;padding:4px 10px;border-radius:5px;background:#f0f3f6;color:var(--text2)}
	</style>
	</head>
	<body>

	<nav>
	<div class="nav-inner">
	<a href="/" class="brand">FinePrint</a>
	<div class="nav-r">
	<a href="/">Home</a>
	<a href="/blog" style="color:var(--text);font-weight:600">Blog</a>
	<a href="/docs">API Docs</a>
	</div>
	</div>
	</nav>

	<article class="article">

	<!-- HERO -->
	<header class="blog-hero">
	<div class="blog-meta">
	<span class="blog-tag">Deep Dive</span>
	<span class="blog-date">Meta PyTorch OpenEnv Hackathon × Scaler School of Technology</span>
	</div>
	<h1>Teaching Language Models That Knowledge Has an Expiration Date</h1>
	<p class="lead">Every enterprise deploying AI agents today is sitting on a ticking time bomb. Policies change. APIs evolve. Terms of service get rewritten overnight. But the AI agent? It keeps quoting yesterday’s rules with today’s confidence.</p>
	<blockquote>
	<p>“The return window is 30 days!” — an AI agent, confidently citing a policy that changed to 14 days at 2 AM.</p>
	</blockquote>
	</header>

	<!-- CONTENT -->
	<div class="prose">

	<h2>The Uncomfortable Truth About AI Agents in Production</h2>

	<p>Consider this: a customer service bot tells a user they have 30 days to return a product. The user ships it back on day 20 — only to be told the policy changed to 14 days last week. Who’s liable? The company, because their AI gave incorrect guidance. This isn’t hypothetical. It’s happening right now, across industries, and <strong>no existing benchmark even tests for it</strong>.</p>

	<p>FinePrint is our answer. An OpenEnv-compatible reinforcement learning environment that trains language models to do something deceptively simple but fundamentally unsolved: <strong>know when to stop trusting their own knowledge</strong>.</p>

	<hr>

	<h2>Why This Problem Doesn’t Have a Solution Yet</h2>

	<p>Traditional RL environments train models on <em>what action to take</em>. CartPole teaches balance. Atari teaches game strategies. Code generation teaches syntax and logic.</p>

	<p>FinePrint trains something different entirely — a <strong>meta-cognitive skill</strong>:</p>

	<blockquote><p>“Should I act on what I currently believe, or should I pause and verify that my knowledge is still accurate?”</p></blockquote>

	<p>This is a binary decision — verify or act — but the <em>context</em> in which the model makes that decision is everything. Current LLMs have no internal mechanism for tracking knowledge freshness. They treat their training data and cached context as permanently valid. FinePrint breaks that assumption and forces the model to develop <strong>temporal awareness</strong>.</p>

	<h3>How FinePrint Differs from Standard RL Environments</h3>

	<table>
	<thead><tr><th>Dimension</th><th>Standard RL</th><th>FinePrint</th></tr></thead>
	<tbody>
	<tr><td><strong>Core Decision</strong></td><td>“What action should I take?”</td><td>“Is my knowledge still valid before I act?”</td></tr>
	<tr><td><strong>Ground Truth</strong></td><td>Static rules (physics, game mechanics)</td><td><strong>Drifting rules</strong> that change mid-episode</td></tr>
	<tr><td><strong>Key Challenge</strong></td><td>Sequence optimization</td><td>Uncertainty calibration under temporal drift</td></tr>
	<tr><td><strong>Training Signal</strong></td><td>Delayed (episode end)</td><td>Immediate (+13 for correct detection, −13 for missed drift)</td></tr>
	<tr><td><strong>Real-World Analog</strong></td><td>Games, robotics</td><td>Compliance, legal, healthcare, finance</td></tr>
	</tbody>
	</table>

	<p>The critical distinction: in Atari, the rules of the game never change. In FinePrint, the rules change <em>while the agent is playing</em>, and the agent must figure out when that happened — sometimes with zero explicit signals.</p>

	<hr>

	<h2>Architecture: What We Built and How It Works</h2>

	<pre><code>FinePrint = OpenEnv-compatible RL environment
	+ Versioned policy database (8 versions, 6 policy categories)
	+ Probabilistic drift scheduler (silent + explicit drift)
	+ Deterministic compliance checker
	+ Shaped reward calculator (26-point swing)
	+ 5 consumer workflow simulations</code></pre>

	<h3>Technology Stack</h3>

	<table>
	<thead><tr><th>Component</th><th>Technology</th><th>Purpose</th></tr></thead>
	<tbody>
	<tr><td>RL Framework</td><td><strong>OpenEnv + Gymnasium</strong></td><td>Hackathon interface, industry-standard RL API</td></tr>
	<tr><td>Base Model</td><td><strong>Qwen2.5-1.5B-Instruct</strong></td><td>Small, efficient instruction-tuned LLM</td></tr>
	<tr><td>Fine-tuning</td><td><strong>Unsloth</strong></td><td>2–4x faster training, 60% less memory</td></tr>
	<tr><td>Training Algorithm</td><td><strong>GRPO</strong></td><td>On-policy RL optimized for language models</td></tr>
	<tr><td>Policy Storage</td><td><strong>JSON with version chaining</strong></td><td>Deterministic, auditable policy versioning</td></tr>
	</tbody>
	</table>

	<h3>The Five Consumer Workflows</h3>

	<p>Each episode randomly selects from five real-world customer service scenarios:</p>

	<ol>
	<li><strong>Online Shopping</strong> — Browse → Cart → Checkout → Payment → Confirmation</li>
	<li><strong>Product Return</strong> — Initiate → Reason → Shipping Label → Refund → Confirmation</li>
	<li><strong>Subscription Signup</strong> — Plan Select → Account → Billing → Confirmation</li>
	<li><strong>Booking Service</strong> — Select → Details → Payment → Confirmation</li>
	<li><strong>Customer Complaint</strong> — Describe → Investigation → Resolution → Confirmation</li>
	</ol>

	<p>Each workflow contains policy-sensitive steps where the agent must quote specific values. <strong>Any of these values can change mid-conversation.</strong></p>

	<h3>The Policy Drift Engine</h3>

	<p>Eight policy versions form a chain, each introducing progressively impactful changes:</p>

	<table>
	<thead><tr><th>Version</th><th>Change</th><th>Severity</th><th>Example</th></tr></thead>
	<tbody>
	<tr><td>v1</td><td>Base state</td><td>—</td><td>Return: 30 days, free ship at $50</td></tr>
	<tr><td>v2</td><td>Return tightened</td><td><strong style="color:var(--red)">HIGH</strong></td><td>Window: 30 → 14 days</td></tr>
	<tr><td>v3</td><td>Shipping raised</td><td><strong style="color:var(--amber)">MEDIUM</strong></td><td>Free threshold: $50 → $75</td></tr>
	<tr><td>v4</td><td>Auto-renewal added</td><td><strong style="color:var(--red)">HIGH</strong></td><td>auto_renewal: false → true</td></tr>
	<tr><td>v5</td><td>Cancel fee introduced</td><td><strong style="color:var(--amber)">MEDIUM</strong></td><td>Fee: $0 → $25</td></tr>
	<tr><td>v6</td><td>Compensation slashed</td><td><strong style="color:var(--red)">HIGH</strong></td><td>Max comp: $200 → $50</td></tr>
	<tr><td>v7</td><td>Scope narrowed</td><td><strong style="color:#7c2d12">CRITICAL</strong></td><td>Electronics returns: eliminated</td></tr>
	<tr><td>v8</td><td>Pricing restructured</td><td><strong style="color:var(--amber)">MEDIUM</strong></td><td>Tax included, bulk discount gone</td></tr>
	</tbody>
	</table>

	<p>Drift is triggered probabilistically. <strong>70% of drifts are silent</strong> — the agent receives no notification. The remaining 30% generate explicit system notifications. This forces the model to develop multiple detection strategies.</p>

	<hr>

	<h2>The Single Decision That Changes Everything</h2>

	<p>At its core, FinePrint trains one action: <code>request_verification()</code>.</p>

	<p>This is the meta-cognitive call that refreshes the agent’s policy cache. The entire training objective is teaching the model <strong>when</strong> to make this call. Too often wastes time (−0.5 penalty). Too rarely leads to stale citations (−8.0 penalty). The optimal policy balances speed against safety.</p>

	<h3>What the Agent Sees</h3>

	<pre><code>observation = {
	"current_workflow": "return",
	"current_step": "refund_method",
	"user_message": "How will I get my refund?",
	"cached_policies": { "return.refund_method": "original_payment" },
	"steps_since_last_verify": 5,
	"system_notification": null,
	"contradiction_detected": true,
	"user_expressed_confusion": true,
	"user_satisfaction": 0.6,
	"last_action_compliant": false
	}</code></pre>

	<div class="callout callout-info">
	<p><strong>Key insight:</strong> The actual active policy version, the true policy values, and the drift log are <strong>deliberately hidden</strong>. The agent can <em>only</em> learn the truth by calling <code>request_verification()</code>.</p>
	</div>

	<h3>Available Actions</h3>

	<table>
	<thead><tr><th>Action</th><th>Purpose</th><th>When to Use</th></tr></thead>
	<tbody>
	<tr><td><code>request_verification</code></td><td>Refresh policy cache</td><td>When drift is suspected</td></tr>
	<tr><td><code>quote_policy</code></td><td>Cite a specific policy value</td><td>Policy questions</td></tr>
	<tr><td><code>respond_to_user</code></td><td>General conversation</td><td>Low-stakes interactions</td></tr>
	<tr><td><code>take_action</code></td><td>Process a request</td><td>Order, refund processing</td></tr>
	<tr><td><code>escalate</code></td><td>Transfer to human</td><td>Beyond AI capability</td></tr>
	<tr><td><code>abort_workflow</code></td><td>Stop current workflow</td><td>Unsafe to continue</td></tr>
	<tr><td><code>clarify</code></td><td>Ask for more info</td><td>Ambiguous user intent</td></tr>
	</tbody>
	</table>

	<hr>

	<h2>Reward Design: The 26-Point Swing</h2>

	<p>The reward structure creates a <strong>26-point gap</strong> between optimal and worst-case behavior for any single policy-sensitive step:</p>

	<div style="display:grid;grid-template-columns:1fr 1fr;gap:14px;margin:20px 0">
	<div class="callout" style="border-left:4px solid var(--green);background:var(--green-bg)">
	<p style="color:var(--green)"><strong>Best case: +13</strong><br>Verify → detect drift (+3) → quote correctly (+10)</p>
	</div>
	<div class="callout callout-danger">
	<p><strong>Worst case: −13</strong><br>Skip verification → stale quote (−8) → complaint (−5)</p>
	</div>
	</div>

	<table>
	<thead><tr><th>Event</th><th>Reward</th><th>Rationale</th></tr></thead>
	<tbody>
	<tr><td>Correct policy quote</td><td><strong style="color:var(--green)">+10.0</strong></td><td>Core task completion</td></tr>
	<tr><td>Timely drift detection (≤2 steps)</td><td><strong style="color:var(--green)">+3.0</strong></td><td>Proactive awareness</td></tr>
	<tr><td>Late drift detection (3+ steps)</td><td><strong style="color:var(--green)">+1.0</strong></td><td>Better late than never</td></tr>
	<tr><td>Freshness bonus</td><td><strong style="color:var(--green)">+1.0</strong></td><td>Encourage regular checks</td></tr>
	<tr><td>All workflows clean (terminal)</td><td><strong style="color:var(--green)">+20.0</strong></td><td>Episode-level excellence</td></tr>
	<tr><td>Stale policy cited</td><td><strong style="color:var(--red)">−8.0</strong></td><td>The core failure we’re training against</td></tr>
	<tr><td>User complaint</td><td><strong style="color:var(--red)">−5.0</strong></td><td>Real-world escalation cost</td></tr>
	<tr><td>Unnecessary verification</td><td><strong style="color:var(--amber)">−0.5</strong></td><td>Prevent over-checking</td></tr>
	<tr><td>Any compliance failure (terminal)</td><td><strong style="color:var(--red)">−30.0</strong></td><td>“One lawsuit ruins everything”</td></tr>
	</tbody>
	</table>

	<hr>

	<h2>The Five Cognitive Skills FinePrint Trains</h2>

	<p>FinePrint doesn’t teach policy values — those are input features. It teaches five meta-cognitive behaviors that current LLMs fundamentally lack:</p>

	<div class="skills-grid">
	<div class="skill-card">
	<div class="skill-num">01</div>
	<h4>Temporal Awareness</h4>
	<div class="skill-q">“Is my knowledge still valid?”</div>
	<p>The model learns that cached knowledge has an expiration date. An untrained model confidently quotes “30 days” when the policy changed to 14. A trained model recognizes elapsed time as a risk factor and verifies before quoting.</p>
	</div>
	<div class="skill-card">
	<div class="skill-num">02</div>
	<h4>Contradiction Detection</h4>
	<div class="skill-q">“Something doesn’t add up.”</div>
	<p>When a user says “The website said 30 days” and the agent’s cache says 14, the model must recognize this mismatch as a <strong>drift signal</strong>, not a user error. It learns “the user knows something I don’t” is a strong verification trigger.</p>
	</div>
	<div class="skill-card">
	<div class="skill-num">03</div>
	<h4>Strategic Verification</h4>
	<div class="skill-q">“When should I check vs. act?”</div>
	<p>This is the meta-skill that separates useful agents from paranoid ones. The model learns an <strong>optimal verification schedule</strong> — check at workflow transitions, after contradictions, at payment stages, and after long gaps.</p>
	</div>
	<div class="skill-card">
	<div class="skill-num">04</div>
	<h4>Graceful Recovery</h4>
	<div class="skill-q">“I made a mistake. Now what?”</div>
	<p>A trained model doesn’t double down on wrong answers. When compliance returns <code>false</code> and the user expresses confusion, it immediately verifies, updates its cache, and corrects course.</p>
	</div>
	<div class="skill-card">
	<div class="skill-num">05</div>
	<h4>Uncertainty Calibration</h4>
	<div class="skill-q">“How confident should I be?”</div>
	<p>The model develops context-dependent confidence: <strong>high</strong> (just verified, no contradictions) → act freely. <strong>Low</strong> (notification present, user contradiction, 6+ steps) → check immediately.</p>
	</div>
	</div>

	<hr>

	<h2>Training: From Naive to Strategic in 80 Episodes</h2>

	<table>
	<thead><tr><th>Parameter</th><th>Value</th></tr></thead>
	<tbody>
	<tr><td>Base Model</td><td>Qwen/Qwen2.5-1.5B-Instruct</td></tr>
	<tr><td>Training Episodes</td><td>80</td></tr>
	<tr><td>Rollouts per Update</td><td>4</td></tr>
	<tr><td>Learning Rate</td><td>2e-5</td></tr>
	<tr><td>Total Training Time</td><td>~3.6 hours (13,092 seconds)</td></tr>
	</tbody>
	</table>

	<h3>Training Progression</h3>

	<div class="phase">
	<div class="phase-label naive">Phase 1 — The Naive Phase (Episodes 1–12)</div>
	<p style="font-size:.9rem;color:var(--text2)">No concept of verification timing. Rewards fluctuate wildly between −11.4 and −0.6, with frequent stale citations and compliance failures.</p>
	<pre><code>Update 1 \| Ep 4 \| Avg Reward: -2.38 \| ← No strategy
	Update 2 \| Ep 8 \| Avg Reward: -0.63 \| ← Slight improvement
	Update 3 \| Ep 12 \| Avg Reward: -11.38 \| ← Catastrophic stale citations</code></pre>
	</div>

	<div class="phase">
	<div class="phase-label triggered">Phase 2 — The Triggered Phase (Episodes 13–32)</div>
	<p style="font-size:.9rem;color:var(--text2)">The model begins associating verification with positive outcomes. Rewards stabilize around 0–1, indicating the model has learned that <code>request_verification()</code> exists as a useful action.</p>
	<pre><code>Update 4 \| Ep 16 \| Avg Reward: 0.88 \| ← Learning to verify
	Update 5 \| Ep 20 \| Avg Reward: 1.38 \| ← Positive territory
	Update 8 \| Ep 32 \| Avg Reward: 0.75 \| ← Stabilizing</code></pre>
	</div>

	<div class="phase">
	<div class="phase-label calibrated">Phase 3 — The Calibrated Phase (Episodes 33–80)</div>
	<p style="font-size:.9rem;color:var(--text2)">Context-sensitive verification behavior. Rewards climb from 4.9 to 8.75, with strategic verification at contradictions, payment steps, and long gaps.</p>
	<pre><code>Update 9 \| Ep 36 \| Avg Reward: 4.88 \| ← Breakthrough
	Update 11 \| Ep 44 \| Avg Reward: 6.63 \| ← Consistent improvement
	Update 15 \| Ep 60 \| Avg Reward: 8.75 \| ← Peak performance
	Update 20 \| Ep 80 \| Avg Reward: 7.75 \| ← Sustained high performance</code></pre>
	</div>

	<p>The trajectory from <strong>−11.4 to +8.75</strong> average reward demonstrates clear behavioral learning. The model moved from random, penalty-heavy actions to strategic, context-aware verification decisions.</p>

	<hr>

	<h2>Evaluation: Baseline vs. Trained Model</h2>

	<table>
	<thead><tr><th>Metric</th><th>Heuristic Baseline</th><th>Trained Model</th></tr></thead>
	<tbody>
	<tr><td><strong>Avg Reward</strong></td><td>125.4</td><td>4.0</td></tr>
	<tr><td><strong>Std Deviation</strong></td><td>9.67</td><td>1.64</td></tr>
	<tr><td><strong>Compliance Failures</strong></td><td>0.0</td><td><strong style="color:var(--green)">0.0</strong></td></tr>
	<tr><td><strong>Drift Detections</strong></td><td>4.8</td><td>1.4</td></tr>
	</tbody>
	</table>

	<div class="callout callout-info">
	<p><strong>The critical insight:</strong> The model learned the most important lesson — <strong>never cite a stale policy</strong>. It achieved zero compliance failures, matching the hand-coded heuristic’s safety guarantee, through <em>learned</em> behavior rather than hard-coded rules.</p>
	</div>

	<p>The reward gap (125.4 vs. 4.0) represents the <strong>optimization frontier</strong>. With more episodes, larger models, and refined reward shaping, the learned policy can approach and potentially exceed the heuristic by learning <em>when not to verify</em>, avoiding the −0.5 penalties that the always-verify strategy accumulates.</p>

	<p>The trained model also shows significantly <strong>lower variance</strong> (std: 1.64 vs. 9.67), indicating more predictable, stable behavior — a desirable property for production deployment.</p>

	<hr>

	<h2>What Makes FinePrint Novel</h2>

	<ol>
	<li><strong>Temporal Knowledge Grounding as a First-Class Problem</strong> — No existing RL benchmark explicitly trains or measures an agent’s ability to recognize stale knowledge.</li>
	<li><strong>Information Asymmetry Design</strong> — The agent is deliberately denied access to the true active policy version. It can only discover truth through <code>request_verification()</code>.</li>
	<li><strong>Multi-Signal Drift Detection</strong> — Four signal types (system notifications, user contradictions, confusion, elapsed time) with varying reliability. The model learns <strong>sensor fusion</strong> for knowledge management.</li>
	<li><strong>Realistic Severity Gradients</strong> — Not all drifts are equal. Return window changes are catastrophic; shipping surcharge tweaks are minor. The reward weights teach prioritization.</li>
	</ol>

	<hr>

	<h2>Beyond Consumer Workflows: High-Stakes Domains</h2>

	<p>The FinePrint architecture is <strong>domain-agnostic</strong>. The same structure — versioned rules, drift scheduling, compliance checking — applies directly to high-stakes domains:</p>

	<div class="domain-grid">
	<div class="domain-card">
	<h4>🏥 Healthcare</h4>
	<div class="domain-quote">“The FDA updated the contraindication list at 3 AM. By 9 AM, an AI had recommended a now-dangerous combination to 47 patients.”</div>
	<ul>
	<li>Verify formulary status before recommending</li>
	<li>Detect guideline drift from updated protocols</li>
	<li>Calibrate urgency for life-threatening interactions</li>
	</ul>
	</div>
	<div class="domain-card">
	<h4>⚖ Legal</h4>
	<div class="domain-quote">“Our AI cited a ruling from 2019 as controlling precedent. It was overturned six months ago.”</div>
	<ul>
	<li>Verify current validity of cited precedent</li>
	<li>Flag potential overrulings</li>
	<li>Distinguish binding vs. persuasive authority</li>
	</ul>
	</div>
	<div class="domain-card">
	<h4>📋 Compliance</h4>
	<div class="domain-quote">“The GDPR interpretation changed. Our bot was still advising on old Article 6 guidance for three weeks.”</div>
	<ul>
	<li>Track GDPR, CCPA, HIPAA evolution</li>
	<li>Handle cross-border regulatory conflicts</li>
	<li>Prioritize by regulatory severity</li>
	</ul>
	</div>
	<div class="domain-card">
	<h4>💰 Financial Services</h4>
	<div class="domain-quote">KYC requirements and margin rules change multiple times per day during market stress.</div>
	<ul>
	<li>Real-time compliance parameter tracking</li>
	<li>Prevent regulatory sanctions</li>
	<li>Audit trail for every verification decision</li>
	</ul>
	</div>
	</div>

	<hr>

	<h2>The Broader Vision</h2>

	<blockquote><p>We envision a future where every deployed AI agent has an internalized “knowledge freshness” model — a learned sense of when to trust its cache and when to re-verify. FinePrint is the first environment designed to build exactly that capability.</p></blockquote>

	<p>The skills are <strong>transferable</strong>. A model that learns temporal awareness on consumer policies can apply the same meta-cognitive pattern to medical guidelines, legal precedent, or financial regulations. The domain changes; the verification instinct persists.</p>

	<hr>

	<h2>Conclusion</h2>

	<p>FinePrint addresses a gap at the intersection of AI safety and practical deployment: <strong>temporal knowledge grounding</strong>. Current LLMs cite outdated policies with the same confidence as current ones, creating liability in every domain where rules change — which is every domain.</p>

	<p>In just 80 episodes, a 1.5B parameter model went from random, penalty-heavy behavior to <strong>zero-compliance-failure performance</strong> with learned verification strategies. The reward trajectory from −11.4 to +8.75 demonstrates clear acquisition of temporal awareness.</p>

	<div class="callout" style="border-left:4px solid var(--text);background:#f0f3f6;margin-top:32px">
	<p style="color:var(--text);font-style:italic;font-weight:500">“The most dangerous AI agent isn’t one that doesn’t know the answer. It’s one that doesn’t know its answer is no longer correct.”</p>
	</div>

	<p style="margin-top:32px;font-size:.88rem;color:var(--text3)"><strong>Built for the Scaler Meta-PyTorch Hackathon</strong> — Theme 3.2: Consumer Workflows with Schema Drift • Patronus AI Sponsor Track</p>
	<div class="stack">
	<span class="stack-badge">OpenEnv</span>
	<span class="stack-badge">Gymnasium</span>
	<span class="stack-badge">Qwen2.5-1.5B</span>
	<span class="stack-badge">Unsloth</span>
	<span class="stack-badge">GRPO</span>
	<span class="stack-badge">Python</span>
	</div>

	</div>
	</article>

	<footer class="blog-footer">
	<span>FinePrint-Env — Meta PyTorch OpenEnv Hackathon × Scaler School of Technology</span>
	<div style="display:flex;gap:14px"><a href="/">Home</a><a href="/docs">API Docs</a></div>
	</footer>

	</body>
	</html>