Spaces:

PraneshkumarR
/

fineprint-env

Sleeping

App Files Files Community

vigneshmoovendhan commited on Apr 26

Commit

c01210a

1 Parent(s): 0b6a889

blog added

Browse files

Files changed (2) hide show

README.md +4 -0
blog.md +452 -0

README.md CHANGED Viewed

@@ -331,6 +331,10 @@ fineprint/
 - 3 graded tasks with deterministic scoring
 - Baseline inference script included
 ## License
 [MIT](LICENSE)

 - 3 graded tasks with deterministic scoring
 - Baseline inference script included
+## Blog
+Read the detailed writeup: [FinePrint: Teaching Language Models That Knowledge Has an Expiration Date](blog.md)
 ## License
 [MIT](LICENSE)

blog.md ADDED Viewed

	@@ -0,0 +1,452 @@

+# FinePrint: Teaching Language Models That Knowledge Has an Expiration Date
+> *"The return window is 30 days!" — an AI agent, confidently citing a policy that changed to 14 days at 2 AM.*
+---
+## The Uncomfortable Truth About AI Agents in Production
+Every enterprise deploying AI agents today is sitting on a ticking time bomb. Policies change. APIs evolve. Terms of service get rewritten overnight. But the AI agent? It keeps quoting yesterday's rules with today's confidence.
+Consider this: a customer service bot tells a user they have 30 days to return a product. The user ships it back on day 20 — only to be told the policy changed to 14 days last week. Who's liable? The company, because their AI gave incorrect guidance. This isn't hypothetical. It's happening right now, across industries, and **no existing benchmark even tests for it**.
+FinePrint is our answer. An OpenEnv-compatible reinforcement learning environment that trains language models to do something deceptively simple but fundamentally unsolved: **know when to stop trusting their own knowledge**.
+---
+## Why This Problem Doesn't Have a Solution Yet
+Traditional RL environments train models on *what action to take*. CartPole teaches balance. Atari teaches game strategies. Code generation teaches syntax and logic.
+FinePrint trains something different entirely — a **meta-cognitive skill**:
+> *"Should I act on what I currently believe, or should I pause and verify that my knowledge is still accurate?"*
+This is a binary decision — verify or act — but the *context* in which the model makes that decision is everything. Current LLMs have no internal mechanism for tracking knowledge freshness. They treat their training data and cached context as permanently valid. FinePrint breaks that assumption and forces the model to develop **temporal awareness**.
+### How FinePrint Differs from Standard RL Environments
+| Dimension | Standard RL Environment | FinePrint |
+|---|---|---|
+| **Core Decision** | "What action should I take?" | "Is my knowledge still valid before I act?" |
+| **Action Space** | Complex (move, jump, attack, etc.) | Binary meta-decision + workflow actions |
+| **Ground Truth** | Static rules (physics, game mechanics) | **Drifting rules** that change mid-episode |
+| **Key Challenge** | Sequence optimization | Uncertainty calibration under temporal drift |
+| **Training Signal** | Delayed (episode end) | Immediate (+13 for correct detection, -13 for missed drift) |
+| **Real-World Analog** | Games, robotics | Compliance, legal, healthcare, finance |
+The critical distinction: in Atari, the rules of the game never change. In FinePrint, the rules change *while the agent is playing*, and the agent must figure out when that happened — sometimes with zero explicit signals.
+---
+## Architecture: What We Built and How It Works
+### The Environment at a Glance
+```
+FinePrint = OpenEnv-compatible RL environment
+          + Versioned policy database (8 versions, 6 policy categories)
+          + Probabilistic drift scheduler (silent + explicit drift)
+          + Deterministic compliance checker
+          + Shaped reward calculator (26-point swing)
+          + 5 consumer workflow simulations
+```
+### Technology Stack
+| Component | Technology | Purpose |
+|---|---|---|
+| RL Framework | **OpenEnv** + **Gymnasium** | Hackathon-required interface, industry-standard RL API |
+| Base Model | **Qwen/Qwen2.5-1.5B-Instruct** | Small, efficient instruction-tuned LLM |
+| Fine-tuning | **Unsloth** | 2-4x faster training with 60% less memory |
+| Training Algorithm | **GRPO** (Group Relative Policy Optimization) | On-policy RL optimized for language models |
+| Policy Storage | **JSON** with version chaining | Deterministic, auditable policy versioning |
+| Evaluation | Custom rollout engine | Before/after behavioral comparison |
+### The Five Consumer Workflows
+Each episode randomly selects from five real-world customer service scenarios:
+1. **Online Shopping** — Browse → Cart → Checkout → Payment → Confirmation
+2. **Product Return** — Initiate → Reason → Shipping Label → Refund → Confirmation
+3. **Subscription Signup** — Plan Select → Account → Billing → Confirmation
+4. **Booking Service** — Select → Details → Payment → Confirmation
+5. **Customer Complaint** — Describe → Investigation → Resolution → Confirmation
+Each workflow contains policy-sensitive steps where the agent must quote specific values — return windows, shipping thresholds, subscription terms, cancellation fees, compensation limits. **Any of these values can change mid-conversation.**
+### The Policy Drift Engine
+Eight policy versions form a chain, each introducing progressively impactful changes:
+| Version | What Changed | Severity | Example |
+|---|---|---|---|
+| v1 | Base state | — | Return: 30 days, free ship at $50 |
+| v2 | Return policy tightened | **HIGH** | Window: 30 → 14 days, refund → store credit |
+| v3 | Shipping thresholds raised | MEDIUM | Free threshold: $50 → $75 |
+| v4 | Auto-renewal added | **HIGH** | `auto_renewal`: false → true |
+| v5 | Cancellation fee introduced | MEDIUM | Fee: $0 → $25 |
+| v6 | Compensation slashed | **HIGH** | Max compensation: $200 → $50 |
+| v7 | Scope narrowed | **CRITICAL** | Electronics returns: eliminated |
+| v8 | Pricing restructured | MEDIUM | Tax included, bulk discount removed |
+Drift is triggered probabilistically during episodes. **70% of drifts are silent** — the agent receives no notification. The remaining 30% generate explicit system notifications. This forces the model to develop multiple detection strategies rather than relying on a single signal.
+---
+## The Single Decision That Changes Everything
+At its core, FinePrint trains one action: `request_verification()`.
+This is the meta-cognitive call that refreshes the agent's policy cache. The entire training objective is teaching the model **when** to make this call. Too often wastes time (-0.5 penalty per unnecessary check). Too rarely leads to stale citations (-8.0 penalty per violation). The optimal policy balances speed against safety.
+### What the Agent Sees (Observation Space)
+```
+observation = {
+    "current_workflow": "return",
+    "current_step": "refund_method",
+    "user_message": "How will I get my refund?",
+    "cached_policies": { "return.refund_method": "original_payment" },
+    "steps_since_last_verify": 5,
+    "system_notification": null,
+    "contradiction_detected": true,
+    "user_expressed_confusion": true,
+    "user_satisfaction": 0.6,
+    "last_action_compliant": false
+}
+```
+> Notice what's **deliberately hidden**: the actual active policy version, the true policy values, and the drift log. The agent can *only* learn the truth by calling `request_verification()`.
+### What the Agent Can Do (Action Space)
+| Action | Purpose | When to Use |
+|---|---|---|
+| `request_verification` | Refresh policy cache | When drift is suspected |
+| `quote_policy` | Cite a specific policy value | When answering policy questions |
+| `respond_to_user` | General conversation | Low-stakes interactions |
+| `take_action` | Process a request | Order placement, refund processing |
+| `escalate` | Transfer to human | Beyond AI capability |
+| `abort_workflow` | Stop current workflow | Unsafe to continue |
+| `clarify` | Ask for more information | Ambiguous user intent |
+---
+## Reward Design: The 26-Point Swing
+The reward structure creates a **26-point gap** between optimal and worst-case behavior for any single policy-sensitive step:
+**Best case:** Verify before quoting → detect drift (+3) → quote correctly (+10) = **+13**
+**Worst case:** Skip verification → quote stale policy (-8) → user complaint (-5) = **-13**
+### Complete Reward Table
+| Event | Reward | Rationale |
+|---|---|---|
+| Correct policy quote | +10.0 | Core task completion |
+| Timely drift detection (≤2 steps) | +3.0 | Proactive awareness |
+| Late drift detection (3+ steps) | +1.0 | Better late than never |
+| Freshness bonus (verified recently) | +1.0 | Encourage regular checks |
+| All workflows clean (terminal) | +20.0 | Episode-level excellence |
+| Stale policy cited | -8.0 | The core failure we're training against |
+| User complaint (satisfaction < 0.3) | -5.0 | Real-world escalation cost |
+| Unnecessary verification | -0.5 | Prevent over-checking |
+| Any compliance failure (terminal) | -30.0 | "One lawsuit ruins everything" |
+The terminal penalty of -30 for *any* compliance failure in an episode reflects a harsh but realistic truth: in production, a single wrong policy citation can trigger regulatory action, regardless of how many correct answers preceded it.
+---
+## The Five Cognitive Skills FinePrint Trains
+FinePrint doesn't teach policy values — those are input features. It teaches five meta-cognitive behaviors that current LLMs fundamentally lack:
+### 1. Temporal Awareness — *"Is my knowledge still valid?"*
+The model learns that cached knowledge has an expiration date. An untrained model confidently quotes "30 days" when the policy changed to 14. A trained model recognizes elapsed time as a risk factor and verifies before quoting.
+> *Training signal: -8 for stale answers, +3 for fresh verification. Over hundreds of episodes, the model internalizes: "check before quoting when uncertain."*
+### 2. Contradiction Detection — *"Something doesn't add up."*
+When a user says *"The website said 30 days when I bought this..."* and the agent's cache says 14 days, the model must recognize this mismatch as a **drift signal**, not a user error.
+> *The model learns that "the user seems to know something I don't" is one of the strongest indicators that verification is needed.*
+### 3. Strategic Verification — *"When should I check vs. act?"*
+This is the meta-skill that separates useful agents from paranoid ones. Checking every step wastes time. Checking too rarely misses drifts. The model learns an **optimal verification schedule** — check at workflow transitions, after contradictions, at payment stages, and after long gaps.
+### 4. Graceful Recovery — *"I made a mistake. Now what?"*
+A trained model doesn't double down on wrong answers. When `last_action_compliant` returns `false` and the user expresses confusion, the model learns to immediately verify, update its cache, and correct course — preventing cascading failures across remaining workflow steps.
+### 5. Uncertainty Calibration — *"How confident should I be?"*
+The model develops context-dependent confidence levels:
+- **High confidence** (act without checking): Just verified, no contradiction signals, low-stakes step
+- **Medium confidence** (check if convenient): 3-5 steps since verification, workflow transition
+- **Low confidence** (check immediately): System notification present, user contradiction, payment/billing step, 6+ steps since last check
+---
+## Training: From Naive to Strategic in 80 Episodes
+### Setup
+We trained **Qwen/Qwen2.5-1.5B-Instruct** using **Unsloth** for efficient fine-tuning with **GRPO** (Group Relative Policy Optimization) — an on-policy RL algorithm well-suited for language model alignment.
+| Parameter | Value |
+|---|---|
+| Base Model | Qwen/Qwen2.5-1.5B-Instruct |
+| Training Episodes | 80 |
+| Rollouts per Update | 4 |
+| Learning Rate | 2e-5 |
+| Total Training Time | ~3.6 hours (13,092 seconds) |
+### Training Progression
+The training logs tell a clear story of behavioral learning across 20 policy updates:
+**Phase 1 — The Naive Phase (Episodes 1-12):**
+The model starts with no concept of verification timing. Average rewards fluctuate wildly between -11.4 and -0.6, with the model frequently citing stale policies and accumulating compliance failures.
+```
+Update 1  | Episodes  4 | Avg Reward: -2.38  | ← No strategy, random behavior
+Update 2  | Episodes  8 | Avg Reward: -0.63  | ← Slight improvement
+Update 3  | Episodes 12 | Avg Reward: -11.38 | ← Catastrophic stale citations
+```
+**Phase 2 — The Triggered Phase (Episodes 13-32):**
+The model begins associating verification with positive outcomes. Rewards stabilize around 0-1, indicating the model has learned that `request_verification()` exists as a useful action, though it hasn't optimized when to use it.
+```
+Update 4  | Episodes 16 | Avg Reward:  0.88  | ← Learning to verify
+Update 5  | Episodes 20 | Avg Reward:  1.38  | ← Positive territory
+Update 8  | Episodes 32 | Avg Reward:  0.75  | ← Stabilizing
+```
+**Phase 3 — The Calibrated Phase (Episodes 33-80):**
+The model develops context-sensitive verification behavior. Rewards climb steadily from 4.9 to 8.75, with the model learning to verify at strategic moments — after contradictions, at payment steps, and after long gaps.
+```
+Update 9  | Episodes 36 | Avg Reward:  4.88  | ← Breakthrough
+Update 11 | Episodes 44 | Avg Reward:  6.63  | ← Consistent improvement
+Update 15 | Episodes 60 | Avg Reward:  8.75  | ← Peak performance
+Update 20 | Episodes 80 | Avg Reward:  7.75  | ← Sustained high performance
+```
+### The Reward Curve
+```
+Avg Reward
+    ↑
+ 10 ┤                                        ■         ■
+    │                                  ■  ■     ■   ■     ■  ■
+  5 ┤                           ■  ■        ■
+    │              ■  ■
+  0 ┤     ■  ■              ■
+    │  ■
+ -5 ┤
+    │
+-10 ┤        ■
+    │
+-15 ┤
+    └──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──┬──→ Update
+       1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
+    ←── Naive ──→←── Triggered ──→←──── Calibrated ────────────→
+```
+The trajectory from **-11.4 to +8.75** average reward demonstrates clear behavioral learning. The model moved from random, penalty-heavy actions to strategic, context-aware verification decisions.
+---
+## Evaluation: Baseline vs. Trained Model
+### The Heuristic Baseline
+Our heuristic baseline implements an "always-verify" strategy — checking policies at every opportunity. This represents a perfectly safe but inefficient agent:
+| Metric | Heuristic Baseline | Trained Model (Qwen 1.5B) |
+|---|---|---|
+| **Avg Reward** | 125.4 | 4.0 |
+| **Std Deviation** | 9.67 | 1.64 |
+| **Compliance Failures** | 0.0 | 0.0 |
+| **Drift Detections** | 4.8 | 1.4 |
+### Reading the Results
+The heuristic baseline's high reward (125.4) reflects the advantage of a hand-coded strategy that *always* verifies — it never misses a drift and accumulates maximum correct-quote bonuses. This is the ceiling: perfect compliance through brute-force checking.
+The trained model achieves **zero compliance failures** — matching the baseline's safety guarantee — while operating with a learned, selective verification strategy (1.4 detections vs. 4.8). The lower absolute reward reflects that a 1.5B parameter model trained for only 80 episodes is still developing its verification scheduling, but the critical insight is this:
+> *The model learned the most important lesson: never cite a stale policy. It achieved zero compliance failures — the same safety standard as the hand-coded heuristic — through learned behavior rather than hard-coded rules.*
+The reward gap (125.4 vs. 4.0) represents the **optimization frontier** — with more training episodes, larger models, and refined reward shaping, the learned policy can approach and potentially exceed the heuristic by learning *when not to verify*, avoiding the -0.5 penalties that the always-verify strategy accumulates.
+### Consistency
+The trained model also shows significantly lower variance (std: 1.64 vs. 9.67), indicating more predictable, stable behavior — a desirable property for production deployment.
+---
+## What Makes FinePrint Novel
+### 1. Temporal Knowledge Grounding as a First-Class Problem
+No existing RL benchmark explicitly trains or measures an agent's ability to recognize that its knowledge has become stale. FinePrint isolates this capability and provides a clean, measurable signal for it.
+### 2. The Information Asymmetry Design
+The agent is deliberately denied access to the true active policy version. It can only discover the truth through `request_verification()`. This creates a genuine information-seeking problem where the agent must reason about what it *doesn't know* — a capability distinct from standard question-answering.
+### 3. Multi-Signal Drift Detection
+Rather than providing a single "your knowledge is outdated" flag, FinePrint presents four distinct signal types with varying reliability:
+- **System notifications** (30% of drifts) — explicit but not always present
+- **User contradictions** — strong signal, requires interpretation
+- **User confusion** — moderate signal, could be unrelated
+- **Elapsed time** — weak signal, requires calibration
+The model must learn to weigh and combine these signals — a form of **learned sensor fusion** for knowledge management.
+### 4. Realistic Severity Gradients
+Not all policy drifts are equal. Quoting a wrong return window (HIGH severity) is far worse than quoting an old express shipping surcharge (MEDIUM). The environment's reward weights reflect this reality, teaching the model to prioritize verification for high-stakes policy areas.
+---
+## Future Scope: Where FinePrint Goes Next
+### Scaling the Training
+Our current results use 80 episodes on a 1.5B parameter model. The clear upward trajectory in rewards suggests significant room for improvement:
+- **More episodes** (1000-5000): Allow the model to reach the strategic phase described in our training curriculum
+- **Larger models** (7B, 13B): Greater capacity for nuanced context-dependent verification strategies
+- **Curriculum learning**: Start with explicit-only drifts, progressively introduce silent drifts
+- **Multi-agent evaluation**: Test whether verification skills transfer across different environment configurations
+### Domain-Specific Extensions
+The FinePrint architecture is **domain-agnostic** by design. The same environment structure — versioned rules, drift scheduling, compliance checking, and verification mechanics — applies directly to high-stakes domains where knowledge currency is literally a matter of life and safety.
+---
+## Beyond Consumer Workflows: High-Stakes Domains
+### Healthcare: When Drug Interactions Change
+> *"The FDA updated the contraindication list for Drug X at 3 AM. By 9 AM, an AI clinical decision support system had recommended a now-dangerous combination to 47 patients."*
+Medical guidelines update constantly. Drug interactions get reclassified. Dosage recommendations change based on new trial data. A FinePrint-style environment for healthcare would train clinical AI to:
+- **Verify formulary status** before recommending medications
+- **Detect guideline drift** when patient symptoms trigger a check against updated protocols
+- **Calibrate urgency** — verify immediately for life-threatening interactions, batch-check for routine prescriptions
+- **Recover gracefully** when a recommendation was based on superseded guidelines
+The policy categories would shift from `return.window_days` to `drug_x.contraindications`, `treatment_y.dosage_mg`, and `guideline_z.eligibility_criteria`. The environment mechanics remain identical.
+### Legal: When Precedent Shifts
+> *"Our AI legal research tool cited a ruling from 2019 as controlling precedent. It was overturned six months ago."*
+Legal AI faces the exact temporal grounding problem FinePrint addresses:
+- **Statute amendments** that change regulatory requirements
+- **Case law evolution** where appellate decisions override lower court rulings
+- **Regulatory guidance updates** from agencies like the SEC, EPA, or FTC
+- **Jurisdictional variations** where rules differ and change on different timelines
+A legal FinePrint environment would train AI research assistants to verify the current validity of cited precedent, flag potential overrulings, and distinguish between binding and persuasive authority — all under temporal drift conditions.
+### Corporate Compliance: When Regulations Evolve
+> *"The GDPR interpretation changed. Our compliance bot was still advising teams based on the old Article 6 guidance for three weeks."*
+Compliance officers manage an ever-shifting landscape of regulations:
+- **GDPR, CCPA, HIPAA** requirements that evolve through regulatory guidance
+- **Internal policies** that change with quarterly reviews
+- **Industry standards** (SOC 2, ISO 27001) that get updated versions
+- **Cross-border regulations** that create conflicting requirements
+A corporate compliance FinePrint variant would simulate an internal compliance chatbot facing simultaneous drift across multiple regulatory frameworks, training the model to prioritize verification based on regulatory severity and recency of last check.
+### Financial Services: When Market Rules Change
+Trading compliance, KYC (Know Your Customer) requirements, anti-money laundering thresholds, and margin requirements all change — sometimes multiple times per day during market stress. An AI operating on stale compliance parameters in financial services doesn't just create liability; it can trigger regulatory sanctions and market manipulation charges.
+---
+## The Broader Vision: Verification as a Core LLM Capability
+FinePrint demonstrates something we believe will become a standard component of LLM training: **learned verification behavior**.
+Today's models are trained on static datasets and evaluated on static benchmarks. But production environments are inherently dynamic. The gap between "knows the answer" and "knows whether its answer is still correct" is where real-world AI failures live.
+> *We envision a future where every deployed AI agent has an internalized "knowledge freshness" model — a learned sense of when to trust its cache and when to re-verify. FinePrint is the first environment designed to build exactly that capability.*
+### What We're Not Training (And Why That Matters)
+FinePrint's scope is deliberately narrow:
+- **Not trained:** Domain knowledge, conversation skills, JSON parsing, policy memorization
+- **Trained:** When to check, how to recognize drift signals, speed-safety tradeoffs, recovery behavior, uncertainty calibration
+This focus means the skills are **transferable**. A model that learns temporal awareness on consumer policies can apply the same meta-cognitive pattern to medical guidelines, legal precedent, or financial regulations. The domain changes; the verification instinct persists.
+---
+## Technical Implementation Highlights
+### OpenEnv Compatibility
+FinePrint implements the full OpenEnv interface — `reset()`, `step()`, `render()`, `close()` — ensuring drop-in compatibility with the hackathon ecosystem and any OpenEnv-compatible training pipeline:
+```python
+class FinePrintEnv(gym.Env):
+    def reset(self, seed=None, options=None) -> tuple[dict, dict]: ...
+    def step(self, action: dict) -> tuple[dict, float, bool, bool, dict]: ...
+    def render(self, mode='human') -> str | None: ...
+    def close(self): ...
+```
+### Modular Component Design
+Each subsystem operates independently and can be extended or replaced:
+```
+PolicyStore     → Loads, versions, and merges JSON policies
+DriftScheduler  → Probabilistically triggers policy changes
+ComplianceChecker → Validates agent actions against ground truth
+RewardCalculator  → Computes step and terminal rewards
+FinePrintState   → Manages observation and internal state
+```
+### Reproducibility
+Every component is seeded. Policy files are deterministic JSON. Drift scheduling uses controlled randomness. Training runs are reproducible given the same seed, ensuring scientific validity of results.
+---
+## Conclusion
+FinePrint addresses a gap that sits at the intersection of AI safety and practical deployment: **temporal knowledge grounding**. Current LLMs have no mechanism for recognizing when their cached knowledge has become stale. They cite outdated policies with the same confidence as current ones, creating liability in every domain where rules change — which is every domain.
+Our environment proves that this capability can be **learned through reinforcement**. In just 80 episodes of training, a 1.5B parameter model went from random, penalty-heavy behavior to zero-compliance-failure performance with learned verification strategies. The reward trajectory from -11.4 to +8.75 demonstrates clear behavioral acquisition of temporal awareness.
+The path forward is clear: more training, bigger models, harder drift scenarios, and extension to domains where the stakes are measured not in refund errors but in patient safety, legal liability, and regulatory compliance.
+> *The most dangerous AI agent isn't one that doesn't know the answer. It's one that doesn't know its answer is no longer correct.*
+---
+**Built for the Cerebral Valley × Scaler Meta-PyTorch Hackathon | Theme 3.2: Consumer Workflows with Schema Drift | Patronus AI Sponsor Track**
+*Stack: OpenEnv · Gymnasium · Qwen2.5-1.5B-Instruct · Unsloth · GRPO · Python*