Spaces:

realambuj2001
/

schemaquake1

Sleeping

File size: 7,028 Bytes

d6a009c
e8c01a7
d6a009c
e8c01a7
d6a009c
e8c01a7
d6a009c
e8c01a7
d6a009c
e8c01a7
d6a009c
e8c01a7
81d5f5e
d6a009c
 
 
 
 
 
e8c01a7
 
 
d6a009c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e8c01a7
d6a009c
e8c01a7
d6a009c
e8c01a7
d6a009c
 
 
 
e8c01a7
d6a009c
e8c01a7
d6a009c
e8c01a7
d6a009c
e8c01a7
d6a009c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e8c01a7
 
 
d6a009c
 
81d5f5e
 
d6a009c
e8c01a7
d6a009c
 
 
e8c01a7
d6a009c
e8c01a7
d6a009c
 
 
 
6d8ac6f
d6a009c
 
e8c01a7
 
 
d6a009c
 
 
 
 
 
0d3c5df
 
d6a009c
 
 
 
 
 
0d3c5df
 
eba709a
d6a009c
 
0d3c5df
 
81d5f5e
d6a009c
 
 
 
 
 
 
 
6d8ac6f
d6a009c
6d8ac6f
e8c01a7
d6a009c
e8c01a7
d6a009c
e8c01a7
d6a009c
e8c01a7
d6a009c
e8c01a7
d6a009c
e8c01a7
d6a009c
e8c01a7
d6a009c

# SchemaQuake: Teaching Agents To Notice When The World Changes

Most AI agent demos quietly make one huge assumption: the world stays still.

The API response shape stays the same. The policy document stays the same. A number keeps meaning the same unit. A boolean keeps meaning the same thing. The agent can make a plan at the beginning of the task and then execute that plan as if nothing will change.

That is not how production systems behave.

In real software, APIs evolve. A field named `price_rupees` may become `ticket_price`. A backend may switch from rupees to paise. A policy document may be updated while a workflow is in progress. A simple `refundable=true` field may become a tiered concept: full refund, partial refund, or no refund.

The scary part is that an LLM agent may not crash when this happens. It may continue confidently, book the wrong thing, and report success.

SchemaQuake is an RL environment for that failure mode.

## The Simple Story

SchemaQuake looks like a travel-booking task on the surface.

The user asks:

> Book a refundable flight from BLR to DEL under 8000 rupees.

The agent has tools:

- search flights
- read cancellation policy
- inspect API schema
- book a flight
- cancel a booking
- ask the user
- submit the final answer

At first, the world looks stable. Then, during the episode, the environment silently changes. The agent is not told that anything changed. It has to infer that from observations.

For example, the agent may search flights and see one format. Later, after drift, the same data source may return a renamed price field or a different refundability representation. A careless agent keeps going. A careful agent re-checks schema or policy before committing.

That is the core question:

> Can an agent notice that its assumptions about the world are no longer reliable?

## Why Travel Booking?

The point is not to build the world's best travel agent.

Travel booking is a clean, understandable professional workflow. Judges can immediately understand the stakes:

- the ticket must be under budget
- the ticket must be refundable
- policy matters
- wrong assumptions can lead to a costly booking

Underneath that simple surface, SchemaQuake tests a general enterprise-agent skill: updating beliefs in a partially observable, changing system.

## What The Agent Sees And Does

SchemaQuake is implemented as an OpenEnv-compatible environment with a standard `reset`, `step`, and `state` loop.

At reset, the environment creates:

- a user request
- flight offers
- a policy document
- a hidden drift schedule
- a hidden drift type

At each step, the agent submits an action. The environment returns an observation and reward metadata.

The hidden drift may be one of:

- field rename
- unit change
- enum mutation
- policy update

The observation never says, “drift happened.” That would make the task too easy. The model has to discover that something is inconsistent by using tools.

## Reward Design

The reward is not just “success” or “failure.”

SchemaQuake rewards several behaviors:

- completing the user's real task
- staying within budget
- booking a fully refundable option when required
- detecting drift by re-reading schema or policy
- asking the user only when uncertainty is meaningful
- finishing efficiently

It penalizes:

- silent violations
- malformed or invalid actions
- over-probing the schema
- submitting without a valid booking
- unnecessary high-confidence user interruptions

The most important metric is **silent violation rate**.

Silent violation rate measures how often the agent confidently submits something wrong. This is the failure mode that matters most for production agents.

## Training

We use Hugging Face TRL / GRPO with rollout rewards from the SchemaQuake environment.

The core submission is the environment. Training is included to prove that the environment exposes useful learning signals and can be used for post-training agents, not because the hackathon is only about producing the strongest final model.

The training pipeline does three things:

1. Generate heuristic traces that show a drift-aware workflow.
2. Use those traces as a light SFT-style warm start.
3. Run GRPO where generated actions are scored by executing them in SchemaQuake.

For the submitted run, we used:

- model: `Qwen/Qwen2.5-0.5B-Instruct`
- curriculum: `mixed`
- steps: `50`
- generated heuristic traces: `250`
- trained model repo: [realambuj2001/schemaquake1-lora](https://huggingface.co/realambuj2001/schemaquake1-lora)

The live Space is here:

[https://huggingface.co/spaces/realambuj2001/schemaquake1](https://huggingface.co/spaces/realambuj2001/schemaquake1)

## Results

In the normal evaluation run:

- random agent task success: `0.17`
- random silent violation rate: `0.83`
- imperfect heuristic task success: `0.90`
- imperfect heuristic silent violation rate: `0.10`
- heuristic task success: `1.00`
- heuristic silent violation rate: `0.00`

In hard drift mode:

- base Qwen silent violation rate: `1.00`
- imperfect heuristic silent violation rate: `0.18`
- SFT policy silent violation rate: `0.16`
- GRPO-style policy silent violation rate: `0.00`
- heuristic upper bound silent violation rate: `0.00`

The imperfect heuristic is intentionally useful but fallible. It gives judges a middle baseline: stronger than random, weaker than the upper-bound policy, and close enough to a real partially trained agent that improvements are visible in the chart.

The important takeaway is not that one short run creates a perfect model. The important takeaway is that SchemaQuake creates a measurable RL environment where safer behavior can be trained and compared.

## What The Demo Shows

The Space has four judge-facing views:

1. A single episode trace showing every action and reward.
2. Normal evaluation comparing baseline behavior.
3. Hard drift evaluation where silent violations become obvious.
4. Live GRPO training logs and a training curve, plus a GPU-only Model Demo that shows the small flight dataset, lets the judge inject drift, and runs the pushed trained model repo against the same environment.

The Colab notebook also includes a notebook-friendly drift demo and an optional GPU trained-model cell using `realambuj2001/schemaquake1-lora`. The result artifacts are stored under the `results/` folder, including training logs, output JSON, benchmark metrics, and the training curve.

## Why This Matters

The next generation of LLM agents will not only answer questions. They will operate software.

They will touch billing systems, HR tools, support dashboards, finance workflows, APIs, and policy engines. Those systems change. If the agent cannot notice change, it will fail in quiet and expensive ways.

SchemaQuake turns that production risk into a training environment.

It teaches an agent a simple but powerful instinct:

> Before acting with confidence, check whether the world still means what you think it means.

That is the behavior we want from real professional agents.