aagparekh's picture
Add interactive frontend UI
b0c701c
metadata
title: Context Corruption Env
emoji: 🔍
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: mit

ContextCorruption-Env

OpenEnv Hackathon | Meta x Hugging Face x PyTorch

ContextCorruption-Env is an OpenEnv environment for training epistemic robustness in LLMs. The agent receives a factual question plus retrieved documents, some of which are deliberately corrupted. It must answer the question and flag unreliable sources.

This submission targets Theme #3.1: World Modeling / Professional Tasks. The environment simulates a partially observable information workspace where some evidence is trustworthy and some evidence lies.

Required Materials

Environment Summary

Each episode contains:

  • 1 factual question
  • 8 retrieved documents
  • 1-4 corrupted documents
  • 12-step budget
  • deterministic reward

The agent can take four actions:

  • read_doc: spend budget to inspect a document;
  • flag_suspicious: mark a document as likely corrupted;
  • unflag_doc: remove a flag;
  • submit_answer: finish with an answer and confidence score.

The environment is intentionally simple to run but hard to master. A weak agent can guess an answer. A stronger agent must notice contradictions and avoid over-flagging clean documents.

Interactive Demo UI

The FastAPI app serves a lightweight frontend at /. It lets users start an episode, inspect the eight retrieved documents, spend read budget, flag suspicious documents, submit an answer with confidence, and optionally call the trained model through /model/infer.

Run locally with:

uvicorn environment.server:app --host 0.0.0.0 --port 7860

Reward

The reward is deterministic and compositional. There is no hidden LLM judge.

Component What It Rewards Weight
Answer correctness exact match after normalization +0.40
Corruption recall fraction of corrupt docs found +0.30
Precision avoids false accusations +0.20
Confidence calibration confidence helps only when correct +/-0.10
Efficiency small bonus for conserving budget +0.05

Reward range: -0.5 to 1.05.

Results

We trained Qwen2-1.5B-Instruct with GRPO using Unsloth / TRL. The run was sized for hackathon constraints, but it produced a clear signal above the random baseline.

Agent Reward Evidence
Random baseline 0.1302 avg reward over 100 episodes
Qwen2-1.5B GRPO 0.3289 final logged reward in the finished WandB run

The trained LoRA adapter is pushed to the Hub and is loaded by the hosted Space through /model/infer for a live sanity check.

Reward curve

Loss curve

Additional exported charts:

The WandB run was exported into this repo so judges do not need access to a private project. See the raw log, scalar history, config, summary, and completion tables under assets/wandb_run_rl5jygl8/.

Repo Structure

environment/   # OpenEnv environment, actions, reward, server, model inference
data/          # QA loading, corruptions, document generation
training/      # GRPO training script and notebook
eval/          # random baseline evaluation
assets/        # charts, exported training logs, completion samples