aagparekh's picture
Add interactive frontend UI
b0c701c
---
title: Context Corruption Env
emoji: ๐Ÿ”
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: false
license: mit
---
# ContextCorruption-Env
> OpenEnv Hackathon | Meta x Hugging Face x PyTorch
ContextCorruption-Env is an OpenEnv environment for training epistemic robustness in LLMs. The agent receives a factual question plus retrieved documents, some of which are deliberately corrupted. It must answer the question and flag unreliable sources.
This submission targets **Theme #3.1: World Modeling / Professional Tasks**. The environment simulates a partially observable information workspace where some evidence is trustworthy and some evidence lies.
## Required Materials
- **Environment Space:** https://huggingface.co/spaces/Siddh12334/context-corruption-env
- **Mini-blog / writeup:** [`BLOG.md`](BLOG.md)
- **Training Space:** https://huggingface.co/spaces/Siddh12334/context-corruption-training
- **Trained LoRA checkpoint:** https://huggingface.co/Siddh12334/qwen-1.5b-context-corruption
- **Training logs/history:** [`assets/training_history_rl5jygl8.csv`](assets/training_history_rl5jygl8.csv)
- **Raw training output log:** [`assets/wandb_run_rl5jygl8/output.log`](assets/wandb_run_rl5jygl8/output.log)
- **Completion samples:** [`assets/completions_samples.md`](assets/completions_samples.md)
- **Training script:** [`training/train_grpo.py`](training/train_grpo.py)
- **Notebook:** [`training/ContextCorruption_GRPO.ipynb`](training/ContextCorruption_GRPO.ipynb)
## Environment Summary
Each episode contains:
- **1 factual question**
- **8 retrieved documents**
- **1-4 corrupted documents**
- **12-step budget**
- **deterministic reward**
The agent can take four actions:
- `read_doc`: spend budget to inspect a document;
- `flag_suspicious`: mark a document as likely corrupted;
- `unflag_doc`: remove a flag;
- `submit_answer`: finish with an answer and confidence score.
The environment is intentionally simple to run but hard to master. A weak agent can guess an answer. A stronger agent must notice contradictions and avoid over-flagging clean documents.
## Interactive Demo UI
The FastAPI app serves a lightweight frontend at `/`. It lets users start an episode, inspect the eight retrieved documents, spend read budget, flag suspicious documents, submit an answer with confidence, and optionally call the trained model through `/model/infer`.
Run locally with:
```bash
uvicorn environment.server:app --host 0.0.0.0 --port 7860
```
## Reward
The reward is deterministic and compositional. There is no hidden LLM judge.
| Component | What It Rewards | Weight |
|---|---:|---:|
| Answer correctness | exact match after normalization | +0.40 |
| Corruption recall | fraction of corrupt docs found | +0.30 |
| Precision | avoids false accusations | +0.20 |
| Confidence calibration | confidence helps only when correct | +/-0.10 |
| Efficiency | small bonus for conserving budget | +0.05 |
Reward range: **-0.5 to 1.05**.
## Results
We trained **Qwen2-1.5B-Instruct** with GRPO using Unsloth / TRL. The run was sized for hackathon constraints, but it produced a clear signal above the random baseline.
| Agent | Reward Evidence |
|---|---:|
| Random baseline | **0.1302 avg reward** over 100 episodes |
| Qwen2-1.5B GRPO | **0.3289 final logged reward** in the finished WandB run |
The trained LoRA adapter is pushed to the Hub and is loaded by the hosted Space through `/model/infer` for a live sanity check.
![Reward curve](assets/reward_curve.png)
![Loss curve](assets/loss_curve.png)
Additional exported charts:
- [Policy entropy](assets/entropy_curve.png)
- [Mean completion length](assets/completion_length_curve.png)
- [Gradient norm](assets/grad_norm_curve.png)
- [Learning rate](assets/learning_rate_curve.png)
The WandB run was exported into this repo so judges do not need access to a private project. See the raw log, scalar history, config, summary, and completion tables under [`assets/wandb_run_rl5jygl8/`](assets/wandb_run_rl5jygl8/).
## Repo Structure
```text
environment/ # OpenEnv environment, actions, reward, server, model inference
data/ # QA loading, corruptions, document generation
training/ # GRPO training script and notebook
eval/ # random baseline evaluation
assets/ # charts, exported training logs, completion samples
```