Spaces:
Sleeping
Sleeping
| title: Context Corruption Env | |
| emoji: ๐ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| license: mit | |
| # ContextCorruption-Env | |
| > OpenEnv Hackathon | Meta x Hugging Face x PyTorch | |
| ContextCorruption-Env is an OpenEnv environment for training epistemic robustness in LLMs. The agent receives a factual question plus retrieved documents, some of which are deliberately corrupted. It must answer the question and flag unreliable sources. | |
| This submission targets **Theme #3.1: World Modeling / Professional Tasks**. The environment simulates a partially observable information workspace where some evidence is trustworthy and some evidence lies. | |
| ## Required Materials | |
| - **Environment Space:** https://huggingface.co/spaces/Siddh12334/context-corruption-env | |
| - **Mini-blog / writeup:** [`BLOG.md`](BLOG.md) | |
| - **Training Space:** https://huggingface.co/spaces/Siddh12334/context-corruption-training | |
| - **Trained LoRA checkpoint:** https://huggingface.co/Siddh12334/qwen-1.5b-context-corruption | |
| - **Training logs/history:** [`assets/training_history_rl5jygl8.csv`](assets/training_history_rl5jygl8.csv) | |
| - **Raw training output log:** [`assets/wandb_run_rl5jygl8/output.log`](assets/wandb_run_rl5jygl8/output.log) | |
| - **Completion samples:** [`assets/completions_samples.md`](assets/completions_samples.md) | |
| - **Training script:** [`training/train_grpo.py`](training/train_grpo.py) | |
| - **Notebook:** [`training/ContextCorruption_GRPO.ipynb`](training/ContextCorruption_GRPO.ipynb) | |
| ## Environment Summary | |
| Each episode contains: | |
| - **1 factual question** | |
| - **8 retrieved documents** | |
| - **1-4 corrupted documents** | |
| - **12-step budget** | |
| - **deterministic reward** | |
| The agent can take four actions: | |
| - `read_doc`: spend budget to inspect a document; | |
| - `flag_suspicious`: mark a document as likely corrupted; | |
| - `unflag_doc`: remove a flag; | |
| - `submit_answer`: finish with an answer and confidence score. | |
| The environment is intentionally simple to run but hard to master. A weak agent can guess an answer. A stronger agent must notice contradictions and avoid over-flagging clean documents. | |
| ## Interactive Demo UI | |
| The FastAPI app serves a lightweight frontend at `/`. It lets users start an episode, inspect the eight retrieved documents, spend read budget, flag suspicious documents, submit an answer with confidence, and optionally call the trained model through `/model/infer`. | |
| Run locally with: | |
| ```bash | |
| uvicorn environment.server:app --host 0.0.0.0 --port 7860 | |
| ``` | |
| ## Reward | |
| The reward is deterministic and compositional. There is no hidden LLM judge. | |
| | Component | What It Rewards | Weight | | |
| |---|---:|---:| | |
| | Answer correctness | exact match after normalization | +0.40 | | |
| | Corruption recall | fraction of corrupt docs found | +0.30 | | |
| | Precision | avoids false accusations | +0.20 | | |
| | Confidence calibration | confidence helps only when correct | +/-0.10 | | |
| | Efficiency | small bonus for conserving budget | +0.05 | | |
| Reward range: **-0.5 to 1.05**. | |
| ## Results | |
| We trained **Qwen2-1.5B-Instruct** with GRPO using Unsloth / TRL. The run was sized for hackathon constraints, but it produced a clear signal above the random baseline. | |
| | Agent | Reward Evidence | | |
| |---|---:| | |
| | Random baseline | **0.1302 avg reward** over 100 episodes | | |
| | Qwen2-1.5B GRPO | **0.3289 final logged reward** in the finished WandB run | | |
| The trained LoRA adapter is pushed to the Hub and is loaded by the hosted Space through `/model/infer` for a live sanity check. | |
|  | |
|  | |
| Additional exported charts: | |
| - [Policy entropy](assets/entropy_curve.png) | |
| - [Mean completion length](assets/completion_length_curve.png) | |
| - [Gradient norm](assets/grad_norm_curve.png) | |
| - [Learning rate](assets/learning_rate_curve.png) | |
| The WandB run was exported into this repo so judges do not need access to a private project. See the raw log, scalar history, config, summary, and completion tables under [`assets/wandb_run_rl5jygl8/`](assets/wandb_run_rl5jygl8/). | |
| ## Repo Structure | |
| ```text | |
| environment/ # OpenEnv environment, actions, reward, server, model inference | |
| data/ # QA loading, corruptions, document generation | |
| training/ # GRPO training script and notebook | |
| eval/ # random baseline evaluation | |
| assets/ # charts, exported training logs, completion samples | |
| ``` | |