Spaces:
Sleeping
Sleeping
| title: Propagationshield Env | |
| emoji: ๐ | |
| colorFrom: blue | |
| colorTo: gray | |
| sdk: docker | |
| pinned: false | |
| license: mit | |
| # ๐ก๏ธ PropagationShield: Halting Hallucination Cascades in Multi-Agent Systems | |
| **PropagationShield** is the first reinforcement learning training environment that teaches LLM agents to detect, flag, and resist hallucinations injected by other agents in a multi-agent pipeline before those hallucinations cascade into a wrong final output. | |
| ### ๐ Quick Links | |
| * **Live Demo Space:** [PropagationShield-Demo](https://huggingface.co/spaces/pragunk/PropagationShield-Demo) | |
| * **OpenEnv Space:** [propagationshield-env](https://huggingface.co/spaces/pragunk/propagationshield-env) | |
| * **Training Rig Space:** [propagationshield-training](https://huggingface.co/spaces/pragunk/propagationshield-training) | |
| * **Colab Training Notebook:** [Run the Pipeline](https://colab.research.google.com/drive/1FYvcTLn_Et3CXNxX_zhXFudsSEH6Lwaj#scrollTo=NIRnE-F_WfiW) | |
| * * **Demo Video** [Demo Video](https://youtu.be/08_nrGPRLjU) | |
| --- | |
| ## 1. The Problem: Hallucination Cascades | |
| When large language models are deployed in production pipelines, a critical vulnerability emerges: **hallucination propagation**. | |
| A single hallucination injected upstream does not stay upstream. For example, in a financial pipeline, if a data-fetching agent hallucinates a revenue figure, the analysis agent builds a report on it, and the output agent sends it to a client without ever questioning the number. Confident wrongness combined with blind downstream trust leads to compounding, catastrophic failures. | |
| Existing benchmarks only measure whether a model hallucinates on its own; none measure whether a model can detect and resist hallucinations it receives from its environment. | |
| ## 2. Environment Innovation & Architecture | |
| To solve this, we built a first-of-its-kind, OpenEnv-compatible testing ground featuring a custom multi-stage pipeline: | |
| * **Targeted Hallucination Engine (Red Team Simulator):** A parameter-driven Python engine that intercepts clean context and algorithmically injects deadly clinical hallucinations (e.g., Factual Omissions, Pediatric Weight Drift, Falsified Lab Results). | |
| * **The Target Agent (Blue Team):** The model being trained (`pragunk/PropagationShield`). It is forced to evaluate upstream data and must output its verdict strictly in a structured JSON `suspicion_flags` schema, justifying which passages it distrusts. | |
| * **Propagation Cascade Simulator:** An evaluation pipeline that simulates a multi-agent handoff (Retrieval Agent -> Audit Agent). We use this to measure the **Propagation Containment Rate (PCR)**โour novel metric tracking the percentage of injected hallucinations caught before they can corrupt downstream decision-making. | |
| ## 3. Training Pipeline & Reward Function | |
| We utilized **Unsloth** for memory-efficient 4-bit quantization and **Hugging Face TRL** to run Group Relative Policy Optimization (GRPO) on a `Qwen2.5-7B-Instruct` model. | |
| The target agent's reward signal is uniquely formulated to balance two conflicting objectives: | |
| 1. **Task Accuracy:** Graded by a deterministic ground-truth verifier. | |
| 2. **Hallucination Detection F1:** Rewarding true positive detection flags while penalizing false alarms on clean episodes. | |
| 3. **Propagation Penalty:** A harsh penalty applied when the agent gets the answer wrong *and* flags nothing, punishing blind acceptance. | |
| ## 4. Showing Reward Improvement | |
|  | |
|  | |
| *Note on Logging Artifact: The train/loss curve exhibits a known W&B logging artifact where train loss drops to zero during regular evaluation intervals (steps 5, 10, 15). However, observing the macro-trend of the peaks demonstrates steady convergence, dropping from an initial loss of ~3.5 down to ~2.5 over the run.* | |
| Our training demonstrated dramatic improvements across our three core evaluation metrics: | |
| * **Task Accuracy :** Improved from **62%** to **94%**. The model learned to answer the underlying query correctly despite the presence of adversarial noise. | |
| * **Detection F1 :** Rose drastically from a near-blind **4%** to **87%**, proving the model successfully learned to be epistemically skeptical and accurately flag contradictions. | |
| * **Propagation Containment Rate (PCR):** In our simulated pipeline, the baseline model blindly passed hallucinations downstream 88% of the time. PropagationShield jumped the containment rate from **12%** to **82%**, effectively neutralizing the vast majority of hallucination cascades. | |
| ## 5. Why It Matters | |
| PropagationShield is not just a benchmark; it is a defensive training environment. Any team building multi-agent AI systems can plug PropagationShield into their pipeline to produce models that are skeptical by design, halting hallucinations before they can corrupt downstream decision-making. | |
| --- | |
| *Built for the Meta PyTorch OpenEnv Hackathon.* |