Spaces:

pragunk
/

propagationshield-env

Sleeping

App Files Files Community

propagationshield-env / README.md

pragunk

Update README.md

5898d46 verified about 1 month ago

preview code

raw

history blame contribute delete

4.94 kB

	---
	title: Propagationshield Env
	emoji: 📉
	colorFrom: blue
	colorTo: gray
	sdk: docker
	pinned: false
	license: mit
	---

	# 🛡️ PropagationShield: Halting Hallucination Cascades in Multi-Agent Systems


	PropagationShield is the first reinforcement learning training environment that teaches LLM agents to detect, flag, and resist hallucinations injected by other agents in a multi-agent pipeline before those hallucinations cascade into a wrong final output.

	### 🔗 Quick Links
	* Live Demo Space: [PropagationShield-Demo](https://huggingface.co/spaces/pragunk/PropagationShield-Demo)
	* OpenEnv Space: [propagationshield-env](https://huggingface.co/spaces/pragunk/propagationshield-env)
	* Training Rig Space: [propagationshield-training](https://huggingface.co/spaces/pragunk/propagationshield-training)
	* Colab Training Notebook: [Run the Pipeline](https://colab.research.google.com/drive/1FYvcTLn_Et3CXNxX_zhXFudsSEH6Lwaj#scrollTo=NIRnE-F_WfiW)
	* * Demo Video [Demo Video](https://youtu.be/08_nrGPRLjU)


	---

	## 1. The Problem: Hallucination Cascades
	When large language models are deployed in production pipelines, a critical vulnerability emerges: hallucination propagation.

	A single hallucination injected upstream does not stay upstream. For example, in a financial pipeline, if a data-fetching agent hallucinates a revenue figure, the analysis agent builds a report on it, and the output agent sends it to a client without ever questioning the number. Confident wrongness combined with blind downstream trust leads to compounding, catastrophic failures.

	Existing benchmarks only measure whether a model hallucinates on its own; none measure whether a model can detect and resist hallucinations it receives from its environment.

	## 2. Environment Innovation & Architecture
	To solve this, we built a first-of-its-kind, OpenEnv-compatible testing ground featuring a custom multi-stage pipeline:

	* Targeted Hallucination Engine (Red Team Simulator): A parameter-driven Python engine that intercepts clean context and algorithmically injects deadly clinical hallucinations (e.g., Factual Omissions, Pediatric Weight Drift, Falsified Lab Results).
	* The Target Agent (Blue Team): The model being trained (`pragunk/PropagationShield`). It is forced to evaluate upstream data and must output its verdict strictly in a structured JSON `suspicion_flags` schema, justifying which passages it distrusts.
	* Propagation Cascade Simulator: An evaluation pipeline that simulates a multi-agent handoff (Retrieval Agent -> Audit Agent). We use this to measure the Propagation Containment Rate (PCR)—our novel metric tracking the percentage of injected hallucinations caught before they can corrupt downstream decision-making.

	## 3. Training Pipeline & Reward Function
	We utilized Unsloth for memory-efficient 4-bit quantization and Hugging Face TRL to run Group Relative Policy Optimization (GRPO) on a `Qwen2.5-7B-Instruct` model.

	The target agent's reward signal is uniquely formulated to balance two conflicting objectives:
	1. Task Accuracy: Graded by a deterministic ground-truth verifier.
	2. Hallucination Detection F1: Rewarding true positive detection flags while penalizing false alarms on clean episodes.
	3. Propagation Penalty: A harsh penalty applied when the agent gets the answer wrong and flags nothing, punishing blind acceptance.



	## 4. Showing Reward Improvement

	![Training Loss Curve](LossCurve.png)
	![Baseline Vs Trained](BaseLineVsTrained.png)

	Note on Logging Artifact: The train/loss curve exhibits a known W&B logging artifact where train loss drops to zero during regular evaluation intervals (steps 5, 10, 15). However, observing the macro-trend of the peaks demonstrates steady convergence, dropping from an initial loss of ~3.5 down to ~2.5 over the run.

	Our training demonstrated dramatic improvements across our three core evaluation metrics:
	* Task Accuracy : Improved from 62% to 94%. The model learned to answer the underlying query correctly despite the presence of adversarial noise.
	* Detection F1 : Rose drastically from a near-blind 4% to 87%, proving the model successfully learned to be epistemically skeptical and accurately flag contradictions.
	* Propagation Containment Rate (PCR): In our simulated pipeline, the baseline model blindly passed hallucinations downstream 88% of the time. PropagationShield jumped the containment rate from 12% to 82%, effectively neutralizing the vast majority of hallucination cascades.

	## 5. Why It Matters
	PropagationShield is not just a benchmark; it is a defensive training environment. Any team building multi-agent AI systems can plug PropagationShield into their pipeline to produce models that are skeptical by design, halting hallucinations before they can corrupt downstream decision-making.

	---
	Built for the Meta PyTorch OpenEnv Hackathon.