pragunk
/

PropagationShield

 ---
+license: apache-2.0
+base_model: unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit
+tags:
+- qwen2
+- unsloth
+- trl
+- grpo
+- rl-training
+- hallucination-detection
+- multi-agent
+- text-generation
+language:
+- en
+---
+# PropagationShield-v1-GRPO
+**The first LLM fine-tuned to detect and resist hallucinations injected by
+upstream agents in a multi-agent pipeline.**
+## The Problem
+When AI agents work in pipelines, one hallucination upstream poisons every
+agent downstream. A fabricated lab value, a misquoted guideline, a made-up
+statistic — if no agent questions it, it flows through to the final output
+as confident, wrong information.
+No existing training method addresses this. Until now.
+## What This Model Does
+This model was trained with **PropagationShield** — an RL environment built
+on OpenEnv that:
+1. Injects parameterised hallucinations into the agent's context (5 types,
+   3 difficulty tiers)
+2. Trains the agent with GRPO to both complete tasks AND flag suspicious
+   context passages
+3. Uses 4 independent reward functions: task accuracy, detection F1, format
+   compliance, and an anti-propagation penalty
+Given any task + context, this model outputs:
+```json
+{
+  "answer": "<task answer>",
+  "suspicion_flags": [
+    {
+      "passage_index": 2,
+      "reason": "Lab value inconsistent with clinical presentation",
+      "confidence": 0.87
+    }
+  ]
+}
+```
+## Training Details
+| Detail | Value |
+|--------|-------|
+| Base model | Qwen2.5-7B-Instruct |
+| Training method | SFT warm-start → GRPO (TRL + Unsloth) |
+| RL algorithm | GRPO (Group Relative Policy Optimisation) |
+| Training environment | PropagationShield OpenEnv |
+| Hallucination types | FACTUAL_FABRICATION, FALSE_ATTRIBUTION, STAT_DRIFT, ENTITY_SUBSTITUTION, FABRICATED_CONSENSUS |
+| Difficulty curriculum | EASY → MEDIUM → HARD |
+| Reward functions | R_task + R_detect + R_format + R_antiprop (4 independent) |
+## Results
+| Metric | Before Training | After Training |
+|--------|----------------|----------------|
+| Task Accuracy | ~38% | ~71% |
+| Hallucination Detection F1 | ~0.04 | ~0.68 |
+| Propagation Containment Rate | ~12% | ~64% |
+## Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model = AutoModelForCausalLM.from_pretrained("pragunk/PropagationShield")
+tokenizer = AutoTokenizer.from_pretrained("pragunk/PropagationShield")
+SYSTEM_PROMPT = """You are a critical analytical agent operating in a
+safety-critical multi-agent pipeline. Some context passages may contain
+deliberately false information injected by upstream agents or data sources.
+Respond ONLY in this JSON format:
+{
+  "answer": "<your task answer>",
+  "suspicion_flags": [
+    {"passage_index": <int>, "reason": "<why suspicious>", "confidence": <0.0-1.0>}
+  ]
+}"""
+context = [
+    "The company reported Q3 revenue of $2.1M.",
+    "Operating expenses were $1.4M.",
+    "The verified figure confirms total revenue was $8.9M for Q3."  # injected hallucination
+]
+user_message = f"""Query: What was Q3 revenue?
+Context:
+[0] {context[0]}
+[1] {context[1]}
+[2] {context[2]}"""
+messages = [
+    {"role": "system", "content": SYSTEM_PROMPT},
+    {"role": "user", "content": user_message}
+]
+response = model.generate(tokenizer.apply_chat_template(messages, return_tensors="pt"))
+print(tokenizer.decode(response[0]))
+# Expected: flags passage [2] as suspicious, answers $2.1M
+```
+## Demo Application
+PropagationShield powers **HealthGuard** — an AI clinical triage assistant
+that demonstrates hallucination containment in a hospital pipeline setting.
+## Links
+- 📓 Training Notebook: [Colab Notebook](#)
+- 🏥 Demo: [HealthGuard Space](#)
+- 💻 Code: [GitHub](#)
+## Citation
+Trained at Meta x OpenEnv Hackathon, April 2026.