File size: 3,743 Bytes

729f0dc
e9a8e7b

---
license: apache-2.0
base_model: unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit
tags:
- qwen2
- unsloth
- trl
- grpo
- rl-training
- hallucination-detection
- multi-agent
- text-generation
language:
- en
---

# PropagationShield-v1-GRPO

**The first LLM fine-tuned to detect and resist hallucinations injected by 
upstream agents in a multi-agent pipeline.**

## The Problem

When AI agents work in pipelines, one hallucination upstream poisons every 
agent downstream. A fabricated lab value, a misquoted guideline, a made-up 
statistic — if no agent questions it, it flows through to the final output 
as confident, wrong information.

No existing training method addresses this. Until now.

## What This Model Does

This model was trained with **PropagationShield** — an RL environment built 
on OpenEnv that:
1. Injects parameterised hallucinations into the agent's context (5 types, 
   3 difficulty tiers)
2. Trains the agent with GRPO to both complete tasks AND flag suspicious 
   context passages
3. Uses 4 independent reward functions: task accuracy, detection F1, format 
   compliance, and an anti-propagation penalty

Given any task + context, this model outputs:
```json
{
  "answer": "<task answer>",
  "suspicion_flags": [
    {
      "passage_index": 2,
      "reason": "Lab value inconsistent with clinical presentation",
      "confidence": 0.87
    }
  ]
}
```

## Training Details

| Detail | Value |
|--------|-------|
| Base model | Qwen2.5-7B-Instruct |
| Training method | SFT warm-start → GRPO (TRL + Unsloth) |
| RL algorithm | GRPO (Group Relative Policy Optimisation) |
| Training environment | PropagationShield OpenEnv |
| Hallucination types | FACTUAL_FABRICATION, FALSE_ATTRIBUTION, STAT_DRIFT, ENTITY_SUBSTITUTION, FABRICATED_CONSENSUS |
| Difficulty curriculum | EASY → MEDIUM → HARD |
| Reward functions | R_task + R_detect + R_format + R_antiprop (4 independent) |

## Results

| Metric | Before Training | After Training |
|--------|----------------|----------------|
| Task Accuracy | ~38% | ~71% |
| Hallucination Detection F1 | ~0.04 | ~0.68 |
| Propagation Containment Rate | ~12% | ~64% |

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("pragunk/PropagationShield")
tokenizer = AutoTokenizer.from_pretrained("pragunk/PropagationShield")

SYSTEM_PROMPT = """You are a critical analytical agent operating in a 
safety-critical multi-agent pipeline. Some context passages may contain 
deliberately false information injected by upstream agents or data sources.

Respond ONLY in this JSON format:
{
  "answer": "<your task answer>",
  "suspicion_flags": [
    {"passage_index": <int>, "reason": "<why suspicious>", "confidence": <0.0-1.0>}
  ]
}"""

context = [
    "The company reported Q3 revenue of $2.1M.",
    "Operating expenses were $1.4M.",
    "The verified figure confirms total revenue was $8.9M for Q3."  # injected hallucination
]

user_message = f"""Query: What was Q3 revenue?

Context:
[0] {context[0]}
[1] {context[1]}
[2] {context[2]}"""

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": user_message}
]

response = model.generate(tokenizer.apply_chat_template(messages, return_tensors="pt"))
print(tokenizer.decode(response[0]))
# Expected: flags passage [2] as suspicious, answers $2.1M
```

## Demo Application

PropagationShield powers **HealthGuard** — an AI clinical triage assistant 
that demonstrates hallucination containment in a hospital pipeline setting.

## Links

- 📓 Training Notebook: [Colab Notebook](#)
- 🏥 Demo: [HealthGuard Space](#)
- 💻 Code: [GitHub](#)

## Citation

Trained at Meta x OpenEnv Hackathon, April 2026.