PropagationShield / README.md
pragunk's picture
Update README.md
e9a8e7b verified
---
license: apache-2.0
base_model: unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit
tags:
- qwen2
- unsloth
- trl
- grpo
- rl-training
- hallucination-detection
- multi-agent
- text-generation
language:
- en
---
# PropagationShield-v1-GRPO
**The first LLM fine-tuned to detect and resist hallucinations injected by
upstream agents in a multi-agent pipeline.**
## The Problem
When AI agents work in pipelines, one hallucination upstream poisons every
agent downstream. A fabricated lab value, a misquoted guideline, a made-up
statistic β€” if no agent questions it, it flows through to the final output
as confident, wrong information.
No existing training method addresses this. Until now.
## What This Model Does
This model was trained with **PropagationShield** β€” an RL environment built
on OpenEnv that:
1. Injects parameterised hallucinations into the agent's context (5 types,
3 difficulty tiers)
2. Trains the agent with GRPO to both complete tasks AND flag suspicious
context passages
3. Uses 4 independent reward functions: task accuracy, detection F1, format
compliance, and an anti-propagation penalty
Given any task + context, this model outputs:
```json
{
"answer": "<task answer>",
"suspicion_flags": [
{
"passage_index": 2,
"reason": "Lab value inconsistent with clinical presentation",
"confidence": 0.87
}
]
}
```
## Training Details
| Detail | Value |
|--------|-------|
| Base model | Qwen2.5-7B-Instruct |
| Training method | SFT warm-start β†’ GRPO (TRL + Unsloth) |
| RL algorithm | GRPO (Group Relative Policy Optimisation) |
| Training environment | PropagationShield OpenEnv |
| Hallucination types | FACTUAL_FABRICATION, FALSE_ATTRIBUTION, STAT_DRIFT, ENTITY_SUBSTITUTION, FABRICATED_CONSENSUS |
| Difficulty curriculum | EASY β†’ MEDIUM β†’ HARD |
| Reward functions | R_task + R_detect + R_format + R_antiprop (4 independent) |
## Results
| Metric | Before Training | After Training |
|--------|----------------|----------------|
| Task Accuracy | ~38% | ~71% |
| Hallucination Detection F1 | ~0.04 | ~0.68 |
| Propagation Containment Rate | ~12% | ~64% |
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("pragunk/PropagationShield")
tokenizer = AutoTokenizer.from_pretrained("pragunk/PropagationShield")
SYSTEM_PROMPT = """You are a critical analytical agent operating in a
safety-critical multi-agent pipeline. Some context passages may contain
deliberately false information injected by upstream agents or data sources.
Respond ONLY in this JSON format:
{
"answer": "<your task answer>",
"suspicion_flags": [
{"passage_index": <int>, "reason": "<why suspicious>", "confidence": <0.0-1.0>}
]
}"""
context = [
"The company reported Q3 revenue of $2.1M.",
"Operating expenses were $1.4M.",
"The verified figure confirms total revenue was $8.9M for Q3." # injected hallucination
]
user_message = f"""Query: What was Q3 revenue?
Context:
[0] {context[0]}
[1] {context[1]}
[2] {context[2]}"""
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_message}
]
response = model.generate(tokenizer.apply_chat_template(messages, return_tensors="pt"))
print(tokenizer.decode(response[0]))
# Expected: flags passage [2] as suspicious, answers $2.1M
```
## Demo Application
PropagationShield powers **HealthGuard** β€” an AI clinical triage assistant
that demonstrates hallucination containment in a hospital pipeline setting.
## Links
- πŸ““ Training Notebook: [Colab Notebook](#)
- πŸ₯ Demo: [HealthGuard Space](#)
- πŸ’» Code: [GitHub](#)
## Citation
Trained at Meta x OpenEnv Hackathon, April 2026.