File size: 3,743 Bytes
729f0dc
e9a8e7b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
license: apache-2.0
base_model: unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit
tags:
- qwen2
- unsloth
- trl
- grpo
- rl-training
- hallucination-detection
- multi-agent
- text-generation
language:
- en
---

# PropagationShield-v1-GRPO

**The first LLM fine-tuned to detect and resist hallucinations injected by 
upstream agents in a multi-agent pipeline.**

## The Problem

When AI agents work in pipelines, one hallucination upstream poisons every 
agent downstream. A fabricated lab value, a misquoted guideline, a made-up 
statistic β€” if no agent questions it, it flows through to the final output 
as confident, wrong information.

No existing training method addresses this. Until now.

## What This Model Does

This model was trained with **PropagationShield** β€” an RL environment built 
on OpenEnv that:
1. Injects parameterised hallucinations into the agent's context (5 types, 
   3 difficulty tiers)
2. Trains the agent with GRPO to both complete tasks AND flag suspicious 
   context passages
3. Uses 4 independent reward functions: task accuracy, detection F1, format 
   compliance, and an anti-propagation penalty

Given any task + context, this model outputs:
```json
{
  "answer": "<task answer>",
  "suspicion_flags": [
    {
      "passage_index": 2,
      "reason": "Lab value inconsistent with clinical presentation",
      "confidence": 0.87
    }
  ]
}
```

## Training Details

| Detail | Value |
|--------|-------|
| Base model | Qwen2.5-7B-Instruct |
| Training method | SFT warm-start β†’ GRPO (TRL + Unsloth) |
| RL algorithm | GRPO (Group Relative Policy Optimisation) |
| Training environment | PropagationShield OpenEnv |
| Hallucination types | FACTUAL_FABRICATION, FALSE_ATTRIBUTION, STAT_DRIFT, ENTITY_SUBSTITUTION, FABRICATED_CONSENSUS |
| Difficulty curriculum | EASY β†’ MEDIUM β†’ HARD |
| Reward functions | R_task + R_detect + R_format + R_antiprop (4 independent) |

## Results

| Metric | Before Training | After Training |
|--------|----------------|----------------|
| Task Accuracy | ~38% | ~71% |
| Hallucination Detection F1 | ~0.04 | ~0.68 |
| Propagation Containment Rate | ~12% | ~64% |

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("pragunk/PropagationShield")
tokenizer = AutoTokenizer.from_pretrained("pragunk/PropagationShield")

SYSTEM_PROMPT = """You are a critical analytical agent operating in a 
safety-critical multi-agent pipeline. Some context passages may contain 
deliberately false information injected by upstream agents or data sources.

Respond ONLY in this JSON format:
{
  "answer": "<your task answer>",
  "suspicion_flags": [
    {"passage_index": <int>, "reason": "<why suspicious>", "confidence": <0.0-1.0>}
  ]
}"""

context = [
    "The company reported Q3 revenue of $2.1M.",
    "Operating expenses were $1.4M.",
    "The verified figure confirms total revenue was $8.9M for Q3."  # injected hallucination
]

user_message = f"""Query: What was Q3 revenue?

Context:
[0] {context[0]}
[1] {context[1]}
[2] {context[2]}"""

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": user_message}
]

response = model.generate(tokenizer.apply_chat_template(messages, return_tensors="pt"))
print(tokenizer.decode(response[0]))
# Expected: flags passage [2] as suspicious, answers $2.1M
```

## Demo Application

PropagationShield powers **HealthGuard** β€” an AI clinical triage assistant 
that demonstrates hallucination containment in a hospital pipeline setting.

## Links

- πŸ““ Training Notebook: [Colab Notebook](#)
- πŸ₯ Demo: [HealthGuard Space](#)
- πŸ’» Code: [GitHub](#)

## Citation

Trained at Meta x OpenEnv Hackathon, April 2026.