PropagationShield-v1-GRPO

The first LLM fine-tuned to detect and resist hallucinations injected by upstream agents in a multi-agent pipeline.

The Problem

When AI agents work in pipelines, one hallucination upstream poisons every agent downstream. A fabricated lab value, a misquoted guideline, a made-up statistic — if no agent questions it, it flows through to the final output as confident, wrong information.

No existing training method addresses this. Until now.

What This Model Does

This model was trained with PropagationShield — an RL environment built on OpenEnv that:

Injects parameterised hallucinations into the agent's context (5 types, 3 difficulty tiers)
Trains the agent with GRPO to both complete tasks AND flag suspicious context passages
Uses 4 independent reward functions: task accuracy, detection F1, format compliance, and an anti-propagation penalty

Given any task + context, this model outputs:

{
  "answer": "<task answer>",
  "suspicion_flags": [
    {
      "passage_index": 2,
      "reason": "Lab value inconsistent with clinical presentation",
      "confidence": 0.87
    }
  ]
}

Training Details

Detail	Value
Base model	Qwen2.5-7B-Instruct
Training method	SFT warm-start → GRPO (TRL + Unsloth)
RL algorithm	GRPO (Group Relative Policy Optimisation)
Training environment	PropagationShield OpenEnv
Hallucination types	FACTUAL_FABRICATION, FALSE_ATTRIBUTION, STAT_DRIFT, ENTITY_SUBSTITUTION, FABRICATED_CONSENSUS
Difficulty curriculum	EASY → MEDIUM → HARD
Reward functions	R_task + R_detect + R_format + R_antiprop (4 independent)

Results

Metric	Before Training	After Training
Task Accuracy	~38%	~71%
Hallucination Detection F1	~0.04	~0.68
Propagation Containment Rate	~12%	~64%

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("pragunk/PropagationShield")
tokenizer = AutoTokenizer.from_pretrained("pragunk/PropagationShield")

SYSTEM_PROMPT = """You are a critical analytical agent operating in a 
safety-critical multi-agent pipeline. Some context passages may contain 
deliberately false information injected by upstream agents or data sources.

Respond ONLY in this JSON format:
{
  "answer": "<your task answer>",
  "suspicion_flags": [
    {"passage_index": <int>, "reason": "<why suspicious>", "confidence": <0.0-1.0>}
  ]
}"""

context = [
    "The company reported Q3 revenue of $2.1M.",
    "Operating expenses were $1.4M.",
    "The verified figure confirms total revenue was $8.9M for Q3."  # injected hallucination
]

user_message = f"""Query: What was Q3 revenue?

Context:
[0] {context[0]}
[1] {context[1]}
[2] {context[2]}"""

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": user_message}
]

response = model.generate(tokenizer.apply_chat_template(messages, return_tensors="pt"))
print(tokenizer.decode(response[0]))
# Expected: flags passage [2] as suspicious, answers $2.1M

Demo Application

PropagationShield powers HealthGuard — an AI clinical triage assistant that demonstrates hallucination containment in a hospital pipeline setting.

Citation

Trained at Meta x OpenEnv Hackathon, April 2026.

Downloads last month: 525

Safetensors

Model size

8B params

Tensor type

BF16

pragunk
/

PropagationShield