PropagationShield-v1-GRPO

The first LLM fine-tuned to detect and resist hallucinations injected by upstream agents in a multi-agent pipeline.

The Problem

When AI agents work in pipelines, one hallucination upstream poisons every agent downstream. A fabricated lab value, a misquoted guideline, a made-up statistic โ€” if no agent questions it, it flows through to the final output as confident, wrong information.

No existing training method addresses this. Until now.

What This Model Does

This model was trained with PropagationShield โ€” an RL environment built on OpenEnv that:

  1. Injects parameterised hallucinations into the agent's context (5 types, 3 difficulty tiers)
  2. Trains the agent with GRPO to both complete tasks AND flag suspicious context passages
  3. Uses 4 independent reward functions: task accuracy, detection F1, format compliance, and an anti-propagation penalty

Given any task + context, this model outputs:

{
  "answer": "<task answer>",
  "suspicion_flags": [
    {
      "passage_index": 2,
      "reason": "Lab value inconsistent with clinical presentation",
      "confidence": 0.87
    }
  ]
}

Training Details

Detail Value
Base model Qwen2.5-7B-Instruct
Training method SFT warm-start โ†’ GRPO (TRL + Unsloth)
RL algorithm GRPO (Group Relative Policy Optimisation)
Training environment PropagationShield OpenEnv
Hallucination types FACTUAL_FABRICATION, FALSE_ATTRIBUTION, STAT_DRIFT, ENTITY_SUBSTITUTION, FABRICATED_CONSENSUS
Difficulty curriculum EASY โ†’ MEDIUM โ†’ HARD
Reward functions R_task + R_detect + R_format + R_antiprop (4 independent)

Results

Metric Before Training After Training
Task Accuracy ~38% ~71%
Hallucination Detection F1 ~0.04 ~0.68
Propagation Containment Rate ~12% ~64%

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("pragunk/PropagationShield")
tokenizer = AutoTokenizer.from_pretrained("pragunk/PropagationShield")

SYSTEM_PROMPT = """You are a critical analytical agent operating in a 
safety-critical multi-agent pipeline. Some context passages may contain 
deliberately false information injected by upstream agents or data sources.

Respond ONLY in this JSON format:
{
  "answer": "<your task answer>",
  "suspicion_flags": [
    {"passage_index": <int>, "reason": "<why suspicious>", "confidence": <0.0-1.0>}
  ]
}"""

context = [
    "The company reported Q3 revenue of $2.1M.",
    "Operating expenses were $1.4M.",
    "The verified figure confirms total revenue was $8.9M for Q3."  # injected hallucination
]

user_message = f"""Query: What was Q3 revenue?

Context:
[0] {context[0]}
[1] {context[1]}
[2] {context[2]}"""

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": user_message}
]

response = model.generate(tokenizer.apply_chat_template(messages, return_tensors="pt"))
print(tokenizer.decode(response[0]))
# Expected: flags passage [2] as suspicious, answers $2.1M

Demo Application

PropagationShield powers HealthGuard โ€” an AI clinical triage assistant that demonstrates hallucination containment in a hospital pipeline setting.

Links

Citation

Trained at Meta x OpenEnv Hackathon, April 2026.

Downloads last month
525
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support