--- license: apache-2.0 base_model: unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit tags: - qwen2 - unsloth - trl - grpo - rl-training - hallucination-detection - multi-agent - text-generation language: - en --- # PropagationShield-v1-GRPO **The first LLM fine-tuned to detect and resist hallucinations injected by upstream agents in a multi-agent pipeline.** ## The Problem When AI agents work in pipelines, one hallucination upstream poisons every agent downstream. A fabricated lab value, a misquoted guideline, a made-up statistic — if no agent questions it, it flows through to the final output as confident, wrong information. No existing training method addresses this. Until now. ## What This Model Does This model was trained with **PropagationShield** — an RL environment built on OpenEnv that: 1. Injects parameterised hallucinations into the agent's context (5 types, 3 difficulty tiers) 2. Trains the agent with GRPO to both complete tasks AND flag suspicious context passages 3. Uses 4 independent reward functions: task accuracy, detection F1, format compliance, and an anti-propagation penalty Given any task + context, this model outputs: ```json { "answer": "", "suspicion_flags": [ { "passage_index": 2, "reason": "Lab value inconsistent with clinical presentation", "confidence": 0.87 } ] } ``` ## Training Details | Detail | Value | |--------|-------| | Base model | Qwen2.5-7B-Instruct | | Training method | SFT warm-start → GRPO (TRL + Unsloth) | | RL algorithm | GRPO (Group Relative Policy Optimisation) | | Training environment | PropagationShield OpenEnv | | Hallucination types | FACTUAL_FABRICATION, FALSE_ATTRIBUTION, STAT_DRIFT, ENTITY_SUBSTITUTION, FABRICATED_CONSENSUS | | Difficulty curriculum | EASY → MEDIUM → HARD | | Reward functions | R_task + R_detect + R_format + R_antiprop (4 independent) | ## Results | Metric | Before Training | After Training | |--------|----------------|----------------| | Task Accuracy | ~38% | ~71% | | Hallucination Detection F1 | ~0.04 | ~0.68 | | Propagation Containment Rate | ~12% | ~64% | ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained("pragunk/PropagationShield") tokenizer = AutoTokenizer.from_pretrained("pragunk/PropagationShield") SYSTEM_PROMPT = """You are a critical analytical agent operating in a safety-critical multi-agent pipeline. Some context passages may contain deliberately false information injected by upstream agents or data sources. Respond ONLY in this JSON format: { "answer": "", "suspicion_flags": [ {"passage_index": , "reason": "", "confidence": <0.0-1.0>} ] }""" context = [ "The company reported Q3 revenue of $2.1M.", "Operating expenses were $1.4M.", "The verified figure confirms total revenue was $8.9M for Q3." # injected hallucination ] user_message = f"""Query: What was Q3 revenue? Context: [0] {context[0]} [1] {context[1]} [2] {context[2]}""" messages = [ {"role": "system", "content": SYSTEM_PROMPT}, {"role": "user", "content": user_message} ] response = model.generate(tokenizer.apply_chat_template(messages, return_tensors="pt")) print(tokenizer.decode(response[0])) # Expected: flags passage [2] as suspicious, answers $2.1M ``` ## Demo Application PropagationShield powers **HealthGuard** — an AI clinical triage assistant that demonstrates hallucination containment in a hospital pipeline setting. ## Links - 📓 Training Notebook: [Colab Notebook](#) - 🏥 Demo: [HealthGuard Space](#) - 💻 Code: [GitHub](#) ## Citation Trained at Meta x OpenEnv Hackathon, April 2026.