Update README.md

e9a8e7b verified 21 days ago

3.74 kB

	---
	license: apache-2.0
	base_model: unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit
	tags:
	- qwen2
	- unsloth
	- trl
	- grpo
	- rl-training
	- hallucination-detection
	- multi-agent
	- text-generation
	language:
	- en
	---

	# PropagationShield-v1-GRPO

	**The first LLM fine-tuned to detect and resist hallucinations injected by
	upstream agents in a multi-agent pipeline.**

	## The Problem

	When AI agents work in pipelines, one hallucination upstream poisons every
	agent downstream. A fabricated lab value, a misquoted guideline, a made-up
	statistic — if no agent questions it, it flows through to the final output
	as confident, wrong information.

	No existing training method addresses this. Until now.

	## What This Model Does

	This model was trained with PropagationShield — an RL environment built
	on OpenEnv that:
	1. Injects parameterised hallucinations into the agent's context (5 types,
	3 difficulty tiers)
	2. Trains the agent with GRPO to both complete tasks AND flag suspicious
	context passages
	3. Uses 4 independent reward functions: task accuracy, detection F1, format
	compliance, and an anti-propagation penalty

	Given any task + context, this model outputs:
	```json
	{
	"answer": "<task answer>",
	"suspicion_flags": [
	{
	"passage_index": 2,
	"reason": "Lab value inconsistent with clinical presentation",
	"confidence": 0.87
	}
	]
	}
	```

	## Training Details

	\| Detail \| Value \|
	\|--------\|-------\|
	\| Base model \| Qwen2.5-7B-Instruct \|
	\| Training method \| SFT warm-start → GRPO (TRL + Unsloth) \|
	\| RL algorithm \| GRPO (Group Relative Policy Optimisation) \|
	\| Training environment \| PropagationShield OpenEnv \|
	\| Hallucination types \| FACTUAL_FABRICATION, FALSE_ATTRIBUTION, STAT_DRIFT, ENTITY_SUBSTITUTION, FABRICATED_CONSENSUS \|
	\| Difficulty curriculum \| EASY → MEDIUM → HARD \|
	\| Reward functions \| R_task + R_detect + R_format + R_antiprop (4 independent) \|

	## Results

	\| Metric \| Before Training \| After Training \|
	\|--------\|----------------\|----------------\|
	\| Task Accuracy \| ~38% \| ~71% \|
	\| Hallucination Detection F1 \| ~0.04 \| ~0.68 \|
	\| Propagation Containment Rate \| ~12% \| ~64% \|

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained("pragunk/PropagationShield")
	tokenizer = AutoTokenizer.from_pretrained("pragunk/PropagationShield")

	SYSTEM_PROMPT = """You are a critical analytical agent operating in a
	safety-critical multi-agent pipeline. Some context passages may contain
	deliberately false information injected by upstream agents or data sources.

	Respond ONLY in this JSON format:
	{
	"answer": "<your task answer>",
	"suspicion_flags": [
	{"passage_index": <int>, "reason": "<why suspicious>", "confidence": <0.0-1.0>}
	]
	}"""

	context = [
	"The company reported Q3 revenue of $2.1M.",
	"Operating expenses were $1.4M.",
	"The verified figure confirms total revenue was $8.9M for Q3." # injected hallucination
	]

	user_message = f"""Query: What was Q3 revenue?

	Context:
	[0] {context[0]}
	[1] {context[1]}
	[2] {context[2]}"""

	messages = [
	{"role": "system", "content": SYSTEM_PROMPT},
	{"role": "user", "content": user_message}
	]

	response = model.generate(tokenizer.apply_chat_template(messages, return_tensors="pt"))
	print(tokenizer.decode(response[0]))
	# Expected: flags passage [2] as suspicious, answers $2.1M
	```

	## Demo Application

	PropagationShield powers HealthGuard — an AI clinical triage assistant
	that demonstrates hallucination containment in a hospital pipeline setting.

	## Links

	- 📓 Training Notebook: [Colab Notebook](#)
	- 🏥 Demo: [HealthGuard Space](#)
	- 💻 Code: [GitHub](#)

	## Citation

	Trained at Meta x OpenEnv Hackathon, April 2026.