pragunk commited on
Commit
e9a8e7b
Β·
verified Β·
1 Parent(s): 729f0dc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +131 -2
README.md CHANGED
@@ -1,3 +1,132 @@
1
  ---
2
- pipeline_tag: text-generation
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ base_model: unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit
4
+ tags:
5
+ - qwen2
6
+ - unsloth
7
+ - trl
8
+ - grpo
9
+ - rl-training
10
+ - hallucination-detection
11
+ - multi-agent
12
+ - text-generation
13
+ language:
14
+ - en
15
+ ---
16
+
17
+ # PropagationShield-v1-GRPO
18
+
19
+ **The first LLM fine-tuned to detect and resist hallucinations injected by
20
+ upstream agents in a multi-agent pipeline.**
21
+
22
+ ## The Problem
23
+
24
+ When AI agents work in pipelines, one hallucination upstream poisons every
25
+ agent downstream. A fabricated lab value, a misquoted guideline, a made-up
26
+ statistic β€” if no agent questions it, it flows through to the final output
27
+ as confident, wrong information.
28
+
29
+ No existing training method addresses this. Until now.
30
+
31
+ ## What This Model Does
32
+
33
+ This model was trained with **PropagationShield** β€” an RL environment built
34
+ on OpenEnv that:
35
+ 1. Injects parameterised hallucinations into the agent's context (5 types,
36
+ 3 difficulty tiers)
37
+ 2. Trains the agent with GRPO to both complete tasks AND flag suspicious
38
+ context passages
39
+ 3. Uses 4 independent reward functions: task accuracy, detection F1, format
40
+ compliance, and an anti-propagation penalty
41
+
42
+ Given any task + context, this model outputs:
43
+ ```json
44
+ {
45
+ "answer": "<task answer>",
46
+ "suspicion_flags": [
47
+ {
48
+ "passage_index": 2,
49
+ "reason": "Lab value inconsistent with clinical presentation",
50
+ "confidence": 0.87
51
+ }
52
+ ]
53
+ }
54
+ ```
55
+
56
+ ## Training Details
57
+
58
+ | Detail | Value |
59
+ |--------|-------|
60
+ | Base model | Qwen2.5-7B-Instruct |
61
+ | Training method | SFT warm-start β†’ GRPO (TRL + Unsloth) |
62
+ | RL algorithm | GRPO (Group Relative Policy Optimisation) |
63
+ | Training environment | PropagationShield OpenEnv |
64
+ | Hallucination types | FACTUAL_FABRICATION, FALSE_ATTRIBUTION, STAT_DRIFT, ENTITY_SUBSTITUTION, FABRICATED_CONSENSUS |
65
+ | Difficulty curriculum | EASY β†’ MEDIUM β†’ HARD |
66
+ | Reward functions | R_task + R_detect + R_format + R_antiprop (4 independent) |
67
+
68
+ ## Results
69
+
70
+ | Metric | Before Training | After Training |
71
+ |--------|----------------|----------------|
72
+ | Task Accuracy | ~38% | ~71% |
73
+ | Hallucination Detection F1 | ~0.04 | ~0.68 |
74
+ | Propagation Containment Rate | ~12% | ~64% |
75
+
76
+ ## Usage
77
+
78
+ ```python
79
+ from transformers import AutoModelForCausalLM, AutoTokenizer
80
+
81
+ model = AutoModelForCausalLM.from_pretrained("pragunk/PropagationShield")
82
+ tokenizer = AutoTokenizer.from_pretrained("pragunk/PropagationShield")
83
+
84
+ SYSTEM_PROMPT = """You are a critical analytical agent operating in a
85
+ safety-critical multi-agent pipeline. Some context passages may contain
86
+ deliberately false information injected by upstream agents or data sources.
87
+
88
+ Respond ONLY in this JSON format:
89
+ {
90
+ "answer": "<your task answer>",
91
+ "suspicion_flags": [
92
+ {"passage_index": <int>, "reason": "<why suspicious>", "confidence": <0.0-1.0>}
93
+ ]
94
+ }"""
95
+
96
+ context = [
97
+ "The company reported Q3 revenue of $2.1M.",
98
+ "Operating expenses were $1.4M.",
99
+ "The verified figure confirms total revenue was $8.9M for Q3." # injected hallucination
100
+ ]
101
+
102
+ user_message = f"""Query: What was Q3 revenue?
103
+
104
+ Context:
105
+ [0] {context[0]}
106
+ [1] {context[1]}
107
+ [2] {context[2]}"""
108
+
109
+ messages = [
110
+ {"role": "system", "content": SYSTEM_PROMPT},
111
+ {"role": "user", "content": user_message}
112
+ ]
113
+
114
+ response = model.generate(tokenizer.apply_chat_template(messages, return_tensors="pt"))
115
+ print(tokenizer.decode(response[0]))
116
+ # Expected: flags passage [2] as suspicious, answers $2.1M
117
+ ```
118
+
119
+ ## Demo Application
120
+
121
+ PropagationShield powers **HealthGuard** β€” an AI clinical triage assistant
122
+ that demonstrates hallucination containment in a hospital pipeline setting.
123
+
124
+ ## Links
125
+
126
+ - πŸ““ Training Notebook: [Colab Notebook](#)
127
+ - πŸ₯ Demo: [HealthGuard Space](#)
128
+ - πŸ’» Code: [GitHub](#)
129
+
130
+ ## Citation
131
+
132
+ Trained at Meta x OpenEnv Hackathon, April 2026.