Xunzhuo commited on
Commit
ce44031
·
verified ·
1 Parent(s): 8dccd4e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +216 -44
README.md CHANGED
@@ -8,6 +8,8 @@ tags:
8
  - hallucination-detection
9
  - modernbert
10
  - lora
 
 
11
  datasets:
12
  - squad
13
  - trivia_qa
@@ -23,11 +25,11 @@ metrics:
23
  - accuracy
24
  - f1
25
  model-index:
26
- - name: fact-check-classifier-modernbert
27
  results:
28
  - task:
29
  type: text-classification
30
- name: Fact-Check Classification
31
  metrics:
32
  - type: accuracy
33
  value: 0.964
@@ -37,81 +39,251 @@ model-index:
37
  name: F1 Score
38
  ---
39
 
40
- # Fact-Check Classifier (ModernBERT + LoRA)
41
 
42
- A fine-tuned ModernBERT model that classifies user prompts into:
43
- - **FACT_CHECK_NEEDED**: Questions requiring factual verification (e.g., "When was the Eiffel Tower built?")
44
- - **NO_FACT_CHECK_NEEDED**: Creative, coding, opinion, or math requests (e.g., "Write a poem about spring")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
  ## Model Details
47
 
48
- - **Base Model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)
49
- - **Fine-tuning Method**: LoRA (rank=16, alpha=32)
50
- - **Training Data**: 50,000 balanced samples from 14 datasets
 
 
 
 
 
51
  - **Validation Accuracy**: 96.4%
52
- - **Edge Case Accuracy**: 100% (27/27 test cases)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
  ## Usage
55
 
 
 
56
  ```python
57
  from transformers import AutoModelForSequenceClassification, AutoTokenizer
58
  import torch
59
 
60
- model = AutoModelForSequenceClassification.from_pretrained("rootfs/fact-check-classifier-modernbert")
61
- tokenizer = AutoTokenizer.from_pretrained("rootfs/fact-check-classifier-modernbert")
 
 
62
 
63
- def classify(text):
64
- inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
 
 
 
 
 
 
 
65
  with torch.no_grad():
66
  outputs = model(**inputs)
67
- probs = torch.softmax(outputs.logits, dim=-1)
68
- pred_class = torch.argmax(probs, dim=-1).item()
69
- label = "FACT_CHECK_NEEDED" if pred_class == 1 else "NO_FACT_CHECK_NEEDED"
70
- return label, probs[0, pred_class].item()
 
71
 
72
  # Examples
73
- print(classify("When was the Eiffel Tower built?")) # FACT_CHECK_NEEDED
74
- print(classify("Write a poem about spring")) # NO_FACT_CHECK_NEEDED
75
- print(classify("What is the meaning of life?")) # NO_FACT_CHECK_NEEDED
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
  ```
77
 
 
 
78
  ## Training Data
79
 
 
 
80
  ### FACT_CHECK_NEEDED (25,000 samples)
81
- - **NISQ-ISQ**: Gold standard Information-Seeking Questions
82
- - **HaluEval**: QA questions from hallucination benchmark
83
- - **FaithDial**: Information-seeking dialogue questions
84
- - **FactCHD**: Fact-conflicting hallucination queries
85
- - **SQuAD, TriviaQA, HotpotQA**: Factual QA datasets
86
- - **TruthfulQA**: High-risk factual queries
87
- - **CoQA**: Conversational factual questions
 
 
 
88
 
89
  ### NO_FACT_CHECK_NEEDED (25,000 samples)
90
- - **NISQ-NonISQ**: Gold standard Non-Information-Seeking Questions
91
- - **Dolly**: Creative writing, brainstorming, summarization
92
- - **WritingPrompts**: Creative writing prompts
93
- - **Alpaca**: Coding, math, opinion instructions
 
 
 
 
 
 
 
94
 
95
  ## Intended Use
96
 
97
- This model is designed for use in LLM gateway/router systems to:
98
- 1. Classify incoming prompts to determine if fact-checking is needed
99
- 2. Trigger hallucination detection only for factual queries
100
- 3. Reduce unnecessary compute by skipping fact-check for creative/coding tasks
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
 
102
  ## Limitations
103
 
104
- - Borderline cases (philosophical questions) may have lower confidence
105
- - Trained on English data only
106
- - Best used as part of a larger hallucination mitigation pipeline
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
 
108
  ## Citation
109
 
 
 
110
  ```bibtex
111
- @software{fact_check_classifier_2024,
112
- title = {Fact-Check Classifier for LLM Hallucination Mitigation},
113
  author = {vLLM Project},
114
- year = {2024},
115
- url = {https://github.com/vllm-project/semantic-router}
116
  }
117
- ```
 
 
 
 
 
 
 
 
 
 
8
  - hallucination-detection
9
  - modernbert
10
  - lora
11
+ - llm-routing
12
+ - llm-gateway
13
  datasets:
14
  - squad
15
  - trivia_qa
 
25
  - accuracy
26
  - f1
27
  model-index:
28
+ - name: HaluGate-Sentinel
29
  results:
30
  - task:
31
  type: text-classification
32
+ name: Fact-Check Need Classification
33
  metrics:
34
  - type: accuracy
35
  value: 0.964
 
39
  name: F1 Score
40
  ---
41
 
42
+ # HaluGate Sentinel — Prompt Fact-Check Switch for Hallucination Gatekeeper
43
 
44
+ **HaluGate Sentinel** is a ModernBERT + LoRA classifier that decides whether an incoming user prompt **requires factual verification**.
45
+
46
+ It *does not* check facts itself. Instead, it acts as a **frontline switch** in an LLM routing / gateway system, deciding whether a request should enter a **fact-checking / RAG / hallucination-mitigation pipeline**.
47
+
48
+ The model classifies prompts into:
49
+
50
+ - **`FACT_CHECK_NEEDED`**:
51
+ Information-seeking queries that depend on external/world knowledge
52
+ - e.g., “When was the Eiffel Tower built?”
53
+ - e.g., “What is the GDP of Japan in 2023?”
54
+
55
+ - **`NO_FACT_CHECK_NEEDED`**:
56
+ Creative, coding, opinion, or pure reasoning/math tasks
57
+ - e.g., “Write a poem about spring”
58
+ - e.g., “Implement quicksort in Python”
59
+ - e.g., “What is the meaning of life?”
60
+
61
+ This model is part of the **Hallucination Gatekeeper** stack for `llm-semantic-router`.
62
+
63
+ ---
64
 
65
  ## Model Details
66
 
67
+ - **Model name**: `HaluGate Sentinel`
68
+ - **Repository**: `llm-semantic-router/halugate-sentinel`
69
+ - **Task**: Binary text classification (prompt-level)
70
+ - **Labels**:
71
+ - `0` → `NO_FACT_CHECK_NEEDED`
72
+ - `1` → `FACT_CHECK_NEEDED`
73
+ - **Base model**: [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)
74
+ - **Fine-tuning method**: LoRA (rank = 16, alpha = 32)
75
  - **Validation Accuracy**: 96.4%
76
+ - **Validation F1 Score**: 0.965
77
+ - **Edge-case accuracy**: 100% on a 27-sample curated test set of borderline prompt types
78
+
79
+ ---
80
+
81
+ ## Position in a Hallucination Mitigation Pipeline
82
+
83
+ HaluGate Sentinel is designed as **Stage 0** in a multi-stage hallucination mitigation architecture:
84
+
85
+ 1. **Stage 0 — HaluGate Sentinel (this model)**
86
+ Classifies user prompts and decides whether **fact-checking is needed**:
87
+ - `NO_FACT_CHECK_NEEDED` → Route directly to LLM generation.
88
+ - `FACT_CHECK_NEEDED` → Route into the **Hallucination Gatekeeper** path (RAG, tools, verifiers).
89
+
90
+ 2. **Stage 1+ — Answer-level hallucination models (e.g., “HaluGate Verifier”)**
91
+ Operate on *(query, answer, evidence)* to detect hallucinations and enforce trust policies.
92
+
93
+ HaluGate Sentinel focuses solely on **prompt intent classification** to minimize unnecessary compute while preserving safety for factual queries.
94
+
95
+ ---
96
 
97
  ## Usage
98
 
99
+ ### Basic Inference
100
+
101
  ```python
102
  from transformers import AutoModelForSequenceClassification, AutoTokenizer
103
  import torch
104
 
105
+ MODEL_ID = "llm-semantic-router/halugate-sentinel"
106
+
107
+ model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
108
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
109
 
110
+ id2label = model.config.id2label # {0: 'NO_FACT_CHECK_NEEDED', 1: 'FACT_CHECK_NEEDED'}
111
+
112
+ def classify_prompt(text: str):
113
+ inputs = tokenizer(
114
+ text,
115
+ return_tensors="pt",
116
+ truncation=True,
117
+ max_length=512,
118
+ )
119
  with torch.no_grad():
120
  outputs = model(**inputs)
121
+ probs = torch.softmax(outputs.logits, dim=-1)[0]
122
+ pred_id = int(torch.argmax(probs).item())
123
+ label = id2label.get(pred_id, str(pred_id))
124
+ confidence = float(probs[pred_id].item())
125
+ return label, confidence
126
 
127
  # Examples
128
+ print(classify_prompt("When was the Eiffel Tower built?"))
129
+ # ('FACT_CHECK_NEEDED', 0.99...)
130
+
131
+ print(classify_prompt("Write a poem about spring"))
132
+ # → ('NO_FACT_CHECK_NEEDED', 0.98...)
133
+
134
+ print(classify_prompt("Implement a binary search in Python"))
135
+ # → ('NO_FACT_CHECK_NEEDED', 0.97...)
136
+ ````
137
+
138
+ ### Example: Integrating with a Router / Gateway
139
+
140
+ Pseudocode for a routing decision:
141
+
142
+ ```python
143
+ label, prob = classify_prompt(user_prompt)
144
+
145
+ FACT_CHECK_THRESHOLD = 0.6 # configurable based on your risk appetite
146
+
147
+ if label == "FACT_CHECK_NEEDED" and prob >= FACT_CHECK_THRESHOLD:
148
+ route = "hallucination_gatekeeper" # RAG / tools / verifiers
149
+ else:
150
+ route = "direct_generation"
151
+
152
+ # Use `route` to select downstream pipelines in your LLM gateway.
153
  ```
154
 
155
+ ---
156
+
157
  ## Training Data
158
 
159
+ Balanced dataset of **50,000** prompts:
160
+
161
  ### FACT_CHECK_NEEDED (25,000 samples)
162
+
163
+ Information-seeking and knowledge-intensive questions drawn from:
164
+
165
+ * **NISQ-ISQ**: Gold-standard information-seeking questions
166
+ * **HaluEval**: Hallucination-focused QA benchmark
167
+ * **FaithDial**: Information-seeking dialogue questions
168
+ * **FactCHD**: Fact-conflicting / hallucination-prone queries
169
+ * **SQuAD, TriviaQA, HotpotQA**: Standard factual QA datasets
170
+ * **TruthfulQA**: High-risk factual queries
171
+ * **CoQA**: Conversational factual questions
172
 
173
  ### NO_FACT_CHECK_NEEDED (25,000 samples)
174
+
175
+ Tasks that typically do **not** require external factual verification:
176
+
177
+ * **NISQ-NonISQ**: Non-information-seeking questions
178
+ * **Databricks Dolly**: Creative writing, summarization, brainstorming
179
+ * **WritingPrompts**: Creative writing prompts
180
+ * **Alpaca**: Coding, math, opinion, and general instructions
181
+
182
+ The objective is to approximate “does this prompt require world knowledge / external facts?” rather than “is the answer true?”.
183
+
184
+ ---
185
 
186
  ## Intended Use
187
 
188
+ ### Primary Use Cases
189
+
190
+ * **LLM Gateway / Router**
191
+
192
+ * Decide if a prompt must be routed into a **fact-aware pipeline** (RAG, tools, knowledge base, verifiers).
193
+ * Avoid unnecessary compute for creative / coding / opinion tasks.
194
+
195
+ * **Hallucination Gatekeeper Frontline**
196
+
197
+ * Only enable expensive hallucination detection for prompts labeled `FACT_CHECK_NEEDED`.
198
+ * Implement different safety and latency policies for the two classes.
199
+
200
+ * **Traffic Analytics & Risk Scoring**
201
+
202
+ * Monitor proportion of factual vs non-factual traffic.
203
+ * Adjust infrastructure sizing for retrieval / tool-heavy pipelines accordingly.
204
+
205
+ ### Non-Goals
206
+
207
+ * It does *not* verify the correctness of any answer.
208
+ * It should not be used as a generic toxicity / safety classifier.
209
+ * It does not handle non-English prompts reliably (trained on English only).
210
+
211
+ ---
212
+
213
+ ## How It Works
214
+
215
+ * **Architecture**:
216
+
217
+ * ModernBERT-base encoder
218
+ * Classification head on top of `[CLS]` / pooled representation
219
+
220
+ * **Fine-tuning**:
221
+
222
+ * LoRA on the base encoder
223
+ * Binary cross-entropy / cross-entropy loss on the two labels
224
+ * Balanced sampling between FACT_CHECK_NEEDED and NO_FACT_CHECK_NEEDED
225
+
226
+ * **Decision Boundary**:
227
+
228
+ * Borderline / philosophical / highly abstract questions may be assigned lower confidence.
229
+ * Downstream systems are encouraged to use the **confidence score** as a soft signal, not a hard oracle.
230
+
231
+ ---
232
 
233
  ## Limitations
234
 
235
+ * **Language**:
236
+
237
+ * Trained on English data only.
238
+ * Performance on other languages is not guaranteed.
239
+
240
+ * **Borderline Queries**:
241
+
242
+ * Philosophical or hybrid prompts (e.g. “Is time travel possible?”) may be ambiguous.
243
+ * In such cases, consider inspecting the model confidence and implementing a “default-to-safe” policy.
244
+
245
+ * **Domain Coverage**:
246
+
247
+ * General-purpose factual tasks are well-covered; highly specialized verticals (e.g. niche scientific domains) are not explicitly targeted during fine-tuning.
248
+
249
+ * **Not a Verifier**:
250
+
251
+ * This model only decides if a prompt **needs factual support**.
252
+ * Actual hallucination detection and answer verification must be handled by separate models (e.g., answer-level verifiers).
253
+
254
+ ---
255
+
256
+ ## Ethical Considerations
257
+
258
+ * **Risk Trade-off**:
259
+
260
+ * Over-classifying prompts as `NO_FACT_CHECK_NEEDED` may reduce safety for borderline factual tasks.
261
+ * Over-classifying as `FACT_CHECK_NEEDED` increases compute cost but is safer in high-risk environments.
262
+
263
+ * **Deployment Recommendation**:
264
+
265
+ * For safety-critical domains (finance, healthcare, legal, etc.), configure conservative thresholds and fallbacks that favor routing more traffic through the fact-checking path.
266
+
267
+ ---
268
 
269
  ## Citation
270
 
271
+ If you use HaluGate Sentinel in academic work or production systems, please cite:
272
+
273
  ```bibtex
274
+ @software{halugate_sentinel_2024,
275
+ title = {HaluGate Sentinel: Prompt-Level Fact-Check Switch for Hallucination Gatekeepers},
276
  author = {vLLM Project},
277
+ year = {2024},
278
+ url = {https://github.com/vllm-project/semantic-router}
279
  }
280
+ ```
281
+
282
+ ---
283
+
284
+ ## Acknowledgements
285
+
286
+ * Base encoder: [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)
287
+ * Training datasets: SQuAD, TriviaQA, HotpotQA, TruthfulQA, CoQA, Dolly, Alpaca, WritingPrompts, HaluEval, and others listed above.
288
+ * Designed for integration with the **vLLM Semantic Router** and broader **Hallucination Gatekeeper** ecosystem.
289
+