File size: 9,057 Bytes
c23a678
 
 
 
 
 
 
 
 
 
ce44031
 
c23a678
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce44031
c23a678
 
 
ce44031
c23a678
 
 
 
 
 
 
 
 
ce44031
c23a678
ce44031
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c23a678
 
 
ce44031
 
 
 
 
 
 
 
c23a678
ce44031
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c23a678
 
 
ce44031
 
c23a678
 
 
 
ce44031
 
 
 
c23a678
ce44031
 
 
 
 
 
 
 
 
c23a678
 
ce44031
 
 
 
 
c23a678
 
ce44031
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c23a678
 
ce44031
 
c23a678
 
ce44031
 
c23a678
ce44031
 
 
 
 
 
 
 
 
 
c23a678
 
ce44031
 
 
 
 
 
 
 
 
 
 
c23a678
 
 
ce44031
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c23a678
 
 
ce44031
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c23a678
 
 
ce44031
 
c23a678
ce44031
 
c23a678
ce44031
 
c23a678
ce44031
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
---
license: apache-2.0
language:
- en
tags:
- text-classification
- fact-checking
- hallucination-detection
- modernbert
- lora
- llm-routing
- llm-gateway
datasets:
- squad
- trivia_qa
- hotpot_qa
- truthful_qa
- databricks/databricks-dolly-15k
- tatsu-lab/alpaca
- pminervini/HaluEval
- neural-bridge/rag-dataset-12000
pipeline_tag: text-classification
base_model: answerdotai/ModernBERT-base
metrics:
- accuracy
- f1
model-index:
- name: HaluGate-Sentinel
  results:
  - task:
      type: text-classification
      name: Fact-Check Need Classification
    metrics:
    - type: accuracy
      value: 0.964
      name: Validation Accuracy
    - type: f1
      value: 0.965
      name: F1 Score
---

# HaluGate Sentinel — Prompt Fact-Check Switch for Hallucination Gatekeeper

**HaluGate Sentinel** is a ModernBERT + LoRA classifier that decides whether an incoming user prompt **requires factual verification**.

It *does not* check facts itself. Instead, it acts as a **frontline switch** in an LLM routing / gateway system, deciding whether a request should enter a **fact-checking / RAG / hallucination-mitigation pipeline**.

The model classifies prompts into:

- **`FACT_CHECK_NEEDED`**:  
  Information-seeking queries that depend on external/world knowledge  
  - e.g., “When was the Eiffel Tower built?”  
  - e.g., “What is the GDP of Japan in 2023?”

- **`NO_FACT_CHECK_NEEDED`**:  
  Creative, coding, opinion, or pure reasoning/math tasks  
  - e.g., “Write a poem about spring”  
  - e.g., “Implement quicksort in Python”  
  - e.g., “What is the meaning of life?”

This model is part of the **Hallucination Gatekeeper** stack for `llm-semantic-router`.

---

## Model Details

- **Model name**: `HaluGate Sentinel`
- **Repository**: `llm-semantic-router/halugate-sentinel`
- **Task**: Binary text classification (prompt-level)
- **Labels**:
  - `0` → `NO_FACT_CHECK_NEEDED`
  - `1` → `FACT_CHECK_NEEDED`
- **Base model**: [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)
- **Fine-tuning method**: LoRA (rank = 16, alpha = 32)
- **Validation Accuracy**: 96.4%
- **Validation F1 Score**: 0.965
- **Edge-case accuracy**: 100% on a 27-sample curated test set of borderline prompt types

---

## Position in a Hallucination Mitigation Pipeline

HaluGate Sentinel is designed as **Stage 0** in a multi-stage hallucination mitigation architecture:

1. **Stage 0 — HaluGate Sentinel (this model)**  
   Classifies user prompts and decides whether **fact-checking is needed**:
   - `NO_FACT_CHECK_NEEDED` → Route directly to LLM generation.
   - `FACT_CHECK_NEEDED` → Route into the **Hallucination Gatekeeper** path (RAG, tools, verifiers).

2. **Stage 1+ — Answer-level hallucination models (e.g., “HaluGate Verifier”)**  
   Operate on *(query, answer, evidence)* to detect hallucinations and enforce trust policies.

HaluGate Sentinel focuses solely on **prompt intent classification** to minimize unnecessary compute while preserving safety for factual queries.

---

## Usage

### Basic Inference

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

MODEL_ID = "llm-semantic-router/halugate-sentinel"

model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

id2label = model.config.id2label  # {0: 'NO_FACT_CHECK_NEEDED', 1: 'FACT_CHECK_NEEDED'}

def classify_prompt(text: str):
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=512,
    )
    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)[0]
    pred_id = int(torch.argmax(probs).item())
    label = id2label.get(pred_id, str(pred_id))
    confidence = float(probs[pred_id].item())
    return label, confidence

# Examples
print(classify_prompt("When was the Eiffel Tower built?"))
# → ('FACT_CHECK_NEEDED', 0.99...)

print(classify_prompt("Write a poem about spring"))
# → ('NO_FACT_CHECK_NEEDED', 0.98...)

print(classify_prompt("Implement a binary search in Python"))
# → ('NO_FACT_CHECK_NEEDED', 0.97...)
````

### Example: Integrating with a Router / Gateway

Pseudocode for a routing decision:

```python
label, prob = classify_prompt(user_prompt)

FACT_CHECK_THRESHOLD = 0.6  # configurable based on your risk appetite

if label == "FACT_CHECK_NEEDED" and prob >= FACT_CHECK_THRESHOLD:
    route = "hallucination_gatekeeper"  # RAG / tools / verifiers
else:
    route = "direct_generation"

# Use `route` to select downstream pipelines in your LLM gateway.
```

---

## Training Data

Balanced dataset of **50,000** prompts:

### FACT_CHECK_NEEDED (25,000 samples)

Information-seeking and knowledge-intensive questions drawn from:

* **NISQ-ISQ**: Gold-standard information-seeking questions
* **HaluEval**: Hallucination-focused QA benchmark
* **FaithDial**: Information-seeking dialogue questions
* **FactCHD**: Fact-conflicting / hallucination-prone queries
* **SQuAD, TriviaQA, HotpotQA**: Standard factual QA datasets
* **TruthfulQA**: High-risk factual queries
* **CoQA**: Conversational factual questions

### NO_FACT_CHECK_NEEDED (25,000 samples)

Tasks that typically do **not** require external factual verification:

* **NISQ-NonISQ**: Non-information-seeking questions
* **Databricks Dolly**: Creative writing, summarization, brainstorming
* **WritingPrompts**: Creative writing prompts
* **Alpaca**: Coding, math, opinion, and general instructions

The objective is to approximate “does this prompt require world knowledge / external facts?” rather than “is the answer true?”.

---

## Intended Use

### Primary Use Cases

* **LLM Gateway / Router**

  * Decide if a prompt must be routed into a **fact-aware pipeline** (RAG, tools, knowledge base, verifiers).
  * Avoid unnecessary compute for creative / coding / opinion tasks.

* **Hallucination Gatekeeper Frontline**

  * Only enable expensive hallucination detection for prompts labeled `FACT_CHECK_NEEDED`.
  * Implement different safety and latency policies for the two classes.

* **Traffic Analytics & Risk Scoring**

  * Monitor proportion of factual vs non-factual traffic.
  * Adjust infrastructure sizing for retrieval / tool-heavy pipelines accordingly.

### Non-Goals

* It does *not* verify the correctness of any answer.
* It should not be used as a generic toxicity / safety classifier.
* It does not handle non-English prompts reliably (trained on English only).

---

## How It Works

* **Architecture**:

  * ModernBERT-base encoder
  * Classification head on top of `[CLS]` / pooled representation

* **Fine-tuning**:

  * LoRA on the base encoder
  * Binary cross-entropy / cross-entropy loss on the two labels
  * Balanced sampling between FACT_CHECK_NEEDED and NO_FACT_CHECK_NEEDED

* **Decision Boundary**:

  * Borderline / philosophical / highly abstract questions may be assigned lower confidence.
  * Downstream systems are encouraged to use the **confidence score** as a soft signal, not a hard oracle.

---

## Limitations

* **Language**:

  * Trained on English data only.
  * Performance on other languages is not guaranteed.

* **Borderline Queries**:

  * Philosophical or hybrid prompts (e.g. “Is time travel possible?”) may be ambiguous.
  * In such cases, consider inspecting the model confidence and implementing a “default-to-safe” policy.

* **Domain Coverage**:

  * General-purpose factual tasks are well-covered; highly specialized verticals (e.g. niche scientific domains) are not explicitly targeted during fine-tuning.

* **Not a Verifier**:

  * This model only decides if a prompt **needs factual support**.
  * Actual hallucination detection and answer verification must be handled by separate models (e.g., answer-level verifiers).

---

## Ethical Considerations

* **Risk Trade-off**:

  * Over-classifying prompts as `NO_FACT_CHECK_NEEDED` may reduce safety for borderline factual tasks.
  * Over-classifying as `FACT_CHECK_NEEDED` increases compute cost but is safer in high-risk environments.

* **Deployment Recommendation**:

  * For safety-critical domains (finance, healthcare, legal, etc.), configure conservative thresholds and fallbacks that favor routing more traffic through the fact-checking path.

---

## Citation

If you use HaluGate Sentinel in academic work or production systems, please cite:

```bibtex
@software{halugate_sentinel_2024,
  title  = {HaluGate Sentinel: Prompt-Level Fact-Check Switch for Hallucination Gatekeepers},
  author = {vLLM Project},
  year   = {2024},
  url    = {https://github.com/vllm-project/semantic-router}
}
```

---

## Acknowledgements

* Base encoder: [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)
* Training datasets: SQuAD, TriviaQA, HotpotQA, TruthfulQA, CoQA, Dolly, Alpaca, WritingPrompts, HaluEval, and others listed above.
* Designed for integration with the **vLLM Semantic Router** and broader **Hallucination Gatekeeper** ecosystem.