File size: 9,057 Bytes
c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 c23a678 ce44031 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 |
---
license: apache-2.0
language:
- en
tags:
- text-classification
- fact-checking
- hallucination-detection
- modernbert
- lora
- llm-routing
- llm-gateway
datasets:
- squad
- trivia_qa
- hotpot_qa
- truthful_qa
- databricks/databricks-dolly-15k
- tatsu-lab/alpaca
- pminervini/HaluEval
- neural-bridge/rag-dataset-12000
pipeline_tag: text-classification
base_model: answerdotai/ModernBERT-base
metrics:
- accuracy
- f1
model-index:
- name: HaluGate-Sentinel
results:
- task:
type: text-classification
name: Fact-Check Need Classification
metrics:
- type: accuracy
value: 0.964
name: Validation Accuracy
- type: f1
value: 0.965
name: F1 Score
---
# HaluGate Sentinel — Prompt Fact-Check Switch for Hallucination Gatekeeper
**HaluGate Sentinel** is a ModernBERT + LoRA classifier that decides whether an incoming user prompt **requires factual verification**.
It *does not* check facts itself. Instead, it acts as a **frontline switch** in an LLM routing / gateway system, deciding whether a request should enter a **fact-checking / RAG / hallucination-mitigation pipeline**.
The model classifies prompts into:
- **`FACT_CHECK_NEEDED`**:
Information-seeking queries that depend on external/world knowledge
- e.g., “When was the Eiffel Tower built?”
- e.g., “What is the GDP of Japan in 2023?”
- **`NO_FACT_CHECK_NEEDED`**:
Creative, coding, opinion, or pure reasoning/math tasks
- e.g., “Write a poem about spring”
- e.g., “Implement quicksort in Python”
- e.g., “What is the meaning of life?”
This model is part of the **Hallucination Gatekeeper** stack for `llm-semantic-router`.
---
## Model Details
- **Model name**: `HaluGate Sentinel`
- **Repository**: `llm-semantic-router/halugate-sentinel`
- **Task**: Binary text classification (prompt-level)
- **Labels**:
- `0` → `NO_FACT_CHECK_NEEDED`
- `1` → `FACT_CHECK_NEEDED`
- **Base model**: [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)
- **Fine-tuning method**: LoRA (rank = 16, alpha = 32)
- **Validation Accuracy**: 96.4%
- **Validation F1 Score**: 0.965
- **Edge-case accuracy**: 100% on a 27-sample curated test set of borderline prompt types
---
## Position in a Hallucination Mitigation Pipeline
HaluGate Sentinel is designed as **Stage 0** in a multi-stage hallucination mitigation architecture:
1. **Stage 0 — HaluGate Sentinel (this model)**
Classifies user prompts and decides whether **fact-checking is needed**:
- `NO_FACT_CHECK_NEEDED` → Route directly to LLM generation.
- `FACT_CHECK_NEEDED` → Route into the **Hallucination Gatekeeper** path (RAG, tools, verifiers).
2. **Stage 1+ — Answer-level hallucination models (e.g., “HaluGate Verifier”)**
Operate on *(query, answer, evidence)* to detect hallucinations and enforce trust policies.
HaluGate Sentinel focuses solely on **prompt intent classification** to minimize unnecessary compute while preserving safety for factual queries.
---
## Usage
### Basic Inference
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
MODEL_ID = "llm-semantic-router/halugate-sentinel"
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
id2label = model.config.id2label # {0: 'NO_FACT_CHECK_NEEDED', 1: 'FACT_CHECK_NEEDED'}
def classify_prompt(text: str):
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=512,
)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)[0]
pred_id = int(torch.argmax(probs).item())
label = id2label.get(pred_id, str(pred_id))
confidence = float(probs[pred_id].item())
return label, confidence
# Examples
print(classify_prompt("When was the Eiffel Tower built?"))
# → ('FACT_CHECK_NEEDED', 0.99...)
print(classify_prompt("Write a poem about spring"))
# → ('NO_FACT_CHECK_NEEDED', 0.98...)
print(classify_prompt("Implement a binary search in Python"))
# → ('NO_FACT_CHECK_NEEDED', 0.97...)
````
### Example: Integrating with a Router / Gateway
Pseudocode for a routing decision:
```python
label, prob = classify_prompt(user_prompt)
FACT_CHECK_THRESHOLD = 0.6 # configurable based on your risk appetite
if label == "FACT_CHECK_NEEDED" and prob >= FACT_CHECK_THRESHOLD:
route = "hallucination_gatekeeper" # RAG / tools / verifiers
else:
route = "direct_generation"
# Use `route` to select downstream pipelines in your LLM gateway.
```
---
## Training Data
Balanced dataset of **50,000** prompts:
### FACT_CHECK_NEEDED (25,000 samples)
Information-seeking and knowledge-intensive questions drawn from:
* **NISQ-ISQ**: Gold-standard information-seeking questions
* **HaluEval**: Hallucination-focused QA benchmark
* **FaithDial**: Information-seeking dialogue questions
* **FactCHD**: Fact-conflicting / hallucination-prone queries
* **SQuAD, TriviaQA, HotpotQA**: Standard factual QA datasets
* **TruthfulQA**: High-risk factual queries
* **CoQA**: Conversational factual questions
### NO_FACT_CHECK_NEEDED (25,000 samples)
Tasks that typically do **not** require external factual verification:
* **NISQ-NonISQ**: Non-information-seeking questions
* **Databricks Dolly**: Creative writing, summarization, brainstorming
* **WritingPrompts**: Creative writing prompts
* **Alpaca**: Coding, math, opinion, and general instructions
The objective is to approximate “does this prompt require world knowledge / external facts?” rather than “is the answer true?”.
---
## Intended Use
### Primary Use Cases
* **LLM Gateway / Router**
* Decide if a prompt must be routed into a **fact-aware pipeline** (RAG, tools, knowledge base, verifiers).
* Avoid unnecessary compute for creative / coding / opinion tasks.
* **Hallucination Gatekeeper Frontline**
* Only enable expensive hallucination detection for prompts labeled `FACT_CHECK_NEEDED`.
* Implement different safety and latency policies for the two classes.
* **Traffic Analytics & Risk Scoring**
* Monitor proportion of factual vs non-factual traffic.
* Adjust infrastructure sizing for retrieval / tool-heavy pipelines accordingly.
### Non-Goals
* It does *not* verify the correctness of any answer.
* It should not be used as a generic toxicity / safety classifier.
* It does not handle non-English prompts reliably (trained on English only).
---
## How It Works
* **Architecture**:
* ModernBERT-base encoder
* Classification head on top of `[CLS]` / pooled representation
* **Fine-tuning**:
* LoRA on the base encoder
* Binary cross-entropy / cross-entropy loss on the two labels
* Balanced sampling between FACT_CHECK_NEEDED and NO_FACT_CHECK_NEEDED
* **Decision Boundary**:
* Borderline / philosophical / highly abstract questions may be assigned lower confidence.
* Downstream systems are encouraged to use the **confidence score** as a soft signal, not a hard oracle.
---
## Limitations
* **Language**:
* Trained on English data only.
* Performance on other languages is not guaranteed.
* **Borderline Queries**:
* Philosophical or hybrid prompts (e.g. “Is time travel possible?”) may be ambiguous.
* In such cases, consider inspecting the model confidence and implementing a “default-to-safe” policy.
* **Domain Coverage**:
* General-purpose factual tasks are well-covered; highly specialized verticals (e.g. niche scientific domains) are not explicitly targeted during fine-tuning.
* **Not a Verifier**:
* This model only decides if a prompt **needs factual support**.
* Actual hallucination detection and answer verification must be handled by separate models (e.g., answer-level verifiers).
---
## Ethical Considerations
* **Risk Trade-off**:
* Over-classifying prompts as `NO_FACT_CHECK_NEEDED` may reduce safety for borderline factual tasks.
* Over-classifying as `FACT_CHECK_NEEDED` increases compute cost but is safer in high-risk environments.
* **Deployment Recommendation**:
* For safety-critical domains (finance, healthcare, legal, etc.), configure conservative thresholds and fallbacks that favor routing more traffic through the fact-checking path.
---
## Citation
If you use HaluGate Sentinel in academic work or production systems, please cite:
```bibtex
@software{halugate_sentinel_2024,
title = {HaluGate Sentinel: Prompt-Level Fact-Check Switch for Hallucination Gatekeepers},
author = {vLLM Project},
year = {2024},
url = {https://github.com/vllm-project/semantic-router}
}
```
---
## Acknowledgements
* Base encoder: [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)
* Training datasets: SQuAD, TriviaQA, HotpotQA, TruthfulQA, CoQA, Dolly, Alpaca, WritingPrompts, HaluEval, and others listed above.
* Designed for integration with the **vLLM Semantic Router** and broader **Hallucination Gatekeeper** ecosystem.
|