|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- text-classification |
|
|
- fact-checking |
|
|
- hallucination-detection |
|
|
- modernbert |
|
|
- lora |
|
|
- llm-routing |
|
|
- llm-gateway |
|
|
datasets: |
|
|
- squad |
|
|
- trivia_qa |
|
|
- hotpot_qa |
|
|
- truthful_qa |
|
|
- databricks/databricks-dolly-15k |
|
|
- tatsu-lab/alpaca |
|
|
- pminervini/HaluEval |
|
|
- neural-bridge/rag-dataset-12000 |
|
|
pipeline_tag: text-classification |
|
|
base_model: answerdotai/ModernBERT-base |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
model-index: |
|
|
- name: HaluGate-Sentinel |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Fact-Check Need Classification |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.964 |
|
|
name: Validation Accuracy |
|
|
- type: f1 |
|
|
value: 0.965 |
|
|
name: F1 Score |
|
|
--- |
|
|
|
|
|
# HaluGate Sentinel — Prompt Fact-Check Switch for Hallucination Gatekeeper |
|
|
|
|
|
**HaluGate Sentinel** is a ModernBERT + LoRA classifier that decides whether an incoming user prompt **requires factual verification**. |
|
|
|
|
|
It *does not* check facts itself. Instead, it acts as a **frontline switch** in an LLM routing / gateway system, deciding whether a request should enter a **fact-checking / RAG / hallucination-mitigation pipeline**. |
|
|
|
|
|
The model classifies prompts into: |
|
|
|
|
|
- **`FACT_CHECK_NEEDED`**: |
|
|
Information-seeking queries that depend on external/world knowledge |
|
|
- e.g., “When was the Eiffel Tower built?” |
|
|
- e.g., “What is the GDP of Japan in 2023?” |
|
|
|
|
|
- **`NO_FACT_CHECK_NEEDED`**: |
|
|
Creative, coding, opinion, or pure reasoning/math tasks |
|
|
- e.g., “Write a poem about spring” |
|
|
- e.g., “Implement quicksort in Python” |
|
|
- e.g., “What is the meaning of life?” |
|
|
|
|
|
This model is part of the **Hallucination Gatekeeper** stack for `llm-semantic-router`. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model name**: `HaluGate Sentinel` |
|
|
- **Repository**: `llm-semantic-router/halugate-sentinel` |
|
|
- **Task**: Binary text classification (prompt-level) |
|
|
- **Labels**: |
|
|
- `0` → `NO_FACT_CHECK_NEEDED` |
|
|
- `1` → `FACT_CHECK_NEEDED` |
|
|
- **Base model**: [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base) |
|
|
- **Fine-tuning method**: LoRA (rank = 16, alpha = 32) |
|
|
- **Validation Accuracy**: 96.4% |
|
|
- **Validation F1 Score**: 0.965 |
|
|
- **Edge-case accuracy**: 100% on a 27-sample curated test set of borderline prompt types |
|
|
|
|
|
--- |
|
|
|
|
|
## Position in a Hallucination Mitigation Pipeline |
|
|
|
|
|
HaluGate Sentinel is designed as **Stage 0** in a multi-stage hallucination mitigation architecture: |
|
|
|
|
|
1. **Stage 0 — HaluGate Sentinel (this model)** |
|
|
Classifies user prompts and decides whether **fact-checking is needed**: |
|
|
- `NO_FACT_CHECK_NEEDED` → Route directly to LLM generation. |
|
|
- `FACT_CHECK_NEEDED` → Route into the **Hallucination Gatekeeper** path (RAG, tools, verifiers). |
|
|
|
|
|
2. **Stage 1+ — Answer-level hallucination models (e.g., “HaluGate Verifier”)** |
|
|
Operate on *(query, answer, evidence)* to detect hallucinations and enforce trust policies. |
|
|
|
|
|
HaluGate Sentinel focuses solely on **prompt intent classification** to minimize unnecessary compute while preserving safety for factual queries. |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Basic Inference |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
MODEL_ID = "llm-semantic-router/halugate-sentinel" |
|
|
|
|
|
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID) |
|
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) |
|
|
|
|
|
id2label = model.config.id2label # {0: 'NO_FACT_CHECK_NEEDED', 1: 'FACT_CHECK_NEEDED'} |
|
|
|
|
|
def classify_prompt(text: str): |
|
|
inputs = tokenizer( |
|
|
text, |
|
|
return_tensors="pt", |
|
|
truncation=True, |
|
|
max_length=512, |
|
|
) |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
probs = torch.softmax(outputs.logits, dim=-1)[0] |
|
|
pred_id = int(torch.argmax(probs).item()) |
|
|
label = id2label.get(pred_id, str(pred_id)) |
|
|
confidence = float(probs[pred_id].item()) |
|
|
return label, confidence |
|
|
|
|
|
# Examples |
|
|
print(classify_prompt("When was the Eiffel Tower built?")) |
|
|
# → ('FACT_CHECK_NEEDED', 0.99...) |
|
|
|
|
|
print(classify_prompt("Write a poem about spring")) |
|
|
# → ('NO_FACT_CHECK_NEEDED', 0.98...) |
|
|
|
|
|
print(classify_prompt("Implement a binary search in Python")) |
|
|
# → ('NO_FACT_CHECK_NEEDED', 0.97...) |
|
|
```` |
|
|
|
|
|
### Example: Integrating with a Router / Gateway |
|
|
|
|
|
Pseudocode for a routing decision: |
|
|
|
|
|
```python |
|
|
label, prob = classify_prompt(user_prompt) |
|
|
|
|
|
FACT_CHECK_THRESHOLD = 0.6 # configurable based on your risk appetite |
|
|
|
|
|
if label == "FACT_CHECK_NEEDED" and prob >= FACT_CHECK_THRESHOLD: |
|
|
route = "hallucination_gatekeeper" # RAG / tools / verifiers |
|
|
else: |
|
|
route = "direct_generation" |
|
|
|
|
|
# Use `route` to select downstream pipelines in your LLM gateway. |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Data |
|
|
|
|
|
Balanced dataset of **50,000** prompts: |
|
|
|
|
|
### FACT_CHECK_NEEDED (25,000 samples) |
|
|
|
|
|
Information-seeking and knowledge-intensive questions drawn from: |
|
|
|
|
|
* **NISQ-ISQ**: Gold-standard information-seeking questions |
|
|
* **HaluEval**: Hallucination-focused QA benchmark |
|
|
* **FaithDial**: Information-seeking dialogue questions |
|
|
* **FactCHD**: Fact-conflicting / hallucination-prone queries |
|
|
* **SQuAD, TriviaQA, HotpotQA**: Standard factual QA datasets |
|
|
* **TruthfulQA**: High-risk factual queries |
|
|
* **CoQA**: Conversational factual questions |
|
|
|
|
|
### NO_FACT_CHECK_NEEDED (25,000 samples) |
|
|
|
|
|
Tasks that typically do **not** require external factual verification: |
|
|
|
|
|
* **NISQ-NonISQ**: Non-information-seeking questions |
|
|
* **Databricks Dolly**: Creative writing, summarization, brainstorming |
|
|
* **WritingPrompts**: Creative writing prompts |
|
|
* **Alpaca**: Coding, math, opinion, and general instructions |
|
|
|
|
|
The objective is to approximate “does this prompt require world knowledge / external facts?” rather than “is the answer true?”. |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Primary Use Cases |
|
|
|
|
|
* **LLM Gateway / Router** |
|
|
|
|
|
* Decide if a prompt must be routed into a **fact-aware pipeline** (RAG, tools, knowledge base, verifiers). |
|
|
* Avoid unnecessary compute for creative / coding / opinion tasks. |
|
|
|
|
|
* **Hallucination Gatekeeper Frontline** |
|
|
|
|
|
* Only enable expensive hallucination detection for prompts labeled `FACT_CHECK_NEEDED`. |
|
|
* Implement different safety and latency policies for the two classes. |
|
|
|
|
|
* **Traffic Analytics & Risk Scoring** |
|
|
|
|
|
* Monitor proportion of factual vs non-factual traffic. |
|
|
* Adjust infrastructure sizing for retrieval / tool-heavy pipelines accordingly. |
|
|
|
|
|
### Non-Goals |
|
|
|
|
|
* It does *not* verify the correctness of any answer. |
|
|
* It should not be used as a generic toxicity / safety classifier. |
|
|
* It does not handle non-English prompts reliably (trained on English only). |
|
|
|
|
|
--- |
|
|
|
|
|
## How It Works |
|
|
|
|
|
* **Architecture**: |
|
|
|
|
|
* ModernBERT-base encoder |
|
|
* Classification head on top of `[CLS]` / pooled representation |
|
|
|
|
|
* **Fine-tuning**: |
|
|
|
|
|
* LoRA on the base encoder |
|
|
* Binary cross-entropy / cross-entropy loss on the two labels |
|
|
* Balanced sampling between FACT_CHECK_NEEDED and NO_FACT_CHECK_NEEDED |
|
|
|
|
|
* **Decision Boundary**: |
|
|
|
|
|
* Borderline / philosophical / highly abstract questions may be assigned lower confidence. |
|
|
* Downstream systems are encouraged to use the **confidence score** as a soft signal, not a hard oracle. |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations |
|
|
|
|
|
* **Language**: |
|
|
|
|
|
* Trained on English data only. |
|
|
* Performance on other languages is not guaranteed. |
|
|
|
|
|
* **Borderline Queries**: |
|
|
|
|
|
* Philosophical or hybrid prompts (e.g. “Is time travel possible?”) may be ambiguous. |
|
|
* In such cases, consider inspecting the model confidence and implementing a “default-to-safe” policy. |
|
|
|
|
|
* **Domain Coverage**: |
|
|
|
|
|
* General-purpose factual tasks are well-covered; highly specialized verticals (e.g. niche scientific domains) are not explicitly targeted during fine-tuning. |
|
|
|
|
|
* **Not a Verifier**: |
|
|
|
|
|
* This model only decides if a prompt **needs factual support**. |
|
|
* Actual hallucination detection and answer verification must be handled by separate models (e.g., answer-level verifiers). |
|
|
|
|
|
--- |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
* **Risk Trade-off**: |
|
|
|
|
|
* Over-classifying prompts as `NO_FACT_CHECK_NEEDED` may reduce safety for borderline factual tasks. |
|
|
* Over-classifying as `FACT_CHECK_NEEDED` increases compute cost but is safer in high-risk environments. |
|
|
|
|
|
* **Deployment Recommendation**: |
|
|
|
|
|
* For safety-critical domains (finance, healthcare, legal, etc.), configure conservative thresholds and fallbacks that favor routing more traffic through the fact-checking path. |
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use HaluGate Sentinel in academic work or production systems, please cite: |
|
|
|
|
|
```bibtex |
|
|
@software{halugate_sentinel_2024, |
|
|
title = {HaluGate Sentinel: Prompt-Level Fact-Check Switch for Hallucination Gatekeepers}, |
|
|
author = {vLLM Project}, |
|
|
year = {2024}, |
|
|
url = {https://github.com/vllm-project/semantic-router} |
|
|
} |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
* Base encoder: [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base) |
|
|
* Training datasets: SQuAD, TriviaQA, HotpotQA, TruthfulQA, CoQA, Dolly, Alpaca, WritingPrompts, HaluEval, and others listed above. |
|
|
* Designed for integration with the **vLLM Semantic Router** and broader **Hallucination Gatekeeper** ecosystem. |
|
|
|
|
|
|