halugate-sentinel / README.md
Xunzhuo's picture
Update README.md
ce44031 verified
---
license: apache-2.0
language:
- en
tags:
- text-classification
- fact-checking
- hallucination-detection
- modernbert
- lora
- llm-routing
- llm-gateway
datasets:
- squad
- trivia_qa
- hotpot_qa
- truthful_qa
- databricks/databricks-dolly-15k
- tatsu-lab/alpaca
- pminervini/HaluEval
- neural-bridge/rag-dataset-12000
pipeline_tag: text-classification
base_model: answerdotai/ModernBERT-base
metrics:
- accuracy
- f1
model-index:
- name: HaluGate-Sentinel
results:
- task:
type: text-classification
name: Fact-Check Need Classification
metrics:
- type: accuracy
value: 0.964
name: Validation Accuracy
- type: f1
value: 0.965
name: F1 Score
---
# HaluGate Sentinel — Prompt Fact-Check Switch for Hallucination Gatekeeper
**HaluGate Sentinel** is a ModernBERT + LoRA classifier that decides whether an incoming user prompt **requires factual verification**.
It *does not* check facts itself. Instead, it acts as a **frontline switch** in an LLM routing / gateway system, deciding whether a request should enter a **fact-checking / RAG / hallucination-mitigation pipeline**.
The model classifies prompts into:
- **`FACT_CHECK_NEEDED`**:
Information-seeking queries that depend on external/world knowledge
- e.g., “When was the Eiffel Tower built?”
- e.g., “What is the GDP of Japan in 2023?”
- **`NO_FACT_CHECK_NEEDED`**:
Creative, coding, opinion, or pure reasoning/math tasks
- e.g., “Write a poem about spring”
- e.g., “Implement quicksort in Python”
- e.g., “What is the meaning of life?”
This model is part of the **Hallucination Gatekeeper** stack for `llm-semantic-router`.
---
## Model Details
- **Model name**: `HaluGate Sentinel`
- **Repository**: `llm-semantic-router/halugate-sentinel`
- **Task**: Binary text classification (prompt-level)
- **Labels**:
- `0` → `NO_FACT_CHECK_NEEDED`
- `1` → `FACT_CHECK_NEEDED`
- **Base model**: [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)
- **Fine-tuning method**: LoRA (rank = 16, alpha = 32)
- **Validation Accuracy**: 96.4%
- **Validation F1 Score**: 0.965
- **Edge-case accuracy**: 100% on a 27-sample curated test set of borderline prompt types
---
## Position in a Hallucination Mitigation Pipeline
HaluGate Sentinel is designed as **Stage 0** in a multi-stage hallucination mitigation architecture:
1. **Stage 0 — HaluGate Sentinel (this model)**
Classifies user prompts and decides whether **fact-checking is needed**:
- `NO_FACT_CHECK_NEEDED` → Route directly to LLM generation.
- `FACT_CHECK_NEEDED` → Route into the **Hallucination Gatekeeper** path (RAG, tools, verifiers).
2. **Stage 1+ — Answer-level hallucination models (e.g., “HaluGate Verifier”)**
Operate on *(query, answer, evidence)* to detect hallucinations and enforce trust policies.
HaluGate Sentinel focuses solely on **prompt intent classification** to minimize unnecessary compute while preserving safety for factual queries.
---
## Usage
### Basic Inference
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
MODEL_ID = "llm-semantic-router/halugate-sentinel"
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
id2label = model.config.id2label # {0: 'NO_FACT_CHECK_NEEDED', 1: 'FACT_CHECK_NEEDED'}
def classify_prompt(text: str):
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=512,
)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)[0]
pred_id = int(torch.argmax(probs).item())
label = id2label.get(pred_id, str(pred_id))
confidence = float(probs[pred_id].item())
return label, confidence
# Examples
print(classify_prompt("When was the Eiffel Tower built?"))
# → ('FACT_CHECK_NEEDED', 0.99...)
print(classify_prompt("Write a poem about spring"))
# → ('NO_FACT_CHECK_NEEDED', 0.98...)
print(classify_prompt("Implement a binary search in Python"))
# → ('NO_FACT_CHECK_NEEDED', 0.97...)
````
### Example: Integrating with a Router / Gateway
Pseudocode for a routing decision:
```python
label, prob = classify_prompt(user_prompt)
FACT_CHECK_THRESHOLD = 0.6 # configurable based on your risk appetite
if label == "FACT_CHECK_NEEDED" and prob >= FACT_CHECK_THRESHOLD:
route = "hallucination_gatekeeper" # RAG / tools / verifiers
else:
route = "direct_generation"
# Use `route` to select downstream pipelines in your LLM gateway.
```
---
## Training Data
Balanced dataset of **50,000** prompts:
### FACT_CHECK_NEEDED (25,000 samples)
Information-seeking and knowledge-intensive questions drawn from:
* **NISQ-ISQ**: Gold-standard information-seeking questions
* **HaluEval**: Hallucination-focused QA benchmark
* **FaithDial**: Information-seeking dialogue questions
* **FactCHD**: Fact-conflicting / hallucination-prone queries
* **SQuAD, TriviaQA, HotpotQA**: Standard factual QA datasets
* **TruthfulQA**: High-risk factual queries
* **CoQA**: Conversational factual questions
### NO_FACT_CHECK_NEEDED (25,000 samples)
Tasks that typically do **not** require external factual verification:
* **NISQ-NonISQ**: Non-information-seeking questions
* **Databricks Dolly**: Creative writing, summarization, brainstorming
* **WritingPrompts**: Creative writing prompts
* **Alpaca**: Coding, math, opinion, and general instructions
The objective is to approximate “does this prompt require world knowledge / external facts?” rather than “is the answer true?”.
---
## Intended Use
### Primary Use Cases
* **LLM Gateway / Router**
* Decide if a prompt must be routed into a **fact-aware pipeline** (RAG, tools, knowledge base, verifiers).
* Avoid unnecessary compute for creative / coding / opinion tasks.
* **Hallucination Gatekeeper Frontline**
* Only enable expensive hallucination detection for prompts labeled `FACT_CHECK_NEEDED`.
* Implement different safety and latency policies for the two classes.
* **Traffic Analytics & Risk Scoring**
* Monitor proportion of factual vs non-factual traffic.
* Adjust infrastructure sizing for retrieval / tool-heavy pipelines accordingly.
### Non-Goals
* It does *not* verify the correctness of any answer.
* It should not be used as a generic toxicity / safety classifier.
* It does not handle non-English prompts reliably (trained on English only).
---
## How It Works
* **Architecture**:
* ModernBERT-base encoder
* Classification head on top of `[CLS]` / pooled representation
* **Fine-tuning**:
* LoRA on the base encoder
* Binary cross-entropy / cross-entropy loss on the two labels
* Balanced sampling between FACT_CHECK_NEEDED and NO_FACT_CHECK_NEEDED
* **Decision Boundary**:
* Borderline / philosophical / highly abstract questions may be assigned lower confidence.
* Downstream systems are encouraged to use the **confidence score** as a soft signal, not a hard oracle.
---
## Limitations
* **Language**:
* Trained on English data only.
* Performance on other languages is not guaranteed.
* **Borderline Queries**:
* Philosophical or hybrid prompts (e.g. “Is time travel possible?”) may be ambiguous.
* In such cases, consider inspecting the model confidence and implementing a “default-to-safe” policy.
* **Domain Coverage**:
* General-purpose factual tasks are well-covered; highly specialized verticals (e.g. niche scientific domains) are not explicitly targeted during fine-tuning.
* **Not a Verifier**:
* This model only decides if a prompt **needs factual support**.
* Actual hallucination detection and answer verification must be handled by separate models (e.g., answer-level verifiers).
---
## Ethical Considerations
* **Risk Trade-off**:
* Over-classifying prompts as `NO_FACT_CHECK_NEEDED` may reduce safety for borderline factual tasks.
* Over-classifying as `FACT_CHECK_NEEDED` increases compute cost but is safer in high-risk environments.
* **Deployment Recommendation**:
* For safety-critical domains (finance, healthcare, legal, etc.), configure conservative thresholds and fallbacks that favor routing more traffic through the fact-checking path.
---
## Citation
If you use HaluGate Sentinel in academic work or production systems, please cite:
```bibtex
@software{halugate_sentinel_2024,
title = {HaluGate Sentinel: Prompt-Level Fact-Check Switch for Hallucination Gatekeepers},
author = {vLLM Project},
year = {2024},
url = {https://github.com/vllm-project/semantic-router}
}
```
---
## Acknowledgements
* Base encoder: [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)
* Training datasets: SQuAD, TriviaQA, HotpotQA, TruthfulQA, CoQA, Dolly, Alpaca, WritingPrompts, HaluEval, and others listed above.
* Designed for integration with the **vLLM Semantic Router** and broader **Hallucination Gatekeeper** ecosystem.