File size: 9,057 Bytes

---
license: apache-2.0
language:
- en
tags:
- text-classification
- fact-checking
- hallucination-detection
- modernbert
- lora
- llm-routing
- llm-gateway
datasets:
- squad
- trivia_qa
- hotpot_qa
- truthful_qa
- databricks/databricks-dolly-15k
- tatsu-lab/alpaca
- pminervini/HaluEval
- neural-bridge/rag-dataset-12000
pipeline_tag: text-classification
base_model: answerdotai/ModernBERT-base
metrics:
- accuracy
- f1
model-index:
- name: HaluGate-Sentinel
  results:
  - task:
      type: text-classification
      name: Fact-Check Need Classification
    metrics:
    - type: accuracy
      value: 0.964
      name: Validation Accuracy
    - type: f1
      value: 0.965
      name: F1 Score
---

# HaluGate Sentinel — Prompt Fact-Check Switch for Hallucination Gatekeeper

**HaluGate Sentinel** is a ModernBERT + LoRA classifier that decides whether an incoming user prompt **requires factual verification**.

It *does not* check facts itself. Instead, it acts as a **frontline switch** in an LLM routing / gateway system, deciding whether a request should enter a **fact-checking / RAG / hallucination-mitigation pipeline**.

The model classifies prompts into:

- **`FACT_CHECK_NEEDED`**:  
  Information-seeking queries that depend on external/world knowledge  
  - e.g., “When was the Eiffel Tower built?”  
  - e.g., “What is the GDP of Japan in 2023?”

- **`NO_FACT_CHECK_NEEDED`**:  
  Creative, coding, opinion, or pure reasoning/math tasks  
  - e.g., “Write a poem about spring”  
  - e.g., “Implement quicksort in Python”  
  - e.g., “What is the meaning of life?”

This model is part of the **Hallucination Gatekeeper** stack for `llm-semantic-router`.

---

## Model Details

- **Model name**: `HaluGate Sentinel`
- **Repository**: `llm-semantic-router/halugate-sentinel`
- **Task**: Binary text classification (prompt-level)
- **Labels**:
  - `0` → `NO_FACT_CHECK_NEEDED`
  - `1` → `FACT_CHECK_NEEDED`
- **Base model**: [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)
- **Fine-tuning method**: LoRA (rank = 16, alpha = 32)
- **Validation Accuracy**: 96.4%
- **Validation F1 Score**: 0.965
- **Edge-case accuracy**: 100% on a 27-sample curated test set of borderline prompt types

---

## Position in a Hallucination Mitigation Pipeline

HaluGate Sentinel is designed as **Stage 0** in a multi-stage hallucination mitigation architecture:

1. **Stage 0 — HaluGate Sentinel (this model)**  
   Classifies user prompts and decides whether **fact-checking is needed**:
   - `NO_FACT_CHECK_NEEDED` → Route directly to LLM generation.
   - `FACT_CHECK_NEEDED` → Route into the **Hallucination Gatekeeper** path (RAG, tools, verifiers).

2. **Stage 1+ — Answer-level hallucination models (e.g., “HaluGate Verifier”)**  
   Operate on *(query, answer, evidence)* to detect hallucinations and enforce trust policies.

HaluGate Sentinel focuses solely on **prompt intent classification** to minimize unnecessary compute while preserving safety for factual queries.

---

## Usage

### Basic Inference

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

MODEL_ID = "llm-semantic-router/halugate-sentinel"

model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

id2label = model.config.id2label  # {0: 'NO_FACT_CHECK_NEEDED', 1: 'FACT_CHECK_NEEDED'}

def classify_prompt(text: str):
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=512,
    )
    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)[0]
    pred_id = int(torch.argmax(probs).item())
    label = id2label.get(pred_id, str(pred_id))
    confidence = float(probs[pred_id].item())
    return label, confidence

# Examples
print(classify_prompt("When was the Eiffel Tower built?"))
# → ('FACT_CHECK_NEEDED', 0.99...)

print(classify_prompt("Write a poem about spring"))
# → ('NO_FACT_CHECK_NEEDED', 0.98...)

print(classify_prompt("Implement a binary search in Python"))
# → ('NO_FACT_CHECK_NEEDED', 0.97...)
````

### Example: Integrating with a Router / Gateway

Pseudocode for a routing decision:

```python
label, prob = classify_prompt(user_prompt)

FACT_CHECK_THRESHOLD = 0.6  # configurable based on your risk appetite

if label == "FACT_CHECK_NEEDED" and prob >= FACT_CHECK_THRESHOLD:
    route = "hallucination_gatekeeper"  # RAG / tools / verifiers
else:
    route = "direct_generation"

# Use `route` to select downstream pipelines in your LLM gateway.
```

---

## Training Data

Balanced dataset of **50,000** prompts:

### FACT_CHECK_NEEDED (25,000 samples)

Information-seeking and knowledge-intensive questions drawn from:

* **NISQ-ISQ**: Gold-standard information-seeking questions
* **HaluEval**: Hallucination-focused QA benchmark
* **FaithDial**: Information-seeking dialogue questions
* **FactCHD**: Fact-conflicting / hallucination-prone queries
* **SQuAD, TriviaQA, HotpotQA**: Standard factual QA datasets
* **TruthfulQA**: High-risk factual queries
* **CoQA**: Conversational factual questions

### NO_FACT_CHECK_NEEDED (25,000 samples)

Tasks that typically do **not** require external factual verification:

* **NISQ-NonISQ**: Non-information-seeking questions
* **Databricks Dolly**: Creative writing, summarization, brainstorming
* **WritingPrompts**: Creative writing prompts
* **Alpaca**: Coding, math, opinion, and general instructions

The objective is to approximate “does this prompt require world knowledge / external facts?” rather than “is the answer true?”.

---

## Intended Use

### Primary Use Cases

* **LLM Gateway / Router**

  * Decide if a prompt must be routed into a **fact-aware pipeline** (RAG, tools, knowledge base, verifiers).
  * Avoid unnecessary compute for creative / coding / opinion tasks.

* **Hallucination Gatekeeper Frontline**

  * Only enable expensive hallucination detection for prompts labeled `FACT_CHECK_NEEDED`.
  * Implement different safety and latency policies for the two classes.

* **Traffic Analytics & Risk Scoring**

  * Monitor proportion of factual vs non-factual traffic.
  * Adjust infrastructure sizing for retrieval / tool-heavy pipelines accordingly.

### Non-Goals

* It does *not* verify the correctness of any answer.
* It should not be used as a generic toxicity / safety classifier.
* It does not handle non-English prompts reliably (trained on English only).

---

## How It Works

* **Architecture**:

  * ModernBERT-base encoder
  * Classification head on top of `[CLS]` / pooled representation

* **Fine-tuning**:

  * LoRA on the base encoder
  * Binary cross-entropy / cross-entropy loss on the two labels
  * Balanced sampling between FACT_CHECK_NEEDED and NO_FACT_CHECK_NEEDED

* **Decision Boundary**:

  * Borderline / philosophical / highly abstract questions may be assigned lower confidence.
  * Downstream systems are encouraged to use the **confidence score** as a soft signal, not a hard oracle.

---

## Limitations

* **Language**:

  * Trained on English data only.
  * Performance on other languages is not guaranteed.

* **Borderline Queries**:

  * Philosophical or hybrid prompts (e.g. “Is time travel possible?”) may be ambiguous.
  * In such cases, consider inspecting the model confidence and implementing a “default-to-safe” policy.

* **Domain Coverage**:

  * General-purpose factual tasks are well-covered; highly specialized verticals (e.g. niche scientific domains) are not explicitly targeted during fine-tuning.

* **Not a Verifier**:

  * This model only decides if a prompt **needs factual support**.
  * Actual hallucination detection and answer verification must be handled by separate models (e.g., answer-level verifiers).

---

## Ethical Considerations

* **Risk Trade-off**:

  * Over-classifying prompts as `NO_FACT_CHECK_NEEDED` may reduce safety for borderline factual tasks.
  * Over-classifying as `FACT_CHECK_NEEDED` increases compute cost but is safer in high-risk environments.

* **Deployment Recommendation**:

  * For safety-critical domains (finance, healthcare, legal, etc.), configure conservative thresholds and fallbacks that favor routing more traffic through the fact-checking path.

---

## Citation

If you use HaluGate Sentinel in academic work or production systems, please cite:

```bibtex
@software{halugate_sentinel_2024,
  title  = {HaluGate Sentinel: Prompt-Level Fact-Check Switch for Hallucination Gatekeepers},
  author = {vLLM Project},
  year   = {2024},
  url    = {https://github.com/vllm-project/semantic-router}
}
```

---

## Acknowledgements

* Base encoder: [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)
* Training datasets: SQuAD, TriviaQA, HotpotQA, TruthfulQA, CoQA, Dolly, Alpaca, WritingPrompts, HaluEval, and others listed above.
* Designed for integration with the **vLLM Semantic Router** and broader **Hallucination Gatekeeper** ecosystem.