llm-semantic-router
/

halugate-sentinel

@@ -8,6 +8,8 @@ tags:
 - hallucination-detection
 - modernbert
 - lora
 datasets:
 - squad
 - trivia_qa
@@ -23,11 +25,11 @@ metrics:
 - accuracy
 - f1
 model-index:
-- name: fact-check-classifier-modernbert
   results:
   - task:
       type: text-classification
-      name: Fact-Check Classification
     metrics:
     - type: accuracy
       value: 0.964
@@ -37,81 +39,251 @@ model-index:
       name: F1 Score
 ---
-# Fact-Check Classifier (ModernBERT + LoRA)
-A fine-tuned ModernBERT model that classifies user prompts into:
-- **FACT_CHECK_NEEDED**: Questions requiring factual verification (e.g., "When was the Eiffel Tower built?")
-- **NO_FACT_CHECK_NEEDED**: Creative, coding, opinion, or math requests (e.g., "Write a poem about spring")
 ## Model Details
-- **Base Model**: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)
-- **Fine-tuning Method**: LoRA (rank=16, alpha=32)
-- **Training Data**: 50,000 balanced samples from 14 datasets
 - **Validation Accuracy**: 96.4%
-- **Edge Case Accuracy**: 100% (27/27 test cases)
 ## Usage
 ```python
 from transformers import AutoModelForSequenceClassification, AutoTokenizer
 import torch
-model = AutoModelForSequenceClassification.from_pretrained("rootfs/fact-check-classifier-modernbert")
-tokenizer = AutoTokenizer.from_pretrained("rootfs/fact-check-classifier-modernbert")
-def classify(text):
-    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
     with torch.no_grad():
         outputs = model(**inputs)
-    probs = torch.softmax(outputs.logits, dim=-1)
-    pred_class = torch.argmax(probs, dim=-1).item()
-    label = "FACT_CHECK_NEEDED" if pred_class == 1 else "NO_FACT_CHECK_NEEDED"
-    return label, probs[0, pred_class].item()
 # Examples
-print(classify("When was the Eiffel Tower built?"))  # FACT_CHECK_NEEDED
-print(classify("Write a poem about spring"))         # NO_FACT_CHECK_NEEDED
-print(classify("What is the meaning of life?"))      # NO_FACT_CHECK_NEEDED
 ```
 ## Training Data
 ### FACT_CHECK_NEEDED (25,000 samples)
-- **NISQ-ISQ**: Gold standard Information-Seeking Questions
-- **HaluEval**: QA questions from hallucination benchmark
-- **FaithDial**: Information-seeking dialogue questions
-- **FactCHD**: Fact-conflicting hallucination queries
-- **SQuAD, TriviaQA, HotpotQA**: Factual QA datasets
-- **TruthfulQA**: High-risk factual queries
-- **CoQA**: Conversational factual questions
 ### NO_FACT_CHECK_NEEDED (25,000 samples)
-- **NISQ-NonISQ**: Gold standard Non-Information-Seeking Questions
-- **Dolly**: Creative writing, brainstorming, summarization
-- **WritingPrompts**: Creative writing prompts
-- **Alpaca**: Coding, math, opinion instructions
 ## Intended Use
-This model is designed for use in LLM gateway/router systems to:
-1. Classify incoming prompts to determine if fact-checking is needed
-2. Trigger hallucination detection only for factual queries
-3. Reduce unnecessary compute by skipping fact-check for creative/coding tasks
 ## Limitations
-- Borderline cases (philosophical questions) may have lower confidence
-- Trained on English data only
-- Best used as part of a larger hallucination mitigation pipeline
 ## Citation
 ```bibtex
-@software{fact_check_classifier_2024,
-  title = {Fact-Check Classifier for LLM Hallucination Mitigation},
   author = {vLLM Project},
-  year = {2024},
-  url = {https://github.com/vllm-project/semantic-router}
 }
-```

 - hallucination-detection
 - modernbert
 - lora
+- llm-routing
+- llm-gateway
 datasets:
 - squad
 - trivia_qa
 - accuracy
 - f1
 model-index:
+- name: HaluGate-Sentinel
   results:
   - task:
       type: text-classification
+      name: Fact-Check Need Classification
     metrics:
     - type: accuracy
       value: 0.964
       name: F1 Score
 ---
+# HaluGate Sentinel — Prompt Fact-Check Switch for Hallucination Gatekeeper
+**HaluGate Sentinel** is a ModernBERT + LoRA classifier that decides whether an incoming user prompt **requires factual verification**.
+It *does not* check facts itself. Instead, it acts as a **frontline switch** in an LLM routing / gateway system, deciding whether a request should enter a **fact-checking / RAG / hallucination-mitigation pipeline**.
+The model classifies prompts into:
+- **`FACT_CHECK_NEEDED`**:
+  Information-seeking queries that depend on external/world knowledge
+  - e.g., “When was the Eiffel Tower built?”
+  - e.g., “What is the GDP of Japan in 2023?”
+- **`NO_FACT_CHECK_NEEDED`**:
+  Creative, coding, opinion, or pure reasoning/math tasks
+  - e.g., “Write a poem about spring”
+  - e.g., “Implement quicksort in Python”
+  - e.g., “What is the meaning of life?”
+This model is part of the **Hallucination Gatekeeper** stack for `llm-semantic-router`.
+---
 ## Model Details
+- **Model name**: `HaluGate Sentinel`
+- **Repository**: `llm-semantic-router/halugate-sentinel`
+- **Task**: Binary text classification (prompt-level)
+- **Labels**:
+  - `0` → `NO_FACT_CHECK_NEEDED`
+  - `1` → `FACT_CHECK_NEEDED`
+- **Base model**: [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)
+- **Fine-tuning method**: LoRA (rank = 16, alpha = 32)
 - **Validation Accuracy**: 96.4%
+- **Validation F1 Score**: 0.965
+- **Edge-case accuracy**: 100% on a 27-sample curated test set of borderline prompt types
+---
+## Position in a Hallucination Mitigation Pipeline
+HaluGate Sentinel is designed as **Stage 0** in a multi-stage hallucination mitigation architecture:
+1. **Stage 0 — HaluGate Sentinel (this model)**
+   Classifies user prompts and decides whether **fact-checking is needed**:
+   - `NO_FACT_CHECK_NEEDED` → Route directly to LLM generation.
+   - `FACT_CHECK_NEEDED` → Route into the **Hallucination Gatekeeper** path (RAG, tools, verifiers).
+2. **Stage 1+ — Answer-level hallucination models (e.g., “HaluGate Verifier”)**
+   Operate on *(query, answer, evidence)* to detect hallucinations and enforce trust policies.
+HaluGate Sentinel focuses solely on **prompt intent classification** to minimize unnecessary compute while preserving safety for factual queries.
+---
 ## Usage
+### Basic Inference
 ```python
 from transformers import AutoModelForSequenceClassification, AutoTokenizer
 import torch
+MODEL_ID = "llm-semantic-router/halugate-sentinel"
+model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+id2label = model.config.id2label  # {0: 'NO_FACT_CHECK_NEEDED', 1: 'FACT_CHECK_NEEDED'}
+def classify_prompt(text: str):
+    inputs = tokenizer(
+        text,
+        return_tensors="pt",
+        truncation=True,
+        max_length=512,
+    )
     with torch.no_grad():
         outputs = model(**inputs)
+    probs = torch.softmax(outputs.logits, dim=-1)[0]
+    pred_id = int(torch.argmax(probs).item())
+    label = id2label.get(pred_id, str(pred_id))
+    confidence = float(probs[pred_id].item())
+    return label, confidence
 # Examples
+print(classify_prompt("When was the Eiffel Tower built?"))
+# → ('FACT_CHECK_NEEDED', 0.99...)
+print(classify_prompt("Write a poem about spring"))
+# → ('NO_FACT_CHECK_NEEDED', 0.98...)
+print(classify_prompt("Implement a binary search in Python"))
+# → ('NO_FACT_CHECK_NEEDED', 0.97...)
+````
+### Example: Integrating with a Router / Gateway
+Pseudocode for a routing decision:
+```python
+label, prob = classify_prompt(user_prompt)
+FACT_CHECK_THRESHOLD = 0.6  # configurable based on your risk appetite
+if label == "FACT_CHECK_NEEDED" and prob >= FACT_CHECK_THRESHOLD:
+    route = "hallucination_gatekeeper"  # RAG / tools / verifiers
+else:
+    route = "direct_generation"
+# Use `route` to select downstream pipelines in your LLM gateway.
 ```
+---
 ## Training Data
+Balanced dataset of **50,000** prompts:
 ### FACT_CHECK_NEEDED (25,000 samples)
+Information-seeking and knowledge-intensive questions drawn from:
+* **NISQ-ISQ**: Gold-standard information-seeking questions
+* **HaluEval**: Hallucination-focused QA benchmark
+* **FaithDial**: Information-seeking dialogue questions
+* **FactCHD**: Fact-conflicting / hallucination-prone queries
+* **SQuAD, TriviaQA, HotpotQA**: Standard factual QA datasets
+* **TruthfulQA**: High-risk factual queries
+* **CoQA**: Conversational factual questions
 ### NO_FACT_CHECK_NEEDED (25,000 samples)
+Tasks that typically do **not** require external factual verification:
+* **NISQ-NonISQ**: Non-information-seeking questions
+* **Databricks Dolly**: Creative writing, summarization, brainstorming
+* **WritingPrompts**: Creative writing prompts
+* **Alpaca**: Coding, math, opinion, and general instructions
+The objective is to approximate “does this prompt require world knowledge / external facts?” rather than “is the answer true?”.
+---
 ## Intended Use
+### Primary Use Cases
+* **LLM Gateway / Router**
+  * Decide if a prompt must be routed into a **fact-aware pipeline** (RAG, tools, knowledge base, verifiers).
+  * Avoid unnecessary compute for creative / coding / opinion tasks.
+* **Hallucination Gatekeeper Frontline**
+  * Only enable expensive hallucination detection for prompts labeled `FACT_CHECK_NEEDED`.
+  * Implement different safety and latency policies for the two classes.
+* **Traffic Analytics & Risk Scoring**
+  * Monitor proportion of factual vs non-factual traffic.
+  * Adjust infrastructure sizing for retrieval / tool-heavy pipelines accordingly.
+### Non-Goals
+* It does *not* verify the correctness of any answer.
+* It should not be used as a generic toxicity / safety classifier.
+* It does not handle non-English prompts reliably (trained on English only).
+---
+## How It Works
+* **Architecture**:
+  * ModernBERT-base encoder
+  * Classification head on top of `[CLS]` / pooled representation
+* **Fine-tuning**:
+  * LoRA on the base encoder
+  * Binary cross-entropy / cross-entropy loss on the two labels
+  * Balanced sampling between FACT_CHECK_NEEDED and NO_FACT_CHECK_NEEDED
+* **Decision Boundary**:
+  * Borderline / philosophical / highly abstract questions may be assigned lower confidence.
+  * Downstream systems are encouraged to use the **confidence score** as a soft signal, not a hard oracle.
+---
 ## Limitations
+* **Language**:
+  * Trained on English data only.
+  * Performance on other languages is not guaranteed.
+* **Borderline Queries**:
+  * Philosophical or hybrid prompts (e.g. “Is time travel possible?”) may be ambiguous.
+  * In such cases, consider inspecting the model confidence and implementing a “default-to-safe” policy.
+* **Domain Coverage**:
+  * General-purpose factual tasks are well-covered; highly specialized verticals (e.g. niche scientific domains) are not explicitly targeted during fine-tuning.
+* **Not a Verifier**:
+  * This model only decides if a prompt **needs factual support**.
+  * Actual hallucination detection and answer verification must be handled by separate models (e.g., answer-level verifiers).
+---
+## Ethical Considerations
+* **Risk Trade-off**:
+  * Over-classifying prompts as `NO_FACT_CHECK_NEEDED` may reduce safety for borderline factual tasks.
+  * Over-classifying as `FACT_CHECK_NEEDED` increases compute cost but is safer in high-risk environments.
+* **Deployment Recommendation**:
+  * For safety-critical domains (finance, healthcare, legal, etc.), configure conservative thresholds and fallbacks that favor routing more traffic through the fact-checking path.
+---
 ## Citation
+If you use HaluGate Sentinel in academic work or production systems, please cite:
 ```bibtex
+@software{halugate_sentinel_2024,
+  title  = {HaluGate Sentinel: Prompt-Level Fact-Check Switch for Hallucination Gatekeepers},
   author = {vLLM Project},
+  year   = {2024},
+  url    = {https://github.com/vllm-project/semantic-router}
 }
+```
+---
+## Acknowledgements
+* Base encoder: [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)
+* Training datasets: SQuAD, TriviaQA, HotpotQA, TruthfulQA, CoQA, Dolly, Alpaca, WritingPrompts, HaluEval, and others listed above.
+* Designed for integration with the **vLLM Semantic Router** and broader **Hallucination Gatekeeper** ecosystem.