halugate-sentinel / README.md

Update README.md

ce44031 verified 8 days ago

9.06 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- text-classification
	- fact-checking
	- hallucination-detection
	- modernbert
	- lora
	- llm-routing
	- llm-gateway
	datasets:
	- squad
	- trivia_qa
	- hotpot_qa
	- truthful_qa
	- databricks/databricks-dolly-15k
	- tatsu-lab/alpaca
	- pminervini/HaluEval
	- neural-bridge/rag-dataset-12000
	pipeline_tag: text-classification
	base_model: answerdotai/ModernBERT-base
	metrics:
	- accuracy
	- f1
	model-index:
	- name: HaluGate-Sentinel
	results:
	- task:
	type: text-classification
	name: Fact-Check Need Classification
	metrics:
	- type: accuracy
	value: 0.964
	name: Validation Accuracy
	- type: f1
	value: 0.965
	name: F1 Score
	---

	# HaluGate Sentinel — Prompt Fact-Check Switch for Hallucination Gatekeeper

	HaluGate Sentinel is a ModernBERT + LoRA classifier that decides whether an incoming user prompt requires factual verification.

	It does not check facts itself. Instead, it acts as a frontline switch in an LLM routing / gateway system, deciding whether a request should enter a fact-checking / RAG / hallucination-mitigation pipeline.

	The model classifies prompts into:

	- `FACT_CHECK_NEEDED`:
	Information-seeking queries that depend on external/world knowledge
	- e.g., “When was the Eiffel Tower built?”
	- e.g., “What is the GDP of Japan in 2023?”

	- `NO_FACT_CHECK_NEEDED`:
	Creative, coding, opinion, or pure reasoning/math tasks
	- e.g., “Write a poem about spring”
	- e.g., “Implement quicksort in Python”
	- e.g., “What is the meaning of life?”

	This model is part of the Hallucination Gatekeeper stack for `llm-semantic-router`.

	---

	## Model Details

	- Model name: `HaluGate Sentinel`
	- Repository: `llm-semantic-router/halugate-sentinel`
	- Task: Binary text classification (prompt-level)
	- Labels:
	- `0` → `NO_FACT_CHECK_NEEDED`
	- `1` → `FACT_CHECK_NEEDED`
	- Base model: [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)
	- Fine-tuning method: LoRA (rank = 16, alpha = 32)
	- Validation Accuracy: 96.4%
	- Validation F1 Score: 0.965
	- Edge-case accuracy: 100% on a 27-sample curated test set of borderline prompt types

	---

	## Position in a Hallucination Mitigation Pipeline

	HaluGate Sentinel is designed as Stage 0 in a multi-stage hallucination mitigation architecture:

	1. Stage 0 — HaluGate Sentinel (this model)
	Classifies user prompts and decides whether fact-checking is needed:
	- `NO_FACT_CHECK_NEEDED` → Route directly to LLM generation.
	- `FACT_CHECK_NEEDED` → Route into the Hallucination Gatekeeper path (RAG, tools, verifiers).

	2. Stage 1+ — Answer-level hallucination models (e.g., “HaluGate Verifier”)
	Operate on (query, answer, evidence) to detect hallucinations and enforce trust policies.

	HaluGate Sentinel focuses solely on prompt intent classification to minimize unnecessary compute while preserving safety for factual queries.

	---

	## Usage

	### Basic Inference

	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer
	import torch

	MODEL_ID = "llm-semantic-router/halugate-sentinel"

	model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID)
	tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

	id2label = model.config.id2label # {0: 'NO_FACT_CHECK_NEEDED', 1: 'FACT_CHECK_NEEDED'}

	def classify_prompt(text: str):
	inputs = tokenizer(
	text,
	return_tensors="pt",
	truncation=True,
	max_length=512,
	)
	with torch.no_grad():
	outputs = model(**inputs)
	probs = torch.softmax(outputs.logits, dim=-1)[0]
	pred_id = int(torch.argmax(probs).item())
	label = id2label.get(pred_id, str(pred_id))
	confidence = float(probs[pred_id].item())
	return label, confidence

	# Examples
	print(classify_prompt("When was the Eiffel Tower built?"))
	# → ('FACT_CHECK_NEEDED', 0.99...)

	print(classify_prompt("Write a poem about spring"))
	# → ('NO_FACT_CHECK_NEEDED', 0.98...)

	print(classify_prompt("Implement a binary search in Python"))
	# → ('NO_FACT_CHECK_NEEDED', 0.97...)
	````

	### Example: Integrating with a Router / Gateway

	Pseudocode for a routing decision:

	```python
	label, prob = classify_prompt(user_prompt)

	FACT_CHECK_THRESHOLD = 0.6 # configurable based on your risk appetite

	if label == "FACT_CHECK_NEEDED" and prob >= FACT_CHECK_THRESHOLD:
	route = "hallucination_gatekeeper" # RAG / tools / verifiers
	else:
	route = "direct_generation"

	# Use `route` to select downstream pipelines in your LLM gateway.
	```

	---

	## Training Data

	Balanced dataset of 50,000 prompts:

	### FACT_CHECK_NEEDED (25,000 samples)

	Information-seeking and knowledge-intensive questions drawn from:

	* NISQ-ISQ: Gold-standard information-seeking questions
	* HaluEval: Hallucination-focused QA benchmark
	* FaithDial: Information-seeking dialogue questions
	* FactCHD: Fact-conflicting / hallucination-prone queries
	* SQuAD, TriviaQA, HotpotQA: Standard factual QA datasets
	* TruthfulQA: High-risk factual queries
	* CoQA: Conversational factual questions

	### NO_FACT_CHECK_NEEDED (25,000 samples)

	Tasks that typically do not require external factual verification:

	* NISQ-NonISQ: Non-information-seeking questions
	* Databricks Dolly: Creative writing, summarization, brainstorming
	* WritingPrompts: Creative writing prompts
	* Alpaca: Coding, math, opinion, and general instructions

	The objective is to approximate “does this prompt require world knowledge / external facts?” rather than “is the answer true?”.

	---

	## Intended Use

	### Primary Use Cases

	* LLM Gateway / Router

	* Decide if a prompt must be routed into a fact-aware pipeline (RAG, tools, knowledge base, verifiers).
	* Avoid unnecessary compute for creative / coding / opinion tasks.

	* Hallucination Gatekeeper Frontline

	* Only enable expensive hallucination detection for prompts labeled `FACT_CHECK_NEEDED`.
	* Implement different safety and latency policies for the two classes.

	* Traffic Analytics & Risk Scoring

	* Monitor proportion of factual vs non-factual traffic.
	* Adjust infrastructure sizing for retrieval / tool-heavy pipelines accordingly.

	### Non-Goals

	* It does not verify the correctness of any answer.
	* It should not be used as a generic toxicity / safety classifier.
	* It does not handle non-English prompts reliably (trained on English only).

	---

	## How It Works

	* Architecture:

	* ModernBERT-base encoder
	* Classification head on top of `[CLS]` / pooled representation

	* Fine-tuning:

	* LoRA on the base encoder
	* Binary cross-entropy / cross-entropy loss on the two labels
	* Balanced sampling between FACT_CHECK_NEEDED and NO_FACT_CHECK_NEEDED

	* Decision Boundary:

	* Borderline / philosophical / highly abstract questions may be assigned lower confidence.
	* Downstream systems are encouraged to use the confidence score as a soft signal, not a hard oracle.

	---

	## Limitations

	* Language:

	* Trained on English data only.
	* Performance on other languages is not guaranteed.

	* Borderline Queries:

	* Philosophical or hybrid prompts (e.g. “Is time travel possible?”) may be ambiguous.
	* In such cases, consider inspecting the model confidence and implementing a “default-to-safe” policy.

	* Domain Coverage:

	* General-purpose factual tasks are well-covered; highly specialized verticals (e.g. niche scientific domains) are not explicitly targeted during fine-tuning.

	* Not a Verifier:

	* This model only decides if a prompt needs factual support.
	* Actual hallucination detection and answer verification must be handled by separate models (e.g., answer-level verifiers).

	---

	## Ethical Considerations

	* Risk Trade-off:

	* Over-classifying prompts as `NO_FACT_CHECK_NEEDED` may reduce safety for borderline factual tasks.
	* Over-classifying as `FACT_CHECK_NEEDED` increases compute cost but is safer in high-risk environments.

	* Deployment Recommendation:

	* For safety-critical domains (finance, healthcare, legal, etc.), configure conservative thresholds and fallbacks that favor routing more traffic through the fact-checking path.

	---

	## Citation

	If you use HaluGate Sentinel in academic work or production systems, please cite:

	```bibtex
	@software{halugate_sentinel_2024,
	title = {HaluGate Sentinel: Prompt-Level Fact-Check Switch for Hallucination Gatekeepers},
	author = {vLLM Project},
	year = {2024},
	url = {https://github.com/vllm-project/semantic-router}
	}
	```

	---

	## Acknowledgements

	* Base encoder: [`answerdotai/ModernBERT-base`](https://huggingface.co/answerdotai/ModernBERT-base)
	* Training datasets: SQuAD, TriviaQA, HotpotQA, TruthfulQA, CoQA, Dolly, Alpaca, WritingPrompts, HaluEval, and others listed above.
	* Designed for integration with the vLLM Semantic Router and broader Hallucination Gatekeeper ecosystem.