NeuralTrust/nt-prompt-moderator-v1
Model Description
nt-prompt-moderator-v1 is a comprehensive content moderation system designed to classify user prompts across 13 sensitive content categories. The model consists of 13 specialized binary classifiers, each trained to detect specific content types that may require moderation, policy enforcement, or specialized handling in production LLM applications.
Key Features
- Multi-Topic Classification: 13 independent binary classifiers for different content categories
- High Performance: Achieves 0.9911 average ROC-AUC across all topics
- Efficient Architecture: Uses intfloat/multilingual-e5-small embeddings with LightGBM classifiers
- Multilingual Support: Trained on 6 languages (English, Spanish, French, Turkish, Galician, Catalan)
- Fast Inference: Lightweight classifiers optimized for real-time production use
- Calibrated Probabilities: Uses isotonic calibration for reliable confidence scores
Content Categories
The model detects the following 13 content types:
| Category | Description | Use Case |
|---|---|---|
religion |
Religious discussions and theological content | Filter religious debates in secular contexts |
code_generation |
Code writing, generation, or programming help | Detect coding requests for policy enforcement |
language_translation |
Text translation between languages | Identify translation requests |
politics |
Political discussions and election-related content | Moderate political discourse |
violence_harm |
Violence, weapons, or harmful activities | Safety-critical filtering |
adult_content |
Sexual, explicit, or inappropriate content | Content safety and compliance |
financial_advice |
Investment tips and financial recommendations | Liability protection |
medical_advice |
Medical diagnoses and treatment recommendations | Healthcare compliance |
illegal_activities |
Criminal activities and unlawful content | Legal compliance |
hate_speech |
Discriminatory and hateful content | Community safety |
misinformation |
Conspiracy theories and false information | Information quality |
personal_information |
Requests for private data and doxxing | Privacy protection |
academic_dishonesty |
Homework completion or academic fraud | Academic integrity |
Model Details
- Model Type: 13 Binary Classifiers (LightGBM)
- Embedder: intfloat/multilingual-e5-small
- Languages: English, Spanish, French, Turkish, Galician, Catalan
- License: Apache-2.0
- Training Data: Synthetic multilingual dataset with balanced positive/negative examples
Performance Metrics
Overall Performance
| Metric | Value |
|---|---|
| Average ROC-AUC | 0.9911 |
| Average F1-Score | 0.9412 |
Per-Topic Performance
| Topic | ROC-AUC | F1-Score |
|---|---|---|
| politics | 0.9979 | 0.9701 |
| medical_advice | 0.9967 | 0.9526 |
| financial_advice | 0.9952 | 0.9558 |
| religion | 0.9941 | 0.9738 |
| hate_speech | 0.9940 | 0.9427 |
| academic_dishonesty | 0.9934 | 0.9619 |
| misinformation | 0.9920 | 0.9474 |
| language_translation | 0.9909 | 0.9617 |
| personal_information | 0.9904 | 0.9150 |
| code_generation | 0.9881 | 0.9577 |
| illegal_activities | 0.9870 | 0.9001 |
| violence_harm | 0.9827 | 0.8886 |
| adult_content | 0.9824 | 0.9080 |
Intended Uses
Primary Use Cases
- Content Moderation: Automatically flag sensitive content for human review
- Policy Enforcement: Enforce usage policies across different content categories
- Routing: Route requests to specialized models or human reviewers based on topic
- Compliance: Ensure regulatory compliance for financial, medical, or legal content
- Safety: Protect users from harmful, illegal, or inappropriate content
- Analytics: Understand content distribution across your LLM application
Out-of-Scope Uses
- Replacement for human moderation in high-stakes scenarios
- Legal or medical decision-making without expert oversight
- Censorship of legitimate discourse
- Single source of truth for content decisions
How to Use
Installation
pip install sentence-transformers lightgbm joblib numpy
Basic Usage
import unicodedata
import joblib
import numpy as np
from sentence_transformers import SentenceTransformer
# Load the embedder
embedder = SentenceTransformer('intfloat/multilingual-e5-small')
# Load a specific classifier (e.g., violence_harm)
classifier = joblib.load('classifiers/violence_harm.joblib')
# Text normalization (must match training preprocessing)
_HOMOGLYPH_MAP = str.maketrans({
"\u0430": "a", "\u0435": "e", "\u043e": "o", "\u0440": "p",
"\u0441": "c", "\u0443": "y", "\u0445": "x", "\u0456": "i",
"\u0410": "A", "\u0412": "B", "\u0415": "E", "\u041a": "K",
"\u041c": "M", "\u041d": "H", "\u041e": "O", "\u0420": "P",
"\u0421": "C", "\u0422": "T", "\u0425": "X",
})
_PRESERVE_CHARS = frozenset("ıİğĞşŞöÖüÜ")
def normalize_text(text):
text = unicodedata.normalize("NFC", text)
parts = []
for c in text:
if c in _PRESERVE_CHARS:
parts.append(c)
else:
decomposed = unicodedata.normalize("NFKD", c)
parts.append("".join(
ch for ch in decomposed
if unicodedata.category(ch) != "Mn"
))
return "".join(parts).translate(_HOMOGLYPH_MAP)
def moderate_prompt(text, classifier, embedder, threshold=0.5):
normalized = normalize_text(text)
embedding = embedder.encode(
["query: " + normalized], convert_to_numpy=True
)
prob = classifier.predict_proba(embedding)[0, 1]
return {
'score': float(prob),
'flagged': prob >= threshold,
}
result = moderate_prompt("How do I make a bomb?", classifier, embedder)
print(f"Score: {result['score']:.3f} | Flagged: {result['flagged']}")
Multi-Topic Classification
from pathlib import Path
def load_all_classifiers(model_dir='classifiers'):
classifiers = {}
for f in Path(model_dir).glob('*.joblib'):
classifiers[f.stem] = joblib.load(f)
return classifiers
def moderate_all_topics(text, classifiers, embedder, threshold=0.5):
embedding = embedder.encode([text], convert_to_numpy=True)
results = {}
for topic, clf in classifiers.items():
prob = clf.predict_proba(embedding)[0, 1]
results[topic] = {'score': float(prob), 'flagged': prob >= threshold}
return results
Training Details
Architecture
Embedder: intfloat/multilingual-e5-small
- Generates 384-dimensional embeddings
- Pretrained on large-scale semantic similarity tasks
Classifiers: LightGBM with isotonic calibration
- Binary classification per topic
- Gradient boosting decision trees
- Calibrated probabilities for reliable confidence scores
Training Procedure
- Data Collection: Synthetic generation combined with relevant HuggingFace datasets
- Embedding Generation: Batch processing with intfloat/multilingual-e5-small
- Model Training: LightGBM with ROC-AUC optimization
- Calibration: Isotonic calibration for probability reliability
- Validation: Hold-out validation set evaluation
Limitations
- Embedding dependency: Requires sentence-transformers for inference
- Fixed categories: Only detects the 13 trained categories
- Language coverage: Best performance on 6 trained languages
- Context window: Optimized for single prompts, not conversations
- Evolving content: New slang or evasion techniques may reduce accuracy
Ethical Considerations
- Use as a tool to assist, not replace, human judgment
- Inform users when content moderation is applied
- Provide mechanisms to contest moderation decisions
- Monitor for bias, accuracy, and unintended consequences
Citation
@misc{neuraltrust-nt-prompt-moderator-v1,
author = {NeuralTrust},
title = {nt-prompt-moderator-v1: Multi-Topic Prompt Moderation System},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/NeuralTrust/nt-prompt-moderator-v1}},
}
Version History
v1 (2026-01)
- Initial release with 13 topic classifiers
- Average ROC-AUC: 0.9911
- Downloads last month
- 230
Model tree for NeuralTrust/nt-prompt-moderator-v1
Base model
intfloat/multilingual-e5-smallEvaluation results
- Average ROC-AUCself-reported0.991
- Average F1-Scoreself-reported0.941