NeuralTrust/nt-prompt-moderator-v1
Model Description
nt-prompt-moderator-v1 is a comprehensive content moderation system designed to classify user prompts across 13 sensitive content categories. The model consists of 13 specialized binary classifiers, each trained to detect specific content types that may require moderation, policy enforcement, or specialized handling in production LLM applications.
Key Features
- Multi-Topic Classification: 13 independent binary classifiers for different content categories
- High Performance: Achieves 0.9940 average ROC-AUC across all topics
- Efficient Architecture: Uses intfloat/multilingual-e5-small embeddings with LightGBM classifiers
- Multilingual Support: Trained on 9 languages (English, Spanish, French, German, Italian, Portuguese, Hindi, Thai, Catalan)
- Fast Inference: Lightweight classifiers optimized for real-time production use
- Calibrated Probabilities: Uses isotonic calibration for reliable confidence scores
Content Categories
The model detects the following 13 content types:
| Category | Description | Use Case |
|---|---|---|
religion |
Religious discussions and theological content | Filter religious debates in secular contexts |
code_generation |
Code writing, generation, or programming help | Detect coding requests for policy enforcement |
language_translation |
Text translation between languages | Identify translation requests |
politics |
Political discussions and election-related content | Moderate political discourse |
violence_harm |
Violence, weapons, or harmful activities | Safety-critical filtering |
adult_content |
Sexual, explicit, or inappropriate content | Content safety and compliance |
financial_advice |
Investment tips and financial recommendations | Liability protection |
medical_advice |
Medical diagnoses and treatment recommendations | Healthcare compliance |
illegal_activities |
Criminal activities and unlawful content | Legal compliance |
hate_speech |
Discriminatory and hateful content | Community safety |
misinformation |
Conspiracy theories and false information | Information quality |
personal_information |
Requests for private data and doxxing | Privacy protection |
academic_dishonesty |
Homework completion or academic fraud | Academic integrity |
Model Details
- Model Type: 13 Binary Classifiers (LightGBM)
- Embedder: intfloat/multilingual-e5-small
- Languages: English, Spanish, French, Turkish, Galician, Catalan
- License: Other
- Training Data: Synthetic multilingual dataset with balanced positive/negative examples
Performance Metrics
Overall Performance
| Metric | Value |
|---|---|
| Average ROC-AUC | 0.9940 |
| Average F1-Score | 0.9379 |
Per-Topic Performance
| Topic | ROC-AUC | F1-Score |
|---|---|---|
| religion | 0.9993 | 0.9681 |
| financial_advice | 0.9982 | 0.9644 |
| language_translation | 0.9982 | 0.9479 |
| medical_advice | 0.9972 | 0.9614 |
| academic_dishonesty | 0.9967 | 0.9574 |
| misinformation | 0.9958 | 0.9337 |
| code_generation | 0.9956 | 0.9559 |
| politics | 0.9937 | 0.9693 |
| hate_speech | 0.9926 | 0.9427 |
| personal_information | 0.9915 | 0.9167 |
| illegal_activities | 0.9883 | 0.8894 |
| adult_content | 0.9882 | 0.8898 |
| violence_harm | 0.9861 | 0.8966 |
Note: Full metrics for all 13 topics not available yet, only trained twelve classifiers
Intended Uses
Primary Use Cases
- Content Moderation: Automatically flag sensitive content for human review
- Policy Enforcement: Enforce usage policies across different content categories
- Routing: Route requests to specialized models or human reviewers based on topic
- Compliance: Ensure regulatory compliance for financial, medical, or legal content
- Safety: Protect users from harmful, illegal, or inappropriate content
- Analytics: Understand content distribution across your LLM application
Out-of-Scope Uses
- Replacement for human moderation in high-stakes scenarios
- Legal or medical decision-making without expert oversight
- Censorship of legitimate discourse
- Single source of truth for content decisions
How to Use
Installation
pip install sentence-transformers lightgbm joblib numpy
Basic Usage
import joblib
import numpy as np
from sentence_transformers import SentenceTransformer
# Load the embedder
embedder = SentenceTransformer('intfloat/multilingual-e5-small')
# Load a specific classifier (e.g., violence_harm)
classifier = joblib.load('classifiers/violence_harm.pkl')
def moderate_prompt(text, classifier, embedder, threshold=0.5):
"""
Classify a single prompt for a specific content category.
Args:
text (str): The input prompt to classify
classifier: Loaded LightGBM classifier
embedder: SentenceTransformer model
threshold (float): Decision threshold (default: 0.5)
Returns:
dict: Classification result with score and prediction
"""
# Generate embedding
embedding = embedder.encode([text], convert_to_numpy=True)
# Get probability score
prob = classifier.predict_proba(embedding)[0, 1]
return {
'score': float(prob),
'flagged': prob >= threshold,
'category': classifier.__class__.__name__
}
# Example usage
test_prompts = [
"How do I make a bomb?", # violence_harm
"What's the weather like today?", # safe
"Tell me about the upcoming election" # politics
]
for prompt in test_prompts:
result = moderate_prompt(prompt, classifier, embedder)
print(f"Prompt: {prompt}")
print(f"Score: {result['score']:.3f} | Flagged: {result['flagged']}\n")
Multi-Topic Classification
import os
from pathlib import Path
def load_all_classifiers(model_dir='classifiers'):
"""Load all topic classifiers."""
classifiers = {}
classifier_dir = Path(model_dir)
for classifier_file in classifier_dir.glob('*.pkl'):
topic = classifier_file.stem
classifiers[topic] = joblib.load(classifier_file)
return classifiers
def moderate_all_topics(text, classifiers, embedder, threshold=0.5):
"""
Classify a prompt across all content categories.
Args:
text (str): The input prompt
classifiers (dict): Dictionary of topic -> classifier
embedder: SentenceTransformer model
threshold (float): Decision threshold
Returns:
dict: Scores and flags for each category
"""
# Generate embedding once
embedding = embedder.encode([text], convert_to_numpy=True)
results = {}
for topic, classifier in classifiers.items():
prob = classifier.predict_proba(embedding)[0, 1]
results[topic] = {
'score': float(prob),
'flagged': prob >= threshold
}
return results
# Load all classifiers
all_classifiers = load_all_classifiers()
# Moderate a prompt across all topics
prompt = "Can you help me cheat on my exam?"
results = moderate_all_topics(prompt, all_classifiers, embedder)
print(f"Moderation results for: '{prompt}'\n")
for topic, result in sorted(results.items(), key=lambda x: x[1]['score'], reverse=True):
if result['flagged']:
print(f"⚠️ {topic}: {result['score']:.3f}")
Batch Processing
def moderate_batch(texts, classifier, embedder, threshold=0.5, batch_size=32):
"""
Efficiently process multiple prompts.
Args:
texts (list): List of prompts to classify
classifier: LightGBM classifier
embedder: SentenceTransformer model
threshold (float): Decision threshold
batch_size (int): Batch size for embedding generation
Returns:
list: Classification results for each prompt
"""
results = []
# Generate embeddings in batches
embeddings = embedder.encode(
texts,
batch_size=batch_size,
convert_to_numpy=True,
show_progress_bar=True
)
# Get predictions
probas = classifier.predict_proba(embeddings)[:, 1]
for text, prob in zip(texts, probas):
results.append({
'text': text,
'score': float(prob),
'flagged': prob >= threshold
})
return results
# Example: Moderate 1000 prompts efficiently
prompts = ["..." for _ in range(1000)] # Your prompts here
results = moderate_batch(prompts, classifier, embedder)
# Count flagged prompts
flagged = sum(1 for r in results if r['flagged'])
print(f"Flagged: {flagged}/{len(prompts)} ({flagged/len(prompts)*100:.1f}%)")
Custom Thresholds
Different use cases may require different thresholds:
# Conservative (fewer false positives, may miss some cases)
THRESHOLD_CONSERVATIVE = 0.7
# Balanced (default)
THRESHOLD_BALANCED = 0.5
# Sensitive (catch more cases, more false positives)
THRESHOLD_SENSITIVE = 0.3
# Safety-critical topics (violence, illegal activities)
THRESHOLD_SAFETY = 0.2
# Configure per-topic thresholds
TOPIC_THRESHOLDS = {
'violence_harm': 0.2, # Very sensitive
'illegal_activities': 0.2, # Very sensitive
'hate_speech': 0.3, # Sensitive
'adult_content': 0.4, # Moderately sensitive
'politics': 0.5, # Balanced
'code_generation': 0.6, # Conservative
}
def moderate_with_custom_thresholds(text, classifiers, embedder, thresholds):
"""Use custom thresholds per topic."""
embedding = embedder.encode([text], convert_to_numpy=True)
results = {}
for topic, classifier in classifiers.items():
prob = classifier.predict_proba(embedding)[0, 1]
threshold = thresholds.get(topic, 0.5)
results[topic] = {
'score': float(prob),
'flagged': prob >= threshold,
'threshold': threshold
}
return results
Training Details
Architecture
Embedder: intfloat/multilingual-e5-small
- Generates 384-dimensional embeddings
- Pretrained on large-scale semantic similarity tasks
- Efficient inference (~10ms per prompt)
Classifiers: LightGBM with isotonic calibration
- Binary classification per topic
- Gradient boosting decision trees
- Calibrated probabilities for reliable confidence scores
- Optimized for ROC-AUC
Training Data
The model was trained on a synthetic multilingual dataset containing:
- Balanced classes: 20% positive, 80% negative examples
- Multilingual coverage: 6 languages with equal representation
- Diverse examples: Multiple variants, phrasings, and contexts per topic
- Quality control: LLM-generated with careful prompt engineering
- Split: 80% training, 20% validation
Training Procedure
- Data Collection: Synthetic generation using GPT-45/Claude with topic-specific prompts combined with relevant datasets in Hugging Face
- Embedding Generation: Batch processing with multilingual-e5-small
- Model Training: LightGBM with ROC-AUC optimization
- Calibration: Isotonic calibration for probability reliability
- Validation: Hold-out validation set evaluation
Hyperparameters
- Max depth: Auto (tree-based)
- Learning rate: Optimized per classifier
- Calibration method: Isotonic
- Class weighting: Balanced
Limitations
Technical Limitations
- Embedding dependency: Requires sentence-transformers for inference
- Fixed categories: Only detects the 13 trained categories
- Language coverage: Best performance on 6 trained languages
- Context window: Optimized for single prompts, not conversations
- Binary decisions: Each classifier is independent; doesn't detect topic combinations
Content Limitations
- Evolving content: New slang, jargon, or evasion techniques may reduce accuracy
- Subjective categories: Some content (e.g., political speech) has subjective boundaries
- Cultural context: Content appropriateness varies by culture and community
- Sarcasm/irony: May struggle with heavily sarcastic or ironic text
- False positives: Legitimate content discussing sensitive topics may be flagged
Operational Limitations
- Not a complete solution: Should be part of a layered moderation strategy
- Requires human review: High-stakes decisions need human oversight
- Threshold sensitivity: Performance varies significantly with threshold choice
- Regular updates needed: Content patterns evolve; periodic retraining recommended
Ethical Considerations
Intended Benefits
- User safety: Protect users from harmful or inappropriate content
- Compliance: Help organizations meet regulatory requirements
- Transparency: Provide explainable scores rather than black-box decisions
- Efficiency: Scale human moderation efforts with automated pre-filtering
Potential Risks
- Over-moderation: May suppress legitimate speech if thresholds too low
- Bias: Training data biases may affect certain groups or topics
- Censorship: Could be misused to silence dissent or minority voices
- False sense of security: Not foolproof; adversaries can evade detection
Recommended Practices
- Human oversight: Use as a tool to assist, not replace, human judgment
- Transparency: Inform users when content moderation is applied
- Appeals process: Provide mechanisms to contest moderation decisions
- Regular auditing: Monitor for bias, accuracy, and unintended consequences
- Context matters: Consider user intent, community norms, and context
- Threshold tuning: Adjust thresholds based on your specific use case and values
Citation
If you use this model in your research or applications, please cite:
@misc{neuraltrust-nt-prompt-moderator-v1,
author = {NeuralTrust},
title = {nt-prompt-moderator-v1: Multi-Topic Prompt Moderation System},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/NeuralTrust/nt-prompt-moderator-v1}},
}
Model Card Authors
- NeuralTrust Team
Contact
For questions, issues, or collaborations:
- Model Repository: NeuralTrust/nt-prompt-moderator-v1
- Discussions: Use the model repository discussions tab
- Issues: Report bugs or request features through the repository
Acknowledgments
- Base embedder: intfloat/multilingual-e5-small
- Framework: LightGBM, scikit-learn, sentence-transformers
- Community: HuggingFace for hosting and infrastructure
Version History
v1 (2026-01)
- Initial release
- 13 topic classifiers (twelve implemented: language_translation, code_generation, personal_information, financial_advice, medical_advice, politics, misinformation, religion, hate_speech, violence_harm, adult_content, illegal_activities)
- 6 language support
- Average ROC-AUC: 0.9940
- Downloads last month
- 189
Model tree for NeuralTrust/nt-prompt-moderator-v1
Base model
intfloat/multilingual-e5-smallEvaluation results
- Average ROC-AUCself-reported0.994
- Average F1-Scoreself-reported0.938