NeuralTrust/nt-prompt-moderator-v1

Model Description

nt-prompt-moderator-v1 is a comprehensive content moderation system designed to classify user prompts across 13 sensitive content categories. The model consists of 13 specialized binary classifiers, each trained to detect specific content types that may require moderation, policy enforcement, or specialized handling in production LLM applications.

Key Features

  • Multi-Topic Classification: 13 independent binary classifiers for different content categories
  • High Performance: Achieves 0.9940 average ROC-AUC across all topics
  • Efficient Architecture: Uses intfloat/multilingual-e5-small embeddings with LightGBM classifiers
  • Multilingual Support: Trained on 9 languages (English, Spanish, French, German, Italian, Portuguese, Hindi, Thai, Catalan)
  • Fast Inference: Lightweight classifiers optimized for real-time production use
  • Calibrated Probabilities: Uses isotonic calibration for reliable confidence scores

Content Categories

The model detects the following 13 content types:

Category Description Use Case
religion Religious discussions and theological content Filter religious debates in secular contexts
code_generation Code writing, generation, or programming help Detect coding requests for policy enforcement
language_translation Text translation between languages Identify translation requests
politics Political discussions and election-related content Moderate political discourse
violence_harm Violence, weapons, or harmful activities Safety-critical filtering
adult_content Sexual, explicit, or inappropriate content Content safety and compliance
financial_advice Investment tips and financial recommendations Liability protection
medical_advice Medical diagnoses and treatment recommendations Healthcare compliance
illegal_activities Criminal activities and unlawful content Legal compliance
hate_speech Discriminatory and hateful content Community safety
misinformation Conspiracy theories and false information Information quality
personal_information Requests for private data and doxxing Privacy protection
academic_dishonesty Homework completion or academic fraud Academic integrity

Model Details

  • Model Type: 13 Binary Classifiers (LightGBM)
  • Embedder: intfloat/multilingual-e5-small
  • Languages: English, Spanish, French, Turkish, Galician, Catalan
  • License: Other
  • Training Data: Synthetic multilingual dataset with balanced positive/negative examples

Performance Metrics

Overall Performance

Metric Value
Average ROC-AUC 0.9940
Average F1-Score 0.9379

Per-Topic Performance

Topic ROC-AUC F1-Score
religion 0.9993 0.9681
financial_advice 0.9982 0.9644
language_translation 0.9982 0.9479
medical_advice 0.9972 0.9614
academic_dishonesty 0.9967 0.9574
misinformation 0.9958 0.9337
code_generation 0.9956 0.9559
politics 0.9937 0.9693
hate_speech 0.9926 0.9427
personal_information 0.9915 0.9167
illegal_activities 0.9883 0.8894
adult_content 0.9882 0.8898
violence_harm 0.9861 0.8966

Note: Full metrics for all 13 topics not available yet, only trained twelve classifiers

Intended Uses

Primary Use Cases

  • Content Moderation: Automatically flag sensitive content for human review
  • Policy Enforcement: Enforce usage policies across different content categories
  • Routing: Route requests to specialized models or human reviewers based on topic
  • Compliance: Ensure regulatory compliance for financial, medical, or legal content
  • Safety: Protect users from harmful, illegal, or inappropriate content
  • Analytics: Understand content distribution across your LLM application

Out-of-Scope Uses

  • Replacement for human moderation in high-stakes scenarios
  • Legal or medical decision-making without expert oversight
  • Censorship of legitimate discourse
  • Single source of truth for content decisions

How to Use

Installation

pip install sentence-transformers lightgbm joblib numpy

Basic Usage

import joblib
import numpy as np
from sentence_transformers import SentenceTransformer

# Load the embedder
embedder = SentenceTransformer('intfloat/multilingual-e5-small')

# Load a specific classifier (e.g., violence_harm)
classifier = joblib.load('classifiers/violence_harm.pkl')

def moderate_prompt(text, classifier, embedder, threshold=0.5):
    """
    Classify a single prompt for a specific content category.
    
    Args:
        text (str): The input prompt to classify
        classifier: Loaded LightGBM classifier
        embedder: SentenceTransformer model
        threshold (float): Decision threshold (default: 0.5)
    
    Returns:
        dict: Classification result with score and prediction
    """
    # Generate embedding
    embedding = embedder.encode([text], convert_to_numpy=True)
    
    # Get probability score
    prob = classifier.predict_proba(embedding)[0, 1]
    
    return {
        'score': float(prob),
        'flagged': prob >= threshold,
        'category': classifier.__class__.__name__
    }

# Example usage
test_prompts = [
    "How do I make a bomb?",  # violence_harm
    "What's the weather like today?",  # safe
    "Tell me about the upcoming election"  # politics
]

for prompt in test_prompts:
    result = moderate_prompt(prompt, classifier, embedder)
    print(f"Prompt: {prompt}")
    print(f"Score: {result['score']:.3f} | Flagged: {result['flagged']}\n")

Multi-Topic Classification

import os
from pathlib import Path

def load_all_classifiers(model_dir='classifiers'):
    """Load all topic classifiers."""
    classifiers = {}
    classifier_dir = Path(model_dir)
    
    for classifier_file in classifier_dir.glob('*.pkl'):
        topic = classifier_file.stem
        classifiers[topic] = joblib.load(classifier_file)
    
    return classifiers

def moderate_all_topics(text, classifiers, embedder, threshold=0.5):
    """
    Classify a prompt across all content categories.
    
    Args:
        text (str): The input prompt
        classifiers (dict): Dictionary of topic -> classifier
        embedder: SentenceTransformer model
        threshold (float): Decision threshold
    
    Returns:
        dict: Scores and flags for each category
    """
    # Generate embedding once
    embedding = embedder.encode([text], convert_to_numpy=True)
    
    results = {}
    for topic, classifier in classifiers.items():
        prob = classifier.predict_proba(embedding)[0, 1]
        results[topic] = {
            'score': float(prob),
            'flagged': prob >= threshold
        }
    
    return results

# Load all classifiers
all_classifiers = load_all_classifiers()

# Moderate a prompt across all topics
prompt = "Can you help me cheat on my exam?"
results = moderate_all_topics(prompt, all_classifiers, embedder)

print(f"Moderation results for: '{prompt}'\n")
for topic, result in sorted(results.items(), key=lambda x: x[1]['score'], reverse=True):
    if result['flagged']:
        print(f"⚠️  {topic}: {result['score']:.3f}")

Batch Processing

def moderate_batch(texts, classifier, embedder, threshold=0.5, batch_size=32):
    """
    Efficiently process multiple prompts.
    
    Args:
        texts (list): List of prompts to classify
        classifier: LightGBM classifier
        embedder: SentenceTransformer model
        threshold (float): Decision threshold
        batch_size (int): Batch size for embedding generation
    
    Returns:
        list: Classification results for each prompt
    """
    results = []
    
    # Generate embeddings in batches
    embeddings = embedder.encode(
        texts, 
        batch_size=batch_size,
        convert_to_numpy=True,
        show_progress_bar=True
    )
    
    # Get predictions
    probas = classifier.predict_proba(embeddings)[:, 1]
    
    for text, prob in zip(texts, probas):
        results.append({
            'text': text,
            'score': float(prob),
            'flagged': prob >= threshold
        })
    
    return results

# Example: Moderate 1000 prompts efficiently
prompts = ["..." for _ in range(1000)]  # Your prompts here
results = moderate_batch(prompts, classifier, embedder)

# Count flagged prompts
flagged = sum(1 for r in results if r['flagged'])
print(f"Flagged: {flagged}/{len(prompts)} ({flagged/len(prompts)*100:.1f}%)")

Custom Thresholds

Different use cases may require different thresholds:

# Conservative (fewer false positives, may miss some cases)
THRESHOLD_CONSERVATIVE = 0.7

# Balanced (default)
THRESHOLD_BALANCED = 0.5

# Sensitive (catch more cases, more false positives)
THRESHOLD_SENSITIVE = 0.3

# Safety-critical topics (violence, illegal activities)
THRESHOLD_SAFETY = 0.2

# Configure per-topic thresholds
TOPIC_THRESHOLDS = {
    'violence_harm': 0.2,      # Very sensitive
    'illegal_activities': 0.2,  # Very sensitive
    'hate_speech': 0.3,         # Sensitive
    'adult_content': 0.4,       # Moderately sensitive
    'politics': 0.5,            # Balanced
    'code_generation': 0.6,     # Conservative
}

def moderate_with_custom_thresholds(text, classifiers, embedder, thresholds):
    """Use custom thresholds per topic."""
    embedding = embedder.encode([text], convert_to_numpy=True)
    
    results = {}
    for topic, classifier in classifiers.items():
        prob = classifier.predict_proba(embedding)[0, 1]
        threshold = thresholds.get(topic, 0.5)
        
        results[topic] = {
            'score': float(prob),
            'flagged': prob >= threshold,
            'threshold': threshold
        }
    
    return results

Training Details

Architecture

  1. Embedder: intfloat/multilingual-e5-small

    • Generates 384-dimensional embeddings
    • Pretrained on large-scale semantic similarity tasks
    • Efficient inference (~10ms per prompt)
  2. Classifiers: LightGBM with isotonic calibration

    • Binary classification per topic
    • Gradient boosting decision trees
    • Calibrated probabilities for reliable confidence scores
    • Optimized for ROC-AUC

Training Data

The model was trained on a synthetic multilingual dataset containing:

  • Balanced classes: 20% positive, 80% negative examples
  • Multilingual coverage: 6 languages with equal representation
  • Diverse examples: Multiple variants, phrasings, and contexts per topic
  • Quality control: LLM-generated with careful prompt engineering
  • Split: 80% training, 20% validation

Training Procedure

  1. Data Collection: Synthetic generation using GPT-45/Claude with topic-specific prompts combined with relevant datasets in Hugging Face
  2. Embedding Generation: Batch processing with multilingual-e5-small
  3. Model Training: LightGBM with ROC-AUC optimization
  4. Calibration: Isotonic calibration for probability reliability
  5. Validation: Hold-out validation set evaluation

Hyperparameters

  • Max depth: Auto (tree-based)
  • Learning rate: Optimized per classifier
  • Calibration method: Isotonic
  • Class weighting: Balanced

Limitations

Technical Limitations

  • Embedding dependency: Requires sentence-transformers for inference
  • Fixed categories: Only detects the 13 trained categories
  • Language coverage: Best performance on 6 trained languages
  • Context window: Optimized for single prompts, not conversations
  • Binary decisions: Each classifier is independent; doesn't detect topic combinations

Content Limitations

  • Evolving content: New slang, jargon, or evasion techniques may reduce accuracy
  • Subjective categories: Some content (e.g., political speech) has subjective boundaries
  • Cultural context: Content appropriateness varies by culture and community
  • Sarcasm/irony: May struggle with heavily sarcastic or ironic text
  • False positives: Legitimate content discussing sensitive topics may be flagged

Operational Limitations

  • Not a complete solution: Should be part of a layered moderation strategy
  • Requires human review: High-stakes decisions need human oversight
  • Threshold sensitivity: Performance varies significantly with threshold choice
  • Regular updates needed: Content patterns evolve; periodic retraining recommended

Ethical Considerations

Intended Benefits

  • User safety: Protect users from harmful or inappropriate content
  • Compliance: Help organizations meet regulatory requirements
  • Transparency: Provide explainable scores rather than black-box decisions
  • Efficiency: Scale human moderation efforts with automated pre-filtering

Potential Risks

  • Over-moderation: May suppress legitimate speech if thresholds too low
  • Bias: Training data biases may affect certain groups or topics
  • Censorship: Could be misused to silence dissent or minority voices
  • False sense of security: Not foolproof; adversaries can evade detection

Recommended Practices

  1. Human oversight: Use as a tool to assist, not replace, human judgment
  2. Transparency: Inform users when content moderation is applied
  3. Appeals process: Provide mechanisms to contest moderation decisions
  4. Regular auditing: Monitor for bias, accuracy, and unintended consequences
  5. Context matters: Consider user intent, community norms, and context
  6. Threshold tuning: Adjust thresholds based on your specific use case and values

Citation

If you use this model in your research or applications, please cite:

@misc{neuraltrust-nt-prompt-moderator-v1,
  author = {NeuralTrust},
  title = {nt-prompt-moderator-v1: Multi-Topic Prompt Moderation System},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/NeuralTrust/nt-prompt-moderator-v1}},
}

Model Card Authors

  • NeuralTrust Team

Contact

For questions, issues, or collaborations:

  • Model Repository: NeuralTrust/nt-prompt-moderator-v1
  • Discussions: Use the model repository discussions tab
  • Issues: Report bugs or request features through the repository

Acknowledgments

  • Base embedder: intfloat/multilingual-e5-small
  • Framework: LightGBM, scikit-learn, sentence-transformers
  • Community: HuggingFace for hosting and infrastructure

Version History

v1 (2026-01)

  • Initial release
  • 13 topic classifiers (twelve implemented: language_translation, code_generation, personal_information, financial_advice, medical_advice, politics, misinformation, religion, hate_speech, violence_harm, adult_content, illegal_activities)
  • 6 language support
  • Average ROC-AUC: 0.9940
Downloads last month
189
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NeuralTrust/nt-prompt-moderator-v1

Finetuned
(130)
this model

Evaluation results