NeuralTrust/nt-prompt-moderator-v1

Model Description

nt-prompt-moderator-v1 is a comprehensive content moderation system designed to classify user prompts across 13 sensitive content categories. The model consists of 13 specialized binary classifiers, each trained to detect specific content types that may require moderation, policy enforcement, or specialized handling in production LLM applications.

Key Features

  • Multi-Topic Classification: 13 independent binary classifiers for different content categories
  • High Performance: Achieves 0.9911 average ROC-AUC across all topics
  • Efficient Architecture: Uses intfloat/multilingual-e5-small embeddings with LightGBM classifiers
  • Multilingual Support: Trained on 6 languages (English, Spanish, French, Turkish, Galician, Catalan)
  • Fast Inference: Lightweight classifiers optimized for real-time production use
  • Calibrated Probabilities: Uses isotonic calibration for reliable confidence scores

Content Categories

The model detects the following 13 content types:

Category Description Use Case
religion Religious discussions and theological content Filter religious debates in secular contexts
code_generation Code writing, generation, or programming help Detect coding requests for policy enforcement
language_translation Text translation between languages Identify translation requests
politics Political discussions and election-related content Moderate political discourse
violence_harm Violence, weapons, or harmful activities Safety-critical filtering
adult_content Sexual, explicit, or inappropriate content Content safety and compliance
financial_advice Investment tips and financial recommendations Liability protection
medical_advice Medical diagnoses and treatment recommendations Healthcare compliance
illegal_activities Criminal activities and unlawful content Legal compliance
hate_speech Discriminatory and hateful content Community safety
misinformation Conspiracy theories and false information Information quality
personal_information Requests for private data and doxxing Privacy protection
academic_dishonesty Homework completion or academic fraud Academic integrity

Model Details

  • Model Type: 13 Binary Classifiers (LightGBM)
  • Embedder: intfloat/multilingual-e5-small
  • Languages: English, Spanish, French, Turkish, Galician, Catalan
  • License: Apache-2.0
  • Training Data: Synthetic multilingual dataset with balanced positive/negative examples

Performance Metrics

Overall Performance

Metric Value
Average ROC-AUC 0.9911
Average F1-Score 0.9412

Per-Topic Performance

Topic ROC-AUC F1-Score
politics 0.9979 0.9701
medical_advice 0.9967 0.9526
financial_advice 0.9952 0.9558
religion 0.9941 0.9738
hate_speech 0.9940 0.9427
academic_dishonesty 0.9934 0.9619
misinformation 0.9920 0.9474
language_translation 0.9909 0.9617
personal_information 0.9904 0.9150
code_generation 0.9881 0.9577
illegal_activities 0.9870 0.9001
violence_harm 0.9827 0.8886
adult_content 0.9824 0.9080

Intended Uses

Primary Use Cases

  • Content Moderation: Automatically flag sensitive content for human review
  • Policy Enforcement: Enforce usage policies across different content categories
  • Routing: Route requests to specialized models or human reviewers based on topic
  • Compliance: Ensure regulatory compliance for financial, medical, or legal content
  • Safety: Protect users from harmful, illegal, or inappropriate content
  • Analytics: Understand content distribution across your LLM application

Out-of-Scope Uses

  • Replacement for human moderation in high-stakes scenarios
  • Legal or medical decision-making without expert oversight
  • Censorship of legitimate discourse
  • Single source of truth for content decisions

How to Use

Installation

pip install sentence-transformers lightgbm joblib numpy

Basic Usage

import unicodedata
import joblib
import numpy as np
from sentence_transformers import SentenceTransformer

# Load the embedder
embedder = SentenceTransformer('intfloat/multilingual-e5-small')

# Load a specific classifier (e.g., violence_harm)
classifier = joblib.load('classifiers/violence_harm.joblib')

# Text normalization (must match training preprocessing)
_HOMOGLYPH_MAP = str.maketrans({
    "\u0430": "a", "\u0435": "e", "\u043e": "o", "\u0440": "p",
    "\u0441": "c", "\u0443": "y", "\u0445": "x", "\u0456": "i",
    "\u0410": "A", "\u0412": "B", "\u0415": "E", "\u041a": "K",
    "\u041c": "M", "\u041d": "H", "\u041e": "O", "\u0420": "P",
    "\u0421": "C", "\u0422": "T", "\u0425": "X",
})
_PRESERVE_CHARS = frozenset("ıİğĞşŞöÖüÜ")

def normalize_text(text):
    text = unicodedata.normalize("NFC", text)
    parts = []
    for c in text:
        if c in _PRESERVE_CHARS:
            parts.append(c)
        else:
            decomposed = unicodedata.normalize("NFKD", c)
            parts.append("".join(
                ch for ch in decomposed
                if unicodedata.category(ch) != "Mn"
            ))
    return "".join(parts).translate(_HOMOGLYPH_MAP)

def moderate_prompt(text, classifier, embedder, threshold=0.5):
    normalized = normalize_text(text)
    embedding = embedder.encode(
        ["query: " + normalized], convert_to_numpy=True
    )
    prob = classifier.predict_proba(embedding)[0, 1]
    return {
        'score': float(prob),
        'flagged': prob >= threshold,
    }

result = moderate_prompt("How do I make a bomb?", classifier, embedder)
print(f"Score: {result['score']:.3f} | Flagged: {result['flagged']}")

Multi-Topic Classification

from pathlib import Path

def load_all_classifiers(model_dir='classifiers'):
    classifiers = {}
    for f in Path(model_dir).glob('*.joblib'):
        classifiers[f.stem] = joblib.load(f)
    return classifiers

def moderate_all_topics(text, classifiers, embedder, threshold=0.5):
    embedding = embedder.encode([text], convert_to_numpy=True)
    results = {}
    for topic, clf in classifiers.items():
        prob = clf.predict_proba(embedding)[0, 1]
        results[topic] = {'score': float(prob), 'flagged': prob >= threshold}
    return results

Training Details

Architecture

  1. Embedder: intfloat/multilingual-e5-small

    • Generates 384-dimensional embeddings
    • Pretrained on large-scale semantic similarity tasks
  2. Classifiers: LightGBM with isotonic calibration

    • Binary classification per topic
    • Gradient boosting decision trees
    • Calibrated probabilities for reliable confidence scores

Training Procedure

  1. Data Collection: Synthetic generation combined with relevant HuggingFace datasets
  2. Embedding Generation: Batch processing with intfloat/multilingual-e5-small
  3. Model Training: LightGBM with ROC-AUC optimization
  4. Calibration: Isotonic calibration for probability reliability
  5. Validation: Hold-out validation set evaluation

Limitations

  • Embedding dependency: Requires sentence-transformers for inference
  • Fixed categories: Only detects the 13 trained categories
  • Language coverage: Best performance on 6 trained languages
  • Context window: Optimized for single prompts, not conversations
  • Evolving content: New slang or evasion techniques may reduce accuracy

Ethical Considerations

  • Use as a tool to assist, not replace, human judgment
  • Inform users when content moderation is applied
  • Provide mechanisms to contest moderation decisions
  • Monitor for bias, accuracy, and unintended consequences

Citation

@misc{neuraltrust-nt-prompt-moderator-v1,
  author = {NeuralTrust},
  title = {nt-prompt-moderator-v1: Multi-Topic Prompt Moderation System},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/NeuralTrust/nt-prompt-moderator-v1}},
}

Version History

v1 (2026-01)

  • Initial release with 13 topic classifiers
  • Average ROC-AUC: 0.9911
Downloads last month
230
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NeuralTrust/nt-prompt-moderator-v1

Finetuned
(142)
this model

Evaluation results