davidmcmahon
/

nuguard

@@ -5,72 +5,14 @@ tags:
 - safety
 - guardrail
 - content-filtering
-- moderation
-datasets:
-- custom
 license: mit
 ---
 # NuGuard - LLM Prompt Safety Classifier
-A guardrail classification model to detect and block harmful prompts to LLMs.
-## Model Description
-This model is designed to identify potentially harmful or malicious prompts sent to language models. It uses a combination of keyword detection and text pattern recognition to flag content that might:
-- Request private or sensitive information
-- Contain harmful content
-- Attempt to bypass security measures
-## Usage
-```python
-import joblib
-import json
-import numpy as np
-# Load model components
-classifier = joblib.load("classifier.joblib")
-vectorizer = joblib.load("vectorizer.joblib")
-with open("features.json", "r") as f:
-    feature_names = json.load(f)
-# Function to predict if a prompt is malicious
-def predict_prompt(prompt, threshold=0.5):
-    # Preprocess
-    clean_prompt = " ".join(str(prompt).lower().split())
-    # Extract features
-    features = []
-    features.append(len(clean_prompt.split()))  # word_count
-    features.append(1 if any(kw in prompt.lower() for kw in
-                           ['password', 'credential', 'login', 'username', 'authentication', 'account']) else 0)
-    features.append(1 if any(pattern in prompt.lower() for pattern in
-                           ['provide me', 'share with me', 'give me', 'send me', 'tell me']) else 0)
-    features.append(1 if any(kw in prompt.lower() for kw in
-                           ['hack', 'exploit', 'vulnerability', 'bypass', 'attack', 'security', 'breach', 'malware']) else 0)
-    features.append(1 if any(kw in prompt.lower() for kw in
-                           ['personal', 'address', 'email', 'private', 'contact', 'phone', 'details', 'information']) else 0)
-    features.append(1 if any(kw in prompt.lower() for kw in
-                           ['admin', 'administrator', 'root', 'superuser', 'system']) else 0)
-    features.append(1 if any(kw in prompt.lower() for kw in
-                           ['kill', 'harm', 'hurt', 'murder', 'weapon', 'bomb', 'destroy']) else 0)
-    # Vectorize text
-    text_vector = vectorizer.transform([clean_prompt])
-    # Combine features
-    features_array = np.array([features])
-    X_combined = np.hstack((text_vector.toarray(), features_array))
-    # Predict
-    prediction = classifier.predict(X_combined)[0]
-    probability = classifier.predict_proba(X_combined)[0, 1]
-    return {
-        'is_malicious': bool(prediction),
-        'probability': float(probability),
-        'should_block': probability >= threshold
-    }
-```
-Learn more about the project at [https://github.com/davidmcmahon/nuguard](https://github.com/davidmcmahon/nuguard)

 - safety
 - guardrail
 - content-filtering
 license: mit
 ---
 # NuGuard - LLM Prompt Safety Classifier
+A machine learning model for detecting potentially harmful prompts.
+## Model Details
+- Detects malicious content
+- Uses text and feature-based classification
+- Scikit-learn 1.6.1 compatible