davidmcmahon commited on
Commit
9d962c8
·
verified ·
1 Parent(s): e1c449d

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +5 -63
README.md CHANGED
@@ -5,72 +5,14 @@ tags:
5
  - safety
6
  - guardrail
7
  - content-filtering
8
- - moderation
9
- datasets:
10
- - custom
11
  license: mit
12
  ---
13
 
14
  # NuGuard - LLM Prompt Safety Classifier
15
 
16
- A guardrail classification model to detect and block harmful prompts to LLMs.
17
 
18
- ## Model Description
19
- This model is designed to identify potentially harmful or malicious prompts sent to language models. It uses a combination of keyword detection and text pattern recognition to flag content that might:
20
-
21
- - Request private or sensitive information
22
- - Contain harmful content
23
- - Attempt to bypass security measures
24
-
25
- ## Usage
26
- ```python
27
- import joblib
28
- import json
29
- import numpy as np
30
-
31
- # Load model components
32
- classifier = joblib.load("classifier.joblib")
33
- vectorizer = joblib.load("vectorizer.joblib")
34
- with open("features.json", "r") as f:
35
- feature_names = json.load(f)
36
-
37
- # Function to predict if a prompt is malicious
38
- def predict_prompt(prompt, threshold=0.5):
39
- # Preprocess
40
- clean_prompt = " ".join(str(prompt).lower().split())
41
-
42
- # Extract features
43
- features = []
44
- features.append(len(clean_prompt.split())) # word_count
45
- features.append(1 if any(kw in prompt.lower() for kw in
46
- ['password', 'credential', 'login', 'username', 'authentication', 'account']) else 0)
47
- features.append(1 if any(pattern in prompt.lower() for pattern in
48
- ['provide me', 'share with me', 'give me', 'send me', 'tell me']) else 0)
49
- features.append(1 if any(kw in prompt.lower() for kw in
50
- ['hack', 'exploit', 'vulnerability', 'bypass', 'attack', 'security', 'breach', 'malware']) else 0)
51
- features.append(1 if any(kw in prompt.lower() for kw in
52
- ['personal', 'address', 'email', 'private', 'contact', 'phone', 'details', 'information']) else 0)
53
- features.append(1 if any(kw in prompt.lower() for kw in
54
- ['admin', 'administrator', 'root', 'superuser', 'system']) else 0)
55
- features.append(1 if any(kw in prompt.lower() for kw in
56
- ['kill', 'harm', 'hurt', 'murder', 'weapon', 'bomb', 'destroy']) else 0)
57
-
58
- # Vectorize text
59
- text_vector = vectorizer.transform([clean_prompt])
60
-
61
- # Combine features
62
- features_array = np.array([features])
63
- X_combined = np.hstack((text_vector.toarray(), features_array))
64
-
65
- # Predict
66
- prediction = classifier.predict(X_combined)[0]
67
- probability = classifier.predict_proba(X_combined)[0, 1]
68
-
69
- return {
70
- 'is_malicious': bool(prediction),
71
- 'probability': float(probability),
72
- 'should_block': probability >= threshold
73
- }
74
- ```
75
-
76
- Learn more about the project at [https://github.com/davidmcmahon/nuguard](https://github.com/davidmcmahon/nuguard)
 
5
  - safety
6
  - guardrail
7
  - content-filtering
 
 
 
8
  license: mit
9
  ---
10
 
11
  # NuGuard - LLM Prompt Safety Classifier
12
 
13
+ A machine learning model for detecting potentially harmful prompts.
14
 
15
+ ## Model Details
16
+ - Detects malicious content
17
+ - Uses text and feature-based classification
18
+ - Scikit-learn 1.6.1 compatible