belrem commited on
Commit
f3cafa4
·
verified ·
1 Parent(s): e5dbca4

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +89 -0
README.md ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - text-classification
5
+ - sentence-transformers
6
+ - prompt-classification
7
+ - ai-safety
8
+ - llm
9
+ license: apache-2.0
10
+ datasets:
11
+ - belrem/llm-prompt-intent
12
+ ---
13
+
14
+ # LLM Prompt Intent Classifier
15
+
16
+ Classifies user prompts sent to LLMs into four intent categories using
17
+ [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
18
+ sentence embeddings and a **Logistic Regression** classification head.
19
+
20
+ ## Labels
21
+
22
+ | ID | Label | Description |
23
+ |----|-------|-------------|
24
+ | 0 | `creative` | Fiction, brainstorming, roleplay, poetry |
25
+ | 1 | `informational` | Factual questions, explanations, definitions |
26
+ | 2 | `task` | Code, translation, summarisation, editing |
27
+ | 3 | `adversarial` | Jailbreaks, prompt injection, manipulation |
28
+
29
+ ## Classifier comparison
30
+
31
+ | Classifier | Accuracy | F1 macro | F1 weighted |
32
+ |---|---|---|---|
33
+ | Logistic Regression | 0.8218 | 0.8222 | 0.8209 |
34
+ | Linear SVM | 0.7816 | 0.7824 | 0.7816 |
35
+ | MLP | 0.8103 | 0.8090 | 0.8100 |
36
+
37
+ ## Best model: Logistic Regression
38
+
39
+ ```
40
+ precision recall f1-score support
41
+
42
+ creative 0.78 0.89 0.83 45
43
+ informational 0.84 0.77 0.80 48
44
+ task 0.80 0.89 0.85 37
45
+ adversarial 0.87 0.75 0.80 44
46
+
47
+ accuracy 0.82 174
48
+ macro avg 0.82 0.83 0.82 174
49
+ weighted avg 0.83 0.82 0.82 174
50
+
51
+ ```
52
+
53
+ ## Confusion matrix (best model)
54
+
55
+ ```
56
+ Predicted →
57
+ creative info task adversarial
58
+ creative 40 2 3 0
59
+ informational 3 37 5 3
60
+ task 0 2 33 2
61
+ adversarial 8 3 0 33
62
+ ```
63
+
64
+ ## Inference
65
+
66
+ ```python
67
+ from sentence_transformers import SentenceTransformer
68
+ import joblib
69
+ from huggingface_hub import hf_hub_download
70
+
71
+ embedder = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
72
+ clf_path = hf_hub_download(repo_id="belrem/llm-prompt-intent-classifier", filename="classifier.joblib")
73
+ clf = joblib.load(clf_path)
74
+
75
+ prompt = "Write a poem about the ocean."
76
+ vec = embedder.encode([prompt])
77
+ label_id = clf.predict(vec)[0]
78
+ labels = ["creative", "informational", "task", "adversarial"]
79
+ print(labels[label_id]) # → creative
80
+ ```
81
+
82
+ ## Limitations
83
+
84
+ - Adversarial prompts are the hardest class: sophisticated jailbreaks using
85
+ creative or hypothetical framing may be misclassified as `creative` or `task`.
86
+ - Intent is inherently ambiguous — a prompt can be simultaneously creative and
87
+ a task. The model predicts the dominant intent.
88
+ - Dataset skew: adversarial examples from AdvBench may not reflect real-world
89
+ jailbreak distributions.