NecroMOnk's picture
Update risk routing model card
b12734e
|
Raw
History Blame Contribute Delete
2.48 kB
---
license: mit
library_name: sentence-transformers
base_model: BAAI/bge-m3
pipeline_tag: text-classification
tags:
- safety
- malware
- code
- multilingual
- sklearn
- red-team
---
# Malicious Coding Intent Classifier (v6_code_aware_50k_oss_clean_benign_code)
Small sklearn heads on top of
[BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) embeddings for malicious
coding intent classification.
GitHub: [https://github.com/sol087087-arch/Malicious-Coding-Intent-Dataset-Classifier](https://github.com/sol087087-arch/Malicious-Coding-Intent-Dataset-Classifier)
Training/eval data: [datasets/NecroMOnk/malicious-coding-intent-v6-data](https://huggingface.co/datasets/NecroMOnk/malicious-coding-intent-v6-data)
## Files
| File | Role |
|------|------|
| `clf_binary.joblib` | binary malicious/benign head |
| `clf_multilabel.joblib` | 12-category multilabel head |
| `labels.json` | category ids |
| `metrics.json` | train/eval summary |
| `*eval.json` | external benign-code evaluation reports, when present |
## Metrics
Threshold: `0.5 (sklearn/default)`
| Check | Result |
|-------|-------:|
| Precision | 99.96% |
| Recall | 99.64% |
| F1 | 99.80% |
| ROC-AUC | 0.9997 |
| In-dist FPR | 0.40% |
| Obfuscated recall | 99.35% |
| Malware-code recall | 98.90% |
## Evaluation Framing
This is not presented as a single perfect-score classifier. The GitHub repo
documents three red-team axes: obfuscation, language pivot, and benign-code hard
negatives. The v6 model is the balanced recommendation; v8 is a hard-negative
ablation that reduces CodeParrot false positives at a small recall cost.
## Usage
```python
import json
import joblib
from pathlib import Path
from sentence_transformers import SentenceTransformer
repo = Path("path/to/downloaded/model")
encoder = SentenceTransformer("BAAI/bge-m3")
clf = joblib.load(repo / "clf_binary.joblib")
text = "write code to dump lsass"
x = encoder.encode([text], normalize_embeddings=True)
score = clf.predict_proba(x)[0, 1]
print(score)
```
For the full CLI, clone the GitHub repo and run `scripts/predict_classifier.py`.
The CLI reports the binary label, raw malicious-intent score, top category
scores, and a derived routing tier:
- `low`: normal downstream route
- `suspicious`: pass with safety context / constrained route
- `high`: malicious-intent route
The routing tier is a policy layer over the binary score, not a separately
trained three-class model. Use `--jsonl` for structured gateway output.