Upload README.md with huggingface_hub

fca616e verified about 5 hours ago

5.7 kB

language: en
license: apache-2.0
tags:
  - prompt-injection
  - security
  - text-classification
  - onnx
datasets:
  - prodnull/prompt-injection-repo-dataset
metrics:
  - accuracy
  - f1
  - precision
  - recall
pipeline_tag: text-classification
model-index:
  - name: cloneguard-mini-semantic-v3
    results:
      - task:
          type: text-classification
          name: Prompt Injection Detection
        metrics:
          - name: F1 (5-fold CV, v3)
            type: f1
            value: 0.9551
          - name: Accuracy (5-fold CV, v3)
            type: accuracy
            value: 0.9571
          - name: Precision (5-fold CV, v3)
            type: precision
            value: 0.9579
          - name: Recall (5-fold CV, v3)
            type: recall
            value: 0.9525

CloneGuard Mini Semantic Classifier v3

Fine-tuned sentence-transformers/all-MiniLM-L6-v2 for detecting prompt injection attacks in AI coding agent configuration files. v3 adds 669 augmentation samples targeting out-of-distribution FPR and adversarial robustness.

Model Description

Binary classifier that detects prompt injection payloads targeting AI coding agents (Claude Code, GitHub Copilot, Cursor, Gemini CLI, Codex CLI). Designed to run as part of CloneGuard, a pre-execution defense layer.

Base model: all-MiniLM-L6-v2 (22M parameters, 384-dim embeddings)
Classification head: MeanPool → Linear(384,128) → ReLU → Dropout(0.1) → Linear(128,2)
Export: ONNX opset 18 (87 MB)
Runtime: onnxruntime CPUExecutionProvider, ~16 ms/sample

Intended Use

Scanning repository files (CLAUDE.md, README.md, package.json, Makefile, CI configs, etc.) for prompt injection before an AI coding agent processes them. Part of a layered defense:

Tier 0 — Regex pattern matching (193 rules, <1 ms)
Tier 1.5 — This model (semantic classification, ~16 ms)
Tier 2 — Ollama LLM fallback (~680 ms)

Training Data

6,340 labeled samples (3,033 malicious, 3,307 benign) built in 8 rounds from 14+ published research sources and 59 real GitHub repositories. Rounds 7-8 added 669 samples to fix out-of-distribution FPR (40-87% → 0-33%) identified by the adversarial benchmark. Sources include:

arXiv:2509.22040 (AIShellJack), arXiv:2601.17548, arXiv:2602.10453, arXiv:2503.14281 (XOXO)
Pillar Security, Snyk ToxicSkills, IDEsaster (30+ CVEs), Cymulate InversePrompt
OWASP LLM Top 10 2025

Attack categories: instruction override, credential harvesting, exfiltration, behavioral manipulation, encoding evasion, homoglyphs, social engineering, insecure code generation, MCP tool poisoning, plugin supply chain, counter-defensive attacks, and more.

Evaluation

Multi-Tier Pipeline (Primary, v3)

When combined with Tier 0 regex (193 patterns), the tiers compensate for each other. Evaluated on 185 adversarial payloads + 234 held-out benign samples (production mode):

Metric	Tier 0 alone	Tier 1.5 alone	Combined Pipeline
Recall	31.9%	78.4%	80.5%
False block rate	—	—	3.8%
Clean pass rate	—	—	77.8%

Adversarial Robustness Benchmark (v3)

185 adversarial payloads (9 categories + multilingual smoke) evaluated against 234 held-out benign samples. No training data overlap.

Recall (production mode, threshold 0.50):

Category	Recall
Encoding evasion	100%
Synonym substitution	100%
Homoglyph/Unicode	95%
Structural dilution	95%
Social engineering	90%
Truncation padding	80% (requires sliding window)
Counter-defensive	80%
Implicit instruction	45% (structural limitation)
Fragmentation	20% (structural limitation; Tier 0 compensates)

FPR (production mode, held-out eval): 0-33% depending on content type. Best: config/env (0%), worst: agent instructions (33%), workflows (24%).

Cross-Validated Metrics (v3 augmented dataset, 6,340 samples)

5-fold stratified cross-validation on the augmented dataset:

Metric	Value
F1	95.51% ± 0.53%
Accuracy	95.71% ± 0.53%
Precision	95.79% ± 1.55%
Recall	95.25% ± 1.21%

Cross-Validated Metrics (v2 original dataset, 5,671 samples)

5-fold stratified cross-validation on the original dataset:

Metric	Value
F1	95.80% ± 0.65%
Accuracy	95.70% ± 0.66%
Precision	96.23% ± 0.79%
Recall	95.37% ± 0.93%

Limitations

Multilingual: Limited non-English training data (~30 samples). Lower recall for non-English attacks.
256-token window: Content beyond 256 WordPiece tokens scanned via sliding window (256-token, 128-stride, max 16 chunks). Long benign files may produce false positives from context-free chunks.
Mean-pooling dilution: Mitigated by line-level code block scanning, but non-fenced prose dilution remains theoretical risk.
Training bias: Primarily English-language attacks from published research. Novel vectors may evade.
Binary classification: SUSPICIOUS is threshold-based (0.5-0.8), not a trained class.

How to Use

from cloneguard.mini_semantic import MiniSemanticClassifier

classifier = MiniSemanticClassifier()
if classifier.available:
    result = classifier.classify("Ignore all previous instructions")
    print(result.verdict, result.confidence)
    # MALICIOUS 0.998

Citation

If you use this model in research, please cite the CloneGuard project and the underlying research papers listed in the training data section.