LLM Defence Scanner — LFM2.5 1.2B

A fine-tuned LFM2.5-1.2B-Instruct model for AI guardrail classification. Single LLM, six categories, structured JSON verdicts. Works for input guard (scan user prompts before they reach an LLM) and output guard (scan LLM responses before they reach the user).

— runnable end-to-end with all six categories.

Capabilities

Category	What it catches
`pii`	names, emails, phones, IDs, cards, addresses, credentials
`prompt_injection`	jailbreaks, instruction overrides, role-hijack attempts
`topic_ban`	tenant-configured restricted topics
`competitor`	tenant-configured competitor mentions
`code`	code snippets in disallowed programming languages
`malicious_url`	phishing, typosquats, known-bad domains

Categories are scoped per request via applied_policies — different tenants enable different scanners with different parameters.

Benchmarks

Held-out test set, 841 records, English only. Model output is post-processed deterministically (apply_policy_postprocess) for code allow-list and competitor list filtering.

metric	value
mean_total_score	0.991
schema_validity_rate	1.000
exact_category_set_rate	0.999
mean_matched_score	0.990
mean_span_score	0.961
exact_overall_blocked_rate	0.999
hard_fail_rate	0.001

Per-category matched-accuracy:

category	matched-acc	span F1
pii	99.5%	0.913
competitor	99.7%	0.995
topic_ban	98.1%	0.937
code	100.0%	0.982
prompt_injection	99.3%	0.873
malicious_url	96.4%	0.907

Inference

End-to-end usage — model load, prompt rendering, post-processor, six category examples for both input-leg and output-leg — is in release_v1_demo.ipynb. Download and run in Jupyter or Colab.

For production serving with vLLM:

vllm serve FrameByFrame/llm-defence-scanner-lfm2.5-1.2b \
  --served-model-name llm-defence-scanner \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85 \
  --no-enable-prefix-caching

p50 latency ~360 ms on RTX PRO 6000 Blackwell, p95 ~870 ms.

Output schema

{
  "overall_blocked": true,
  "severity": "high",
  "language": {"dominant": "en", "alternates": [], "script": "latin", "code_mixed": false},
  "scenario": {"name": "banking", "profile": "retail_banking_kyc"},
  "categories": [
    {
      "name": "pii",
      "matched": true,
      "matches": [
        {"text": "alice.tan@example.com", "kind": "email"},
        {"text": "+1-415-555-2244", "kind": "phone"}
      ]
    }
  ],
  "reason": "The input contains personally identifiable information."
}

field	role
`overall_blocked`	gateway decision: refuse or allow
`categories[].matched`	per-category fire flag
`categories[].matches[].text`	verbatim substring — ready to redact
`categories[].matches[].kind`	subtype label (e.g. `email`, `phone`, `python`, `phishing`)

Training

Base: LiquidAI/LFM2.5-1.2B-Instruct
Method: LoRA (r=16, α=16) on q/k/v/o + w1/w2/w3, then merged
Dataset (v5): ~11k records spanning synthetic multi-category inputs, ai4privacy/pii-masking-65k PII spans, bantopics safety pairs, embedded prompt-injection synthesis. English only, 8-gram leakage-filtered against the test split.
Hyperparameters: effective batch=32, LR=2e-4, 2 epochs
Hardware: single RTX PRO 6000 Blackwell (96GB)
Final eval_loss: 0.0015

Citation

@misc{mariappan2026llmdefence,
  author    = {Mariappan, Vijayachandran},
  title     = {LLM Defence Scanner — LFM2.5 1.2B},
  year      = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/FrameByFrame/llm-defence-scanner-lfm2.5-1.2b}}
}