IntentGuard: Building a Production-Grade Vertical Intent Classifier for LLM Safety

Community Article Published March 12, 2026

A deep technical dive into three-way classification, margin-based decision logic, adversarial input defense, and calibrated confidence scoring — all in a 22M-parameter model that runs in under 20ms on CPU.

blog-hero-threeway


When you deploy an LLM into a regulated domain — financial services, healthcare, legal — one of the first compliance questions is deceptively simple: "What topics will this system answer?"

Most teams discover the hard way that a general-purpose LLM will happily answer anything. A financial chatbot will offer medical advice. A healthcare navigator will discuss stock picks. These aren't adversarial attacks — they're the natural consequence of deploying a model trained on the entire internet into a vertical-specific context without a topical boundary layer.

IntentGuard is an open-source intent classification system that solves this. It sits between users and your LLM, classifying every incoming message as ALLOW, DENY, or ABSTAIN before it reaches the downstream model. The system is built on DeBERTa-v3-xsmall, exported to ONNX with INT8 quantization, and designed to add less than 20ms of latency at the p99 on commodity hardware.

This article walks through the architecture, the decision logic, the adversarial defenses, and the evaluation methodology in detail — with enough code and configuration examples to reproduce everything.


Table of Contents

  1. Why Binary Classification Fails at the Margins
  2. Three-Way Classification: The Abstain Class
  3. The Model: DeBERTa-v3-xsmall with Vertical Context Priming
  4. Margin-Based Decision Logic
  5. Input Normalization as Adversarial Defense
  6. Temperature Scaling and Probability Calibration
  7. Policy-Driven Configuration
  8. Multi-Vertical Routing
  9. Shipping Gates: When Is a Model Ready for Production?
  10. Deployment Architecture
  11. Performance Benchmarks
  12. Limitations and Future Work

Why Binary Classification Fails at the Margins

The naive approach to topical guardrails is a binary classifier: on-topic or off-topic. Train a model, pick a threshold, deploy. This works in clean conditions but fails precisely where you need it most — at the decision boundary.

Consider these queries sent to a financial services chatbot:

Query Binary Classifier Correct Decision
"What are current mortgage rates?" ON-TOPIC ✓ ALLOW
"Best recipe for chocolate cake" OFF-TOPIC ✓ DENY
"How does HIPAA affect medical billing disputes?" ??? Depends on context
"What are HSA contribution limits for 2025?" ??? ALLOW (financial planning)
"Symptoms of a drug interaction" preceded by HSA discussion ??? DENY (medical, not financial)

The third and fourth queries are where binary classifiers break. "HIPAA billing disputes" touches healthcare and finance. A strict topic filter trained on financial queries might block it — a false block that frustrates a legitimate user. An HSA question is clearly financial, but a poorly-trained model might flag "contribution limits" as non-financial jargon.

The fifth query is worse: an off-topic medical question embedded in financial conversation context. A binary classifier that considers conversation history might pass it because of the surrounding financial framing.

Both error types matter, but they matter differently. A false block (denying a legitimate query) frustrates users. A false pass (allowing an off-topic query) exposes the business to liability — a financial chatbot dispensing medical advice is a compliance nightmare. Binary classifiers optimize for a single threshold that trades off both error types against each other. For regulated industries, that's the wrong trade-off.


Three-Way Classification: The Abstain Class

IntentGuard adds a third classification output: ABSTAIN.

ALLOW   → Forward to LLM, message is on-topic
DENY    → Return polite refusal with topic suggestions
ABSTAIN → Ask for clarification before deciding

The abstain class captures queries where the model's confidence doesn't cross either decision threshold. Instead of forcing a binary choice on uncertain inputs, the system surfaces the uncertainty to the user: "Can you clarify how this relates to financial services?"

This maps naturally to a UX pattern. The system doesn't fail closed (block everything uncertain) or fail open (pass everything uncertain). It asks. In production, this reduces the legitimate-block rate significantly on borderline queries while maintaining the safety boundary for clearly off-topic inputs.

The three classes are defined per-vertical in a policy configuration:

{
  "labeling_rules": {
    "allow_definition": "Clearly in-scope for financial services without needing more context.",
    "abstain_definition": "Could be in-scope depending on intent. Ask for clarification.",
    "deny_definition": "Clearly out-of-scope even under charitable interpretation."
  }
}

These definitions guide both the synthetic data generation pipeline and human annotators, ensuring consistent labeling across the training set.


The Model: DeBERTa-v3-xsmall with Vertical Context Priming

Why DeBERTa-v3-xsmall?

IntentGuard uses Microsoft's DeBERTa-v3-xsmall — a 22M-parameter encoder model. This was a deliberate engineering choice, not a compromise:

Model Parameters ONNX INT8 Size p99 Latency (CPU) Accuracy
DeBERTa-v3-xsmall 22M ~10MB <20ms 95-98%
DeBERTa-v3-base 86M ~45MB >100ms 96-99%
Prompted LLM (GPT-4) ~1.7T N/A 500-2000ms 98-99%

Most guardrail deployments operate in a sidecar configuration — the classifier runs on every single request before the LLM sees it. At high throughput, adding 100ms+ of latency is unacceptable. Adding 500ms+ for an LLM-based classifier defeats the purpose. DeBERTa-v3-xsmall fits under 10MB quantized, runs on CPU at p99 < 20ms, and achieves 95-98% accuracy across our evaluation suite. The 1-3% accuracy gap compared to larger models is recovered through margin-based thresholds that route uncertain predictions to ABSTAIN rather than making wrong decisions.

Sentence-Pair Input with Vertical Context

The model uses DeBERTa's sentence-pair input format. The first sequence is the user query; the second is a vertical context string constructed from the policy configuration:

VERTICAL=finance; CONTEXT_VERSION=ctv1;
CORE_TOPICS=[banking,lending,credit,payments,investing,insurance,tax,personal finance,retirement,mortgages,financial planning,budgeting];
CONDITIONAL_ALLOW=[healthcare: only when related to financial planning, insurance, HSA/FSA, medical debt; legal: only when related to financial regulation, contracts, or compliance];
HARD_EXCLUSIONS=[sports,entertainment,cooking,gaming,celebrity gossip,fashion,travel_leisure]

This design means the model doesn't bake scope into its weights — it reads the vertical's scope definition as conditioning context at inference time. The context string is built from the policy JSON, so updating what topics are allowed or excluded can often be done by editing configuration rather than retraining.

Training Configuration

Fine-tuning uses HuggingFace's Trainer with carefully tuned hyperparameters:

# training/train_config.yaml
base_model: microsoft/deberta-v3-xsmall
learning_rate: 2.0e-05
num_train_epochs: 3
per_device_train_batch_size: 8
max_seq_length: 128
warmup_ratio: 0.1
label_smoothing_factor: 0.05
weight_decay: 0.01
freeze_embeddings: true    # Prevent overfitting on small datasets
class_weights: inverse     # Inverse-frequency weighting for class imbalance

Two details worth highlighting:

  • Embedding freezing: The word embedding layer is frozen during fine-tuning. With synthetic training data that may not be perfectly diverse, freezing embeddings prevents the model from overfitting to surface lexical patterns rather than learning semantic scope boundaries.

  • Inverse-frequency class weighting: Training sets typically have more ALLOW examples than DENY or ABSTAIN. Rather than using an explicit cost matrix, we weight the cross-entropy loss inversely proportional to class frequency: weight_i = total / (num_classes × count_i). This naturally upweights underrepresented classes.


Margin-Based Decision Logic

blog-margin-thresholds

Standard classifiers take the argmax — the class with the highest logit wins. This ignores how confident the decision is. If the model produces allow=0.52, deny=0.48, the argmax is ALLOW, but the margin is only 0.04 — that's a coin flip, not a confident prediction.

IntentGuard uses margin-based thresholds. A decision requires two conditions:

  1. The winning class must exceed a minimum confidence threshold (τ)
  2. The gap between the winner and the runner-up must exceed a margin (m)

Here's the exact decision logic from classifier.py:

def _apply_thresholds(self, probs, tricks_detected=False):
    """
    ALLOW if: p_allow >= tau_allow AND (p_allow - max(p_deny, p_abstain)) >= margin_allow
    DENY  if: p_deny  >= tau_deny  AND (p_deny  - max(p_allow, p_abstain)) >= margin_deny
    Otherwise: ABSTAIN
    """
    t = self.policy.thresholds

    # Encoding tricks detected → always abstain
    if tricks_detected:
        return Decision.ABSTAIN, p_abstain

    # Check ALLOW first (bias toward allowing legitimate queries)
    if p_allow >= t.tau_allow and (p_allow - max(p_deny, p_abstain)) >= t.margin_allow:
        return Decision.ALLOW, p_allow

    # Check DENY
    if p_deny >= t.tau_deny and (p_deny - max(p_allow, p_abstain)) >= t.margin_deny:
        return Decision.DENY, p_deny

    # Default to ABSTAIN
    return Decision.ABSTAIN, max(p_abstain, 1.0 - p_allow - p_deny)

Asymmetric Thresholds

Notice the thresholds are asymmetric by default:

{
  "decision": {
    "tau_allow": 0.80,
    "tau_deny": 0.90,
    "margin_allow": 0.10,
    "margin_deny": 0.10
  }
}

The tau_allow is set to 0.80 (80% confidence required to allow) while tau_deny is set to 0.90 (90% confidence required to deny). Both margins require a 10% gap between the winning class and the runner-up.

The DENY threshold (0.90) is higher than the ALLOW threshold (0.80). This is intentional — it's harder to block a user than to let them through. A false block frustrates a legitimate user; a false allow can be caught by other safety layers downstream. This asymmetry encodes a preference for user access over aggressive blocking, while still maintaining a hard safety boundary for clearly off-topic queries.

Worked Examples

Probabilities Margin Decision Reasoning
allow=0.92, deny=0.04, abstain=0.04 0.88 ALLOW p_allow (0.92) ≥ τ_allow (0.80) and margin (0.88) ≥ 0.10
allow=0.50, deny=0.45, abstain=0.05 0.05 ABSTAIN p_allow (0.50) < τ_allow (0.80). Margin too small.
allow=0.10, deny=0.85, abstain=0.05 0.75 ABSTAIN p_deny (0.85) < τ_deny (0.90). Close but not confident enough.
allow=0.03, deny=0.95, abstain=0.02 0.92 DENY p_deny (0.95) ≥ τ_deny (0.90) and margin (0.92) ≥ 0.10
allow=0.35, deny=0.33, abstain=0.32 0.02 ABSTAIN No dominant class. Maximum uncertainty.

The thresholds and margins are configurable in the policy JSON, not hardcoded in the model. A consumer-facing chatbot and an internal clinical tool have different tolerances — operators tune behavior without retraining.


Input Normalization as Adversarial Defense

blog-input-normalization

Before any text reaches the model, it passes through a normalization pipeline designed to reduce the adversarial surface area. This isn't a replacement for adversarial training — it's a deterministic preprocessing step that eliminates entire categories of evasion techniques.

The Normalization Pipeline

def normalize(text: str, max_chars: int = 2000) -> str:
    # 1. Unicode NFKC normalization
    text = unicodedata.normalize("NFKC", text)

    # 2. Strip zero-width and invisible characters
    text = _ZERO_WIDTH.sub("", text)

    # 3. Collapse whitespace
    text = _WHITESPACE.sub(" ", text).strip()

    # 4. Truncate
    if len(text) > max_chars:
        text = text[:max_chars]

    return text

Step 1: Unicode NFKC normalization. This collapses fullwidth characters (finance → finance), compatibility characters, superscripts, subscripts, and ligatures into their canonical forms. An attacker can't evade classification by writing "finance" with a ligature instead of "finance" — NFKC maps them to the same string.

Step 2: Zero-width character stripping. The normalization pipeline removes 18 categories of invisible Unicode characters: zero-width spaces, joiners, directional marks, word joiners, invisible mathematical operators, byte-order marks, soft hyphens, and various language-specific filler characters. These are commonly used to break tokenization without visible changes to the text.

Step 3: Whitespace collapse. Runs of spaces, tabs, and newlines are collapsed to a single space. This prevents evasion through whitespace injection that might cause token boundary shifts.

Encoding Trick Detection

Beyond normalization, the system detects patterns that suggest encoding-based evasion:

def has_encoding_tricks(text: str) -> bool:
    # Base64 blobs: 20+ consecutive base64-alphabet chars
    if _BASE64_BLOB.search(text):
        return True

    # High non-ASCII ratio in short text (obfuscation, not CJK prose)
    non_ascii = sum(1 for c in text if ord(c) > 127)
    ratio = non_ascii / len(text) if text else 0
    if ratio > 0.6 and len(text) < 200:
        return True

    return False

When encoding tricks are detected, the classification pipeline immediately routes to ABSTAIN regardless of model output. This is a conservative defense — if the input looks like it's been deliberately obfuscated, don't trust the classifier's prediction on it. Ask for clarification instead.

The len(text) < 200 guard is important: legitimate CJK text (Chinese, Japanese, Korean) is naturally high in non-ASCII characters but tends to appear in longer messages. Short bursts of non-ASCII are more likely to be encoding attacks.


Temperature Scaling and Probability Calibration

blog-calibration

A model that outputs allow=0.90 should be correct about 90% of the time when it does so. This property — calibration — is critical for margin-based decision logic. If the model is systematically overconfident, the margin thresholds won't work as intended.

The Problem: Neural Networks Are Overconfident

Modern neural networks, including DeBERTa, tend to produce overconfident predictions. A model might output allow=0.95 on a query it's actually only 70% likely to classify correctly. This miscalibration means your thresholds are operating on unreliable probability estimates.

The Solution: Temperature Scaling

Temperature scaling is a post-hoc calibration technique that fits a single parameter T to adjust the logit distribution:

scaled_logits=logitsT\text{scaled\_logits} = \frac{\text{logits}}{T}

probabilities=softmax(scaled_logits)\text{probabilities} = \text{softmax}(\text{scaled\_logits})

If T > 1, the distribution is softened (reducing overconfidence). If T < 1, the distribution is sharpened. The optimal T is found by minimizing negative log-likelihood on a held-out calibration set using L-BFGS optimization:

class TemperatureScaler(nn.Module):
    def __init__(self):
        super().__init__()
        self.temperature = nn.Parameter(torch.ones(1))

    def forward(self, logits):
        return logits / self.temperature

# Fit using L-BFGS
scaler = TemperatureScaler()
optimizer = torch.optim.LBFGS([scaler.temperature], lr=0.01, max_iter=100)
criterion = nn.CrossEntropyLoss()

Quality Gate: Expected Calibration Error

We measure calibration quality using Expected Calibration Error (ECE). This buckets predictions by confidence, computes the gap between confidence and accuracy in each bucket, and takes the weighted average:

ECE=b=1BBbNacc(Bb)conf(Bb)ECE = \sum_{b=1}^{B} \frac{|B_b|}{N} \cdot |acc(B_b) - conf(B_b)|

The shipping gate requires ECE < 0.03 — meaning the average gap between predicted confidence and actual accuracy is less than 3 percentage points. The calibrated temperature and both pre- and post-calibration ECE values are stored in calibration_params.json:

{
  "temperature": 1.2847,
  "pre_calibration_ece": 0.0412,
  "post_calibration_ece": 0.0189,
  "calibration_set_size": 350
}

At inference time, the ONNX classifier applies the fitted temperature before softmax — a single division operation that adds negligible latency but meaningfully improves the reliability of probability estimates and, by extension, the margin-based decision logic.


Policy-Driven Configuration

Every deployment decision in IntentGuard is driven by a policy JSON file, not hardcoded logic. This is the single most important design decision for production flexibility.

Policy Structure

A complete policy defines scope, thresholds, responses, privacy settings, and downstream tool permissions:

{
  "vertical": "finance",
  "version": "1.0",
  "display_name": "Financial Services",

  "scope": {
    "core_topics": [
      "banking", "lending", "credit", "payments", "investing",
      "insurance", "tax", "personal finance", "retirement",
      "mortgages", "financial planning", "budgeting"
    ],
    "conditional_allow": [
      {
        "topic": "healthcare",
        "condition": "only when related to HSA/FSA, medical debt, insurance",
        "examples_allow": ["What are HSA contribution limits?"],
        "examples_deny": ["Explain my MRI results"],
        "disambiguation_questions": [
          "Is your question about healthcare costs, insurance, or financial planning?"
        ]
      }
    ],
    "hard_exclusions": [
      "sports", "entertainment", "cooking", "gaming",
      "celebrity gossip", "fashion", "travel_leisure"
    ]
  },

  "decision": {
    "tau_allow": 0.80,
    "tau_deny": 0.90,
    "margin_allow": 0.10,
    "margin_deny": 0.10
  },

  "privacy": {
    "log_query_text_default": false,
    "pii_redaction_default": true,
    "log_sampling_rate": 0.1
  }
}

Conditional Allow: The Hard Part

The conditional_allow section handles the nuanced cases that break binary classifiers. Healthcare questions sent to a financial chatbot should be allowed when they relate to HSA/FSA accounts or medical debt, but denied when they're asking for clinical advice.

Each conditional topic includes:

  • Condition: A human-readable rule describing when the topic is in-scope
  • Examples (allow/deny): Concrete examples used in training data generation
  • Disambiguation questions: Templates for ABSTAIN responses specific to this topic

This structure serves double duty: it configures the classifier's behavior and feeds the synthetic data generation pipeline that creates training examples for these boundary cases.

Policy Packs: Per-Decision Downstream Configuration

Each decision carries a policy pack — metadata that tells the downstream system what tools and guardrails to apply:

{
  "policy_packs": {
    "allow": {
      "allowed_tools": ["calculator", "market_data", "account_lookup"],
      "guardrails": ["no_trade_execution", "no_pii_disclosure", "disclaimer_required"],
      "metadata": {"risk_level": "standard"}
    },
    "deny": {
      "allowed_tools": [],
      "guardrails": ["block_response", "log_attempt"],
      "metadata": {"risk_level": "blocked"}
    },
    "abstain": {
      "allowed_tools": ["document_search"],
      "guardrails": ["require_clarification", "no_pii_disclosure"],
      "metadata": {"risk_level": "elevated"}
    }
  }
}

This means the classification decision doesn't just gate access — it configures the downstream LLM's permissions. An ALLOW decision might grant access to market data tools. An ABSTAIN decision might restrict the LLM to document search while it asks for clarification. A DENY decision grants no tool access at all.


Multi-Vertical Routing

blog-multi-vertical-router

For organizations that deploy multiple domain-specific chatbots, IntentGuard supports a two-stage routing architecture:

Stage 1: Vertical Router

A lightweight N-way classifier routes incoming queries to the most appropriate vertical:

class VerticalRouter:
    def route(self, text: str) -> str:
        """Route query to best vertical. Returns vertical name."""
        probs = self._run_router(text)
        return max(probs, key=probs.get)

    def route_scores(self, text: str) -> dict[str, float]:
        """Return confidence scores for all verticals."""
        return self._run_router(text)

Stage 2: Per-Vertical Classification

Once routed, the query is classified by the vertical-specific model with its own policy, thresholds, and calibration:

def classify(self, text: str) -> tuple[ClassifyResponse, str, dict]:
    """Route + classify. Returns (response, routed_vertical, router_scores)."""
    scores = self.route_scores(text)
    vertical = max(scores, key=scores.get)

    classifier = self.classifiers[vertical]
    response = classifier.classify(text)

    return response, vertical, scores

Router Configuration

The entire multi-vertical setup is defined in a single JSON config:

{
  "router_model": "models/router/model.onnx",
  "router_tokenizer": "models/router/tokenizer",
  "verticals": {
    "finance": {
      "model": "dist/finance/model.onnx",
      "tokenizer": "dist/finance/tokenizer",
      "policy": "policies/finance.json",
      "calibration": "dist/finance/calibration_params.json"
    },
    "healthcare": { "..." : "..." },
    "legal": { "..." : "..." }
  }
}

In debug mode, the API response includes the router scores so operators can see which vertical was selected and how confidently:

{
  "decision": "allow",
  "confidence": 0.94,
  "vertical": "finance",
  "routed_vertical": "finance",
  "router_scores": {
    "finance": 0.87,
    "healthcare": 0.08,
    "legal": 0.05
  }
}

Shipping Gates: When Is a Model Ready for Production?

IntentGuard defines explicit, automated shipping gates — quantitative criteria that must pass before a model can be deployed. This removes subjectivity from the release process.

The Three Core Metrics

1. LBR (Legitimate-Block Rate)Gate: < 0.5%

LBR=P(predicted=DENYgold=ALLOW)LBR = P(\text{predicted}=\text{DENY} \mid \text{gold}=\text{ALLOW})

This measures how often the model blocks legitimate on-topic queries. Even a 1% LBR means 1 in 100 legitimate users gets blocked. The gate is set aggressively low at 0.5%.

2. OPR (Off-Topic-Pass Rate)Gate: < 2%

OPR=P(predicted=ALLOWgold=DENY)OPR = P(\text{predicted}=\text{ALLOW} \mid \text{gold}=\text{DENY})

This measures how often off-topic queries slip through. The gate (2%) is deliberately less aggressive than LBR — reflecting the asymmetric cost design where false blocks are worse than false passes (which can be caught by other safety layers).

3. AOC (Abstain-On-Clean)Gate: < 10%

AOC=P(predicted=ABSTAINgold=ALLOW,category{positive,clean})AOC = P(\text{predicted}=\text{ABSTAIN} \mid \text{gold}=\text{ALLOW}, \text{category} \in \{\text{positive}, \text{clean}\})

This measures unnecessary abstentions on cleanly on-topic queries. A model that abstains too often on clear-cut queries is adding friction without safety benefit. The 10% gate ensures the model is decisive on unambiguous inputs.

Ship Decision

All three gates must pass. The decision is binary:

def check_gates(metrics, gates=DEFAULT_GATES):
    checks = {
        "lbr": metrics["lbr"] <= gates["lbr_max"],      # < 0.5%
        "opr": metrics["opr"] <= gates["opr_max"],      # < 2.0%
        "aoc": metrics["aoc"] <= gates["aoc_max"],      # < 10%
    }
    ship = all(checks.values())
    return "SHIP" if ship else "NO-SHIP"

Current Vertical Performance

Vertical Accuracy LBR OPR AOC Ship?
Finance 98.3% 0.37% 0.00% 4.2% SHIP
Healthcare 97.7% 0.00% 0.00% 6.1% SHIP
Legal 95.3% 0.41% 0.50% 8.3% SHIP

Adversarial Evaluation Categories

The evaluation suite tests across multiple attack categories:

  • Clean on-topic (150 examples): Straightforward domain queries
  • Clean off-topic (100 examples): Clearly irrelevant topics
  • Lexical overlap (50 examples): Off-topic queries using domain keywords ("What's the interest rate of this TV show?")
  • Context wrapping (50 examples): Off-topic queries wrapped in domain framing
  • Multi-intent (50 examples): Queries mixing on-topic and off-topic content
  • Polysemy (30 examples): Words with domain-dependent meanings
  • Encoding tricks (20 examples): Unicode manipulation, base64, zero-width injection

Each category's accuracy is reported independently so operators can see exactly where the model excels and where it's vulnerable.


Deployment Architecture

IntentGuard supports three deployment modes:

Mode 1: Sidecar Classification

User App → POST /v1/classify → IntentGuard → Decision
                                                ↓
                              ALLOW → Forward to LLM
                              DENY  → Return refusal
                              ABSTAIN → Return clarification

The /v1/classify endpoint accepts an OpenAI-compatible messages array and returns the classification decision with confidence, vertical, and policy pack. Response headers include latency and decision metadata for observability.

Mode 2: Transparent Proxy

User App → POST /v1/chat/completions → IntentGuard → LLM (if ALLOW)
                                                    → Refusal (if DENY/ABSTAIN)

Set DOWNSTREAM_URL to your LLM endpoint. IntentGuard acts as a drop-in replacement that classifies before proxying. The response format is OpenAI-compatible, so clients don't need to change.

Mode 3: Shadow Mode

User App → POST /v1/classify?mode=shadow → IntentGuard → Always returns ALLOW
                                                        → Real decision in headers

Shadow mode classifies every request but always returns ALLOW. The real classification result appears in the X-Classification-Shadow header. This enables non-blocking evaluation in production before switching to enforcement — you can measure accuracy, latency, and false-block rates without impacting users.

Observability

Prometheus metrics are exported at /metrics:

intentguard_requests_total{decision="allow", vertical="finance"} 12847
intentguard_requests_total{decision="deny", vertical="finance"} 423
intentguard_requests_total{decision="abstain", vertical="finance"} 891
intentguard_latency_seconds_bucket{vertical="finance", le="0.02"} 13891
intentguard_model_loaded 1
intentguard_feedback_total{expected="allow", actual="deny"} 7

The feedback endpoint (POST /v1/feedback) records expected-vs-actual decision pairs, enabling continuous monitoring and identifying drift.


Performance Benchmarks

blog-performance-comparison

Latency

Configuration p50 p95 p99
ONNX INT8, 4-core CPU 8ms 15ms 19ms
PyTorch FP32, 4-core CPU 22ms 38ms 47ms
ONNX INT8, 2-core CPU 12ms 21ms 28ms

Model Size

Format Size
PyTorch FP32 ~88MB
ONNX FP32 ~45MB
ONNX INT8 (quantized) ~10MB
Docker image (complete) <500MB

ONNX Export and Quantization

The export pipeline converts the fine-tuned model to ONNX with INT8 dynamic quantization:

# Mandatory sanity gate: ONNX output must match PyTorch output
pytorch_probs = run_pytorch_inference(model, test_inputs)
onnx_probs = run_onnx_inference(session, test_inputs)
assert np.allclose(pytorch_probs, onnx_probs, atol=0.01), \
    "ONNX model output diverges from PyTorch — export failed"

The sanity gate ensures quantization doesn't silently degrade accuracy. If the ONNX outputs diverge from PyTorch by more than 1% on the test set, the export fails and the model isn't shipped.


Limitations and Future Work

IntentGuard solves one specific problem: topical boundary enforcement. It is explicitly not:

  • A prompt injection detector
  • A harmful content filter
  • A hallucination detector
  • A factual accuracy checker
  • An output-side constraint on what the LLM says after classification

It's one layer in a defense-in-depth architecture. It answers the question "Should this LLM be talking about this topic?" — not "Is this LLM saying something safe?"

Known Limitations

  • Single-turn only: The current context_window: 1 means each message is classified independently. Multi-turn context would help with follow-up queries but adds complexity around context poisoning.
  • Three verticals: Finance, healthcare, and legal are the initial release. Additional verticals need their own training data, policies, and evaluation suites.
  • Adversarial hardness: The current adversarial test suite covers the main attack categories but needs to be continuously expanded as new evasion techniques emerge.
  • Policy expressiveness: The conditional_allow format handles the common case but can't express arbitrarily complex scope rules. Future versions may support a more expressive DSL.

What We'd Like to Build Next

  • Multi-turn context support with conversation-level classification
  • Automatic hard negative mining from production feedback data
  • Support for more granular decision levels (e.g., "allow with disclaimer")
  • Expanded vertical coverage (education, government, customer support)
  • Active learning loop from shadow mode + feedback data

Getting Started

Models on HuggingFace Hub

All three models are published under Apache 2.0:

Quick Start with Docker

docker run -p 8080:8080 \
  -e POLICY_PATH=policies/finance.json \
  -e DEBUG=true \
  ghcr.io/perfecxion/intentguard:finance-latest

Test a Query

curl -X POST http://localhost:8080/v1/classify \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "What are current mortgage rates?"}
    ]
  }'
{
  "decision": "allow",
  "confidence": 0.94,
  "vertical": "finance",
  "message": "",
  "probabilities": {
    "allow": 0.94,
    "deny": 0.03,
    "abstain": 0.03
  }
}

Full Source

The complete training pipeline, evaluation suite, adversarial test sets, gating reports, and Docker configurations are in the GitHub repository.

We're publishing this as a starting point, not a finished product. The adversarial test sets need to be harder, the policy format could be more expressive, and there are verticals beyond the initial three that need coverage. Contributions and issue reports are welcome.

If you're building guardrails for LLM deployments — or thinking about how to approach topical scope enforcement — we'd like to hear what problems you're running into.


Scott Thornton is an AI security researcher at perfecXion.ai focusing on defensive tooling for LLM deployments. Follow the project on GitHub.

Community

Sign up or log in to comment