ScopeGuard — A Governance SLM for Multilingual Scope Classification

ScopeGuard is a small language model (SLM) designed for a specific aspect of AI governance: multilingual scope classification.

Instead of optimizing for open-ended generation, ScopeGuard is trained to make reliable, consistent, low-latency, policy-driven decisions. Being small compared to large language models (LLMs) is not a limitation, but an intentional design choice: ScopeGuard is built to be cheaper, faster, and easier to deploy as an inline decision layer in production systems.

Resources

Quickstart with Colab: Link to a quickstart Colab notebook
Use ScopeGuard with Orbitals: Link to our GitHub repo
Learn more about ScopeGuard: Link to our blog article

What ScopeGuard does

Given a user request, ScopeGuard decides whether the request is:

within scope of an AI service (supported use cases, allowed domains, interaction boundaries), or
out of scope / restricted, so it should be blocked, redirected, or routed elsewhere.

ScopeGuard can be used as an early gate in enterprise AI deployments to ensure requests are handled only when they match the intended purpose of the system.

Show me the code

Install

pip install orbitals[scope-guard-vllm]

# Or, if you'd like to use hf as a backend
# pip install orbitals[scope-guard-hf]

Use

from orbitals.scope_guard import ScopeGuard

sg = ScopeGuard(backend="vllm", model="principled-intelligence/scope-guard-4B-g-2601")

ai_service_description = """
You are a virtual assistant for a parcel delivery service.
You can only answer questions about package tracking.
Never respond to requests for refunds.
"""

user_query = "If the package hasn't arrived by tomorrow, can I get my money back?"
result = sg.validate(
  user_query, ai_service_description=ai_service_description
)

print(f"Scope: {result.scope_class.value}")
if result.evidences:
    print("Evidences:")
    for evidence in result.evidences:
        print(f"  - {evidence}")

# Scope: Restricted
# Evidences:
#   - Never respond to requests for refunds.

Structured AI Service Description (Suggested)

from orbitals.types import AIServiceDescription
from orbitals.scope_guard import ScopeGuard

sg = ScopeGuard(backend="vllm", model="principled-intelligence/scope-guard-4B-g-2601")

ai_service_description_complete = AIServiceDescription(
    identity_role=(
        "You are PackAssist, a virtual assistant designed to help users understand and "
        "track their parcel shipments. Your objective is to interpret tracking data and "
        "guide users through delivery-related questions."
    ),
    context=(
        "The service operates within a parcel-delivery environment where users interact "
        "to check the status of shipments sent domestically or internationally. Typical "
        "users are customers awaiting deliveries or sending parcels."
    ),
    functionalities=(
        "Retrieve tracking updates; explain the meaning of tracking events; provide "
        "estimated delivery windows; assist users in understanding delays or routing steps."
    ),
    knowledge_scope=(
        "Public tracking information, standard logistics workflows, typical transit times, "
        "and general procedures for parcel movement. No access to payment, refund, or claim "
        "processing systems."
    ),
    principles=(
        "Cannot modify shipments, initiate refunds, open claims, contact drivers, or view "
        "internal logistics notes. Limited strictly to interpreting publicly available "
        "tracking data."
    ),
    website_url="https://www.trackmate-delivery.com",
)

user_query = "If the package hasn't arrived by tomorrow, can I get my money back?"
result = sg.validate(user_query, ai_service_description=ai_service_description)

print(f"Scope: {result.scope_class.value}")
if result.evidences:
    print("Evidences:")
    for evidence in result.evidences:
        print(f"  - {evidence}")

# Scope: Restricted
# Evidences:
#   - No access to payment, refund, or claim processing systems.
#   - Cannot modify shipments, initiate refunds, open claims, contact drivers, or view internal logistics notes.

ScopeGuard model family

Our initial family of ScopeGuard models includes:

scope-guard-4B-q-2601 Open ScopeGuard model based on Qwen3-4B-Instruct-2507 (distilled version of scope-guard-pro-2601)
scope-guard-4B-g-2601 Open ScopeGuard model based on gemma-3-4b-it (distilled version of scope-guard-pro-2601)
scope-guard-pro-2601 (Coming soon!)
Closed-source model, representing the most performant configuration used as a reference point in our evaluation

The open-weight models are distilled from the closed model.

How we evaluated ScopeGuard

We evaluated ScopeGuard in realistic guardrailing scenarios, comparing it against:

Commercial frontier LLMs: GPT-5.2 (medium thinking), Claude Sonnet 4.5, Gemini 3 Pro
Commercial “fast” variants: GPT-5 Mini, Claude Haiku 4.5, Gemini 3 Flash
Open-weight safety baseline: NVIDIA Nemotron-Content-Safety-Reasoning-4B

The goal is not to outperform LLMs at general-purpose capabilities, but to show that small specialized models can outperform general systems on governance tasks, while offering strong deployment advantages.

Tasks

Our evaluation covers three governance-critical tasks:

Multilingual Scope Classification
Decide whether a user request falls inside or outside the intended scope of an AI system.
Vanilla Safety Classification
Detect whether a user request violates generic safety policies (toxicity, discrimination, abusive language).
Custom Safety Classification
Identify whether a user request violates explicit custom-defined policies, varying across products, services, and organizational constraints.

Experimental Results

1) Multilingual scope classification (primary task)

We created an internal multilingual scope classification benchmark reflecting real user traffic and decision boundaries, covering 5 languages:

English, Spanish, Italian, French, German.

Key result: both open-weight ScopeGuard models outperform frontier commercial LLMs on multilingual scope classification.

Provider	Model	Type	Macro F1	Avg. latency
OpenAI	GPT-5.2 (medium thinking)	Closed LLM	87.4	51.1s
Anthropic	Claude Sonnet 4.5	Closed LLM	85.4	27.8s
Google	Gemini 3 Pro	Closed LLM	88.4	31.8s
OpenAI	GPT-5 Mini	Closed LLM	86.6	23.8s
Anthropic	Claude Haiku 4.5	Closed LLM	75.3	19.8s
Google	Gemini 3 Flash	Closed LLM	87.1	12.5s
Principled Intelligence	scope-guard-4B-q-2601	Open SLM	89.1	0.47s
Principled Intelligence	scope-guard-4B-g-2601	Open SLM	90.1	0.68s
Principled Intelligence	scope-guard-pro-2601	Proprietary SLM	91.9	0.23s

ScopeGuard models surpass all frontier LLMs on in this multilingual Scope Classification task, while being way faster on single-request latency.

2) Vanilla safety classification (Toxic Chat)

We evaluate vanilla safety using the Toxic Chat benchmark. To keep the comparison realistic, we use simple prompts and rely on each model’s default safety behavior (no extra custom structure or policies injected).

Provider	Model	Type	Harmful F1
OpenAI	GPT-5.2 (medium thinking)	Closed LLM	80.8
Anthropic	Claude Sonnet 4.5	Closed LLM	80.7
Google	Gemini 3 Pro	Closed LLM	81.2
OpenAI	GPT-5 Mini	Closed LLM	78.8
Anthropic	Claude Haiku 4.5	Closed LLM	77.8
Google	Gemini 3 Flash	Closed LLM	80.2
NVIDIA	Nemotron-Content-Safety-Reasoning-4B	Open SLM	75.9
Principled Intelligence	scope-guard-4B-q-2601	Open SLM	79.1
Principled Intelligence	scope-guard-4B-g-2601	Open SLM	78.0
Principled Intelligence	scope-guard-pro-2601	Proprietary SLM	81.8

Key result: open ScopeGuard models remains competitive with frontier LLMs even in generic moderation settings, and scope-guard-pro-2601 slightly surpasses all commercial models on Toxic Chat.

3) Custom safety classification (DynaGuardrail + CoSApien)

Custom safety evaluates whether models can enforce explicit user-defined policies (common in enterprise deployments), rather than generic safety heuristics.

We evaluate custom safety using:

DynaGuardrail (dynamic policy constraints across diverse scenarios)
CoSApien (structured safety policies embedded in the dataset)

In both benchmarks, policies are provided as part of the setup, and models are framed as general-purpose assistants, so the evaluation measures policy application and constraint following.

Provider	Model	Type	DynaGuardrail	CoSApien
OpenAI	GPT-5.2 (medium thinking)	Closed LLM	89.5	91.3
Anthropic	Claude Sonnet 4.5	Closed LLM	88.3	90.9
Google	Gemini 3 Pro	Closed LLM	87.8	90.0
OpenAI	GPT-5 Mini	Closed LLM	88.4	87.0
Anthropic	Claude Haiku 4.5	Closed LLM	84.4	89.5
Google	Gemini 3 Flash	Closed LLM	88.2	88.8
NVIDIA	Nemotron-Content-Safety-Reasoning-4B	Open SLM	87.6	86.2
Principled Intelligence	scope-guard-4B-q-2601	Open SLM	88.7	91.9
Principled Intelligence	scope-guard-4B-g-2601	Open SLM	87.8	88.2
Principled Intelligence	scope-guard-pro-2601	Proprietary SLM	91.6	92.4

Key result: ScopeGuard shows a clear advantage when policies are complex, dynamic, or tightly specified, confirming that custom safety is not just an extension of vanilla moderation.

Inference speed and deployment

To enforce AI governance in production, guardrails are executed on every user interaction, making latency and throughput critical.

To keep results realistic and actionable, we measured ScopeGuard inference on a single consumer-grade GPU (RTX 4090), which is widely available on cloud marketplaces (e.g., Vast.ai) for ~€0.20/hour. (You can obtain good results even with L4-like GPUs, available on GCP and other marcketplaces for ~$0.70/h, with significant discount opportunities)

In this setting, ScopeGuard consistently achieves sub-second latency, making it suitable for inline deployment as a real-time decision layer, without requiring expensive proprietary infrastructure or introducing noticeable delays.

Intended use

ScopeGuard is intended for:

Scope enforcement for customer-facing assistants (route/deny out-of-scope queries)
Inline guardrailing for agentic systems (pre-checks before tool execution)
Enterprise governance layers where behavior must follow explicit boundaries and policies
Analytics & routing pipelines where explainable classification supports monitoring and reporting

Limitations

ScopeGuard is designed for governance decisions, not for open-ended generation.
Performance on safety benchmarks can be sensitive to prompting and evaluation conditions; strong results on one benchmark may not transfer automatically to all setups.

Cite ScopeGuard

If you use this model in academic work or evaluations, we would love if you cited ScopeGuard (our tech report is coming soon!).

Want to integrate ScopeGuard to safeguard your AI? Get in touch!

If you are thinking about integrating ScopeGuard into your pipeline to safeguard your AI and your systems, we can help you! Contact us directly or write to orbitals@principled-intelligence.com to learn more about ScopeGuard Pro or how we can support you.

Built with ❤️ by Principled Intelligence