---
language:
  - en
license: other
base_model: microsoft/deberta-v3-small
library_name: transformers
pipeline_tag: text-classification
tags:
  - tool-use
  - function-calling
  - tool-calling
  - guardrails
  - verifier
  - text-classification
  - onnx
  - rust
  - shadow-mode
metrics:
  - accuracy
  - f1
  - precision
  - recall
model-index:
  - name: toolcall-verifier-classifier-production
    results:
      - task:
          type: text-classification
          name: Tool-call verification
        dataset:
          name: toolcall-verifier-dataset
          type: cowWhySo/toolcall-verifier-dataset
        metrics:
          - name: Accuracy
            type: accuracy
            value: 0.9729105322763307
          - name: Macro F1
            type: f1
            value: 0.9796981345691861
          - name: Macro Precision
            type: precision
            value: 0.9801958127901861
          - name: Macro Recall
            type: recall
            value: 0.9793045374079483
---

# Tool-call Verifier Classifier Production

This repository contains a production-mode text-classification sidecar for tool-call guardrails. The model scores serialized tool-call candidates after deterministic validation has already handled syntax, JSON schema, unknown tool names, required-step enforcement, prerequisite checks, unsafe batches, and terminal-tool rules.

The intended deployment pattern is conservative: deterministic guardrails remain authoritative, while this classifier starts in `shadow` mode and is promoted only after repository-specific eval replay proves that it improves nudging or routing without introducing false blocks on valid tool calls.

## Model summary

| Field | Value |
|---|---|
| Base model | `microsoft/deberta-v3-small` |
| Model kind | Text-classification cross-encoder |
| Label mode | `production` |
| Input schema | `toolcall-verifier-input/v1` |
| Serializer | `serialize_state_v1` |
| Max sequence length | `1280` |
| Deployment default | `shadow` |
| Primary artifact | `model.onnx` |
| Quantized artifact | `model_quantized.onnx` |
| Required tokenizer files | `tokenizer_config.json`, `special_tokens_map.json`, `spm.model` |
| Threshold file | `thresholds.json` |
| Manifest file | `artifact_manifest.json` |

## Intended use

Use this model to classify a candidate tool call in the context of:

- the original user request,
- available tool definitions,
- required workflow steps,
- completed and pending steps,
- terminal tools,
- recent errors,
- and the candidate tool call.

It is meant to support:

- shadow telemetry for semantic tool-call quality,
- advisory nudges when the selected tool or arguments look semantically wrong,
- eval-backed enforcement for high-confidence semantic errors,
- Rust-side inference through ONNX Runtime.

It is not meant to replace deterministic guardrails. It should not accept malformed calls, override JSON-schema validation, rewrite arguments, execute tools, or relax required workflow rules.

## Labels

Production mode uses six labels:

| Label | Meaning | Deployment guidance |
|---|---|---|
| `valid` | Candidate call appears appropriate for the request and workflow state. | Allow. |
| `wrong_tool_semantic` | Candidate uses the wrong tool for the request or workflow state. | Conservative; currently disabled for advisory/enforcement by thresholds. |
| `wrong_arguments_semantic` | Candidate uses a plausible tool but semantically wrong arguments. | Advisory first; enforce only after eval proof. |
| `tool_not_needed` | Candidate calls a tool when no tool call is needed. | Advisory first; enforce only after eval proof. |
| `needs_clarification` | Request is underspecified and should be clarified before tool use. | Advisory first; enforce only after eval proof. |
| `deterministic_invalid` | Collapsed bucket for failures owned by deterministic validation. | Deterministic-only. Do not enforce from ML. |

In production mode, the following raw labels are collapsed into `deterministic_invalid`: `invalid_args_schema`, `missing_required_args`, `unknown_tool`, `premature_terminal`, `missing_prerequisite`, `unsafe_parallel_batch`, and `malformed_tool_call`.

## Training configuration

Latest production run:

| Field | Value |
|---|---:|
| GPU profile | `high_vram_quality` |
| GPU | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| GPU memory | 95.0 GB |
| Precision | bf16 + tf32 |
| Seed | `42` |
| Max per source | `40000` |
| Max sequence length | `1280` |
| Epochs requested | `5` |
| Per-device train batch | `64` |
| Eval batch | `128` |
| Gradient accumulation | `1` |
| Learning rate | `6e-6` |
| Warmup ratio | `0.08` |
| Early stopping patience | `2` |
| Optimizer | `adamw_torch_fused` |
| Gradient checkpointing | `false` |
| Class weights | disabled |
| Forge augmentation | enabled |
| Final-response verifier training | enabled in the notebook, but separate from this tool-call classifier |

Split sizes:

| Split | Rows |
|---|---:|
| Train | 176,705 |
| Validation | 10,819 |
| Test | 22,075 |

Best validation checkpoint:

| Metric | Value |
|---|---:|
| Best checkpoint | `/content/toolcall-verifier/model/checkpoint-15505` |
| Selection metric | `macro_f1` |
| Validation loss | 0.07549399137496948 |
| Validation accuracy | 0.973380164525372 |
| Validation macro precision | 0.980623200539544 |
| Validation macro recall | 0.9800955285359482 |
| Validation macro F1 | 0.9803291996618319 |

## Test metrics

Held-out test set metrics:

| Metric | Value |
|---|---:|
| Test loss | 0.0744357779622078 |
| Accuracy | 0.9729105322763307 |
| Macro precision | 0.9801958127901861 |
| Macro recall | 0.9793045374079483 |
| Macro F1 | 0.9796981345691861 |
| Samples/sec | 686.863 |
| Steps/sec | 5.383 |

Per-label test report:

| Label | Precision | Recall | F1 | Support |
|---|---:|---:|---:|---:|
| `valid` | 0.94 | 0.97 | 0.95 | 4,955 |
| `wrong_tool_semantic` | 0.97 | 0.96 | 0.96 | 4,960 |
| `wrong_arguments_semantic` | 0.98 | 0.98 | 0.98 | 5,005 |
| `tool_not_needed` | 1.00 | 0.99 | 1.00 | 2,029 |
| `needs_clarification` | 1.00 | 1.00 | 1.00 | 10 |
| `deterministic_invalid` | 0.99 | 0.98 | 0.99 | 5,116 |
| **Macro avg** | **0.98** | **0.98** | **0.98** | **22,075** |
| **Weighted avg** | **0.97** | **0.97** | **0.97** | **22,075** |

Per-source test accuracy:

| Source | Rows | Accuracy | Avg confidence |
|---|---:|---:|---:|
| `Salesforce/xlam-function-calling-60k` | 14,234 | 0.973444 | 0.983573 |
| `glaiveai/glaive-function-calling-v2` | 5,449 | 0.971738 | 0.981286 |
| `Team-ACE/ToolACE` | 2,380 | 0.973109 | 0.982948 |
| `forge_augmented` | 12 | 1.000000 | 0.999698 |

Per-label test accuracy:

| True label | Rows | Accuracy | Avg confidence |
|---|---:|---:|---:|
| `deterministic_invalid` | 5,116 | 0.980258 | 0.992830 |
| `wrong_arguments_semantic` | 5,005 | 0.980819 | 0.989488 |
| `wrong_tool_semantic` | 4,960 | 0.956855 | 0.971546 |
| `valid` | 4,955 | 0.965691 | 0.972742 |
| `tool_not_needed` | 2,029 | 0.992607 | 0.994770 |
| `needs_clarification` | 10 | 1.000000 | 0.971681 |

## Confusion matrix

Rows are true labels. Columns are predicted labels.

| True \\ Predicted | `valid` | `wrong_tool_semantic` | `wrong_arguments_semantic` | `tool_not_needed` | `needs_clarification` | `deterministic_invalid` |
|---|---:|---:|---:|---:|---:|---:|
| `valid` | 4,785 | 76 | 74 | 0 | 0 | 20 |
| `wrong_tool_semantic` | 195 | 4,746 | 8 | 0 | 0 | 11 |
| `wrong_arguments_semantic` | 75 | 18 | 4,909 | 0 | 0 | 3 |
| `tool_not_needed` | 1 | 13 | 0 | 2,014 | 0 | 1 |
| `needs_clarification` | 0 | 0 | 0 | 0 | 10 | 0 |
| `deterministic_invalid` | 49 | 50 | 2 | 0 | 0 | 5,015 |

## Threshold policy

The exported default mode is `shadow`, with default action `allow`. These thresholds should be treated as deployment policy metadata, not as proof that enforcement is safe in a new environment.

```json
{
  "schema_version": "toolcall-verifier-thresholds/v1",
  "mode": "shadow",
  "default_action": "allow",
  "temperature": 1.1155287027359009,
  "notes": [
    "Deterministic guardrails remain authoritative.",
    "Use ML in shadow mode first, then advisory nudges, then high-confidence enforcement only after eval proof.",
    "deterministic_invalid is never enforced by ML in this default config.",
    "wrong_tool_semantic stays conservative because current Forge telemetry showed high-confidence false positives on valid terminal/summarize calls."
  ],
  "labels": {
    "valid": {
      "action": "allow",
      "advisory_min_confidence": 0.0,
      "enforce_min_confidence": 1.01
    },
    "wrong_tool_semantic": {
      "action": "advisory_then_enforce_after_eval",
      "advisory_min_confidence": 1.01,
      "enforce_min_confidence": 1.01
    },
    "wrong_arguments_semantic": {
      "action": "advisory_then_enforce_after_eval",
      "advisory_min_confidence": 0.9,
      "enforce_min_confidence": 0.995
    },
    "tool_not_needed": {
      "action": "advisory_then_enforce_after_eval",
      "advisory_min_confidence": 0.8,
      "enforce_min_confidence": 0.95
    },
    "needs_clarification": {
      "action": "advisory_then_enforce_after_eval",
      "advisory_min_confidence": 0.8,
      "enforce_min_confidence": 0.95
    },
    "deterministic_invalid": {
      "action": "deterministic_only",
      "advisory_min_confidence": 1.01,
      "enforce_min_confidence": 1.01
    }
  }
}
```

## Input format

The classifier expects the canonical serialized format produced by `serialize_state_v1`.

```text
SCHEMA_VERSION:
toolcall-verifier-input/v1

USER_REQUEST:
Generate a sales report from the Q4 2024 dataset.

WORKFLOW_STATE:
required_steps=['fetch_sales_data', 'analyze_sales']
completed_steps=[]
pending_steps=['fetch_sales_data', 'analyze_sales']
terminal_tools=['report']
recent_errors=[]

AVAILABLE_TOOLS:
report: Produce the final report from findings.
PARAMETERS: {"properties": {"summary": {"type": "string"}}, "required": ["summary"], "type": "object"}

fetch_sales_data: Fetch sales data for a given quarter and year.
PARAMETERS: {"properties": {"quarter": {"type": "integer"}, "year": {"type": "integer"}}, "required": ["quarter", "year"], "type": "object"}

analyze_sales: Analyze the loaded sales data and produce findings.
PARAMETERS: {"properties": {}, "type": "object"}

CANDIDATE_CALL:
{"arguments": {"summary": "Done."}, "name": "report"}
```

For this fixture, both PyTorch and ONNX selected `deterministic_invalid`, with a reported max absolute logit difference of `6.67572021484375e-06`.

## Inference

### Transformers pipeline

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

repo_id = "cowWhySo/toolcall-verifier-classifier-production"

tokenizer = AutoTokenizer.from_pretrained(repo_id, use_fast=False)
model = AutoModelForSequenceClassification.from_pretrained(repo_id)

clf = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
    top_k=None,
    device=0,  # use -1 for CPU
)

scores = clf(serialized_tool_call, truncation=True, max_length=1280)[0]
scores = sorted(scores, key=lambda item: item["score"], reverse=True)
print(scores[:5])
```

### ONNX Runtime

The ONNX path is the recommended Rust/runtime deployment path. Load the model with the same tokenizer behavior and the same serialized input text used during training.

Required files:

```text
model.onnx
model_quantized.onnx
labels.json
thresholds.json
artifact_manifest.json
input_schema.json
serializer_fixture.json
tokenizer_config.json
special_tokens_map.json
spm.model
```

Runtime integrations should byte-compare their serializer output against `serializer_fixture.json` before trusting model scores. This catches train/inference drift.

## Rust deployment guidance

Recommended integration order:

```text
1. Parse provider response.
2. Validate format, known tool names, and JSON-schema arguments.
3. Enforce required steps, prerequisites, terminal rules, and unsafe batches.
4. If the call is still valid-looking, run the classifier.
5. Shadow mode: log classifier verdict only.
6. Advisory mode: use classifier verdict to choose better nudges.
7. Enforce mode: block only high-confidence semantic labels after eval proof.
```

Suggested runtime flags:

```text
--classifier-dir <path>
--classifier-mode off|shadow|advisory|enforce
--classifier-max-latency-ms <n>
FORGE_CLASSIFIER_DIR
FORGE_CLASSIFIER_MODE
FORGE_CLASSIFIER_MAX_LATENCY_MS
```

Default should be `off` unless a classifier directory is explicitly provided. First rollout should use `shadow`.

Recommended artifact-loader checks:

```text
artifact_manifest.json exists and includes training_run_summary/test_metrics provenance
artifact_schema_version == "toolcall-verifier-artifact/v1"
input_schema_version == "toolcall-verifier-input/v1"
serializer == "serialize_state_v1"
labels.json labels match model config
thresholds.json has every deployed label
tokenizer files exist
ONNX file exists
```

Loading failures should fail closed for strict deployment modes. Scoring failures should fail open in `shadow` and `advisory` modes, with telemetry.

## Calibration and safety notes

- Keep the model in `shadow` mode until eval replay confirms behavior on your real traffic and workflow families.
- Do not use `deterministic_invalid` predictions to enforce blocks. Deterministic Rust guardrails own those decisions.
- `wrong_tool_semantic` is intentionally disabled by threshold values above `1.0` because the current telemetry showed high-confidence false positives on otherwise valid terminal/summarize calls.
- High-confidence mistakes were observed, including valid calls predicted as deterministic or wrong-argument failures. Use per-family replay, not only aggregate F1, before promotion.
- The `needs_clarification` test support is small (`10` rows), so treat that label as under-validated despite the perfect held-out score.
- Validate public dataset licenses and any Forge-derived traces before publishing derived artifacts broadly.

## Tokenizer notes

The training run emitted tokenizer warnings around slow-to-fast conversion and regex/tokenization behavior. For parity-sensitive deployment, prefer the tokenizer path used by the notebook and artifact tests, and keep `use_fast=False` unless you have separately verified byte-for-byte or score-level parity.

If your Transformers version emits a Mistral regex warning for the local artifact, load with the appropriate `fix_mistral_regex=True` setting where supported. For Rust deployment, verify whether `tokenizer.json` is present and equivalent. If tokenizer parity is uncertain, use a sidecar scorer process until the tokenizer path is proven.

## ONNX parity and latency smoke check

A smoke test from the latest run reported:

| Check | Value |
|---|---:|
| Example latency | 126.49 ms |
| PyTorch top label | `deterministic_invalid` |
| ONNX top label | `deterministic_invalid` |
| Max absolute difference | `6.67572021484375e-06` |

This is a single-fixture smoke check, not a full deployment benchmark. Run larger parity checks on the exported `onnx_parity_report.json` and real replay traces before using quantized artifacts in advisory or enforcement mode.

## Related final-response verifier

The notebook can also train a separate final-response verifier with labels such as `valid_final_response`, `missing_tool_fact`, `contradicts_tool_result`, `unsupported_claim`, and `failed_to_acknowledge_data_gap`. That verifier is a separate artifact family and should be documented, evaluated, and deployed independently from this tool-call verifier.

The latest final-response run was small: `90` total rows split into `70` train, `10` validation, and `10` test rows. Its validation macro F1 remained low in the shown run, so it should stay experimental/shadow-only until the dataset is materially expanded.

## Limitations

- This model was trained on serialized tool-call contexts, not arbitrary natural language.
- It assumes deterministic validation has already run.
- It is sensitive to serializer drift, tokenizer drift, and tool-list truncation.
- Aggregate metrics are strong, but valid-call false positives are more important than headline macro F1 for enforcement.
- The `forge_augmented` test slice shown in the run contains only `12` rows, so it is useful as a smoke signal, not as sufficient Forge coverage.
- The final-response verifier path in the notebook is not mature enough for enforcement based on the shown data.

## Recommended eval replay before promotion

Run at least these variants before changing deployment mode:

```text
no_classifier
classifier_fp32_onnx_shadow
classifier_quantized_onnx_shadow
classifier_fp32_onnx_advisory
classifier_quantized_onnx_advisory
```

Promotion criteria should include:

- zero or near-zero false objections on valid calls,
- no regression in terminal-tool workflows,
- no regression in summarize/report workflows,
- improved targeted scenario-family scores,
- acceptable p95/p99 latency,
- PyTorch/ONNX/quantized parity on replay traces,
- stable behavior across real tool schemas, not only public function-calling datasets.