cowWhySo's picture
Update README.md
7986dbc verified
---
language:
- en
license: other
base_model: microsoft/deberta-v3-small
library_name: transformers
pipeline_tag: text-classification
tags:
- tool-use
- function-calling
- tool-calling
- guardrails
- verifier
- text-classification
- onnx
- rust
- shadow-mode
metrics:
- accuracy
- f1
- precision
- recall
model-index:
- name: toolcall-verifier-classifier-production
results:
- task:
type: text-classification
name: Tool-call verification
dataset:
name: toolcall-verifier-dataset
type: cowWhySo/toolcall-verifier-dataset
metrics:
- name: Accuracy
type: accuracy
value: 0.9729105322763307
- name: Macro F1
type: f1
value: 0.9796981345691861
- name: Macro Precision
type: precision
value: 0.9801958127901861
- name: Macro Recall
type: recall
value: 0.9793045374079483
---
# Tool-call Verifier Classifier Production
This repository contains a production-mode text-classification sidecar for tool-call guardrails. The model scores serialized tool-call candidates after deterministic validation has already handled syntax, JSON schema, unknown tool names, required-step enforcement, prerequisite checks, unsafe batches, and terminal-tool rules.
The intended deployment pattern is conservative: deterministic guardrails remain authoritative, while this classifier starts in `shadow` mode and is promoted only after repository-specific eval replay proves that it improves nudging or routing without introducing false blocks on valid tool calls.
## Model summary
| Field | Value |
|---|---|
| Base model | `microsoft/deberta-v3-small` |
| Model kind | Text-classification cross-encoder |
| Label mode | `production` |
| Input schema | `toolcall-verifier-input/v1` |
| Serializer | `serialize_state_v1` |
| Max sequence length | `1280` |
| Deployment default | `shadow` |
| Primary artifact | `model.onnx` |
| Quantized artifact | `model_quantized.onnx` |
| Required tokenizer files | `tokenizer_config.json`, `special_tokens_map.json`, `spm.model` |
| Threshold file | `thresholds.json` |
| Manifest file | `artifact_manifest.json` |
## Intended use
Use this model to classify a candidate tool call in the context of:
- the original user request,
- available tool definitions,
- required workflow steps,
- completed and pending steps,
- terminal tools,
- recent errors,
- and the candidate tool call.
It is meant to support:
- shadow telemetry for semantic tool-call quality,
- advisory nudges when the selected tool or arguments look semantically wrong,
- eval-backed enforcement for high-confidence semantic errors,
- Rust-side inference through ONNX Runtime.
It is not meant to replace deterministic guardrails. It should not accept malformed calls, override JSON-schema validation, rewrite arguments, execute tools, or relax required workflow rules.
## Labels
Production mode uses six labels:
| Label | Meaning | Deployment guidance |
|---|---|---|
| `valid` | Candidate call appears appropriate for the request and workflow state. | Allow. |
| `wrong_tool_semantic` | Candidate uses the wrong tool for the request or workflow state. | Conservative; currently disabled for advisory/enforcement by thresholds. |
| `wrong_arguments_semantic` | Candidate uses a plausible tool but semantically wrong arguments. | Advisory first; enforce only after eval proof. |
| `tool_not_needed` | Candidate calls a tool when no tool call is needed. | Advisory first; enforce only after eval proof. |
| `needs_clarification` | Request is underspecified and should be clarified before tool use. | Advisory first; enforce only after eval proof. |
| `deterministic_invalid` | Collapsed bucket for failures owned by deterministic validation. | Deterministic-only. Do not enforce from ML. |
In production mode, the following raw labels are collapsed into `deterministic_invalid`: `invalid_args_schema`, `missing_required_args`, `unknown_tool`, `premature_terminal`, `missing_prerequisite`, `unsafe_parallel_batch`, and `malformed_tool_call`.
## Training configuration
Latest production run:
| Field | Value |
|---|---:|
| GPU profile | `high_vram_quality` |
| GPU | NVIDIA RTX PRO 6000 Blackwell Server Edition |
| GPU memory | 95.0 GB |
| Precision | bf16 + tf32 |
| Seed | `42` |
| Max per source | `40000` |
| Max sequence length | `1280` |
| Epochs requested | `5` |
| Per-device train batch | `64` |
| Eval batch | `128` |
| Gradient accumulation | `1` |
| Learning rate | `6e-6` |
| Warmup ratio | `0.08` |
| Early stopping patience | `2` |
| Optimizer | `adamw_torch_fused` |
| Gradient checkpointing | `false` |
| Class weights | disabled |
| Forge augmentation | enabled |
| Final-response verifier training | enabled in the notebook, but separate from this tool-call classifier |
Split sizes:
| Split | Rows |
|---|---:|
| Train | 176,705 |
| Validation | 10,819 |
| Test | 22,075 |
Best validation checkpoint:
| Metric | Value |
|---|---:|
| Best checkpoint | `/content/toolcall-verifier/model/checkpoint-15505` |
| Selection metric | `macro_f1` |
| Validation loss | 0.07549399137496948 |
| Validation accuracy | 0.973380164525372 |
| Validation macro precision | 0.980623200539544 |
| Validation macro recall | 0.9800955285359482 |
| Validation macro F1 | 0.9803291996618319 |
## Test metrics
Held-out test set metrics:
| Metric | Value |
|---|---:|
| Test loss | 0.0744357779622078 |
| Accuracy | 0.9729105322763307 |
| Macro precision | 0.9801958127901861 |
| Macro recall | 0.9793045374079483 |
| Macro F1 | 0.9796981345691861 |
| Samples/sec | 686.863 |
| Steps/sec | 5.383 |
Per-label test report:
| Label | Precision | Recall | F1 | Support |
|---|---:|---:|---:|---:|
| `valid` | 0.94 | 0.97 | 0.95 | 4,955 |
| `wrong_tool_semantic` | 0.97 | 0.96 | 0.96 | 4,960 |
| `wrong_arguments_semantic` | 0.98 | 0.98 | 0.98 | 5,005 |
| `tool_not_needed` | 1.00 | 0.99 | 1.00 | 2,029 |
| `needs_clarification` | 1.00 | 1.00 | 1.00 | 10 |
| `deterministic_invalid` | 0.99 | 0.98 | 0.99 | 5,116 |
| **Macro avg** | **0.98** | **0.98** | **0.98** | **22,075** |
| **Weighted avg** | **0.97** | **0.97** | **0.97** | **22,075** |
Per-source test accuracy:
| Source | Rows | Accuracy | Avg confidence |
|---|---:|---:|---:|
| `Salesforce/xlam-function-calling-60k` | 14,234 | 0.973444 | 0.983573 |
| `glaiveai/glaive-function-calling-v2` | 5,449 | 0.971738 | 0.981286 |
| `Team-ACE/ToolACE` | 2,380 | 0.973109 | 0.982948 |
| `forge_augmented` | 12 | 1.000000 | 0.999698 |
Per-label test accuracy:
| True label | Rows | Accuracy | Avg confidence |
|---|---:|---:|---:|
| `deterministic_invalid` | 5,116 | 0.980258 | 0.992830 |
| `wrong_arguments_semantic` | 5,005 | 0.980819 | 0.989488 |
| `wrong_tool_semantic` | 4,960 | 0.956855 | 0.971546 |
| `valid` | 4,955 | 0.965691 | 0.972742 |
| `tool_not_needed` | 2,029 | 0.992607 | 0.994770 |
| `needs_clarification` | 10 | 1.000000 | 0.971681 |
## Confusion matrix
Rows are true labels. Columns are predicted labels.
| True \\ Predicted | `valid` | `wrong_tool_semantic` | `wrong_arguments_semantic` | `tool_not_needed` | `needs_clarification` | `deterministic_invalid` |
|---|---:|---:|---:|---:|---:|---:|
| `valid` | 4,785 | 76 | 74 | 0 | 0 | 20 |
| `wrong_tool_semantic` | 195 | 4,746 | 8 | 0 | 0 | 11 |
| `wrong_arguments_semantic` | 75 | 18 | 4,909 | 0 | 0 | 3 |
| `tool_not_needed` | 1 | 13 | 0 | 2,014 | 0 | 1 |
| `needs_clarification` | 0 | 0 | 0 | 0 | 10 | 0 |
| `deterministic_invalid` | 49 | 50 | 2 | 0 | 0 | 5,015 |
## Threshold policy
The exported default mode is `shadow`, with default action `allow`. These thresholds should be treated as deployment policy metadata, not as proof that enforcement is safe in a new environment.
```json
{
"schema_version": "toolcall-verifier-thresholds/v1",
"mode": "shadow",
"default_action": "allow",
"temperature": 1.1155287027359009,
"notes": [
"Deterministic guardrails remain authoritative.",
"Use ML in shadow mode first, then advisory nudges, then high-confidence enforcement only after eval proof.",
"deterministic_invalid is never enforced by ML in this default config.",
"wrong_tool_semantic stays conservative because current Forge telemetry showed high-confidence false positives on valid terminal/summarize calls."
],
"labels": {
"valid": {
"action": "allow",
"advisory_min_confidence": 0.0,
"enforce_min_confidence": 1.01
},
"wrong_tool_semantic": {
"action": "advisory_then_enforce_after_eval",
"advisory_min_confidence": 1.01,
"enforce_min_confidence": 1.01
},
"wrong_arguments_semantic": {
"action": "advisory_then_enforce_after_eval",
"advisory_min_confidence": 0.9,
"enforce_min_confidence": 0.995
},
"tool_not_needed": {
"action": "advisory_then_enforce_after_eval",
"advisory_min_confidence": 0.8,
"enforce_min_confidence": 0.95
},
"needs_clarification": {
"action": "advisory_then_enforce_after_eval",
"advisory_min_confidence": 0.8,
"enforce_min_confidence": 0.95
},
"deterministic_invalid": {
"action": "deterministic_only",
"advisory_min_confidence": 1.01,
"enforce_min_confidence": 1.01
}
}
}
```
## Input format
The classifier expects the canonical serialized format produced by `serialize_state_v1`.
```text
SCHEMA_VERSION:
toolcall-verifier-input/v1
USER_REQUEST:
Generate a sales report from the Q4 2024 dataset.
WORKFLOW_STATE:
required_steps=['fetch_sales_data', 'analyze_sales']
completed_steps=[]
pending_steps=['fetch_sales_data', 'analyze_sales']
terminal_tools=['report']
recent_errors=[]
AVAILABLE_TOOLS:
report: Produce the final report from findings.
PARAMETERS: {"properties": {"summary": {"type": "string"}}, "required": ["summary"], "type": "object"}
fetch_sales_data: Fetch sales data for a given quarter and year.
PARAMETERS: {"properties": {"quarter": {"type": "integer"}, "year": {"type": "integer"}}, "required": ["quarter", "year"], "type": "object"}
analyze_sales: Analyze the loaded sales data and produce findings.
PARAMETERS: {"properties": {}, "type": "object"}
CANDIDATE_CALL:
{"arguments": {"summary": "Done."}, "name": "report"}
```
For this fixture, both PyTorch and ONNX selected `deterministic_invalid`, with a reported max absolute logit difference of `6.67572021484375e-06`.
## Inference
### Transformers pipeline
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
repo_id = "cowWhySo/toolcall-verifier-classifier-production"
tokenizer = AutoTokenizer.from_pretrained(repo_id, use_fast=False)
model = AutoModelForSequenceClassification.from_pretrained(repo_id)
clf = pipeline(
"text-classification",
model=model,
tokenizer=tokenizer,
top_k=None,
device=0, # use -1 for CPU
)
scores = clf(serialized_tool_call, truncation=True, max_length=1280)[0]
scores = sorted(scores, key=lambda item: item["score"], reverse=True)
print(scores[:5])
```
### ONNX Runtime
The ONNX path is the recommended Rust/runtime deployment path. Load the model with the same tokenizer behavior and the same serialized input text used during training.
Required files:
```text
model.onnx
model_quantized.onnx
labels.json
thresholds.json
artifact_manifest.json
input_schema.json
serializer_fixture.json
tokenizer_config.json
special_tokens_map.json
spm.model
```
Runtime integrations should byte-compare their serializer output against `serializer_fixture.json` before trusting model scores. This catches train/inference drift.
## Rust deployment guidance
Recommended integration order:
```text
1. Parse provider response.
2. Validate format, known tool names, and JSON-schema arguments.
3. Enforce required steps, prerequisites, terminal rules, and unsafe batches.
4. If the call is still valid-looking, run the classifier.
5. Shadow mode: log classifier verdict only.
6. Advisory mode: use classifier verdict to choose better nudges.
7. Enforce mode: block only high-confidence semantic labels after eval proof.
```
Suggested runtime flags:
```text
--classifier-dir <path>
--classifier-mode off|shadow|advisory|enforce
--classifier-max-latency-ms <n>
FORGE_CLASSIFIER_DIR
FORGE_CLASSIFIER_MODE
FORGE_CLASSIFIER_MAX_LATENCY_MS
```
Default should be `off` unless a classifier directory is explicitly provided. First rollout should use `shadow`.
Recommended artifact-loader checks:
```text
artifact_manifest.json exists and includes training_run_summary/test_metrics provenance
artifact_schema_version == "toolcall-verifier-artifact/v1"
input_schema_version == "toolcall-verifier-input/v1"
serializer == "serialize_state_v1"
labels.json labels match model config
thresholds.json has every deployed label
tokenizer files exist
ONNX file exists
```
Loading failures should fail closed for strict deployment modes. Scoring failures should fail open in `shadow` and `advisory` modes, with telemetry.
## Calibration and safety notes
- Keep the model in `shadow` mode until eval replay confirms behavior on your real traffic and workflow families.
- Do not use `deterministic_invalid` predictions to enforce blocks. Deterministic Rust guardrails own those decisions.
- `wrong_tool_semantic` is intentionally disabled by threshold values above `1.0` because the current telemetry showed high-confidence false positives on otherwise valid terminal/summarize calls.
- High-confidence mistakes were observed, including valid calls predicted as deterministic or wrong-argument failures. Use per-family replay, not only aggregate F1, before promotion.
- The `needs_clarification` test support is small (`10` rows), so treat that label as under-validated despite the perfect held-out score.
- Validate public dataset licenses and any Forge-derived traces before publishing derived artifacts broadly.
## Tokenizer notes
The training run emitted tokenizer warnings around slow-to-fast conversion and regex/tokenization behavior. For parity-sensitive deployment, prefer the tokenizer path used by the notebook and artifact tests, and keep `use_fast=False` unless you have separately verified byte-for-byte or score-level parity.
If your Transformers version emits a Mistral regex warning for the local artifact, load with the appropriate `fix_mistral_regex=True` setting where supported. For Rust deployment, verify whether `tokenizer.json` is present and equivalent. If tokenizer parity is uncertain, use a sidecar scorer process until the tokenizer path is proven.
## ONNX parity and latency smoke check
A smoke test from the latest run reported:
| Check | Value |
|---|---:|
| Example latency | 126.49 ms |
| PyTorch top label | `deterministic_invalid` |
| ONNX top label | `deterministic_invalid` |
| Max absolute difference | `6.67572021484375e-06` |
This is a single-fixture smoke check, not a full deployment benchmark. Run larger parity checks on the exported `onnx_parity_report.json` and real replay traces before using quantized artifacts in advisory or enforcement mode.
## Related final-response verifier
The notebook can also train a separate final-response verifier with labels such as `valid_final_response`, `missing_tool_fact`, `contradicts_tool_result`, `unsupported_claim`, and `failed_to_acknowledge_data_gap`. That verifier is a separate artifact family and should be documented, evaluated, and deployed independently from this tool-call verifier.
The latest final-response run was small: `90` total rows split into `70` train, `10` validation, and `10` test rows. Its validation macro F1 remained low in the shown run, so it should stay experimental/shadow-only until the dataset is materially expanded.
## Limitations
- This model was trained on serialized tool-call contexts, not arbitrary natural language.
- It assumes deterministic validation has already run.
- It is sensitive to serializer drift, tokenizer drift, and tool-list truncation.
- Aggregate metrics are strong, but valid-call false positives are more important than headline macro F1 for enforcement.
- The `forge_augmented` test slice shown in the run contains only `12` rows, so it is useful as a smoke signal, not as sufficient Forge coverage.
- The final-response verifier path in the notebook is not mature enough for enforcement based on the shown data.
## Recommended eval replay before promotion
Run at least these variants before changing deployment mode:
```text
no_classifier
classifier_fp32_onnx_shadow
classifier_quantized_onnx_shadow
classifier_fp32_onnx_advisory
classifier_quantized_onnx_advisory
```
Promotion criteria should include:
- zero or near-zero false objections on valid calls,
- no regression in terminal-tool workflows,
- no regression in summarize/report workflows,
- improved targeted scenario-family scores,
- acceptable p95/p99 latency,
- PyTorch/ONNX/quantized parity on replay traces,
- stable behavior across real tool schemas, not only public function-calling datasets.