Text Classification
Transformers
ONNX
Safetensors
Rust
English
tool-use
function-calling
tool-calling
guardrails
verifier
shadow-mode
Eval Results (legacy)
Instructions to use cowWhySo/toolcall-verifier-classifier-production with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cowWhySo/toolcall-verifier-classifier-production with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="cowWhySo/toolcall-verifier-classifier-production")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("cowWhySo/toolcall-verifier-classifier-production", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - en | |
| license: other | |
| base_model: microsoft/deberta-v3-small | |
| library_name: transformers | |
| pipeline_tag: text-classification | |
| tags: | |
| - tool-use | |
| - function-calling | |
| - tool-calling | |
| - guardrails | |
| - verifier | |
| - text-classification | |
| - onnx | |
| - rust | |
| - shadow-mode | |
| metrics: | |
| - accuracy | |
| - f1 | |
| - precision | |
| - recall | |
| model-index: | |
| - name: toolcall-verifier-classifier-production | |
| results: | |
| - task: | |
| type: text-classification | |
| name: Tool-call verification | |
| dataset: | |
| name: toolcall-verifier-dataset | |
| type: cowWhySo/toolcall-verifier-dataset | |
| metrics: | |
| - name: Accuracy | |
| type: accuracy | |
| value: 0.9729105322763307 | |
| - name: Macro F1 | |
| type: f1 | |
| value: 0.9796981345691861 | |
| - name: Macro Precision | |
| type: precision | |
| value: 0.9801958127901861 | |
| - name: Macro Recall | |
| type: recall | |
| value: 0.9793045374079483 | |
| # Tool-call Verifier Classifier Production | |
| This repository contains a production-mode text-classification sidecar for tool-call guardrails. The model scores serialized tool-call candidates after deterministic validation has already handled syntax, JSON schema, unknown tool names, required-step enforcement, prerequisite checks, unsafe batches, and terminal-tool rules. | |
| The intended deployment pattern is conservative: deterministic guardrails remain authoritative, while this classifier starts in `shadow` mode and is promoted only after repository-specific eval replay proves that it improves nudging or routing without introducing false blocks on valid tool calls. | |
| ## Model summary | |
| | Field | Value | | |
| |---|---| | |
| | Base model | `microsoft/deberta-v3-small` | | |
| | Model kind | Text-classification cross-encoder | | |
| | Label mode | `production` | | |
| | Input schema | `toolcall-verifier-input/v1` | | |
| | Serializer | `serialize_state_v1` | | |
| | Max sequence length | `1280` | | |
| | Deployment default | `shadow` | | |
| | Primary artifact | `model.onnx` | | |
| | Quantized artifact | `model_quantized.onnx` | | |
| | Required tokenizer files | `tokenizer_config.json`, `special_tokens_map.json`, `spm.model` | | |
| | Threshold file | `thresholds.json` | | |
| | Manifest file | `artifact_manifest.json` | | |
| ## Intended use | |
| Use this model to classify a candidate tool call in the context of: | |
| - the original user request, | |
| - available tool definitions, | |
| - required workflow steps, | |
| - completed and pending steps, | |
| - terminal tools, | |
| - recent errors, | |
| - and the candidate tool call. | |
| It is meant to support: | |
| - shadow telemetry for semantic tool-call quality, | |
| - advisory nudges when the selected tool or arguments look semantically wrong, | |
| - eval-backed enforcement for high-confidence semantic errors, | |
| - Rust-side inference through ONNX Runtime. | |
| It is not meant to replace deterministic guardrails. It should not accept malformed calls, override JSON-schema validation, rewrite arguments, execute tools, or relax required workflow rules. | |
| ## Labels | |
| Production mode uses six labels: | |
| | Label | Meaning | Deployment guidance | | |
| |---|---|---| | |
| | `valid` | Candidate call appears appropriate for the request and workflow state. | Allow. | | |
| | `wrong_tool_semantic` | Candidate uses the wrong tool for the request or workflow state. | Conservative; currently disabled for advisory/enforcement by thresholds. | | |
| | `wrong_arguments_semantic` | Candidate uses a plausible tool but semantically wrong arguments. | Advisory first; enforce only after eval proof. | | |
| | `tool_not_needed` | Candidate calls a tool when no tool call is needed. | Advisory first; enforce only after eval proof. | | |
| | `needs_clarification` | Request is underspecified and should be clarified before tool use. | Advisory first; enforce only after eval proof. | | |
| | `deterministic_invalid` | Collapsed bucket for failures owned by deterministic validation. | Deterministic-only. Do not enforce from ML. | | |
| In production mode, the following raw labels are collapsed into `deterministic_invalid`: `invalid_args_schema`, `missing_required_args`, `unknown_tool`, `premature_terminal`, `missing_prerequisite`, `unsafe_parallel_batch`, and `malformed_tool_call`. | |
| ## Training configuration | |
| Latest production run: | |
| | Field | Value | | |
| |---|---:| | |
| | GPU profile | `high_vram_quality` | | |
| | GPU | NVIDIA RTX PRO 6000 Blackwell Server Edition | | |
| | GPU memory | 95.0 GB | | |
| | Precision | bf16 + tf32 | | |
| | Seed | `42` | | |
| | Max per source | `40000` | | |
| | Max sequence length | `1280` | | |
| | Epochs requested | `5` | | |
| | Per-device train batch | `64` | | |
| | Eval batch | `128` | | |
| | Gradient accumulation | `1` | | |
| | Learning rate | `6e-6` | | |
| | Warmup ratio | `0.08` | | |
| | Early stopping patience | `2` | | |
| | Optimizer | `adamw_torch_fused` | | |
| | Gradient checkpointing | `false` | | |
| | Class weights | disabled | | |
| | Forge augmentation | enabled | | |
| | Final-response verifier training | enabled in the notebook, but separate from this tool-call classifier | | |
| Split sizes: | |
| | Split | Rows | | |
| |---|---:| | |
| | Train | 176,705 | | |
| | Validation | 10,819 | | |
| | Test | 22,075 | | |
| Best validation checkpoint: | |
| | Metric | Value | | |
| |---|---:| | |
| | Best checkpoint | `/content/toolcall-verifier/model/checkpoint-15505` | | |
| | Selection metric | `macro_f1` | | |
| | Validation loss | 0.07549399137496948 | | |
| | Validation accuracy | 0.973380164525372 | | |
| | Validation macro precision | 0.980623200539544 | | |
| | Validation macro recall | 0.9800955285359482 | | |
| | Validation macro F1 | 0.9803291996618319 | | |
| ## Test metrics | |
| Held-out test set metrics: | |
| | Metric | Value | | |
| |---|---:| | |
| | Test loss | 0.0744357779622078 | | |
| | Accuracy | 0.9729105322763307 | | |
| | Macro precision | 0.9801958127901861 | | |
| | Macro recall | 0.9793045374079483 | | |
| | Macro F1 | 0.9796981345691861 | | |
| | Samples/sec | 686.863 | | |
| | Steps/sec | 5.383 | | |
| Per-label test report: | |
| | Label | Precision | Recall | F1 | Support | | |
| |---|---:|---:|---:|---:| | |
| | `valid` | 0.94 | 0.97 | 0.95 | 4,955 | | |
| | `wrong_tool_semantic` | 0.97 | 0.96 | 0.96 | 4,960 | | |
| | `wrong_arguments_semantic` | 0.98 | 0.98 | 0.98 | 5,005 | | |
| | `tool_not_needed` | 1.00 | 0.99 | 1.00 | 2,029 | | |
| | `needs_clarification` | 1.00 | 1.00 | 1.00 | 10 | | |
| | `deterministic_invalid` | 0.99 | 0.98 | 0.99 | 5,116 | | |
| | **Macro avg** | **0.98** | **0.98** | **0.98** | **22,075** | | |
| | **Weighted avg** | **0.97** | **0.97** | **0.97** | **22,075** | | |
| Per-source test accuracy: | |
| | Source | Rows | Accuracy | Avg confidence | | |
| |---|---:|---:|---:| | |
| | `Salesforce/xlam-function-calling-60k` | 14,234 | 0.973444 | 0.983573 | | |
| | `glaiveai/glaive-function-calling-v2` | 5,449 | 0.971738 | 0.981286 | | |
| | `Team-ACE/ToolACE` | 2,380 | 0.973109 | 0.982948 | | |
| | `forge_augmented` | 12 | 1.000000 | 0.999698 | | |
| Per-label test accuracy: | |
| | True label | Rows | Accuracy | Avg confidence | | |
| |---|---:|---:|---:| | |
| | `deterministic_invalid` | 5,116 | 0.980258 | 0.992830 | | |
| | `wrong_arguments_semantic` | 5,005 | 0.980819 | 0.989488 | | |
| | `wrong_tool_semantic` | 4,960 | 0.956855 | 0.971546 | | |
| | `valid` | 4,955 | 0.965691 | 0.972742 | | |
| | `tool_not_needed` | 2,029 | 0.992607 | 0.994770 | | |
| | `needs_clarification` | 10 | 1.000000 | 0.971681 | | |
| ## Confusion matrix | |
| Rows are true labels. Columns are predicted labels. | |
| | True \\ Predicted | `valid` | `wrong_tool_semantic` | `wrong_arguments_semantic` | `tool_not_needed` | `needs_clarification` | `deterministic_invalid` | | |
| |---|---:|---:|---:|---:|---:|---:| | |
| | `valid` | 4,785 | 76 | 74 | 0 | 0 | 20 | | |
| | `wrong_tool_semantic` | 195 | 4,746 | 8 | 0 | 0 | 11 | | |
| | `wrong_arguments_semantic` | 75 | 18 | 4,909 | 0 | 0 | 3 | | |
| | `tool_not_needed` | 1 | 13 | 0 | 2,014 | 0 | 1 | | |
| | `needs_clarification` | 0 | 0 | 0 | 0 | 10 | 0 | | |
| | `deterministic_invalid` | 49 | 50 | 2 | 0 | 0 | 5,015 | | |
| ## Threshold policy | |
| The exported default mode is `shadow`, with default action `allow`. These thresholds should be treated as deployment policy metadata, not as proof that enforcement is safe in a new environment. | |
| ```json | |
| { | |
| "schema_version": "toolcall-verifier-thresholds/v1", | |
| "mode": "shadow", | |
| "default_action": "allow", | |
| "temperature": 1.1155287027359009, | |
| "notes": [ | |
| "Deterministic guardrails remain authoritative.", | |
| "Use ML in shadow mode first, then advisory nudges, then high-confidence enforcement only after eval proof.", | |
| "deterministic_invalid is never enforced by ML in this default config.", | |
| "wrong_tool_semantic stays conservative because current Forge telemetry showed high-confidence false positives on valid terminal/summarize calls." | |
| ], | |
| "labels": { | |
| "valid": { | |
| "action": "allow", | |
| "advisory_min_confidence": 0.0, | |
| "enforce_min_confidence": 1.01 | |
| }, | |
| "wrong_tool_semantic": { | |
| "action": "advisory_then_enforce_after_eval", | |
| "advisory_min_confidence": 1.01, | |
| "enforce_min_confidence": 1.01 | |
| }, | |
| "wrong_arguments_semantic": { | |
| "action": "advisory_then_enforce_after_eval", | |
| "advisory_min_confidence": 0.9, | |
| "enforce_min_confidence": 0.995 | |
| }, | |
| "tool_not_needed": { | |
| "action": "advisory_then_enforce_after_eval", | |
| "advisory_min_confidence": 0.8, | |
| "enforce_min_confidence": 0.95 | |
| }, | |
| "needs_clarification": { | |
| "action": "advisory_then_enforce_after_eval", | |
| "advisory_min_confidence": 0.8, | |
| "enforce_min_confidence": 0.95 | |
| }, | |
| "deterministic_invalid": { | |
| "action": "deterministic_only", | |
| "advisory_min_confidence": 1.01, | |
| "enforce_min_confidence": 1.01 | |
| } | |
| } | |
| } | |
| ``` | |
| ## Input format | |
| The classifier expects the canonical serialized format produced by `serialize_state_v1`. | |
| ```text | |
| SCHEMA_VERSION: | |
| toolcall-verifier-input/v1 | |
| USER_REQUEST: | |
| Generate a sales report from the Q4 2024 dataset. | |
| WORKFLOW_STATE: | |
| required_steps=['fetch_sales_data', 'analyze_sales'] | |
| completed_steps=[] | |
| pending_steps=['fetch_sales_data', 'analyze_sales'] | |
| terminal_tools=['report'] | |
| recent_errors=[] | |
| AVAILABLE_TOOLS: | |
| report: Produce the final report from findings. | |
| PARAMETERS: {"properties": {"summary": {"type": "string"}}, "required": ["summary"], "type": "object"} | |
| fetch_sales_data: Fetch sales data for a given quarter and year. | |
| PARAMETERS: {"properties": {"quarter": {"type": "integer"}, "year": {"type": "integer"}}, "required": ["quarter", "year"], "type": "object"} | |
| analyze_sales: Analyze the loaded sales data and produce findings. | |
| PARAMETERS: {"properties": {}, "type": "object"} | |
| CANDIDATE_CALL: | |
| {"arguments": {"summary": "Done."}, "name": "report"} | |
| ``` | |
| For this fixture, both PyTorch and ONNX selected `deterministic_invalid`, with a reported max absolute logit difference of `6.67572021484375e-06`. | |
| ## Inference | |
| ### Transformers pipeline | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline | |
| repo_id = "cowWhySo/toolcall-verifier-classifier-production" | |
| tokenizer = AutoTokenizer.from_pretrained(repo_id, use_fast=False) | |
| model = AutoModelForSequenceClassification.from_pretrained(repo_id) | |
| clf = pipeline( | |
| "text-classification", | |
| model=model, | |
| tokenizer=tokenizer, | |
| top_k=None, | |
| device=0, # use -1 for CPU | |
| ) | |
| scores = clf(serialized_tool_call, truncation=True, max_length=1280)[0] | |
| scores = sorted(scores, key=lambda item: item["score"], reverse=True) | |
| print(scores[:5]) | |
| ``` | |
| ### ONNX Runtime | |
| The ONNX path is the recommended Rust/runtime deployment path. Load the model with the same tokenizer behavior and the same serialized input text used during training. | |
| Required files: | |
| ```text | |
| model.onnx | |
| model_quantized.onnx | |
| labels.json | |
| thresholds.json | |
| artifact_manifest.json | |
| input_schema.json | |
| serializer_fixture.json | |
| tokenizer_config.json | |
| special_tokens_map.json | |
| spm.model | |
| ``` | |
| Runtime integrations should byte-compare their serializer output against `serializer_fixture.json` before trusting model scores. This catches train/inference drift. | |
| ## Rust deployment guidance | |
| Recommended integration order: | |
| ```text | |
| 1. Parse provider response. | |
| 2. Validate format, known tool names, and JSON-schema arguments. | |
| 3. Enforce required steps, prerequisites, terminal rules, and unsafe batches. | |
| 4. If the call is still valid-looking, run the classifier. | |
| 5. Shadow mode: log classifier verdict only. | |
| 6. Advisory mode: use classifier verdict to choose better nudges. | |
| 7. Enforce mode: block only high-confidence semantic labels after eval proof. | |
| ``` | |
| Suggested runtime flags: | |
| ```text | |
| --classifier-dir <path> | |
| --classifier-mode off|shadow|advisory|enforce | |
| --classifier-max-latency-ms <n> | |
| FORGE_CLASSIFIER_DIR | |
| FORGE_CLASSIFIER_MODE | |
| FORGE_CLASSIFIER_MAX_LATENCY_MS | |
| ``` | |
| Default should be `off` unless a classifier directory is explicitly provided. First rollout should use `shadow`. | |
| Recommended artifact-loader checks: | |
| ```text | |
| artifact_manifest.json exists and includes training_run_summary/test_metrics provenance | |
| artifact_schema_version == "toolcall-verifier-artifact/v1" | |
| input_schema_version == "toolcall-verifier-input/v1" | |
| serializer == "serialize_state_v1" | |
| labels.json labels match model config | |
| thresholds.json has every deployed label | |
| tokenizer files exist | |
| ONNX file exists | |
| ``` | |
| Loading failures should fail closed for strict deployment modes. Scoring failures should fail open in `shadow` and `advisory` modes, with telemetry. | |
| ## Calibration and safety notes | |
| - Keep the model in `shadow` mode until eval replay confirms behavior on your real traffic and workflow families. | |
| - Do not use `deterministic_invalid` predictions to enforce blocks. Deterministic Rust guardrails own those decisions. | |
| - `wrong_tool_semantic` is intentionally disabled by threshold values above `1.0` because the current telemetry showed high-confidence false positives on otherwise valid terminal/summarize calls. | |
| - High-confidence mistakes were observed, including valid calls predicted as deterministic or wrong-argument failures. Use per-family replay, not only aggregate F1, before promotion. | |
| - The `needs_clarification` test support is small (`10` rows), so treat that label as under-validated despite the perfect held-out score. | |
| - Validate public dataset licenses and any Forge-derived traces before publishing derived artifacts broadly. | |
| ## Tokenizer notes | |
| The training run emitted tokenizer warnings around slow-to-fast conversion and regex/tokenization behavior. For parity-sensitive deployment, prefer the tokenizer path used by the notebook and artifact tests, and keep `use_fast=False` unless you have separately verified byte-for-byte or score-level parity. | |
| If your Transformers version emits a Mistral regex warning for the local artifact, load with the appropriate `fix_mistral_regex=True` setting where supported. For Rust deployment, verify whether `tokenizer.json` is present and equivalent. If tokenizer parity is uncertain, use a sidecar scorer process until the tokenizer path is proven. | |
| ## ONNX parity and latency smoke check | |
| A smoke test from the latest run reported: | |
| | Check | Value | | |
| |---|---:| | |
| | Example latency | 126.49 ms | | |
| | PyTorch top label | `deterministic_invalid` | | |
| | ONNX top label | `deterministic_invalid` | | |
| | Max absolute difference | `6.67572021484375e-06` | | |
| This is a single-fixture smoke check, not a full deployment benchmark. Run larger parity checks on the exported `onnx_parity_report.json` and real replay traces before using quantized artifacts in advisory or enforcement mode. | |
| ## Related final-response verifier | |
| The notebook can also train a separate final-response verifier with labels such as `valid_final_response`, `missing_tool_fact`, `contradicts_tool_result`, `unsupported_claim`, and `failed_to_acknowledge_data_gap`. That verifier is a separate artifact family and should be documented, evaluated, and deployed independently from this tool-call verifier. | |
| The latest final-response run was small: `90` total rows split into `70` train, `10` validation, and `10` test rows. Its validation macro F1 remained low in the shown run, so it should stay experimental/shadow-only until the dataset is materially expanded. | |
| ## Limitations | |
| - This model was trained on serialized tool-call contexts, not arbitrary natural language. | |
| - It assumes deterministic validation has already run. | |
| - It is sensitive to serializer drift, tokenizer drift, and tool-list truncation. | |
| - Aggregate metrics are strong, but valid-call false positives are more important than headline macro F1 for enforcement. | |
| - The `forge_augmented` test slice shown in the run contains only `12` rows, so it is useful as a smoke signal, not as sufficient Forge coverage. | |
| - The final-response verifier path in the notebook is not mature enough for enforcement based on the shown data. | |
| ## Recommended eval replay before promotion | |
| Run at least these variants before changing deployment mode: | |
| ```text | |
| no_classifier | |
| classifier_fp32_onnx_shadow | |
| classifier_quantized_onnx_shadow | |
| classifier_fp32_onnx_advisory | |
| classifier_quantized_onnx_advisory | |
| ``` | |
| Promotion criteria should include: | |
| - zero or near-zero false objections on valid calls, | |
| - no regression in terminal-tool workflows, | |
| - no regression in summarize/report workflows, | |
| - improved targeted scenario-family scores, | |
| - acceptable p95/p99 latency, | |
| - PyTorch/ONNX/quantized parity on replay traces, | |
| - stable behavior across real tool schemas, not only public function-calling datasets. | |