Update README.md

7986dbc verified 13 days ago

17.1 kB

	---
	language:
	- en
	license: other
	base_model: microsoft/deberta-v3-small
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- tool-use
	- function-calling
	- tool-calling
	- guardrails
	- verifier
	- text-classification
	- onnx
	- rust
	- shadow-mode
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	model-index:
	- name: toolcall-verifier-classifier-production
	results:
	- task:
	type: text-classification
	name: Tool-call verification
	dataset:
	name: toolcall-verifier-dataset
	type: cowWhySo/toolcall-verifier-dataset
	metrics:
	- name: Accuracy
	type: accuracy
	value: 0.9729105322763307
	- name: Macro F1
	type: f1
	value: 0.9796981345691861
	- name: Macro Precision
	type: precision
	value: 0.9801958127901861
	- name: Macro Recall
	type: recall
	value: 0.9793045374079483
	---

	# Tool-call Verifier Classifier Production

	This repository contains a production-mode text-classification sidecar for tool-call guardrails. The model scores serialized tool-call candidates after deterministic validation has already handled syntax, JSON schema, unknown tool names, required-step enforcement, prerequisite checks, unsafe batches, and terminal-tool rules.

	The intended deployment pattern is conservative: deterministic guardrails remain authoritative, while this classifier starts in `shadow` mode and is promoted only after repository-specific eval replay proves that it improves nudging or routing without introducing false blocks on valid tool calls.

	## Model summary

	\| Field \| Value \|
	\|---\|---\|
	\| Base model \| `microsoft/deberta-v3-small` \|
	\| Model kind \| Text-classification cross-encoder \|
	\| Label mode \| `production` \|
	\| Input schema \| `toolcall-verifier-input/v1` \|
	\| Serializer \| `serialize_state_v1` \|
	\| Max sequence length \| `1280` \|
	\| Deployment default \| `shadow` \|
	\| Primary artifact \| `model.onnx` \|
	\| Quantized artifact \| `model_quantized.onnx` \|
	\| Required tokenizer files \| `tokenizer_config.json`, `special_tokens_map.json`, `spm.model` \|
	\| Threshold file \| `thresholds.json` \|
	\| Manifest file \| `artifact_manifest.json` \|

	## Intended use

	Use this model to classify a candidate tool call in the context of:

	- the original user request,
	- available tool definitions,
	- required workflow steps,
	- completed and pending steps,
	- terminal tools,
	- recent errors,
	- and the candidate tool call.

	It is meant to support:

	- shadow telemetry for semantic tool-call quality,
	- advisory nudges when the selected tool or arguments look semantically wrong,
	- eval-backed enforcement for high-confidence semantic errors,
	- Rust-side inference through ONNX Runtime.

	It is not meant to replace deterministic guardrails. It should not accept malformed calls, override JSON-schema validation, rewrite arguments, execute tools, or relax required workflow rules.

	## Labels

	Production mode uses six labels:

	\| Label \| Meaning \| Deployment guidance \|
	\|---\|---\|---\|
	\| `valid` \| Candidate call appears appropriate for the request and workflow state. \| Allow. \|
	\| `wrong_tool_semantic` \| Candidate uses the wrong tool for the request or workflow state. \| Conservative; currently disabled for advisory/enforcement by thresholds. \|
	\| `wrong_arguments_semantic` \| Candidate uses a plausible tool but semantically wrong arguments. \| Advisory first; enforce only after eval proof. \|
	\| `tool_not_needed` \| Candidate calls a tool when no tool call is needed. \| Advisory first; enforce only after eval proof. \|
	\| `needs_clarification` \| Request is underspecified and should be clarified before tool use. \| Advisory first; enforce only after eval proof. \|
	\| `deterministic_invalid` \| Collapsed bucket for failures owned by deterministic validation. \| Deterministic-only. Do not enforce from ML. \|

	In production mode, the following raw labels are collapsed into `deterministic_invalid`: `invalid_args_schema`, `missing_required_args`, `unknown_tool`, `premature_terminal`, `missing_prerequisite`, `unsafe_parallel_batch`, and `malformed_tool_call`.

	## Training configuration

	Latest production run:

	\| Field \| Value \|
	\|---\|---:\|
	\| GPU profile \| `high_vram_quality` \|
	\| GPU \| NVIDIA RTX PRO 6000 Blackwell Server Edition \|
	\| GPU memory \| 95.0 GB \|
	\| Precision \| bf16 + tf32 \|
	\| Seed \| `42` \|
	\| Max per source \| `40000` \|
	\| Max sequence length \| `1280` \|
	\| Epochs requested \| `5` \|
	\| Per-device train batch \| `64` \|
	\| Eval batch \| `128` \|
	\| Gradient accumulation \| `1` \|
	\| Learning rate \| `6e-6` \|
	\| Warmup ratio \| `0.08` \|
	\| Early stopping patience \| `2` \|
	\| Optimizer \| `adamw_torch_fused` \|
	\| Gradient checkpointing \| `false` \|
	\| Class weights \| disabled \|
	\| Forge augmentation \| enabled \|
	\| Final-response verifier training \| enabled in the notebook, but separate from this tool-call classifier \|

	Split sizes:

	\| Split \| Rows \|
	\|---\|---:\|
	\| Train \| 176,705 \|
	\| Validation \| 10,819 \|
	\| Test \| 22,075 \|

	Best validation checkpoint:

	\| Metric \| Value \|
	\|---\|---:\|
	\| Best checkpoint \| `/content/toolcall-verifier/model/checkpoint-15505` \|
	\| Selection metric \| `macro_f1` \|
	\| Validation loss \| 0.07549399137496948 \|
	\| Validation accuracy \| 0.973380164525372 \|
	\| Validation macro precision \| 0.980623200539544 \|
	\| Validation macro recall \| 0.9800955285359482 \|
	\| Validation macro F1 \| 0.9803291996618319 \|

	## Test metrics

	Held-out test set metrics:

	\| Metric \| Value \|
	\|---\|---:\|
	\| Test loss \| 0.0744357779622078 \|
	\| Accuracy \| 0.9729105322763307 \|
	\| Macro precision \| 0.9801958127901861 \|
	\| Macro recall \| 0.9793045374079483 \|
	\| Macro F1 \| 0.9796981345691861 \|
	\| Samples/sec \| 686.863 \|
	\| Steps/sec \| 5.383 \|

	Per-label test report:

	\| Label \| Precision \| Recall \| F1 \| Support \|
	\|---\|---:\|---:\|---:\|---:\|
	\| `valid` \| 0.94 \| 0.97 \| 0.95 \| 4,955 \|
	\| `wrong_tool_semantic` \| 0.97 \| 0.96 \| 0.96 \| 4,960 \|
	\| `wrong_arguments_semantic` \| 0.98 \| 0.98 \| 0.98 \| 5,005 \|
	\| `tool_not_needed` \| 1.00 \| 0.99 \| 1.00 \| 2,029 \|
	\| `needs_clarification` \| 1.00 \| 1.00 \| 1.00 \| 10 \|
	\| `deterministic_invalid` \| 0.99 \| 0.98 \| 0.99 \| 5,116 \|
	\| Macro avg \| 0.98 \| 0.98 \| 0.98 \| 22,075 \|
	\| Weighted avg \| 0.97 \| 0.97 \| 0.97 \| 22,075 \|

	Per-source test accuracy:

	\| Source \| Rows \| Accuracy \| Avg confidence \|
	\|---\|---:\|---:\|---:\|
	\| `Salesforce/xlam-function-calling-60k` \| 14,234 \| 0.973444 \| 0.983573 \|
	\| `glaiveai/glaive-function-calling-v2` \| 5,449 \| 0.971738 \| 0.981286 \|
	\| `Team-ACE/ToolACE` \| 2,380 \| 0.973109 \| 0.982948 \|
	\| `forge_augmented` \| 12 \| 1.000000 \| 0.999698 \|

	Per-label test accuracy:

	\| True label \| Rows \| Accuracy \| Avg confidence \|
	\|---\|---:\|---:\|---:\|
	\| `deterministic_invalid` \| 5,116 \| 0.980258 \| 0.992830 \|
	\| `wrong_arguments_semantic` \| 5,005 \| 0.980819 \| 0.989488 \|
	\| `wrong_tool_semantic` \| 4,960 \| 0.956855 \| 0.971546 \|
	\| `valid` \| 4,955 \| 0.965691 \| 0.972742 \|
	\| `tool_not_needed` \| 2,029 \| 0.992607 \| 0.994770 \|
	\| `needs_clarification` \| 10 \| 1.000000 \| 0.971681 \|

	## Confusion matrix

	Rows are true labels. Columns are predicted labels.

	\| True \\ Predicted \| `valid` \| `wrong_tool_semantic` \| `wrong_arguments_semantic` \| `tool_not_needed` \| `needs_clarification` \| `deterministic_invalid` \|
	\|---\|---:\|---:\|---:\|---:\|---:\|---:\|
	\| `valid` \| 4,785 \| 76 \| 74 \| 0 \| 0 \| 20 \|
	\| `wrong_tool_semantic` \| 195 \| 4,746 \| 8 \| 0 \| 0 \| 11 \|
	\| `wrong_arguments_semantic` \| 75 \| 18 \| 4,909 \| 0 \| 0 \| 3 \|
	\| `tool_not_needed` \| 1 \| 13 \| 0 \| 2,014 \| 0 \| 1 \|
	\| `needs_clarification` \| 0 \| 0 \| 0 \| 0 \| 10 \| 0 \|
	\| `deterministic_invalid` \| 49 \| 50 \| 2 \| 0 \| 0 \| 5,015 \|

	## Threshold policy

	The exported default mode is `shadow`, with default action `allow`. These thresholds should be treated as deployment policy metadata, not as proof that enforcement is safe in a new environment.

	```json
	{
	"schema_version": "toolcall-verifier-thresholds/v1",
	"mode": "shadow",
	"default_action": "allow",
	"temperature": 1.1155287027359009,
	"notes": [
	"Deterministic guardrails remain authoritative.",
	"Use ML in shadow mode first, then advisory nudges, then high-confidence enforcement only after eval proof.",
	"deterministic_invalid is never enforced by ML in this default config.",
	"wrong_tool_semantic stays conservative because current Forge telemetry showed high-confidence false positives on valid terminal/summarize calls."
	],
	"labels": {
	"valid": {
	"action": "allow",
	"advisory_min_confidence": 0.0,
	"enforce_min_confidence": 1.01
	},
	"wrong_tool_semantic": {
	"action": "advisory_then_enforce_after_eval",
	"advisory_min_confidence": 1.01,
	"enforce_min_confidence": 1.01
	},
	"wrong_arguments_semantic": {
	"action": "advisory_then_enforce_after_eval",
	"advisory_min_confidence": 0.9,
	"enforce_min_confidence": 0.995
	},
	"tool_not_needed": {
	"action": "advisory_then_enforce_after_eval",
	"advisory_min_confidence": 0.8,
	"enforce_min_confidence": 0.95
	},
	"needs_clarification": {
	"action": "advisory_then_enforce_after_eval",
	"advisory_min_confidence": 0.8,
	"enforce_min_confidence": 0.95
	},
	"deterministic_invalid": {
	"action": "deterministic_only",
	"advisory_min_confidence": 1.01,
	"enforce_min_confidence": 1.01
	}
	}
	}
	```

	## Input format

	The classifier expects the canonical serialized format produced by `serialize_state_v1`.

	```text
	SCHEMA_VERSION:
	toolcall-verifier-input/v1

	USER_REQUEST:
	Generate a sales report from the Q4 2024 dataset.

	WORKFLOW_STATE:
	required_steps=['fetch_sales_data', 'analyze_sales']
	completed_steps=[]
	pending_steps=['fetch_sales_data', 'analyze_sales']
	terminal_tools=['report']
	recent_errors=[]

	AVAILABLE_TOOLS:
	report: Produce the final report from findings.
	PARAMETERS: {"properties": {"summary": {"type": "string"}}, "required": ["summary"], "type": "object"}

	fetch_sales_data: Fetch sales data for a given quarter and year.
	PARAMETERS: {"properties": {"quarter": {"type": "integer"}, "year": {"type": "integer"}}, "required": ["quarter", "year"], "type": "object"}

	analyze_sales: Analyze the loaded sales data and produce findings.
	PARAMETERS: {"properties": {}, "type": "object"}

	CANDIDATE_CALL:
	{"arguments": {"summary": "Done."}, "name": "report"}
	```

	For this fixture, both PyTorch and ONNX selected `deterministic_invalid`, with a reported max absolute logit difference of `6.67572021484375e-06`.

	## Inference

	### Transformers pipeline

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

	repo_id = "cowWhySo/toolcall-verifier-classifier-production"

	tokenizer = AutoTokenizer.from_pretrained(repo_id, use_fast=False)
	model = AutoModelForSequenceClassification.from_pretrained(repo_id)

	clf = pipeline(
	"text-classification",
	model=model,
	tokenizer=tokenizer,
	top_k=None,
	device=0, # use -1 for CPU
	)

	scores = clf(serialized_tool_call, truncation=True, max_length=1280)[0]
	scores = sorted(scores, key=lambda item: item["score"], reverse=True)
	print(scores[:5])
	```

	### ONNX Runtime

	The ONNX path is the recommended Rust/runtime deployment path. Load the model with the same tokenizer behavior and the same serialized input text used during training.

	Required files:

	```text
	model.onnx
	model_quantized.onnx
	labels.json
	thresholds.json
	artifact_manifest.json
	input_schema.json
	serializer_fixture.json
	tokenizer_config.json
	special_tokens_map.json
	spm.model
	```

	Runtime integrations should byte-compare their serializer output against `serializer_fixture.json` before trusting model scores. This catches train/inference drift.

	## Rust deployment guidance

	Recommended integration order:

	```text
	1. Parse provider response.
	2. Validate format, known tool names, and JSON-schema arguments.
	3. Enforce required steps, prerequisites, terminal rules, and unsafe batches.
	4. If the call is still valid-looking, run the classifier.
	5. Shadow mode: log classifier verdict only.
	6. Advisory mode: use classifier verdict to choose better nudges.
	7. Enforce mode: block only high-confidence semantic labels after eval proof.
	```

	Suggested runtime flags:

	```text
	--classifier-dir <path>
	--classifier-mode off\|shadow\|advisory\|enforce
	--classifier-max-latency-ms <n>
	FORGE_CLASSIFIER_DIR
	FORGE_CLASSIFIER_MODE
	FORGE_CLASSIFIER_MAX_LATENCY_MS
	```

	Default should be `off` unless a classifier directory is explicitly provided. First rollout should use `shadow`.

	Recommended artifact-loader checks:

	```text
	artifact_manifest.json exists and includes training_run_summary/test_metrics provenance
	artifact_schema_version == "toolcall-verifier-artifact/v1"
	input_schema_version == "toolcall-verifier-input/v1"
	serializer == "serialize_state_v1"
	labels.json labels match model config
	thresholds.json has every deployed label
	tokenizer files exist
	ONNX file exists
	```

	Loading failures should fail closed for strict deployment modes. Scoring failures should fail open in `shadow` and `advisory` modes, with telemetry.

	## Calibration and safety notes

	- Keep the model in `shadow` mode until eval replay confirms behavior on your real traffic and workflow families.
	- Do not use `deterministic_invalid` predictions to enforce blocks. Deterministic Rust guardrails own those decisions.
	- `wrong_tool_semantic` is intentionally disabled by threshold values above `1.0` because the current telemetry showed high-confidence false positives on otherwise valid terminal/summarize calls.
	- High-confidence mistakes were observed, including valid calls predicted as deterministic or wrong-argument failures. Use per-family replay, not only aggregate F1, before promotion.
	- The `needs_clarification` test support is small (`10` rows), so treat that label as under-validated despite the perfect held-out score.
	- Validate public dataset licenses and any Forge-derived traces before publishing derived artifacts broadly.

	## Tokenizer notes

	The training run emitted tokenizer warnings around slow-to-fast conversion and regex/tokenization behavior. For parity-sensitive deployment, prefer the tokenizer path used by the notebook and artifact tests, and keep `use_fast=False` unless you have separately verified byte-for-byte or score-level parity.

	If your Transformers version emits a Mistral regex warning for the local artifact, load with the appropriate `fix_mistral_regex=True` setting where supported. For Rust deployment, verify whether `tokenizer.json` is present and equivalent. If tokenizer parity is uncertain, use a sidecar scorer process until the tokenizer path is proven.

	## ONNX parity and latency smoke check

	A smoke test from the latest run reported:

	\| Check \| Value \|
	\|---\|---:\|
	\| Example latency \| 126.49 ms \|
	\| PyTorch top label \| `deterministic_invalid` \|
	\| ONNX top label \| `deterministic_invalid` \|
	\| Max absolute difference \| `6.67572021484375e-06` \|

	This is a single-fixture smoke check, not a full deployment benchmark. Run larger parity checks on the exported `onnx_parity_report.json` and real replay traces before using quantized artifacts in advisory or enforcement mode.

	## Related final-response verifier

	The notebook can also train a separate final-response verifier with labels such as `valid_final_response`, `missing_tool_fact`, `contradicts_tool_result`, `unsupported_claim`, and `failed_to_acknowledge_data_gap`. That verifier is a separate artifact family and should be documented, evaluated, and deployed independently from this tool-call verifier.

	The latest final-response run was small: `90` total rows split into `70` train, `10` validation, and `10` test rows. Its validation macro F1 remained low in the shown run, so it should stay experimental/shadow-only until the dataset is materially expanded.

	## Limitations

	- This model was trained on serialized tool-call contexts, not arbitrary natural language.
	- It assumes deterministic validation has already run.
	- It is sensitive to serializer drift, tokenizer drift, and tool-list truncation.
	- Aggregate metrics are strong, but valid-call false positives are more important than headline macro F1 for enforcement.
	- The `forge_augmented` test slice shown in the run contains only `12` rows, so it is useful as a smoke signal, not as sufficient Forge coverage.
	- The final-response verifier path in the notebook is not mature enough for enforcement based on the shown data.

	## Recommended eval replay before promotion

	Run at least these variants before changing deployment mode:

	```text
	no_classifier
	classifier_fp32_onnx_shadow
	classifier_quantized_onnx_shadow
	classifier_fp32_onnx_advisory
	classifier_quantized_onnx_advisory
	```

	Promotion criteria should include:

	- zero or near-zero false objections on valid calls,
	- no regression in terminal-tool workflows,
	- no regression in summarize/report workflows,
	- improved targeted scenario-family scores,
	- acceptable p95/p99 latency,
	- PyTorch/ONNX/quantized parity on replay traces,
	- stable behavior across real tool schemas, not only public function-calling datasets.