Final Response Verifier Classifier Production

Model summary

cowWhySo/final-response-verifier-classifier-production is an experimental DeBERTa-v3-small sequence classifier for verifying candidate final responses in tool-using workflows. It is designed as a sidecar verifier after a workflow has completed its required tool calls. The model takes a serialized representation of the user request, workflow state, required facts, tool trace, tool results, and candidate final response, then predicts whether the final response is grounded in the tool evidence.

This model is not a generative model. It does not execute tools, retrieve evidence, or rewrite responses. It only scores a serialized candidate response against the supplied workflow context.

Current status: shadow-only. This artifact is useful for integration testing, serializer compatibility checks, ONNX deployment validation, and telemetry collection. It should not be used for production blocking or autonomous enforcement yet. The current test set is very small and the measured classification quality is poor: 0.20 accuracy and 0.0667 macro F1 on a 10-row test split.

Intended use

Use this model to evaluate candidate terminal responses in systems that already maintain structured tool workflow state.

Appropriate uses:

  • Shadow logging for final-answer verifier experiments.
  • Eval replay comparing no-classifier, FP32 ONNX shadow, quantized ONNX shadow, and advisory variants.
  • Checking whether Rust-side serialization and ONNX inference match the Python training artifact.
  • Building telemetry for future promotion decisions.

Do not use this model for:

  • Production enforcement without additional evaluation.
  • General-purpose hallucination detection.
  • Safety-critical factuality, compliance, medical, legal, or financial decisions.
  • Replacing deterministic tool-call guardrails, JSON-schema checks, workflow-state enforcement, or source/citation validation.

Labels

The classifier predicts one of five labels:

Label Meaning Default handling
valid_final_response Candidate response is grounded in the required facts and tool outputs. Allow
missing_tool_fact Candidate response omits one or more required facts from the tool evidence. Shadow/advisory only
contradicts_tool_result Candidate response conflicts with a tool result. Shadow/advisory only
unsupported_claim Candidate response adds a claim not supported by the supplied tool results. Shadow/advisory only
failed_to_acknowledge_data_gap Candidate response fails to acknowledge missing required data, or treats missing data as known. Shadow/advisory only

Input contract

The artifact uses:

  • Input schema: final-response-verifier-input/v1
  • Serializer: serialize_final_response_state_v1
  • Max sequence length: 768
  • Base model: microsoft/deberta-v3-small

The structured input contains:

{
  "schema_version": "final-response-verifier-input/v1",
  "user_request": "...",
  "workflow_state": {
    "required_steps": ["..."],
    "completed_steps": ["..."],
    "pending_steps": ["..."],
    "terminal_tools": ["..."],
    "recent_errors": ["..."]
  },
  "required_facts": ["..."],
  "tool_trace": ["..."],
  "tool_results": [
    {"tool_name": "...", "content": "..."}
  ],
  "candidate_final_response": "...",
  "metadata": {
    "scenario_family": "...",
    "requires_transform": false,
    "requires_synthesis": true,
    "requires_all_tool_facts": true,
    "must_acknowledge_missing_data": false
  }
}

The text serializer emits a sectioned prompt-like string:

SCHEMA_VERSION:
final-response-verifier-input/v1

USER_REQUEST:
...

WORKFLOW_STATE:
required_steps=[...]
completed_steps=[...]
pending_steps=[...]
terminal_tools=[...]
recent_errors=[...]

REQUIRED_FACTS:
[...]

TOOL_TRACE:
[...]

TOOL_RESULTS:
tool_name: "tool output text"

CANDIDATE_FINAL_RESPONSE:
...

SCORING_METADATA:
scenario_family="..."
requires_transform=false
requires_synthesis=true
requires_all_tool_facts=true
must_acknowledge_missing_data=false

For deployment, the Rust or Python caller must reproduce this serializer exactly. Training on one serialization format and inferring with another will invalidate the classifier behavior.

Repository layout

The repository contains two deployment surfaces:

hf_model/
  artifact_manifest.json
  config.json
  input_schema.json
  labels.json
  model.safetensors
  onnx_parity_report.json
  special_tokens_map.json
  spm.model
  thresholds.json
  tokenizer_config.json
  training_args.bin
  training_provenance.json

onnx/
  artifact_manifest.json
  config.json
  input_schema.json
  labels.json
  model.onnx
  model_quantized.onnx
  onnx_parity_report.json
  special_tokens_map.json
  spm.model
  thresholds.json
  tokenizer_config.json
  training_provenance.json

Use hf_model/ for Transformers/PyTorch inference and onnx/ for ONNX Runtime deployment.

Training data

The final-response verifier artifact was trained on a small, balanced fixture dataset:

Split Rows Groups
Train 70 14
Validation 10 2
Test 10 2
Total 90 18

Label counts:

Label Rows
valid_final_response 18
missing_tool_fact 18
contradicts_tool_result 18
unsupported_claim 18
failed_to_acknowledge_data_gap 18

The dataset is intentionally small and fixture-heavy. Treat all metrics as smoke-test metrics, not as evidence of production readiness.

Training configuration

Field Value
Base model microsoft/deberta-v3-small
Run profile high_vram_quality
Final-response max length 768
Configured epochs 5
Recorded training epoch 3.0
Train batch size 16
Eval batch size 32
Gradient accumulation 4
Max rows per label 5000
Force retrain false
CPU-only ONNX export true
GPU NVIDIA RTX PRO 6000 Blackwell Server Edition
GPU memory 95.0 GB
Precision bf16 and tf32 enabled, fp16 disabled

Evaluation

Test metrics from the artifact provenance:

Metric Value
Eval loss 1.6188628673553467
Accuracy 0.2
Macro precision 0.04
Macro recall 0.2
Macro F1 0.06666666666666667
Eval samples/s 27.032

Because the test split has only 10 examples and the dataset has five balanced labels, 0.20 accuracy is approximately chance-level. Do not promote this model to advisory or enforcement mode based on the current metrics.

ONNX parity

The exported ONNX artifacts passed a small parity smoke check:

Check Value
Rows 10
PyTorch vs FP32 ONNX top-label agreement 1.0
PyTorch vs FP32 ONNX max absolute diff 2.980232238769531e-07
Quantized ONNX present true
FP32 ONNX vs quantized ONNX top-label agreement 1.0
FP32 ONNX vs quantized ONNX disagreements 0
FP32 ONNX vs quantized ONNX max absolute diff 0.017383113503456116

This only validates export parity on a tiny sample. It does not validate model quality.

Threshold policy

The included thresholds.json is shadow-first:

{
  "schema_version": "final-response-verifier-thresholds/v1",
  "mode": "shadow",
  "default_action": "allow"
}

Default label policy:

Label Action Advisory threshold Enforcement threshold
valid_final_response allow 0.0 1.01
missing_tool_fact advisory_then_enforce_after_eval 0.90 0.995
contradicts_tool_result advisory_then_enforce_after_eval 0.90 0.995
unsupported_claim advisory_then_enforce_after_eval 0.90 0.995
failed_to_acknowledge_data_gap advisory_then_enforce_after_eval 0.90 0.995

Despite these threshold fields, the current model card recommendation is stricter: keep the model in shadow mode until a larger held-out evaluation shows useful precision and recall.

Transformers usage

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

repo_id = "cowWhySo/final-response-verifier-classifier-production"
subfolder = "hf_model"

# use_fast=False is recommended for parity with the training/export code path.
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=subfolder, use_fast=False)
model = AutoModelForSequenceClassification.from_pretrained(repo_id, subfolder=subfolder)
model.eval()

text = """SCHEMA_VERSION:
final-response-verifier-input/v1

USER_REQUEST:
Generate a sales report from the Q4 2024 dataset.

WORKFLOW_STATE:
required_steps=['fetch_sales_data', 'analyze_sales']
completed_steps=['fetch_sales_data', 'analyze_sales']
pending_steps=[]
terminal_tools=['report']
recent_errors=[]

REQUIRED_FACTS:
['23% YoY growth', 'Widget Pro', 'APAC']

TOOL_TRACE:
['fetch_sales_data', 'analyze_sales', 'report']

TOOL_RESULTS:
analyze_sales: "Revenue grew 23% YoY. Top product: Widget Pro. Weakest region: APAC."

CANDIDATE_FINAL_RESPONSE:
Revenue grew 23% YoY. Top product was Widget Pro, and APAC was the weakest region.

SCORING_METADATA:
scenario_family="sequential_3step"
requires_transform=false
requires_synthesis=false
requires_all_tool_facts=true
must_acknowledge_missing_data=false"""

inputs = tokenizer(
    [text],
    return_tensors="pt",
    truncation=True,
    max_length=768,
    padding=True,
)

with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1)[0]

id2label = model.config.id2label
for idx, score in sorted(enumerate(probs.tolist()), key=lambda x: x[1], reverse=True):
    print(id2label[idx], score)

ONNX Runtime usage

import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoTokenizer

repo_id = "cowWhySo/final-response-verifier-classifier-production"

# Use model.onnx for FP32 or model_quantized.onnx for smaller CPU deployment.
onnx_path = hf_hub_download(repo_id, filename="onnx/model_quantized.onnx")
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder="onnx", use_fast=False)

session = ort.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])
inputs = tokenizer([text], return_tensors="np", truncation=True, max_length=768, padding=True)
input_names = {item.name for item in session.get_inputs()}
ort_inputs = {key: value for key, value in inputs.items() if key in input_names}
logits = session.run(None, ort_inputs)[0]
probs = np.exp(logits) / np.exp(logits).sum(axis=-1, keepdims=True)
print(probs[0])

Rust deployment notes

A Rust integration should load the following from onnx/:

  • model.onnx or model_quantized.onnx
  • tokenizer files: tokenizer_config.json, special_tokens_map.json, spm.model
  • labels.json
  • thresholds.json
  • artifact_manifest.json
  • input_schema.json
  • training_provenance.json
  • onnx_parity_report.json

Recommended integration sequence:

  1. Validate workflow state and deterministic guardrails first.
  2. Build the final-response scoring context from already-completed tool calls.
  3. Serialize with serialize_final_response_state_v1.
  4. Run the model in shadow mode.
  5. Log the predicted label, confidence, raw logits, model version, serializer version, and threshold decision.
  6. Do not block or rewrite responses until offline eval replay proves the model improves target scenarios without false objections on valid final responses.

Limitations

  • The training dataset has only 90 rows.
  • The test split has only 10 examples.
  • Current test accuracy is 0.20, which is chance-level for a balanced five-label task.
  • The examples are synthetic or fixture-like and do not represent broad real-world final-response behavior.
  • The model depends on the caller supplying accurate required_facts, tool_trace, and tool_results.
  • The model does not independently verify source truth, tool correctness, or external facts.
  • The model has not been validated for multilingual use, adversarial prompts, long multi-tool traces, or production traffic.

Recommended next steps

Before any advisory or enforcement rollout:

  1. Expand the dataset with real Forge/tool-workflow traces.
  2. Add hard negatives for subtle omissions, numeric drift, unsupported causal claims, and missing-data overclaims.
  3. Build a larger group-held-out test split with per-scenario metrics.
  4. Calibrate probabilities after training.
  5. Compare PyTorch, FP32 ONNX, and quantized ONNX on the same replay set.
  6. Track valid-final-response false objection rate as the primary promotion gate.
  7. Keep deterministic tool/workflow guardrails authoritative.

Citation and provenance

This model was produced from the toolcall_verifier_training_production_colab_v4 workflow and uploaded to the Hugging Face repository cowWhySo/final-response-verifier-classifier-production. The artifact is marked deployment_default: shadow because it is an experimental final-response verifier that should be promoted only after eval replay.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cowWhySo/final-response-verifier-classifier-production

Quantized
(13)
this model

Evaluation results

  • Test accuracy on final_response_verifier_dataset
    self-reported
    0.200
  • Test macro F1 on final_response_verifier_dataset
    self-reported
    0.067
  • Test macro precision on final_response_verifier_dataset
    self-reported
    0.040
  • Test macro recall on final_response_verifier_dataset
    self-reported
    0.200