Kyzlo-4b

Kyzlo-4b is a merged Qwen/Qwen3.5-4B model tuned for search and tool-use scout behavior.

This update replaces the earlier public preview with a merged full model that includes a small targeted format/schema repair SFT patch. It is still a preview: the result below is a partial local BFCL-harness measurement, not an official BFCL leaderboard score, and private ScoutSearch eval is still pending.

Model Status

  • Artifact type: merged full model
  • Base model: Qwen/Qwen3.5-4B
  • Initial training: QLoRA SFT, then adapter merge
  • Frozen initial SFT adapter: artifacts/adapters/sft-v0
  • Repair patch: targeted format/schema QLoRA SFT
  • Repair dataset: 240 synthetic leakage-checked examples
  • Repair merged artifact path before upload: artifacts/merged-format-repair-sft-v1
  • Release status: public preview with partial local-harness results

What Improved

The repair patch improved the same partial Qwen-FC BFCL local-harness slice used during development:

Model Slice Accuracy
Qwen/Qwen3.5-4B base first 100 simple_python, Qwen-FC mode 87/100
previous merged SFT preview first 100 simple_python, Qwen-FC mode 90/100
repaired merged model first 100 simple_python, Qwen-FC mode 92/100

Delta:

  • +2/100 over the previous merged SFT preview
  • +5/100 over the base model

Important: this is a custom local-harness partial run using bfcl-eval==2025.12.17 with a runtime-added local model entry. It must not be read as an official BFCL leaderboard result.

Smoke Eval Caveat

The repaired model still needs output-control work before a final release claim.

On the 8-prompt ScoutSearch smoke set:

  • Tool-call signal outputs: 8/8
  • XML <tool_call> outputs: 8/8
  • Single XML <tool_call> outputs: 7/8
  • Parseable JSON outputs by the older smoke parser: 1/8
  • Outputs with fake continuation/tool-response-like text: 2/8

This means the Qwen-FC path improved, but the ScoutSearch runtime still needs an XML-aware scorer and/or stricter stop conditions.

Intended Use

This model is intended for experiments with a narrow search/tool-use scout model. The target behavior is:

  • decide when a prompt needs search or external evidence
  • emit tool calls for source-grounded lookup workflows
  • reduce unsupported citations and fabricated sources
  • support a Queen+Scout style agent architecture where a scout gathers current evidence

Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "clarkkitchen22/Kyzlo-4b"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
)
model.eval()

Limitations

  • Not an official BFCL result.
  • Private ScoutSearch eval is still pending.
  • Full benchmark contamination checks against the original training data are still pending.
  • May emit tool calls in XML format rather than JSON depending on prompt/template path.
  • May continue into fake tool-response-like text without appropriate stop conditions.
  • Requires external retrieval for current or source-grounded facts.
  • Should not be used as a final authority for medical, legal, financial, or other high-stakes advice.
  • Training data provenance and license notes should be finalized before a full release statement.

Credits

  • Base model: Qwen/Qwen3.5-4B
  • Training data included a filtered subset of nvidia/Nemotron-SFT-Agentic-v2 plus custom synthetic ScoutSearch examples.
Downloads last month
30
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for clarkkitchen22/Kyzlo-4b

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(196)
this model