Kyzlo-4b

Kyzlo-4b is a merged Qwen/Qwen3.5-4B model tuned for search and tool-use scout behavior.

This update replaces the earlier public preview with a merged full model that includes a small targeted format/schema repair SFT patch. It is still a preview: the result below is a partial local BFCL-harness measurement, not an official BFCL leaderboard score, and private ScoutSearch eval is still pending.

Model Status

Artifact type: merged full model
Base model: Qwen/Qwen3.5-4B
Initial training: QLoRA SFT, then adapter merge
Frozen initial SFT adapter: artifacts/adapters/sft-v0
Repair patch: targeted format/schema QLoRA SFT
Repair dataset: 240 synthetic leakage-checked examples
Repair merged artifact path before upload: artifacts/merged-format-repair-sft-v1
Release status: public preview with partial local-harness results

What Improved

The repair patch improved the same partial Qwen-FC BFCL local-harness slice used during development:

Model	Slice	Accuracy
`Qwen/Qwen3.5-4B` base	first 100 `simple_python`, Qwen-FC mode	87/100
previous merged SFT preview	first 100 `simple_python`, Qwen-FC mode	90/100
repaired merged model	first 100 `simple_python`, Qwen-FC mode	92/100

Delta:

+2/100 over the previous merged SFT preview
+5/100 over the base model

Important: this is a custom local-harness partial run using bfcl-eval==2025.12.17 with a runtime-added local model entry. It must not be read as an official BFCL leaderboard result.

Smoke Eval Caveat

The repaired model still needs output-control work before a final release claim.

On the 8-prompt ScoutSearch smoke set:

Tool-call signal outputs: 8/8
XML <tool_call> outputs: 8/8
Single XML <tool_call> outputs: 7/8
Parseable JSON outputs by the older smoke parser: 1/8
Outputs with fake continuation/tool-response-like text: 2/8

This means the Qwen-FC path improved, but the ScoutSearch runtime still needs an XML-aware scorer and/or stricter stop conditions.

Intended Use

This model is intended for experiments with a narrow search/tool-use scout model. The target behavior is:

decide when a prompt needs search or external evidence
emit tool calls for source-grounded lookup workflows
reduce unsupported citations and fabricated sources
support a Queen+Scout style agent architecture where a scout gathers current evidence

Loading

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "clarkkitchen22/Kyzlo-4b"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True,
)
model.eval()

Limitations

Not an official BFCL result.
Private ScoutSearch eval is still pending.
Full benchmark contamination checks against the original training data are still pending.
May emit tool calls in XML format rather than JSON depending on prompt/template path.
May continue into fake tool-response-like text without appropriate stop conditions.
Requires external retrieval for current or source-grounded facts.
Should not be used as a final authority for medical, legal, financial, or other high-stakes advice.
Training data provenance and license notes should be finalized before a full release statement.

Credits

Base model: Qwen/Qwen3.5-4B
Training data included a filtered subset of nvidia/Nemotron-SFT-Agentic-v2 plus custom synthetic ScoutSearch examples.

Downloads last month: 30

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for clarkkitchen22/Kyzlo-4b

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Finetuned

(196)

this model