Kyzlo-4b
Kyzlo-4b is a merged Qwen/Qwen3.5-4B model tuned for search and tool-use scout behavior.
This update replaces the earlier public preview with a merged full model that includes a small targeted format/schema repair SFT patch. It is still a preview: the result below is a partial local BFCL-harness measurement, not an official BFCL leaderboard score, and private ScoutSearch eval is still pending.
Model Status
- Artifact type: merged full model
- Base model:
Qwen/Qwen3.5-4B - Initial training: QLoRA SFT, then adapter merge
- Frozen initial SFT adapter:
artifacts/adapters/sft-v0 - Repair patch: targeted format/schema QLoRA SFT
- Repair dataset:
240synthetic leakage-checked examples - Repair merged artifact path before upload:
artifacts/merged-format-repair-sft-v1 - Release status: public preview with partial local-harness results
What Improved
The repair patch improved the same partial Qwen-FC BFCL local-harness slice used during development:
| Model | Slice | Accuracy |
|---|---|---|
Qwen/Qwen3.5-4B base |
first 100 simple_python, Qwen-FC mode |
87/100 |
| previous merged SFT preview | first 100 simple_python, Qwen-FC mode |
90/100 |
| repaired merged model | first 100 simple_python, Qwen-FC mode |
92/100 |
Delta:
+2/100over the previous merged SFT preview+5/100over the base model
Important: this is a custom local-harness partial run using bfcl-eval==2025.12.17 with a runtime-added local model entry. It must not be read as an official BFCL leaderboard result.
Smoke Eval Caveat
The repaired model still needs output-control work before a final release claim.
On the 8-prompt ScoutSearch smoke set:
- Tool-call signal outputs: 8/8
- XML
<tool_call>outputs: 8/8 - Single XML
<tool_call>outputs: 7/8 - Parseable JSON outputs by the older smoke parser: 1/8
- Outputs with fake continuation/tool-response-like text: 2/8
This means the Qwen-FC path improved, but the ScoutSearch runtime still needs an XML-aware scorer and/or stricter stop conditions.
Intended Use
This model is intended for experiments with a narrow search/tool-use scout model. The target behavior is:
- decide when a prompt needs search or external evidence
- emit tool calls for source-grounded lookup workflows
- reduce unsupported citations and fabricated sources
- support a Queen+Scout style agent architecture where a scout gathers current evidence
Loading
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "clarkkitchen22/Kyzlo-4b"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True,
)
model.eval()
Limitations
- Not an official BFCL result.
- Private ScoutSearch eval is still pending.
- Full benchmark contamination checks against the original training data are still pending.
- May emit tool calls in XML format rather than JSON depending on prompt/template path.
- May continue into fake tool-response-like text without appropriate stop conditions.
- Requires external retrieval for current or source-grounded facts.
- Should not be used as a final authority for medical, legal, financial, or other high-stakes advice.
- Training data provenance and license notes should be finalized before a full release statement.
Credits
- Base model:
Qwen/Qwen3.5-4B - Training data included a filtered subset of
nvidia/Nemotron-SFT-Agentic-v2plus custom synthetic ScoutSearch examples.
- Downloads last month
- 30