File size: 13,844 Bytes
e6d643c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4b63a1c
 
e6d643c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d39c40a
e6d643c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
---
language:
- tr
- en
license: apache-2.0
library_name: transformers
base_model: AlicanKiraz0/Kara-Kumru-v1.0-2B
pipeline_tag: text-generation
tags:
- turkish
- tool-calling
- function-calling
- hermes
- kara-kumru
- mistral
- gguf
---

# Roka — Turkish Tool-Calling Fine-Tune of Kara-Kumru 2B

Roka is a supervised fine-tune of `AlicanKiraz0/Kara-Kumru-v1.0-2B` that teaches a 2B-parameter Turkish language model to use five tools (web search, calculator, date/time, weather, URL reader) via a Hermes-style `<tool_call>…</tool_call>` output format.

This is a **v0.2 research preview**, released for reproducibility and community feedback. It is not a production-grade tool-calling agent and has known weaknesses (see *Limitations*).

The v0.2 training set is fully decontaminated against the evaluation set: no test-set query appears verbatim in train or validation.

## Model at a glance

| | |
|---|---|
| **Base model** | `AlicanKiraz0/Kara-Kumru-v1.0-2B` (Mistral architecture, Llama-3 chat template, Turkish-pretrained) |
| **Upstream base** | `vngrs-ai/Kumru-2B` |
| **Parameters** | ~2.15B |
| **Fine-tuning** | Full fine-tuning, 3 epochs, LR 5e-5 linear, bf16, TRL SFTTrainer |
| **Hardware** | Single NVIDIA A6000 (~65 min / epoch ~22 min) |
| **Languages** | Primarily Turkish; ~13% of the training mix is English (Glaive-sourced synthetic tool-calling examples) |
| **License** | Apache 2.0 (inherited from base chain) |

## Tool set

| Tool | Description |
|---|---|
| `web_search` | Internet search (DuckDuckGo) |
| `calculator` | Arithmetic expression evaluator |
| `datetime` | Date/time and calendar arithmetic (9 actions: `today`, `now`, `day_of_week`, `add_days`, `date_diff`, `days_until`, `day_of_year`, `end_of_month`, `days_until_weekday`) |
| `hava_durumu` | Weather query by city name |
| `sayfa_oku` | URL content reader |

The model is trained to emit tool calls as:

```
<tool_call>
{"name": "datetime", "arguments": {"action": "today"}}
</tool_call>
```

Tool results are fed back to the model wrapped in `<tool_response>…</tool_response>` inside a user turn, and the model synthesizes a final Turkish answer.

## Evaluation

The test set contains 260 Turkish prompts spread over six categories (simple tool calls, fullflow multi-step, parallel, multiple tools, irrelevance, adversarial). Scoring uses an alignment-aware harness (`scripts/rescore_aligned.py`) that normalizes equivalent datetime actions and accepts semantically equivalent arithmetic expressions.

### Overall results (Roka v0.2, April 2026)

| View | n | Full-Match | Tool-Call Acc. | Name Acc. | Arg Acc. |
|---|---|---|---|---|---|
| **All test (held-out)** | 260 | **73.5%** | 93.1% | 71.9% | 60.6% |

Every test query was verified to be absent from both `data/train.jsonl` and `data/val.jsonl`, so the 73.5% number above is a genuinely held-out measurement. See *Decontamination history* below for why this is lower than an earlier, un-decontaminated run.

### Per-subcategory results

| Subcategory | n | Full-Match |
|---|---|---|
| simple/web_search | 30 | **93.3%** |
| simple/weather | 20 | **100.0%** |
| simple/url_reader | 15 | **100.0%** |
| simple/calculator | 20 | 70.0% |
| simple/datetime | 15 | 46.7% |
| fullflow | 35 | **80.0%** |
| multiple | 45 | 64.4% |
| parallel | 15 | **0.0%** |
| adversarial/turkish_special | 10 | 90.0% |
| adversarial/edge_case | 5 | 40.0% |
| adversarial/ambiguous | 15 | 26.7% |
| irrelevance/greeting | 15 | **100.0%** |
| irrelevance/identity | 10 | **100.0%** |
| irrelevance/opinion | 10 | **100.0%** |

**Parallel tool calls score 0% because the training mix does not contain parallel-call examples.** This is a known gap, not a reproducibility failure.

### Decontamination history

During preparation for this release we audited the training set and found that **44 of the 260 test queries appeared verbatim in train/val** (8 in simple/datetime, 6 in simple/web_search, 7 in multiple, and the rest in irrelevance/identity and irrelevance/greeting). We removed all 76 matching train examples and 6 matching val examples, and retrained on the clean split. That retraining is the model reported above.

For transparency we also report the before-and-after numbers on the 216 test queries that were **not** affected by the decontamination (i.e., the genuinely held-out subset from the *pre-cleanup* model's perspective):

| Model | Training data | Clean-216 FM |
|---|---|---|
| v0.1 pre-clean | original (with 76 overlaps) | 78.2% |
| **v0.2** (released) | decontaminated | **73.6%** |

The ~4.6-point drop is informative: it is *not* contamination-inflation. The removed training examples were pattern-providing (datetime variants, fullflow web-search turns, distractor augmentations of the same base queries), and losing them cost about 4.6 points of generalization even on held-out queries. The cost of honest decontamination was larger than the narrow definition of "memorization gain" would predict. We report the post-decontamination number because it is the only one that is defensible as a held-out measurement. A future v0.3 will attempt to recover the gap by adding clean synthetic replacements for the removed examples.

## Development journey (brief)

Arriving at the final model required an honest amount of dead-ends.

1. **Baseline (Run 10)** — 62.7% aligned FM with an earlier pipeline, before any of the spec-005 data work.
2. **Phase A v1–v4 collapse** — four consecutive training runs where loss converged to near-zero but test-set Full-Match stayed at 0/260. All of them passed `loss` sanity checks, so the failure was invisible from inside the run.
3. **Root cause** — TRL issue [#3910](https://github.com/huggingface/trl/issues/3910): the `max_seq_length` argument was silently renamed to `max_length` (default 1024) in TRL 0.20+. Every assistant turn longer than 1024 tokens (≈75% of our fullflow examples) was being truncated before it contributed to the loss. The model trained to completion on fragments, not on full tool-calling traces. Fix: pass `max_length=4096` explicitly.
4. **Data iterations**
   - Removed the `unit` argument from all `hava_durumu` training examples (the test set does not supply it). `simple/weather` Full-Match rose from 10% to 100%.
   - Added 45 supplementary `datetime` examples covering `day_of_year`, `end_of_month`, and `days_until_weekday` — test actions that were absent from the R10 training data.
   - Those supplementary examples caused a regression on `day_of_week` queries ("23 Nisan hangi güne denk geliyor?" was mis-routed to `day_of_year`). A targeted set of 30 `day_of_week` contrast examples fixed it.
5. **Final v0.1 model** — 4,778 training / 509 validation examples, 795 optimizer steps. Reported 76.9% all-test, 78.2% on the clean-216 subset.
6. **v0.2 — decontamination** — 76 train and 6 val examples whose first user turn matched a test query were removed, producing a 4,702 / 503 split. Retraining on this split gave the 73.5% number now reported above. The 4.6-point drop on the clean-216 subset between v0.1 and v0.2 is the cost of honest decontamination — see *Decontamination history*.

Total compute used across Phase A and v0.2: ~5 A6000-hours.

## Limitations

- **Multi-turn pattern lock-in.** The SFT mix contains very few multi-turn tool-calling sequences. If the user starts with a chit-chat turn ("selam"), the model tends to stay in plain-chat mode on subsequent turns and skip the tool call. The provided `scripts/serve_ui.py` works around this by feeding only the current user message (without prior turns) into the tool-decision loop.
- **Parallel tool calls: 0%.** Not trained.
- **`hava_durumu` has no temporal parameter.** Queries like "yarın İstanbul'da hava" still produce `{"city": "İstanbul"}` because that is what the schema allows. The fix is a schema change + data regeneration, not a prompt change.
- **Adversarial/ambiguous: 40%.** The model is easily nudged off-task by ambiguous phrasing.
- **Long-passage synthesis is brittle.** When `sayfa_oku` returns several paragraphs, the synthesized summary sometimes fragments quotes in an unnatural way.
- **Hermes parser coupling.** Native OpenAI-style `tool_calls` parsing via `llama-server` requires the provided `training/roka_tool_template.jinja` chat template and requires the client to pass the full list of 5 tools. Passing a subset confuses llama.cpp's Hermes detector.
- **Scoring discrepancy.** The in-training `training/eval.py` scorer disagrees slightly with the alignment-aware rescorer. Only the rescored numbers are reported above. Resolving the discrepancy is open work.

## Training data

- **4,778 train / 509 validation** examples, Hermes-format chat turns.
- **~72% Turkish, ~13% English, ~15% short/symbolic.** The English fraction is Glaive-sourced synthetic tool-calling data retained for multi-tool pattern coverage.
- **Deterministic generators** for `calculator`, `datetime`, `hava_durumu` (in `training/generators/`).
- **Real DuckDuckGo search results** cached in `data/ddg_cache.json` and used to construct `web_search` fullflow examples.
- **PII scan**: only two flagged matches in user-facing content, both false positives (embedded WSJ article IDs). No email addresses, Turkish ID numbers, credit cards, or IP addresses found.

## Contamination verification

The released v0.2 model is trained on a split where **no test query appears verbatim** in either train or validation. The decontamination script (`scripts/decontaminate.py`) normalizes whitespace and case before matching. The pre-decontamination overlap distribution (all removed in v0.2) was:

| Subcategory | Overlap (removed) |
|---|---|
| irrelevance/identity | 8 / 10 |
| irrelevance/greeting | 11 / 15 |
| simple/datetime | 8 / 15 |
| simple/web_search | 6 / 30 |
| multiple | 7 / 45 |
| adversarial/turkish_special | 1 / 10 |
| adversarial/opinion | 1 / 10 |
| simple/weather | 1 / 20 |
| fullflow | 1 / 35 |

Because augmentation variants of each base query (masked/distractor versions) shared the same user turn, removing 44 unique queries deleted 76 train examples and 6 val examples in total. The remaining 4,702 / 503 split is what v0.2 was trained on.

This decontamination is **exact-string**, not fuzzy. Near-duplicates (paraphrases that return the same tool call) are still present. Closing the paraphrase loophole requires a more elaborate embedding-based deduplication pass, which is left for v0.3.

## Repository layout

```
src/                    Inference clients (transformers & llama-server)
training/
  tools.py              Tool schemas + training system prompt
  train.py              TRL SFTTrainer entry point
  eval.py               Test-set scorer (in-training)
  roka_tool_template.jinja   llama-server chat template with Hermes detection hook
  generators/           Deterministic data generators per tool
scripts/
  work_pipeline.py      End-to-end pod orchestration
  pod_run_and_dump.py   On-pod training → prediction dump → HF upload
  rescore_aligned.py    Alignment-aware rescorer (authoritative numbers)
  serve_ui.py           FastAPI chat UI wrapping the agent
data/
  train.jsonl, val.jsonl, test_set.json
specs/005-post-run10-75/   Spec, plan, and task list for this iteration
```

Github: (https://github.com/bilersan/roka)

## Reproducibility

1. Clone the repo and install requirements:
   ```bash
   pip install -r requirements.txt
   ```
2. Regenerate the training set (deterministic):
   ```bash
   python -m training.build_dataset
   ```
3. Train (RunPod-hosted, ~1 GPU-hour on an A6000):
   ```bash
   python -m scripts.work_pipeline
   ```
4. Rescore predictions with the alignment-aware harness:
   ```bash
   python -m scripts.rescore_aligned --predictions .work/artifacts/predictions/<run_id>.json
   ```

The training recipe is fully specified in `training/config.yaml`. The only hyperparameter that is unusually specific is `max_length: 4096` in `training/train.py` — removing it reproduces the Phase A v1–v4 collapse described above.

## Intended use and out-of-scope use

**Intended**: Turkish-language tool-calling agents for well-defined tools, research on small-model function calling, educational demonstrations of the SFT pipeline.

**Out of scope**:
- Safety-critical applications. The model has not been evaluated for harmful-content refusal beyond what Kara-Kumru inherits from its base.
- Parallel / agentic planning over large tool catalogs.
- Multi-turn conversational agents that need to preserve long prior context.
- Any application that requires the model to use tools not present in the training schema.

## License

This repository and the released weights are distributed under the **Apache License 2.0**, inherited from both `AlicanKiraz0/Kara-Kumru-v1.0-2B` and its upstream base `vngrs-ai/Kumru-2B`. See `LICENSE`.

## Citation

If you use Roka in research, please cite both the base model and this work:

```bibtex
@misc{roka_2026,
  title  = {Roka: Turkish Tool-Calling Fine-Tune of Kara-Kumru 2B},
  author = {Bilik, Ersan},
  year   = {2026},
  url    = {https://huggingface.co/ersanbil/roka}
}

@misc{karakumru_2025,
  title  = {Kara-Kumru-v1.0-2B},
  author = {Kiraz, Alican},
  year   = {2025},
  url    = {https://huggingface.co/AlicanKiraz0/Kara-Kumru-v1.0-2B}
}
```

## Acknowledgements

- **vngrs-ai** for the open Turkish base model `Kumru-2B`.
- **Alican Kiraz** for the Turkish-conversational fine-tune `Kara-Kumru-v1.0-2B`.
- **Hugging Face TRL / Unsloth** for the training stack.
- **Glaive-AI function-calling dataset** for the English portion of the multi-tool synthetic mix.

## Contact and feedback

Issues and pull requests are welcome on the GitHub mirror. This is a research preview — please file bugs for any behavior that contradicts the documented limitations above; those are the interesting cases.