Text Generation
Transformers
Safetensors
GGUF
HERMES
Turkish
English
mistral
turkish
tool-calling
function-calling
kara-kumru
conversational
text-generation-inference
Instructions to use ersanbil/roka with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ersanbil/roka with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ersanbil/roka") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ersanbil/roka") model = AutoModelForCausalLM.from_pretrained("ersanbil/roka") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - HERMES
How to use ersanbil/roka with HERMES:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- llama-cpp-python
How to use ersanbil/roka with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ersanbil/roka", filename="gguf/roka-v0.2-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use ersanbil/roka with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ersanbil/roka:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ersanbil/roka:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ersanbil/roka:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ersanbil/roka:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ersanbil/roka:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf ersanbil/roka:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ersanbil/roka:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf ersanbil/roka:Q4_K_M
Use Docker
docker model run hf.co/ersanbil/roka:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use ersanbil/roka with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ersanbil/roka" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ersanbil/roka", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ersanbil/roka:Q4_K_M
- SGLang
How to use ersanbil/roka with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ersanbil/roka" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ersanbil/roka", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ersanbil/roka" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ersanbil/roka", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use ersanbil/roka with Ollama:
ollama run hf.co/ersanbil/roka:Q4_K_M
- Unsloth Studio new
How to use ersanbil/roka with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ersanbil/roka to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ersanbil/roka to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ersanbil/roka to start chatting
- Docker Model Runner
How to use ersanbil/roka with Docker Model Runner:
docker model run hf.co/ersanbil/roka:Q4_K_M
- Lemonade
How to use ersanbil/roka with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ersanbil/roka:Q4_K_M
Run and chat with the model
lemonade run user.roka-Q4_K_M
List all available models
lemonade list
File size: 13,844 Bytes
e6d643c 4b63a1c e6d643c d39c40a e6d643c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 | ---
language:
- tr
- en
license: apache-2.0
library_name: transformers
base_model: AlicanKiraz0/Kara-Kumru-v1.0-2B
pipeline_tag: text-generation
tags:
- turkish
- tool-calling
- function-calling
- hermes
- kara-kumru
- mistral
- gguf
---
# Roka — Turkish Tool-Calling Fine-Tune of Kara-Kumru 2B
Roka is a supervised fine-tune of `AlicanKiraz0/Kara-Kumru-v1.0-2B` that teaches a 2B-parameter Turkish language model to use five tools (web search, calculator, date/time, weather, URL reader) via a Hermes-style `<tool_call>…</tool_call>` output format.
This is a **v0.2 research preview**, released for reproducibility and community feedback. It is not a production-grade tool-calling agent and has known weaknesses (see *Limitations*).
The v0.2 training set is fully decontaminated against the evaluation set: no test-set query appears verbatim in train or validation.
## Model at a glance
| | |
|---|---|
| **Base model** | `AlicanKiraz0/Kara-Kumru-v1.0-2B` (Mistral architecture, Llama-3 chat template, Turkish-pretrained) |
| **Upstream base** | `vngrs-ai/Kumru-2B` |
| **Parameters** | ~2.15B |
| **Fine-tuning** | Full fine-tuning, 3 epochs, LR 5e-5 linear, bf16, TRL SFTTrainer |
| **Hardware** | Single NVIDIA A6000 (~65 min / epoch ~22 min) |
| **Languages** | Primarily Turkish; ~13% of the training mix is English (Glaive-sourced synthetic tool-calling examples) |
| **License** | Apache 2.0 (inherited from base chain) |
## Tool set
| Tool | Description |
|---|---|
| `web_search` | Internet search (DuckDuckGo) |
| `calculator` | Arithmetic expression evaluator |
| `datetime` | Date/time and calendar arithmetic (9 actions: `today`, `now`, `day_of_week`, `add_days`, `date_diff`, `days_until`, `day_of_year`, `end_of_month`, `days_until_weekday`) |
| `hava_durumu` | Weather query by city name |
| `sayfa_oku` | URL content reader |
The model is trained to emit tool calls as:
```
<tool_call>
{"name": "datetime", "arguments": {"action": "today"}}
</tool_call>
```
Tool results are fed back to the model wrapped in `<tool_response>…</tool_response>` inside a user turn, and the model synthesizes a final Turkish answer.
## Evaluation
The test set contains 260 Turkish prompts spread over six categories (simple tool calls, fullflow multi-step, parallel, multiple tools, irrelevance, adversarial). Scoring uses an alignment-aware harness (`scripts/rescore_aligned.py`) that normalizes equivalent datetime actions and accepts semantically equivalent arithmetic expressions.
### Overall results (Roka v0.2, April 2026)
| View | n | Full-Match | Tool-Call Acc. | Name Acc. | Arg Acc. |
|---|---|---|---|---|---|
| **All test (held-out)** | 260 | **73.5%** | 93.1% | 71.9% | 60.6% |
Every test query was verified to be absent from both `data/train.jsonl` and `data/val.jsonl`, so the 73.5% number above is a genuinely held-out measurement. See *Decontamination history* below for why this is lower than an earlier, un-decontaminated run.
### Per-subcategory results
| Subcategory | n | Full-Match |
|---|---|---|
| simple/web_search | 30 | **93.3%** |
| simple/weather | 20 | **100.0%** |
| simple/url_reader | 15 | **100.0%** |
| simple/calculator | 20 | 70.0% |
| simple/datetime | 15 | 46.7% |
| fullflow | 35 | **80.0%** |
| multiple | 45 | 64.4% |
| parallel | 15 | **0.0%** |
| adversarial/turkish_special | 10 | 90.0% |
| adversarial/edge_case | 5 | 40.0% |
| adversarial/ambiguous | 15 | 26.7% |
| irrelevance/greeting | 15 | **100.0%** |
| irrelevance/identity | 10 | **100.0%** |
| irrelevance/opinion | 10 | **100.0%** |
**Parallel tool calls score 0% because the training mix does not contain parallel-call examples.** This is a known gap, not a reproducibility failure.
### Decontamination history
During preparation for this release we audited the training set and found that **44 of the 260 test queries appeared verbatim in train/val** (8 in simple/datetime, 6 in simple/web_search, 7 in multiple, and the rest in irrelevance/identity and irrelevance/greeting). We removed all 76 matching train examples and 6 matching val examples, and retrained on the clean split. That retraining is the model reported above.
For transparency we also report the before-and-after numbers on the 216 test queries that were **not** affected by the decontamination (i.e., the genuinely held-out subset from the *pre-cleanup* model's perspective):
| Model | Training data | Clean-216 FM |
|---|---|---|
| v0.1 pre-clean | original (with 76 overlaps) | 78.2% |
| **v0.2** (released) | decontaminated | **73.6%** |
The ~4.6-point drop is informative: it is *not* contamination-inflation. The removed training examples were pattern-providing (datetime variants, fullflow web-search turns, distractor augmentations of the same base queries), and losing them cost about 4.6 points of generalization even on held-out queries. The cost of honest decontamination was larger than the narrow definition of "memorization gain" would predict. We report the post-decontamination number because it is the only one that is defensible as a held-out measurement. A future v0.3 will attempt to recover the gap by adding clean synthetic replacements for the removed examples.
## Development journey (brief)
Arriving at the final model required an honest amount of dead-ends.
1. **Baseline (Run 10)** — 62.7% aligned FM with an earlier pipeline, before any of the spec-005 data work.
2. **Phase A v1–v4 collapse** — four consecutive training runs where loss converged to near-zero but test-set Full-Match stayed at 0/260. All of them passed `loss` sanity checks, so the failure was invisible from inside the run.
3. **Root cause** — TRL issue [#3910](https://github.com/huggingface/trl/issues/3910): the `max_seq_length` argument was silently renamed to `max_length` (default 1024) in TRL 0.20+. Every assistant turn longer than 1024 tokens (≈75% of our fullflow examples) was being truncated before it contributed to the loss. The model trained to completion on fragments, not on full tool-calling traces. Fix: pass `max_length=4096` explicitly.
4. **Data iterations**
- Removed the `unit` argument from all `hava_durumu` training examples (the test set does not supply it). `simple/weather` Full-Match rose from 10% to 100%.
- Added 45 supplementary `datetime` examples covering `day_of_year`, `end_of_month`, and `days_until_weekday` — test actions that were absent from the R10 training data.
- Those supplementary examples caused a regression on `day_of_week` queries ("23 Nisan hangi güne denk geliyor?" was mis-routed to `day_of_year`). A targeted set of 30 `day_of_week` contrast examples fixed it.
5. **Final v0.1 model** — 4,778 training / 509 validation examples, 795 optimizer steps. Reported 76.9% all-test, 78.2% on the clean-216 subset.
6. **v0.2 — decontamination** — 76 train and 6 val examples whose first user turn matched a test query were removed, producing a 4,702 / 503 split. Retraining on this split gave the 73.5% number now reported above. The 4.6-point drop on the clean-216 subset between v0.1 and v0.2 is the cost of honest decontamination — see *Decontamination history*.
Total compute used across Phase A and v0.2: ~5 A6000-hours.
## Limitations
- **Multi-turn pattern lock-in.** The SFT mix contains very few multi-turn tool-calling sequences. If the user starts with a chit-chat turn ("selam"), the model tends to stay in plain-chat mode on subsequent turns and skip the tool call. The provided `scripts/serve_ui.py` works around this by feeding only the current user message (without prior turns) into the tool-decision loop.
- **Parallel tool calls: 0%.** Not trained.
- **`hava_durumu` has no temporal parameter.** Queries like "yarın İstanbul'da hava" still produce `{"city": "İstanbul"}` because that is what the schema allows. The fix is a schema change + data regeneration, not a prompt change.
- **Adversarial/ambiguous: 40%.** The model is easily nudged off-task by ambiguous phrasing.
- **Long-passage synthesis is brittle.** When `sayfa_oku` returns several paragraphs, the synthesized summary sometimes fragments quotes in an unnatural way.
- **Hermes parser coupling.** Native OpenAI-style `tool_calls` parsing via `llama-server` requires the provided `training/roka_tool_template.jinja` chat template and requires the client to pass the full list of 5 tools. Passing a subset confuses llama.cpp's Hermes detector.
- **Scoring discrepancy.** The in-training `training/eval.py` scorer disagrees slightly with the alignment-aware rescorer. Only the rescored numbers are reported above. Resolving the discrepancy is open work.
## Training data
- **4,778 train / 509 validation** examples, Hermes-format chat turns.
- **~72% Turkish, ~13% English, ~15% short/symbolic.** The English fraction is Glaive-sourced synthetic tool-calling data retained for multi-tool pattern coverage.
- **Deterministic generators** for `calculator`, `datetime`, `hava_durumu` (in `training/generators/`).
- **Real DuckDuckGo search results** cached in `data/ddg_cache.json` and used to construct `web_search` fullflow examples.
- **PII scan**: only two flagged matches in user-facing content, both false positives (embedded WSJ article IDs). No email addresses, Turkish ID numbers, credit cards, or IP addresses found.
## Contamination verification
The released v0.2 model is trained on a split where **no test query appears verbatim** in either train or validation. The decontamination script (`scripts/decontaminate.py`) normalizes whitespace and case before matching. The pre-decontamination overlap distribution (all removed in v0.2) was:
| Subcategory | Overlap (removed) |
|---|---|
| irrelevance/identity | 8 / 10 |
| irrelevance/greeting | 11 / 15 |
| simple/datetime | 8 / 15 |
| simple/web_search | 6 / 30 |
| multiple | 7 / 45 |
| adversarial/turkish_special | 1 / 10 |
| adversarial/opinion | 1 / 10 |
| simple/weather | 1 / 20 |
| fullflow | 1 / 35 |
Because augmentation variants of each base query (masked/distractor versions) shared the same user turn, removing 44 unique queries deleted 76 train examples and 6 val examples in total. The remaining 4,702 / 503 split is what v0.2 was trained on.
This decontamination is **exact-string**, not fuzzy. Near-duplicates (paraphrases that return the same tool call) are still present. Closing the paraphrase loophole requires a more elaborate embedding-based deduplication pass, which is left for v0.3.
## Repository layout
```
src/ Inference clients (transformers & llama-server)
training/
tools.py Tool schemas + training system prompt
train.py TRL SFTTrainer entry point
eval.py Test-set scorer (in-training)
roka_tool_template.jinja llama-server chat template with Hermes detection hook
generators/ Deterministic data generators per tool
scripts/
work_pipeline.py End-to-end pod orchestration
pod_run_and_dump.py On-pod training → prediction dump → HF upload
rescore_aligned.py Alignment-aware rescorer (authoritative numbers)
serve_ui.py FastAPI chat UI wrapping the agent
data/
train.jsonl, val.jsonl, test_set.json
specs/005-post-run10-75/ Spec, plan, and task list for this iteration
```
Github: (https://github.com/bilersan/roka)
## Reproducibility
1. Clone the repo and install requirements:
```bash
pip install -r requirements.txt
```
2. Regenerate the training set (deterministic):
```bash
python -m training.build_dataset
```
3. Train (RunPod-hosted, ~1 GPU-hour on an A6000):
```bash
python -m scripts.work_pipeline
```
4. Rescore predictions with the alignment-aware harness:
```bash
python -m scripts.rescore_aligned --predictions .work/artifacts/predictions/<run_id>.json
```
The training recipe is fully specified in `training/config.yaml`. The only hyperparameter that is unusually specific is `max_length: 4096` in `training/train.py` — removing it reproduces the Phase A v1–v4 collapse described above.
## Intended use and out-of-scope use
**Intended**: Turkish-language tool-calling agents for well-defined tools, research on small-model function calling, educational demonstrations of the SFT pipeline.
**Out of scope**:
- Safety-critical applications. The model has not been evaluated for harmful-content refusal beyond what Kara-Kumru inherits from its base.
- Parallel / agentic planning over large tool catalogs.
- Multi-turn conversational agents that need to preserve long prior context.
- Any application that requires the model to use tools not present in the training schema.
## License
This repository and the released weights are distributed under the **Apache License 2.0**, inherited from both `AlicanKiraz0/Kara-Kumru-v1.0-2B` and its upstream base `vngrs-ai/Kumru-2B`. See `LICENSE`.
## Citation
If you use Roka in research, please cite both the base model and this work:
```bibtex
@misc{roka_2026,
title = {Roka: Turkish Tool-Calling Fine-Tune of Kara-Kumru 2B},
author = {Bilik, Ersan},
year = {2026},
url = {https://huggingface.co/ersanbil/roka}
}
@misc{karakumru_2025,
title = {Kara-Kumru-v1.0-2B},
author = {Kiraz, Alican},
year = {2025},
url = {https://huggingface.co/AlicanKiraz0/Kara-Kumru-v1.0-2B}
}
```
## Acknowledgements
- **vngrs-ai** for the open Turkish base model `Kumru-2B`.
- **Alican Kiraz** for the Turkish-conversational fine-tune `Kara-Kumru-v1.0-2B`.
- **Hugging Face TRL / Unsloth** for the training stack.
- **Glaive-AI function-calling dataset** for the English portion of the multi-tool synthetic mix.
## Contact and feedback
Issues and pull requests are welcome on the GitHub mirror. This is a research preview — please file bugs for any behavior that contradicts the documented limitations above; those are the interesting cases.
|