Text Generation
Transformers
Safetensors
GGUF
HERMES
Turkish
English
mistral
turkish
tool-calling
function-calling
kara-kumru
conversational
text-generation-inference
Instructions to use ersanbil/roka with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ersanbil/roka with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ersanbil/roka") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ersanbil/roka") model = AutoModelForCausalLM.from_pretrained("ersanbil/roka") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - HERMES
How to use ersanbil/roka with HERMES:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- llama-cpp-python
How to use ersanbil/roka with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ersanbil/roka", filename="gguf/roka-v0.2-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use ersanbil/roka with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ersanbil/roka:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ersanbil/roka:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ersanbil/roka:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ersanbil/roka:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ersanbil/roka:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf ersanbil/roka:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ersanbil/roka:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf ersanbil/roka:Q4_K_M
Use Docker
docker model run hf.co/ersanbil/roka:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use ersanbil/roka with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ersanbil/roka" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ersanbil/roka", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ersanbil/roka:Q4_K_M
- SGLang
How to use ersanbil/roka with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ersanbil/roka" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ersanbil/roka", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ersanbil/roka" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ersanbil/roka", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use ersanbil/roka with Ollama:
ollama run hf.co/ersanbil/roka:Q4_K_M
- Unsloth Studio new
How to use ersanbil/roka with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ersanbil/roka to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ersanbil/roka to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ersanbil/roka to start chatting
- Docker Model Runner
How to use ersanbil/roka with Docker Model Runner:
docker model run hf.co/ersanbil/roka:Q4_K_M
- Lemonade
How to use ersanbil/roka with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ersanbil/roka:Q4_K_M
Run and chat with the model
lemonade run user.roka-Q4_K_M
List all available models
lemonade list
| language: | |
| - tr | |
| - en | |
| license: apache-2.0 | |
| library_name: transformers | |
| base_model: AlicanKiraz0/Kara-Kumru-v1.0-2B | |
| pipeline_tag: text-generation | |
| tags: | |
| - turkish | |
| - tool-calling | |
| - function-calling | |
| - hermes | |
| - kara-kumru | |
| - mistral | |
| - gguf | |
| # Roka — Turkish Tool-Calling Fine-Tune of Kara-Kumru 2B | |
| Roka is a supervised fine-tune of `AlicanKiraz0/Kara-Kumru-v1.0-2B` that teaches a 2B-parameter Turkish language model to use five tools (web search, calculator, date/time, weather, URL reader) via a Hermes-style `<tool_call>…</tool_call>` output format. | |
| This is a **v0.2 research preview**, released for reproducibility and community feedback. It is not a production-grade tool-calling agent and has known weaknesses (see *Limitations*). | |
| The v0.2 training set is fully decontaminated against the evaluation set: no test-set query appears verbatim in train or validation. | |
| ## Model at a glance | |
| | | | | |
| |---|---| | |
| | **Base model** | `AlicanKiraz0/Kara-Kumru-v1.0-2B` (Mistral architecture, Llama-3 chat template, Turkish-pretrained) | | |
| | **Upstream base** | `vngrs-ai/Kumru-2B` | | |
| | **Parameters** | ~2.15B | | |
| | **Fine-tuning** | Full fine-tuning, 3 epochs, LR 5e-5 linear, bf16, TRL SFTTrainer | | |
| | **Hardware** | Single NVIDIA A6000 (~65 min / epoch ~22 min) | | |
| | **Languages** | Primarily Turkish; ~13% of the training mix is English (Glaive-sourced synthetic tool-calling examples) | | |
| | **License** | Apache 2.0 (inherited from base chain) | | |
| ## Tool set | |
| | Tool | Description | | |
| |---|---| | |
| | `web_search` | Internet search (DuckDuckGo) | | |
| | `calculator` | Arithmetic expression evaluator | | |
| | `datetime` | Date/time and calendar arithmetic (9 actions: `today`, `now`, `day_of_week`, `add_days`, `date_diff`, `days_until`, `day_of_year`, `end_of_month`, `days_until_weekday`) | | |
| | `hava_durumu` | Weather query by city name | | |
| | `sayfa_oku` | URL content reader | | |
| The model is trained to emit tool calls as: | |
| ``` | |
| <tool_call> | |
| {"name": "datetime", "arguments": {"action": "today"}} | |
| </tool_call> | |
| ``` | |
| Tool results are fed back to the model wrapped in `<tool_response>…</tool_response>` inside a user turn, and the model synthesizes a final Turkish answer. | |
| ## Evaluation | |
| The test set contains 260 Turkish prompts spread over six categories (simple tool calls, fullflow multi-step, parallel, multiple tools, irrelevance, adversarial). Scoring uses an alignment-aware harness (`scripts/rescore_aligned.py`) that normalizes equivalent datetime actions and accepts semantically equivalent arithmetic expressions. | |
| ### Overall results (Roka v0.2, April 2026) | |
| | View | n | Full-Match | Tool-Call Acc. | Name Acc. | Arg Acc. | | |
| |---|---|---|---|---|---| | |
| | **All test (held-out)** | 260 | **73.5%** | 93.1% | 71.9% | 60.6% | | |
| Every test query was verified to be absent from both `data/train.jsonl` and `data/val.jsonl`, so the 73.5% number above is a genuinely held-out measurement. See *Decontamination history* below for why this is lower than an earlier, un-decontaminated run. | |
| ### Per-subcategory results | |
| | Subcategory | n | Full-Match | | |
| |---|---|---| | |
| | simple/web_search | 30 | **93.3%** | | |
| | simple/weather | 20 | **100.0%** | | |
| | simple/url_reader | 15 | **100.0%** | | |
| | simple/calculator | 20 | 70.0% | | |
| | simple/datetime | 15 | 46.7% | | |
| | fullflow | 35 | **80.0%** | | |
| | multiple | 45 | 64.4% | | |
| | parallel | 15 | **0.0%** | | |
| | adversarial/turkish_special | 10 | 90.0% | | |
| | adversarial/edge_case | 5 | 40.0% | | |
| | adversarial/ambiguous | 15 | 26.7% | | |
| | irrelevance/greeting | 15 | **100.0%** | | |
| | irrelevance/identity | 10 | **100.0%** | | |
| | irrelevance/opinion | 10 | **100.0%** | | |
| **Parallel tool calls score 0% because the training mix does not contain parallel-call examples.** This is a known gap, not a reproducibility failure. | |
| ### Decontamination history | |
| During preparation for this release we audited the training set and found that **44 of the 260 test queries appeared verbatim in train/val** (8 in simple/datetime, 6 in simple/web_search, 7 in multiple, and the rest in irrelevance/identity and irrelevance/greeting). We removed all 76 matching train examples and 6 matching val examples, and retrained on the clean split. That retraining is the model reported above. | |
| For transparency we also report the before-and-after numbers on the 216 test queries that were **not** affected by the decontamination (i.e., the genuinely held-out subset from the *pre-cleanup* model's perspective): | |
| | Model | Training data | Clean-216 FM | | |
| |---|---|---| | |
| | v0.1 pre-clean | original (with 76 overlaps) | 78.2% | | |
| | **v0.2** (released) | decontaminated | **73.6%** | | |
| The ~4.6-point drop is informative: it is *not* contamination-inflation. The removed training examples were pattern-providing (datetime variants, fullflow web-search turns, distractor augmentations of the same base queries), and losing them cost about 4.6 points of generalization even on held-out queries. The cost of honest decontamination was larger than the narrow definition of "memorization gain" would predict. We report the post-decontamination number because it is the only one that is defensible as a held-out measurement. A future v0.3 will attempt to recover the gap by adding clean synthetic replacements for the removed examples. | |
| ## Development journey (brief) | |
| Arriving at the final model required an honest amount of dead-ends. | |
| 1. **Baseline (Run 10)** — 62.7% aligned FM with an earlier pipeline, before any of the spec-005 data work. | |
| 2. **Phase A v1–v4 collapse** — four consecutive training runs where loss converged to near-zero but test-set Full-Match stayed at 0/260. All of them passed `loss` sanity checks, so the failure was invisible from inside the run. | |
| 3. **Root cause** — TRL issue [#3910](https://github.com/huggingface/trl/issues/3910): the `max_seq_length` argument was silently renamed to `max_length` (default 1024) in TRL 0.20+. Every assistant turn longer than 1024 tokens (≈75% of our fullflow examples) was being truncated before it contributed to the loss. The model trained to completion on fragments, not on full tool-calling traces. Fix: pass `max_length=4096` explicitly. | |
| 4. **Data iterations** | |
| - Removed the `unit` argument from all `hava_durumu` training examples (the test set does not supply it). `simple/weather` Full-Match rose from 10% to 100%. | |
| - Added 45 supplementary `datetime` examples covering `day_of_year`, `end_of_month`, and `days_until_weekday` — test actions that were absent from the R10 training data. | |
| - Those supplementary examples caused a regression on `day_of_week` queries ("23 Nisan hangi güne denk geliyor?" was mis-routed to `day_of_year`). A targeted set of 30 `day_of_week` contrast examples fixed it. | |
| 5. **Final v0.1 model** — 4,778 training / 509 validation examples, 795 optimizer steps. Reported 76.9% all-test, 78.2% on the clean-216 subset. | |
| 6. **v0.2 — decontamination** — 76 train and 6 val examples whose first user turn matched a test query were removed, producing a 4,702 / 503 split. Retraining on this split gave the 73.5% number now reported above. The 4.6-point drop on the clean-216 subset between v0.1 and v0.2 is the cost of honest decontamination — see *Decontamination history*. | |
| Total compute used across Phase A and v0.2: ~5 A6000-hours. | |
| ## Limitations | |
| - **Multi-turn pattern lock-in.** The SFT mix contains very few multi-turn tool-calling sequences. If the user starts with a chit-chat turn ("selam"), the model tends to stay in plain-chat mode on subsequent turns and skip the tool call. The provided `scripts/serve_ui.py` works around this by feeding only the current user message (without prior turns) into the tool-decision loop. | |
| - **Parallel tool calls: 0%.** Not trained. | |
| - **`hava_durumu` has no temporal parameter.** Queries like "yarın İstanbul'da hava" still produce `{"city": "İstanbul"}` because that is what the schema allows. The fix is a schema change + data regeneration, not a prompt change. | |
| - **Adversarial/ambiguous: 40%.** The model is easily nudged off-task by ambiguous phrasing. | |
| - **Long-passage synthesis is brittle.** When `sayfa_oku` returns several paragraphs, the synthesized summary sometimes fragments quotes in an unnatural way. | |
| - **Hermes parser coupling.** Native OpenAI-style `tool_calls` parsing via `llama-server` requires the provided `training/roka_tool_template.jinja` chat template and requires the client to pass the full list of 5 tools. Passing a subset confuses llama.cpp's Hermes detector. | |
| - **Scoring discrepancy.** The in-training `training/eval.py` scorer disagrees slightly with the alignment-aware rescorer. Only the rescored numbers are reported above. Resolving the discrepancy is open work. | |
| ## Training data | |
| - **4,778 train / 509 validation** examples, Hermes-format chat turns. | |
| - **~72% Turkish, ~13% English, ~15% short/symbolic.** The English fraction is Glaive-sourced synthetic tool-calling data retained for multi-tool pattern coverage. | |
| - **Deterministic generators** for `calculator`, `datetime`, `hava_durumu` (in `training/generators/`). | |
| - **Real DuckDuckGo search results** cached in `data/ddg_cache.json` and used to construct `web_search` fullflow examples. | |
| - **PII scan**: only two flagged matches in user-facing content, both false positives (embedded WSJ article IDs). No email addresses, Turkish ID numbers, credit cards, or IP addresses found. | |
| ## Contamination verification | |
| The released v0.2 model is trained on a split where **no test query appears verbatim** in either train or validation. The decontamination script (`scripts/decontaminate.py`) normalizes whitespace and case before matching. The pre-decontamination overlap distribution (all removed in v0.2) was: | |
| | Subcategory | Overlap (removed) | | |
| |---|---| | |
| | irrelevance/identity | 8 / 10 | | |
| | irrelevance/greeting | 11 / 15 | | |
| | simple/datetime | 8 / 15 | | |
| | simple/web_search | 6 / 30 | | |
| | multiple | 7 / 45 | | |
| | adversarial/turkish_special | 1 / 10 | | |
| | adversarial/opinion | 1 / 10 | | |
| | simple/weather | 1 / 20 | | |
| | fullflow | 1 / 35 | | |
| Because augmentation variants of each base query (masked/distractor versions) shared the same user turn, removing 44 unique queries deleted 76 train examples and 6 val examples in total. The remaining 4,702 / 503 split is what v0.2 was trained on. | |
| This decontamination is **exact-string**, not fuzzy. Near-duplicates (paraphrases that return the same tool call) are still present. Closing the paraphrase loophole requires a more elaborate embedding-based deduplication pass, which is left for v0.3. | |
| ## Repository layout | |
| ``` | |
| src/ Inference clients (transformers & llama-server) | |
| training/ | |
| tools.py Tool schemas + training system prompt | |
| train.py TRL SFTTrainer entry point | |
| eval.py Test-set scorer (in-training) | |
| roka_tool_template.jinja llama-server chat template with Hermes detection hook | |
| generators/ Deterministic data generators per tool | |
| scripts/ | |
| work_pipeline.py End-to-end pod orchestration | |
| pod_run_and_dump.py On-pod training → prediction dump → HF upload | |
| rescore_aligned.py Alignment-aware rescorer (authoritative numbers) | |
| serve_ui.py FastAPI chat UI wrapping the agent | |
| data/ | |
| train.jsonl, val.jsonl, test_set.json | |
| specs/005-post-run10-75/ Spec, plan, and task list for this iteration | |
| ``` | |
| Github: (https://github.com/bilersan/roka) | |
| ## Reproducibility | |
| 1. Clone the repo and install requirements: | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| 2. Regenerate the training set (deterministic): | |
| ```bash | |
| python -m training.build_dataset | |
| ``` | |
| 3. Train (RunPod-hosted, ~1 GPU-hour on an A6000): | |
| ```bash | |
| python -m scripts.work_pipeline | |
| ``` | |
| 4. Rescore predictions with the alignment-aware harness: | |
| ```bash | |
| python -m scripts.rescore_aligned --predictions .work/artifacts/predictions/<run_id>.json | |
| ``` | |
| The training recipe is fully specified in `training/config.yaml`. The only hyperparameter that is unusually specific is `max_length: 4096` in `training/train.py` — removing it reproduces the Phase A v1–v4 collapse described above. | |
| ## Intended use and out-of-scope use | |
| **Intended**: Turkish-language tool-calling agents for well-defined tools, research on small-model function calling, educational demonstrations of the SFT pipeline. | |
| **Out of scope**: | |
| - Safety-critical applications. The model has not been evaluated for harmful-content refusal beyond what Kara-Kumru inherits from its base. | |
| - Parallel / agentic planning over large tool catalogs. | |
| - Multi-turn conversational agents that need to preserve long prior context. | |
| - Any application that requires the model to use tools not present in the training schema. | |
| ## License | |
| This repository and the released weights are distributed under the **Apache License 2.0**, inherited from both `AlicanKiraz0/Kara-Kumru-v1.0-2B` and its upstream base `vngrs-ai/Kumru-2B`. See `LICENSE`. | |
| ## Citation | |
| If you use Roka in research, please cite both the base model and this work: | |
| ```bibtex | |
| @misc{roka_2026, | |
| title = {Roka: Turkish Tool-Calling Fine-Tune of Kara-Kumru 2B}, | |
| author = {Bilik, Ersan}, | |
| year = {2026}, | |
| url = {https://huggingface.co/ersanbil/roka} | |
| } | |
| @misc{karakumru_2025, | |
| title = {Kara-Kumru-v1.0-2B}, | |
| author = {Kiraz, Alican}, | |
| year = {2025}, | |
| url = {https://huggingface.co/AlicanKiraz0/Kara-Kumru-v1.0-2B} | |
| } | |
| ``` | |
| ## Acknowledgements | |
| - **vngrs-ai** for the open Turkish base model `Kumru-2B`. | |
| - **Alican Kiraz** for the Turkish-conversational fine-tune `Kara-Kumru-v1.0-2B`. | |
| - **Hugging Face TRL / Unsloth** for the training stack. | |
| - **Glaive-AI function-calling dataset** for the English portion of the multi-tool synthetic mix. | |
| ## Contact and feedback | |
| Issues and pull requests are welcome on the GitHub mirror. This is a research preview — please file bugs for any behavior that contradicts the documented limitations above; those are the interesting cases. | |