farabi-0.6B / README.md
nur-dev's picture
Add interim evaluation results (BFCL v4 + multilingual MC)
30d2556 verified
---
language:
- kk
- ru
- en
pipeline_tag: text-generation
library_name: transformers
tags:
- kazakh
- multilingual
- instruction-tuning
- tool-calling
- function-calling
- agent
- conversational
base_model: nur-dev/farabi-0.6B-base
license: apache-2.0
---
# Farabi-0.6B
**Farabi-0.6B** is a compact, multilingual instruction-tuned language model with a
primary focus on **Kazakh**, alongside strong **Russian** and **English** support.
It is designed for everyday assistant use, reasoning, retrieval-grounded answering,
and **tool / function calling** in agentic applications.
The model speaks fluent Kazakh and is intended to make high-quality conversational
AI more accessible for the Kazakh language, where well-aligned models remain scarce.
Created by **[Nurgali Kadyrbek](https://www.linkedin.com/in/nurgali-kadyrbek-504260231/)**.
It is built on **[`nur-dev/farabi-0.6B-base`](https://huggingface.co/nur-dev/farabi-0.6B-base)**
a Kazakh-adapted base model that was itself continually pre-trained from Qwen3-0.6B — and then
instruction-tuned to produce this assistant.
---
## Highlights
- 🇰🇿 **Kazakh-first** — the majority of the instruction data is native Kazakh, with
Russian and English mixed in for cross-lingual robustness.
- 🧠 **Reasoning** — supports optional step-by-step "thinking" mode that can be toggled
on or off at request time.
- 🔧 **Tool calling** — emits Hermes-style `<tool_call>` blocks and is compatible with
the OpenAI-style function-calling interface and agent frameworks.
- 📚 **Grounded answering** — trained to answer from provided documents and context,
including longer inputs.
- 🪶 **Small & deployable** — 0.6B parameters, runs comfortably on a single modest GPU.
---
## Languages
| Language | Approx. share of instruction data |
|----------|-----------------------------------|
| Kazakh (kk) | ~56% |
| English (en) | ~33% |
| Russian (ru) | ~10% |
---
## Data coverage by domain
The model was instruction-tuned on a broad, internally curated mixture. Described in
general terms (no technical specifics), the approximate domain composition is:
| Domain | Approx. share |
|--------|---------------|
| General instruction following & multi-turn conversation | ~45% |
| Reasoning & step-by-step problem solving | ~27% |
| Retrieval-grounded answering, long context & document Q&A | ~13% |
| Tool use, function calling & agentic interaction | ~7% |
| Knowledge, culture, news & encyclopedic content | ~4% |
| Mathematics, language tasks (grammar / translation), safety & appropriate refusal, device & environment control, and assistant identity | ~4% |
*Shares are approximate and reflect general domain proportions rather than exact figures.*
---
## Data provenance & acknowledgments
The training datasets were **created internally by the author**, including original
synthesis as well as additionally processed and enriched material.
Approximately **5.4%** of all data used for instruction tuning was derived (with
additional processing and enrichment) from resources of two organizations, whose
contributions to the Kazakh language are gratefully acknowledged:
1. **Институт языкознания имени А. Байтурсынова***Institute of Linguistics named after A. Baitursynov*
2. **ННПЦ «Тіл-Қазына» имени Шайсултана Шаяхметова***Sh. Shayakhmetov National Research and Practical Center "Til-Qazyna"*
---
## Recommended sampling parameters
A good starting point for general use:
```json
{
"temperature": 0.15,
"top_p": 0.95,
"max_tokens": 1024,
"repetition_penalty": 1.05,
"stream": true,
"chat_template_kwargs": {
"enable_thinking": true
},
"continue_final_message": true
}
```
Set `"enable_thinking": false` to get direct answers without an explicit reasoning step.
Raise `temperature` for more creative / open-ended generation.
---
## Serving with vLLM
Start an OpenAI-compatible server with tool-calling enabled:
```bash
vllm serve nur-dev/farabi-0.6B \
--served-model-name farabi-0.6b \
--enable-auto-tool-choice \
--tool-call-parser hermes
```
Query it with the standard OpenAI client (and the recommended sampling params):
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
resp = client.chat.completions.create(
model="farabi-0.6b",
messages=[
{"role": "system", "content": "Сіз пайдалы әрі дәл көмекшісіз."},
{"role": "user", "content": "Алматы туралы қысқаша айтып бер."},
],
temperature=0.15,
top_p=0.95,
max_tokens=1024,
extra_body={
"repetition_penalty": 1.05,
"chat_template_kwargs": {"enable_thinking": True},
},
stream=True,
)
for chunk in resp:
delta = chunk.choices[0].delta.content
if delta:
print(delta, end="", flush=True)
```
Tool calling works through the standard `tools=[...]` argument — the model returns
function calls that the server parses into structured `tool_calls`.
---
## Serving with PyTorch / Transformers
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "nur-dev/farabi-0.6B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "Сіз пайдалы әрі дәл көмекшісіз."},
{"role": "user", "content": "Қазақстанның астанасы қай қала?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
enable_thinking=True, # set False for direct answers
return_tensors="pt",
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=1024,
do_sample=True,
temperature=0.15,
top_p=0.95,
repetition_penalty=1.05,
)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
```
---
## Evaluation
> ⚠️ **Interim results.** The numbers below were measured on an early checkpoint
> (~17% through instruction tuning). They are expected to improve as training
> continues, but already show meaningful capability.
### Tool / function calling — BFCL v4
Berkeley Function-Calling Leaderboard (v4), 1,040 cases, evaluated with the
HuggingFace backend.
| Category | Accuracy | n | What it measures |
|----------|----------|---|------------------|
| Simple | 80.5% | 322/400 | one call, one tool available |
| Multiple | 71.5% | 143/200 | pick the right tool from several |
| Parallel | 65.5% | 131/200 | several calls in one turn |
| Irrelevance | 5.4% | 13/240 | abstain when no tool fits |
| **Overall** | **58.6%** | 609/1040 | |
| **Function-calling avg** | **74.5%** | 596/800 | excludes irrelevance |
**Takeaways:**
- **Strong calling ability for a 0.6B model.** When a call is warranted it is correct
~74.5% of the time — right tool, valid arguments, clean JSON — including 65.5% on the
hard parallel / multi-call category.
- **The weakness is abstention, not calling.** On queries that match no available tool,
the model still tends to emit a call (irrelevance 5.4% → it over-triggers). This is the
main driver of the lower overall score and the clearest area for improvement.
### Multilingual comprehension — 4-way multiple choice
Multiple-choice comprehension across the model's three languages (random baseline = 25%),
evaluated with the chat template and `enable_thinking=False`.
| Language | Accuracy |
|----------|----------|
| English | 53.7% ±1.7 |
| Russian | 50.0% ±1.7 |
| Kazakh | 41.8% ±1.6 |
**Takeaways:**
- Well above the 25% random baseline in all three languages — real comprehension in
English, Russian, and Kazakh.
- Resource ordering (en > ru > kk) is as expected; Kazakh at 41.8% is clearly non-trivial.
- Evaluating with the chat template and `enable_thinking=False` adds ~5–6 points per
language versus a raw prompt — another reason to serve the model with its chat template
(see serving instructions above).
---
## Intended use & limitations
Farabi-0.6B is intended as a helpful general-purpose and agentic assistant, with a
focus on Kazakh-language use cases. As a small model, it can make factual mistakes,
and outputs should be verified for high-stakes or factual-critical applications. It
should be used responsibly and in accordance with applicable laws and the base model's
license.
---
## Citation
If you use this model, please credit the author:
> Nurgali Kadyrbek — Farabi-0.6B.
> https://www.linkedin.com/in/nurgali-kadyrbek-504260231/