farabi-4b / README.md
nur-dev's picture
Model card: add BFCL v4 tool-calling benchmark (Sherkala-8B unsupported)
09e3d6b verified
|
Raw
History Blame Contribute Delete
4.76 kB
---
license: apache-2.0
language:
- kk
- ru
- en
pipeline_tag: text-generation
library_name: transformers
tags:
- kazakh
- russian
- rag
- tool-calling
- agent
- qwen3
---
# Farabi-4B
A 4B-parameter instruction model for **Kazakh, Russian, and English**, focused on
**grounded RAG** (answer from provided passages, cite, and abstain when evidence is
insufficient) and **Hermes-style tool calling / agentic use**. Qwen3-4B architecture.
- **Languages:** Kazakh (kk), Russian (ru), English (en)
- **Context length:** 8192 tokens
- **Precision:** bf16
- **Tool-call format:** Hermes (vLLM `--tool-call-parser hermes`)
## Serving
### vLLM (recommended — enables tool calling)
```bash
vllm serve nur-dev/farabi-4b \
--dtype bfloat16 --max-model-len 8192 \
--enable-auto-tool-choice --tool-call-parser hermes \
--chat-template chat_template.jinja
```
### OpenAI-compatible client / Agents SDK
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
resp = client.chat.completions.create(
model="nur-dev/farabi-4b",
messages=[{"role": "user", "content": "Астанада бүгін ауа райы қандай?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}],
)
print(resp.choices[0].message)
```
### transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("nur-dev/farabi-4b")
model = AutoModelForCausalLM.from_pretrained("nur-dev/farabi-4b", torch_dtype="bfloat16", device_map="auto")
msgs = [{"role": "user", "content": "Спутник деген не?"}]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=512)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
```
## Benchmarks
Evaluated on public Kazakh/Russian benchmarks against **Sherkala-8B-Chat**
(`inceptionai/Llama-3.1-Sherkala-8B-Chat`, an 8B Kazakh chat model), both run through the
identical harness. Kazakh reasoning uses the ISSAI QOLDA suite (n=250); knowledge is
measured with standard multiple-choice sets.
**Summary:** Farabi-4B is a **tool-calling** model — it scores **78.3%** on BFCL v4
(Berkeley Function-Calling Leaderboard), while Sherkala-8B-Chat has no function-calling
interface and cannot be evaluated on it. Despite being half the size, Farabi-4B also
**leads on aggregate Kazakh reasoning** (light-kk mean 46.9 vs 43.2) and **on every
Russian-language benchmark** (by +5 to +20pt). Sherkala-8B — trained on substantially more
native-Kazakh text — leads on native Kazakh knowledge MC (KazMMLU-kk, TUMLU-kk) and on RAG
free-generation (chrF).
### Tool / function calling — BFCL v4 (Berkeley Function-Calling Leaderboard, AST, %)
| Category | Farabi-4B | Sherkala-8B |
|---|---|---|
| Simple | 92.5 | **unsupported** |
| Multiple | 91.0 | **unsupported** |
| Parallel | 87.0 | **unsupported** |
| Irrelevance | 36.7 | **unsupported** |
| **Overall** | **78.3** | **unsupported** |
> **unsupported** = Sherkala-8B-Chat's chat template has no `tools` / tool-call mechanism;
> it emits zero function calls on every BFCL category, so function calling cannot be
> evaluated. Farabi-4B is served with vLLM `--tool-call-parser hermes`.
### Kazakh reasoning — ISSAI QOLDA (accuracy, %)
| Benchmark | Farabi-4B | Sherkala-8B |
|---|---|---|
| **light-kk mean** | **46.9** | 43.2 |
| MMLU-kk | 50.0 | 47.2 |
| MMLU-Pro-kk | 30.0 | 20.8 |
| GPQA-kk | 34.4 | 30.0 |
| PolyMath-kk | 26.0 | 21.6 |
| ARC-kk | 73.2 | 74.8 |
| GSM8K-kk | 66.4 | 68.8 |
| RAGBench-kk (chrF) | 30.6 | 41.9 |
### Russian reasoning — ISSAI QOLDA (accuracy, %)
| Benchmark | Farabi-4B | Sherkala-8B |
|---|---|---|
| ARC-ru | 92.8 | 78.4 |
| MMLU-Pro-ru | 42.8 | 22.8 |
| GPQA-ru | 32.4 | 25.2 |
| GSM8K-ru | 84.4 | 79.6 |
### Standard multiple-choice (accuracy, %)
| Benchmark | Farabi-4B | Sherkala-8B |
|---|---|---|
| Belebele-kk | 70.5 | 69.0 |
| Belebele-ru | 80.5 | 79.5 |
| Belebele-en | 90.5 | 94.5 |
| KazMMLU-kk | 35.3 | 40.2 |
| KazMMLU-ru | 39.9 | 36.6 |
| TUMLU-kk | 30.5 | 37.5 |
| TruthfulQA-mc2 | 51.4 | 50.6 |
## Intended use
Grounded question answering over retrieved passages (RAG), tool-augmented assistants /
agents (Hermes tool calls), and Kazakh/Russian/English chat. For grounded RAG the model
is trained to answer only from provided evidence and to abstain when evidence is
insufficient.
## License
Apache-2.0.