farabi-4b / README.md
nur-dev's picture
Model card: add BFCL v4 tool-calling benchmark (Sherkala-8B unsupported)
09e3d6b verified
|
Raw
History Blame Contribute Delete
4.76 kB
metadata
license: apache-2.0
language:
  - kk
  - ru
  - en
pipeline_tag: text-generation
library_name: transformers
tags:
  - kazakh
  - russian
  - rag
  - tool-calling
  - agent
  - qwen3

Farabi-4B

A 4B-parameter instruction model for Kazakh, Russian, and English, focused on grounded RAG (answer from provided passages, cite, and abstain when evidence is insufficient) and Hermes-style tool calling / agentic use. Qwen3-4B architecture.

  • Languages: Kazakh (kk), Russian (ru), English (en)
  • Context length: 8192 tokens
  • Precision: bf16
  • Tool-call format: Hermes (vLLM --tool-call-parser hermes)

Serving

vLLM (recommended — enables tool calling)

vllm serve nur-dev/farabi-4b \
  --dtype bfloat16 --max-model-len 8192 \
  --enable-auto-tool-choice --tool-call-parser hermes \
  --chat-template chat_template.jinja

OpenAI-compatible client / Agents SDK

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

resp = client.chat.completions.create(
    model="nur-dev/farabi-4b",
    messages=[{"role": "user", "content": "Астанада бүгін ауа райы қандай?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }],
)
print(resp.choices[0].message)

transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("nur-dev/farabi-4b")
model = AutoModelForCausalLM.from_pretrained("nur-dev/farabi-4b", torch_dtype="bfloat16", device_map="auto")
msgs = [{"role": "user", "content": "Спутник деген не?"}]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=512)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

Benchmarks

Evaluated on public Kazakh/Russian benchmarks against Sherkala-8B-Chat (inceptionai/Llama-3.1-Sherkala-8B-Chat, an 8B Kazakh chat model), both run through the identical harness. Kazakh reasoning uses the ISSAI QOLDA suite (n=250); knowledge is measured with standard multiple-choice sets.

Summary: Farabi-4B is a tool-calling model — it scores 78.3% on BFCL v4 (Berkeley Function-Calling Leaderboard), while Sherkala-8B-Chat has no function-calling interface and cannot be evaluated on it. Despite being half the size, Farabi-4B also leads on aggregate Kazakh reasoning (light-kk mean 46.9 vs 43.2) and on every Russian-language benchmark (by +5 to +20pt). Sherkala-8B — trained on substantially more native-Kazakh text — leads on native Kazakh knowledge MC (KazMMLU-kk, TUMLU-kk) and on RAG free-generation (chrF).

Tool / function calling — BFCL v4 (Berkeley Function-Calling Leaderboard, AST, %)

Category Farabi-4B Sherkala-8B
Simple 92.5 unsupported
Multiple 91.0 unsupported
Parallel 87.0 unsupported
Irrelevance 36.7 unsupported
Overall 78.3 unsupported

unsupported = Sherkala-8B-Chat's chat template has no tools / tool-call mechanism; it emits zero function calls on every BFCL category, so function calling cannot be evaluated. Farabi-4B is served with vLLM --tool-call-parser hermes.

Kazakh reasoning — ISSAI QOLDA (accuracy, %)

Benchmark Farabi-4B Sherkala-8B
light-kk mean 46.9 43.2
MMLU-kk 50.0 47.2
MMLU-Pro-kk 30.0 20.8
GPQA-kk 34.4 30.0
PolyMath-kk 26.0 21.6
ARC-kk 73.2 74.8
GSM8K-kk 66.4 68.8
RAGBench-kk (chrF) 30.6 41.9

Russian reasoning — ISSAI QOLDA (accuracy, %)

Benchmark Farabi-4B Sherkala-8B
ARC-ru 92.8 78.4
MMLU-Pro-ru 42.8 22.8
GPQA-ru 32.4 25.2
GSM8K-ru 84.4 79.6

Standard multiple-choice (accuracy, %)

Benchmark Farabi-4B Sherkala-8B
Belebele-kk 70.5 69.0
Belebele-ru 80.5 79.5
Belebele-en 90.5 94.5
KazMMLU-kk 35.3 40.2
KazMMLU-ru 39.9 36.6
TUMLU-kk 30.5 37.5
TruthfulQA-mc2 51.4 50.6

Intended use

Grounded question answering over retrieved passages (RAG), tool-augmented assistants / agents (Hermes tool calls), and Kazakh/Russian/English chat. For grounded RAG the model is trained to answer only from provided evidence and to abstain when evidence is insufficient.

License

Apache-2.0.