Instructions to use nur-dev/farabi-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nur-dev/farabi-4b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nur-dev/farabi-4b") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("nur-dev/farabi-4b") model = AutoModelForMultimodalLM.from_pretrained("nur-dev/farabi-4b") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - HERMES
How to use nur-dev/farabi-4b with HERMES:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use nur-dev/farabi-4b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nur-dev/farabi-4b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nur-dev/farabi-4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nur-dev/farabi-4b
- SGLang
How to use nur-dev/farabi-4b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nur-dev/farabi-4b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nur-dev/farabi-4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nur-dev/farabi-4b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nur-dev/farabi-4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nur-dev/farabi-4b with Docker Model Runner:
docker model run hf.co/nur-dev/farabi-4b
Farabi-4B
A 4B Kazakh / Russian / English assistant built on Qwen3-4B and strengthened for Kazakh & Russian knowledge and grounded RAG / agentic tool-use, while retaining the base model's function-calling ability. It drops into agent stacks that expect OpenAI-style function calling and emits clean Hermes tool calls.
Capabilities
- Stronger Kazakh & Russian knowledge. Improves over the Qwen3-4B base and surpasses the much larger ISSAI Sherkala-8B on the Kazakh knowledge benchmarks (see below).
- Grounded RAG. Answers from provided passages, attributes claims to the supporting text, and abstains when the evidence is insufficient.
- Tool-calling (Hermes / OpenAI function calling). Decides when a tool is needed, asks
for missing required arguments, emits valid calls, and grounds the final answer in the
tool result. Competitive with the Qwen3-4B base on common call patterns.
- Parallel tool-calling — multiple independent calls in a single turn.
- Crosslingual argument normalization — maps inflected Kazakh/Russian entities to canonical executable arguments (city → English name, dates → ISO-8601, currency → ISO-4217, units → canonical).
- Error recovery — retries repairable failures and reports non-repairable ones (not-found / permission-denied / empty) instead of inventing success.
- Clean outputs — no hidden chain-of-thought; final answers and tool calls only, suitable for production serving.
How to use
Serve with vLLM (OpenAI-compatible, Hermes tool parser)
vllm serve nur-dev/farabi-4b \
--chat-template chat_template.jinja \
--enable-auto-tool-choice --tool-call-parser hermes
Call it with the OpenAI SDK (and the OpenAI Agents SDK)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
resp = client.chat.completions.create(
model="nur-dev/farabi-4b",
messages=[{"role": "user", "content": "Бүгін Алматыда ауа райы қандай?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Current weather for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string", "description": "Canonical English city name."}},
"required": ["city"],
},
},
}],
tool_choice="auto",
)
print(resp.choices[0].message.tool_calls)
The OpenAI Agents SDK works the same way via
openai.AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="x").
Quick chat with Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("nur-dev/farabi-4b")
model = AutoModelForCausalLM.from_pretrained(
"nur-dev/farabi-4b", torch_dtype="bfloat16", device_map="auto")
msgs = [{"role": "user", "content": "Қазақстанның астанасы қай қала?"}]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=256)
print(tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True))
The canonical
chat_template.jinjaships in this repo. Use it when serving so tool calls render in the Hermes format the parser expects.
Benchmarks
Kazakh knowledge — ISSAI QOLDA suite (n=250/benchmark, accuracy %)
Against the Qwen3-4B base (same size) and the larger ISSAI Sherkala-8B-Chat.
| Benchmark | Farabi-4B | Qwen3-4B (base) | Sherkala-8B |
|---|---|---|---|
| ARC-kk | 69.6 | 68.4 | 74.8 |
| MMLU-kk | 50.8 | 42.8 | 47.6 |
| MMLU-Pro-kk | 28.8 | 25.6 | 20.4 |
| GPQA-kk | 36.0 | 30.8 | 30.0 |
| mean | 46.3 | 41.9 | 43.2 |
Farabi-4B improves Kazakh knowledge over its own base (+4.4 mean) and beats the 8B Sherkala on 3 of 4 benchmarks and on the mean, at roughly half Sherkala's size — only ARC-kk still trails Sherkala.
Kazakh academic (additional, accuracy %)
| Belebele-kk | KazMMLU-kk | TUMLU-kk |
|---|---|---|
| 69.5 | 36.1 | 36.5 |
Russian & math (accuracy %)
| Farabi-4B | Qwen3-4B (base) | |
|---|---|---|
| ARC-ru | 92.4 | 92.0 |
| MMLU-Pro-ru | 42.4 | 35.2 |
| GPQA-ru | 32.4 | 31.6 |
| GSM8K-ru | 84.0 | 91.6 |
| GSM8K-kk | 68.4 | 68.4 |
Farabi-4B leads on Russian knowledge MC (notably MMLU-Pro-ru, +7.2); the base remains stronger on Russian grade-school math (GSM8K-ru).
Function calling — BFCL (Berkeley Function Calling Leaderboard, V4 non-live, accuracy %)
| Category | Farabi-4B | Qwen3-4B (base) |
|---|---|---|
| Simple AST | 77.6 | 75.8 |
| • Python | 95.8 | 96.3 |
| • Java | 65.0 | 61.0 |
| • JavaScript | 72.0 | 70.0 |
| Multiple | 95.5 | 96.5 |
| Parallel | 88.5 | 91.5 |
| Parallel-Multiple | 64.0 | 87.5 |
| Irrelevance | 47.9 | 82.1 |
| Non-live overall | 81.4 | 87.8 |
Farabi-4B matches the Qwen3-4B base on the common call patterns — simple, multiple, and parallel calls (all ≥ 88%, with Farabi slightly ahead on Java/JavaScript). The base stays ahead on the compositional parallel-multiple case and on irrelevance detection, so its overall non-live score is higher. In short: Farabi-4B keeps the base's everyday function-calling while adding the Kazakh/Russian knowledge and grounded-RAG behavior above.
Known limitation. The model's relative weak point is abstention / irrelevance detection — when no tool or evidence is appropriate, it tends to act (answer or call a tool) rather than decline (BFCL irrelevance 47.9%). For high-stakes or credential-bearing contexts, pair it with explicit guardrails or an output filter.
Serving compatibility
Works with vLLM's OpenAI-compatible server using the Hermes tool-call parser
(--enable-auto-tool-choice --tool-call-parser hermes) and with the OpenAI Agents SDK
via openai.AsyncOpenAI(base_url=..., api_key="x").
Languages
Kazakh (kk), Russian (ru), English (en).
License
CC BY-NC 4.0 — non-commercial use only. Released for research, education, and evaluation; commercial use is not permitted. Built on Qwen3-4B (Apache-2.0); the base-model components remain under their original Apache-2.0 terms.
- Downloads last month
- -