Instructions to use nur-dev/farabi-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nur-dev/farabi-4b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nur-dev/farabi-4b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("nur-dev/farabi-4b")
model = AutoModelForMultimodalLM.from_pretrained("nur-dev/farabi-4b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

HERMES

How to use nur-dev/farabi-4b with HERMES:

# No code snippets available yet for this library.

# To use this model, check the repository files and the library's documentation.

# Want to help? PRs adding snippets are welcome at:
# https://github.com/huggingface/huggingface.js

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nur-dev/farabi-4b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nur-dev/farabi-4b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/farabi-4b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nur-dev/farabi-4b

SGLang

How to use nur-dev/farabi-4b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nur-dev/farabi-4b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/farabi-4b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nur-dev/farabi-4b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/farabi-4b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nur-dev/farabi-4b with Docker Model Runner:
```
docker model run hf.co/nur-dev/farabi-4b
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Farabi-4B

A 4B Kazakh / Russian / English assistant built on Qwen3-4B and strengthened for Kazakh & Russian knowledge and grounded RAG / agentic tool-use, while retaining the base model's function-calling ability. It drops into agent stacks that expect OpenAI-style function calling and emits clean Hermes tool calls.

Capabilities

Stronger Kazakh & Russian knowledge. Improves over the Qwen3-4B base and surpasses the much larger ISSAI Sherkala-8B on the Kazakh knowledge benchmarks (see below).
Grounded RAG. Answers from provided passages, attributes claims to the supporting text, and abstains when the evidence is insufficient.
Tool-calling (Hermes / OpenAI function calling). Decides when a tool is needed, asks for missing required arguments, emits valid calls, and grounds the final answer in the tool result. Competitive with the Qwen3-4B base on common call patterns.
- Parallel tool-calling — multiple independent calls in a single turn.
- Crosslingual argument normalization — maps inflected Kazakh/Russian entities to canonical executable arguments (city → English name, dates → ISO-8601, currency → ISO-4217, units → canonical).
- Error recovery — retries repairable failures and reports non-repairable ones (not-found / permission-denied / empty) instead of inventing success.
Clean outputs — no hidden chain-of-thought; final answers and tool calls only, suitable for production serving.

How to use

Serve with vLLM (OpenAI-compatible, Hermes tool parser)

vllm serve nur-dev/farabi-4b \
  --chat-template chat_template.jinja \
  --enable-auto-tool-choice --tool-call-parser hermes

Call it with the OpenAI SDK (and the OpenAI Agents SDK)

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

resp = client.chat.completions.create(
    model="nur-dev/farabi-4b",
    messages=[{"role": "user", "content": "Бүгін Алматыда ауа райы қандай?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Current weather for a city.",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string", "description": "Canonical English city name."}},
                "required": ["city"],
            },
        },
    }],
    tool_choice="auto",
)
print(resp.choices[0].message.tool_calls)

The OpenAI Agents SDK works the same way via openai.AsyncOpenAI(base_url="http://localhost:8000/v1", api_key="x").

Quick chat with Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("nur-dev/farabi-4b")
model = AutoModelForCausalLM.from_pretrained(
    "nur-dev/farabi-4b", torch_dtype="bfloat16", device_map="auto")

msgs = [{"role": "user", "content": "Қазақстанның астанасы қай қала?"}]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=256)
print(tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True))

The canonical chat_template.jinja ships in this repo. Use it when serving so tool calls render in the Hermes format the parser expects.

Benchmarks

Kazakh knowledge — ISSAI QOLDA suite (n=250/benchmark, accuracy %)

Against the Qwen3-4B base (same size) and the larger ISSAI Sherkala-8B-Chat.

Benchmark	Farabi-4B	Qwen3-4B (base)	Sherkala-8B
ARC-kk	69.6	68.4	74.8
MMLU-kk	50.8	42.8	47.6
MMLU-Pro-kk	28.8	25.6	20.4
GPQA-kk	36.0	30.8	30.0
mean	46.3	41.9	43.2

Farabi-4B improves Kazakh knowledge over its own base (+4.4 mean) and beats the 8B Sherkala on 3 of 4 benchmarks and on the mean, at roughly half Sherkala's size — only ARC-kk still trails Sherkala.

Kazakh academic (additional, accuracy %)

Belebele-kk	KazMMLU-kk	TUMLU-kk
69.5	36.1	36.5

Russian & math (accuracy %)

	Farabi-4B	Qwen3-4B (base)
ARC-ru	92.4	92.0
MMLU-Pro-ru	42.4	35.2
GPQA-ru	32.4	31.6
GSM8K-ru	84.0	91.6
GSM8K-kk	68.4	68.4

Farabi-4B leads on Russian knowledge MC (notably MMLU-Pro-ru, +7.2); the base remains stronger on Russian grade-school math (GSM8K-ru).

Function calling — BFCL (Berkeley Function Calling Leaderboard, V4 non-live, accuracy %)

Category	Farabi-4B	Qwen3-4B (base)
Simple AST	77.6	75.8
• Python	95.8	96.3
• Java	65.0	61.0
• JavaScript	72.0	70.0
Multiple	95.5	96.5
Parallel	88.5	91.5
Parallel-Multiple	64.0	87.5
Irrelevance	47.9	82.1
Non-live overall	81.4	87.8

Farabi-4B matches the Qwen3-4B base on the common call patterns — simple, multiple, and parallel calls (all ≥ 88%, with Farabi slightly ahead on Java/JavaScript). The base stays ahead on the compositional parallel-multiple case and on irrelevance detection, so its overall non-live score is higher. In short: Farabi-4B keeps the base's everyday function-calling while adding the Kazakh/Russian knowledge and grounded-RAG behavior above.

Known limitation. The model's relative weak point is abstention / irrelevance detection — when no tool or evidence is appropriate, it tends to act (answer or call a tool) rather than decline (BFCL irrelevance 47.9%). For high-stakes or credential-bearing contexts, pair it with explicit guardrails or an output filter.

Serving compatibility

Works with vLLM's OpenAI-compatible server using the Hermes tool-call parser (--enable-auto-tool-choice --tool-call-parser hermes) and with the OpenAI Agents SDK via openai.AsyncOpenAI(base_url=..., api_key="x").

Languages

Kazakh (kk), Russian (ru), English (en).

License

CC BY-NC 4.0 — non-commercial use only. Released for research, education, and evaluation; commercial use is not permitted. Built on Qwen3-4B (Apache-2.0); the base-model components remain under their original Apache-2.0 terms.

Downloads last month: -

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for nur-dev/farabi-4b

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Finetuned

(727)

this model