Instructions to use nur-dev/farabi-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nur-dev/farabi-4b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nur-dev/farabi-4b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nur-dev/farabi-4b")
model = AutoModelForCausalLM.from_pretrained("nur-dev/farabi-4b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nur-dev/farabi-4b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nur-dev/farabi-4b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/farabi-4b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nur-dev/farabi-4b

SGLang

How to use nur-dev/farabi-4b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nur-dev/farabi-4b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/farabi-4b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nur-dev/farabi-4b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/farabi-4b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nur-dev/farabi-4b with Docker Model Runner:
```
docker model run hf.co/nur-dev/farabi-4b
```

farabi-4b / README.md

nur-dev

Model card: add BFCL v4 tool-calling benchmark (Sherkala-8B unsupported)

09e3d6b verified about 7 hours ago

preview code

Raw

History Blame Contribute Delete

4.76 kB

	---
	license: apache-2.0
	language:
	- kk
	- ru
	- en
	pipeline_tag: text-generation
	library_name: transformers
	tags:
	- kazakh
	- russian
	- rag
	- tool-calling
	- agent
	- qwen3
	---

	# Farabi-4B

	A 4B-parameter instruction model for Kazakh, Russian, and English, focused on
	grounded RAG (answer from provided passages, cite, and abstain when evidence is
	insufficient) and Hermes-style tool calling / agentic use. Qwen3-4B architecture.

	- Languages: Kazakh (kk), Russian (ru), English (en)
	- Context length: 8192 tokens
	- Precision: bf16
	- Tool-call format: Hermes (vLLM `--tool-call-parser hermes`)

	## Serving

	### vLLM (recommended — enables tool calling)

	```bash
	vllm serve nur-dev/farabi-4b \
	--dtype bfloat16 --max-model-len 8192 \
	--enable-auto-tool-choice --tool-call-parser hermes \
	--chat-template chat_template.jinja
	```

	### OpenAI-compatible client / Agents SDK

	```python
	from openai import OpenAI
	client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

	resp = client.chat.completions.create(
	model="nur-dev/farabi-4b",
	messages=[{"role": "user", "content": "Астанада бүгін ауа райы қандай?"}],
	tools=[{
	"type": "function",
	"function": {
	"name": "get_weather",
	"description": "Get current weather for a city",
	"parameters": {
	"type": "object",
	"properties": {"city": {"type": "string"}},
	"required": ["city"],
	},
	},
	}],
	)
	print(resp.choices[0].message)
	```

	### transformers

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	tok = AutoTokenizer.from_pretrained("nur-dev/farabi-4b")
	model = AutoModelForCausalLM.from_pretrained("nur-dev/farabi-4b", torch_dtype="bfloat16", device_map="auto")
	msgs = [{"role": "user", "content": "Спутник деген не?"}]
	ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
	out = model.generate(ids, max_new_tokens=512)
	print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
	```

	## Benchmarks

	Evaluated on public Kazakh/Russian benchmarks against Sherkala-8B-Chat
	(`inceptionai/Llama-3.1-Sherkala-8B-Chat`, an 8B Kazakh chat model), both run through the
	identical harness. Kazakh reasoning uses the ISSAI QOLDA suite (n=250); knowledge is
	measured with standard multiple-choice sets.

	Summary: Farabi-4B is a tool-calling model — it scores 78.3% on BFCL v4
	(Berkeley Function-Calling Leaderboard), while Sherkala-8B-Chat has no function-calling
	interface and cannot be evaluated on it. Despite being half the size, Farabi-4B also
	leads on aggregate Kazakh reasoning (light-kk mean 46.9 vs 43.2) and **on every
	Russian-language benchmark** (by +5 to +20pt). Sherkala-8B — trained on substantially more
	native-Kazakh text — leads on native Kazakh knowledge MC (KazMMLU-kk, TUMLU-kk) and on RAG
	free-generation (chrF).

	### Tool / function calling — BFCL v4 (Berkeley Function-Calling Leaderboard, AST, %)

	\| Category \| Farabi-4B \| Sherkala-8B \|
	\|---\|---\|---\|
	\| Simple \| 92.5 \| unsupported \|
	\| Multiple \| 91.0 \| unsupported \|
	\| Parallel \| 87.0 \| unsupported \|
	\| Irrelevance \| 36.7 \| unsupported \|
	\| Overall \| 78.3 \| unsupported \|

	> unsupported = Sherkala-8B-Chat's chat template has no `tools` / tool-call mechanism;
	> it emits zero function calls on every BFCL category, so function calling cannot be
	> evaluated. Farabi-4B is served with vLLM `--tool-call-parser hermes`.

	### Kazakh reasoning — ISSAI QOLDA (accuracy, %)

	\| Benchmark \| Farabi-4B \| Sherkala-8B \|
	\|---\|---\|---\|
	\| light-kk mean \| 46.9 \| 43.2 \|
	\| MMLU-kk \| 50.0 \| 47.2 \|
	\| MMLU-Pro-kk \| 30.0 \| 20.8 \|
	\| GPQA-kk \| 34.4 \| 30.0 \|
	\| PolyMath-kk \| 26.0 \| 21.6 \|
	\| ARC-kk \| 73.2 \| 74.8 \|
	\| GSM8K-kk \| 66.4 \| 68.8 \|
	\| RAGBench-kk (chrF) \| 30.6 \| 41.9 \|

	### Russian reasoning — ISSAI QOLDA (accuracy, %)

	\| Benchmark \| Farabi-4B \| Sherkala-8B \|
	\|---\|---\|---\|
	\| ARC-ru \| 92.8 \| 78.4 \|
	\| MMLU-Pro-ru \| 42.8 \| 22.8 \|
	\| GPQA-ru \| 32.4 \| 25.2 \|
	\| GSM8K-ru \| 84.4 \| 79.6 \|

	### Standard multiple-choice (accuracy, %)

	\| Benchmark \| Farabi-4B \| Sherkala-8B \|
	\|---\|---\|---\|
	\| Belebele-kk \| 70.5 \| 69.0 \|
	\| Belebele-ru \| 80.5 \| 79.5 \|
	\| Belebele-en \| 90.5 \| 94.5 \|
	\| KazMMLU-kk \| 35.3 \| 40.2 \|
	\| KazMMLU-ru \| 39.9 \| 36.6 \|
	\| TUMLU-kk \| 30.5 \| 37.5 \|
	\| TruthfulQA-mc2 \| 51.4 \| 50.6 \|

	## Intended use

	Grounded question answering over retrieved passages (RAG), tool-augmented assistants /
	agents (Hermes tool calls), and Kazakh/Russian/English chat. For grounded RAG the model
	is trained to answer only from provided evidence and to abstain when evidence is
	insufficient.

	## License

	Apache-2.0.