Instructions to use nur-dev/farabi-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nur-dev/farabi-4b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nur-dev/farabi-4b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nur-dev/farabi-4b")
model = AutoModelForCausalLM.from_pretrained("nur-dev/farabi-4b")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use nur-dev/farabi-4b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nur-dev/farabi-4b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/farabi-4b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nur-dev/farabi-4b

SGLang

How to use nur-dev/farabi-4b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nur-dev/farabi-4b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/farabi-4b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nur-dev/farabi-4b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/farabi-4b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nur-dev/farabi-4b with Docker Model Runner:
```
docker model run hf.co/nur-dev/farabi-4b
```

farabi-4b / README.md

nur-dev

Model card: add BFCL v4 tool-calling benchmark (Sherkala-8B unsupported)

09e3d6b verified about 2 hours ago

preview code

Raw

History Blame Contribute Delete

4.76 kB

metadata

license: apache-2.0
language:
  - kk
  - ru
  - en
pipeline_tag: text-generation
library_name: transformers
tags:
  - kazakh
  - russian
  - rag
  - tool-calling
  - agent
  - qwen3

Farabi-4B

A 4B-parameter instruction model for Kazakh, Russian, and English, focused on grounded RAG (answer from provided passages, cite, and abstain when evidence is insufficient) and Hermes-style tool calling / agentic use. Qwen3-4B architecture.

Languages: Kazakh (kk), Russian (ru), English (en)
Context length: 8192 tokens
Precision: bf16
Tool-call format: Hermes (vLLM --tool-call-parser hermes)

Serving

vLLM (recommended — enables tool calling)

vllm serve nur-dev/farabi-4b \
  --dtype bfloat16 --max-model-len 8192 \
  --enable-auto-tool-choice --tool-call-parser hermes \
  --chat-template chat_template.jinja

OpenAI-compatible client / Agents SDK

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")

resp = client.chat.completions.create(
    model="nur-dev/farabi-4b",
    messages=[{"role": "user", "content": "Астанада бүгін ауа райы қандай?"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {"city": {"type": "string"}},
                "required": ["city"],
            },
        },
    }],
)
print(resp.choices[0].message)

transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("nur-dev/farabi-4b")
model = AutoModelForCausalLM.from_pretrained("nur-dev/farabi-4b", torch_dtype="bfloat16", device_map="auto")
msgs = [{"role": "user", "content": "Спутник деген не?"}]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=512)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

Benchmarks

Evaluated on public Kazakh/Russian benchmarks against Sherkala-8B-Chat (inceptionai/Llama-3.1-Sherkala-8B-Chat, an 8B Kazakh chat model), both run through the identical harness. Kazakh reasoning uses the ISSAI QOLDA suite (n=250); knowledge is measured with standard multiple-choice sets.

Summary: Farabi-4B is a tool-calling model — it scores 78.3% on BFCL v4 (Berkeley Function-Calling Leaderboard), while Sherkala-8B-Chat has no function-calling interface and cannot be evaluated on it. Despite being half the size, Farabi-4B also leads on aggregate Kazakh reasoning (light-kk mean 46.9 vs 43.2) and on every Russian-language benchmark (by +5 to +20pt). Sherkala-8B — trained on substantially more native-Kazakh text — leads on native Kazakh knowledge MC (KazMMLU-kk, TUMLU-kk) and on RAG free-generation (chrF).

Tool / function calling — BFCL v4 (Berkeley Function-Calling Leaderboard, AST, %)

Category	Farabi-4B	Sherkala-8B
Simple	92.5	unsupported
Multiple	91.0	unsupported
Parallel	87.0	unsupported
Irrelevance	36.7	unsupported
Overall	78.3	unsupported

unsupported = Sherkala-8B-Chat's chat template has no tools / tool-call mechanism; it emits zero function calls on every BFCL category, so function calling cannot be evaluated. Farabi-4B is served with vLLM --tool-call-parser hermes.

Kazakh reasoning — ISSAI QOLDA (accuracy, %)

Benchmark	Farabi-4B	Sherkala-8B
light-kk mean	46.9	43.2
MMLU-kk	50.0	47.2
MMLU-Pro-kk	30.0	20.8
GPQA-kk	34.4	30.0
PolyMath-kk	26.0	21.6
ARC-kk	73.2	74.8
GSM8K-kk	66.4	68.8
RAGBench-kk (chrF)	30.6	41.9

Russian reasoning — ISSAI QOLDA (accuracy, %)

Benchmark	Farabi-4B	Sherkala-8B
ARC-ru	92.8	78.4
MMLU-Pro-ru	42.8	22.8
GPQA-ru	32.4	25.2
GSM8K-ru	84.4	79.6

Standard multiple-choice (accuracy, %)

Benchmark	Farabi-4B	Sherkala-8B
Belebele-kk	70.5	69.0
Belebele-ru	80.5	79.5
Belebele-en	90.5	94.5
KazMMLU-kk	35.3	40.2
KazMMLU-ru	39.9	36.6
TUMLU-kk	30.5	37.5
TruthfulQA-mc2	51.4	50.6

Intended use

Grounded question answering over retrieved passages (RAG), tool-augmented assistants / agents (Hermes tool calls), and Kazakh/Russian/English chat. For grounded RAG the model is trained to answer only from provided evidence and to abstain when evidence is insufficient.

License

Apache-2.0.