Instructions to use nur-dev/farabi-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nur-dev/farabi-4b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="nur-dev/farabi-4b") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("nur-dev/farabi-4b") model = AutoModelForCausalLM.from_pretrained("nur-dev/farabi-4b") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use nur-dev/farabi-4b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "nur-dev/farabi-4b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nur-dev/farabi-4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/nur-dev/farabi-4b
- SGLang
How to use nur-dev/farabi-4b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "nur-dev/farabi-4b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nur-dev/farabi-4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "nur-dev/farabi-4b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "nur-dev/farabi-4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use nur-dev/farabi-4b with Docker Model Runner:
docker model run hf.co/nur-dev/farabi-4b
license: apache-2.0
language:
- kk
- ru
- en
pipeline_tag: text-generation
library_name: transformers
tags:
- kazakh
- russian
- rag
- tool-calling
- agent
- qwen3
Farabi-4B
A 4B-parameter instruction model for Kazakh, Russian, and English, focused on grounded RAG (answer from provided passages, cite, and abstain when evidence is insufficient) and Hermes-style tool calling / agentic use. Qwen3-4B architecture.
- Languages: Kazakh (kk), Russian (ru), English (en)
- Context length: 8192 tokens
- Precision: bf16
- Tool-call format: Hermes (vLLM
--tool-call-parser hermes)
Serving
vLLM (recommended — enables tool calling)
vllm serve nur-dev/farabi-4b \
--dtype bfloat16 --max-model-len 8192 \
--enable-auto-tool-choice --tool-call-parser hermes \
--chat-template chat_template.jinja
OpenAI-compatible client / Agents SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
resp = client.chat.completions.create(
model="nur-dev/farabi-4b",
messages=[{"role": "user", "content": "Астанада бүгін ауа райы қандай?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}],
)
print(resp.choices[0].message)
transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
tok = AutoTokenizer.from_pretrained("nur-dev/farabi-4b")
model = AutoModelForCausalLM.from_pretrained("nur-dev/farabi-4b", torch_dtype="bfloat16", device_map="auto")
msgs = [{"role": "user", "content": "Спутник деген не?"}]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(ids, max_new_tokens=512)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
Benchmarks
Evaluated on public Kazakh/Russian benchmarks against Sherkala-8B-Chat
(inceptionai/Llama-3.1-Sherkala-8B-Chat, an 8B Kazakh chat model), both run through the
identical harness. Kazakh reasoning uses the ISSAI QOLDA suite (n=250); knowledge is
measured with standard multiple-choice sets.
Summary: Farabi-4B is a tool-calling model — it scores 78.3% on BFCL v4 (Berkeley Function-Calling Leaderboard), while Sherkala-8B-Chat has no function-calling interface and cannot be evaluated on it. Despite being half the size, Farabi-4B also leads on aggregate Kazakh reasoning (light-kk mean 46.9 vs 43.2) and on every Russian-language benchmark (by +5 to +20pt). Sherkala-8B — trained on substantially more native-Kazakh text — leads on native Kazakh knowledge MC (KazMMLU-kk, TUMLU-kk) and on RAG free-generation (chrF).
Tool / function calling — BFCL v4 (Berkeley Function-Calling Leaderboard, AST, %)
| Category | Farabi-4B | Sherkala-8B |
|---|---|---|
| Simple | 92.5 | unsupported |
| Multiple | 91.0 | unsupported |
| Parallel | 87.0 | unsupported |
| Irrelevance | 36.7 | unsupported |
| Overall | 78.3 | unsupported |
unsupported = Sherkala-8B-Chat's chat template has no
tools/ tool-call mechanism; it emits zero function calls on every BFCL category, so function calling cannot be evaluated. Farabi-4B is served with vLLM--tool-call-parser hermes.
Kazakh reasoning — ISSAI QOLDA (accuracy, %)
| Benchmark | Farabi-4B | Sherkala-8B |
|---|---|---|
| light-kk mean | 46.9 | 43.2 |
| MMLU-kk | 50.0 | 47.2 |
| MMLU-Pro-kk | 30.0 | 20.8 |
| GPQA-kk | 34.4 | 30.0 |
| PolyMath-kk | 26.0 | 21.6 |
| ARC-kk | 73.2 | 74.8 |
| GSM8K-kk | 66.4 | 68.8 |
| RAGBench-kk (chrF) | 30.6 | 41.9 |
Russian reasoning — ISSAI QOLDA (accuracy, %)
| Benchmark | Farabi-4B | Sherkala-8B |
|---|---|---|
| ARC-ru | 92.8 | 78.4 |
| MMLU-Pro-ru | 42.8 | 22.8 |
| GPQA-ru | 32.4 | 25.2 |
| GSM8K-ru | 84.4 | 79.6 |
Standard multiple-choice (accuracy, %)
| Benchmark | Farabi-4B | Sherkala-8B |
|---|---|---|
| Belebele-kk | 70.5 | 69.0 |
| Belebele-ru | 80.5 | 79.5 |
| Belebele-en | 90.5 | 94.5 |
| KazMMLU-kk | 35.3 | 40.2 |
| KazMMLU-ru | 39.9 | 36.6 |
| TUMLU-kk | 30.5 | 37.5 |
| TruthfulQA-mc2 | 51.4 | 50.6 |
Intended use
Grounded question answering over retrieved passages (RAG), tool-augmented assistants / agents (Hermes tool calls), and Kazakh/Russian/English chat. For grounded RAG the model is trained to answer only from provided evidence and to abstain when evidence is insufficient.
License
Apache-2.0.