Instructions to use nur-dev/farabi-0.6B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nur-dev/farabi-0.6B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nur-dev/farabi-0.6B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nur-dev/farabi-0.6B")
model = AutoModelForCausalLM.from_pretrained("nur-dev/farabi-0.6B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nur-dev/farabi-0.6B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nur-dev/farabi-0.6B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/farabi-0.6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nur-dev/farabi-0.6B

SGLang

How to use nur-dev/farabi-0.6B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nur-dev/farabi-0.6B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/farabi-0.6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nur-dev/farabi-0.6B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nur-dev/farabi-0.6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nur-dev/farabi-0.6B with Docker Model Runner:
```
docker model run hf.co/nur-dev/farabi-0.6B
```

farabi-0.6B / README.md

nur-dev

Add interim evaluation results (BFCL v4 + multilingual MC)

30d2556 verified 7 days ago

preview code

raw

history blame contribute delete

8.78 kB

	---
	language:
	- kk
	- ru
	- en
	pipeline_tag: text-generation
	library_name: transformers
	tags:
	- kazakh
	- multilingual
	- instruction-tuning
	- tool-calling
	- function-calling
	- agent
	- conversational
	base_model: nur-dev/farabi-0.6B-base
	license: apache-2.0
	---

	# Farabi-0.6B

	Farabi-0.6B is a compact, multilingual instruction-tuned language model with a
	primary focus on Kazakh, alongside strong Russian and English support.
	It is designed for everyday assistant use, reasoning, retrieval-grounded answering,
	and tool / function calling in agentic applications.

	The model speaks fluent Kazakh and is intended to make high-quality conversational
	AI more accessible for the Kazakh language, where well-aligned models remain scarce.

	Created by [Nurgali Kadyrbek](https://www.linkedin.com/in/nurgali-kadyrbek-504260231/).

	It is built on [`nur-dev/farabi-0.6B-base`](https://huggingface.co/nur-dev/farabi-0.6B-base) —
	a Kazakh-adapted base model that was itself continually pre-trained from Qwen3-0.6B — and then
	instruction-tuned to produce this assistant.

	---

	## Highlights

	- 🇰🇿 Kazakh-first — the majority of the instruction data is native Kazakh, with
	Russian and English mixed in for cross-lingual robustness.
	- 🧠 Reasoning — supports optional step-by-step "thinking" mode that can be toggled
	on or off at request time.
	- 🔧 Tool calling — emits Hermes-style `<tool_call>` blocks and is compatible with
	the OpenAI-style function-calling interface and agent frameworks.
	- 📚 Grounded answering — trained to answer from provided documents and context,
	including longer inputs.
	- 🪶 Small & deployable — 0.6B parameters, runs comfortably on a single modest GPU.

	---

	## Languages

	\| Language \| Approx. share of instruction data \|
	\|----------\|-----------------------------------\|
	\| Kazakh (kk) \| ~56% \|
	\| English (en) \| ~33% \|
	\| Russian (ru) \| ~10% \|

	---

	## Data coverage by domain

	The model was instruction-tuned on a broad, internally curated mixture. Described in
	general terms (no technical specifics), the approximate domain composition is:

	\| Domain \| Approx. share \|
	\|--------\|---------------\|
	\| General instruction following & multi-turn conversation \| ~45% \|
	\| Reasoning & step-by-step problem solving \| ~27% \|
	\| Retrieval-grounded answering, long context & document Q&A \| ~13% \|
	\| Tool use, function calling & agentic interaction \| ~7% \|
	\| Knowledge, culture, news & encyclopedic content \| ~4% \|
	\| Mathematics, language tasks (grammar / translation), safety & appropriate refusal, device & environment control, and assistant identity \| ~4% \|

	Shares are approximate and reflect general domain proportions rather than exact figures.

	---

	## Data provenance & acknowledgments

	The training datasets were created internally by the author, including original
	synthesis as well as additionally processed and enriched material.

	Approximately 5.4% of all data used for instruction tuning was derived (with
	additional processing and enrichment) from resources of two organizations, whose
	contributions to the Kazakh language are gratefully acknowledged:

	1. Институт языкознания имени А. Байтурсынова — Institute of Linguistics named after A. Baitursynov
	2. ННПЦ «Тіл-Қазына» имени Шайсултана Шаяхметова — Sh. Shayakhmetov National Research and Practical Center "Til-Qazyna"

	---

	## Recommended sampling parameters

	A good starting point for general use:

	```json
	{
	"temperature": 0.15,
	"top_p": 0.95,
	"max_tokens": 1024,
	"repetition_penalty": 1.05,
	"stream": true,
	"chat_template_kwargs": {
	"enable_thinking": true
	},
	"continue_final_message": true
	}
	```

	Set `"enable_thinking": false` to get direct answers without an explicit reasoning step.
	Raise `temperature` for more creative / open-ended generation.

	---

	## Serving with vLLM

	Start an OpenAI-compatible server with tool-calling enabled:

	```bash
	vllm serve nur-dev/farabi-0.6B \
	--served-model-name farabi-0.6b \
	--enable-auto-tool-choice \
	--tool-call-parser hermes
	```

	Query it with the standard OpenAI client (and the recommended sampling params):

	```python
	from openai import OpenAI

	client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

	resp = client.chat.completions.create(
	model="farabi-0.6b",
	messages=[
	{"role": "system", "content": "Сіз пайдалы әрі дәл көмекшісіз."},
	{"role": "user", "content": "Алматы туралы қысқаша айтып бер."},
	],
	temperature=0.15,
	top_p=0.95,
	max_tokens=1024,
	extra_body={
	"repetition_penalty": 1.05,
	"chat_template_kwargs": {"enable_thinking": True},
	},
	stream=True,
	)
	for chunk in resp:
	delta = chunk.choices[0].delta.content
	if delta:
	print(delta, end="", flush=True)
	```

	Tool calling works through the standard `tools=[...]` argument — the model returns
	function calls that the server parses into structured `tool_calls`.

	---

	## Serving with PyTorch / Transformers

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "nur-dev/farabi-0.6B"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	)

	messages = [
	{"role": "system", "content": "Сіз пайдалы әрі дәл көмекшісіз."},
	{"role": "user", "content": "Қазақстанның астанасы қай қала?"},
	]

	inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	enable_thinking=True, # set False for direct answers
	return_tensors="pt",
	).to(model.device)

	outputs = model.generate(
	inputs,
	max_new_tokens=1024,
	do_sample=True,
	temperature=0.15,
	top_p=0.95,
	repetition_penalty=1.05,
	)
	print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
	```

	---

	## Evaluation

	> ⚠️ Interim results. The numbers below were measured on an early checkpoint
	> (~17% through instruction tuning). They are expected to improve as training
	> continues, but already show meaningful capability.

	### Tool / function calling — BFCL v4

	Berkeley Function-Calling Leaderboard (v4), 1,040 cases, evaluated with the
	HuggingFace backend.

	\| Category \| Accuracy \| n \| What it measures \|
	\|----------\|----------\|---\|------------------\|
	\| Simple \| 80.5% \| 322/400 \| one call, one tool available \|
	\| Multiple \| 71.5% \| 143/200 \| pick the right tool from several \|
	\| Parallel \| 65.5% \| 131/200 \| several calls in one turn \|
	\| Irrelevance \| 5.4% \| 13/240 \| abstain when no tool fits \|
	\| Overall \| 58.6% \| 609/1040 \| \|
	\| Function-calling avg \| 74.5% \| 596/800 \| excludes irrelevance \|

	Takeaways:
	- Strong calling ability for a 0.6B model. When a call is warranted it is correct
	~74.5% of the time — right tool, valid arguments, clean JSON — including 65.5% on the
	hard parallel / multi-call category.
	- The weakness is abstention, not calling. On queries that match no available tool,
	the model still tends to emit a call (irrelevance 5.4% → it over-triggers). This is the
	main driver of the lower overall score and the clearest area for improvement.

	### Multilingual comprehension — 4-way multiple choice

	Multiple-choice comprehension across the model's three languages (random baseline = 25%),
	evaluated with the chat template and `enable_thinking=False`.

	\| Language \| Accuracy \|
	\|----------\|----------\|
	\| English \| 53.7% ±1.7 \|
	\| Russian \| 50.0% ±1.7 \|
	\| Kazakh \| 41.8% ±1.6 \|

	Takeaways:
	- Well above the 25% random baseline in all three languages — real comprehension in
	English, Russian, and Kazakh.
	- Resource ordering (en > ru > kk) is as expected; Kazakh at 41.8% is clearly non-trivial.
	- Evaluating with the chat template and `enable_thinking=False` adds ~5–6 points per
	language versus a raw prompt — another reason to serve the model with its chat template
	(see serving instructions above).

	---

	## Intended use & limitations

	Farabi-0.6B is intended as a helpful general-purpose and agentic assistant, with a
	focus on Kazakh-language use cases. As a small model, it can make factual mistakes,
	and outputs should be verified for high-stakes or factual-critical applications. It
	should be used responsibly and in accordance with applicable laws and the base model's
	license.

	---

	## Citation

	If you use this model, please credit the author:

	> Nurgali Kadyrbek — Farabi-0.6B.
	> https://www.linkedin.com/in/nurgali-kadyrbek-504260231/