Instructions to use domyn/Domyn-Small-v1.0 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use domyn/Domyn-Small-v1.0 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="domyn/Domyn-Small-v1.0")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("domyn/Domyn-Small-v1.0")
model = AutoModelForCausalLM.from_pretrained("domyn/Domyn-Small-v1.0")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use domyn/Domyn-Small-v1.0 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "domyn/Domyn-Small-v1.0"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "domyn/Domyn-Small-v1.0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/domyn/Domyn-Small-v1.0

SGLang

How to use domyn/Domyn-Small-v1.0 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "domyn/Domyn-Small-v1.0" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "domyn/Domyn-Small-v1.0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "domyn/Domyn-Small-v1.0" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "domyn/Domyn-Small-v1.0",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use domyn/Domyn-Small-v1.0 with Docker Model Runner:
```
docker model run hf.co/domyn/Domyn-Small-v1.0
```

Domyn Small

Domyn Small is a 10B-parameter open-weight reasoning model designed for resource-constrained, agentic, and fine-tunable deployments. It pairs a dual-mode (thinking on/off) inference design with grouped-query attention, a native 32k context window (extensible to 131k via YaRN), and tool calling. On reasoning benchmarks it reaches accuracy comparable to leading 7–10B reasoning peers while spending roughly 2–4× fewer reasoning tokens — placing it on a favourable accuracy/cost Pareto frontier for production inference and downstream fine-tuning.

Fine-tune Domyn Small to your domain to unlock its real power and to retain full ownership and control over the resulting model.

Highlights

Token-efficient reasoning — ~32% of Qwen3.5-9B's reasoning-token budget and ~35% of OLMo-3-7B-Think's at comparable accuracy on several reasoning tasks (Token Efficiency).
Dual-mode inference — thinking on for deep multi-step reasoning, thinking off for fast, compact output. Toggleable from the system prompt or the API.
Tool calling — first-class function calling via <tool_call> XML tags, with a chat template that handles tool injection automatically. Strong BFCL V3 single-turn results (75.9 Non-Live / 68.3 Live) at ~280 mean tokens per problem.
Expandable context — 32,768 tokens natively, extensible to 131,072 (128k) via YaRN at inference time.
Multilingual — 50+ languages with explicit coverage; optimised for English and the Tier-A European set (Italian, Spanish, French, German).

Model Overview

Developed by: Domyn S.p.A.
Version: 1.0
Released and last updated on: May 2026
Input / Output: Text-only / Text-only
Model size: ~10B parameters
Attention: Grouped-Query Attention (48 query heads, 8 KV heads)
Tokenizer: 256,000-token SentencePiece BPE vocabulary
Native context: 32,768 tokens
Extended context: 131,072 tokens (YaRN, 4× at inference time)
Language(s): 50+ languages; optimised for English and the Tier-A European set (Italian, Spanish, French, German)
Base model: Initialised from Italia 10B and continually pre-trained on 503B tokens
Knowledge cut-off date: September 2024 (based on pre-training dataset cut-off)
License: MIT

A full architecture and training-recipe specification is available in the Domyn Small technical report.

Quickstart

from openai import OpenAI

client = OpenAI(
    base_url="http://<your-vllm-host>/v1",
    api_key="none",
)

response = client.chat.completions.create(
    model="domyn/Domyn-Small-v1.0",
    messages=[
        {"role": "system", "content": "You are Domyn Small, a helpful assistant."},
        {"role": "user", "content": "What is the capital of Italy?"},
    ],
)
print(response.choices[0].message.content)

Deployment

We recommend vLLM ≥ 0.9.2 for all the snippets below.

vLLM — Basic

vllm serve domyn/Domyn-Small-v1.0 \
    --tensor-parallel-size 1 \
    --dtype bfloat16 \
    --max-model-len 32768 \
    --max-num-seqs 256 \
    --gpu-memory-utilization 0.9

vLLM — With Reasoning Parsing

To have vLLM automatically extract the model's <think> blocks and expose them as a structured reasoning_content field, add --reasoning-parser olmo3. Domyn Small emits the identical <think>…</think> format as OLMo 3, so the OLMo 3 parser plugin works directly — no Domyn-specific parser is required.

vllm serve domyn/Domyn-Small-v1.0 \
    --tensor-parallel-size 1 \
    --dtype bfloat16 \
    --max-model-len 32768 \
    --max-num-seqs 256 \
    --gpu-memory-utilization 0.9 \
    --reasoning-parser olmo3

vLLM — Extended Context with YaRN

YaRN scaling may impact model quality on inputs shorter than 32k. Enable it only when you actually need contexts beyond the native 32,768-token window.

vllm serve domyn/Domyn-Small-v1.0 \
    --tensor-parallel-size 1 \
    --dtype bfloat16 \
    # vLLM < 0.12.0
    --rope-scaling '{"rope_type": "yarn", "factor": 4, "original_max_position_embeddings": 32768}' \
    # vLLM >= 0.12.0
    --hf-overrides '{"rope_parameters": {"rope_type": "yarn", "factor": 4.0, "original_max_position_embeddings": 32768}}' \
    --max-model-len 131072

vLLM — With Tool Calling

Tool calling requires three extra flags and the bundled plugin files (shipped with this model checkpoint):

vllm serve domyn/Domyn-Small-v1.0 \
    --tensor-parallel-size 1 \
    --dtype bfloat16 \
    --max-model-len 32768 \
    --max-num-seqs 256 \
    --gpu-memory-utilization 0.9 \
    --enable-auto-tool-choice \
    --tool-call-parser xml_tool_call \
    --tool-parser-plugin /path/to/tool_parser_plugin.py \
    --chat-template /path/to/chat_template.jinja

Replace /path/to/ with the actual paths to the files bundled with the checkpoint.

Transformers

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "domyn/Domyn-Small-v1.0"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, dtype=torch.bfloat16, device_map="auto"
)

messages = [
    {
        "role": "system",
        "content": "You are Domyn Small, a helpful assistant. thinking on",
    },
    {"role": "user", "content": "Solve step by step: what is 17 × 24?"},
]

inputs = tokenizer.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
)

outputs = model.generate(**inputs, max_new_tokens=128)
print(
    tokenizer.decode(
        outputs[0][inputs["input_ids"].shape[-1] :], skip_special_tokens=True
    )
)

Thinking Mode

Domyn Small supports chain-of-thought reasoning controlled by a directive in the system prompt:

Thinking off (default): omit the directive, or include thinking off.
Thinking on: append thinking on to your system prompt.

messages = [
    {"role": "system", "content": "You are Domyn Small, a helpful assistant. thinking on"},
    {"role": "user", "content": "Solve step by step: what is 17 × 24?"},
]

When thinking is on, the model emits its reasoning inside <think>…</think> tags before the final answer.

Alternatively, you can control reasoning by passing enable_thinking as an extra request parameter. This has the same effect as adding thinking on to the system prompt. Because enable_thinking is not part of the standard OpenAI schema, it must be forwarded to vLLM via the OpenAI client's extra_body field:

response = client.chat.completions.create(
    model="domyn/Domyn-Small-v1.0",
    messages=[
        {"role": "user", "content": "Solve step by step: what is 17 × 24?"},
    ],
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

Recommended Sampling Parameters

Mode	temperature	top_p	top_k	min_p
Thinking off	0.1	0.95	50	0.1
Thinking on	0.6	0.90	25	0.1

Do not use greedy decoding in thinking mode — it degrades reasoning quality and may cause repetition.

Tool Calling

How It Works

Domyn Small has been trained to call functions using <tool_call> XML tags. The chat template handles tool formatting automatically: you do not need to write tool instructions in your system prompt.

When you pass a tools list to the API, the chat template prepends a structured tool-instruction block to the system prompt automatically. Your own system message (for persona or context) is appended after that block. The final rendered system block looks like:

<auto-generated tool instruction containing the tools JSON>
<your system message>
thinking on/off

This means your system prompt stays clean — just describe the assistant's persona or context.

Python Example

from openai import OpenAI

client = OpenAI(
    base_url="http://<your-vllm-host>/v1",
    api_key="none",
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather_forecast",
            "description": "Get the weather forecast for a location on a given date.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                    "date": {"type": "string", "description": "Date in YYYY-MM-DD format"},
                },
                "required": ["location", "date"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="domyn/Domyn-Small-v1.0",
    messages=[
        {"role": "system", "content": "You are Domyn Small, a helpful assistant."},
        {"role": "user", "content": "What's the weather like in Rome today?"},
    ],
    tools=tools,
    temperature=0.0,
)

choice = response.choices[0]
if choice.finish_reason == "tool_calls":
    for tc in choice.message.tool_calls:
        print(f"Function: {tc.function.name}")
        print(f"Arguments: {tc.function.arguments}")

Evaluations

Domyn Small is evaluated against four peer models in the 7–10B parameter class: Qwen3.5-9B, OLMo-3-7B-Think, Llama-3.1-Nemotron-Nano-8B-v1, and Ministral-3-8B-Reasoning. All scores are in thinking-on mode at 32,768-token sequence length (RULER extends to 131,072 via YaRN).

Category	Benchmark	Domyn Small	Qwen3.5-9B	OLMo-3-7B-Think	Nemotron-Nano	Ministral-3-8B
Reasoning	MATH-500	93.2	97.4	96.8	95.4	89.2
	AIME 2025 (avg@48)	35.7	90.0	70.4	51.2	32.3
	GPQA-Diamond	50.0	82.7	50.8	42.4	43.9
Code	HumanEval (pass@1)	96.3	93.3	95.7	91.5	86.6
	LiveCodeBench (pass@1)	55.0	86.2	74.8	67.2	46.0
	MBPP (pass@1)	76.8	76.8	86.6	77.6	66.6
General Knowledge	MMLU	80.3	84.6	75.2	56.0	75.3
	MMLU-PRO	67.7	84.4	64.0	28.8	62.0
Instruction	IFEval (strict)	79.9	91.0	83.7	70.4	62.5
Multilingual	MGSM	73.1	88.9	64.0	19.9	75.5
Long context	RULER 32k	59.5	89.8	69.8	34.0	88.7
	RULER 64k	29.6	87.9	17.2	18.7	85.9
Tool calling	BFCL V3 Non-Live	75.9	78.1	61.1	63.3	—
	BFCL V3 Live	68.3	78.4	66.9	40.2	—
	BFCL V3 Multi-Turn	7.0	50.6	2.1	0.1	—

Domyn Small attains its single-turn BFCL results at ~280 mean tokens per problem against ~590 for Qwen3.5-9B and ~2,429 for OLMo-3-7B-Think — the best accuracy-per-token tool-calling profile in the peer set among models that fully engage the reasoning path. Ministral-3-8B is excluded from the BFCL comparison: during evaluation it consistently failed to close the [/THINK] reasoning delimiter, making its structured outputs unparseable by the benchmark.

Token Efficiency

The table below compares mean generated tokens per problem (thinking on, lower is better) against the strongest accuracy peer in the set, Qwen3.5-9B. Grand means weight each benchmark by its problem count.

Category	Benchmark	Domyn Small	Qwen3.5-9B
Reasoning	MATH-500	2,261	7,614
	AIME 2025	5,190	18,668
	GPQA-Diamond	3,396	8,976
	Grand mean	2,690	8,440
Code	HumanEval	1,884	1,144
	LCB-Gen	5,010	12,739
	MBPP	2,420	1,927
	Grand mean	3,312	5,870
General Knowledge	MMLU	1,236	3,262
	MMLU-PRO	2,947	4,666
	Grand mean	2,026	3,910
Instruction	IFEval	775	3,874
Multilingual	MGSM	796	3,140

On the reasoning suite Domyn Small produces approximately 32% of Qwen3.5-9B's token budget — a 3.1× saving at comparable accuracy on several benchmarks.

Dual-Mode Comparison (Thinking ON vs. OFF)

Effect of the reasoning toggle on Domyn Small. Same evaluation harness; thinking-on AIME 2025 is reported as avg@48, other thinking-on entries are single-pass.

Benchmark	Thinking off	Thinking on	Δ
MATH-500	91.4	93.2	+1.8
AIME 2025	31.0	35.7	+4.7
LiveCodeBench	33.8	55.0	+21.2
MBPP	54.6	76.8	+22.2
HumanEval	69.5	96.3	+26.8
GPQA-Diamond	40.0	50.0	+10.0
MMLU-PRO	60.0	67.7	+7.7
MGSM	59.7	73.1	+13.4
IFEval (prompt strict)	78.6	79.9	+1.3

The toggle helps most when the bottleneck is multi-step search or program synthesis (code, science reasoning, multilingual math); it helps least when the bottleneck is recall or format compliance.

Intended Uses

Primary Use Cases

Domyn Small is intended for commercial and research use in multiple languages:

Regulated-industry use cases in resource-constrained environments that need reduced computational cost and faster response times in production.
Fine-tuning to any desired domain knowledge across industries, to equip the model with the context and expertise needed to excel on real-world applications.
Agentic applications, especially agents that need to solve coding and mathematical problems and perform sequential, tool-calling tasks.

Out-of-Scope Use Cases

Domyn Small is not specifically designed or evaluated for all downstream purposes. As with any language model, developers should carefully evaluate accuracy, safety, and fairness before applying it to specific downstream scenarios, particularly high-risk ones. Developers should also ensure compliance with all applicable laws and regulations (including, but not limited to, privacy and trade compliance) relevant to their use case.

EU AI Act Compliance

Domyn Small is released as a general-purpose AI (GPAI) model under the EU AI Act. Article 53 transparency obligations are discharged via this model card, the Domyn Small technical report (architecture, training data composition, training stages, evaluations, and known limitations end-to-end), and the MIT-licensed open-weights release. The training-data summary required by Article 53(1)(d) is provided as a companion artefact to the model release.

To uphold data-subject rights and comply with the AI Act and EU copyright framework, we operate an opt-out procedure for rights holders. Anyone who believes their copyrighted material was inadvertently included in our training corpora can contact copyright@domyn.com, and we will exclude the affected data from subsequent model iterations.

Citation

If you find this work valuable, please consider citing it:

@misc{domynsmall2026,
  title  = {Domyn Small},
  author = {Domyn S.p.A.},
  year   = {2026},
  eprint = {TBD},
  note   = {Technical report, forthcoming},
}

Contacts

For general inquiries about Domyn Small, please contact: models@domyn.com
For copyright-related complaints, please contact: copyright@domyn.com

Affected rightsholders and their authorised representatives, including collective management organisations, may submit sufficiently precise and adequately substantiated complaints electronically concerning any non-compliance with our commitments under the Copyright Chapter of the GPAI Code of Practice. We commit to handling such complaints diligently, impartially, and within a reasonable timeframe, except in cases where the complaint is manifestly unfounded or has already been addressed. This mechanism complements, but does not limit, the available legal measures, remedies, and sanctions under Union and national copyright law.

Downloads last month: -

Safetensors

Model size

10B params

Tensor type

BF16