Instructions to use useitone/OCC-RAG-0.6B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use useitone/OCC-RAG-0.6B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="useitone/OCC-RAG-0.6B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("useitone/OCC-RAG-0.6B")
model = AutoModelForCausalLM.from_pretrained("useitone/OCC-RAG-0.6B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use useitone/OCC-RAG-0.6B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "useitone/OCC-RAG-0.6B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "useitone/OCC-RAG-0.6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/useitone/OCC-RAG-0.6B

SGLang

How to use useitone/OCC-RAG-0.6B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "useitone/OCC-RAG-0.6B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "useitone/OCC-RAG-0.6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "useitone/OCC-RAG-0.6B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "useitone/OCC-RAG-0.6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use useitone/OCC-RAG-0.6B with Docker Model Runner:
```
docker model run hf.co/useitone/OCC-RAG-0.6B
```

OCC-RAG-0.6B / README.md

useitone

Duplicate from occ-ai/OCC-RAG-0.6B

42e031d about 13 hours ago

preview code

raw

history blame contribute delete

8.54 kB

	---
	license: mit
	language:
	- en
	- ru
	library_name: transformers
	pipeline_tag: text-generation
	base_model: Qwen/Qwen3-0.6B-Base
	tags:
	- rag
	- faithful-qa
	- occ
	---

	# OCC-RAG-0.6B

	<p align="center">
	<img src="figures/occ.png" alt="OCC-RAG" width="320"/>
	</p>

	<p align="center">
	<a href="https://github.com/optimal-cognitive-core/OCC-RAG"><b>GitHub</b></a>  \|
	<a href="https://arxiv.org/abs/2606.00683"><b>Technical Report</b></a>  \|
	<a href="https://cloud.ru/products/evolution-ml-inference"><b>Cloud</b></a>
	</p>

	OCC-RAG-0.6B is a 0.6B-parameter small language model specialized for faithful, context-grounded question answering. Along with OCC-RAG-1.7B, it belongs to the first generation of Optimal Cognitive Core (OCC) specialized reasoning models. Given a question and a set of sources, it produces a structured reasoning trace with explicit source citations, decides whether the context actually supports an answer, and either answers from the context or abstains.

	Despite its size, OCC-RAG-0.6B matches or exceeds general-purpose models 2–6× larger on multi-hop reasoning, faithfulness, and refusal benchmarks. It is mid-trained from `Qwen/Qwen3-0.6B-Base` on a large synthetic corpus of multi-context, multi-hop QA with citation-anchored reasoning traces.

	## Highlights

	- Faithful by design — answers only from the supplied context; achieves the best faithfulness (lowest memorization ratio) across all evaluated scales, including 32B models.
	- Calibrated abstention — outputs `Not enough information` when the context does not support an answer.
	- Structured, citable reasoning — every answer comes with a transparent trace (query analysis → source analysis → reasoning → status → answer) that cites sources by id.
	- Compact — a small model that delivers chain-of-thought-level transparency at a fraction of full thinking-mode inference cost.

	## Model overview

	OCC-RAG-0.6B is mid-trained from `Qwen/Qwen3-0.6B-Base` via supervised fine-tuning on a synthetic corpus of ~3.25M QA pairs (~2.78M single-hop, ~262k multi-hop single-context, ~165k multi-hop multi-context, and ~43k abstain examples), distilled from a larger teacher with citation-anchored reasoning traces. Multi-hop and multi-context subsets are oversampled to emphasize compositional reasoning. The prompt/response format is identical at training and inference time, so no train–test mismatch is introduced.

	## Evaluation

	Evaluated across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un). In-Acc = the gold answer appears as a substring of the prediction; F1 = token-level overlap between prediction and gold answer; M_R = memorization ratio (lower = more faithful); R-Acc = refusal accuracy.

	\| Model \| HotpotQA<br>In-Acc \| MuSiQue<br>In-Acc \| TAT-QA<br>F1 \| ConFiQA<br>In-Acc \| ConFiQA<br>M_R ↓ \| MuSiQue-Un<br>R-Acc \|
	\|---\|---\|---\|---\|---\|---\|---\|
	\| gemma-3-4b-it \| 55.8 \| 30.1 \| 65.3 \| 69.8 \| 8.9 \| 55.8 \|
	\| Qwen3-1.7B (think) \| 60.9 \| 30.7 \| 74.8 \| 70.4 \| 8.3 \| 82.8 \|
	\| Qwen3-4B (think) \| 67.1 \| 41.5 \| 79.1 \| 74.1 \| 7.5 \| 84.0 \|
	\| Pleias-RAG-1.2B \| 48.5 \| 15.0 \| 8.4 \| 37.3 \| 25.3 \| 21.9 \|
	\| OCC-RAG-0.6B \| 57.6 \| 36.6 \| 75.0 \| 79.9 \| 5.2 \| 86.9 \|

	OCC-RAG-0.6B exceeds Gemma-3-4B and SmolLM-3-3B on every dimension and attains the strongest faithfulness (highest ConFiQA In-Acc, lowest M_R) among all evaluated models.

	## Input / output format

	OCC-RAG uses a structured prompt format with special tokens. The question is wrapped in `<\|query_start\|> … <\|query_end\|>` and each source in `<\|source_start\|><\|source_id\|>N … <\|source_end\|>`.

	The response is split into five sections, each delimited by special tokens:

	\| Section \| Tokens \| Content \|
	\|---\|---\|---\|
	\| Query analysis \| `<\\|query_analysis_start\\|> … <\\|query_analysis_end\\|>` \| Decomposes the question into what must be found. \|
	\| Source analysis \| `<\\|source_analysis_start\\|> … <\\|source_analysis_end\\|>` \| Assesses each source's relevance, citing by `<\\|source_id\\|>N`. \|
	\| Reasoning \| `<\\|reasoning_start\\|> … <\\|reasoning_end\\|>` \| Composes evidence across sources into a multi-hop chain. \|
	\| Status \| `<\\|status_start\\|> … <\\|status_end\\|>` \| `ANSWERABLE` / `UNANSWERABLE` verdict. \|
	\| Answer \| `<\\|answer_start\\|> … <\\|answer_end\\|>` \| The final answer span, or the refusal phrase. \|

	## Quickstart (Transformers)

	The chat template accepts a `documents=` kwarg and emits the structural tokens for the query and sources automatically — pass the user message as plain text and the sources as a list of dicts.

	```python
	import re
	from transformers import AutoModelForCausalLM, AutoTokenizer

	MODEL = "occ-ai/OCC-RAG-0.6B"

	tokenizer = AutoTokenizer.from_pretrained(MODEL)
	model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype="auto", device_map="auto")

	question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?"
	documents = [
	{"text": "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone."},
	{"text": "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there."},
	{"text": "Nova Scotia is a province on the east coast of Canada."},
	]

	text = tokenizer.apply_chat_template(
	[{"role": "user", "content": question}],
	documents=documents,
	tokenize=False,
	add_generation_prompt=True,
	enable_thinking=False,
	)

	# Alternative: assemble the structural tokens yourself.
	#
	# query_start, query_end = "<\|query_start\|>", "<\|query_end\|>"
	# source_start, source_end, source_id = "<\|source_start\|>", "<\|source_end\|>", "<\|source_id\|>"
	#
	# def build_user_content(question, sources):
	# content = f"{query_start}{question}{query_end}\n"
	# for i, s in enumerate(sources, start=1):
	# content += f"{source_start}{source_id}{i} {s}{source_end}\n"
	# return content
	#
	# messages = [{"role": "user", "content": build_user_content(question, [d["text"] for d in documents])}]
	# text = tokenizer.apply_chat_template(
	# messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
	# )

	inputs = tokenizer([text], return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=2048)
	response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
	print(response)

	m = re.findall(r"<\\|answer_start\\|>(.*?)(?:<\\|answer_end\\|>\|\Z)", response, re.DOTALL)
	print("Answer:", m[-1].strip() if m else "") # -> Canada
	```

	> [!NOTE]
	> We recommend greedy decoding (`do_sample=False`), which is the training/evaluation default and is baked into `generation_config.json`. Qwen3's default sampling parameters ([best practices](https://huggingface.co/Qwen/Qwen3-0.6B#best-practices)) also work fine.

	## Deployment

	OCC-RAG-0.6B is a standard Qwen3 causal LM and is compatible with vLLM, SGLang, and other Transformers-based serving stacks. With only 0.6B parameters, it can be readily deployed in constrained infrastructure, including desktop systems running on CPU RAM. When serving, keep `skip_special_tokens=False` if you need to parse the structural tokens out of the raw output.

	When using an OpenAI-compatible server (vLLM ≥0.6, SGLang ≥0.4.7), the `documents=` kwarg is reachable from the client via `chat_template_kwargs`:

	```python
	client.chat.completions.create(
	model="occ-ai/OCC-RAG-0.6B",
	messages=[{"role": "user", "content": question}],
	extra_body={"chat_template_kwargs": {"documents": documents}},
	)
	```

	## Limitations

	- Context-grounded only. The model is trained to answer from the supplied sources and to ignore parametric knowledge. It is not a general-purpose chat or knowledge model.
	- Reasoning depth. Training and evaluation are capped at three-hop reasoning; longer chains are out of distribution.

	## Citation

	If you find our work helpful, feel free to give us a cite.

	```bibtex
	@misc{savkin2026occragoptimalcognitivecore,
	title = {OCC-RAG: Optimal Cognitive Core for Faithful Question Answering},
	author = {Maksim Savkin and Mikhail Goncharov and Alexander Gambashidze and Alla Chepurova and Dmitrii Tarasov and Nikita Andriianov and Daria Pugacheva and Vasily Konovalov and Andrey Galichin and Ivan Oseledets},
	year = {2026},
	eprint = {2606.00683},
	archivePrefix = {arXiv},
	primaryClass = {cs.CL},
	url = {https://arxiv.org/abs/2606.00683}
	}
	```