Text Generation
Transformers
Safetensors
English
Russian
qwen3
rag
faithful-qa
occ
conversational
text-generation-inference
Instructions to use useitone/OCC-RAG-0.6B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use useitone/OCC-RAG-0.6B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="useitone/OCC-RAG-0.6B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("useitone/OCC-RAG-0.6B") model = AutoModelForCausalLM.from_pretrained("useitone/OCC-RAG-0.6B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use useitone/OCC-RAG-0.6B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "useitone/OCC-RAG-0.6B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "useitone/OCC-RAG-0.6B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/useitone/OCC-RAG-0.6B
- SGLang
How to use useitone/OCC-RAG-0.6B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "useitone/OCC-RAG-0.6B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "useitone/OCC-RAG-0.6B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "useitone/OCC-RAG-0.6B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "useitone/OCC-RAG-0.6B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use useitone/OCC-RAG-0.6B with Docker Model Runner:
docker model run hf.co/useitone/OCC-RAG-0.6B
File size: 8,536 Bytes
42e031d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | ---
license: mit
language:
- en
- ru
library_name: transformers
pipeline_tag: text-generation
base_model: Qwen/Qwen3-0.6B-Base
tags:
- rag
- faithful-qa
- occ
---
# OCC-RAG-0.6B
<p align="center">
<img src="figures/occ.png" alt="OCC-RAG" width="320"/>
</p>
<p align="center">
<a href="https://github.com/optimal-cognitive-core/OCC-RAG"><b>GitHub</b></a> |
<a href="https://arxiv.org/abs/2606.00683"><b>Technical Report</b></a> |
<a href="https://cloud.ru/products/evolution-ml-inference"><b>Cloud</b></a>
</p>
**OCC-RAG-0.6B** is a 0.6B-parameter small language model specialized for **faithful, context-grounded question answering**. Along with OCC-RAG-1.7B, it belongs to the first generation of **Optimal Cognitive Core (OCC)** specialized reasoning models. Given a question and a set of sources, it produces a structured reasoning trace with explicit source citations, decides whether the context actually supports an answer, and either answers from the context or abstains.
Despite its size, OCC-RAG-0.6B matches or exceeds general-purpose models **2–6× larger** on multi-hop reasoning, faithfulness, and refusal benchmarks. It is mid-trained from `Qwen/Qwen3-0.6B-Base` on a large synthetic corpus of multi-context, multi-hop QA with citation-anchored reasoning traces.
## Highlights
- **Faithful by design** — answers only from the supplied context; achieves the best faithfulness (lowest memorization ratio) across all evaluated scales, including 32B models.
- **Calibrated abstention** — outputs `Not enough information` when the context does not support an answer.
- **Structured, citable reasoning** — every answer comes with a transparent trace (query analysis → source analysis → reasoning → status → answer) that cites sources by id.
- **Compact** — a small model that delivers chain-of-thought-level transparency at a fraction of full thinking-mode inference cost.
## Model overview
OCC-RAG-0.6B is mid-trained from `Qwen/Qwen3-0.6B-Base` via supervised fine-tuning on a synthetic corpus of **~3.25M QA pairs** (~2.78M single-hop, ~262k multi-hop single-context, ~165k multi-hop multi-context, and ~43k abstain examples), distilled from a larger teacher with citation-anchored reasoning traces. Multi-hop and multi-context subsets are oversampled to emphasize compositional reasoning. The prompt/response format is identical at training and inference time, so no train–test mismatch is introduced.
## Evaluation
Evaluated across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un). In-Acc = the gold answer appears as a substring of the prediction; F1 = token-level overlap between prediction and gold answer; M_R = memorization ratio (lower = more faithful); R-Acc = refusal accuracy.
| Model | HotpotQA<br>In-Acc | MuSiQue<br>In-Acc | TAT-QA<br>F1 | ConFiQA<br>In-Acc | ConFiQA<br>M_R ↓ | MuSiQue-Un<br>R-Acc |
|---|---|---|---|---|---|---|
| gemma-3-4b-it | 55.8 | 30.1 | 65.3 | 69.8 | 8.9 | 55.8 |
| Qwen3-1.7B (think) | 60.9 | 30.7 | 74.8 | 70.4 | 8.3 | 82.8 |
| Qwen3-4B (think) | 67.1 | 41.5 | 79.1 | 74.1 | 7.5 | 84.0 |
| Pleias-RAG-1.2B | 48.5 | 15.0 | 8.4 | 37.3 | 25.3 | 21.9 |
| **OCC-RAG-0.6B** | **57.6** | **36.6** | **75.0** | **79.9** | **5.2** | **86.9** |
OCC-RAG-0.6B exceeds Gemma-3-4B and SmolLM-3-3B on every dimension and attains the strongest faithfulness (highest ConFiQA In-Acc, lowest M_R) among all evaluated models.
## Input / output format
OCC-RAG uses a **structured prompt format with special tokens**. The question is wrapped in `<|query_start|> … <|query_end|>` and each source in `<|source_start|><|source_id|>N … <|source_end|>`.
The response is split into five sections, each delimited by special tokens:
| Section | Tokens | Content |
|---|---|---|
| Query analysis | `<\|query_analysis_start\|> … <\|query_analysis_end\|>` | Decomposes the question into what must be found. |
| Source analysis | `<\|source_analysis_start\|> … <\|source_analysis_end\|>` | Assesses each source's relevance, citing by `<\|source_id\|>N`. |
| Reasoning | `<\|reasoning_start\|> … <\|reasoning_end\|>` | Composes evidence across sources into a multi-hop chain. |
| Status | `<\|status_start\|> … <\|status_end\|>` | `ANSWERABLE` / `UNANSWERABLE` verdict. |
| Answer | `<\|answer_start\|> … <\|answer_end\|>` | The final answer span, or the refusal phrase. |
## Quickstart (Transformers)
The chat template accepts a `documents=` kwarg and emits the structural tokens for the query and sources automatically — pass the user message as plain text and the sources as a list of dicts.
```python
import re
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL = "occ-ai/OCC-RAG-0.6B"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype="auto", device_map="auto")
question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?"
documents = [
{"text": "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone."},
{"text": "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there."},
{"text": "Nova Scotia is a province on the east coast of Canada."},
]
text = tokenizer.apply_chat_template(
[{"role": "user", "content": question}],
documents=documents,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False,
)
# Alternative: assemble the structural tokens yourself.
#
# query_start, query_end = "<|query_start|>", "<|query_end|>"
# source_start, source_end, source_id = "<|source_start|>", "<|source_end|>", "<|source_id|>"
#
# def build_user_content(question, sources):
# content = f"{query_start}{question}{query_end}\n"
# for i, s in enumerate(sources, start=1):
# content += f"{source_start}{source_id}{i} {s}{source_end}\n"
# return content
#
# messages = [{"role": "user", "content": build_user_content(question, [d["text"] for d in documents])}]
# text = tokenizer.apply_chat_template(
# messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
# )
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
print(response)
m = re.findall(r"<\|answer_start\|>(.*?)(?:<\|answer_end\|>|\Z)", response, re.DOTALL)
print("Answer:", m[-1].strip() if m else "") # -> Canada
```
> [!NOTE]
> We recommend greedy decoding (`do_sample=False`), which is the training/evaluation default and is baked into `generation_config.json`. Qwen3's default sampling parameters ([best practices](https://huggingface.co/Qwen/Qwen3-0.6B#best-practices)) also work fine.
## Deployment
OCC-RAG-0.6B is a standard Qwen3 causal LM and is compatible with vLLM, SGLang, and other Transformers-based serving stacks. With only 0.6B parameters, it can be readily deployed in constrained infrastructure, including desktop systems running on CPU RAM. When serving, keep `skip_special_tokens=False` if you need to parse the structural tokens out of the raw output.
When using an OpenAI-compatible server (vLLM ≥0.6, SGLang ≥0.4.7), the `documents=` kwarg is reachable from the client via `chat_template_kwargs`:
```python
client.chat.completions.create(
model="occ-ai/OCC-RAG-0.6B",
messages=[{"role": "user", "content": question}],
extra_body={"chat_template_kwargs": {"documents": documents}},
)
```
## Limitations
- **Context-grounded only.** The model is trained to answer from the supplied sources and to ignore parametric knowledge. It is not a general-purpose chat or knowledge model.
- **Reasoning depth.** Training and evaluation are capped at three-hop reasoning; longer chains are out of distribution.
## Citation
If you find our work helpful, feel free to give us a cite.
```bibtex
@misc{savkin2026occragoptimalcognitivecore,
title = {OCC-RAG: Optimal Cognitive Core for Faithful Question Answering},
author = {Maksim Savkin and Mikhail Goncharov and Alexander Gambashidze and Alla Chepurova and Dmitrii Tarasov and Nikita Andriianov and Daria Pugacheva and Vasily Konovalov and Andrey Galichin and Ivan Oseledets},
year = {2026},
eprint = {2606.00683},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2606.00683}
}
```
|