useitone
/

File size: 8,536 Bytes
42e031d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
license: mit
language:
- en
- ru
library_name: transformers
pipeline_tag: text-generation
base_model: Qwen/Qwen3-0.6B-Base
tags:
- rag
- faithful-qa
- occ
---

# OCC-RAG-0.6B

<p align="center">
  <img src="figures/occ.png" alt="OCC-RAG" width="320"/>
</p>

<p align="center">
  <a href="https://github.com/optimal-cognitive-core/OCC-RAG"><b>GitHub</b></a> &nbsp;|&nbsp;
  <a href="https://arxiv.org/abs/2606.00683"><b>Technical Report</b></a> &nbsp;|&nbsp;
  <a href="https://cloud.ru/products/evolution-ml-inference"><b>Cloud</b></a>
</p>

**OCC-RAG-0.6B** is a 0.6B-parameter small language model specialized for **faithful, context-grounded question answering**. Along with OCC-RAG-1.7B, it belongs to the first generation of **Optimal Cognitive Core (OCC)** specialized reasoning models. Given a question and a set of sources, it produces a structured reasoning trace with explicit source citations, decides whether the context actually supports an answer, and either answers from the context or abstains.

Despite its size, OCC-RAG-0.6B matches or exceeds general-purpose models **2–6× larger** on multi-hop reasoning, faithfulness, and refusal benchmarks. It is mid-trained from `Qwen/Qwen3-0.6B-Base` on a large synthetic corpus of multi-context, multi-hop QA with citation-anchored reasoning traces.

## Highlights

- **Faithful by design** — answers only from the supplied context; achieves the best faithfulness (lowest memorization ratio) across all evaluated scales, including 32B models.
- **Calibrated abstention** — outputs `Not enough information` when the context does not support an answer.
- **Structured, citable reasoning** — every answer comes with a transparent trace (query analysis → source analysis → reasoning → status → answer) that cites sources by id.
- **Compact** — a small model that delivers chain-of-thought-level transparency at a fraction of full thinking-mode inference cost.

## Model overview

OCC-RAG-0.6B is mid-trained from `Qwen/Qwen3-0.6B-Base` via supervised fine-tuning on a synthetic corpus of **~3.25M QA pairs** (~2.78M single-hop, ~262k multi-hop single-context, ~165k multi-hop multi-context, and ~43k abstain examples), distilled from a larger teacher with citation-anchored reasoning traces. Multi-hop and multi-context subsets are oversampled to emphasize compositional reasoning. The prompt/response format is identical at training and inference time, so no train–test mismatch is introduced.

## Evaluation

Evaluated across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un). In-Acc = the gold answer appears as a substring of the prediction; F1 = token-level overlap between prediction and gold answer; M_R = memorization ratio (lower = more faithful); R-Acc = refusal accuracy.

| Model | HotpotQA<br>In-Acc | MuSiQue<br>In-Acc | TAT-QA<br>F1 | ConFiQA<br>In-Acc | ConFiQA<br>M_R ↓ | MuSiQue-Un<br>R-Acc |
|---|---|---|---|---|---|---|
| gemma-3-4b-it | 55.8 | 30.1 | 65.3 | 69.8 | 8.9 | 55.8 |
| Qwen3-1.7B (think) | 60.9 | 30.7 | 74.8 | 70.4 | 8.3 | 82.8 |
| Qwen3-4B (think) | 67.1 | 41.5 | 79.1 | 74.1 | 7.5 | 84.0 |
| Pleias-RAG-1.2B | 48.5 | 15.0 | 8.4 | 37.3 | 25.3 | 21.9 |
| **OCC-RAG-0.6B** | **57.6** | **36.6** | **75.0** | **79.9** | **5.2** | **86.9** |

OCC-RAG-0.6B exceeds Gemma-3-4B and SmolLM-3-3B on every dimension and attains the strongest faithfulness (highest ConFiQA In-Acc, lowest M_R) among all evaluated models.

## Input / output format

OCC-RAG uses a **structured prompt format with special tokens**. The question is wrapped in `<|query_start|> … <|query_end|>` and each source in `<|source_start|><|source_id|>N … <|source_end|>`.

The response is split into five sections, each delimited by special tokens:

| Section | Tokens | Content |
|---|---|---|
| Query analysis | `<\|query_analysis_start\|> … <\|query_analysis_end\|>` | Decomposes the question into what must be found. |
| Source analysis | `<\|source_analysis_start\|> … <\|source_analysis_end\|>` | Assesses each source's relevance, citing by `<\|source_id\|>N`. |
| Reasoning | `<\|reasoning_start\|> … <\|reasoning_end\|>` | Composes evidence across sources into a multi-hop chain. |
| Status | `<\|status_start\|> … <\|status_end\|>` | `ANSWERABLE` / `UNANSWERABLE` verdict. |
| Answer | `<\|answer_start\|> … <\|answer_end\|>` | The final answer span, or the refusal phrase. |

## Quickstart (Transformers)

The chat template accepts a `documents=` kwarg and emits the structural tokens for the query and sources automatically — pass the user message as plain text and the sources as a list of dicts.

```python
import re
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL = "occ-ai/OCC-RAG-0.6B"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype="auto", device_map="auto")

question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?"
documents = [
    {"text": "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone."},
    {"text": "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there."},
    {"text": "Nova Scotia is a province on the east coast of Canada."},
]

text = tokenizer.apply_chat_template(
    [{"role": "user", "content": question}],
    documents=documents,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)

# Alternative: assemble the structural tokens yourself.
#
# query_start, query_end = "<|query_start|>", "<|query_end|>"
# source_start, source_end, source_id = "<|source_start|>", "<|source_end|>", "<|source_id|>"
#
# def build_user_content(question, sources):
#     content = f"{query_start}{question}{query_end}\n"
#     for i, s in enumerate(sources, start=1):
#         content += f"{source_start}{source_id}{i} {s}{source_end}\n"
#     return content
#
# messages = [{"role": "user", "content": build_user_content(question, [d["text"] for d in documents])}]
# text = tokenizer.apply_chat_template(
#     messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
# )

inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
print(response)

m = re.findall(r"<\|answer_start\|>(.*?)(?:<\|answer_end\|>|\Z)", response, re.DOTALL)
print("Answer:", m[-1].strip() if m else "")   # -> Canada
```

> [!NOTE]
> We recommend greedy decoding (`do_sample=False`), which is the training/evaluation default and is baked into `generation_config.json`. Qwen3's default sampling parameters ([best practices](https://huggingface.co/Qwen/Qwen3-0.6B#best-practices)) also work fine.

## Deployment

OCC-RAG-0.6B is a standard Qwen3 causal LM and is compatible with vLLM, SGLang, and other Transformers-based serving stacks. With only 0.6B parameters, it can be readily deployed in constrained infrastructure, including desktop systems running on CPU RAM. When serving, keep `skip_special_tokens=False` if you need to parse the structural tokens out of the raw output.

When using an OpenAI-compatible server (vLLM ≥0.6, SGLang ≥0.4.7), the `documents=` kwarg is reachable from the client via `chat_template_kwargs`:

```python
client.chat.completions.create(
    model="occ-ai/OCC-RAG-0.6B",
    messages=[{"role": "user", "content": question}],
    extra_body={"chat_template_kwargs": {"documents": documents}},
)
```

## Limitations

- **Context-grounded only.** The model is trained to answer from the supplied sources and to ignore parametric knowledge. It is not a general-purpose chat or knowledge model.
- **Reasoning depth.** Training and evaluation are capped at three-hop reasoning; longer chains are out of distribution.

## Citation

If you find our work helpful, feel free to give us a cite.

```bibtex
@misc{savkin2026occragoptimalcognitivecore,
  title         = {OCC-RAG: Optimal Cognitive Core for Faithful Question Answering},
  author        = {Maksim Savkin and Mikhail Goncharov and Alexander Gambashidze and Alla Chepurova and Dmitrii Tarasov and Nikita Andriianov and Daria Pugacheva and Vasily Konovalov and Andrey Galichin and Ivan Oseledets},
  year          = {2026},
  eprint        = {2606.00683},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2606.00683}
}
```