useitone
/

useitone andreuka18 commited on
Commit
42e031d
·
0 Parent(s):

Duplicate from occ-ai/OCC-RAG-0.6B

Browse files

Co-authored-by: Andrey Galichin <andreuka18@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ - ru
6
+ library_name: transformers
7
+ pipeline_tag: text-generation
8
+ base_model: Qwen/Qwen3-0.6B-Base
9
+ tags:
10
+ - rag
11
+ - faithful-qa
12
+ - occ
13
+ ---
14
+
15
+ # OCC-RAG-0.6B
16
+
17
+ <p align="center">
18
+ <img src="figures/occ.png" alt="OCC-RAG" width="320"/>
19
+ </p>
20
+
21
+ <p align="center">
22
+ <a href="https://github.com/optimal-cognitive-core/OCC-RAG"><b>GitHub</b></a> &nbsp;|&nbsp;
23
+ <a href="https://arxiv.org/abs/2606.00683"><b>Technical Report</b></a> &nbsp;|&nbsp;
24
+ <a href="https://cloud.ru/products/evolution-ml-inference"><b>Cloud</b></a>
25
+ </p>
26
+
27
+ **OCC-RAG-0.6B** is a 0.6B-parameter small language model specialized for **faithful, context-grounded question answering**. Along with OCC-RAG-1.7B, it belongs to the first generation of **Optimal Cognitive Core (OCC)** specialized reasoning models. Given a question and a set of sources, it produces a structured reasoning trace with explicit source citations, decides whether the context actually supports an answer, and either answers from the context or abstains.
28
+
29
+ Despite its size, OCC-RAG-0.6B matches or exceeds general-purpose models **2–6× larger** on multi-hop reasoning, faithfulness, and refusal benchmarks. It is mid-trained from `Qwen/Qwen3-0.6B-Base` on a large synthetic corpus of multi-context, multi-hop QA with citation-anchored reasoning traces.
30
+
31
+ ## Highlights
32
+
33
+ - **Faithful by design** — answers only from the supplied context; achieves the best faithfulness (lowest memorization ratio) across all evaluated scales, including 32B models.
34
+ - **Calibrated abstention** — outputs `Not enough information` when the context does not support an answer.
35
+ - **Structured, citable reasoning** — every answer comes with a transparent trace (query analysis → source analysis → reasoning → status → answer) that cites sources by id.
36
+ - **Compact** — a small model that delivers chain-of-thought-level transparency at a fraction of full thinking-mode inference cost.
37
+
38
+ ## Model overview
39
+
40
+ OCC-RAG-0.6B is mid-trained from `Qwen/Qwen3-0.6B-Base` via supervised fine-tuning on a synthetic corpus of **~3.25M QA pairs** (~2.78M single-hop, ~262k multi-hop single-context, ~165k multi-hop multi-context, and ~43k abstain examples), distilled from a larger teacher with citation-anchored reasoning traces. Multi-hop and multi-context subsets are oversampled to emphasize compositional reasoning. The prompt/response format is identical at training and inference time, so no train–test mismatch is introduced.
41
+
42
+ ## Evaluation
43
+
44
+ Evaluated across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un). In-Acc = the gold answer appears as a substring of the prediction; F1 = token-level overlap between prediction and gold answer; M_R = memorization ratio (lower = more faithful); R-Acc = refusal accuracy.
45
+
46
+ | Model | HotpotQA<br>In-Acc | MuSiQue<br>In-Acc | TAT-QA<br>F1 | ConFiQA<br>In-Acc | ConFiQA<br>M_R ↓ | MuSiQue-Un<br>R-Acc |
47
+ |---|---|---|---|---|---|---|
48
+ | gemma-3-4b-it | 55.8 | 30.1 | 65.3 | 69.8 | 8.9 | 55.8 |
49
+ | Qwen3-1.7B (think) | 60.9 | 30.7 | 74.8 | 70.4 | 8.3 | 82.8 |
50
+ | Qwen3-4B (think) | 67.1 | 41.5 | 79.1 | 74.1 | 7.5 | 84.0 |
51
+ | Pleias-RAG-1.2B | 48.5 | 15.0 | 8.4 | 37.3 | 25.3 | 21.9 |
52
+ | **OCC-RAG-0.6B** | **57.6** | **36.6** | **75.0** | **79.9** | **5.2** | **86.9** |
53
+
54
+ OCC-RAG-0.6B exceeds Gemma-3-4B and SmolLM-3-3B on every dimension and attains the strongest faithfulness (highest ConFiQA In-Acc, lowest M_R) among all evaluated models.
55
+
56
+ ## Input / output format
57
+
58
+ OCC-RAG uses a **structured prompt format with special tokens**. The question is wrapped in `<|query_start|> … <|query_end|>` and each source in `<|source_start|><|source_id|>N … <|source_end|>`.
59
+
60
+ The response is split into five sections, each delimited by special tokens:
61
+
62
+ | Section | Tokens | Content |
63
+ |---|---|---|
64
+ | Query analysis | `<\|query_analysis_start\|> … <\|query_analysis_end\|>` | Decomposes the question into what must be found. |
65
+ | Source analysis | `<\|source_analysis_start\|> … <\|source_analysis_end\|>` | Assesses each source's relevance, citing by `<\|source_id\|>N`. |
66
+ | Reasoning | `<\|reasoning_start\|> … <\|reasoning_end\|>` | Composes evidence across sources into a multi-hop chain. |
67
+ | Status | `<\|status_start\|> … <\|status_end\|>` | `ANSWERABLE` / `UNANSWERABLE` verdict. |
68
+ | Answer | `<\|answer_start\|> … <\|answer_end\|>` | The final answer span, or the refusal phrase. |
69
+
70
+ ## Quickstart (Transformers)
71
+
72
+ The chat template accepts a `documents=` kwarg and emits the structural tokens for the query and sources automatically — pass the user message as plain text and the sources as a list of dicts.
73
+
74
+ ```python
75
+ import re
76
+ from transformers import AutoModelForCausalLM, AutoTokenizer
77
+
78
+ MODEL = "occ-ai/OCC-RAG-0.6B"
79
+
80
+ tokenizer = AutoTokenizer.from_pretrained(MODEL)
81
+ model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype="auto", device_map="auto")
82
+
83
+ question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?"
84
+ documents = [
85
+ {"text": "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone."},
86
+ {"text": "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there."},
87
+ {"text": "Nova Scotia is a province on the east coast of Canada."},
88
+ ]
89
+
90
+ text = tokenizer.apply_chat_template(
91
+ [{"role": "user", "content": question}],
92
+ documents=documents,
93
+ tokenize=False,
94
+ add_generation_prompt=True,
95
+ enable_thinking=False,
96
+ )
97
+
98
+ # Alternative: assemble the structural tokens yourself.
99
+ #
100
+ # query_start, query_end = "<|query_start|>", "<|query_end|>"
101
+ # source_start, source_end, source_id = "<|source_start|>", "<|source_end|>", "<|source_id|>"
102
+ #
103
+ # def build_user_content(question, sources):
104
+ # content = f"{query_start}{question}{query_end}\n"
105
+ # for i, s in enumerate(sources, start=1):
106
+ # content += f"{source_start}{source_id}{i} {s}{source_end}\n"
107
+ # return content
108
+ #
109
+ # messages = [{"role": "user", "content": build_user_content(question, [d["text"] for d in documents])}]
110
+ # text = tokenizer.apply_chat_template(
111
+ # messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
112
+ # )
113
+
114
+ inputs = tokenizer([text], return_tensors="pt").to(model.device)
115
+ outputs = model.generate(**inputs, max_new_tokens=2048)
116
+ response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
117
+ print(response)
118
+
119
+ m = re.findall(r"<\|answer_start\|>(.*?)(?:<\|answer_end\|>|\Z)", response, re.DOTALL)
120
+ print("Answer:", m[-1].strip() if m else "") # -> Canada
121
+ ```
122
+
123
+ > [!NOTE]
124
+ > We recommend greedy decoding (`do_sample=False`), which is the training/evaluation default and is baked into `generation_config.json`. Qwen3's default sampling parameters ([best practices](https://huggingface.co/Qwen/Qwen3-0.6B#best-practices)) also work fine.
125
+
126
+ ## Deployment
127
+
128
+ OCC-RAG-0.6B is a standard Qwen3 causal LM and is compatible with vLLM, SGLang, and other Transformers-based serving stacks. With only 0.6B parameters, it can be readily deployed in constrained infrastructure, including desktop systems running on CPU RAM. When serving, keep `skip_special_tokens=False` if you need to parse the structural tokens out of the raw output.
129
+
130
+ When using an OpenAI-compatible server (vLLM ≥0.6, SGLang ≥0.4.7), the `documents=` kwarg is reachable from the client via `chat_template_kwargs`:
131
+
132
+ ```python
133
+ client.chat.completions.create(
134
+ model="occ-ai/OCC-RAG-0.6B",
135
+ messages=[{"role": "user", "content": question}],
136
+ extra_body={"chat_template_kwargs": {"documents": documents}},
137
+ )
138
+ ```
139
+
140
+ ## Limitations
141
+
142
+ - **Context-grounded only.** The model is trained to answer from the supplied sources and to ignore parametric knowledge. It is not a general-purpose chat or knowledge model.
143
+ - **Reasoning depth.** Training and evaluation are capped at three-hop reasoning; longer chains are out of distribution.
144
+
145
+ ## Citation
146
+
147
+ If you find our work helpful, feel free to give us a cite.
148
+
149
+ ```bibtex
150
+ @misc{savkin2026occragoptimalcognitivecore,
151
+ title = {OCC-RAG: Optimal Cognitive Core for Faithful Question Answering},
152
+ author = {Maksim Savkin and Mikhail Goncharov and Alexander Gambashidze and Alla Chepurova and Dmitrii Tarasov and Nikita Andriianov and Daria Pugacheva and Vasily Konovalov and Andrey Galichin and Ivan Oseledets},
153
+ year = {2026},
154
+ eprint = {2606.00683},
155
+ archivePrefix = {arXiv},
156
+ primaryClass = {cs.CL},
157
+ url = {https://arxiv.org/abs/2606.00683}
158
+ }
159
+ ```
chat_template.jinja ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- for message in messages -%}
2
+ {%- if message['role'] == 'system' -%}
3
+ {{ '<|im_start|>system\n' + message['content'] + '<|im_end|>\n' }}
4
+ {%- elif message['role'] == 'user' -%}
5
+ {%- if documents and loop.last -%}
6
+ {{ '<|im_start|>user\n<|query_start|>' + message['content'] + '<|query_end|>\n' }}
7
+ {%- for doc in documents -%}
8
+ {{ '<|source_start|><|source_id|>' + (loop.index | string) + ' ' + doc['text'] + '<|source_end|>\n' }}
9
+ {%- endfor -%}
10
+ {{ '<|im_end|>\n' }}
11
+ {%- else -%}
12
+ {{ '<|im_start|>user\n' + message['content'] + '<|im_end|>\n' }}
13
+ {%- endif -%}
14
+ {%- elif message['role'] == 'assistant' -%}
15
+ {{ '<|im_start|>assistant\n<think>\n\n</think>\n\n' + message['content'] + '<|im_end|>\n' }}
16
+ {%- endif -%}
17
+ {%- endfor -%}
18
+ {%- if add_generation_prompt -%}
19
+ {{ '<|im_start|>assistant\n<think>\n\n</think>\n\n<|query_analysis_start|>\n' }}
20
+ {%- endif -%}
config.json ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Qwen3ForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": null,
8
+ "dtype": "bfloat16",
9
+ "eos_token_id": 151643,
10
+ "head_dim": 128,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 1024,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 3072,
15
+ "layer_types": [
16
+ "full_attention",
17
+ "full_attention",
18
+ "full_attention",
19
+ "full_attention",
20
+ "full_attention",
21
+ "full_attention",
22
+ "full_attention",
23
+ "full_attention",
24
+ "full_attention",
25
+ "full_attention",
26
+ "full_attention",
27
+ "full_attention",
28
+ "full_attention",
29
+ "full_attention",
30
+ "full_attention",
31
+ "full_attention",
32
+ "full_attention",
33
+ "full_attention",
34
+ "full_attention",
35
+ "full_attention",
36
+ "full_attention",
37
+ "full_attention",
38
+ "full_attention",
39
+ "full_attention",
40
+ "full_attention",
41
+ "full_attention",
42
+ "full_attention",
43
+ "full_attention"
44
+ ],
45
+ "max_position_embeddings": 32768,
46
+ "max_window_layers": 28,
47
+ "model_type": "qwen3",
48
+ "num_attention_heads": 16,
49
+ "num_hidden_layers": 28,
50
+ "num_key_value_heads": 8,
51
+ "pad_token_id": 151643,
52
+ "rms_norm_eps": 1e-06,
53
+ "rope_parameters": {
54
+ "rope_theta": 1000000,
55
+ "rope_type": "default"
56
+ },
57
+ "sliding_window": null,
58
+ "tie_word_embeddings": true,
59
+ "transformers_version": "5.5.4",
60
+ "use_cache": true,
61
+ "use_sliding_window": false,
62
+ "vocab_size": 151936
63
+ }
figures/github-mark.png ADDED
figures/occ.png ADDED
generation_config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_sample": false,
3
+ "temperature": 0.0,
4
+ "eos_token_id": [
5
+ 151643,
6
+ 151645,
7
+ 151683
8
+ ],
9
+ "max_new_tokens": 2048,
10
+ "pad_token_id": 151643,
11
+ "transformers_version": "5.5.4"
12
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f8f1d583afd08756cc40273d9c63d63580000852e47aa64d535bc77c872533ee
3
+ size 1192135096
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:672e331460a05e2ea9888810a7a37f0c775429fe05fddc6330ee0dc9147a1370
3
+ size 11425566
tokenizer_config.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "backend": "tokenizers",
4
+ "bos_token": null,
5
+ "clean_up_tokenization_spaces": false,
6
+ "eos_token": "<|endoftext|>",
7
+ "errors": "replace",
8
+ "extra_special_tokens": [
9
+ "<|im_start|>",
10
+ "<|im_end|>",
11
+ "<|object_ref_start|>",
12
+ "<|object_ref_end|>",
13
+ "<|box_start|>",
14
+ "<|box_end|>",
15
+ "<|quad_start|>",
16
+ "<|quad_end|>",
17
+ "<|vision_start|>",
18
+ "<|vision_end|>",
19
+ "<|vision_pad|>",
20
+ "<|image_pad|>",
21
+ "<|video_pad|>",
22
+ "<|query_start|>",
23
+ "<|query_end|>",
24
+ "<|source_start|>",
25
+ "<|source_end|>",
26
+ "<|source_id|>",
27
+ "<|query_analysis_start|>",
28
+ "<|query_analysis_end|>",
29
+ "<|source_analysis_start|>",
30
+ "<|source_analysis_end|>",
31
+ "<|reasoning_start|>",
32
+ "<|reasoning_end|>",
33
+ "<|status_start|>",
34
+ "<|status_end|>",
35
+ "<|answer_start|>",
36
+ "<|answer_end|>"
37
+ ],
38
+ "is_local": false,
39
+ "model_max_length": 131072,
40
+ "pad_token": "<|endoftext|>",
41
+ "split_special_tokens": false,
42
+ "tokenizer_class": "Qwen2Tokenizer",
43
+ "unk_token": null
44
+ }