Instructions to use useitone/OCC-RAG-0.6B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use useitone/OCC-RAG-0.6B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="useitone/OCC-RAG-0.6B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("useitone/OCC-RAG-0.6B")
model = AutoModelForCausalLM.from_pretrained("useitone/OCC-RAG-0.6B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use useitone/OCC-RAG-0.6B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "useitone/OCC-RAG-0.6B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "useitone/OCC-RAG-0.6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/useitone/OCC-RAG-0.6B

SGLang

How to use useitone/OCC-RAG-0.6B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "useitone/OCC-RAG-0.6B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "useitone/OCC-RAG-0.6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "useitone/OCC-RAG-0.6B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "useitone/OCC-RAG-0.6B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use useitone/OCC-RAG-0.6B with Docker Model Runner:
```
docker model run hf.co/useitone/OCC-RAG-0.6B
```

useitone

andreuka18 commited on about 16 hours ago

Commit

42e031d

0 Parent(s):

Duplicate from occ-ai/OCC-RAG-0.6B

Browse files

Co-authored-by: Andrey Galichin <andreuka18@users.noreply.huggingface.co>

Files changed (10) hide show

.gitattributes +36 -0
README.md +159 -0
chat_template.jinja +20 -0
config.json +63 -0
figures/github-mark.png +0 -0
figures/occ.png +0 -0
generation_config.json +12 -0
model.safetensors +3 -0
tokenizer.json +3 -0
tokenizer_config.json +44 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1,36 @@

+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,159 @@

+---
+license: mit
+language:
+- en
+- ru
+library_name: transformers
+pipeline_tag: text-generation
+base_model: Qwen/Qwen3-0.6B-Base
+tags:
+- rag
+- faithful-qa
+- occ
+---
+# OCC-RAG-0.6B
+<p align="center">
+  <img src="figures/occ.png" alt="OCC-RAG" width="320"/>
+</p>
+<p align="center">
+  <a href="https://github.com/optimal-cognitive-core/OCC-RAG"><b>GitHub</b></a> &nbsp;|&nbsp;
+  <a href="https://arxiv.org/abs/2606.00683"><b>Technical Report</b></a> &nbsp;|&nbsp;
+  <a href="https://cloud.ru/products/evolution-ml-inference"><b>Cloud</b></a>
+</p>
+**OCC-RAG-0.6B** is a 0.6B-parameter small language model specialized for **faithful, context-grounded question answering**. Along with OCC-RAG-1.7B, it belongs to the first generation of **Optimal Cognitive Core (OCC)** specialized reasoning models. Given a question and a set of sources, it produces a structured reasoning trace with explicit source citations, decides whether the context actually supports an answer, and either answers from the context or abstains.
+Despite its size, OCC-RAG-0.6B matches or exceeds general-purpose models **2–6× larger** on multi-hop reasoning, faithfulness, and refusal benchmarks. It is mid-trained from `Qwen/Qwen3-0.6B-Base` on a large synthetic corpus of multi-context, multi-hop QA with citation-anchored reasoning traces.
+## Highlights
+- **Faithful by design** — answers only from the supplied context; achieves the best faithfulness (lowest memorization ratio) across all evaluated scales, including 32B models.
+- **Calibrated abstention** — outputs `Not enough information` when the context does not support an answer.
+- **Structured, citable reasoning** — every answer comes with a transparent trace (query analysis → source analysis → reasoning → status → answer) that cites sources by id.
+- **Compact** — a small model that delivers chain-of-thought-level transparency at a fraction of full thinking-mode inference cost.
+## Model overview
+OCC-RAG-0.6B is mid-trained from `Qwen/Qwen3-0.6B-Base` via supervised fine-tuning on a synthetic corpus of **~3.25M QA pairs** (~2.78M single-hop, ~262k multi-hop single-context, ~165k multi-hop multi-context, and ~43k abstain examples), distilled from a larger teacher with citation-anchored reasoning traces. Multi-hop and multi-context subsets are oversampled to emphasize compositional reasoning. The prompt/response format is identical at training and inference time, so no train–test mismatch is introduced.
+## Evaluation
+Evaluated across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un). In-Acc = the gold answer appears as a substring of the prediction; F1 = token-level overlap between prediction and gold answer; M_R = memorization ratio (lower = more faithful); R-Acc = refusal accuracy.
+| Model | HotpotQA<br>In-Acc | MuSiQue<br>In-Acc | TAT-QA<br>F1 | ConFiQA<br>In-Acc | ConFiQA<br>M_R ↓ | MuSiQue-Un<br>R-Acc |
+|---|---|---|---|---|---|---|
+| gemma-3-4b-it | 55.8 | 30.1 | 65.3 | 69.8 | 8.9 | 55.8 |
+| Qwen3-1.7B (think) | 60.9 | 30.7 | 74.8 | 70.4 | 8.3 | 82.8 |
+| Qwen3-4B (think) | 67.1 | 41.5 | 79.1 | 74.1 | 7.5 | 84.0 |
+| Pleias-RAG-1.2B | 48.5 | 15.0 | 8.4 | 37.3 | 25.3 | 21.9 |
+| **OCC-RAG-0.6B** | **57.6** | **36.6** | **75.0** | **79.9** | **5.2** | **86.9** |
+OCC-RAG-0.6B exceeds Gemma-3-4B and SmolLM-3-3B on every dimension and attains the strongest faithfulness (highest ConFiQA In-Acc, lowest M_R) among all evaluated models.
+## Input / output format
+OCC-RAG uses a **structured prompt format with special tokens**. The question is wrapped in `<|query_start|> … <|query_end|>` and each source in `<|source_start|><|source_id|>N … <|source_end|>`.
+The response is split into five sections, each delimited by special tokens:
+| Section | Tokens | Content |
+|---|---|---|
+| Query analysis | `<\|query_analysis_start\|> … <\|query_analysis_end\|>` | Decomposes the question into what must be found. |
+| Source analysis | `<\|source_analysis_start\|> … <\|source_analysis_end\|>` | Assesses each source's relevance, citing by `<\|source_id\|>N`. |
+| Reasoning | `<\|reasoning_start\|> … <\|reasoning_end\|>` | Composes evidence across sources into a multi-hop chain. |
+| Status | `<\|status_start\|> … <\|status_end\|>` | `ANSWERABLE` / `UNANSWERABLE` verdict. |
+| Answer | `<\|answer_start\|> … <\|answer_end\|>` | The final answer span, or the refusal phrase. |
+## Quickstart (Transformers)
+The chat template accepts a `documents=` kwarg and emits the structural tokens for the query and sources automatically — pass the user message as plain text and the sources as a list of dicts.
+```python
+import re
+from transformers import AutoModelForCausalLM, AutoTokenizer
+MODEL = "occ-ai/OCC-RAG-0.6B"
+tokenizer = AutoTokenizer.from_pretrained(MODEL)
+model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype="auto", device_map="auto")
+question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?"
+documents = [
+    {"text": "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone."},
+    {"text": "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there."},
+    {"text": "Nova Scotia is a province on the east coast of Canada."},
+]
+text = tokenizer.apply_chat_template(
+    [{"role": "user", "content": question}],
+    documents=documents,
+    tokenize=False,
+    add_generation_prompt=True,
+    enable_thinking=False,
+)
+# Alternative: assemble the structural tokens yourself.
+#
+# query_start, query_end = "<|query_start|>", "<|query_end|>"
+# source_start, source_end, source_id = "<|source_start|>", "<|source_end|>", "<|source_id|>"
+#
+# def build_user_content(question, sources):
+#     content = f"{query_start}{question}{query_end}\n"
+#     for i, s in enumerate(sources, start=1):
+#         content += f"{source_start}{source_id}{i} {s}{source_end}\n"
+#     return content
+#
+# messages = [{"role": "user", "content": build_user_content(question, [d["text"] for d in documents])}]
+# text = tokenizer.apply_chat_template(
+#     messages, tokenize=False, add_generation_prompt=True, enable_thinking=False
+# )
+inputs = tokenizer([text], return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=2048)
+response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
+print(response)
+m = re.findall(r"<\|answer_start\|>(.*?)(?:<\|answer_end\|>|\Z)", response, re.DOTALL)
+print("Answer:", m[-1].strip() if m else "")   # -> Canada
+```
+> [!NOTE]
+> We recommend greedy decoding (`do_sample=False`), which is the training/evaluation default and is baked into `generation_config.json`. Qwen3's default sampling parameters ([best practices](https://huggingface.co/Qwen/Qwen3-0.6B#best-practices)) also work fine.
+## Deployment
+OCC-RAG-0.6B is a standard Qwen3 causal LM and is compatible with vLLM, SGLang, and other Transformers-based serving stacks. With only 0.6B parameters, it can be readily deployed in constrained infrastructure, including desktop systems running on CPU RAM. When serving, keep `skip_special_tokens=False` if you need to parse the structural tokens out of the raw output.
+When using an OpenAI-compatible server (vLLM ≥0.6, SGLang ≥0.4.7), the `documents=` kwarg is reachable from the client via `chat_template_kwargs`:
+```python
+client.chat.completions.create(
+    model="occ-ai/OCC-RAG-0.6B",
+    messages=[{"role": "user", "content": question}],
+    extra_body={"chat_template_kwargs": {"documents": documents}},
+)
+```
+## Limitations
+- **Context-grounded only.** The model is trained to answer from the supplied sources and to ignore parametric knowledge. It is not a general-purpose chat or knowledge model.
+- **Reasoning depth.** Training and evaluation are capped at three-hop reasoning; longer chains are out of distribution.
+## Citation
+If you find our work helpful, feel free to give us a cite.
+```bibtex
+@misc{savkin2026occragoptimalcognitivecore,
+  title         = {OCC-RAG: Optimal Cognitive Core for Faithful Question Answering},
+  author        = {Maksim Savkin and Mikhail Goncharov and Alexander Gambashidze and Alla Chepurova and Dmitrii Tarasov and Nikita Andriianov and Daria Pugacheva and Vasily Konovalov and Andrey Galichin and Ivan Oseledets},
+  year          = {2026},
+  eprint        = {2606.00683},
+  archivePrefix = {arXiv},
+  primaryClass  = {cs.CL},
+  url           = {https://arxiv.org/abs/2606.00683}
+}
+```

chat_template.jinja ADDED Viewed

	@@ -0,0 +1,20 @@

+{%- for message in messages -%}
+    {%- if message['role'] == 'system' -%}
+        {{ '<|im_start|>system\n' + message['content'] + '<|im_end|>\n' }}
+    {%- elif message['role'] == 'user' -%}
+        {%- if documents and loop.last -%}
+            {{ '<|im_start|>user\n<|query_start|>' + message['content'] + '<|query_end|>\n' }}
+            {%- for doc in documents -%}
+                {{ '<|source_start|><|source_id|>' + (loop.index | string) + ' ' + doc['text'] + '<|source_end|>\n' }}
+            {%- endfor -%}
+            {{ '<|im_end|>\n' }}
+        {%- else -%}
+            {{ '<|im_start|>user\n' + message['content'] + '<|im_end|>\n' }}
+        {%- endif -%}
+    {%- elif message['role'] == 'assistant' -%}
+        {{ '<|im_start|>assistant\n<think>\n\n</think>\n\n' + message['content'] + '<|im_end|>\n' }}
+    {%- endif -%}
+{%- endfor -%}
+{%- if add_generation_prompt -%}
+    {{ '<|im_start|>assistant\n<think>\n\n</think>\n\n<|query_analysis_start|>\n' }}
+{%- endif -%}

config.json ADDED Viewed

	@@ -0,0 +1,63 @@

+{
+  "architectures": [
+    "Qwen3ForCausalLM"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "bos_token_id": null,
+  "dtype": "bfloat16",
+  "eos_token_id": 151643,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 1024,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_types": [
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention",
+    "full_attention"
+  ],
+  "max_position_embeddings": 32768,
+  "max_window_layers": 28,
+  "model_type": "qwen3",
+  "num_attention_heads": 16,
+  "num_hidden_layers": 28,
+  "num_key_value_heads": 8,
+  "pad_token_id": 151643,
+  "rms_norm_eps": 1e-06,
+  "rope_parameters": {
+    "rope_theta": 1000000,
+    "rope_type": "default"
+  },
+  "sliding_window": null,
+  "tie_word_embeddings": true,
+  "transformers_version": "5.5.4",
+  "use_cache": true,
+  "use_sliding_window": false,
+  "vocab_size": 151936
+}

figures/github-mark.png ADDED Viewed

figures/occ.png ADDED Viewed

generation_config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "do_sample": false,
+  "temperature": 0.0,
+  "eos_token_id": [
+    151643,
+    151645,
+    151683
+  ],
+  "max_new_tokens": 2048,
+  "pad_token_id": 151643,
+  "transformers_version": "5.5.4"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f8f1d583afd08756cc40273d9c63d63580000852e47aa64d535bc77c872533ee
+size 1192135096

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:672e331460a05e2ea9888810a7a37f0c775429fe05fddc6330ee0dc9147a1370
+size 11425566

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,44 @@

+{
+  "add_prefix_space": false,
+  "backend": "tokenizers",
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|endoftext|>",
+  "errors": "replace",
+  "extra_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>",
+    "<|query_start|>",
+    "<|query_end|>",
+    "<|source_start|>",
+    "<|source_end|>",
+    "<|source_id|>",
+    "<|query_analysis_start|>",
+    "<|query_analysis_end|>",
+    "<|source_analysis_start|>",
+    "<|source_analysis_end|>",
+    "<|reasoning_start|>",
+    "<|reasoning_end|>",
+    "<|status_start|>",
+    "<|status_end|>",
+    "<|answer_start|>",
+    "<|answer_end|>"
+  ],
+  "is_local": false,
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null
+}