Text Generation
Transformers
Safetensors
English
Russian
qwen3
rag
faithful-qa
occ
conversational
text-generation-inference
Instructions to use useitone/OCC-RAG-0.6B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use useitone/OCC-RAG-0.6B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="useitone/OCC-RAG-0.6B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("useitone/OCC-RAG-0.6B") model = AutoModelForCausalLM.from_pretrained("useitone/OCC-RAG-0.6B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use useitone/OCC-RAG-0.6B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "useitone/OCC-RAG-0.6B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "useitone/OCC-RAG-0.6B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/useitone/OCC-RAG-0.6B
- SGLang
How to use useitone/OCC-RAG-0.6B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "useitone/OCC-RAG-0.6B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "useitone/OCC-RAG-0.6B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "useitone/OCC-RAG-0.6B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "useitone/OCC-RAG-0.6B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use useitone/OCC-RAG-0.6B with Docker Model Runner:
docker model run hf.co/useitone/OCC-RAG-0.6B
| license: mit | |
| language: | |
| - en | |
| - ru | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| base_model: Qwen/Qwen3-0.6B-Base | |
| tags: | |
| - rag | |
| - faithful-qa | |
| - occ | |
| # OCC-RAG-0.6B | |
| <p align="center"> | |
| <img src="figures/occ.png" alt="OCC-RAG" width="320"/> | |
| </p> | |
| <p align="center"> | |
| <a href="https://github.com/optimal-cognitive-core/OCC-RAG"><b>GitHub</b></a> | | |
| <a href="https://arxiv.org/abs/2606.00683"><b>Technical Report</b></a> | | |
| <a href="https://cloud.ru/products/evolution-ml-inference"><b>Cloud</b></a> | |
| </p> | |
| **OCC-RAG-0.6B** is a 0.6B-parameter small language model specialized for **faithful, context-grounded question answering**. Along with OCC-RAG-1.7B, it belongs to the first generation of **Optimal Cognitive Core (OCC)** specialized reasoning models. Given a question and a set of sources, it produces a structured reasoning trace with explicit source citations, decides whether the context actually supports an answer, and either answers from the context or abstains. | |
| Despite its size, OCC-RAG-0.6B matches or exceeds general-purpose models **2–6× larger** on multi-hop reasoning, faithfulness, and refusal benchmarks. It is mid-trained from `Qwen/Qwen3-0.6B-Base` on a large synthetic corpus of multi-context, multi-hop QA with citation-anchored reasoning traces. | |
| ## Highlights | |
| - **Faithful by design** — answers only from the supplied context; achieves the best faithfulness (lowest memorization ratio) across all evaluated scales, including 32B models. | |
| - **Calibrated abstention** — outputs `Not enough information` when the context does not support an answer. | |
| - **Structured, citable reasoning** — every answer comes with a transparent trace (query analysis → source analysis → reasoning → status → answer) that cites sources by id. | |
| - **Compact** — a small model that delivers chain-of-thought-level transparency at a fraction of full thinking-mode inference cost. | |
| ## Model overview | |
| OCC-RAG-0.6B is mid-trained from `Qwen/Qwen3-0.6B-Base` via supervised fine-tuning on a synthetic corpus of **~3.25M QA pairs** (~2.78M single-hop, ~262k multi-hop single-context, ~165k multi-hop multi-context, and ~43k abstain examples), distilled from a larger teacher with citation-anchored reasoning traces. Multi-hop and multi-context subsets are oversampled to emphasize compositional reasoning. The prompt/response format is identical at training and inference time, so no train–test mismatch is introduced. | |
| ## Evaluation | |
| Evaluated across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un). In-Acc = the gold answer appears as a substring of the prediction; F1 = token-level overlap between prediction and gold answer; M_R = memorization ratio (lower = more faithful); R-Acc = refusal accuracy. | |
| | Model | HotpotQA<br>In-Acc | MuSiQue<br>In-Acc | TAT-QA<br>F1 | ConFiQA<br>In-Acc | ConFiQA<br>M_R ↓ | MuSiQue-Un<br>R-Acc | | |
| |---|---|---|---|---|---|---| | |
| | gemma-3-4b-it | 55.8 | 30.1 | 65.3 | 69.8 | 8.9 | 55.8 | | |
| | Qwen3-1.7B (think) | 60.9 | 30.7 | 74.8 | 70.4 | 8.3 | 82.8 | | |
| | Qwen3-4B (think) | 67.1 | 41.5 | 79.1 | 74.1 | 7.5 | 84.0 | | |
| | Pleias-RAG-1.2B | 48.5 | 15.0 | 8.4 | 37.3 | 25.3 | 21.9 | | |
| | **OCC-RAG-0.6B** | **57.6** | **36.6** | **75.0** | **79.9** | **5.2** | **86.9** | | |
| OCC-RAG-0.6B exceeds Gemma-3-4B and SmolLM-3-3B on every dimension and attains the strongest faithfulness (highest ConFiQA In-Acc, lowest M_R) among all evaluated models. | |
| ## Input / output format | |
| OCC-RAG uses a **structured prompt format with special tokens**. The question is wrapped in `<|query_start|> … <|query_end|>` and each source in `<|source_start|><|source_id|>N … <|source_end|>`. | |
| The response is split into five sections, each delimited by special tokens: | |
| | Section | Tokens | Content | | |
| |---|---|---| | |
| | Query analysis | `<\|query_analysis_start\|> … <\|query_analysis_end\|>` | Decomposes the question into what must be found. | | |
| | Source analysis | `<\|source_analysis_start\|> … <\|source_analysis_end\|>` | Assesses each source's relevance, citing by `<\|source_id\|>N`. | | |
| | Reasoning | `<\|reasoning_start\|> … <\|reasoning_end\|>` | Composes evidence across sources into a multi-hop chain. | | |
| | Status | `<\|status_start\|> … <\|status_end\|>` | `ANSWERABLE` / `UNANSWERABLE` verdict. | | |
| | Answer | `<\|answer_start\|> … <\|answer_end\|>` | The final answer span, or the refusal phrase. | | |
| ## Quickstart (Transformers) | |
| The chat template accepts a `documents=` kwarg and emits the structural tokens for the query and sources automatically — pass the user message as plain text and the sources as a list of dicts. | |
| ```python | |
| import re | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| MODEL = "occ-ai/OCC-RAG-0.6B" | |
| tokenizer = AutoTokenizer.from_pretrained(MODEL) | |
| model = AutoModelForCausalLM.from_pretrained(MODEL, torch_dtype="auto", device_map="auto") | |
| question = "Which country is the inventor of the telephone, Alexander Graham Bell, buried in?" | |
| documents = [ | |
| {"text": "Alexander Graham Bell was a Scottish-born inventor best known for patenting the first practical telephone."}, | |
| {"text": "Bell died on August 2, 1922, at his estate Beinn Bhreagh, near Baddeck, Nova Scotia, and was buried there."}, | |
| {"text": "Nova Scotia is a province on the east coast of Canada."}, | |
| ] | |
| text = tokenizer.apply_chat_template( | |
| [{"role": "user", "content": question}], | |
| documents=documents, | |
| tokenize=False, | |
| add_generation_prompt=True, | |
| enable_thinking=False, | |
| ) | |
| # Alternative: assemble the structural tokens yourself. | |
| # | |
| # query_start, query_end = "<|query_start|>", "<|query_end|>" | |
| # source_start, source_end, source_id = "<|source_start|>", "<|source_end|>", "<|source_id|>" | |
| # | |
| # def build_user_content(question, sources): | |
| # content = f"{query_start}{question}{query_end}\n" | |
| # for i, s in enumerate(sources, start=1): | |
| # content += f"{source_start}{source_id}{i} {s}{source_end}\n" | |
| # return content | |
| # | |
| # messages = [{"role": "user", "content": build_user_content(question, [d["text"] for d in documents])}] | |
| # text = tokenizer.apply_chat_template( | |
| # messages, tokenize=False, add_generation_prompt=True, enable_thinking=False | |
| # ) | |
| inputs = tokenizer([text], return_tensors="pt").to(model.device) | |
| outputs = model.generate(**inputs, max_new_tokens=2048) | |
| response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False) | |
| print(response) | |
| m = re.findall(r"<\|answer_start\|>(.*?)(?:<\|answer_end\|>|\Z)", response, re.DOTALL) | |
| print("Answer:", m[-1].strip() if m else "") # -> Canada | |
| ``` | |
| > [!NOTE] | |
| > We recommend greedy decoding (`do_sample=False`), which is the training/evaluation default and is baked into `generation_config.json`. Qwen3's default sampling parameters ([best practices](https://huggingface.co/Qwen/Qwen3-0.6B#best-practices)) also work fine. | |
| ## Deployment | |
| OCC-RAG-0.6B is a standard Qwen3 causal LM and is compatible with vLLM, SGLang, and other Transformers-based serving stacks. With only 0.6B parameters, it can be readily deployed in constrained infrastructure, including desktop systems running on CPU RAM. When serving, keep `skip_special_tokens=False` if you need to parse the structural tokens out of the raw output. | |
| When using an OpenAI-compatible server (vLLM ≥0.6, SGLang ≥0.4.7), the `documents=` kwarg is reachable from the client via `chat_template_kwargs`: | |
| ```python | |
| client.chat.completions.create( | |
| model="occ-ai/OCC-RAG-0.6B", | |
| messages=[{"role": "user", "content": question}], | |
| extra_body={"chat_template_kwargs": {"documents": documents}}, | |
| ) | |
| ``` | |
| ## Limitations | |
| - **Context-grounded only.** The model is trained to answer from the supplied sources and to ignore parametric knowledge. It is not a general-purpose chat or knowledge model. | |
| - **Reasoning depth.** Training and evaluation are capped at three-hop reasoning; longer chains are out of distribution. | |
| ## Citation | |
| If you find our work helpful, feel free to give us a cite. | |
| ```bibtex | |
| @misc{savkin2026occragoptimalcognitivecore, | |
| title = {OCC-RAG: Optimal Cognitive Core for Faithful Question Answering}, | |
| author = {Maksim Savkin and Mikhail Goncharov and Alexander Gambashidze and Alla Chepurova and Dmitrii Tarasov and Nikita Andriianov and Daria Pugacheva and Vasily Konovalov and Andrey Galichin and Ivan Oseledets}, | |
| year = {2026}, | |
| eprint = {2606.00683}, | |
| archivePrefix = {arXiv}, | |
| primaryClass = {cs.CL}, | |
| url = {https://arxiv.org/abs/2606.00683} | |
| } | |
| ``` | |