--- language: - en - zh library_name: transformers pipeline_tag: text-generation tags: - recommendation - generative-recommendation - reasoning - itemic-token - qwen3 - pretraining --- # OneReason Reasoning Foundation Models for Generative Recommendation [Paper](https://arxiv.org/abs/2606.06260) | [Model Zoo](#model-zoo) | [Quick Start](#quick-start) | [Citation](#citation)
Figure 1: The pre-training, SFT, RL, and reasoning-evaluation pipeline of OneReason.
## Introduction OneReason is a recommendation foundation model that connects large language models with generative recommender systems. It represents items as compact **itemic tokens** and trains the model to align itemic-token semantics with natural language, user behavior, and recommendation-oriented reasoning traces. The OneReason training stack contains three stages: - **Pre-training:** builds itemic-token perception through four-granularity itemic-text alignment data, covering token-, item-, relational-, and user-level signals. - **Supervised Fine-Tuning (SFT):** teaches recommendation cognition with coarse-to-fine Chain-of-Thought (CoT) traces over user profiles, behavior histories, and itemic-token evidence. - **Reinforcement Learning (RL):** uses a specialize-then-unify recipe to improve thinking-mode recommendation while balancing performance across multiple recommendation domains. This repository currently releases the **OneReason-0.8B Pretrain checkpoint**. We will continue to release OneReason-0.8B SFT/RL checkpoints and the OneReason-8B series. ## News - **[2026.06]** OneReason-0.8B Pretrain checkpoint is released. - **Coming soon:** OneReason-0.8B SFT checkpoint. - **Coming soon:** OneReason-0.8B RL checkpoint. - **Coming soon:** OneReason-8B checkpoints. ## Model Zoo | Model | Stage | Parameters | Status | Description | |---|---:|---:|---|---| | OneReason-0.8B-Pretrain | Pre-training | 0.8B | Released | Foundation checkpoint after itemic-text alignment pre-training. Suitable for research, continued pre-training, and downstream SFT. | | OneReason-0.8B-SFT | SFT | 0.8B | Coming soon | Instruction-tuned checkpoint with recommendation perception, derivation, evolution, and recommendation supervision. | | OneReason-0.8B-RL | RL | 0.8B | Coming soon | Post-trained checkpoint optimized for recommendation-oriented reasoning. | | OneReason-8B | Pretrain/SFT/RL | 8B | Coming soon | Larger OneReason model family with stronger reasoning and recommendation performance. | ## Method Overview ### Itemic Tokens OneReason represents each item with one domain-aware begin token and three hierarchical sub-tokens: ```text <|domain_begin|><|video_begin|> | <|video_begin|><s_a_3334><s_b_4643><s_c_625> |
| E-commerce product | <|prod_begin|> | <|prod_begin|><s_a_2147><s_b_7978><s_c_5031> |
| Advertisement | <|ad_begin|> | <|ad_begin|><s_a_7939><s_b_6234><s_c_4978> |
| Live streaming | <|living_begin|> | <|living_begin|><s_a_4515><s_b_6234><s_c_6278> |
| General multimodal item | <|sid_begin|> | <|sid_begin|><s_a_340><s_b_6566><s_c_5603> |
Each itemic token sequence is produced by a three-layer codebook, where each layer contains 8192 codes. The released checkpoint can process these itemic-token strings through its tokenizer. Mapping raw items to itemic tokens, or mapping generated itemic tokens back to real item IDs, requires the corresponding itemic tokenizer and item catalog.
### Pre-training Data Design
OneReason pre-training uses **578B tokens** to align itemic-token and text-token semantic spaces. The recommendation part follows a four-granularity corpus design:
- **Token granularity:** aligns individual and compositional sub-token semantics.
- **Item granularity:** aligns complete itemic patterns with natural-language captions and multi-perspective item QA.
- **Relational granularity:** injects item-to-item collaborative relations with natural-language transition explanations.
- **User granularity:** models user behavior sequences with domain-grouped and chronologically interleaved itemic-text formats.
General-domain text and multimodal corpora are mixed in to preserve instruction-following, reasoning, code, math, and broad semantic capabilities while injecting recommendation-specific knowledge.
### Training Recipe
The pre-training recipe contains three stages:
| Stage | Trainable parameters | Token budget | Purpose |
|---|---|---:|---|
| Stage 1 | Extended vocabulary + LM head | 110B | Warm up newly introduced itemic-token embeddings. |
| Stage 2 | All parameters | 449B | Inject four-granularity recommendation knowledge. |
| Stage 3 | All parameters | 19B | Extend long-context user behavior modeling. |
## OneReason-Bench
OneReason is evaluated with **OneReason-Bench**, a reasoning-oriented recommendation benchmark organized into four layers:
| Layer | Capability | Representative tasks |
|---|---|---|
| R0: Perception | Ground itemic tokens into semantic content. | Item understanding, itemic pattern grounding, item QA. |
| R1: Derivation | Reason over item-to-item relations. | Item2Item relation derivation. |
| R2: Evolution | Model user interests as temporal processes. | Evolution action selection, topic generation, direct evolution generation. |
| R3: Recommendation | Combine perception, derivation, and evolution for recommendation. | Single-domain and cross-domain recommendation. |
## Performance
The released **OneReason-0.8B-Pretrain** checkpoint is the foundation checkpoint before SFT/RL. It is designed to provide strong itemic-token perception and a good initialization for downstream recommendation tuning.
The tables below report the full OneReason-8B system results from the technical report. We will update this model card with checkpoint-specific numbers as the OneReason-0.8B SFT/RL and OneReason-8B checkpoints become available.
Figure 2: Performance overview of OneReason-8B. The radar chart summarizes general, perception, derivation, evolution, and recommendation capabilities; the bar charts show thinking-mode gains and the effect of thinking-data supervision.
### Results on Cross-Domain Recommendation Cross-domain recommendation results are reported in percentage. Best results are **bolded**; second-best results are underlined. | Category | Model | C-Video Pass@64 | C-Video Recall@64 | C-Product Pass@64 | C-Product Recall@64 | C-Ad Pass@64 | C-Ad Recall@64 | C-Live Pass@64 | C-Live Recall@64 | |---|---|---:|---:|---:|---:|---:|---:|---:|---:| | ID-Based | SASRec | 0.03 | 0.01 | 0.31 | 0.25 | 1.04 | 0.37 | 1.76 | 0.40 | | ID-Based | HSTU | 0.10 | 0.01 | 0.32 | 0.24 | 2.79 | 0.78 | 2.32 | 2.14 | | Text-Based | Qwen3-8B | 0.05 | 0.01 | 0.15 | 0.12 | 0.48 | 0.09 | 2.10 | 1.85 | | Text-Based | Qwen3-32B | 0.33 | 0.03 | 0.84 | 0.63 | 1.21 | 0.30 | 5.64 | 5.10 | | Text-Based | Qwen3-235B-A22B | 0.24 | 0.02 | 0.64 | 0.49 | 0.77 | 0.19 | 5.10 | 4.66 | | Text-Based | Deepseek-V3.2 | 0.11 | 0.01 | 0.38 | 0.31 | 0.62 | 0.13 | 3.46 | 3.12 | | Text-Based | Claude-Opus-4.6 | 0.14 | 0.01 | 0.23 | 0.17 | 0.50 | 0.11 | 3.02 | 2.66 | | Text-Based | Gemini-3-Preview | 0.29 | 0.03 | 0.74 | 0.59 | 1.22 | 0.27 | 3.92 | 3.44 | | Text-Based | GPT-4o-mini | 0.19 | 0.02 | 0.73 | 0.55 | 1.21 | 0.28 | 4.01 | 3.57 | | Text-Based | GPT-5.4 | 0.24 | 0.02 | 1.43 | 1.15 | 1.64 | 0.43 | 7.20 | 6.38 | | Itemic Token-Based | TIGER | 0.88 | 0.07 | 0.21 | 0.17 | 7.65 | 2.39 | 2.32 | 1.78 | | Itemic Token-Based | LC-Rec-SFT-Only-8B | 0.22 | 0.02 | 0.06 | 0.05 | 2.83 | 0.67 | 0.89 | 0.71 | | Itemic Token-Based | LC-Rec-SFT-Only-14B | 0.20 | 0.01 | 1.03 | 0.73 | 5.99 | 1.94 | 3.76 | 3.09 | | Itemic Token-Based | LC-Rec-PT-SFT-8B | 1.49 | 0.13 | 3.95 | 3.00 | 15.85 | 6.55 | 19.32 | 16.70 | | Itemic Token-Based | OneReason SFT non-thinking | 1.33 | 0.11 | 3.94 | 2.96 | 15.73 | 6.49 | 18.05 | 15.52 | | Itemic Token-Based | OneReason SFT thinking | 0.71 | 0.06 | 2.18 | 1.65 | 9.16 | 3.41 | 16.43 | 14.32 | | Itemic Token-Based | OneReason RFT non-thinking | 2.08 | 0.19 | 5.20 | 3.96 | 17.56 | 7.26 | 21.01 | 18.17 | | Itemic Token-Based | OneReason RFT thinking | **2.41** | **0.24** | **5.47** | **4.19** | **17.78** | **7.50** | **21.10** | **18.35** | ### Results on R0-R2 Reasoning Tasks R0-R2 results on OneReason-Bench are reported in percentage. For R0 tasks, results are macro-averaged over all domains. Grounding is reported by Pass@64. | Category | Model | R0 Item Und. | R0 Ground. | R0 QA | R1 I2I | R2 Select. | R2 Topic Gen. | R2 Direct Gen. | |---|---|---:|---:|---:|---:|---:|---:|---:| | Text-Based | Qwen3-8B | - | - | - | - | 40.70 | 25.49 | 8.60 | | Text-Based | Qwen3-32B | - | - | - | - | 51.96 | 28.05 | 7.73 | | Text-Based | Deepseek-V3.2 | - | - | - | - | 57.18 | 27.13 | 11.32 | | Text-Based | Claude-Opus-4.6 | - | - | - | - | 56.84 | 17.16 | 13.46 | | Text-Based | Gemini-3-Preview | - | - | - | - | 56.83 | 33.68 | 14.76 | | Text-Based | GPT-5.4 | - | - | - | - | **58.92** | **41.41** | 17.61 | | Itemic Token-Based | LC-Rec-SFT-Only-8B | 22.98 | 0.00 | 0.40 | 3.43 | 0.00 | 0.00 | 0.00 | | Itemic Token-Based | LC-Rec-SFT-Only-14B | 26.48 | 0.00 | 56.45 | 16.21 | 0.00 | 0.00 | 0.00 | | Itemic Token-Based | LC-Rec-PT-SFT-8B | 35.41 | 5.21 | 63.90 | 25.54 | 3.32 | 8.60 | 4.46 | | Itemic Token-Based | OneReason SFT non-thinking | 36.84 | 3.95 | 66.55 | 28.36 | 35.07 | 33.87 | 15.42 | | Itemic Token-Based | OneReason SFT thinking | **36.91** | 1.06 | 64.60 | 23.88 | 32.18 | 31.60 | 14.31 | | Itemic Token-Based | OneReason RFT non-thinking | 36.82 | **5.24** | **67.25** | 23.99 | 38.92 | 39.33 | 20.31 | | Itemic Token-Based | OneReason RFT thinking | 36.78 | 1.35 | 65.65 | **28.60** | 42.42 | 39.57 | **21.23** | ## Quick Start Install dependencies: ```bash pip install "transformers>=4.51.0" accelerate safetensors torch ``` Load the model: ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "OpenOneRec/OneReason-0.8B-Pretrain" # or the local path to this repository tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto", trust_remote_code=True, ) ``` ### Item Understanding Example ```python prompt = "<|prod_begin|>