File size: 5,096 Bytes
dd71555 2bfc7a9 8eeae01 2bfc7a9 8eeae01 2bfc7a9 dd71555 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 | ---
license: mit
---
# **Scaling Reasoning without Attention**
[](http://arxiv.org/abs/2505.22425)
[](https://github.com/inclusionAI/PromptCoT)
---
## π Overview
**PromptCoT-Mamba** establishes the first **attention-free foundation model** capable of surpassing strong Transformer baselines across a broad suite of competition-level math and code reasoning tasks. Built on the **Mamba-2** architecture and trained through a structured, two-stage curriculum using the [**PromptCoT**](http://arxiv.org/abs/2503.02324) pipeline, it delivers **high accuracy with constant-memory inference**, eliminating the need for KV caching.
---
## π Key Results
### πΉ General Performance
| Model | MATH-500 | AIME 24 | AIME 25 | OlympiadBench | HumanEval | HumanEval+ | Livecodebench |
| ---------------------- | -------- | -------- | -------- | ------------- | --------- | ---------- | ------------- |
| **PromptCoT-Mamba-7B** | 84.6 | **35.2** | **24.6** | 50.7 | 81.7 | 75.0 | **29.9** |
| Gemma3-27B | **89.0** | 32.6 | 24.0 | **54.2** | **86.0** | **78.0** | 26.9 |
| Gemma3-12B | 83.8 | 22.9 | 19.2 | 49.9 | 81.1 | 73.2 | 22.2 |
| Sky-T1-7B | 85.0 | 19.2 | 19.2 | 49.2 | 41.5 | 37.2 | 18.3 |
| S1.1-7B | 82.0 | 19.2 | 17.5 | 43.1 | 64.0 | 56.7 | 13.3 |
| Bespoke-Stratos-7B | 81.2 | 18.3 | 16.3 | 45.0 | 73.2 | 68.3 | 8.6 |
| Nemotron-H-8B | 77.6 | -- | -- | -- | 79.3 | 74.4 | -- |
| M1-3B | 81.7 | 23.0 | 22.0 | 43.6 | -- | -- | -- |
> π **PromptCoT-Mamba-7B** consistently outperforms all 7B-scale Transformer and hybrid Mamba-Transformer baselines across all tasks.
---
### πΉ Math Specialization vs. Generalist
| Model | MATH-500 | AIME 24 | AIME 25 | OlympiadBench | HumanEval | HumanEval+ | Livecodebench |
| --------------------------- | -------- | -------- | -------- | ------------- | --------- | ---------- | ------------- |
| **PromptCoT-Mamba-Math-7B** | **88.0** | **42.9** | **30.8** | **52.1** | 71.3 | 66.5 | 20.3 |
| PromptCoT-Mamba-7B | 84.6 | 35.2 | 24.6 | 50.7 | **81.7** | **75.0** | **29.9** |
> π― The math-specialized variant improves AIME 24 by **+7.7%** and AIME 25 by **+6.2%**, with a slight trade-off in code-related performance.
---
### β‘ Inference Efficiency
Using `vLLM` under constrained memory, PromptCoT-Mamba-7B demonstrates substantial speedups over the S1.1-7B Transformer baseline:
* π‘ **3.66Γ faster** at long-sequence generation on **24GB GPU**
* π‘ **1.69Γ faster** under **72GB memory**
> βοΈ Practical for cost-sensitive or long-context inference workloads at scale.
---
## π§ͺ Quick Start
### π§ Install Requirements
```bash
pip install transformers vllm torch accelerate
```
### π§ Load and Run the Model
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "xl-zhao/PromptCoT-Mamba-Math-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda")
problem_statement = (
"A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?"
)
prompt = (
f"<|im_start|>user\n{problem_statement}\nPlease reason step by step, and put your final answer within \\boxed{{}}.<|im_end|>\n"
"<|im_start|>assistant\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(**inputs, max_length=65536, temperature=0.8)
generated_solution = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_solution)
```
---
## β‘ Fast Inference with vLLM
```python
from vllm import LLM, SamplingParams
model_name = "xl-zhao/PromptCoT-Mamba-Math-7B"
llm = LLM(model=model_name, tensor_parallel_size=1)
problem_statement = (
"A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?"
)
prompt = (
f"<|im_start|>user\n{problem_statement}\nPlease reason step by step, and put your final answer within \\boxed{{}}.<|im_end|>\n"
"<|im_start|>assistant\n"
)
sampling_params = SamplingParams(temperature=0.8, max_tokens=65536)
outputs = llm.generate([prompt], sampling_params)
print(outputs[0].outputs[0].text)
```
---
## π Citation
```bibtex
@article{zhao2025scaling,
author = {Xueliang Zhao and Wei Wu and Lingpeng Kong},
title = {Scaling Reasoning without Attention},
journal = {arXiv preprint arXiv:2505.22425},
year = {2025},
url = {https://arxiv.org/abs/2505.22425}
}
``` |