|
|
--- |
|
|
license: mit |
|
|
--- |
|
|
# **Scaling Reasoning without Attention** |
|
|
|
|
|
|
|
|
[](http://arxiv.org/abs/2505.22425) |
|
|
[](https://github.com/inclusionAI/PromptCoT) |
|
|
|
|
|
--- |
|
|
|
|
|
## π Overview |
|
|
|
|
|
**PromptCoT-Mamba** establishes the first **attention-free foundation model** capable of surpassing strong Transformer baselines across a broad suite of competition-level math and code reasoning tasks. Built on the **Mamba-2** architecture and trained through a structured, two-stage curriculum using the [**PromptCoT**](http://arxiv.org/abs/2503.02324) pipeline, it delivers **high accuracy with constant-memory inference**, eliminating the need for KV caching. |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## π Key Results |
|
|
|
|
|
### πΉ General Performance |
|
|
|
|
|
| Model | MATH-500 | AIME 24 | AIME 25 | OlympiadBench | HumanEval | HumanEval+ | Livecodebench | |
|
|
| ---------------------- | -------- | -------- | -------- | ------------- | --------- | ---------- | ------------- | |
|
|
| **PromptCoT-Mamba-7B** | 84.6 | **35.2** | **24.6** | 50.7 | 81.7 | 75.0 | **29.9** | |
|
|
| Gemma3-27B | **89.0** | 32.6 | 24.0 | **54.2** | **86.0** | **78.0** | 26.9 | |
|
|
| Gemma3-12B | 83.8 | 22.9 | 19.2 | 49.9 | 81.1 | 73.2 | 22.2 | |
|
|
| Sky-T1-7B | 85.0 | 19.2 | 19.2 | 49.2 | 41.5 | 37.2 | 18.3 | |
|
|
| S1.1-7B | 82.0 | 19.2 | 17.5 | 43.1 | 64.0 | 56.7 | 13.3 | |
|
|
| Bespoke-Stratos-7B | 81.2 | 18.3 | 16.3 | 45.0 | 73.2 | 68.3 | 8.6 | |
|
|
| Nemotron-H-8B | 77.6 | -- | -- | -- | 79.3 | 74.4 | -- | |
|
|
| M1-3B | 81.7 | 23.0 | 22.0 | 43.6 | -- | -- | -- | |
|
|
|
|
|
> π **PromptCoT-Mamba-7B** consistently outperforms all 7B-scale Transformer and hybrid Mamba-Transformer baselines across all tasks. |
|
|
|
|
|
--- |
|
|
|
|
|
### πΉ Math Specialization vs. Generalist |
|
|
|
|
|
| Model | MATH-500 | AIME 24 | AIME 25 | OlympiadBench | HumanEval | HumanEval+ | Livecodebench | |
|
|
| --------------------------- | -------- | -------- | -------- | ------------- | --------- | ---------- | ------------- | |
|
|
| **PromptCoT-Mamba-Math-7B** | **88.0** | **42.9** | **30.8** | **52.1** | 71.3 | 66.5 | 20.3 | |
|
|
| PromptCoT-Mamba-7B | 84.6 | 35.2 | 24.6 | 50.7 | **81.7** | **75.0** | **29.9** | |
|
|
|
|
|
> π― The math-specialized variant improves AIME 24 by **+7.7%** and AIME 25 by **+6.2%**, with a slight trade-off in code-related performance. |
|
|
|
|
|
--- |
|
|
|
|
|
### β‘ Inference Efficiency |
|
|
|
|
|
Using `vLLM` under constrained memory, PromptCoT-Mamba-7B demonstrates substantial speedups over the S1.1-7B Transformer baseline: |
|
|
|
|
|
* π‘ **3.66Γ faster** at long-sequence generation on **24GB GPU** |
|
|
* π‘ **1.69Γ faster** under **72GB memory** |
|
|
|
|
|
> βοΈ Practical for cost-sensitive or long-context inference workloads at scale. |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ͺ Quick Start |
|
|
|
|
|
### π§ Install Requirements |
|
|
|
|
|
```bash |
|
|
pip install transformers vllm torch accelerate |
|
|
``` |
|
|
|
|
|
### π§ Load and Run the Model |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
model_name = "xl-zhao/PromptCoT-Mamba-Math-7B" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda") |
|
|
|
|
|
problem_statement = ( |
|
|
"A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?" |
|
|
) |
|
|
|
|
|
prompt = ( |
|
|
f"<|im_start|>user\n{problem_statement}\nPlease reason step by step, and put your final answer within \\boxed{{}}.<|im_end|>\n" |
|
|
"<|im_start|>assistant\n" |
|
|
) |
|
|
|
|
|
inputs = tokenizer(prompt, return_tensors="pt").to("cuda") |
|
|
|
|
|
with torch.no_grad(): |
|
|
output = model.generate(**inputs, max_length=65536, temperature=0.8) |
|
|
|
|
|
generated_solution = tokenizer.decode(output[0], skip_special_tokens=True) |
|
|
print(generated_solution) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## β‘ Fast Inference with vLLM |
|
|
|
|
|
```python |
|
|
from vllm import LLM, SamplingParams |
|
|
|
|
|
model_name = "xl-zhao/PromptCoT-Mamba-Math-7B" |
|
|
llm = LLM(model=model_name, tensor_parallel_size=1) |
|
|
|
|
|
problem_statement = ( |
|
|
"A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?" |
|
|
) |
|
|
|
|
|
prompt = ( |
|
|
f"<|im_start|>user\n{problem_statement}\nPlease reason step by step, and put your final answer within \\boxed{{}}.<|im_end|>\n" |
|
|
"<|im_start|>assistant\n" |
|
|
) |
|
|
|
|
|
sampling_params = SamplingParams(temperature=0.8, max_tokens=65536) |
|
|
outputs = llm.generate([prompt], sampling_params) |
|
|
|
|
|
print(outputs[0].outputs[0].text) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## π Citation |
|
|
|
|
|
```bibtex |
|
|
@article{zhao2025scaling, |
|
|
author = {Xueliang Zhao and Wei Wu and Lingpeng Kong}, |
|
|
title = {Scaling Reasoning without Attention}, |
|
|
journal = {arXiv preprint arXiv:2505.22425}, |
|
|
year = {2025}, |
|
|
url = {https://arxiv.org/abs/2505.22425} |
|
|
} |
|
|
``` |