|
|
--- |
|
|
license: apache-2.0 |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
tags: |
|
|
- test-time-scaling |
|
|
- reflective-model |
|
|
- mathematics |
|
|
- code |
|
|
- reasoning |
|
|
--- |
|
|
|
|
|
# MetaStone-S1: Test-Time Scaling with Reflective Generative Model |
|
|
|
|
|
**Paper:** [Test-Time Scaling with Reflective Generative Model](https://huggingface.co/papers/2507.01951) |
|
|
**Project page:** [wenxiaobai.com](https://www.wenxiaobai.com/) |
|
|
**Code:** [MetaStone-AI/MetaStone-S1](https://github.com/MetaStone-AI/MetaStone-S1) |
|
|
|
|
|
## Introduction |
|
|
We release our first reflective generative model: MetaStone-S1. |
|
|
With only 32B parameters, MetaStone-S1 performs comparably to the OpenAI-o3 series on mathematics, coding, and Chinese reasoning tasks. |
|
|
<img src="./figures/performance.jpg" alt="Performance compared with OpenAI-o3-mini" width="800"> |
|
|
|
|
|
MetaStone‑S1 is trained based on our proposed **reflective generative form, which combines “Long-CoT Reinforcement Learning” and “Process Reward Learning” into a unified training form**. |
|
|
This form enables a single model to simultaneously achieve deep reasoning and high-quality reasoning trajectory selection. |
|
|
By sharing the backbone network between the PRMs and policy models, MetaStone‑S1 significantly reduces the inference cost of PRMs by 99%, resulting in faster and higher-quality responses. |
|
|
|
|
|
<img src="./figures/intro.jpg" alt="Introduction" width="800"> |
|
|
|
|
|
This repository contains the training and evaluation code for MetaStone-S1. For full details, please refer to our [paper](https://huggingface.co/papers/2507.01951) and [official website](https://www.wenxiaobai.com/). |
|
|
|
|
|
## Usage |
|
|
You can load the model using the `transformers` library for basic text generation. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
|
|
# Load model and tokenizer |
|
|
# Note: For full functionality of MetaStone-S1's reflective generative capabilities |
|
|
# (e.g., using the Process Reward Model for enhanced reasoning modes and test-time scaling), |
|
|
# please refer to the official GitHub repository for detailed inference pipeline. |
|
|
model_name = "MetaStoneTec/MetaStone-S1-32B" # Use MetaStoneTec/MetaStone-S1-7B or MetaStoneTec/MetaStone-S1-1.5B for other sizes |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_name, |
|
|
torch_dtype=torch.bfloat16, # Use torch.float16 if bfloat16 is not supported by your GPU |
|
|
device_map="auto" |
|
|
) |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
|
|
# Example text generation |
|
|
prompt = "What is the capital of France?" |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
|
|
|
# Generate text |
|
|
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7) |
|
|
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(generated_text) |
|
|
|
|
|
# Example with a specific prompt format (if applicable, adjust as per model's fine-tuning) |
|
|
# For models fine-tuned with specific chat templates, use tokenizer.apply_chat_template: |
|
|
# messages = [{"role": "user", "content": "Hello, how are you today?"}] |
|
|
# prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
# inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
|
# outputs = model.generate(**inputs, max_new_tokens=50) |
|
|
# generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
# print(generated_text) |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Model | AIME24 | AIME25 | LiveCodeBench | C-EVAL | |
|
|
|------------------------------|--------|--------|----------------|--------| |
|
|
| DeepScaleR-1.5B-Preview | 43.1 | 30.0 | - | - | |
|
|
| R1-Distill-Qwen-1.5B | 28.9 | 22.8 | 16.9 | 27.1 | |
|
|
| R1-Distill-Qwen-7B | 55.5 | - | 37.6 | - | |
|
|
| R1-Distill-Llama-8B | 50.4 | - | 39.6 | - | |
|
|
| **MetaStone-S1-7B-low** | 60.7 | 45.4 | 41.7 | 55.1 | |
|
|
| **MetaStone-S1-7B-medium** | <ins>66.3</ins> | <ins>48.3</ins> | <ins>44.1</ins> | <ins>57.5</ins> | |
|
|
| **MetaStone-S1-7B-high** | **70.2** | **48.6** | **44.4** | **57.8** | |
|
|
|
|
|
|
|
|
## Model |
|
|
|
|
|
We save the parameters of the policy model and the SPRM head into two files: |
|
|
|
|
|
- "model.safetensors" is the checkpoint of the policy model. |
|
|
|
|
|
- "score_module.pt" is the checkpoint of the SPRM head. |
|
|
|
|
|
|
|
|
You can find other sizes of MetaStone‑S1 below: |
|
|
|
|
|
| Model|Transformers(HF) | ModelScope | |
|
|
|---------------|---------|---------| |
|
|
|MetaStone-S1-1.5B|[MetaStone-S1-1.5B](https://huggingface.co/MetaStoneTec/MetaStone-S1-1.5B)|[MetaStone-S1-1.5B](https://modelscope.cn/models/MetaStoneTec/MetaStone-S1-1.5B)| |
|
|
|MetaStone-S1-7B|[MetaStone-S1-7B](https://huggingface.co/MetaStoneTec/MetaStone-S1-7B)|[MetaStone-S1-7B](https://modelscope.cn/models/MetaStoneTec/MetaStone-S1-7B)| |
|
|
|MetaStone-S1-32B|[MetaStone-S1-32B](https://huggingface.co/MetaStoneTec/MetaStone-S1-32B)|[MetaStone-S1-32B](https://modelscope.cn/models/MetaStoneTec/MetaStone-S1-32B)| |
|
|
|
|
|
|
|
|
## Evaluation |
|
|
Since Huggingface models do not directly support inference on SPRM. |
|
|
Please refer to [github](https://github.com/MetaStone-AI/MetaStone-S1) for the detailed training and testing pipeline. |
|
|
|
|
|
|
|
|
## Citation |
|
|
If you find our work helpful, feel free to give us a cite. |
|
|
``` |
|
|
@misc{wang2025testtimescalingreflectivegenerative, |
|
|
title={Test-Time Scaling with Reflective Generative Model}, |
|
|
author={Zixiao Wang and Yuxin Wang and Xiaorui Wang and Mengting Xing and Jie Gao and Jianjun Xu and Guangcan Liu and Chenhui Jin and Zhuo Wang and Shengzhuo Zhang and Hongtao Xie}, |
|
|
year={2025}, |
|
|
eprint={2507.01951}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.LG}, |
|
|
url={https://arxiv.org/abs/2507.01951}, |
|
|
} |
|
|
``` |