Instructions to use deepdream-ai/DeepScaleR-7B-GRPO-8k with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use deepdream-ai/DeepScaleR-7B-GRPO-8k with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="deepdream-ai/DeepScaleR-7B-GRPO-8k")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("deepdream-ai/DeepScaleR-7B-GRPO-8k")
model = AutoModelForCausalLM.from_pretrained("deepdream-ai/DeepScaleR-7B-GRPO-8k")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use deepdream-ai/DeepScaleR-7B-GRPO-8k with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "deepdream-ai/DeepScaleR-7B-GRPO-8k"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepdream-ai/DeepScaleR-7B-GRPO-8k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/deepdream-ai/DeepScaleR-7B-GRPO-8k

SGLang

How to use deepdream-ai/DeepScaleR-7B-GRPO-8k with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "deepdream-ai/DeepScaleR-7B-GRPO-8k" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepdream-ai/DeepScaleR-7B-GRPO-8k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "deepdream-ai/DeepScaleR-7B-GRPO-8k" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "deepdream-ai/DeepScaleR-7B-GRPO-8k",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use deepdream-ai/DeepScaleR-7B-GRPO-8k with Docker Model Runner:
```
docker model run hf.co/deepdream-ai/DeepScaleR-7B-GRPO-8k
```

DeepScaleR-7B-GRPO-8k

DeepScaleR-7B-GRPO-8k Overview

DeepScaleR-7B-GRPO-8k is a language model fine-tuned from DeepSeek-R1-Distilled-Qwen-7B using distributed reinforcement learning (RL) to scale up to long context lengths. Note that we follow the training process of original DeepscaleR and further extend the model size up to 7B model.

This model is a 8K Context version, corrspoding to the stage-1 training of DeepScaleR-1.5B-Preview. Our full training process of this 8K-context stage consist of 400 training steps before the learning curves converge. This is a snapshot of step 240 where the testing result of AIME2023/2024 is relativly better.

Data

Same to DeepScaleR-1.5B-Preview, the training dataset consists of approximately 40,000 unique problem-answer pairs compiled from:

AIME problems (1984-2023)
AMC problems (prior to 2023)
Omni-MATH dataset
Still dataset

Training Recipe

We employ Deepseek's Group Relative Policy Optimization (GRPO), a simplified RL algorithm that extends PPO by:

Normalizing advantage function over all samples generated from the same prompt.
Applying KL divergence regularization on top of PPO's surrogate loss to prevent significant policy drift.

Reward Function: Our reward function is simple but effective:

1 for correct answers passing LaTeX/Sympy checks
0 for incorrect or improperly formatted answers
Note: No partial rewards (such as PRMs) or intermediate feedback.

Iterative Context Lengthening:

Initial 8K Context (0-400 steps):
- 38% -> ~50% Pass@1 on AIME 2023
- Trained on 4 H20 GPUs, BS= (Prompts) * (Samples/Prompt) = 128 * 6 = 768

We will share our foundings, recipe and wandb logs in our upcoming blog post.

Evaluation

We report Pass@1 accuracy averaged over 16 samples for each problem.

Model	AIME 2024	MATH 500	AMC 2023	Minerva Math	OlympiadBench	Avg.
Qwen-2.5-7B-Instruct	13.3	79.8	50.6	34.6	40.7	43.8
rStar-Math-7B	26.7	78.4	47.5	-	47.1	-
Eurus-2-7B-PRIME	26.7	79.2	57.8	38.6	42.1	48.9
Qwen2.5-7B-SimpleRL	26.7	82.4	62.5	39.7	43.3	50.9
DeepSeek-R1-Distill-Qwen-1.5B	28.8	82.8	62.9	26.5	43.3	48.9
Still-1.5B	32.5	84.4	66.7	29.0	45.4	51.6
DeepScaleR-1.5B-Preview	43.1	87.8	73.6	30.2	50.0	57.0
O1-Preview	40.0	81.4	-	-	-	-

Serving DeepScaleR

Our model can be served using popular high-performance inference systems:

vLLM
Hugging Face Text Generation Inference (TGI)
SGLang
TensorRT-LLM

All these systems support the OpenAI Chat Completions API format.

License

This project is released under the MIT License, reflecting our commitment to open and accessible AI development. We believe in democratizing AI technology by making our work freely available for anyone to use, modify, and build upon. This permissive license ensures that researchers, developers, and enthusiasts worldwide can leverage and extend our work without restrictions, fostering innovation and collaboration in the AI community.

Acknowledgement

Our training experiments are powered by our heavily modified fork of Verl, an open-source RLHF library.
Our model is trained on top of DeepSeek-R1-Distill-Qwen-1.5B.
Our work is done as part of Berkeley Sky Computing Lab and Berkeley AI Research.

Citation

@misc{deepscaler2025,
  title={DeepScaleR: Surpassing O1-Preview with a 1.5B Model by Scaling RL},
  author={Michael Luo and Sijun Tan and Justin Wong and Xiaoxiang Shi and William Y. Tang and Manan Roongta and Colin Cai and Jeffrey Luo and Tianjun Zhang and Li Erran Li and Raluca Ada Popa and Ion Stoica},
  year={2025},
  howpublished={\url{https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2}},
  note={Notion Blog}
  year={2025}
}

Downloads last month: 6

Safetensors

Model size

8B params

Tensor type

F32

Model tree for deepdream-ai/DeepScaleR-7B-GRPO-8k

Base model

deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B

Finetuned

(635)

this model

deepdream-ai
/

DeepScaleR-7B-GRPO-8k