Instructions to use haoranhe/ROVER-countdown-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use haoranhe/ROVER-countdown-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="haoranhe/ROVER-countdown-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("haoranhe/ROVER-countdown-3B")
model = AutoModelForCausalLM.from_pretrained("haoranhe/ROVER-countdown-3B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use haoranhe/ROVER-countdown-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "haoranhe/ROVER-countdown-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "haoranhe/ROVER-countdown-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/haoranhe/ROVER-countdown-3B

SGLang

How to use haoranhe/ROVER-countdown-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "haoranhe/ROVER-countdown-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "haoranhe/ROVER-countdown-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "haoranhe/ROVER-countdown-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "haoranhe/ROVER-countdown-3B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use haoranhe/ROVER-countdown-3B with Docker Model Runner:
```
docker model run hf.co/haoranhe/ROVER-countdown-3B
```

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

This repository contains the model presented in the paper Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards.

ROVER (Random Policy Valuation for Diverse Reasoning) is a minimalist yet highly effective Reinforcement Learning (RL) method for Large Language Model (LLM) reasoning. It achieves superior optimality and diversity by evaluating uniform-policy Q-values, bypassing complex policy iteration loops typically found in methods like PPO and GRPO. This approach is particularly effective for math reasoning tasks, preserving diversity throughout training for sustained exploration of multiple valid pathways.

Paper: Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards
Code: https://github.com/tinnerhrhe/ROVER

Main Results and Features

*Figure 1: (a) ROVER achieves superior performances in terms of both pass@1 and pass@256 (trained on Qwen3-8B-Base averaged over AIME24, AIME24 and HMMT25 tasks). (b) Illustrative example demonstrating that ROVER achieves high-quality solutions with a lightweight procedure (see Table below for details) while maintaining diversity. (c) ROVER achieves higher diversity.*

ROVER needs minimal GPU memory and computation cost, leaving more space for the KV cache. This allows ROVER to run on smaller memory setups and speeds up training:

Method	Memory Usage of Model Parameters
ROVER (Ours)	Low (actor model ONLY!😊)
GRPO	Medium (actor + reference model)
PPO	High (actor + reference + critic model)

For installation, training, and evaluation instructions, please refer to the GitHub repository.

Citation

If you find the project useful, please consider citing our paper:

@article{he2025randompolicyvaluation,
      title={Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards}, 
      author={Haoran He and Yuxiao Ye and Qingpeng Cai and Chen Hu and Binxing Jiao and Daxin Jiang and Ling Pan},
      journal={arXiv preprint arXiv:2509.24981},
      year={2025}
}

Downloads last month: 5

Safetensors

Model size

3B params

Tensor type

F32

Model tree for haoranhe/ROVER-countdown-3B

Base model

Qwen/Qwen2.5-3B

Finetuned

(408)

this model

Paper for haoranhe/ROVER-countdown-3B

Random Policy Valuation is Enough for LLM Reasoning with Verifiable Rewards

Paper • 2509.24981 • Published Sep 29, 2025 • 29