Instructions to use prefeitura-rio/Rio-3.0-Open-Mini with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use prefeitura-rio/Rio-3.0-Open-Mini with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="prefeitura-rio/Rio-3.0-Open-Mini")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("prefeitura-rio/Rio-3.0-Open-Mini")
model = AutoModelForCausalLM.from_pretrained("prefeitura-rio/Rio-3.0-Open-Mini")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use prefeitura-rio/Rio-3.0-Open-Mini with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "prefeitura-rio/Rio-3.0-Open-Mini"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prefeitura-rio/Rio-3.0-Open-Mini",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/prefeitura-rio/Rio-3.0-Open-Mini

SGLang

How to use prefeitura-rio/Rio-3.0-Open-Mini with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "prefeitura-rio/Rio-3.0-Open-Mini" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prefeitura-rio/Rio-3.0-Open-Mini",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "prefeitura-rio/Rio-3.0-Open-Mini" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "prefeitura-rio/Rio-3.0-Open-Mini",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use prefeitura-rio/Rio-3.0-Open-Mini with Docker Model Runner:
```
docker model run hf.co/prefeitura-rio/Rio-3.0-Open-Mini
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Rio 3.0 Open Mini

Rio 3.0 Open Mini is a frontier-class reasoning model developed by IplanRIO, the municipal IT company of Rio de Janeiro's city government. Built through distillation on top of Qwen3-4B-Thinking-2507 using reasoning traces from our to be announced Rio 3.0 model, Rio 3.0 Open achieves state-of-the-art results across mathematics, STEM, and code benchmarks — surpassing its base model by significant margins and competing with models far larger than itself.

Rio 3.0 Open Mini features SwiReasoning, a training-free inference framework based on Shi et al. (2025) that dynamically switches between explicit chain-of-thought and latent-space reasoning, guided by entropy-based confidence signals. This enables both higher accuracy and dramatically improved token efficiency. This model was explicitly trained to maximize the efficiency gained via latent reasoning.

Key Features

4B total parameters
262,144 token context window
SwiReasoning integration — dynamic explicit/latent reasoning switching for Pareto-superior accuracy and efficiency
Distilled from Qwen3-4B-Thinking-2507 with traces from Rio 3.0
Multilingual — strong performance in Portuguese, English, Chinese, and dozens of other languages
MIT License — fully open for commercial and research use

Benchmark Results

Mathematics & STEM

Model	GPQA Diamond	LiveCodeBench	Composite Math*	AIME 2025	AIME 2026 I	HMMT 2025 I	HMMT 2025 II	BRUMO 2025	CMIMC 2025	SMT 2025
Rio 3.0 Open Mini	71.90%	63.50%	78.11%	89.17%	75.00%	73.33%	79.17%	85.83%	66.88%	77.36%
Rio 3.0 Open Mini (w/o latent)	70.10%	62.00%	75.53%	85.83%	75.83%	66.67%	74.17%	84.17%	63.75%	78.30%
Qwen3-4B-2507 (base)	65.80%	55.20%	71.12%	81.67%	70.83%	55.83%	73.33%	81.67%	60.00%	74.53%
Qwen3-30B-A3B-2507	73.40%	66.00%	76.08%	82.50%	76.67%	70.83%	75.83%	85.00%	66.25%	75.47%
GPT OSS 20B	71.50%	70.26%	82.34%	89.17%	85.00%	76.67%	83.33%	86.67%	72.50%	83.02%

*Composite Math is the average across all other mathematics benchmarks in this table.

Rio Model Family Comparison

Model	GPQA Diamond	LiveCodeBench	Composite Math*	AIME 2025
Rio 3.0 Open	85.10%	76.00%	91.78%	96.67%
Rio 2.5 Open	77.20%	69.60%	87.53%	93.33%
Rio 3.0 Open Mini	71.90%	63.50%	78.11%	89.17%

Gains Over Base Model (Qwen3-4B-2507)

Benchmark	Base Model	Rio 3.0 Open Mini	Δ
GPQA Diamond	65.80%	71.90%	+6.10%
LiveCodeBench	55.20%	63.50%	+8.30%
Composite Math	71.12%	78.11%	+6.99%
AIME 2025	81.67%	89.17%	+7.50%
AIME 2026 I	70.83%	75.00%	+4.17%
HMMT 2025 I	55.83%	73.33%	+17.50%
BRUMO 2025	81.67%	85.83%	+4.16%
CMIMC 2025	60.00%	66.88%	+6.88%
SMT 2025	74.53%	77.36%	+2.83%

SwiReasoning: Latent/Explicit Reasoning

Rio 3.0 Open Mini integrates SwiReasoning (Shi et al., 2025), a training-free inference framework that dynamically alternates between two reasoning modes:

Explicit reasoning — standard chain-of-thought in natural language, where the model commits tokens to a single reasoning path
Latent reasoning — continuous reasoning in hidden space, where the model explores multiple implicit paths simultaneously without emitting tokens

The switching is governed by block-wise confidence estimated from entropy trends in the next-token distribution. When confidence is low (entropy trending upward), the model enters latent mode to explore alternatives. When confidence recovers, it switches back to explicit mode to commit to a solution.

This approach achieves a Pareto-superior trade-off: higher accuracy at unlimited budgets and dramatically better token efficiency under constrained budgets.

The benchmark table above includes (w/o latent) rows showing performance with standard explicit-only reasoning, demonstrating the consistent gains from SwiReasoning across all benchmarks.

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "prefeitura-rio/Rio-3.0-Open-Mini"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

prompt = "Write a poem about Rio de Janeiro."

messages = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=81920,
    temperature=0.6,
    top_p=0.95,
)

response = tokenizer.decode(outputs[0][inputs.input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

Using with vLLM

vllm serve prefeitura-rio/Rio-3.0-Open-Mini \
    --tensor-parallel-size 4 \
    --max-model-len 262144 \
    --trust-remote-code

Using with SGLang

python -m sglang.launch_server \
    --model-path prefeitura-rio/Rio-3.0-Open-Mini \
    --tp 4 \
    --context-length 262144 \
    --trust-remote-code

Model Details


Developer	IplanRIO — Empresa Municipal de Informática e Planejamento S.A.
Base Model	Qwen3-4B-Thinking-2507
Architecture	Transformer
Total Parameters	~4B
Context Length	262,144 tokens
Default Max Output Length	81,920 tokens
Training Method	Distillation
Inference Enhancement	SwiReasoning (latent/explicit switching)
License	MIT
Languages	Multilingual (en, pt, zh, ja, ko, fr, de, es, ar, and more)

Citation

If you use SwiReasoning, please also cite:

@misc{shi2025swireasoning,
    title={SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs},
    author={Dachuan Shi et al.},
    year={2025},
    eprint={2510.05069},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Acknowledgments

Rio 3.0 Open Mini is built upon the exceptional work of the Qwen Team and their Qwen3 model family. We also acknowledge the authors of SwiReasoning for their innovative inference framework.

Developed in Rio de Janeiro 🇧🇷 by IplanRIO.

Downloads last month: 2,179,212

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for prefeitura-rio/Rio-3.0-Open-Mini

Base model

Qwen/Qwen3-4B-Thinking-2507

Finetuned

(232)

this model

Quantizations

2 models

Paper for prefeitura-rio/Rio-3.0-Open-Mini

SwiReasoning: Switch-Thinking in Latent and Explicit for Pareto-Superior Reasoning LLMs

Paper • 2510.05069 • Published Oct 6, 2025 • 13