Instructions to use newmindai/QwQ-32B-r1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use newmindai/QwQ-32B-r1 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="newmindai/QwQ-32B-r1")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("newmindai/QwQ-32B-r1", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use newmindai/QwQ-32B-r1 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "newmindai/QwQ-32B-r1" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "newmindai/QwQ-32B-r1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/newmindai/QwQ-32B-r1
- SGLang
How to use newmindai/QwQ-32B-r1 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "newmindai/QwQ-32B-r1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "newmindai/QwQ-32B-r1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "newmindai/QwQ-32B-r1" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "newmindai/QwQ-32B-r1", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use newmindai/QwQ-32B-r1 with Docker Model Runner:
docker model run hf.co/newmindai/QwQ-32B-r1
Overview
newmindai/QwQ-32B-r1 is a LoRA adapter, fine-tuned via Reinforcement Learning (RL) on top of the base model QwQ-32B. It incorporates:
- ORMs (Open Reward Modules)
- DAPO (Decoder Appearance Optimization)
- SimpleScaling (Multi-objective loss balancing)
This is an adapter, not a fully merged model. To use it, you must load it on top of the base model (
Qwen/QwQ-32B) using thepeftlibrary.
Training Setup
Base Model
- Architecture:
QwQ-32B(Qwen-style transformer) - Libraries:
transformers,trl,deepspeed,accelerate,vllm - Tokenizer: Custom-trained (compatible with Hugging Face format)
Reward Modules (ORMs)
| Reward Function | Description |
|---|---|
math |
Evaluates symbolic math correctness (MathORM) |
accuracy |
Targets numeric accuracy (MathAccuracy) |
format |
Enforces strict formatting constraints |
cosine |
Measures similarity to gold responses |
repetition |
Penalizes repeated or degenerate outputs |
soft_overlong |
Soft penalty for overly long generations |
These were combined and scaled during training with adaptive weighting.
Scaling Techniques
- DAPO (Appearance Optimization): Regularizes attention and layout structure in decoder outputs.
- SimpleScaling (
newmindai/simplescaling): Controls optimizer behavior and reward balance across multiple objectives.
Training Regime
- Stage 1 (Wait #1): Model explores reward landscape; initial rewards unstable.
- Stage 2 (Wait #2): Convergence improves as ORM signals align.
- Aha Moment: Clear gains in math and formatting scores around ~2K steps after warm-up.
Evaluation
🐍 Mezura-SnakeBench Benchmarking
Final performance was benchmarked using the Mezura SnakeBench framework — a standardized evaluation suite developed by NewmindAI for structured Turkish NLP tasks.
Usage Example (LoRA Adapter)
This adapter must be loaded on top of the base model Qwen/QwQ-32B using the peft library:
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
base_model_id = "Qwen/QwQ-32B"
adapter_id = "newmindai/QwQ-32B-r1"
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.float16,
device_map="auto"
)
# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, adapter_id)
# Inference
prompt = "Türkiye'nin en yüksek dağı nedir?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 8