# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="rushigulum/grpo")
messages = [
{"role": "user", "content": "Who are you?"},
]
pipe(messages)# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("rushigulum/grpo")
model = AutoModelForCausalLM.from_pretrained("rushigulum/grpo")
messages = [
{"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))Uploaded model
- Developed by: rushigulum
- License: apache-2.0
- Finetuned from model : Qwen/Qwen2.5-3B-Instruct
This qwen2 model was trained 2x faster with Unsloth and Huggingface's TRL library.
Nano R1 is a fine-tuned variant of Qwen2.5-3B-Instruct, aligned using Group Relative Preference Optimization (GRPO) for reasoning-intensive tasks such as math problem-solving. The model is trained with Unsloth + TRL + vLLM to ensure efficient fine-tuning, faster inference, and improved contextual accuracy.
Key Highlights:
= Base Model: Qwen2.5-3B-Instruct (via HuggingFace)
= Fine-Tuning: GRPO reinforcement learning with custom reward functions
= Optimizations: LoRA adapters, 4-bit quantization, vLLM inference
= Dataset: GSM8K (math reasoning) with structured XML reasoning prompts
= Deployment: Hugging Face Hub integration
Model Loading with LoRA Base: Qwen/Qwen2.5-3B-Instruct Optimizations: 4-bit quantization, LoRA rank=64, gradient checkpointing.
Reward Functions Semantic Correctness → via Sentence-BERT embeddings. Strict XML Compliance → ensures reasoning/answer separation. Numerical Answer Check → enforces valid math outputs. Length & Format Penalty → prevents overly long/unstructured responses.
GRPO Training Optimizer: AdamW (8-bit) Batch size: 1 (accumulated) Learning rate: 5e-6 Steps: 150 (demo run) Inference engine: vLLM for efficiency
Valuation Benchmarked on GSM8K validation set. Metrics: Final Answer Accuracy (semantic similarity > threshold). Format Compliance (% responses following XML structure). Average Reward Score across completions.
Results Improved reasoning structure with consistent / format. Higher semantic accuracy vs baseline Qwen2.5-3B. Optimized inference speed using vLLM.
- Downloads last month
- -

# Gated model: Login with a HF token with gated access permission hf auth login