Qwen2.5-7B-Instruct — INT4 NF4 Quantized
Alibaba's Qwen2.5-7B-Instruct quantized to 4-bit NF4 with double quantization for robotic reasoning and planning. 2.7x smaller — from 14.5 GB to 5.3 GB — while preserving instruction-following and reasoning capabilities.
This model is part of the RobotFlowLabs model library, built for the ANIMA agentic robotics platform — a modular ROS2-native AI system that brings foundation model intelligence to real robots operating in the real world.
Why This Model Exists
Robots need to reason about instructions, plan multi-step tasks, and generate structured outputs — all in real-time on edge hardware. Qwen2.5-7B is one of the strongest open-source instruction-following models at this scale, with excellent performance on reasoning, coding, and structured output generation. At 14.5 GB it's too large for edge GPUs. INT4 NF4 double quantization brings it to 5.3 GB — fitting on a single L4 24GB alongside vision models.
Model Details
| Property | Value |
|---|---|
| Architecture | Qwen2 (decoder-only transformer) |
| Parameters | 7B |
| Hidden Dimension | 3584 |
| Layers | 28 |
| Attention Heads | 28 (4 KV heads, GQA) |
| MLP Dimension | 18944 (SiLU activation) |
| Context Length | 32,768 tokens |
| Vocabulary | 152,064 tokens |
| RoPE | θ = 1,000,000 |
| Quantization | NF4 double quantization (bitsandbytes) |
| Original Model | Qwen/Qwen2.5-7B-Instruct |
| License | Apache-2.0 |
Compression Results
Quantized on an NVIDIA L4 24GB GPU using bitsandbytes NF4 with double quantization.
| Metric | Original | INT4 Quantized | Change |
|---|---|---|---|
| Total Size | 14,537 MB | 5,301 MB | 2.7x smaller |
| Quantization | BF16 | NF4 + double quant | 4-bit weights |
| Compute Dtype | BF16 | BF16 | Preserved at inference |
| Format | SafeTensors | SafeTensors | Direct HF loading |
Quick Start
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"robotflowlabs/qwen2.5-7b-instruct-int4",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("robotflowlabs/qwen2.5-7b-instruct-int4")
messages = [
{"role": "system", "content": "You are a robotic task planner."},
{"role": "user", "content": "Plan the steps to pick up the red cup and place it on the shelf."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With FORGE (ANIMA Integration)
from forge.language import LanguageModelRegistry
planner = LanguageModelRegistry.load("qwen2.5-7b-instruct-int4")
plan = planner.generate("Pick up the red cup and place it on the shelf")
Use Cases in ANIMA
Qwen2.5-7B serves as the reasoning backbone in ANIMA:
- Task Planning — Decompose natural language instructions into executable step sequences
- Code Generation — Generate robot control scripts and action sequences
- Structured Output — Produce JSON task plans, waypoint lists, and parameter configs
- Safety Reasoning — Evaluate whether proposed actions are safe before execution
- Error Recovery — Diagnose failures and generate recovery plans
- Human Dialogue — Natural language interaction with operators
About ANIMA
ANIMA is a modular, ROS2-native agentic robotics platform developed by RobotFlowLabs. It combines 58 specialized AI modules into a unified system for real-world robotic autonomy.
Other Collections
- ANIMA Vision — SAM2, DINOv2, CLIP, SigLIP, Depth Anything
- ANIMA Language — Qwen2.5, SmolLM2
- ANIMA VLM — Qwen2.5-VL
- ANIMA VLA — SmolVLA, RDT2-FM, FORGE students
Intended Use
Designed For
- On-device robotic task planning and reasoning
- Instruction following in manipulation and navigation pipelines
- Structured output generation (JSON, code, action sequences)
- Multi-turn dialogue with human operators
Limitations
- INT4 quantization may slightly reduce performance on complex reasoning benchmarks
- 32K context window may not be sufficient for very long interaction histories
- Requires GPU (bitsandbytes NF4 does not run on CPU)
- Inherits biases from Qwen2.5 training data
Out of Scope
- Safety-critical autonomous decision making without human oversight
- Medical or legal advice generation
- Generation of harmful content
Technical Details
Compression Pipeline
Original Qwen2.5-7B-Instruct (BF16, 14.5 GB)
│
└─→ bitsandbytes NF4 double quantization
├─→ bnb_4bit_quant_type: nf4
├─→ bnb_4bit_use_double_quant: true
├─→ bnb_4bit_compute_dtype: bfloat16
└─→ model.safetensors (5.3 GB)
- Quantization: NF4 (Normal Float 4-bit) with double quantization via bitsandbytes
- Compute: BF16 at inference — weights dequantized on-the-fly
- Hardware: NVIDIA L4 24GB, CUDA 13.0, PyTorch 2.10, Python 3.14
Attribution
- Original Model:
Qwen/Qwen2.5-7B-Instructby Alibaba Cloud - License: Apache-2.0
- Paper: Qwen2.5 Technical Report — Qwen Team, 2024
- Compressed by: RobotFlowLabs using FORGE
Citation
@article{qwen2.5,
title={Qwen2.5 Technical Report},
author={Qwen Team},
journal={arXiv preprint arXiv:2412.15115},
year={2024}
}
Built with FORGE by RobotFlowLabs
Optimizing foundation models for real robots.
- Downloads last month
- 14
Model tree for robotflowlabs/qwen2.5-7b-instruct-int4
Collection including robotflowlabs/qwen2.5-7b-instruct-int4
Paper for robotflowlabs/qwen2.5-7b-instruct-int4
Evaluation results
- Model Size (MB)self-reported5301.000
- Compression Ratioself-reported2.700
- Original Size (MB)self-reported14537.000