Instructions to use robotflowlabs/qwen2.5-7b-instruct-int4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use robotflowlabs/qwen2.5-7b-instruct-int4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="robotflowlabs/qwen2.5-7b-instruct-int4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("robotflowlabs/qwen2.5-7b-instruct-int4")
model = AutoModelForCausalLM.from_pretrained("robotflowlabs/qwen2.5-7b-instruct-int4")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use robotflowlabs/qwen2.5-7b-instruct-int4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "robotflowlabs/qwen2.5-7b-instruct-int4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "robotflowlabs/qwen2.5-7b-instruct-int4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/robotflowlabs/qwen2.5-7b-instruct-int4

SGLang

How to use robotflowlabs/qwen2.5-7b-instruct-int4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "robotflowlabs/qwen2.5-7b-instruct-int4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "robotflowlabs/qwen2.5-7b-instruct-int4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "robotflowlabs/qwen2.5-7b-instruct-int4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "robotflowlabs/qwen2.5-7b-instruct-int4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use robotflowlabs/qwen2.5-7b-instruct-int4 with Docker Model Runner:
```
docker model run hf.co/robotflowlabs/qwen2.5-7b-instruct-int4
```

Qwen2.5-7B-Instruct — INT4 NF4 Quantized

Alibaba's Qwen2.5-7B-Instruct quantized to 4-bit NF4 with double quantization for robotic reasoning and planning. 2.7x smaller — from 14.5 GB to 5.3 GB — while preserving instruction-following and reasoning capabilities.

This model is part of the RobotFlowLabs model library, built for the ANIMA agentic robotics platform — a modular ROS2-native AI system that brings foundation model intelligence to real robots operating in the real world.

Why This Model Exists

Robots need to reason about instructions, plan multi-step tasks, and generate structured outputs — all in real-time on edge hardware. Qwen2.5-7B is one of the strongest open-source instruction-following models at this scale, with excellent performance on reasoning, coding, and structured output generation. At 14.5 GB it's too large for edge GPUs. INT4 NF4 double quantization brings it to 5.3 GB — fitting on a single L4 24GB alongside vision models.

Model Details

Property	Value
Architecture	Qwen2 (decoder-only transformer)
Parameters	7B
Hidden Dimension	3584
Layers	28
Attention Heads	28 (4 KV heads, GQA)
MLP Dimension	18944 (SiLU activation)
Context Length	32,768 tokens
Vocabulary	152,064 tokens
RoPE	θ = 1,000,000
Quantization	NF4 double quantization (bitsandbytes)
Original Model	`Qwen/Qwen2.5-7B-Instruct`
License	Apache-2.0

Compression Results

Quantized on an NVIDIA L4 24GB GPU using bitsandbytes NF4 with double quantization.

Metric	Original	INT4 Quantized	Change
Total Size	14,537 MB	5,301 MB	2.7x smaller
Quantization	BF16	NF4 + double quant	4-bit weights
Compute Dtype	BF16	BF16	Preserved at inference
Format	SafeTensors	SafeTensors	Direct HF loading

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "robotflowlabs/qwen2.5-7b-instruct-int4",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("robotflowlabs/qwen2.5-7b-instruct-int4")

messages = [
    {"role": "system", "content": "You are a robotic task planner."},
    {"role": "user", "content": "Plan the steps to pick up the red cup and place it on the shelf."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With FORGE (ANIMA Integration)

from forge.language import LanguageModelRegistry

planner = LanguageModelRegistry.load("qwen2.5-7b-instruct-int4")
plan = planner.generate("Pick up the red cup and place it on the shelf")

Use Cases in ANIMA

Qwen2.5-7B serves as the reasoning backbone in ANIMA:

Task Planning — Decompose natural language instructions into executable step sequences
Code Generation — Generate robot control scripts and action sequences
Structured Output — Produce JSON task plans, waypoint lists, and parameter configs
Safety Reasoning — Evaluate whether proposed actions are safe before execution
Error Recovery — Diagnose failures and generate recovery plans
Human Dialogue — Natural language interaction with operators

About ANIMA

ANIMA is a modular, ROS2-native agentic robotics platform developed by RobotFlowLabs. It combines 58 specialized AI modules into a unified system for real-world robotic autonomy.

Other Collections

ANIMA Vision — SAM2, DINOv2, CLIP, SigLIP, Depth Anything
ANIMA Language — Qwen2.5, SmolLM2
ANIMA VLM — Qwen2.5-VL
ANIMA VLA — SmolVLA, RDT2-FM, FORGE students

Intended Use

Designed For

On-device robotic task planning and reasoning
Instruction following in manipulation and navigation pipelines
Structured output generation (JSON, code, action sequences)
Multi-turn dialogue with human operators

Limitations

INT4 quantization may slightly reduce performance on complex reasoning benchmarks
32K context window may not be sufficient for very long interaction histories
Requires GPU (bitsandbytes NF4 does not run on CPU)
Inherits biases from Qwen2.5 training data

Out of Scope

Safety-critical autonomous decision making without human oversight
Medical or legal advice generation
Generation of harmful content

Technical Details

Compression Pipeline

Original Qwen2.5-7B-Instruct (BF16, 14.5 GB)
    │
    └─→ bitsandbytes NF4 double quantization
        ├─→ bnb_4bit_quant_type: nf4
        ├─→ bnb_4bit_use_double_quant: true
        ├─→ bnb_4bit_compute_dtype: bfloat16
        └─→ model.safetensors (5.3 GB)

Quantization: NF4 (Normal Float 4-bit) with double quantization via bitsandbytes
Compute: BF16 at inference — weights dequantized on-the-fly
Hardware: NVIDIA L4 24GB, CUDA 13.0, PyTorch 2.10, Python 3.14

Attribution

Original Model: Qwen/Qwen2.5-7B-Instruct by Alibaba Cloud
License: Apache-2.0
Paper: Qwen2.5 Technical Report — Qwen Team, 2024
Compressed by: RobotFlowLabs using FORGE

Citation

@article{qwen2.5,
  title={Qwen2.5 Technical Report},
  author={Qwen Team},
  journal={arXiv preprint arXiv:2412.15115},
  year={2024}
}

Built with FORGE by RobotFlowLabs
Optimizing foundation models for real robots.

Downloads last month: 7

Safetensors

Model size

8B params

Tensor type

F32

BF16

Model tree for robotflowlabs/qwen2.5-7b-instruct-int4

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Quantized

(313)

this model

Collection including robotflowlabs/qwen2.5-7b-instruct-int4

ANIMA Language

Collection

INT4 quantized language models for robotic reasoning. Qwen2.5, SmolLM2 optimized for edge deployment. • 3 items • Updated Mar 19

Paper for robotflowlabs/qwen2.5-7b-instruct-int4

Qwen2.5 Technical Report

Paper • 2412.15115 • Published Dec 19, 2024 • 380

Evaluation results

Model Size (MB)
self-reported

5301.000
Compression Ratio
self-reported

2.700
Original Size (MB)
self-reported

14537.000