Agentic Qwen GGUF Models

Quantized Qwen2.5-Coder-Instruct models optimized for agentic CLI tasks on 6GB VRAM GPUs.

Models

Model Size Context Use Case
qwen-0.5b-q4_k_m.gguf 380MB 32k Fast, simple tool calls
qwen-1.5b-q4_k_m.gguf 941MB 32k Smarter, still fast

Quick Start with Ollama

# Download and create
wget https://huggingface.co/antoniostepien/agentic-qwen-gguf/resolve/main/qwen-1.5b-q4_k_m.gguf
wget https://huggingface.co/antoniostepien/agentic-qwen-gguf/resolve/main/Modelfile-1.5b

ollama create agentic-1.5b -f Modelfile-1.5b
ollama run agentic-1.5b

Use with Claude Code / OpenAI API

# Ollama serves OpenAI-compatible API on :11434
claude --model ollama/agentic-1.5b

# Or set base URL
export OPENAI_API_BASE=http://localhost:11434/v1

Tool Calling Format

The models use <tool_call> XML tags:

<tool_call>{"name": "shell", "arguments": {"command": "ls -la"}}</tool_call>

VRAM Usage (32k context)

  • 0.5B Q4: ~2-3GB total
  • 1.5B Q4: ~4-5GB total

Perfect for RTX 3060, RTX 4060, or similar 6GB cards.

License

Apache 2.0 (same as base Qwen2.5-Coder models)

Downloads last month
57
GGUF
Model size
0.5B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support