YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

Qwen3.5-2B Quantization for Small Devices

Whole Quantization was done autonomously by NEO an autonomous ML agent !! totally on its own !!

Python 3.8+ License: MIT GGUF

Optimized quantized LLM deployment for Raspberry Pi and resource-constrained edge devices

This repository contains extreme quantization of the Qwen3.5-2B model for deployment on edge devices with limited RAM (<2GB). The models are converted to GGUF format using llama.cpp quantization techniques.


πŸ“Š Model Variants

Model Size BPW RAM Required Target Device Quality
Q4_K_S 1.15 GB 4.37 ~1.4 GB High-end edge (2GB+ RAM) ⭐⭐⭐⭐ Best
Q3_K_S 973 MB 4.29 ~1.2 GB Mid-range (1.5GB+ RAM) ⭐⭐⭐⭐ Good
Q2_K 873 MB 3.85 ~1.0 GB Low-end (<1GB RAM) ⭐⭐⭐ Compressed

Size Reduction

Original FP16:    3,600 MB (baseline)
Q4_K_S:           1,152 MB (68% reduction)
Q3_K_S:             973 MB (73% reduction)
Q2_K:               873 MB (76% reduction)

πŸ—οΈ Architecture

Qwen3.5-2B Specifications

  • Parameters: 2.0B (1.5B active during inference)
  • Architecture: Transformer with SwiGLU activation, RoPE embeddings
  • Context Length: 32K tokens (limited to 2K for edge deployment)
  • Vocabulary: 151,936 tokens
  • Hidden Size: 2,048
  • Layers: 36
  • Attention Heads: 16 (Q), 16 (KV)

Quantization Method

  • Tool: llama-quantize (llama.cpp)
  • Formats: Q2_K, Q3_K_S, Q4_K_S (K-quantization)
  • K-quantization: Mixed precision with importance-aware weight clustering

πŸš€ Quick Start

Prerequisites

# Python 3.8+
pip install -r requirements.txt

# Build llama.cpp (if not already built)
cd llama.cpp && mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

List Available Models

python edge_deploy.py --list-models

Run Inference

# Best quality (Q4_K_S)
python edge_deploy.py --model Q4_K_S --prompt "What is a Raspberry Pi?"

# Balanced (Q3_K_S)
python edge_deploy.py --model Q3_K_S --prompt "Explain quantum computing"

# Smallest size (Q2_K)
python edge_deploy.py --model Q2_K --prompt "Hello world" -n 64

Interactive Mode

# Chat with the model
python edge_deploy.py --model Q3_K_S --interactive

API Server

# Start REST API server
python edge_deploy.py --model Q2_K --server --port 8080

# Test with curl
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hello"}]}'

πŸ“± Hardware Requirements

Raspberry Pi 4 (4GB RAM)

Model RAM Usage Tokens/sec Recommendation
Q4_K_S ~1.4 GB 5-8 t/s ⭐ Recommended
Q3_K_S ~1.2 GB 6-10 t/s ⭐ Recommended
Q2_K ~1.0 GB 8-12 t/s Good for multi-tasking

Raspberry Pi 3 (1GB RAM)

Model RAM Usage Tokens/sec Recommendation
Q3_K_S ~1.2 GB 3-5 t/s ⭐ Recommended
Q2_K ~1.0 GB 4-6 t/s Good for stability

Raspberry Pi Zero (512MB RAM)

Model RAM Usage Tokens/sec Recommendation
Q2_K ~1.0 GB 1-2 t/s ⚠️ Requires swap

Older Mobile Phones (<2GB RAM)

  • Recommended: Q2_K with 512-1024 context
  • Threads: 2 (to prevent UI lag)
  • Expected: 2-5 tokens/second

πŸ“Š Benchmark Results

Measured Performance (Tesla V100)

Model Size Load Time Tokens/sec Peak RAM
Q4_K_S 1.15 GB ~3s 15-25 t/s ~2.5 GB
Q3_K_S 973 MB ~2.5s 18-28 t/s ~2.2 GB
Q2_K 873 MB ~2s 20-30 t/s ~2.0 GB

Raspberry Pi 4 Estimated Performance

Model Tokens/sec Power Draw
Q4_K_S 5-8 t/s ~6-7W
Q3_K_S 6-10 t/s ~5-6W
Q2_K 8-12 t/s ~4-5W

πŸ”§ Project Structure

Quantized/
β”œβ”€β”€ edge_deploy.py          # Main deployment script
β”œβ”€β”€ convert_qwen35_2b_to_gguf.py  # Model conversion utilities
β”œβ”€β”€ run_evaluation.py         # Benchmarking script
β”œβ”€β”€ requirements.txt          # Python dependencies
β”œβ”€β”€ llama.cpp/               # llama.cpp source (build artifacts cleaned)
β”‚   └── build/bin/           # Compiled binaries
β”œβ”€β”€ output/                  # Quantized GGUF models
β”‚   β”œβ”€β”€ qwen3.5-2b-Q2_K.gguf
β”‚   β”œβ”€β”€ qwen3.5-2b-Q3_K_S.gguf
β”‚   └── qwen3.5-2b-Q4_K_S.gguf
└── data/                    # Test data and calibration

πŸ› Troubleshooting

Model Not Found

# Check if model files exist
ls -lh output/*.gguf

# If missing, check the output directory path in edge_deploy.py

Out of Memory

# Reduce context size
python edge_deploy.py --model Q2_K --prompt "Hello" -c 512

# Use fewer threads
python edge_deploy.py --model Q2_K --prompt "Hello" -t 1

Slow Performance

# Check CPU frequency (Raspberry Pi)
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq

# Enable performance governor
sudo cpufreq-set -g performance

llama-cli Not Found

# Build llama.cpp
cd llama.cpp && mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

πŸ“ License

This project uses:

  • Qwen3.5-2B: Licensed under Qwen License (see HuggingFace)
  • llama.cpp: MIT License
  • Quantization scripts: MIT License

🀝 Contributing

Contributions welcome! Areas for improvement:

  • Importance matrix (imatrix) quantization for <800MB models
  • ARM NEON optimizations
  • Mobile app wrappers (iOS/Android)
  • Benchmarking on actual Raspberry Pi hardware

πŸ“§ Contact

For questions or issues:

  • Open an issue on GitHub
  • Check the troubleshooting section above

Whole Quantization was done autonomously by NEO - Your Autonomous AI Engineering Agent

Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support