YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
Qwen3.5-2B Quantization for Small Devices
Whole Quantization was done autonomously by NEO an autonomous ML agent !! totally on its own !!
Optimized quantized LLM deployment for Raspberry Pi and resource-constrained edge devices
This repository contains extreme quantization of the Qwen3.5-2B model for deployment on edge devices with limited RAM (<2GB). The models are converted to GGUF format using llama.cpp quantization techniques.
π Model Variants
| Model | Size | BPW | RAM Required | Target Device | Quality |
|---|---|---|---|---|---|
| Q4_K_S | 1.15 GB | 4.37 | ~1.4 GB | High-end edge (2GB+ RAM) | ββββ Best |
| Q3_K_S | 973 MB | 4.29 | ~1.2 GB | Mid-range (1.5GB+ RAM) | ββββ Good |
| Q2_K | 873 MB | 3.85 | ~1.0 GB | Low-end (<1GB RAM) | βββ Compressed |
Size Reduction
Original FP16: 3,600 MB (baseline)
Q4_K_S: 1,152 MB (68% reduction)
Q3_K_S: 973 MB (73% reduction)
Q2_K: 873 MB (76% reduction)
ποΈ Architecture
Qwen3.5-2B Specifications
- Parameters: 2.0B (1.5B active during inference)
- Architecture: Transformer with SwiGLU activation, RoPE embeddings
- Context Length: 32K tokens (limited to 2K for edge deployment)
- Vocabulary: 151,936 tokens
- Hidden Size: 2,048
- Layers: 36
- Attention Heads: 16 (Q), 16 (KV)
Quantization Method
- Tool: llama-quantize (llama.cpp)
- Formats: Q2_K, Q3_K_S, Q4_K_S (K-quantization)
- K-quantization: Mixed precision with importance-aware weight clustering
π Quick Start
Prerequisites
# Python 3.8+
pip install -r requirements.txt
# Build llama.cpp (if not already built)
cd llama.cpp && mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
List Available Models
python edge_deploy.py --list-models
Run Inference
# Best quality (Q4_K_S)
python edge_deploy.py --model Q4_K_S --prompt "What is a Raspberry Pi?"
# Balanced (Q3_K_S)
python edge_deploy.py --model Q3_K_S --prompt "Explain quantum computing"
# Smallest size (Q2_K)
python edge_deploy.py --model Q2_K --prompt "Hello world" -n 64
Interactive Mode
# Chat with the model
python edge_deploy.py --model Q3_K_S --interactive
API Server
# Start REST API server
python edge_deploy.py --model Q2_K --server --port 8080
# Test with curl
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello"}]}'
π± Hardware Requirements
Raspberry Pi 4 (4GB RAM)
| Model | RAM Usage | Tokens/sec | Recommendation |
|---|---|---|---|
| Q4_K_S | ~1.4 GB | 5-8 t/s | β Recommended |
| Q3_K_S | ~1.2 GB | 6-10 t/s | β Recommended |
| Q2_K | ~1.0 GB | 8-12 t/s | Good for multi-tasking |
Raspberry Pi 3 (1GB RAM)
| Model | RAM Usage | Tokens/sec | Recommendation |
|---|---|---|---|
| Q3_K_S | ~1.2 GB | 3-5 t/s | β Recommended |
| Q2_K | ~1.0 GB | 4-6 t/s | Good for stability |
Raspberry Pi Zero (512MB RAM)
| Model | RAM Usage | Tokens/sec | Recommendation |
|---|---|---|---|
| Q2_K | ~1.0 GB | 1-2 t/s | β οΈ Requires swap |
Older Mobile Phones (<2GB RAM)
- Recommended: Q2_K with 512-1024 context
- Threads: 2 (to prevent UI lag)
- Expected: 2-5 tokens/second
π Benchmark Results
Measured Performance (Tesla V100)
| Model | Size | Load Time | Tokens/sec | Peak RAM |
|---|---|---|---|---|
| Q4_K_S | 1.15 GB | ~3s | 15-25 t/s | ~2.5 GB |
| Q3_K_S | 973 MB | ~2.5s | 18-28 t/s | ~2.2 GB |
| Q2_K | 873 MB | ~2s | 20-30 t/s | ~2.0 GB |
Raspberry Pi 4 Estimated Performance
| Model | Tokens/sec | Power Draw |
|---|---|---|
| Q4_K_S | 5-8 t/s | ~6-7W |
| Q3_K_S | 6-10 t/s | ~5-6W |
| Q2_K | 8-12 t/s | ~4-5W |
π§ Project Structure
Quantized/
βββ edge_deploy.py # Main deployment script
βββ convert_qwen35_2b_to_gguf.py # Model conversion utilities
βββ run_evaluation.py # Benchmarking script
βββ requirements.txt # Python dependencies
βββ llama.cpp/ # llama.cpp source (build artifacts cleaned)
β βββ build/bin/ # Compiled binaries
βββ output/ # Quantized GGUF models
β βββ qwen3.5-2b-Q2_K.gguf
β βββ qwen3.5-2b-Q3_K_S.gguf
β βββ qwen3.5-2b-Q4_K_S.gguf
βββ data/ # Test data and calibration
π Troubleshooting
Model Not Found
# Check if model files exist
ls -lh output/*.gguf
# If missing, check the output directory path in edge_deploy.py
Out of Memory
# Reduce context size
python edge_deploy.py --model Q2_K --prompt "Hello" -c 512
# Use fewer threads
python edge_deploy.py --model Q2_K --prompt "Hello" -t 1
Slow Performance
# Check CPU frequency (Raspberry Pi)
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
# Enable performance governor
sudo cpufreq-set -g performance
llama-cli Not Found
# Build llama.cpp
cd llama.cpp && mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)
π License
This project uses:
- Qwen3.5-2B: Licensed under Qwen License (see HuggingFace)
- llama.cpp: MIT License
- Quantization scripts: MIT License
π€ Contributing
Contributions welcome! Areas for improvement:
- Importance matrix (imatrix) quantization for <800MB models
- ARM NEON optimizations
- Mobile app wrappers (iOS/Android)
- Benchmarking on actual Raspberry Pi hardware
π§ Contact
For questions or issues:
- Open an issue on GitHub
- Check the troubleshooting section above
Whole Quantization was done autonomously by NEO - Your Autonomous AI Engineering Agent
- Downloads last month
- 22
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support