Instructions to use Bopalv/Qwen3-0.6B-quantized with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Bopalv/Qwen3-0.6B-quantized with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Bopalv/Qwen3-0.6B-quantized",
	filename="Qwen3-0.6B-GGUF/Qwen3-0.6B.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use Bopalv/Qwen3-0.6B-quantized with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Bopalv/Qwen3-0.6B-quantized:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Bopalv/Qwen3-0.6B-quantized:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Bopalv/Qwen3-0.6B-quantized:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Bopalv/Qwen3-0.6B-quantized:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Bopalv/Qwen3-0.6B-quantized:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Bopalv/Qwen3-0.6B-quantized:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Bopalv/Qwen3-0.6B-quantized:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Bopalv/Qwen3-0.6B-quantized:Q4_K_M

Use Docker

docker model run hf.co/Bopalv/Qwen3-0.6B-quantized:Q4_K_M

LM Studio
Jan
Ollama
How to use Bopalv/Qwen3-0.6B-quantized with Ollama:
```
ollama run hf.co/Bopalv/Qwen3-0.6B-quantized:Q4_K_M
```

Unsloth Studio

How to use Bopalv/Qwen3-0.6B-quantized with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Bopalv/Qwen3-0.6B-quantized to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Bopalv/Qwen3-0.6B-quantized to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Bopalv/Qwen3-0.6B-quantized to start chatting

How to use Bopalv/Qwen3-0.6B-quantized with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Bopalv/Qwen3-0.6B-quantized:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Bopalv/Qwen3-0.6B-quantized:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Bopalv/Qwen3-0.6B-quantized with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Bopalv/Qwen3-0.6B-quantized:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Bopalv/Qwen3-0.6B-quantized:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use Bopalv/Qwen3-0.6B-quantized with Docker Model Runner:
```
docker model run hf.co/Bopalv/Qwen3-0.6B-quantized:Q4_K_M
```

Lemonade

How to use Bopalv/Qwen3-0.6B-quantized with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Bopalv/Qwen3-0.6B-quantized:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3-0.6B-quantized-Q4_K_M

List all available models

lemonade list

Bopalv commited on Mar 22

Commit

ffe4b6a

verified ·

1 Parent(s): 0627dd9

Upload Qwen3-0.6B-Comparison.md with huggingface_hub

Browse files

Files changed (1) hide show

Qwen3-0.6B-Comparison.md +168 -0

Qwen3-0.6B-Comparison.md ADDED Viewed

	@@ -0,0 +1,168 @@

+# Qwen3-0.6B Quantized Models Comparison
+## Summary
+Three quantized versions of Qwen3-0.6B were created on March 21, 2026 and uploaded to Hugging Face at:
+**https://huggingface.co/Bopalv/Qwen3-0.6B-quantized**
+## Models Overview
+| Model | Format | Quantization | File Size | Status |
+|-------|--------|--------------|-----------|--------|
+| GGUF Q4_K_M | GGUF | 4-bit K-quant | 462 MB | ✅ Complete |
+| GPTQ-Int4 | Safetensors | 4-bit GPTQ | 517 MB | ✅ Complete |
+| GPTQ-Int8 | Safetensors | 8-bit GPTQ | 727 MB | ✅ Complete |
+## Technical Specifications
+### Common Properties
+- **Base Model**: Qwen3-0.6B
+- **Parameters**: 0.6B (490M)
+- **Architecture**: Qwen3ForCausalLM
+- **Hidden Size**: 1024
+- **Layers**: 28
+- **Attention Heads**: 16
+- **KV Heads**: 8
+- **Max Context**: 40,960 tokens
+- **Vocab Size**: 151,936
+### Quantization Details
+| Model | Bits | Group Size | Symmetric | Quantizer | Pack Dtype |
+|-------|------|------------|-----------|-----------|------------|
+| GGUF Q4_K_M | 4 | N/A | Yes | llama.cpp | N/A |
+| GPTQ-Int4 | 4 | 128 | Yes | gptqmodel 4.0.0 | int32 |
+| GPTQ-Int8 | 8 | 128 | Yes | gptqmodel 2.2.0 | int32 |
+## File Size Comparison
+```
+GGUF Q4_K_M  ████████████████████ 462 MB (Smallest)
+GPTQ-Int4    ████████████████████████ 517 MB (+12%)
+GPTQ-Int8    ████████████████████████████████████ 727 MB (+57%)
+```
+## Theoretical Performance Analysis
+### Memory Usage
+- **GGUF Q4_K_M**: ~462 MB loaded
+- **GPTQ-Int4**: ~517 MB loaded + overhead
+- **GPTQ-Int8**: ~727 MB loaded + overhead
+### Expected Quality (Lower bits = More compression, Potentially lower quality)
+1. **GPTQ-Int8**: Best quality (8-bit precision)
+2. **GPTQ-Int4**: Good quality (4-bit with group quantization)
+3. **GGUF Q4_K_M**: Good quality (4-bit K-quant, optimized for llama.cpp)
+### Expected Speed (CPU-based)
+1. **GGUF Q4_K_M**: Fastest (optimized for llama.cpp, smallest size)
+2. **GPTQ-Int4**: Medium (requires dequantization overhead)
+3. **GPTQ-Int8**: Slowest (largest size, more computation)
+## Compatibility
+### GGUF Q4_K_M
+- ✅ llama.cpp
+- ✅ prima.cpp (if Qwen3 architecture is supported)
+- ✅ Ollama
+- ✅ LM Studio
+- ✅ Text Generation WebUI
+### GPTQ-Int4 & GPTQ-Int8
+- ✅ HuggingFace Transformers
+- ✅ AutoGPTQ
+- ✅ vLLM
+- ✅ Text Generation WebUI
+- ⚠️ llama.cpp (requires conversion)
+## Usage Recommendations
+### For CPU-only systems
+**Recommended: GGUF Q4_K_M**
+- Smallest file size
+- Optimized for CPU inference
+- Fastest loading time
+- Compatible with llama.cpp ecosystem
+### For GPU systems
+**Recommended: GPTQ-Int4**
+- Good balance of quality and size
+- Works with AutoGPTQ and Transformers
+- Faster than GPTQ-Int8
+- Better quality than GGUF on GPU
+### For Maximum Quality
+**Recommended: GPTQ-Int8**
+- Highest precision (8-bit)
+- Best output quality
+- Requires more memory
+- Slower inference
+## Benchmarking Notes
+The model-efficiency tool from `bopalvelut-prog/model-efficiency` requires:
+1. Ollama running on port 11434, OR
+2. An OpenAI-compatible API server
+To benchmark these models:
+### Option 1: Using Ollama
+```bash
+# Install Ollama
+curl -fsSL https://ollama.com/install.sh | sh
+# Import GGUF model
+ollama create qwen3-0.6b-gguf -f Modelfile
+# Run benchmark
+cd model-efficiency
+python model_efficiency_comparator.py -p "Your test prompt"
+```
+### Option 2: Using prima.cpp
+```bash
+# Start server with GGUF model
+/home/ma/prima.cpp/llama-server \
+  -m /home/ma/models/Qwen3-0.6B-GGUF/Qwen3-0.6B.Q4_K_M.gguf \
+  --port 8080
+# Test with curl
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":50}'
+```
+### Option 3: Using Transformers (for GPTQ models)
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import time
+model_path = "/home/ma/models/Qwen3-0.6B-GPTQ-Int4"
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")
+inputs = tokenizer("Hello", return_tensors="pt")
+start = time.time()
+outputs = model.generate(**inputs, max_new_tokens=100)
+end = time.time()
+print(f"Time: {end-start:.2f}s")
+```
+## Storage Requirements
+| Model | File Size | Disk Space Needed | RAM Needed (Est.) |
+|-------|-----------|-------------------|-------------------|
+| GGUF Q4_K_M | 462 MB | 462 MB | ~600 MB |
+| GPTQ-Int4 | 517 MB | 517 MB | ~700 MB |
+| GPTQ-Int8 | 727 MB | 727 MB | ~900 MB |
+| **All Models** | **1.7 GB** | **1.7 GB** | **~2.2 GB** |
+## Conclusion
+- **Best for CPU/Embedded**: GGUF Q4_K_M (smallest, fastest)
+- **Best for GPU**: GPTQ-Int4 (balanced)
+- **Best Quality**: GPTQ-Int8 (highest precision)
+All models are available at:
+**https://huggingface.co/Bopalv/Qwen3-0.6B-quantized**