Instructions to use chill123/gemma3-smart-q4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use chill123/gemma3-smart-q4 with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="chill123/gemma3-smart-q4", filename="gemma3-1b-q4_0.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use chill123/gemma3-smart-q4 with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf chill123/gemma3-smart-q4:Q4_0 # Run inference directly in the terminal: llama-cli -hf chill123/gemma3-smart-q4:Q4_0
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf chill123/gemma3-smart-q4:Q4_0 # Run inference directly in the terminal: llama-cli -hf chill123/gemma3-smart-q4:Q4_0
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf chill123/gemma3-smart-q4:Q4_0 # Run inference directly in the terminal: ./llama-cli -hf chill123/gemma3-smart-q4:Q4_0
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf chill123/gemma3-smart-q4:Q4_0 # Run inference directly in the terminal: ./build/bin/llama-cli -hf chill123/gemma3-smart-q4:Q4_0
Use Docker
docker model run hf.co/chill123/gemma3-smart-q4:Q4_0
- LM Studio
- Jan
- Ollama
How to use chill123/gemma3-smart-q4 with Ollama:
ollama run hf.co/chill123/gemma3-smart-q4:Q4_0
- Unsloth Studio new
How to use chill123/gemma3-smart-q4 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for chill123/gemma3-smart-q4 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for chill123/gemma3-smart-q4 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for chill123/gemma3-smart-q4 to start chatting
- Docker Model Runner
How to use chill123/gemma3-smart-q4 with Docker Model Runner:
docker model run hf.co/chill123/gemma3-smart-q4:Q4_0
- Lemonade
How to use chill123/gemma3-smart-q4 with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull chill123/gemma3-smart-q4:Q4_0
Run and chat with the model
lemonade run user.gemma3-smart-q4-Q4_0
List all available models
lemonade list
๐ง Gemma3 Smart Q4 โ Bilingual Offline Assistant for Raspberry Pi
Gemma3 Smart Q4 is a quantized bilingual (ItalianโEnglish) variant of Google's Gemma 3 1B model, optimized for edge devices like the Raspberry Pi 4 & 5. It runs completely offline with Ollama or llama.cpp, ensuring privacy and speed without external dependencies.
๐ป Optimized for Raspberry Pi
โ Tested on Raspberry Pi 4 (4GB) โ average speed 3.56-3.67 tokens/s โ Fully offline โ no external APIs, no internet required โ Lightweight โ under 800 MB in Q4 quantization โ Bilingual โ seamlessly switches between Italian and English
๐ Key Features
- ๐ฃ๏ธ Bilingual AI โ Automatically detects and responds in Italian or English
- โก Edge-optimized โ Fine-tuned parameters for low-power ARM devices
- ๐ Privacy-first โ All inference happens locally on your device
- ๐งฉ Two quantizations available:
- Q4_K_M (โ769 MB) โ Better quality, more coherent reasoning
- Q4_0 (โ687 MB) โ 15-20% faster, ideal for real-time interactions
๐ Benchmark Results
Tested on Raspberry Pi 4 (4GB RAM) with Ollama:
| Model | Avg Speed | Individual Results | File Size | Use Case |
|---|---|---|---|---|
| gemma3-1b-q4_k_m.gguf | 3.56 tokens/s | 3.71, 3.58, 3.40 t/s | 769 MB | Better quality, long conversations |
| gemma3-1b-q4_0.gguf | 3.67 tokens/s | 3.65, 3.67, 3.70 t/s | 687 MB | Default choice, general use |
Test details:
- Hardware: Raspberry Pi 4 (4GB RAM)
- OS: Raspberry Pi OS (Debian Bookworm)
- Runtime: Ollama 0.x
- Prompts: Mixed Italian/English, typical assistant queries
Recommendation: Use Q4_0 as default (3% faster, 82MB smaller, same quality). Use Q4_K_M only if you need slightly better coherence in very long conversations (1000+ tokens).
๐ ๏ธ Quick Start with Ollama
Option 1: Pull from Hugging Face
Create a Modelfile:
cat > Modelfile <<'MODELFILE'
FROM hf.co/antonio/gemma3-smart-q4/gemma3-1b-q4_0.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 1024
PARAMETER num_thread 4
PARAMETER num_batch 32
PARAMETER repeat_penalty 1.05
SYSTEM """
You are an offline AI assistant running on a Raspberry Pi. Automatically detect the user's language (Italian or English) and respond in the same language. Be concise, practical, and helpful. If a task requires internet access or external services, clearly state this and suggest local alternatives when possible.
Sei un assistente AI offline che opera su Raspberry Pi. Rileva automaticamente la lingua dell'utente (italiano o inglese) e rispondi nella stessa lingua. Sii conciso, pratico e utile. Se un compito richiede accesso a internet o servizi esterni, indicalo chiaramente e suggerisci alternative locali quando possibile.
"""
MODELFILE
Then run:
ollama create gemma3-smart-q4 -f Modelfile
ollama run gemma3-smart-q4 "Ciao! Chi sei?"
Option 2: Download and Use Locally
# Download the model
wget https://huggingface.co/antonio/gemma3-smart-q4/resolve/main/gemma3-1b-q4_0.gguf
# Create Modelfile
cat > Modelfile <<'MODELFILE'
FROM ./gemma3-1b-q4_0.gguf
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 1024
PARAMETER num_thread 4
PARAMETER num_batch 32
PARAMETER repeat_penalty 1.05
SYSTEM """
You are an offline AI assistant running on a Raspberry Pi. Automatically detect the user's language (Italian or English) and respond in the same language. Be concise, practical, and helpful.
Sei un assistente AI offline su Raspberry Pi. Rileva la lingua dell'utente (italiano o inglese) e rispondi nella stessa lingua. Sii conciso, pratico e utile.
"""
MODELFILE
# Create and run
ollama create gemma3-smart-q4 -f Modelfile
ollama run gemma3-smart-q4 "Hello! Introduce yourself."
โ๏ธ Recommended Parameters
For Raspberry Pi 4/5, use these optimized settings:
Temperature: 0.7 # Balanced creativity vs consistency
Top-p: 0.9 # Nucleus sampling for diverse responses
Context Length: 1024 # Optimal for Pi 4 memory
Threads: 4 # Utilizes all Pi 4 cores
Batch Size: 32 # Optimized for throughput
Repeat Penalty: 1.05 # Reduces repetitive outputs
For faster responses (e.g., voice assistant), reduce num_ctx to 512.
๐ฆ Files Included
gemma3-1b-q4_k_m.ggufโ Q4_K_M quantization (~769 MB) - Better qualitygemma3-1b-q4_0.ggufโ Q4_0 quantization (~687 MB) - Faster speed
๐ License & Attribution
This is a derivative work of Google's Gemma 3 1B. Please review and comply with the Gemma License.
Quantization, optimization, and bilingual configuration by Antonio.
๐ Links
- GitHub Repository: antonio/gemma3-smart-q4 โ Code, demos, benchmark scripts
- Original Model: Google Gemma 3 1B IT
- Ollama Library: Coming soon (pending submission)
๐ Use Cases
- Privacy-focused personal assistant โ All data stays on your device
- Offline home automation โ Control IoT devices without cloud dependencies
- Educational projects โ Learn AI/ML without expensive hardware
- Voice assistants โ Fast enough for real-time speech interaction
- Embedded systems โ Industrial applications requiring offline inference
Built with โค๏ธ by Antonio ๐ฎ๐น Empowering privacy and edge computing, one model at a time.
- Downloads last month
- 11
4-bit