Instructions to use chill123/gemma3-smart-q4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use chill123/gemma3-smart-q4 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="chill123/gemma3-smart-q4",
	filename="gemma3-1b-q4_0.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use chill123/gemma3-smart-q4 with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf chill123/gemma3-smart-q4:Q4_0
# Run inference directly in the terminal:
llama cli -hf chill123/gemma3-smart-q4:Q4_0

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf chill123/gemma3-smart-q4:Q4_0
# Run inference directly in the terminal:
llama cli -hf chill123/gemma3-smart-q4:Q4_0

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf chill123/gemma3-smart-q4:Q4_0
# Run inference directly in the terminal:
./llama-cli -hf chill123/gemma3-smart-q4:Q4_0

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf chill123/gemma3-smart-q4:Q4_0
# Run inference directly in the terminal:
./build/bin/llama-cli -hf chill123/gemma3-smart-q4:Q4_0

Use Docker

docker model run hf.co/chill123/gemma3-smart-q4:Q4_0

LM Studio
Jan
Ollama
How to use chill123/gemma3-smart-q4 with Ollama:
```
ollama run hf.co/chill123/gemma3-smart-q4:Q4_0
```

Unsloth Studio

How to use chill123/gemma3-smart-q4 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for chill123/gemma3-smart-q4 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for chill123/gemma3-smart-q4 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for chill123/gemma3-smart-q4 to start chatting

Atomic Chat new
Docker Model Runner
How to use chill123/gemma3-smart-q4 with Docker Model Runner:
```
docker model run hf.co/chill123/gemma3-smart-q4:Q4_0
```

Lemonade

How to use chill123/gemma3-smart-q4 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull chill123/gemma3-smart-q4:Q4_0

Run and chat with the model

lemonade run user.gemma3-smart-q4-Q4_0

List all available models

lemonade list

🧠 Gemma3 Smart Q4 — Bilingual Offline Assistant for Raspberry Pi

Gemma3 Smart Q4 is a quantized bilingual (Italian–English) variant of Google's Gemma 3 1B model, optimized for edge devices like the Raspberry Pi 4 & 5. It runs completely offline with Ollama or llama.cpp, ensuring privacy and speed without external dependencies.

💻 Optimized for Raspberry Pi

✅ Tested on Raspberry Pi 4 (4GB) — average speed 3.56-3.67 tokens/s ✅ Fully offline — no external APIs, no internet required ✅ Lightweight — under 800 MB in Q4 quantization ✅ Bilingual — seamlessly switches between Italian and English

🔍 Key Features

🗣️ Bilingual AI — Automatically detects and responds in Italian or English
⚡ Edge-optimized — Fine-tuned parameters for low-power ARM devices
🔒 Privacy-first — All inference happens locally on your device
🧩 Two quantizations available:
- Q4_K_M (≈769 MB) → Better quality, more coherent reasoning
- Q4_0 (≈687 MB) → 15-20% faster, ideal for real-time interactions

📊 Benchmark Results

Tested on Raspberry Pi 4 (4GB RAM) with Ollama:

Model	Avg Speed	Individual Results	File Size	Use Case
gemma3-1b-q4_k_m.gguf	3.56 tokens/s	3.71, 3.58, 3.40 t/s	769 MB	Better quality, long conversations
gemma3-1b-q4_0.gguf	3.67 tokens/s	3.65, 3.67, 3.70 t/s	687 MB	Default choice, general use

Test details:

Hardware: Raspberry Pi 4 (4GB RAM)
OS: Raspberry Pi OS (Debian Bookworm)
Runtime: Ollama 0.x
Prompts: Mixed Italian/English, typical assistant queries

Recommendation: Use Q4_0 as default (3% faster, 82MB smaller, same quality). Use Q4_K_M only if you need slightly better coherence in very long conversations (1000+ tokens).

🛠️ Quick Start with Ollama

Option 1: Pull from Hugging Face

Create a Modelfile:

cat > Modelfile <<'MODELFILE'
FROM hf.co/antonio/gemma3-smart-q4/gemma3-1b-q4_0.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 1024
PARAMETER num_thread 4
PARAMETER num_batch 32
PARAMETER repeat_penalty 1.05

SYSTEM """
You are an offline AI assistant running on a Raspberry Pi. Automatically detect the user's language (Italian or English) and respond in the same language. Be concise, practical, and helpful. If a task requires internet access or external services, clearly state this and suggest local alternatives when possible.

Sei un assistente AI offline che opera su Raspberry Pi. Rileva automaticamente la lingua dell'utente (italiano o inglese) e rispondi nella stessa lingua. Sii conciso, pratico e utile. Se un compito richiede accesso a internet o servizi esterni, indicalo chiaramente e suggerisci alternative locali quando possibile.
"""
MODELFILE

Then run:

ollama create gemma3-smart-q4 -f Modelfile
ollama run gemma3-smart-q4 "Ciao! Chi sei?"

Option 2: Download and Use Locally

# Download the model
wget https://huggingface.co/antonio/gemma3-smart-q4/resolve/main/gemma3-1b-q4_0.gguf

# Create Modelfile
cat > Modelfile <<'MODELFILE'
FROM ./gemma3-1b-q4_0.gguf

PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 1024
PARAMETER num_thread 4
PARAMETER num_batch 32
PARAMETER repeat_penalty 1.05

SYSTEM """
You are an offline AI assistant running on a Raspberry Pi. Automatically detect the user's language (Italian or English) and respond in the same language. Be concise, practical, and helpful.

Sei un assistente AI offline su Raspberry Pi. Rileva la lingua dell'utente (italiano o inglese) e rispondi nella stessa lingua. Sii conciso, pratico e utile.
"""
MODELFILE

# Create and run
ollama create gemma3-smart-q4 -f Modelfile
ollama run gemma3-smart-q4 "Hello! Introduce yourself."

⚙️ Recommended Parameters

For Raspberry Pi 4/5, use these optimized settings:

Temperature: 0.7          # Balanced creativity vs consistency
Top-p: 0.9                # Nucleus sampling for diverse responses
Context Length: 1024      # Optimal for Pi 4 memory
Threads: 4                # Utilizes all Pi 4 cores
Batch Size: 32            # Optimized for throughput
Repeat Penalty: 1.05      # Reduces repetitive outputs

For faster responses (e.g., voice assistant), reduce num_ctx to 512.

📦 Files Included

gemma3-1b-q4_k_m.gguf — Q4_K_M quantization (~769 MB) - Better quality
gemma3-1b-q4_0.gguf — Q4_0 quantization (~687 MB) - Faster speed

🔖 License & Attribution

This is a derivative work of Google's Gemma 3 1B. Please review and comply with the Gemma License.

Quantization, optimization, and bilingual configuration by Antonio.

🔗 Links

GitHub Repository: antonio/gemma3-smart-q4 — Code, demos, benchmark scripts
Original Model: Google Gemma 3 1B IT
Ollama Library: Coming soon (pending submission)

🚀 Use Cases

Privacy-focused personal assistant — All data stays on your device
Offline home automation — Control IoT devices without cloud dependencies
Educational projects — Learn AI/ML without expensive hardware
Voice assistants — Fast enough for real-time speech interaction
Embedded systems — Industrial applications requiring offline inference

Built with ❤️ by Antonio 🇮🇹 Empowering privacy and edge computing, one model at a time.

Downloads last month: -

GGUF

Model size

1.0B params

Architecture

gemma3

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support