How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="thinktecture/gemma3-4b-ft-nextera-q4_k_m",
	filename="gemma3-4b-ft-nextera-q4_k_m.gguf",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

⚠️ Conference talk demo β€” not production weights.

This model accompanies a conference keynote on local on-device AI. Published as a reference for the fine-tuning patterns shown on stage β€” not a deployable artefact. No security audit, no SLA, pinned to the talk's state.


Gemma3-4B FT (Q4_K_M) β€” RAG Synthesis (+ Vision) β€” production

Base model google/gemma-3-4b-it (4.3B params, multimodal: text + vision via mmproj)
License Gemma Terms of Use
Provenance llama-quantize from the F16 sibling GGUF (no separate training run β€” quantization only). See finetune/convert_gemma3_4b_to_gguf.sh.
File size 2.49 GB (vs 7.77 GB for F16) β€” ~3Γ— memory-bandwidth headroom on decode
Hardware tested RTX PRO 6000 (Blackwell sm_120), MBP M5 Max (Metal), DGX Spark (GB10 sm_121), Strix Halo (Vulkan/RDNA 3.5) β€” byte-deterministic across all four
Intended use Production RAG response synthesis. Points-of-use: scenarios/<scenario>.json:synthesis_4b_gguf_ft. The vision channel uses the same GGUF (multimodal via the same mmproj as the F16 variant).
Out of scope Tool calling (delegated to Qwen3.5-4B FT). Free-form chat without retrieved context.
Reference eval (Nextera, 2026-05-17) Identical-quality to F16 on the 80-query RAG groundtruth set (MBP same-machine F16-vs-Q4_K_M A/B: 55/80 vs 54/80 = 1-query phrasing noise, zero semantic regression). Realized perf gains vs F16: RAG p50 -25% to -57%, image-query p50 -32% to -61% across the four-backend fleet.
Known failure modes Same as F16 sibling. Q4_K_M-specific quantization artifacts not observed in our evals; would expect them most on rare-token tail behavior.
Downloads last month
37
GGUF
Model size
4B params
Architecture
gemma3
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for thinktecture/gemma3-4b-ft-nextera-q4_k_m

Quantized
(221)
this model