How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="cstr/embeddinggemma-300m-GGUF",
	filename="",
)
output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

embeddinggemma-300m GGUF

GGUF format of google/embeddinggemma-300m for use with CrispEmbed and Ollama-compatible runtimes.

Google EmbeddingGemma 300M. Lightweight multilingual embedding model based on Gemma 3, optimized for search, retrieval, and semantic similarity across 100+ languages.

Model details

  • Architecture: Gemma 3 transformer (300M params), bidirectional attention
  • Pooling: Mean pooling + Dense projection (768โ†’3072โ†’768) + L2 normalize
  • Embedding dimension: 768
  • Languages: 100+ languages
  • Context length: 2,048 tokens
  • License: Gemma

Files

File Quantization Size Parity (cos vs HF)
embeddinggemma-300m.gguf F32 ~1.2 GB 1.0000
embeddinggemma-300m-q8_0.gguf Q8_0 ~327 MB 0.9998
embeddinggemma-300m-q5_k.gguf Q5_K ~289 MB 0.9954
embeddinggemma-300m-q4_k.gguf Q4_K ~277 MB 0.9834

Q4_K shows mild degradation (~1.7%) which is typical for 4-bit quantization on embedding models. Use Q8_0 or Q5_K if highest fidelity is needed.

Quick Start

# With CrispEmbed
crispembed -m embeddinggemma-300m.gguf "Hello world"

See CrispEmbed for full documentation.

Notes

These GGUFs use the Ollama-compatible format with CrispEmbed extension keys:

  • gemma3.is_bidirectional = 1 โ€” bidirectional (no causal mask)
  • gemma3.pooling_type = 1 โ€” mean pooling
  • gemma3.rope.freq_base_local = 10000.0 โ€” sliding-window RoPE theta
  • Dense projection weights stored in F32 for correctness across all quant levels
Downloads last month
685
GGUF
Model size
0.3B params
Architecture
gemma3
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cstr/embeddinggemma-300m-GGUF

Quantized
(45)
this model